24 KiB
Kohaku Hub API Documentation
This document explains how Kohaku Hub's API works, the data flow, and key endpoints.
System Architecture
┌─────────────────────────────────────────────────────────────────┐
│ Client Request │
│ (huggingface_hub Python) │
└────────────────────────────────┬────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ FastAPI Layer │
│ (kohakuhub/api/*) │
│ │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ basic │ │ file │ │ lfs │ │ utils │ │
│ │ .py │ │ .py │ │ .py │ │ .py │ │
│ └──────────┘ └──────────┘ └──────────┘ └──────────┘ │
└────────────────────────────────┬────────────────────────────────┘
│
┌────────────┼────────────┐
│ │ │
▼ ▼ ▼
┌─────────────┐ ┌──────────┐ ┌─────────────┐
│ LakeFS │ │ SQLite/ │ │ MinIO │
│ │ │ Postgres │ │ (S3) │
│ Versioning │ │ Metadata │ │ Storage │
│ Branches │ │ Dedup │ │ Objects │
└─────────────┘ └──────────┘ └─────────────┘
Core Concepts
File Size Thresholds
File Size Decision Tree:
Is file > 10MB?
│
┌───────┴───────┐
│ │
NO YES
│ │
▼ ▼
┌─────────┐ ┌─────────┐
│ Regular │ │ LFS │
│ Mode │ │ Mode │
└─────────┘ └─────────┘
│ │
▼ ▼
Base64 in S3 Direct
Commit Upload
Storage Layout
S3 Bucket Structure:
s3://hub-storage/
│
├── hf-model-org-repo/ ← LakeFS managed repository
│ └── main/ ← Branch
│ ├── config.json
│ └── model.safetensors
│
└── lfs/ ← LFS objects (content-addressable)
└── ab/ ← First 2 chars of SHA256
└── cd/ ← Next 2 chars
└── abcd1234... ← Full SHA256 hash
Upload Workflow
Overview
┌────────┐ ┌──────────┐ ┌─────────┐ ┌────────┐
│ Client │────▶│ Preupload│────▶│ Upload │────▶│ Commit │
└────────┘ └──────────┘ └─────────┘ └────────┘
User Check if Upload Atomic
Request file exists file(s) commit
(dedup) (S3/inline) (LakeFS)
Step 1: Preupload Check
Purpose: Determine upload mode and check for duplicates
Endpoint: POST /api/{repo_type}s/{repo_id}/preupload/{revision}
Request:
{
"files": [
{
"path": "config.json",
"size": 1024,
"sha256": "abc123..."
},
{
"path": "model.bin",
"size": 52428800,
"sha256": "def456..."
}
]
}
Response:
{
"files": [
{
"path": "config.json",
"uploadMode": "regular",
"shouldIgnore": false
},
{
"path": "model.bin",
"uploadMode": "lfs",
"shouldIgnore": true // Already exists!
}
]
}
Decision Logic:
For each file:
1. Check size:
- ≤ 10MB → "regular"
- > 10MB → "lfs"
2. Check if exists (deduplication):
- Query DB for matching SHA256 + size
- If match found → shouldIgnore: true
- If no match → shouldIgnore: false
Step 2a: Regular Upload (≤10MB)
Files are sent inline in the commit payload as base64.
┌────────┐ ┌────────┐
│ Client │───── base64 ──────>│ Commit │
└────────┘ (embedded) └────────┘
No separate upload step needed - proceed directly to Step 3.
Step 2b: LFS Upload (>10MB)
Phase 1: Request Upload URLs
Endpoint: POST /{repo_id}.git/info/lfs/objects/batch
Request:
{
"operation": "upload",
"transfers": ["basic", "multipart"],
"objects": [
{
"oid": "sha256_hash",
"size": 52428800
}
]
}
Response (if file needs upload):
{
"transfer": "basic",
"objects": [
{
"oid": "sha256_hash",
"size": 52428800,
"actions": {
"upload": {
"href": "https://s3.../presigned_url",
"expires_at": "2025-10-02T00:00:00Z"
}
}
}
]
}
Response (if file already exists):
{
"transfer": "basic",
"objects": [
{
"oid": "sha256_hash",
"size": 52428800
// No "actions" field = already exists
}
]
}
Phase 2: Upload to S3
┌────────┐ ┌─────────┐
│ Client │──── PUT file ────────>│ S3 │
└────────┘ (presigned URL) └─────────┘
Direct upload lfs/ab/cd/
(no proxy!) abcd123...
Key Point: Client uploads directly to S3 using the presigned URL. Kohaku Hub server is NOT involved in data transfer.
Step 3: Commit
Purpose: Atomically commit all changes to the repository
Endpoint: POST /api/{repo_type}s/{repo_id}/commit/{revision}
Format: NDJSON (Newline-Delimited JSON)
Example Payload:
{"key":"header","value":{"summary":"Add model files","description":"Initial upload"}}
{"key":"file","value":{"path":"config.json","content":"eyJtb2RlbCI6...","encoding":"base64"}}
{"key":"lfsFile","value":{"path":"model.bin","algo":"sha256","oid":"abc123...","size":52428800}}
{"key":"deletedFile","value":{"path":"old_config.json"}}
Operation Types:
| Key | Description | Usage |
|---|---|---|
header |
Commit metadata | Required, must be first line |
file |
Small file (inline base64) | For files ≤ 10MB |
lfsFile |
Large file (LFS reference) | For files > 10MB, already uploaded to S3 |
deletedFile |
Delete a single file | Remove file from repo |
deletedFolder |
Delete folder recursively | Remove all files in folder |
copyFile |
Copy file within repo | Duplicate file (deduplication-aware) |
Response:
{
"commitUrl": "https://hub.example.com/repo/commit/abc123",
"commitOid": "abc123def456",
"pullRequestUrl": null
}
What Happens:
1. Regular files:
┌─────────┐
│ Decode │ Base64 → Binary
└────┬────┘
│
▼
┌─────────┐
│ Upload │ To LakeFS
└────┬────┘
│
▼
┌─────────┐
│ Update │ Database record
└─────────┘
2. LFS files:
┌─────────┐
│ Link │ S3 physical address → LakeFS
└────┬────┘
│
▼
┌─────────┐
│ Update │ Database record
└─────────┘
3. Commit:
┌─────────┐
│ LakeFS │ Create commit with all changes
└─────────┘
Download Workflow
┌────────┐ ┌──────────┐ ┌─────────┐
│ Client │────>│ HEAD │────>│ GET │
└────────┘ └──────────┘ └─────────┘
Request Get metadata Download
(size, hash) (redirect)
Step 1: Get Metadata (HEAD)
Endpoint: HEAD /{repo_id}/resolve/{revision}/{filename}
Response Headers:
X-Repo-Commit: abc123def456
X-Linked-Etag: "sha256:abc123..."
X-Linked-Size: 52428800
ETag: "abc123..."
Content-Length: 52428800
Location: https://s3.../presigned_download_url
Purpose: Client checks if file needs re-download (by comparing ETag)
Step 2: Download (GET)
Endpoint: GET /{repo_id}/resolve/{revision}/{filename}
Response: HTTP 302 Redirect
HTTP/1.1 302 Found
Location: https://s3.example.com/presigned_url?expires=...
X-Repo-Commit: abc123def456
X-Linked-Etag: "sha256:abc123..."
Flow:
┌────────┐ ┌──────────┐
│ Client │───── GET ─────>│ Kohaku │
└────────┘ │ Hub │
▲ └─────┬────┘
│ │
│ 302 Redirect │ Generate
│ (presigned URL) │ presigned
│<─────────────────────────┘ URL
│
│ ┌──────────┐
└───>│ S3 │
│ Direct │
│ Download │
└──────────┘
Key Point: Client downloads directly from S3. Kohaku Hub only provides the redirect URL.
Repository Privacy & Filtering
KohakuHub respects repository privacy settings when listing repositories. The visibility of repositories depends on authentication:
Privacy Rules
For Unauthenticated Users:
- Can only see public repositories
For Authenticated Users:
- Can see all public repositories
- Can see their own private repositories
- Can see private repositories in organizations they belong to
List Repositories Endpoint
Pattern: /api/{type}s where type is model, dataset, or space
Query Parameters:
author: Filter by author/namespace (username or organization)limit: Maximum results (default: 50, max: 1000)
Examples:
# List all public models
GET /api/models
# List models by author (respects privacy)
GET /api/models?author=my-org
# Authenticated user sees their private repos too
GET /api/models?author=my-org
Authorization: Bearer YOUR_TOKEN
List User's All Repositories
Endpoint: GET /api/users/{username}/repos
Returns all repositories for a user/organization, grouped by type.
Response:
{
"models": [
{"id": "user/model-1", "private": false, ...},
{"id": "user/model-2", "private": true, ...}
],
"datasets": [
{"id": "user/dataset-1", "private": false, ...}
],
"spaces": []
}
Note: Private repositories are only included if:
- The requesting user is the owner, OR
- The requesting user is a member of the organization
Repository Management
Create Repository
Endpoint: POST /api/repos/create
Request:
{
"type": "model",
"name": "my-model",
"organization": "my-org",
"private": false
}
What Happens:
1. Check if exists
└─ Query DB for repo
2. Create LakeFS repo
└─ Repository: hf-model-my-org-my-model
└─ Storage: s3://bucket/hf-model-my-org-my-model
└─ Default branch: main
3. Record in DB
└─ INSERT INTO repository (...)
Response:
{
"url": "https://hub.example.com/models/my-org/my-model",
"repo_id": "my-org/my-model"
}
List Repository Files
Endpoint: GET /api/{repo_type}s/{repo_id}/tree/{revision}/{path}
Query Parameters:
recursive: List all files recursively (default: false)expand: Include LFS metadata (default: false)
Response:
[
{
"type": "file",
"oid": "abc123",
"size": 1024,
"path": "config.json"
},
{
"type": "file",
"oid": "def456",
"size": 52428800,
"path": "model.bin",
"lfs": {
"oid": "def456",
"size": 52428800,
"pointerSize": 134
}
},
{
"type": "directory",
"oid": "",
"size": 0,
"path": "configs"
}
]
Delete Repository
Endpoint: DELETE /api/repos/delete
Request:
{
"type": "model",
"name": "my-model",
"organization": "my-org"
}
What Happens:
1. Delete from LakeFS
└─ Remove repository metadata
└─ (Objects remain in S3 for safety)
2. Delete from DB
├─ DELETE FROM file WHERE repo_full_id = ...
├─ DELETE FROM staging_upload WHERE repo_full_id = ...
└─ DELETE FROM repository WHERE full_id = ...
3. Return success
Database Schema
Repository Table
┌──────────────┬──────────────┬─────────────┐
│ Column │ Type │ Index? │
├──────────────┼──────────────┼─────────────┤
│ id │ INTEGER PK │ Primary │
│ repo_type │ VARCHAR │ Yes │
│ namespace │ VARCHAR │ Yes │
│ name │ VARCHAR │ Yes │
│ full_id │ VARCHAR │ Unique │
│ private │ BOOLEAN │ No │
│ created_at │ TIMESTAMP │ No │
└──────────────┴──────────────┴─────────────┘
Example:
repo_type: "model"
namespace: "myorg"
name: "mymodel"
full_id: "myorg/mymodel"
File Table (Deduplication)
┌──────────────┬──────────────┬─────────────┐
│ Column │ Type │ Index? │
├──────────────┼──────────────┼─────────────┤
│ id │ INTEGER PK │ Primary │
│ repo_full_id │ VARCHAR │ Yes │
│ path_in_repo │ VARCHAR │ Yes │
│ size │ INTEGER │ No │
│ sha256 │ VARCHAR │ Yes │
│ lfs │ BOOLEAN │ No │
│ created_at │ TIMESTAMP │ No │
│ updated_at │ TIMESTAMP │ No │
└──────────────┴──────────────┴─────────────┘
Unique constraint: (repo_full_id, path_in_repo)
Purpose:
- Track file SHA256 hashes for deduplication
- Check if file changed before upload
- Maintain file metadata
StagingUpload Table (Optional)
┌──────────────┬──────────────┬─────────────┐
│ Column │ Type │ Index? │
├──────────────┼──────────────┼─────────────┤
│ id │ INTEGER PK │ Primary │
│ repo_full_id │ VARCHAR │ Yes │
│ revision │ VARCHAR │ Yes │
│ path_in_repo │ VARCHAR │ No │
│ sha256 │ VARCHAR │ No │
│ size │ INTEGER │ No │
│ upload_id │ VARCHAR │ No │
│ storage_key │ VARCHAR │ No │
│ lfs │ BOOLEAN │ No │
│ created_at │ TIMESTAMP │ No │
└──────────────┴──────────────┴─────────────┘
Purpose:
- Track ongoing multipart uploads
- Enable upload resume
- Clean up failed uploads
LakeFS Integration
Repository Naming Convention
Pattern: {namespace}-{repo_type}-{org}-{name}
Examples:
HuggingFace repo: "myorg/mymodel"
LakeFS repo: "hf-model-myorg-mymodel"
HuggingFace repo: "johndoe/dataset"
LakeFS repo: "hf-dataset-johndoe-dataset"
Key Operations
| Operation | LakeFS API | Purpose |
|---|---|---|
| Create Repo | repositories.create_repository() |
Initialize new repository |
| Upload Small File | objects.upload_object() |
Direct content upload |
| Link LFS File | staging.link_physical_address() |
Link S3 object to LakeFS |
| Commit | commits.commit() |
Create atomic commit |
| List Files | objects.list_objects() |
Browse repository |
| Get File Info | objects.stat_object() |
Get file metadata |
| Delete File | objects.delete_object() |
Remove file |
Physical Address Linking
When uploading LFS file:
1. Client uploads to S3:
s3://bucket/lfs/ab/cd/abcd1234...
2. Kohaku Hub links to LakeFS:
┌──────────────────────────────────┐
│ StagingMetadata │
├──────────────────────────────────┤
│ physical_address: │
│ "s3://bucket/lfs/ab/cd/abc..." │
│ checksum: "sha256:abc..." │
│ size_bytes: 52428800 │
└──────────────────────────────────┘
│
▼
┌──────────────────────────────────┐
│ LakeFS: model.bin │
│ → Points to S3 object │
└──────────────────────────────────┘
3. On commit:
LakeFS records this link in its metadata
API Endpoint Summary
Repository Operations
| Endpoint | Method | Auth | Description |
|---|---|---|---|
/api/repos/create |
POST | ✓ | Create new repository |
/api/repos/delete |
DELETE | ✓ | Delete repository |
/api/{type}s |
GET | ○ | List repositories (respects privacy) |
/api/{type}s/{id} |
GET | ○ | Get repo info |
/api/{type}s/{id}/tree/{rev}/{path} |
GET | ○ | List files |
/api/{type}s/{id}/revision/{rev} |
GET | ○ | Get revision info |
/api/{type}s/{id}/paths-info/{rev} |
POST | ○ | Get info for specific paths |
/api/users/{username}/repos |
GET | ○ | List all repos for a user/org (grouped by type) |
File Operations
| Endpoint | Method | Auth | Description |
|---|---|---|---|
/api/{type}s/{id}/preupload/{rev} |
POST | ✓ | Check before upload |
/api/{type}s/{id}/commit/{rev} |
POST | ✓ | Atomic commit |
/{id}/resolve/{rev}/{file} |
GET | ○ | Download file |
/{id}/resolve/{rev}/{file} |
HEAD | ○ | Get file metadata |
/{type}s/{id}/resolve/{rev}/{file} |
GET | ○ | Download file (with type) |
/{type}s/{id}/resolve/{rev}/{file} |
HEAD | ○ | Get file metadata (with type) |
LFS Operations
| Endpoint | Method | Auth | Description |
|---|---|---|---|
/{id}.git/info/lfs/objects/batch |
POST | ✓ | LFS batch API |
/api/{id}.git/info/lfs/verify |
POST | ✓ | Verify upload |
Authentication Operations
| Endpoint | Method | Auth | Description |
|---|---|---|---|
/api/auth/register |
POST | ✗ | Register new user |
/api/auth/login |
POST | ✗ | Login and create session |
/api/auth/logout |
POST | ✓ | Logout and destroy session |
/api/auth/verify-email |
GET | ✗ | Verify email with token |
/api/auth/me |
GET | ✓ | Get current user info |
/api/auth/tokens |
GET | ✓ | List user's API tokens |
/api/auth/tokens/create |
POST | ✓ | Create new API token |
/api/auth/tokens/{id} |
DELETE | ✓ | Revoke API token |
Organization Operations
| Endpoint | Method | Auth | Description |
|---|---|---|---|
/org/create |
POST | ✓ | Create new organization |
/org/{name} |
GET | ✗ | Get organization details |
/org/{name}/members |
POST | ✓ | Add member to organization |
/org/{name}/members/{username} |
DELETE | ✓ | Remove member from organization |
/org/{name}/members/{username} |
PUT | ✓ | Update member role |
/org/users/{username}/orgs |
GET | ✗ | List user's organizations |
Utility Operations
| Endpoint | Method | Auth | Description |
|---|---|---|---|
/api/validate-yaml |
POST | ✗ | Validate YAML content |
/api/whoami-v2 |
GET | ✓ | Get user info |
/health |
GET | ✗ | Health check |
/ |
GET | ✗ | API information |
Auth Legend:
- ✓ = Required
- ○ = Optional (public repos)
- ✗ = Not required
Content Deduplication
Kohaku Hub implements content-addressable storage for LFS files:
Same file uploaded to different repos:
Repo A: myorg/model-v1
└─ model.bin (sha256: abc123...)
Repo B: myorg/model-v2
└─ model.bin (sha256: abc123...)
S3 Storage:
└─ lfs/ab/c1/abc123... ← SINGLE COPY
▲ ▲
│ │
Repo A Repo B
(linked) (linked)
Benefits:
- Save storage space
- Faster uploads (skip if exists)
- Efficient for model variants
Deduplication Points:
- Preupload Check: Query DB by SHA256
- LFS Batch API: Check if OID exists
- Commit: Link existing S3 object instead of uploading
Error Handling
Kohaku Hub uses HuggingFace-compatible error headers:
HTTP Response Headers:
X-Error-Code: RepoNotFound
X-Error-Message: Repository 'org/repo' not found
Error Codes:
| Code | HTTP Status | Description |
|---|---|---|
RepoNotFound |
404 | Repository doesn't exist |
RepoExists |
400 | Repository already exists |
RevisionNotFound |
404 | Branch/commit not found |
EntryNotFound |
404 | File not found |
GatedRepo |
403 | Need permission |
BadRequest |
400 | Invalid request |
ServerError |
500 | Internal error |
These error codes are parsed by huggingface_hub client to raise appropriate Python exceptions.
Performance Considerations
Upload Performance
Small Files (≤10MB):
Client → FastAPI → LakeFS → S3
(Proxied through server)
Large Files (>10MB):
Client ─────────────────────→ S3
(Direct upload, no proxy)
↓
Kohaku Hub (only metadata link)
Why this matters: Large files bypass the application server entirely, allowing unlimited throughput limited only by client and S3 bandwidth.
Download Performance
All Downloads:
Client → Kohaku Hub → 302 Redirect → S3
(metadata) (direct)
Why this matters: After initial redirect, all data transfer is direct from S3/CDN. Server only generates presigned URLs.
Recommended S3 Providers
| Provider | Best For | Pricing Model |
|---|---|---|
| Cloudflare R2 | High download | Free egress, $0.015/GB storage |
| Wasabi | Archive/backup | $6/TB/month, free egress if download < storage |
| MinIO | Self-hosted | Free (your hardware/bandwidth) |
| AWS S3 | Enterprise | Pay per GB + egress |