# Kohaku Hub API Documentation This document explains how Kohaku Hub's API works, the data flow, and key endpoints. ## System Architecture ``` ┌─────────────────────────────────────────────────────────────────┐ │ Client Request │ │ (huggingface_hub Python) │ └────────────────────────────────┬────────────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────────────────────────┐ │ FastAPI Layer │ │ (kohakuhub/api/*) │ │ │ │ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │ │ │ basic │ │ file │ │ lfs │ │ utils │ │ │ │ .py │ │ .py │ │ .py │ │ .py │ │ │ └──────────┘ └──────────┘ └──────────┘ └──────────┘ │ └────────────────────────────────┬────────────────────────────────┘ │ ┌────────────┼────────────┐ │ │ │ ▼ ▼ ▼ ┌─────────────┐ ┌──────────┐ ┌─────────────┐ │ LakeFS │ │ SQLite/ │ │ MinIO │ │ │ │ Postgres │ │ (S3) │ │ Versioning │ │ Metadata │ │ Storage │ │ Branches │ │ Dedup │ │ Objects │ └─────────────┘ └──────────┘ └─────────────┘ ``` ## Core Concepts ### File Size Thresholds ``` File Size Decision Tree: Is file > 10MB? │ ┌───────┴───────┐ │ │ NO YES │ │ ▼ ▼ ┌─────────┐ ┌─────────┐ │ Regular │ │ LFS │ │ Mode │ │ Mode │ └─────────┘ └─────────┘ │ │ ▼ ▼ Base64 in S3 Direct Commit Upload ``` ### Storage Layout ``` S3 Bucket Structure: s3://hub-storage/ │ ├── hf-model-org-repo/ ← LakeFS managed repository │ └── main/ ← Branch │ ├── config.json │ └── model.safetensors │ └── lfs/ ← LFS objects (content-addressable) └── ab/ ← First 2 chars of SHA256 └── cd/ ← Next 2 chars └── abcd1234... ← Full SHA256 hash ``` ## Upload Workflow ### Overview ``` ┌────────┐ ┌──────────┐ ┌─────────┐ ┌────────┐ │ Client │────▶│ Preupload│────▶│ Upload │────▶│ Commit │ └────────┘ └──────────┘ └─────────┘ └────────┘ User Check if Upload Atomic Request file exists file(s) commit (dedup) (S3/inline) (LakeFS) ``` ### Step 1: Preupload Check **Purpose**: Determine upload mode and check for duplicates **Endpoint**: `POST /api/{repo_type}s/{repo_id}/preupload/{revision}` **Request**: ```json { "files": [ { "path": "config.json", "size": 1024, "sha256": "abc123..." }, { "path": "model.bin", "size": 52428800, "sha256": "def456..." } ] } ``` **Response**: ```json { "files": [ { "path": "config.json", "uploadMode": "regular", "shouldIgnore": false }, { "path": "model.bin", "uploadMode": "lfs", "shouldIgnore": true // Already exists! } ] } ``` **Decision Logic**: ``` For each file: 1. Check size: - ≤ 10MB → "regular" - > 10MB → "lfs" 2. Check if exists (deduplication): - Query DB for matching SHA256 + size - If match found → shouldIgnore: true - If no match → shouldIgnore: false ``` ### Step 2a: Regular Upload (≤10MB) Files are sent inline in the commit payload as base64. ``` ┌────────┐ ┌────────┐ │ Client │───── base64 ──────>│ Commit │ └────────┘ (embedded) └────────┘ ``` **No separate upload step needed** - proceed directly to Step 3. ### Step 2b: LFS Upload (>10MB) #### Phase 1: Request Upload URLs **Endpoint**: `POST /{repo_id}.git/info/lfs/objects/batch` **Request**: ```json { "operation": "upload", "transfers": ["basic", "multipart"], "objects": [ { "oid": "sha256_hash", "size": 52428800 } ] } ``` **Response** (if file needs upload): ```json { "transfer": "basic", "objects": [ { "oid": "sha256_hash", "size": 52428800, "actions": { "upload": { "href": "https://s3.../presigned_url", "expires_at": "2025-10-02T00:00:00Z" } } } ] } ``` **Response** (if file already exists): ```json { "transfer": "basic", "objects": [ { "oid": "sha256_hash", "size": 52428800 // No "actions" field = already exists } ] } ``` #### Phase 2: Upload to S3 ``` ┌────────┐ ┌─────────┐ │ Client │──── PUT file ────────>│ S3 │ └────────┘ (presigned URL) └─────────┘ Direct upload lfs/ab/cd/ (no proxy!) abcd123... ``` **Key Point**: Client uploads directly to S3 using the presigned URL. Kohaku Hub server is NOT involved in data transfer. ### Step 3: Commit **Purpose**: Atomically commit all changes to the repository **Endpoint**: `POST /api/{repo_type}s/{repo_id}/commit/{revision}` **Format**: NDJSON (Newline-Delimited JSON) **Example Payload**: ``` {"key":"header","value":{"summary":"Add model files","description":"Initial upload"}} {"key":"file","value":{"path":"config.json","content":"eyJtb2RlbCI6...","encoding":"base64"}} {"key":"lfsFile","value":{"path":"model.bin","algo":"sha256","oid":"abc123...","size":52428800}} {"key":"deletedFile","value":{"path":"old_config.json"}} ``` **Operation Types**: | Key | Description | Usage | |-----|-------------|-------| | `header` | Commit metadata | Required, must be first line | | `file` | Small file (inline base64) | For files ≤ 10MB | | `lfsFile` | Large file (LFS reference) | For files > 10MB, already uploaded to S3 | | `deletedFile` | Delete a single file | Remove file from repo | | `deletedFolder` | Delete folder recursively | Remove all files in folder | | `copyFile` | Copy file within repo | Duplicate file (deduplication-aware) | **Response**: ```json { "commitUrl": "https://hub.example.com/repo/commit/abc123", "commitOid": "abc123def456", "pullRequestUrl": null } ``` **What Happens**: ``` 1. Regular files: ┌─────────┐ │ Decode │ Base64 → Binary └────┬────┘ │ ▼ ┌─────────┐ │ Upload │ To LakeFS └────┬────┘ │ ▼ ┌─────────┐ │ Update │ Database record └─────────┘ 2. LFS files: ┌─────────┐ │ Link │ S3 physical address → LakeFS └────┬────┘ │ ▼ ┌─────────┐ │ Update │ Database record └─────────┘ 3. Commit: ┌─────────┐ │ LakeFS │ Create commit with all changes └─────────┘ ``` ## Download Workflow ``` ┌────────┐ ┌──────────┐ ┌─────────┐ │ Client │────>│ HEAD │────>│ GET │ └────────┘ └──────────┘ └─────────┘ Request Get metadata Download (size, hash) (redirect) ``` ### Step 1: Get Metadata (HEAD) **Endpoint**: `HEAD /{repo_id}/resolve/{revision}/{filename}` **Response Headers**: ``` X-Repo-Commit: abc123def456 X-Linked-Etag: "sha256:abc123..." X-Linked-Size: 52428800 ETag: "abc123..." Content-Length: 52428800 Location: https://s3.../presigned_download_url ``` **Purpose**: Client checks if file needs re-download (by comparing ETag) ### Step 2: Download (GET) **Endpoint**: `GET /{repo_id}/resolve/{revision}/{filename}` **Response**: HTTP 302 Redirect ``` HTTP/1.1 302 Found Location: https://s3.example.com/presigned_url?expires=... X-Repo-Commit: abc123def456 X-Linked-Etag: "sha256:abc123..." ``` **Flow**: ``` ┌────────┐ ┌──────────┐ │ Client │───── GET ─────>│ Kohaku │ └────────┘ │ Hub │ ▲ └─────┬────┘ │ │ │ 302 Redirect │ Generate │ (presigned URL) │ presigned │<─────────────────────────┘ URL │ │ ┌──────────┐ └───>│ S3 │ │ Direct │ │ Download │ └──────────┘ ``` **Key Point**: Client downloads directly from S3. Kohaku Hub only provides the redirect URL. ## Repository Management ### Create Repository **Endpoint**: `POST /api/repos/create` **Request**: ```json { "type": "model", "name": "my-model", "organization": "my-org", "private": false } ``` **What Happens**: ``` 1. Check if exists └─ Query DB for repo 2. Create LakeFS repo └─ Repository: hf-model-my-org-my-model └─ Storage: s3://bucket/hf-model-my-org-my-model └─ Default branch: main 3. Record in DB └─ INSERT INTO repository (...) ``` **Response**: ```json { "url": "https://hub.example.com/models/my-org/my-model", "repo_id": "my-org/my-model" } ``` ### List Repository Files **Endpoint**: `GET /api/{repo_type}s/{repo_id}/tree/{revision}/{path}` **Query Parameters**: - `recursive`: List all files recursively (default: false) - `expand`: Include LFS metadata (default: false) **Response**: ```json [ { "type": "file", "oid": "abc123", "size": 1024, "path": "config.json" }, { "type": "file", "oid": "def456", "size": 52428800, "path": "model.bin", "lfs": { "oid": "def456", "size": 52428800, "pointerSize": 134 } }, { "type": "directory", "oid": "", "size": 0, "path": "configs" } ] ``` ### Delete Repository **Endpoint**: `DELETE /api/repos/delete` **Request**: ```json { "type": "model", "name": "my-model", "organization": "my-org" } ``` **What Happens**: ``` 1. Delete from LakeFS └─ Remove repository metadata └─ (Objects remain in S3 for safety) 2. Delete from DB ├─ DELETE FROM file WHERE repo_full_id = ... ├─ DELETE FROM staging_upload WHERE repo_full_id = ... └─ DELETE FROM repository WHERE full_id = ... 3. Return success ``` ## Database Schema ### Repository Table ``` ┌──────────────┬──────────────┬─────────────┐ │ Column │ Type │ Index? │ ├──────────────┼──────────────┼─────────────┤ │ id │ INTEGER PK │ Primary │ │ repo_type │ VARCHAR │ Yes │ │ namespace │ VARCHAR │ Yes │ │ name │ VARCHAR │ Yes │ │ full_id │ VARCHAR │ Unique │ │ private │ BOOLEAN │ No │ │ created_at │ TIMESTAMP │ No │ └──────────────┴──────────────┴─────────────┘ Example: repo_type: "model" namespace: "myorg" name: "mymodel" full_id: "myorg/mymodel" ``` ### File Table (Deduplication) ``` ┌──────────────┬──────────────┬─────────────┐ │ Column │ Type │ Index? │ ├──────────────┼──────────────┼─────────────┤ │ id │ INTEGER PK │ Primary │ │ repo_full_id │ VARCHAR │ Yes │ │ path_in_repo │ VARCHAR │ Yes │ │ size │ INTEGER │ No │ │ sha256 │ VARCHAR │ Yes │ │ lfs │ BOOLEAN │ No │ │ created_at │ TIMESTAMP │ No │ │ updated_at │ TIMESTAMP │ No │ └──────────────┴──────────────┴─────────────┘ Unique constraint: (repo_full_id, path_in_repo) Purpose: - Track file SHA256 hashes for deduplication - Check if file changed before upload - Maintain file metadata ``` ### StagingUpload Table (Optional) ``` ┌──────────────┬──────────────┬─────────────┐ │ Column │ Type │ Index? │ ├──────────────┼──────────────┼─────────────┤ │ id │ INTEGER PK │ Primary │ │ repo_full_id │ VARCHAR │ Yes │ │ revision │ VARCHAR │ Yes │ │ path_in_repo │ VARCHAR │ No │ │ sha256 │ VARCHAR │ No │ │ size │ INTEGER │ No │ │ upload_id │ VARCHAR │ No │ │ storage_key │ VARCHAR │ No │ │ lfs │ BOOLEAN │ No │ │ created_at │ TIMESTAMP │ No │ └──────────────┴──────────────┴─────────────┘ Purpose: - Track ongoing multipart uploads - Enable upload resume - Clean up failed uploads ``` ## LakeFS Integration ### Repository Naming Convention ``` Pattern: {namespace}-{repo_type}-{org}-{name} Examples: HuggingFace repo: "myorg/mymodel" LakeFS repo: "hf-model-myorg-mymodel" HuggingFace repo: "johndoe/dataset" LakeFS repo: "hf-dataset-johndoe-dataset" ``` ### Key Operations | Operation | LakeFS API | Purpose | |-----------|------------|---------| | Create Repo | `repositories.create_repository()` | Initialize new repository | | Upload Small File | `objects.upload_object()` | Direct content upload | | Link LFS File | `staging.link_physical_address()` | Link S3 object to LakeFS | | Commit | `commits.commit()` | Create atomic commit | | List Files | `objects.list_objects()` | Browse repository | | Get File Info | `objects.stat_object()` | Get file metadata | | Delete File | `objects.delete_object()` | Remove file | ### Physical Address Linking ``` When uploading LFS file: 1. Client uploads to S3: s3://bucket/lfs/ab/cd/abcd1234... 2. Kohaku Hub links to LakeFS: ┌──────────────────────────────────┐ │ StagingMetadata │ ├──────────────────────────────────┤ │ physical_address: │ │ "s3://bucket/lfs/ab/cd/abc..." │ │ checksum: "sha256:abc..." │ │ size_bytes: 52428800 │ └──────────────────────────────────┘ │ ▼ ┌──────────────────────────────────┐ │ LakeFS: model.bin │ │ → Points to S3 object │ └──────────────────────────────────┘ 3. On commit: LakeFS records this link in its metadata ``` ## API Endpoint Summary ### Repository Operations | Endpoint | Method | Auth | Description | |----------|--------|------|-------------| | `/api/repos/create` | POST | ✓ | Create new repository | | `/api/repos/delete` | DELETE | ✓ | Delete repository | | `/api/{type}s` | GET | ○ | List repositories | | `/api/{type}s/{id}` | GET | ○ | Get repo info | | `/api/{type}s/{id}/tree/{rev}/{path}` | GET | ○ | List files | | `/api/{type}s/{id}/revision/{rev}` | GET | ○ | Get revision info | ### File Operations | Endpoint | Method | Auth | Description | |----------|--------|------|-------------| | `/api/{type}s/{id}/preupload/{rev}` | POST | ✓ | Check before upload | | `/api/{type}s/{id}/commit/{rev}` | POST | ✓ | Atomic commit | | `/{id}/resolve/{rev}/{file}` | GET | ○ | Download file | | `/{id}/resolve/{rev}/{file}` | HEAD | ○ | Get file metadata | ### LFS Operations | Endpoint | Method | Auth | Description | |----------|--------|------|-------------| | `/{id}.git/info/lfs/objects/batch` | POST | ✓ | LFS batch API | | `/api/{id}.git/info/lfs/verify` | POST | ✓ | Verify upload | ### Utility Operations | Endpoint | Method | Auth | Description | |----------|--------|------|-------------| | `/api/validate-yaml` | POST | ✗ | Validate YAML content | | `/api/whoami-v2` | GET | ✓ | Get user info | | `/health` | GET | ✗ | Health check | **Auth Legend**: - ✓ = Required - ○ = Optional (public repos) - ✗ = Not required ## Content Deduplication Kohaku Hub implements content-addressable storage for LFS files: ``` Same file uploaded to different repos: Repo A: myorg/model-v1 └─ model.bin (sha256: abc123...) Repo B: myorg/model-v2 └─ model.bin (sha256: abc123...) S3 Storage: └─ lfs/ab/c1/abc123... ← SINGLE COPY ▲ ▲ │ │ Repo A Repo B (linked) (linked) Benefits: - Save storage space - Faster uploads (skip if exists) - Efficient for model variants ``` **Deduplication Points**: 1. **Preupload Check**: Query DB by SHA256 2. **LFS Batch API**: Check if OID exists 3. **Commit**: Link existing S3 object instead of uploading ## Error Handling Kohaku Hub uses HuggingFace-compatible error headers: ``` HTTP Response Headers: X-Error-Code: RepoNotFound X-Error-Message: Repository 'org/repo' not found ``` **Error Codes**: | Code | HTTP Status | Description | |------|-------------|-------------| | `RepoNotFound` | 404 | Repository doesn't exist | | `RepoExists` | 400 | Repository already exists | | `RevisionNotFound` | 404 | Branch/commit not found | | `EntryNotFound` | 404 | File not found | | `GatedRepo` | 403 | Need permission | | `BadRequest` | 400 | Invalid request | | `ServerError` | 500 | Internal error | These error codes are parsed by `huggingface_hub` client to raise appropriate Python exceptions. ## Performance Considerations ### Upload Performance ``` Small Files (≤10MB): Client → FastAPI → LakeFS → S3 (Proxied through server) Large Files (>10MB): Client ─────────────────────→ S3 (Direct upload, no proxy) ↓ Kohaku Hub (only metadata link) ``` **Why this matters**: Large files bypass the application server entirely, allowing unlimited throughput limited only by client and S3 bandwidth. ### Download Performance ``` All Downloads: Client → Kohaku Hub → 302 Redirect → S3 (metadata) (direct) ``` **Why this matters**: After initial redirect, all data transfer is direct from S3/CDN. Server only generates presigned URLs. ### Recommended S3 Providers | Provider | Best For | Pricing Model | |----------|----------|---------------| | Cloudflare R2 | High download | Free egress, $0.015/GB storage | | Wasabi | Archive/backup | $6/TB/month, free egress if download < storage | | MinIO | Self-hosted | Free (your hardware/bandwidth) | | AWS S3 | Enterprise | Pay per GB + egress |