Files
KohakuHub/API.md
2025-10-02 01:43:28 +08:00

710 lines
22 KiB
Markdown

# Kohaku Hub API Documentation
This document explains how Kohaku Hub's API works, the data flow, and key endpoints.
## System Architecture
```
┌─────────────────────────────────────────────────────────────────┐
│ Client Request │
│ (huggingface_hub Python) │
└────────────────────────────────┬────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────┐
│ FastAPI Layer │
│ (kohakuhub/api/*) │
│ │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ basic │ │ file │ │ lfs │ │ utils │ │
│ │ .py │ │ .py │ │ .py │ │ .py │ │
│ └──────────┘ └──────────┘ └──────────┘ └──────────┘ │
└────────────────────────────────┬────────────────────────────────┘
┌────────────┼────────────┐
│ │ │
▼ ▼ ▼
┌─────────────┐ ┌──────────┐ ┌─────────────┐
│ LakeFS │ │ SQLite/ │ │ MinIO │
│ │ │ Postgres │ │ (S3) │
│ Versioning │ │ Metadata │ │ Storage │
│ Branches │ │ Dedup │ │ Objects │
└─────────────┘ └──────────┘ └─────────────┘
```
## Core Concepts
### File Size Thresholds
```
File Size Decision Tree:
Is file > 10MB?
┌───────┴───────┐
│ │
NO YES
│ │
▼ ▼
┌─────────┐ ┌─────────┐
│ Regular │ │ LFS │
│ Mode │ │ Mode │
└─────────┘ └─────────┘
│ │
▼ ▼
Base64 in S3 Direct
Commit Upload
```
### Storage Layout
```
S3 Bucket Structure:
s3://hub-storage/
├── hf-model-org-repo/ ← LakeFS managed repository
│ └── main/ ← Branch
│ ├── config.json
│ └── model.safetensors
└── lfs/ ← LFS objects (content-addressable)
└── ab/ ← First 2 chars of SHA256
└── cd/ ← Next 2 chars
└── abcd1234... ← Full SHA256 hash
```
## Upload Workflow
### Overview
```
┌────────┐ ┌──────────┐ ┌─────────┐ ┌────────┐
│ Client │────▶│ Preupload│────▶│ Upload │────▶│ Commit │
└────────┘ └──────────┘ └─────────┘ └────────┘
User Check if Upload Atomic
Request file exists file(s) commit
(dedup) (S3/inline) (LakeFS)
```
### Step 1: Preupload Check
**Purpose**: Determine upload mode and check for duplicates
**Endpoint**: `POST /api/{repo_type}s/{repo_id}/preupload/{revision}`
**Request**:
```json
{
"files": [
{
"path": "config.json",
"size": 1024,
"sha256": "abc123..."
},
{
"path": "model.bin",
"size": 52428800,
"sha256": "def456..."
}
]
}
```
**Response**:
```json
{
"files": [
{
"path": "config.json",
"uploadMode": "regular",
"shouldIgnore": false
},
{
"path": "model.bin",
"uploadMode": "lfs",
"shouldIgnore": true // Already exists!
}
]
}
```
**Decision Logic**:
```
For each file:
1. Check size:
- ≤ 10MB → "regular"
- > 10MB → "lfs"
2. Check if exists (deduplication):
- Query DB for matching SHA256 + size
- If match found → shouldIgnore: true
- If no match → shouldIgnore: false
```
### Step 2a: Regular Upload (≤10MB)
Files are sent inline in the commit payload as base64.
```
┌────────┐ ┌────────┐
│ Client │───── base64 ──────>│ Commit │
└────────┘ (embedded) └────────┘
```
**No separate upload step needed** - proceed directly to Step 3.
### Step 2b: LFS Upload (>10MB)
#### Phase 1: Request Upload URLs
**Endpoint**: `POST /{repo_id}.git/info/lfs/objects/batch`
**Request**:
```json
{
"operation": "upload",
"transfers": ["basic", "multipart"],
"objects": [
{
"oid": "sha256_hash",
"size": 52428800
}
]
}
```
**Response** (if file needs upload):
```json
{
"transfer": "basic",
"objects": [
{
"oid": "sha256_hash",
"size": 52428800,
"actions": {
"upload": {
"href": "https://s3.../presigned_url",
"expires_at": "2025-10-02T00:00:00Z"
}
}
}
]
}
```
**Response** (if file already exists):
```json
{
"transfer": "basic",
"objects": [
{
"oid": "sha256_hash",
"size": 52428800
// No "actions" field = already exists
}
]
}
```
#### Phase 2: Upload to S3
```
┌────────┐ ┌─────────┐
│ Client │──── PUT file ────────>│ S3 │
└────────┘ (presigned URL) └─────────┘
Direct upload lfs/ab/cd/
(no proxy!) abcd123...
```
**Key Point**: Client uploads directly to S3 using the presigned URL. Kohaku Hub server is NOT involved in data transfer.
### Step 3: Commit
**Purpose**: Atomically commit all changes to the repository
**Endpoint**: `POST /api/{repo_type}s/{repo_id}/commit/{revision}`
**Format**: NDJSON (Newline-Delimited JSON)
**Example Payload**:
```
{"key":"header","value":{"summary":"Add model files","description":"Initial upload"}}
{"key":"file","value":{"path":"config.json","content":"eyJtb2RlbCI6...","encoding":"base64"}}
{"key":"lfsFile","value":{"path":"model.bin","algo":"sha256","oid":"abc123...","size":52428800}}
{"key":"deletedFile","value":{"path":"old_config.json"}}
```
**Operation Types**:
| Key | Description | Usage |
|-----|-------------|-------|
| `header` | Commit metadata | Required, must be first line |
| `file` | Small file (inline base64) | For files ≤ 10MB |
| `lfsFile` | Large file (LFS reference) | For files > 10MB, already uploaded to S3 |
| `deletedFile` | Delete a single file | Remove file from repo |
| `deletedFolder` | Delete folder recursively | Remove all files in folder |
| `copyFile` | Copy file within repo | Duplicate file (deduplication-aware) |
**Response**:
```json
{
"commitUrl": "https://hub.example.com/repo/commit/abc123",
"commitOid": "abc123def456",
"pullRequestUrl": null
}
```
**What Happens**:
```
1. Regular files:
┌─────────┐
│ Decode │ Base64 → Binary
└────┬────┘
┌─────────┐
│ Upload │ To LakeFS
└────┬────┘
┌─────────┐
│ Update │ Database record
└─────────┘
2. LFS files:
┌─────────┐
│ Link │ S3 physical address → LakeFS
└────┬────┘
┌─────────┐
│ Update │ Database record
└─────────┘
3. Commit:
┌─────────┐
│ LakeFS │ Create commit with all changes
└─────────┘
```
## Download Workflow
```
┌────────┐ ┌──────────┐ ┌─────────┐
│ Client │────>│ HEAD │────>│ GET │
└────────┘ └──────────┘ └─────────┘
Request Get metadata Download
(size, hash) (redirect)
```
### Step 1: Get Metadata (HEAD)
**Endpoint**: `HEAD /{repo_id}/resolve/{revision}/{filename}`
**Response Headers**:
```
X-Repo-Commit: abc123def456
X-Linked-Etag: "sha256:abc123..."
X-Linked-Size: 52428800
ETag: "abc123..."
Content-Length: 52428800
Location: https://s3.../presigned_download_url
```
**Purpose**: Client checks if file needs re-download (by comparing ETag)
### Step 2: Download (GET)
**Endpoint**: `GET /{repo_id}/resolve/{revision}/{filename}`
**Response**: HTTP 302 Redirect
```
HTTP/1.1 302 Found
Location: https://s3.example.com/presigned_url?expires=...
X-Repo-Commit: abc123def456
X-Linked-Etag: "sha256:abc123..."
```
**Flow**:
```
┌────────┐ ┌──────────┐
│ Client │───── GET ─────>│ Kohaku │
└────────┘ │ Hub │
▲ └─────┬────┘
│ │
│ 302 Redirect │ Generate
│ (presigned URL) │ presigned
│<─────────────────────────┘ URL
│ ┌──────────┐
└───>│ S3 │
│ Direct │
│ Download │
└──────────┘
```
**Key Point**: Client downloads directly from S3. Kohaku Hub only provides the redirect URL.
## Repository Management
### Create Repository
**Endpoint**: `POST /api/repos/create`
**Request**:
```json
{
"type": "model",
"name": "my-model",
"organization": "my-org",
"private": false
}
```
**What Happens**:
```
1. Check if exists
└─ Query DB for repo
2. Create LakeFS repo
└─ Repository: hf-model-my-org-my-model
└─ Storage: s3://bucket/hf-model-my-org-my-model
└─ Default branch: main
3. Record in DB
└─ INSERT INTO repository (...)
```
**Response**:
```json
{
"url": "https://hub.example.com/models/my-org/my-model",
"repo_id": "my-org/my-model"
}
```
### List Repository Files
**Endpoint**: `GET /api/{repo_type}s/{repo_id}/tree/{revision}/{path}`
**Query Parameters**:
- `recursive`: List all files recursively (default: false)
- `expand`: Include LFS metadata (default: false)
**Response**:
```json
[
{
"type": "file",
"oid": "abc123",
"size": 1024,
"path": "config.json"
},
{
"type": "file",
"oid": "def456",
"size": 52428800,
"path": "model.bin",
"lfs": {
"oid": "def456",
"size": 52428800,
"pointerSize": 134
}
},
{
"type": "directory",
"oid": "",
"size": 0,
"path": "configs"
}
]
```
### Delete Repository
**Endpoint**: `DELETE /api/repos/delete`
**Request**:
```json
{
"type": "model",
"name": "my-model",
"organization": "my-org"
}
```
**What Happens**:
```
1. Delete from LakeFS
└─ Remove repository metadata
└─ (Objects remain in S3 for safety)
2. Delete from DB
├─ DELETE FROM file WHERE repo_full_id = ...
├─ DELETE FROM staging_upload WHERE repo_full_id = ...
└─ DELETE FROM repository WHERE full_id = ...
3. Return success
```
## Database Schema
### Repository Table
```
┌──────────────┬──────────────┬─────────────┐
│ Column │ Type │ Index? │
├──────────────┼──────────────┼─────────────┤
│ id │ INTEGER PK │ Primary │
│ repo_type │ VARCHAR │ Yes │
│ namespace │ VARCHAR │ Yes │
│ name │ VARCHAR │ Yes │
│ full_id │ VARCHAR │ Unique │
│ private │ BOOLEAN │ No │
│ created_at │ TIMESTAMP │ No │
└──────────────┴──────────────┴─────────────┘
Example:
repo_type: "model"
namespace: "myorg"
name: "mymodel"
full_id: "myorg/mymodel"
```
### File Table (Deduplication)
```
┌──────────────┬──────────────┬─────────────┐
│ Column │ Type │ Index? │
├──────────────┼──────────────┼─────────────┤
│ id │ INTEGER PK │ Primary │
│ repo_full_id │ VARCHAR │ Yes │
│ path_in_repo │ VARCHAR │ Yes │
│ size │ INTEGER │ No │
│ sha256 │ VARCHAR │ Yes │
│ lfs │ BOOLEAN │ No │
│ created_at │ TIMESTAMP │ No │
│ updated_at │ TIMESTAMP │ No │
└──────────────┴──────────────┴─────────────┘
Unique constraint: (repo_full_id, path_in_repo)
Purpose:
- Track file SHA256 hashes for deduplication
- Check if file changed before upload
- Maintain file metadata
```
### StagingUpload Table (Optional)
```
┌──────────────┬──────────────┬─────────────┐
│ Column │ Type │ Index? │
├──────────────┼──────────────┼─────────────┤
│ id │ INTEGER PK │ Primary │
│ repo_full_id │ VARCHAR │ Yes │
│ revision │ VARCHAR │ Yes │
│ path_in_repo │ VARCHAR │ No │
│ sha256 │ VARCHAR │ No │
│ size │ INTEGER │ No │
│ upload_id │ VARCHAR │ No │
│ storage_key │ VARCHAR │ No │
│ lfs │ BOOLEAN │ No │
│ created_at │ TIMESTAMP │ No │
└──────────────┴──────────────┴─────────────┘
Purpose:
- Track ongoing multipart uploads
- Enable upload resume
- Clean up failed uploads
```
## LakeFS Integration
### Repository Naming Convention
```
Pattern: {namespace}-{repo_type}-{org}-{name}
Examples:
HuggingFace repo: "myorg/mymodel"
LakeFS repo: "hf-model-myorg-mymodel"
HuggingFace repo: "johndoe/dataset"
LakeFS repo: "hf-dataset-johndoe-dataset"
```
### Key Operations
| Operation | LakeFS API | Purpose |
|-----------|------------|---------|
| Create Repo | `repositories.create_repository()` | Initialize new repository |
| Upload Small File | `objects.upload_object()` | Direct content upload |
| Link LFS File | `staging.link_physical_address()` | Link S3 object to LakeFS |
| Commit | `commits.commit()` | Create atomic commit |
| List Files | `objects.list_objects()` | Browse repository |
| Get File Info | `objects.stat_object()` | Get file metadata |
| Delete File | `objects.delete_object()` | Remove file |
### Physical Address Linking
```
When uploading LFS file:
1. Client uploads to S3:
s3://bucket/lfs/ab/cd/abcd1234...
2. Kohaku Hub links to LakeFS:
┌──────────────────────────────────┐
│ StagingMetadata │
├──────────────────────────────────┤
│ physical_address: │
│ "s3://bucket/lfs/ab/cd/abc..." │
│ checksum: "sha256:abc..." │
│ size_bytes: 52428800 │
└──────────────────────────────────┘
┌──────────────────────────────────┐
│ LakeFS: model.bin │
│ → Points to S3 object │
└──────────────────────────────────┘
3. On commit:
LakeFS records this link in its metadata
```
## API Endpoint Summary
### Repository Operations
| Endpoint | Method | Auth | Description |
|----------|--------|------|-------------|
| `/api/repos/create` | POST | ✓ | Create new repository |
| `/api/repos/delete` | DELETE | ✓ | Delete repository |
| `/api/{type}s` | GET | ○ | List repositories |
| `/api/{type}s/{id}` | GET | ○ | Get repo info |
| `/api/{type}s/{id}/tree/{rev}/{path}` | GET | ○ | List files |
| `/api/{type}s/{id}/revision/{rev}` | GET | ○ | Get revision info |
### File Operations
| Endpoint | Method | Auth | Description |
|----------|--------|------|-------------|
| `/api/{type}s/{id}/preupload/{rev}` | POST | ✓ | Check before upload |
| `/api/{type}s/{id}/commit/{rev}` | POST | ✓ | Atomic commit |
| `/{id}/resolve/{rev}/{file}` | GET | ○ | Download file |
| `/{id}/resolve/{rev}/{file}` | HEAD | ○ | Get file metadata |
### LFS Operations
| Endpoint | Method | Auth | Description |
|----------|--------|------|-------------|
| `/{id}.git/info/lfs/objects/batch` | POST | ✓ | LFS batch API |
| `/api/{id}.git/info/lfs/verify` | POST | ✓ | Verify upload |
### Utility Operations
| Endpoint | Method | Auth | Description |
|----------|--------|------|-------------|
| `/api/validate-yaml` | POST | ✗ | Validate YAML content |
| `/api/whoami-v2` | GET | ✓ | Get user info |
| `/health` | GET | ✗ | Health check |
**Auth Legend**:
- ✓ = Required
- ○ = Optional (public repos)
- ✗ = Not required
## Content Deduplication
Kohaku Hub implements content-addressable storage for LFS files:
```
Same file uploaded to different repos:
Repo A: myorg/model-v1
└─ model.bin (sha256: abc123...)
Repo B: myorg/model-v2
└─ model.bin (sha256: abc123...)
S3 Storage:
└─ lfs/ab/c1/abc123... ← SINGLE COPY
▲ ▲
│ │
Repo A Repo B
(linked) (linked)
Benefits:
- Save storage space
- Faster uploads (skip if exists)
- Efficient for model variants
```
**Deduplication Points**:
1. **Preupload Check**: Query DB by SHA256
2. **LFS Batch API**: Check if OID exists
3. **Commit**: Link existing S3 object instead of uploading
## Error Handling
Kohaku Hub uses HuggingFace-compatible error headers:
```
HTTP Response Headers:
X-Error-Code: RepoNotFound
X-Error-Message: Repository 'org/repo' not found
```
**Error Codes**:
| Code | HTTP Status | Description |
|------|-------------|-------------|
| `RepoNotFound` | 404 | Repository doesn't exist |
| `RepoExists` | 400 | Repository already exists |
| `RevisionNotFound` | 404 | Branch/commit not found |
| `EntryNotFound` | 404 | File not found |
| `GatedRepo` | 403 | Need permission |
| `BadRequest` | 400 | Invalid request |
| `ServerError` | 500 | Internal error |
These error codes are parsed by `huggingface_hub` client to raise appropriate Python exceptions.
## Performance Considerations
### Upload Performance
```
Small Files (≤10MB):
Client → FastAPI → LakeFS → S3
(Proxied through server)
Large Files (>10MB):
Client ─────────────────────→ S3
(Direct upload, no proxy)
Kohaku Hub (only metadata link)
```
**Why this matters**: Large files bypass the application server entirely, allowing unlimited throughput limited only by client and S3 bandwidth.
### Download Performance
```
All Downloads:
Client → Kohaku Hub → 302 Redirect → S3
(metadata) (direct)
```
**Why this matters**: After initial redirect, all data transfer is direct from S3/CDN. Server only generates presigned URLs.
### Recommended S3 Providers
| Provider | Best For | Pricing Model |
|----------|----------|---------------|
| Cloudflare R2 | High download | Free egress, $0.015/GB storage |
| Wasabi | Archive/backup | $6/TB/month, free egress if download < storage |
| MinIO | Self-hosted | Free (your hardware/bandwidth) |
| AWS S3 | Enterprise | Pay per GB + egress |