mirror of
https://github.com/KohakuBlueleaf/KohakuHub.git
synced 2026-05-02 10:27:48 -05:00
1373 lines
35 KiB
Markdown
1373 lines
35 KiB
Markdown
# Kohaku Hub API Documentation
|
|
|
|
*Last Updated: January 2025*
|
|
|
|
This document explains how Kohaku Hub's API works, the data flow, and key endpoints.
|
|
|
|
## System Architecture
|
|
|
|
```mermaid
|
|
graph TB
|
|
subgraph "Client Layer"
|
|
Client["Client<br/>(huggingface_hub, git, browser)"]
|
|
end
|
|
|
|
subgraph "Entry Point"
|
|
Nginx["Nginx (Port 28080)<br/>- Serves static files<br/>- Reverse proxy"]
|
|
end
|
|
|
|
subgraph "Application Layer"
|
|
FastAPI["FastAPI (Port 48888)<br/>- Auth & Permissions<br/>- HF-compatible API<br/>- Git Smart HTTP"]
|
|
end
|
|
|
|
subgraph "Storage Backend"
|
|
LakeFS["LakeFS<br/>- Git-like versioning<br/>- Branch management<br/>- Commit history"]
|
|
DB["PostgreSQL/SQLite<br/>- User data<br/>- Metadata<br/>- Deduplication<br/>- Synchronous with db.atomic()"]
|
|
S3["MinIO/S3<br/>- Object storage<br/>- LFS files<br/>- Presigned URLs"]
|
|
end
|
|
|
|
Client -->|HTTP/Git/LFS| Nginx
|
|
Nginx -->|Static files| Client
|
|
Nginx -->|/api, /org, resolve| FastAPI
|
|
FastAPI -->|REST API (async)| LakeFS
|
|
FastAPI -->|Sync queries with db.atomic()| DB
|
|
FastAPI -->|Async wrappers| S3
|
|
LakeFS -->|Stores objects| S3
|
|
|
|
```
|
|
|
|
## Core Concepts
|
|
|
|
### File Size Thresholds
|
|
|
|
```mermaid
|
|
graph TD
|
|
Start[File Upload] --> Check{File size > 5MB?}
|
|
Check -->|No| Regular[Regular Mode]
|
|
Check -->|Yes| LFS[LFS Mode]
|
|
Regular --> Base64[Base64 in commit payload]
|
|
LFS --> Presigned[S3 presigned URL]
|
|
Base64 --> FastAPI[FastAPI processes]
|
|
Presigned --> Direct[Direct S3 upload]
|
|
FastAPI --> LakeFS1[LakeFS stores object]
|
|
Direct --> Link[FastAPI links S3 object]
|
|
Link --> LakeFS2[LakeFS commit with physical address]
|
|
|
|
```
|
|
|
|
**Note:** The LFS threshold is configurable via `KOHAKU_HUB_LFS_THRESHOLD_BYTES` (default: 5MB = 5,242,880 bytes).
|
|
|
|
### Storage Layout
|
|
|
|
```
|
|
S3 Bucket Structure:
|
|
|
|
s3://hub-storage/
|
|
│
|
|
├── hf-model-org-repo/ ← LakeFS managed repository
|
|
│ └── main/ ← Branch
|
|
│ ├── config.json
|
|
│ └── model.safetensors
|
|
│
|
|
└── lfs/ ← LFS objects (content-addressable)
|
|
└── ab/ ← First 2 chars of SHA256
|
|
└── cd/ ← Next 2 chars
|
|
└── abcd1234... ← Full SHA256 hash
|
|
```
|
|
|
|
## Git Clone Support
|
|
|
|
### Overview
|
|
|
|
KohakuHub supports native Git clone operations using **pure Python implementation** (no pygit2/libgit2).
|
|
|
|
**Git URL Format:**
|
|
```
|
|
http://hub.example.com/{namespace}/{repo-name}.git
|
|
```
|
|
|
|
**Git Endpoints:**
|
|
- `GET /{namespace}/{name}.git/info/refs?service=git-upload-pack` - Service advertisement
|
|
- `POST /{namespace}/{name}.git/git-upload-pack` - Clone/fetch/pull
|
|
- `GET /{namespace}/{name}.git/HEAD` - Get HEAD reference
|
|
- `POST /{namespace}/{name}.git/git-receive-pack` - Push (in progress)
|
|
|
|
### LFS Integration
|
|
|
|
**Automatic LFS Pointers:**
|
|
- Files **<1MB**: Included in Git pack as regular blobs
|
|
- Files **>=1MB**: Converted to LFS pointers (100-byte text files)
|
|
|
|
**LFS Pointer Format:**
|
|
```
|
|
version https://git-lfs.github.com/spec/v1
|
|
oid sha256:abc123...
|
|
size 10737418240
|
|
```
|
|
|
|
**Client Workflow:**
|
|
```bash
|
|
# 1. Clone (gets pointers for large files)
|
|
git clone http://hub.example.com/org/repo.git
|
|
|
|
# 2. Download large files via LFS
|
|
cd repo
|
|
git lfs install
|
|
git lfs pull # Uses existing /info/lfs/ endpoints
|
|
```
|
|
|
|
**Benefits:**
|
|
- Fast clones (only metadata + small files)
|
|
- No memory issues (LFS pointers are tiny)
|
|
- Leverages existing HuggingFace LFS infrastructure
|
|
- Pure Python (no native dependencies)
|
|
|
|
See [Git.md](./Git.md) for complete Git clone documentation and implementation details.
|
|
|
|
---
|
|
|
|
## Upload Workflow
|
|
|
|
### Overview
|
|
|
|
```mermaid
|
|
sequenceDiagram
|
|
participant Client
|
|
participant API as FastAPI
|
|
participant LakeFS
|
|
participant S3
|
|
|
|
Note over Client,S3: Phase 1: Preupload Check
|
|
Client->>API: POST /preupload (file hashes & sizes)
|
|
API->>API: Check DB for existing SHA256
|
|
API-->>Client: Upload mode (regular/lfs) & dedup info
|
|
|
|
alt Small Files (<5MB)
|
|
Note over Client,S3: Phase 2a: Regular Upload
|
|
Client->>API: POST /commit (base64 content)
|
|
API->>LakeFS: Upload object
|
|
LakeFS->>S3: Store object
|
|
else Large Files (>=5MB)
|
|
Note over Client,S3: Phase 2b: LFS Upload
|
|
Client->>API: POST /info/lfs/objects/batch
|
|
API->>S3: Generate presigned URL
|
|
API-->>Client: Presigned URL
|
|
Client->>S3: PUT file (direct upload)
|
|
Client->>API: POST /commit (lfsFile entry)
|
|
API->>LakeFS: Link physical address
|
|
end
|
|
|
|
Note over Client,S3: Phase 3: Commit
|
|
API->>LakeFS: Commit with message
|
|
LakeFS-->>API: Commit ID
|
|
API-->>Client: Commit URL & OID
|
|
```
|
|
|
|
### Step 1: Preupload Check
|
|
|
|
**Purpose**: Determine upload mode and check for duplicates
|
|
|
|
**Endpoint**: `POST /api/{repo_type}s/{repo_id}/preupload/{revision}`
|
|
|
|
**Request**:
|
|
```json
|
|
{
|
|
"files": [
|
|
{
|
|
"path": "config.json",
|
|
"size": 1024,
|
|
"sha256": "abc123..."
|
|
},
|
|
{
|
|
"path": "model.bin",
|
|
"size": 52428800,
|
|
"sha256": "def456..."
|
|
}
|
|
]
|
|
}
|
|
```
|
|
|
|
**Response**:
|
|
```json
|
|
{
|
|
"files": [
|
|
{
|
|
"path": "config.json",
|
|
"uploadMode": "regular",
|
|
"shouldIgnore": false
|
|
},
|
|
{
|
|
"path": "model.bin",
|
|
"uploadMode": "lfs",
|
|
"shouldIgnore": true // Already exists!
|
|
}
|
|
]
|
|
}
|
|
```
|
|
|
|
**Decision Logic**:
|
|
```
|
|
For each file:
|
|
1. Check size:
|
|
- ≤ 5MB → "regular"
|
|
- > 5MB → "lfs"
|
|
|
|
2. Check if exists (deduplication):
|
|
- Query DB for matching SHA256 + size
|
|
- If match found → shouldIgnore: true
|
|
- If no match → shouldIgnore: false
|
|
```
|
|
|
|
### Step 2a: Regular Upload (≤5MB)
|
|
|
|
Files are sent inline in the commit payload as base64.
|
|
|
|
```
|
|
┌────────┐ ┌────────┐
|
|
│ Client │───── base64 ──────>│ Commit │
|
|
└────────┘ (embedded) └────────┘
|
|
```
|
|
|
|
**No separate upload step needed** - proceed directly to Step 3.
|
|
|
|
### Step 2b: LFS Upload (>5MB)
|
|
|
|
#### Phase 1: Request Upload URLs
|
|
|
|
**Endpoint**: `POST /{repo_id}.git/info/lfs/objects/batch`
|
|
|
|
**Request**:
|
|
```json
|
|
{
|
|
"operation": "upload",
|
|
"transfers": ["basic", "multipart"],
|
|
"objects": [
|
|
{
|
|
"oid": "sha256_hash",
|
|
"size": 52428800
|
|
}
|
|
]
|
|
}
|
|
```
|
|
|
|
**Response** (if file needs upload):
|
|
```json
|
|
{
|
|
"transfer": "basic",
|
|
"objects": [
|
|
{
|
|
"oid": "sha256_hash",
|
|
"size": 52428800,
|
|
"actions": {
|
|
"upload": {
|
|
"href": "https://s3.../presigned_url",
|
|
"expires_at": "2025-10-02T00:00:00Z"
|
|
}
|
|
}
|
|
}
|
|
]
|
|
}
|
|
```
|
|
|
|
**Response** (if file already exists):
|
|
```json
|
|
{
|
|
"transfer": "basic",
|
|
"objects": [
|
|
{
|
|
"oid": "sha256_hash",
|
|
"size": 52428800
|
|
// No "actions" field = already exists
|
|
}
|
|
]
|
|
}
|
|
```
|
|
|
|
#### Phase 2: Upload to S3
|
|
|
|
```
|
|
┌────────┐ ┌─────────┐
|
|
│ Client │---- PUT file -------->│ S3 │
|
|
└────────┘ (presigned URL) └─────────┘
|
|
Direct upload lfs/ab/cd/
|
|
(no proxy!) abcd123...
|
|
```
|
|
|
|
**Key Point**: Client uploads directly to S3 using the presigned URL. Kohaku Hub server is NOT involved in data transfer.
|
|
|
|
### Step 3: Commit
|
|
|
|
**Purpose**: Atomically commit all changes to the repository
|
|
|
|
**Endpoint**: `POST /api/{repo_type}s/{repo_id}/commit/{revision}`
|
|
|
|
**Format**: NDJSON (Newline-Delimited JSON)
|
|
|
|
**Example Payload**:
|
|
```
|
|
{"key":"header","value":{"summary":"Add model files","description":"Initial upload"}}
|
|
{"key":"file","value":{"path":"config.json","content":"eyJtb2RlbCI6...","encoding":"base64"}}
|
|
{"key":"lfsFile","value":{"path":"model.bin","algo":"sha256","oid":"abc123...","size":52428800}}
|
|
{"key":"deletedFile","value":{"path":"old_config.json"}}
|
|
```
|
|
|
|
**Operation Types**:
|
|
|
|
| Key | Description | Usage |
|
|
|-----|-------------|-------|
|
|
| `header` | Commit metadata | Required, must be first line |
|
|
| `file` | Small file (inline base64) | For files ≤ 5MB |
|
|
| `lfsFile` | Large file (LFS reference) | For files > 5MB, already uploaded to S3 |
|
|
| `deletedFile` | Delete a single file | Remove file from repo |
|
|
| `deletedFolder` | Delete folder recursively | Remove all files in folder |
|
|
| `copyFile` | Copy file within repo | Duplicate file (deduplication-aware) |
|
|
|
|
**Response**:
|
|
```json
|
|
{
|
|
"commitUrl": "https://hub.example.com/repo/commit/abc123",
|
|
"commitOid": "abc123def456",
|
|
"pullRequestUrl": null
|
|
}
|
|
```
|
|
|
|
**What Happens**:
|
|
```
|
|
1. Regular files:
|
|
┌─────────┐
|
|
│ Decode │ Base64 -> Binary
|
|
└────┬────┘
|
|
|
|
|
v
|
|
┌─────────┐
|
|
│ Upload │ To LakeFS
|
|
└────┬────┘
|
|
|
|
|
v
|
|
┌─────────┐
|
|
│ Update │ Database record
|
|
└─────────┘
|
|
|
|
2. LFS files:
|
|
┌─────────┐
|
|
│ Link │ S3 physical address -> LakeFS
|
|
└────┬────┘
|
|
|
|
|
v
|
|
┌─────────┐
|
|
│ Update │ Database record
|
|
└─────────┘
|
|
|
|
3. Commit:
|
|
┌─────────┐
|
|
│ LakeFS │ Create commit with all changes
|
|
└─────────┘
|
|
```
|
|
|
|
## Download Workflow
|
|
|
|
```mermaid
|
|
sequenceDiagram
|
|
participant Client
|
|
participant API as FastAPI
|
|
participant LakeFS
|
|
participant S3
|
|
|
|
Note over Client,S3: Optional: HEAD request for metadata
|
|
Client->>API: HEAD /resolve/{revision}/{filename}
|
|
API->>LakeFS: Stat object
|
|
LakeFS-->>API: Object metadata (SHA256, size)
|
|
API-->>Client: Headers (ETag, Content-Length, X-Repo-Commit)
|
|
|
|
Note over Client,S3: Download: GET request
|
|
Client->>API: GET /resolve/{revision}/{filename}
|
|
API->>LakeFS: Get object metadata
|
|
API->>S3: Generate presigned URL
|
|
API-->>Client: 302 Redirect (presigned URL)
|
|
Client->>S3: Direct download
|
|
S3-->>Client: File content
|
|
|
|
Note over Client: No proxy - direct S3 download
|
|
```
|
|
|
|
### Step 1: Get Metadata (HEAD)
|
|
|
|
**Endpoint**: `HEAD /{repo_id}/resolve/{revision}/{filename}`
|
|
|
|
**Response Headers**:
|
|
```
|
|
X-Repo-Commit: abc123def456
|
|
X-Linked-Etag: "sha256:abc123..."
|
|
X-Linked-Size: 52428800
|
|
ETag: "abc123..."
|
|
Content-Length: 52428800
|
|
Location: https://s3.../presigned_download_url
|
|
```
|
|
|
|
**Purpose**: Client checks if file needs re-download (by comparing ETag)
|
|
|
|
### Step 2: Download (GET)
|
|
|
|
**Endpoint**: `GET /{repo_id}/resolve/{revision}/{filename}`
|
|
|
|
**Response**: HTTP 302 Redirect
|
|
|
|
```
|
|
HTTP/1.1 302 Found
|
|
Location: https://s3.example.com/presigned_url?expires=...
|
|
X-Repo-Commit: abc123def456
|
|
X-Linked-Etag: "sha256:abc123..."
|
|
```
|
|
|
|
**Flow**:
|
|
```
|
|
┌────────┐ ┌──────────┐
|
|
│ Client │───── GET ─────>│ Kohaku │
|
|
└────────┘ │ Hub │
|
|
▲ └─────┬────┘
|
|
│ │
|
|
│ 302 Redirect │ Generate
|
|
│ (presigned URL) │ presigned
|
|
│<─────────────────────────┘ URL
|
|
│
|
|
│ ┌──────────┐
|
|
└───>│ S3 │
|
|
│ Direct │
|
|
│ Download │
|
|
└──────────┘
|
|
```
|
|
|
|
**Key Point**: Client downloads directly from S3. Kohaku Hub only provides the redirect URL.
|
|
|
|
## Repository Privacy & Filtering
|
|
|
|
KohakuHub respects repository privacy settings when listing repositories. The visibility of repositories depends on authentication:
|
|
|
|
### Privacy Rules
|
|
|
|
**For Unauthenticated Users:**
|
|
- Can only see **public** repositories
|
|
|
|
**For Authenticated Users:**
|
|
- Can see all **public** repositories
|
|
- Can see their **own private** repositories
|
|
- Can see **private repositories** in organizations they belong to
|
|
|
|
### List Repositories Endpoint
|
|
|
|
**Pattern**: `/api/{type}s` where type is `model`, `dataset`, or `space`
|
|
|
|
**Query Parameters:**
|
|
- `author`: Filter by author/namespace (username or organization)
|
|
- `limit`: Maximum results (default: 50, max: 1000)
|
|
|
|
**Examples:**
|
|
```bash
|
|
# List all public models
|
|
GET /api/models
|
|
|
|
# List models by author (respects privacy)
|
|
GET /api/models?author=my-org
|
|
|
|
# Authenticated user sees their private repos too
|
|
GET /api/models?author=my-org
|
|
Authorization: Bearer YOUR_TOKEN
|
|
```
|
|
|
|
### List User's All Repositories
|
|
|
|
**Endpoint**: `GET /api/users/{username}/repos`
|
|
|
|
Returns all repositories for a user/organization, grouped by type.
|
|
|
|
**Response:**
|
|
```json
|
|
{
|
|
"models": [
|
|
{"id": "user/model-1", "private": false, ...},
|
|
{"id": "user/model-2", "private": true, ...}
|
|
],
|
|
"datasets": [
|
|
{"id": "user/dataset-1", "private": false, ...}
|
|
],
|
|
"spaces": []
|
|
}
|
|
```
|
|
|
|
**Note**: Private repositories are only included if:
|
|
1. The requesting user is the owner, OR
|
|
2. The requesting user is a member of the organization
|
|
|
|
## Repository Management
|
|
|
|
### Create Repository
|
|
|
|
**Endpoint**: `POST /api/repos/create`
|
|
|
|
**Request**:
|
|
```json
|
|
{
|
|
"type": "model",
|
|
"name": "my-model",
|
|
"organization": "my-org",
|
|
"private": false
|
|
}
|
|
```
|
|
|
|
**What Happens**:
|
|
```
|
|
1. Check if exists
|
|
└─ Query DB for repo
|
|
|
|
2. Create LakeFS repo
|
|
└─ Repository: hf-model-my-org-my-model
|
|
└─ Storage: s3://bucket/hf-model-my-org-my-model
|
|
└─ Default branch: main
|
|
|
|
3. Record in DB
|
|
└─ INSERT INTO repository (...)
|
|
```
|
|
|
|
**Response**:
|
|
```json
|
|
{
|
|
"url": "https://hub.example.com/models/my-org/my-model",
|
|
"repo_id": "my-org/my-model"
|
|
}
|
|
```
|
|
|
|
### List Repository Files
|
|
|
|
**Endpoint**: `GET /api/{repo_type}s/{repo_id}/tree/{revision}/{path}`
|
|
|
|
**Query Parameters**:
|
|
- `recursive`: List all files recursively (default: false)
|
|
- `expand`: Include LFS metadata (default: false)
|
|
|
|
**Response**:
|
|
```json
|
|
[
|
|
{
|
|
"type": "file",
|
|
"oid": "abc123",
|
|
"size": 1024,
|
|
"path": "config.json"
|
|
},
|
|
{
|
|
"type": "file",
|
|
"oid": "def456",
|
|
"size": 52428800,
|
|
"path": "model.bin",
|
|
"lfs": {
|
|
"oid": "def456",
|
|
"size": 52428800,
|
|
"pointerSize": 134
|
|
}
|
|
},
|
|
{
|
|
"type": "directory",
|
|
"oid": "",
|
|
"size": 0,
|
|
"path": "configs"
|
|
}
|
|
]
|
|
```
|
|
|
|
### Delete Repository
|
|
|
|
**Endpoint**: `DELETE /api/repos/delete`
|
|
|
|
**Request**:
|
|
```json
|
|
{
|
|
"type": "model",
|
|
"name": "my-model",
|
|
"organization": "my-org"
|
|
}
|
|
```
|
|
|
|
**What Happens**:
|
|
```
|
|
1. Delete from LakeFS
|
|
└─ Remove repository metadata
|
|
└─ (Objects remain in S3 for safety)
|
|
|
|
2. Delete from DB
|
|
├─ DELETE FROM file WHERE repo_full_id = ...
|
|
├─ DELETE FROM staging_upload WHERE repo_full_id = ...
|
|
└─ DELETE FROM repository WHERE full_id = ...
|
|
|
|
3. Return success
|
|
```
|
|
|
|
## Database Schema
|
|
|
|
```mermaid
|
|
erDiagram
|
|
USER ||--o{ REPOSITORY : owns
|
|
USER ||--o{ SESSION : has
|
|
USER ||--o{ TOKEN : has
|
|
USER ||--o{ SSHKEY : has
|
|
USER }o--o{ ORGANIZATION : member_of
|
|
ORGANIZATION ||--o{ REPOSITORY : owns
|
|
REPOSITORY ||--o{ FILE : contains
|
|
REPOSITORY ||--o{ COMMIT : has
|
|
REPOSITORY ||--o{ STAGINGUPLOAD : has
|
|
COMMIT ||--o{ LFSOBJECTHISTORY : references
|
|
|
|
USER {
|
|
int id PK
|
|
string username UK
|
|
string email UK
|
|
string password_hash
|
|
boolean email_verified
|
|
boolean is_active
|
|
bigint private_quota_bytes
|
|
bigint public_quota_bytes
|
|
bigint private_used_bytes
|
|
bigint public_used_bytes
|
|
datetime created_at
|
|
}
|
|
|
|
REPOSITORY {
|
|
int id PK
|
|
string repo_type
|
|
string namespace
|
|
string name
|
|
string full_id
|
|
boolean private
|
|
int owner_id FK
|
|
datetime created_at
|
|
}
|
|
|
|
FILE {
|
|
int id PK
|
|
string repo_full_id
|
|
string path_in_repo
|
|
int size
|
|
string sha256
|
|
boolean lfs
|
|
datetime created_at
|
|
datetime updated_at
|
|
}
|
|
|
|
COMMIT {
|
|
int id PK
|
|
string commit_id
|
|
string repo_full_id
|
|
string repo_type
|
|
string branch
|
|
int user_id FK
|
|
string username
|
|
text message
|
|
text description
|
|
datetime created_at
|
|
}
|
|
|
|
ORGANIZATION {
|
|
int id PK
|
|
string name UK
|
|
text description
|
|
bigint private_quota_bytes
|
|
bigint public_quota_bytes
|
|
bigint private_used_bytes
|
|
bigint public_used_bytes
|
|
datetime created_at
|
|
}
|
|
|
|
TOKEN {
|
|
int id PK
|
|
int user_id FK
|
|
string token_hash UK
|
|
string name
|
|
datetime last_used
|
|
datetime created_at
|
|
}
|
|
|
|
SESSION {
|
|
int id PK
|
|
string session_id UK
|
|
int user_id FK
|
|
string secret
|
|
datetime expires_at
|
|
datetime created_at
|
|
}
|
|
|
|
SSHKEY {
|
|
int id PK
|
|
int user_id FK
|
|
string key_type
|
|
text public_key
|
|
string fingerprint UK
|
|
string title
|
|
datetime last_used
|
|
datetime created_at
|
|
}
|
|
|
|
STAGINGUPLOAD {
|
|
int id PK
|
|
string repo_full_id
|
|
string repo_type
|
|
string revision
|
|
string path_in_repo
|
|
string sha256
|
|
int size
|
|
string upload_id
|
|
string storage_key
|
|
boolean lfs
|
|
datetime created_at
|
|
}
|
|
|
|
LFSOBJECTHISTORY {
|
|
int id PK
|
|
string repo_full_id
|
|
string path_in_repo
|
|
string sha256
|
|
int size
|
|
string commit_id
|
|
datetime created_at
|
|
}
|
|
```
|
|
|
|
### Key Tables
|
|
|
|
**Repository Table** - Stores repository metadata:
|
|
- Unique constraint on `(repo_type, namespace, name)`
|
|
- Allows same `full_id` across different `repo_type`
|
|
- Example: `model:myorg/mymodel`, `dataset:myorg/mymodel`
|
|
|
|
**File Table** - Deduplication and metadata:
|
|
- Unique constraint on `(repo_full_id, path_in_repo)`
|
|
- `sha256` indexed for fast deduplication lookups
|
|
- `lfs` flag indicates if file uses LFS storage
|
|
|
|
**Commit Table** - User commit tracking:
|
|
- `commit_id` is LakeFS commit SHA
|
|
- Indexed by `(repo_full_id, branch)` for fast queries
|
|
- Denormalized `username` for performance
|
|
|
|
**LFSObjectHistory Table** - LFS garbage collection:
|
|
- Tracks which commits reference which LFS objects
|
|
- Enables preserving K versions of each file (default: 5)
|
|
- Used for auto-cleanup of old LFS objects
|
|
|
|
**StagingUpload Table** - Multipart upload tracking:
|
|
- Tracks ongoing multipart uploads
|
|
- Enables upload resume
|
|
- Cleans up failed uploads
|
|
|
|
## LakeFS Integration
|
|
|
|
### Repository Naming Convention
|
|
|
|
```
|
|
Pattern: {namespace}-{repo_type}-{org}-{name}
|
|
|
|
Examples:
|
|
HuggingFace repo: "myorg/mymodel"
|
|
LakeFS repo: "hf-model-myorg-mymodel"
|
|
|
|
HuggingFace repo: "johndoe/dataset"
|
|
LakeFS repo: "hf-dataset-johndoe-dataset"
|
|
```
|
|
|
|
### Implementation Notes
|
|
|
|
**Database Operations:**
|
|
- **Synchronous:** Uses Peewee ORM with synchronous operations
|
|
- **Transactions:** `db.atomic()` ensures ACID compliance across concurrent workers
|
|
- **Multi-Worker Safe:** Designed for horizontal scaling (4-8 workers recommended)
|
|
- **Future:** Migration to peewee-async planned for improved concurrency
|
|
|
|
**LakeFS Operations:**
|
|
- **Pure Async:** All operations use REST API via httpx (no thread pools!)
|
|
- **No Deprecated Library:** Uses direct REST API instead of lakefs-client
|
|
|
|
### Key Operations
|
|
|
|
**All LakeFS operations use pure async REST API via httpx (no thread pools!):**
|
|
|
|
| Operation | LakeFS REST Endpoint | KohakuHub Method | Purpose |
|
|
|-----------|---------------------|------------------|---------|
|
|
| Create Repo | `POST /repositories` | `create_repository()` | Initialize new repository |
|
|
| Upload Small File | `POST /repositories/{repo}/branches/{branch}/objects` | `upload_object()` | Direct content upload |
|
|
| Link LFS File | `PUT /repositories/{repo}/branches/{branch}/staging/backing` | `link_physical_address()` | Link S3 object to LakeFS |
|
|
| Commit | `POST /repositories/{repo}/branches/{branch}/commits` | `commit()` | Create atomic commit |
|
|
| List Files | `GET /repositories/{repo}/refs/{ref}/objects/ls` | `list_objects()` | Browse repository |
|
|
| Get File Info | `GET /repositories/{repo}/refs/{ref}/objects/stat` | `stat_object()` | Get file metadata |
|
|
| Get File Content | `GET /repositories/{repo}/refs/{ref}/objects` | `get_object()` | Download file |
|
|
| Delete File | `DELETE /repositories/{repo}/branches/{branch}/objects` | `delete_object()` | Remove file |
|
|
| Create Branch | `POST /repositories/{repo}/branches` | `create_branch()` | Create new branch |
|
|
| Delete Branch | `DELETE /repositories/{repo}/branches/{branch}` | `delete_branch()` | Delete branch |
|
|
| Create Tag | `POST /repositories/{repo}/tags` | `create_tag()` | Create tag |
|
|
| Delete Tag | `DELETE /repositories/{repo}/tags/{tag}` | `delete_tag()` | Delete tag |
|
|
| Revert | `POST /repositories/{repo}/branches/{branch}/revert` | `revert_branch()` | Revert commit |
|
|
| Merge | `POST /repositories/{repo}/refs/{source}/merge/{dest}` | `merge_into_branch()` | Merge branches |
|
|
| Hard Reset | `PUT /repositories/{repo}/branches/{branch}/hard_reset` | `hard_reset_branch()` | Reset branch to commit |
|
|
|
|
### Physical Address Linking
|
|
|
|
```
|
|
When uploading LFS file:
|
|
|
|
1. Client uploads to S3:
|
|
s3://bucket/lfs/ab/cd/abcd1234...
|
|
|
|
2. Kohaku Hub links to LakeFS:
|
|
┌──────────────────────────────────┐
|
|
│ StagingMetadata │
|
|
├──────────────────────────────────┤
|
|
│ physical_address: │
|
|
│ "s3://bucket/lfs/ab/cd/abc..." │
|
|
│ checksum: "sha256:abc..." │
|
|
│ size_bytes: 52428800 │
|
|
└──────────────────────────────────┘
|
|
│
|
|
▼
|
|
┌──────────────────────────────────┐
|
|
│ LakeFS: model.bin │
|
|
│ → Points to S3 object │
|
|
└──────────────────────────────────┘
|
|
|
|
3. On commit:
|
|
LakeFS records this link in its metadata
|
|
```
|
|
|
|
## API Endpoint Summary
|
|
|
|
### Repository Operations
|
|
|
|
| Endpoint | Method | Auth | Description |
|
|
|----------|--------|------|-------------|
|
|
| `/api/repos/create` | POST | ✓ | Create new repository |
|
|
| `/api/repos/delete` | DELETE | ✓ | Delete repository |
|
|
| `/api/repos/move` | POST | ✓ | Move/rename repository |
|
|
| `/api/{type}s` | GET | ○ | List repositories (respects privacy) |
|
|
| `/api/{type}s/{id}` | GET | ○ | Get repo info |
|
|
| `/api/{type}s/{id}/tree/{rev}/{path}` | GET | ○ | List files |
|
|
| `/api/{type}s/{id}/revision/{rev}` | GET | ○ | Get revision info |
|
|
| `/api/{type}s/{id}/paths-info/{rev}` | POST | ○ | Get info for specific paths |
|
|
| `/api/users/{username}/repos` | GET | ○ | List all repos for a user/org (grouped by type) |
|
|
|
|
### File Operations
|
|
|
|
| Endpoint | Method | Auth | Description |
|
|
|----------|--------|------|-------------|
|
|
| `/api/{type}s/{id}/preupload/{rev}` | POST | ✓ | Check before upload |
|
|
| `/api/{type}s/{id}/commit/{rev}` | POST | ✓ | Atomic commit |
|
|
| `/{id}/resolve/{rev}/{file}` | GET | ○ | Download file |
|
|
| `/{id}/resolve/{rev}/{file}` | HEAD | ○ | Get file metadata |
|
|
| `/{type}s/{id}/resolve/{rev}/{file}` | GET | ○ | Download file (with type) |
|
|
| `/{type}s/{id}/resolve/{rev}/{file}` | HEAD | ○ | Get file metadata (with type) |
|
|
|
|
### LFS Operations
|
|
|
|
| Endpoint | Method | Auth | Description |
|
|
|----------|--------|------|-------------|
|
|
| `/{id}.git/info/lfs/objects/batch` | POST | ✓ | LFS batch API |
|
|
| `/api/{id}.git/info/lfs/verify` | POST | ✓ | Verify upload |
|
|
|
|
### Commit History
|
|
|
|
| Endpoint | Method | Auth | Description |
|
|
|----------|--------|------|-------------|
|
|
| `/{type}s/{namespace}/{name}/commits/{branch}` | GET | ○ | List commits on a branch with pagination |
|
|
|
|
### Branch and Tag Management
|
|
|
|
| Endpoint | Method | Auth | Description |
|
|
|----------|--------|------|-------------|
|
|
| `/{type}s/{namespace}/{name}/branch` | POST | ✓ | Create a new branch |
|
|
| `/{type}s/{namespace}/{name}/branch/{branch}` | DELETE | ✓ | Delete a branch |
|
|
| `/{type}s/{namespace}/{name}/tag` | POST | ✓ | Create a new tag |
|
|
| `/{type}s/{namespace}/{name}/tag/{tag}` | DELETE | ✓ | Delete a tag |
|
|
|
|
### Settings Management
|
|
|
|
| Endpoint | Method | Auth | Description |
|
|
|----------|--------|------|-------------|
|
|
| `/users/{username}/settings` | PUT | ✓ | Update user settings |
|
|
| `/organizations/{org_name}/settings` | PUT | ✓ | Update organization settings |
|
|
| `/{type}s/{namespace}/{name}/settings` | PUT | ✓ | Update repository settings (private, gated) |
|
|
|
|
### Authentication Operations
|
|
|
|
| Endpoint | Method | Auth | Description |
|
|
|----------|--------|------|-------------|
|
|
| `/api/auth/register` | POST | ✗ | Register new user |
|
|
| `/api/auth/login` | POST | ✗ | Login and create session |
|
|
| `/api/auth/logout` | POST | ✓ | Logout and destroy session |
|
|
| `/api/auth/verify-email` | GET | ✗ | Verify email with token |
|
|
| `/api/auth/me` | GET | ✓ | Get current user info |
|
|
| `/api/auth/tokens` | GET | ✓ | List user's API tokens |
|
|
| `/api/auth/tokens/create` | POST | ✓ | Create new API token |
|
|
| `/api/auth/tokens/{token_id}` | DELETE | ✓ | Revoke API token |
|
|
|
|
### Organization Operations
|
|
|
|
| Endpoint | Method | Auth | Description |
|
|
|----------|--------|------|-------------|
|
|
| `/org/create` | POST | ✓ | Create new organization |
|
|
| `/org/{org_name}` | GET | ✗ | Get organization details |
|
|
| `/org/{org_name}/members` | GET | ○ | List organization members |
|
|
| `/org/{org_name}/members` | POST | ✓ | Add member to organization |
|
|
| `/org/{org_name}/members/{username}` | DELETE | ✓ | Remove member from organization |
|
|
| `/org/{org_name}/members/{username}` | PUT | ✓ | Update member role |
|
|
| `/org/users/{username}/orgs` | GET | ✗ | List user's organizations |
|
|
|
|
### Utility Operations
|
|
|
|
| Endpoint | Method | Auth | Description |
|
|
|----------|--------|------|-------------|
|
|
| `/api/validate-yaml` | POST | ✗ | Validate YAML content |
|
|
| `/api/whoami-v2` | GET | ✓ | Get detailed current user info |
|
|
| `/api/version` | GET | ✗ | Get API version information |
|
|
| `/health` | GET | ✗ | Health check |
|
|
| `/` | GET | ✗ | API information |
|
|
|
|
**Auth Legend**:
|
|
- ✓ = Required
|
|
- ○ = Optional (public repos)
|
|
- ✗ = Not required
|
|
|
|
---
|
|
|
|
## Detailed Endpoint Documentation
|
|
|
|
### Commit History API
|
|
|
|
The commit history API allows you to retrieve the commit log for a specific branch in a repository.
|
|
|
|
**Endpoint**: `GET /{repo_type}s/{namespace}/{name}/commits/{branch}`
|
|
|
|
**Query Parameters**:
|
|
- `page`: Page number for pagination (default: 1)
|
|
- `limit`: Number of commits per page (default: 20)
|
|
|
|
**Example Request**:
|
|
```bash
|
|
GET /models/myorg/mymodel/commits/main?page=1&limit=20
|
|
```
|
|
|
|
**Response**:
|
|
```json
|
|
{
|
|
"commits": [
|
|
{
|
|
"id": "abc123def456",
|
|
"message": "Update model config",
|
|
"author": "john@example.com",
|
|
"committer": "john@example.com",
|
|
"createdAt": "2025-10-05T12:00:00Z",
|
|
"parents": ["parent123"]
|
|
}
|
|
],
|
|
"pagination": {
|
|
"page": 1,
|
|
"limit": 20,
|
|
"total": 150,
|
|
"hasMore": true
|
|
}
|
|
}
|
|
```
|
|
|
|
### Branch and Tag Management
|
|
|
|
#### Create Branch
|
|
|
|
**Endpoint**: `POST /{repo_type}s/{namespace}/{name}/branch`
|
|
|
|
**Request**:
|
|
```json
|
|
{
|
|
"branch": "feature-branch",
|
|
"startPoint": "main"
|
|
}
|
|
```
|
|
|
|
**Response**:
|
|
```json
|
|
{
|
|
"success": true,
|
|
"branch": "feature-branch",
|
|
"ref": "refs/heads/feature-branch"
|
|
}
|
|
```
|
|
|
|
#### Delete Branch
|
|
|
|
**Endpoint**: `DELETE /{repo_type}s/{namespace}/{name}/branch/{branch}`
|
|
|
|
**Example**: `DELETE /models/myorg/mymodel/branch/feature-branch`
|
|
|
|
**Response**:
|
|
```json
|
|
{
|
|
"success": true,
|
|
"deleted": "feature-branch"
|
|
}
|
|
```
|
|
|
|
**Note**: Cannot delete the default branch (usually `main`).
|
|
|
|
#### Create Tag
|
|
|
|
**Endpoint**: `POST /{repo_type}s/{namespace}/{name}/tag`
|
|
|
|
**Request**:
|
|
```json
|
|
{
|
|
"tag": "v1.0.0",
|
|
"ref": "main",
|
|
"message": "Release version 1.0.0"
|
|
}
|
|
```
|
|
|
|
**Response**:
|
|
```json
|
|
{
|
|
"success": true,
|
|
"tag": "v1.0.0",
|
|
"ref": "refs/tags/v1.0.0"
|
|
}
|
|
```
|
|
|
|
#### Delete Tag
|
|
|
|
**Endpoint**: `DELETE /{repo_type}s/{namespace}/{name}/tag/{tag}`
|
|
|
|
**Example**: `DELETE /models/myorg/mymodel/tag/v1.0.0`
|
|
|
|
**Response**:
|
|
```json
|
|
{
|
|
"success": true,
|
|
"deleted": "v1.0.0"
|
|
}
|
|
```
|
|
|
|
### Settings Management
|
|
|
|
#### Update User Settings
|
|
|
|
**Endpoint**: `PUT /users/{username}/settings`
|
|
|
|
**Request**:
|
|
```json
|
|
{
|
|
"email": "newemail@example.com",
|
|
"displayName": "John Doe",
|
|
"bio": "ML Engineer",
|
|
"website": "https://example.com"
|
|
}
|
|
```
|
|
|
|
**Response**:
|
|
```json
|
|
{
|
|
"success": true,
|
|
"user": {
|
|
"username": "johndoe",
|
|
"email": "newemail@example.com",
|
|
"displayName": "John Doe"
|
|
}
|
|
}
|
|
```
|
|
|
|
#### Update Organization Settings
|
|
|
|
**Endpoint**: `PUT /organizations/{org_name}/settings`
|
|
|
|
**Request**:
|
|
```json
|
|
{
|
|
"displayName": "My Organization",
|
|
"description": "Building amazing ML models",
|
|
"website": "https://example.com",
|
|
"avatar": "https://cdn.example.com/avatar.png"
|
|
}
|
|
```
|
|
|
|
**Response**:
|
|
```json
|
|
{
|
|
"success": true,
|
|
"organization": {
|
|
"name": "my-org",
|
|
"displayName": "My Organization",
|
|
"description": "Building amazing ML models"
|
|
}
|
|
}
|
|
```
|
|
|
|
#### Update Repository Settings
|
|
|
|
**Endpoint**: `PUT /{repo_type}s/{namespace}/{name}/settings`
|
|
|
|
**Request**:
|
|
```json
|
|
{
|
|
"private": true,
|
|
"gated": false,
|
|
"description": "A state-of-the-art language model",
|
|
"tags": ["nlp", "transformers", "llm"]
|
|
}
|
|
```
|
|
|
|
**Response**:
|
|
```json
|
|
{
|
|
"success": true,
|
|
"repository": {
|
|
"id": "myorg/mymodel",
|
|
"private": true,
|
|
"gated": false,
|
|
"description": "A state-of-the-art language model"
|
|
}
|
|
}
|
|
```
|
|
|
|
**Privacy Options**:
|
|
- `private: false` - Public repository, visible to everyone
|
|
- `private: true` - Private repository, only visible to owner and organization members
|
|
- `gated: true` - Requires explicit permission to access (for controlled releases)
|
|
|
|
#### Move/Rename Repository
|
|
|
|
**Endpoint**: `POST /api/repos/move`
|
|
|
|
**Request**:
|
|
```json
|
|
{
|
|
"fromRepo": {
|
|
"type": "model",
|
|
"namespace": "oldorg",
|
|
"name": "oldname"
|
|
},
|
|
"toRepo": {
|
|
"type": "model",
|
|
"namespace": "neworg",
|
|
"name": "newname"
|
|
}
|
|
}
|
|
```
|
|
|
|
**Response**:
|
|
```json
|
|
{
|
|
"success": true,
|
|
"url": "https://hub.example.com/models/neworg/newname",
|
|
"message": "Repository moved successfully"
|
|
}
|
|
```
|
|
|
|
**What Happens**:
|
|
1. Validates that source repository exists and user has permission
|
|
2. Checks that destination doesn't already exist
|
|
3. Updates LakeFS repository name
|
|
4. Updates all database records
|
|
5. Creates redirect from old URL to new URL
|
|
|
|
**Note**: This operation is atomic - either everything succeeds or everything rolls back.
|
|
|
|
### Version and Utility Endpoints
|
|
|
|
#### Get API Version
|
|
|
|
**Endpoint**: `GET /api/version`
|
|
|
|
**Response**:
|
|
```json
|
|
{
|
|
"version": "1.0.0",
|
|
"apiVersion": "v1",
|
|
"lfsVersion": "2.0",
|
|
"features": {
|
|
"lfs": true,
|
|
"multipart": true,
|
|
"deduplication": true,
|
|
"organizations": true
|
|
},
|
|
"limits": {
|
|
"maxFileSize": 107374182400,
|
|
"lfsThreshold": 10485760
|
|
}
|
|
}
|
|
```
|
|
|
|
#### Validate YAML
|
|
|
|
**Endpoint**: `POST /api/validate-yaml`
|
|
|
|
**Request**:
|
|
```json
|
|
{
|
|
"content": "model:\n name: gpt-2\n version: 1.0"
|
|
}
|
|
```
|
|
|
|
**Response** (if valid):
|
|
```json
|
|
{
|
|
"valid": true,
|
|
"parsed": {
|
|
"model": {
|
|
"name": "gpt-2",
|
|
"version": "1.0"
|
|
}
|
|
}
|
|
}
|
|
```
|
|
|
|
**Response** (if invalid):
|
|
```json
|
|
{
|
|
"valid": false,
|
|
"error": "Invalid YAML syntax at line 2: unexpected character",
|
|
"line": 2,
|
|
"column": 10
|
|
}
|
|
```
|
|
|
|
**Use Case**: Validate README.md frontmatter, model card YAML, or configuration files before upload.
|
|
|
|
#### Get Detailed User Info (whoami-v2)
|
|
|
|
**Endpoint**: `GET /api/whoami-v2`
|
|
|
|
**Response**:
|
|
```json
|
|
{
|
|
"type": "user",
|
|
"id": "12345",
|
|
"name": "johndoe",
|
|
"fullname": "John Doe",
|
|
"email": "john@example.com",
|
|
"emailVerified": true,
|
|
"canPay": true,
|
|
"isPro": false,
|
|
"periodEnd": null,
|
|
"avatarUrl": "https://cdn.example.com/avatars/johndoe.png",
|
|
"orgs": [
|
|
{
|
|
"name": "my-org",
|
|
"fullname": "My Organization",
|
|
"email": "contact@my-org.com",
|
|
"avatarUrl": "https://cdn.example.com/orgs/my-org.png",
|
|
"roleInOrg": "admin"
|
|
}
|
|
],
|
|
"auth": {
|
|
"accessToken": {
|
|
"displayName": "API Token",
|
|
"role": "write"
|
|
}
|
|
}
|
|
}
|
|
```
|
|
|
|
**Compared to `/api/auth/me`**: This endpoint provides more detailed information including:
|
|
- Organization memberships with roles
|
|
- Token information
|
|
- Subscription/payment status
|
|
- Email verification status
|
|
|
|
## Content Deduplication
|
|
|
|
Kohaku Hub implements content-addressable storage for LFS files:
|
|
|
|
```
|
|
Same file uploaded to different repos:
|
|
|
|
Repo A: myorg/model-v1
|
|
└─ model.bin (sha256: abc123...)
|
|
|
|
Repo B: myorg/model-v2
|
|
└─ model.bin (sha256: abc123...)
|
|
|
|
S3 Storage:
|
|
└─ lfs/ab/c1/abc123... ← SINGLE COPY
|
|
▲ ▲
|
|
│ │
|
|
Repo A Repo B
|
|
(linked) (linked)
|
|
|
|
Benefits:
|
|
- Save storage space
|
|
- Faster uploads (skip if exists)
|
|
- Efficient for model variants
|
|
```
|
|
|
|
**Deduplication Points**:
|
|
|
|
1. **Preupload Check**: Query DB by SHA256
|
|
2. **LFS Batch API**: Check if OID exists
|
|
3. **Commit**: Link existing S3 object instead of uploading
|
|
|
|
## Error Handling
|
|
|
|
Kohaku Hub uses HuggingFace-compatible error headers:
|
|
|
|
```
|
|
HTTP Response Headers:
|
|
X-Error-Code: RepoNotFound
|
|
X-Error-Message: Repository 'org/repo' not found
|
|
```
|
|
|
|
**Error Codes**:
|
|
|
|
| Code | HTTP Status | Description |
|
|
|------|-------------|-------------|
|
|
| `RepoNotFound` | 404 | Repository doesn't exist |
|
|
| `RepoExists` | 400 | Repository already exists |
|
|
| `RevisionNotFound` | 404 | Branch/commit not found |
|
|
| `EntryNotFound` | 404 | File not found |
|
|
| `GatedRepo` | 403 | Need permission |
|
|
| `BadRequest` | 400 | Invalid request |
|
|
| `ServerError` | 500 | Internal error |
|
|
|
|
These error codes are parsed by `huggingface_hub` client to raise appropriate Python exceptions.
|
|
|
|
## Performance Considerations
|
|
|
|
### Upload Performance
|
|
|
|
```
|
|
Small Files (≤10MB):
|
|
Client → FastAPI → LakeFS → S3
|
|
(Proxied through server)
|
|
|
|
Large Files (>10MB):
|
|
Client ─────────────────────→ S3
|
|
(Direct upload, no proxy)
|
|
↓
|
|
Kohaku Hub (only metadata link)
|
|
```
|
|
|
|
**Why this matters**: Large files bypass the application server entirely, allowing unlimited throughput limited only by client and S3 bandwidth.
|
|
|
|
### Download Performance
|
|
|
|
```
|
|
All Downloads:
|
|
Client → Kohaku Hub → 302 Redirect → S3
|
|
(metadata) (direct)
|
|
```
|
|
|
|
**Why this matters**: After initial redirect, all data transfer is direct from S3/CDN. Server only generates presigned URLs.
|
|
|
|
### Recommended S3 Providers
|
|
|
|
| Provider | Best For | Pricing Model | Notes |
|
|
|----------|----------|---------------|-------|
|
|
| Cloudflare R2 | High download | Free egress, $0.015/GB storage | Best for public datasets |
|
|
| Wasabi | Archive/backup | $6/TB/month, free egress* | *if download < storage |
|
|
| MinIO | Self-hosted | Free (your hardware/bandwidth) | Full control, privacy |
|
|
| AWS S3 | Enterprise | Pay per GB + egress | Most features, expensive egress |
|
|
| Backblaze B2 | Budget | $6/TB storage, $0.01/GB egress | Good for mixed workloads |
|
|
|
|
**Recommendation for KohakuHub:**
|
|
- **Development**: MinIO (included in docker-compose)
|
|
- **Public Hub**: Cloudflare R2 (free egress saves costs)
|
|
- **Private/Enterprise**: Self-hosted MinIO or AWS S3 with VPC endpoints
|