# Kohaku Hub API Documentation *Last Updated: January 2025* This document explains how Kohaku Hub's API works, the data flow, and key endpoints. ## System Architecture ```mermaid graph TB subgraph "Client Layer" Client["Client
(huggingface_hub, git, browser)"] end subgraph "Entry Point" Nginx["Nginx (Port 28080)
- Serves static files
- Reverse proxy"] end subgraph "Application Layer" FastAPI["FastAPI (Port 48888)
- Auth & Permissions
- HF-compatible API
- Git Smart HTTP"] end subgraph "Storage Backend" LakeFS["LakeFS
- Git-like versioning
- Branch management
- Commit history"] DB["PostgreSQL/SQLite
- User data
- Metadata
- Deduplication
- Synchronous with db.atomic()"] S3["MinIO/S3
- Object storage
- LFS files
- Presigned URLs"] end Client -->|HTTP/Git/LFS| Nginx Nginx -->|Static files| Client Nginx -->|/api, /org, resolve| FastAPI FastAPI -->|REST API (async)| LakeFS FastAPI -->|Sync queries with db.atomic()| DB FastAPI -->|Async wrappers| S3 LakeFS -->|Stores objects| S3 ``` ## Core Concepts ### File Size Thresholds ```mermaid graph TD Start[File Upload] --> Check{File size > 5MB?} Check -->|No| Regular[Regular Mode] Check -->|Yes| LFS[LFS Mode] Regular --> Base64[Base64 in commit payload] LFS --> Presigned[S3 presigned URL] Base64 --> FastAPI[FastAPI processes] Presigned --> Direct[Direct S3 upload] FastAPI --> LakeFS1[LakeFS stores object] Direct --> Link[FastAPI links S3 object] Link --> LakeFS2[LakeFS commit with physical address] ``` **Note:** The LFS threshold is configurable via `KOHAKU_HUB_LFS_THRESHOLD_BYTES` (default: 5MB = 5,242,880 bytes). ### Storage Layout ``` S3 Bucket Structure: s3://hub-storage/ │ ├── hf-model-org-repo/ ← LakeFS managed repository │ └── main/ ← Branch │ ├── config.json │ └── model.safetensors │ └── lfs/ ← LFS objects (content-addressable) └── ab/ ← First 2 chars of SHA256 └── cd/ ← Next 2 chars └── abcd1234... ← Full SHA256 hash ``` ## Git Clone Support ### Overview KohakuHub supports native Git clone operations using **pure Python implementation** (no pygit2/libgit2). **Git URL Format:** ``` http://hub.example.com/{namespace}/{repo-name}.git ``` **Git Endpoints:** - `GET /{namespace}/{name}.git/info/refs?service=git-upload-pack` - Service advertisement - `POST /{namespace}/{name}.git/git-upload-pack` - Clone/fetch/pull - `GET /{namespace}/{name}.git/HEAD` - Get HEAD reference - `POST /{namespace}/{name}.git/git-receive-pack` - Push (in progress) ### LFS Integration **Automatic LFS Pointers:** - Files **<1MB**: Included in Git pack as regular blobs - Files **>=1MB**: Converted to LFS pointers (100-byte text files) **LFS Pointer Format:** ``` version https://git-lfs.github.com/spec/v1 oid sha256:abc123... size 10737418240 ``` **Client Workflow:** ```bash # 1. Clone (gets pointers for large files) git clone http://hub.example.com/org/repo.git # 2. Download large files via LFS cd repo git lfs install git lfs pull # Uses existing /info/lfs/ endpoints ``` **Benefits:** - Fast clones (only metadata + small files) - No memory issues (LFS pointers are tiny) - Leverages existing HuggingFace LFS infrastructure - Pure Python (no native dependencies) See [Git.md](./Git.md) for complete Git clone documentation and implementation details. --- ## Upload Workflow ### Overview ```mermaid sequenceDiagram participant Client participant API as FastAPI participant LakeFS participant S3 Note over Client,S3: Phase 1: Preupload Check Client->>API: POST /preupload (file hashes & sizes) API->>API: Check DB for existing SHA256 API-->>Client: Upload mode (regular/lfs) & dedup info alt Small Files (<5MB) Note over Client,S3: Phase 2a: Regular Upload Client->>API: POST /commit (base64 content) API->>LakeFS: Upload object LakeFS->>S3: Store object else Large Files (>=5MB) Note over Client,S3: Phase 2b: LFS Upload Client->>API: POST /info/lfs/objects/batch API->>S3: Generate presigned URL API-->>Client: Presigned URL Client->>S3: PUT file (direct upload) Client->>API: POST /commit (lfsFile entry) API->>LakeFS: Link physical address end Note over Client,S3: Phase 3: Commit API->>LakeFS: Commit with message LakeFS-->>API: Commit ID API-->>Client: Commit URL & OID ``` ### Step 1: Preupload Check **Purpose**: Determine upload mode and check for duplicates **Endpoint**: `POST /api/{repo_type}s/{repo_id}/preupload/{revision}` **Request**: ```json { "files": [ { "path": "config.json", "size": 1024, "sha256": "abc123..." }, { "path": "model.bin", "size": 52428800, "sha256": "def456..." } ] } ``` **Response**: ```json { "files": [ { "path": "config.json", "uploadMode": "regular", "shouldIgnore": false }, { "path": "model.bin", "uploadMode": "lfs", "shouldIgnore": true // Already exists! } ] } ``` **Decision Logic**: ``` For each file: 1. Check size: - ≤ 5MB → "regular" - > 5MB → "lfs" 2. Check if exists (deduplication): - Query DB for matching SHA256 + size - If match found → shouldIgnore: true - If no match → shouldIgnore: false ``` ### Step 2a: Regular Upload (≤5MB) Files are sent inline in the commit payload as base64. ``` ┌────────┐ ┌────────┐ │ Client │───── base64 ──────>│ Commit │ └────────┘ (embedded) └────────┘ ``` **No separate upload step needed** - proceed directly to Step 3. ### Step 2b: LFS Upload (>5MB) #### Phase 1: Request Upload URLs **Endpoint**: `POST /{repo_id}.git/info/lfs/objects/batch` **Request**: ```json { "operation": "upload", "transfers": ["basic", "multipart"], "objects": [ { "oid": "sha256_hash", "size": 52428800 } ] } ``` **Response** (if file needs upload): ```json { "transfer": "basic", "objects": [ { "oid": "sha256_hash", "size": 52428800, "actions": { "upload": { "href": "https://s3.../presigned_url", "expires_at": "2025-10-02T00:00:00Z" } } } ] } ``` **Response** (if file already exists): ```json { "transfer": "basic", "objects": [ { "oid": "sha256_hash", "size": 52428800 // No "actions" field = already exists } ] } ``` #### Phase 2: Upload to S3 ``` ┌────────┐ ┌─────────┐ │ Client │---- PUT file -------->│ S3 │ └────────┘ (presigned URL) └─────────┘ Direct upload lfs/ab/cd/ (no proxy!) abcd123... ``` **Key Point**: Client uploads directly to S3 using the presigned URL. Kohaku Hub server is NOT involved in data transfer. ### Step 3: Commit **Purpose**: Atomically commit all changes to the repository **Endpoint**: `POST /api/{repo_type}s/{repo_id}/commit/{revision}` **Format**: NDJSON (Newline-Delimited JSON) **Example Payload**: ``` {"key":"header","value":{"summary":"Add model files","description":"Initial upload"}} {"key":"file","value":{"path":"config.json","content":"eyJtb2RlbCI6...","encoding":"base64"}} {"key":"lfsFile","value":{"path":"model.bin","algo":"sha256","oid":"abc123...","size":52428800}} {"key":"deletedFile","value":{"path":"old_config.json"}} ``` **Operation Types**: | Key | Description | Usage | |-----|-------------|-------| | `header` | Commit metadata | Required, must be first line | | `file` | Small file (inline base64) | For files ≤ 5MB | | `lfsFile` | Large file (LFS reference) | For files > 5MB, already uploaded to S3 | | `deletedFile` | Delete a single file | Remove file from repo | | `deletedFolder` | Delete folder recursively | Remove all files in folder | | `copyFile` | Copy file within repo | Duplicate file (deduplication-aware) | **Response**: ```json { "commitUrl": "https://hub.example.com/repo/commit/abc123", "commitOid": "abc123def456", "pullRequestUrl": null } ``` **What Happens**: ``` 1. Regular files: ┌─────────┐ │ Decode │ Base64 -> Binary └────┬────┘ | v ┌─────────┐ │ Upload │ To LakeFS └────┬────┘ | v ┌─────────┐ │ Update │ Database record └─────────┘ 2. LFS files: ┌─────────┐ │ Link │ S3 physical address -> LakeFS └────┬────┘ | v ┌─────────┐ │ Update │ Database record └─────────┘ 3. Commit: ┌─────────┐ │ LakeFS │ Create commit with all changes └─────────┘ ``` ## Download Workflow ```mermaid sequenceDiagram participant Client participant API as FastAPI participant LakeFS participant S3 Note over Client,S3: Optional: HEAD request for metadata Client->>API: HEAD /resolve/{revision}/{filename} API->>LakeFS: Stat object LakeFS-->>API: Object metadata (SHA256, size) API-->>Client: Headers (ETag, Content-Length, X-Repo-Commit) Note over Client,S3: Download: GET request Client->>API: GET /resolve/{revision}/{filename} API->>LakeFS: Get object metadata API->>S3: Generate presigned URL API-->>Client: 302 Redirect (presigned URL) Client->>S3: Direct download S3-->>Client: File content Note over Client: No proxy - direct S3 download ``` ### Step 1: Get Metadata (HEAD) **Endpoint**: `HEAD /{repo_id}/resolve/{revision}/{filename}` **Response Headers**: ``` X-Repo-Commit: abc123def456 X-Linked-Etag: "sha256:abc123..." X-Linked-Size: 52428800 ETag: "abc123..." Content-Length: 52428800 Location: https://s3.../presigned_download_url ``` **Purpose**: Client checks if file needs re-download (by comparing ETag) ### Step 2: Download (GET) **Endpoint**: `GET /{repo_id}/resolve/{revision}/{filename}` **Response**: HTTP 302 Redirect ``` HTTP/1.1 302 Found Location: https://s3.example.com/presigned_url?expires=... X-Repo-Commit: abc123def456 X-Linked-Etag: "sha256:abc123..." ``` **Flow**: ``` ┌────────┐ ┌──────────┐ │ Client │───── GET ─────>│ Kohaku │ └────────┘ │ Hub │ ▲ └─────┬────┘ │ │ │ 302 Redirect │ Generate │ (presigned URL) │ presigned │<─────────────────────────┘ URL │ │ ┌──────────┐ └───>│ S3 │ │ Direct │ │ Download │ └──────────┘ ``` **Key Point**: Client downloads directly from S3. Kohaku Hub only provides the redirect URL. ## Repository Privacy & Filtering KohakuHub respects repository privacy settings when listing repositories. The visibility of repositories depends on authentication: ### Privacy Rules **For Unauthenticated Users:** - Can only see **public** repositories **For Authenticated Users:** - Can see all **public** repositories - Can see their **own private** repositories - Can see **private repositories** in organizations they belong to ### List Repositories Endpoint **Pattern**: `/api/{type}s` where type is `model`, `dataset`, or `space` **Query Parameters:** - `author`: Filter by author/namespace (username or organization) - `limit`: Maximum results (default: 50, max: 1000) **Examples:** ```bash # List all public models GET /api/models # List models by author (respects privacy) GET /api/models?author=my-org # Authenticated user sees their private repos too GET /api/models?author=my-org Authorization: Bearer YOUR_TOKEN ``` ### List User's All Repositories **Endpoint**: `GET /api/users/{username}/repos` Returns all repositories for a user/organization, grouped by type. **Response:** ```json { "models": [ {"id": "user/model-1", "private": false, ...}, {"id": "user/model-2", "private": true, ...} ], "datasets": [ {"id": "user/dataset-1", "private": false, ...} ], "spaces": [] } ``` **Note**: Private repositories are only included if: 1. The requesting user is the owner, OR 2. The requesting user is a member of the organization ## Repository Management ### Create Repository **Endpoint**: `POST /api/repos/create` **Request**: ```json { "type": "model", "name": "my-model", "organization": "my-org", "private": false } ``` **What Happens**: ``` 1. Check if exists └─ Query DB for repo 2. Create LakeFS repo └─ Repository: hf-model-my-org-my-model └─ Storage: s3://bucket/hf-model-my-org-my-model └─ Default branch: main 3. Record in DB └─ INSERT INTO repository (...) ``` **Response**: ```json { "url": "https://hub.example.com/models/my-org/my-model", "repo_id": "my-org/my-model" } ``` ### List Repository Files **Endpoint**: `GET /api/{repo_type}s/{repo_id}/tree/{revision}/{path}` **Query Parameters**: - `recursive`: List all files recursively (default: false) - `expand`: Include LFS metadata (default: false) **Response**: ```json [ { "type": "file", "oid": "abc123", "size": 1024, "path": "config.json" }, { "type": "file", "oid": "def456", "size": 52428800, "path": "model.bin", "lfs": { "oid": "def456", "size": 52428800, "pointerSize": 134 } }, { "type": "directory", "oid": "", "size": 0, "path": "configs" } ] ``` ### Delete Repository **Endpoint**: `DELETE /api/repos/delete` **Request**: ```json { "type": "model", "name": "my-model", "organization": "my-org" } ``` **What Happens**: ``` 1. Delete from LakeFS └─ Remove repository metadata └─ (Objects remain in S3 for safety) 2. Delete from DB ├─ DELETE FROM file WHERE repo_full_id = ... ├─ DELETE FROM staging_upload WHERE repo_full_id = ... └─ DELETE FROM repository WHERE full_id = ... 3. Return success ``` ## Database Schema ```mermaid erDiagram USER ||--o{ REPOSITORY : owns USER ||--o{ SESSION : has USER ||--o{ TOKEN : has USER ||--o{ SSHKEY : has USER }o--o{ ORGANIZATION : member_of ORGANIZATION ||--o{ REPOSITORY : owns REPOSITORY ||--o{ FILE : contains REPOSITORY ||--o{ COMMIT : has REPOSITORY ||--o{ STAGINGUPLOAD : has COMMIT ||--o{ LFSOBJECTHISTORY : references USER { int id PK string username UK string email UK string password_hash boolean email_verified boolean is_active bigint private_quota_bytes bigint public_quota_bytes bigint private_used_bytes bigint public_used_bytes datetime created_at } REPOSITORY { int id PK string repo_type string namespace string name string full_id boolean private int owner_id FK datetime created_at } FILE { int id PK string repo_full_id string path_in_repo int size string sha256 boolean lfs datetime created_at datetime updated_at } COMMIT { int id PK string commit_id string repo_full_id string repo_type string branch int user_id FK string username text message text description datetime created_at } ORGANIZATION { int id PK string name UK text description bigint private_quota_bytes bigint public_quota_bytes bigint private_used_bytes bigint public_used_bytes datetime created_at } TOKEN { int id PK int user_id FK string token_hash UK string name datetime last_used datetime created_at } SESSION { int id PK string session_id UK int user_id FK string secret datetime expires_at datetime created_at } SSHKEY { int id PK int user_id FK string key_type text public_key string fingerprint UK string title datetime last_used datetime created_at } STAGINGUPLOAD { int id PK string repo_full_id string repo_type string revision string path_in_repo string sha256 int size string upload_id string storage_key boolean lfs datetime created_at } LFSOBJECTHISTORY { int id PK string repo_full_id string path_in_repo string sha256 int size string commit_id datetime created_at } ``` ### Key Tables **Repository Table** - Stores repository metadata: - Unique constraint on `(repo_type, namespace, name)` - Allows same `full_id` across different `repo_type` - Example: `model:myorg/mymodel`, `dataset:myorg/mymodel` **File Table** - Deduplication and metadata: - Unique constraint on `(repo_full_id, path_in_repo)` - `sha256` indexed for fast deduplication lookups - `lfs` flag indicates if file uses LFS storage **Commit Table** - User commit tracking: - `commit_id` is LakeFS commit SHA - Indexed by `(repo_full_id, branch)` for fast queries - Denormalized `username` for performance **LFSObjectHistory Table** - LFS garbage collection: - Tracks which commits reference which LFS objects - Enables preserving K versions of each file (default: 5) - Used for auto-cleanup of old LFS objects **StagingUpload Table** - Multipart upload tracking: - Tracks ongoing multipart uploads - Enables upload resume - Cleans up failed uploads ## LakeFS Integration ### Repository Naming Convention ``` Pattern: {namespace}-{repo_type}-{org}-{name} Examples: HuggingFace repo: "myorg/mymodel" LakeFS repo: "hf-model-myorg-mymodel" HuggingFace repo: "johndoe/dataset" LakeFS repo: "hf-dataset-johndoe-dataset" ``` ### Implementation Notes **Database Operations:** - **Synchronous:** Uses Peewee ORM with synchronous operations - **Transactions:** `db.atomic()` ensures ACID compliance across concurrent workers - **Multi-Worker Safe:** Designed for horizontal scaling (4-8 workers recommended) - **Future:** Migration to peewee-async planned for improved concurrency **LakeFS Operations:** - **Pure Async:** All operations use REST API via httpx (no thread pools!) - **No Deprecated Library:** Uses direct REST API instead of lakefs-client ### Key Operations **All LakeFS operations use pure async REST API via httpx (no thread pools!):** | Operation | LakeFS REST Endpoint | KohakuHub Method | Purpose | |-----------|---------------------|------------------|---------| | Create Repo | `POST /repositories` | `create_repository()` | Initialize new repository | | Upload Small File | `POST /repositories/{repo}/branches/{branch}/objects` | `upload_object()` | Direct content upload | | Link LFS File | `PUT /repositories/{repo}/branches/{branch}/staging/backing` | `link_physical_address()` | Link S3 object to LakeFS | | Commit | `POST /repositories/{repo}/branches/{branch}/commits` | `commit()` | Create atomic commit | | List Files | `GET /repositories/{repo}/refs/{ref}/objects/ls` | `list_objects()` | Browse repository | | Get File Info | `GET /repositories/{repo}/refs/{ref}/objects/stat` | `stat_object()` | Get file metadata | | Get File Content | `GET /repositories/{repo}/refs/{ref}/objects` | `get_object()` | Download file | | Delete File | `DELETE /repositories/{repo}/branches/{branch}/objects` | `delete_object()` | Remove file | | Create Branch | `POST /repositories/{repo}/branches` | `create_branch()` | Create new branch | | Delete Branch | `DELETE /repositories/{repo}/branches/{branch}` | `delete_branch()` | Delete branch | | Create Tag | `POST /repositories/{repo}/tags` | `create_tag()` | Create tag | | Delete Tag | `DELETE /repositories/{repo}/tags/{tag}` | `delete_tag()` | Delete tag | | Revert | `POST /repositories/{repo}/branches/{branch}/revert` | `revert_branch()` | Revert commit | | Merge | `POST /repositories/{repo}/refs/{source}/merge/{dest}` | `merge_into_branch()` | Merge branches | | Hard Reset | `PUT /repositories/{repo}/branches/{branch}/hard_reset` | `hard_reset_branch()` | Reset branch to commit | ### Physical Address Linking ``` When uploading LFS file: 1. Client uploads to S3: s3://bucket/lfs/ab/cd/abcd1234... 2. Kohaku Hub links to LakeFS: ┌──────────────────────────────────┐ │ StagingMetadata │ ├──────────────────────────────────┤ │ physical_address: │ │ "s3://bucket/lfs/ab/cd/abc..." │ │ checksum: "sha256:abc..." │ │ size_bytes: 52428800 │ └──────────────────────────────────┘ │ ▼ ┌──────────────────────────────────┐ │ LakeFS: model.bin │ │ → Points to S3 object │ └──────────────────────────────────┘ 3. On commit: LakeFS records this link in its metadata ``` ## API Endpoint Summary ### Repository Operations | Endpoint | Method | Auth | Description | |----------|--------|------|-------------| | `/api/repos/create` | POST | ✓ | Create new repository | | `/api/repos/delete` | DELETE | ✓ | Delete repository | | `/api/repos/move` | POST | ✓ | Move/rename repository | | `/api/{type}s` | GET | ○ | List repositories (respects privacy) | | `/api/{type}s/{id}` | GET | ○ | Get repo info | | `/api/{type}s/{id}/tree/{rev}/{path}` | GET | ○ | List files | | `/api/{type}s/{id}/revision/{rev}` | GET | ○ | Get revision info | | `/api/{type}s/{id}/paths-info/{rev}` | POST | ○ | Get info for specific paths | | `/api/users/{username}/repos` | GET | ○ | List all repos for a user/org (grouped by type) | ### File Operations | Endpoint | Method | Auth | Description | |----------|--------|------|-------------| | `/api/{type}s/{id}/preupload/{rev}` | POST | ✓ | Check before upload | | `/api/{type}s/{id}/commit/{rev}` | POST | ✓ | Atomic commit | | `/{id}/resolve/{rev}/{file}` | GET | ○ | Download file | | `/{id}/resolve/{rev}/{file}` | HEAD | ○ | Get file metadata | | `/{type}s/{id}/resolve/{rev}/{file}` | GET | ○ | Download file (with type) | | `/{type}s/{id}/resolve/{rev}/{file}` | HEAD | ○ | Get file metadata (with type) | ### LFS Operations | Endpoint | Method | Auth | Description | |----------|--------|------|-------------| | `/{id}.git/info/lfs/objects/batch` | POST | ✓ | LFS batch API | | `/api/{id}.git/info/lfs/verify` | POST | ✓ | Verify upload | ### Commit History | Endpoint | Method | Auth | Description | |----------|--------|------|-------------| | `/{type}s/{namespace}/{name}/commits/{branch}` | GET | ○ | List commits on a branch with pagination | ### Branch and Tag Management | Endpoint | Method | Auth | Description | |----------|--------|------|-------------| | `/{type}s/{namespace}/{name}/branch` | POST | ✓ | Create a new branch | | `/{type}s/{namespace}/{name}/branch/{branch}` | DELETE | ✓ | Delete a branch | | `/{type}s/{namespace}/{name}/tag` | POST | ✓ | Create a new tag | | `/{type}s/{namespace}/{name}/tag/{tag}` | DELETE | ✓ | Delete a tag | ### Settings Management | Endpoint | Method | Auth | Description | |----------|--------|------|-------------| | `/users/{username}/settings` | PUT | ✓ | Update user settings | | `/organizations/{org_name}/settings` | PUT | ✓ | Update organization settings | | `/{type}s/{namespace}/{name}/settings` | PUT | ✓ | Update repository settings (private, gated) | ### Authentication Operations | Endpoint | Method | Auth | Description | |----------|--------|------|-------------| | `/api/auth/register` | POST | ✗ | Register new user | | `/api/auth/login` | POST | ✗ | Login and create session | | `/api/auth/logout` | POST | ✓ | Logout and destroy session | | `/api/auth/verify-email` | GET | ✗ | Verify email with token | | `/api/auth/me` | GET | ✓ | Get current user info | | `/api/auth/tokens` | GET | ✓ | List user's API tokens | | `/api/auth/tokens/create` | POST | ✓ | Create new API token | | `/api/auth/tokens/{token_id}` | DELETE | ✓ | Revoke API token | ### Organization Operations | Endpoint | Method | Auth | Description | |----------|--------|------|-------------| | `/org/create` | POST | ✓ | Create new organization | | `/org/{org_name}` | GET | ✗ | Get organization details | | `/org/{org_name}/members` | GET | ○ | List organization members | | `/org/{org_name}/members` | POST | ✓ | Add member to organization | | `/org/{org_name}/members/{username}` | DELETE | ✓ | Remove member from organization | | `/org/{org_name}/members/{username}` | PUT | ✓ | Update member role | | `/org/users/{username}/orgs` | GET | ✗ | List user's organizations | ### Utility Operations | Endpoint | Method | Auth | Description | |----------|--------|------|-------------| | `/api/validate-yaml` | POST | ✗ | Validate YAML content | | `/api/whoami-v2` | GET | ✓ | Get detailed current user info | | `/api/version` | GET | ✗ | Get API version information | | `/health` | GET | ✗ | Health check | | `/` | GET | ✗ | API information | **Auth Legend**: - ✓ = Required - ○ = Optional (public repos) - ✗ = Not required --- ## Detailed Endpoint Documentation ### Commit History API The commit history API allows you to retrieve the commit log for a specific branch in a repository. **Endpoint**: `GET /{repo_type}s/{namespace}/{name}/commits/{branch}` **Query Parameters**: - `page`: Page number for pagination (default: 1) - `limit`: Number of commits per page (default: 20) **Example Request**: ```bash GET /models/myorg/mymodel/commits/main?page=1&limit=20 ``` **Response**: ```json { "commits": [ { "id": "abc123def456", "message": "Update model config", "author": "john@example.com", "committer": "john@example.com", "createdAt": "2025-10-05T12:00:00Z", "parents": ["parent123"] } ], "pagination": { "page": 1, "limit": 20, "total": 150, "hasMore": true } } ``` ### Branch and Tag Management #### Create Branch **Endpoint**: `POST /{repo_type}s/{namespace}/{name}/branch` **Request**: ```json { "branch": "feature-branch", "startPoint": "main" } ``` **Response**: ```json { "success": true, "branch": "feature-branch", "ref": "refs/heads/feature-branch" } ``` #### Delete Branch **Endpoint**: `DELETE /{repo_type}s/{namespace}/{name}/branch/{branch}` **Example**: `DELETE /models/myorg/mymodel/branch/feature-branch` **Response**: ```json { "success": true, "deleted": "feature-branch" } ``` **Note**: Cannot delete the default branch (usually `main`). #### Create Tag **Endpoint**: `POST /{repo_type}s/{namespace}/{name}/tag` **Request**: ```json { "tag": "v1.0.0", "ref": "main", "message": "Release version 1.0.0" } ``` **Response**: ```json { "success": true, "tag": "v1.0.0", "ref": "refs/tags/v1.0.0" } ``` #### Delete Tag **Endpoint**: `DELETE /{repo_type}s/{namespace}/{name}/tag/{tag}` **Example**: `DELETE /models/myorg/mymodel/tag/v1.0.0` **Response**: ```json { "success": true, "deleted": "v1.0.0" } ``` ### Settings Management #### Update User Settings **Endpoint**: `PUT /users/{username}/settings` **Request**: ```json { "email": "newemail@example.com", "displayName": "John Doe", "bio": "ML Engineer", "website": "https://example.com" } ``` **Response**: ```json { "success": true, "user": { "username": "johndoe", "email": "newemail@example.com", "displayName": "John Doe" } } ``` #### Update Organization Settings **Endpoint**: `PUT /organizations/{org_name}/settings` **Request**: ```json { "displayName": "My Organization", "description": "Building amazing ML models", "website": "https://example.com", "avatar": "https://cdn.example.com/avatar.png" } ``` **Response**: ```json { "success": true, "organization": { "name": "my-org", "displayName": "My Organization", "description": "Building amazing ML models" } } ``` #### Update Repository Settings **Endpoint**: `PUT /{repo_type}s/{namespace}/{name}/settings` **Request**: ```json { "private": true, "gated": false, "description": "A state-of-the-art language model", "tags": ["nlp", "transformers", "llm"] } ``` **Response**: ```json { "success": true, "repository": { "id": "myorg/mymodel", "private": true, "gated": false, "description": "A state-of-the-art language model" } } ``` **Privacy Options**: - `private: false` - Public repository, visible to everyone - `private: true` - Private repository, only visible to owner and organization members - `gated: true` - Requires explicit permission to access (for controlled releases) #### Move/Rename Repository **Endpoint**: `POST /api/repos/move` **Request**: ```json { "fromRepo": { "type": "model", "namespace": "oldorg", "name": "oldname" }, "toRepo": { "type": "model", "namespace": "neworg", "name": "newname" } } ``` **Response**: ```json { "success": true, "url": "https://hub.example.com/models/neworg/newname", "message": "Repository moved successfully" } ``` **What Happens**: 1. Validates that source repository exists and user has permission 2. Checks that destination doesn't already exist 3. Updates LakeFS repository name 4. Updates all database records 5. Creates redirect from old URL to new URL **Note**: This operation is atomic - either everything succeeds or everything rolls back. ### Version and Utility Endpoints #### Get API Version **Endpoint**: `GET /api/version` **Response**: ```json { "version": "1.0.0", "apiVersion": "v1", "lfsVersion": "2.0", "features": { "lfs": true, "multipart": true, "deduplication": true, "organizations": true }, "limits": { "maxFileSize": 107374182400, "lfsThreshold": 10485760 } } ``` #### Validate YAML **Endpoint**: `POST /api/validate-yaml` **Request**: ```json { "content": "model:\n name: gpt-2\n version: 1.0" } ``` **Response** (if valid): ```json { "valid": true, "parsed": { "model": { "name": "gpt-2", "version": "1.0" } } } ``` **Response** (if invalid): ```json { "valid": false, "error": "Invalid YAML syntax at line 2: unexpected character", "line": 2, "column": 10 } ``` **Use Case**: Validate README.md frontmatter, model card YAML, or configuration files before upload. #### Get Detailed User Info (whoami-v2) **Endpoint**: `GET /api/whoami-v2` **Response**: ```json { "type": "user", "id": "12345", "name": "johndoe", "fullname": "John Doe", "email": "john@example.com", "emailVerified": true, "canPay": true, "isPro": false, "periodEnd": null, "avatarUrl": "https://cdn.example.com/avatars/johndoe.png", "orgs": [ { "name": "my-org", "fullname": "My Organization", "email": "contact@my-org.com", "avatarUrl": "https://cdn.example.com/orgs/my-org.png", "roleInOrg": "admin" } ], "auth": { "accessToken": { "displayName": "API Token", "role": "write" } } } ``` **Compared to `/api/auth/me`**: This endpoint provides more detailed information including: - Organization memberships with roles - Token information - Subscription/payment status - Email verification status ## Content Deduplication Kohaku Hub implements content-addressable storage for LFS files: ``` Same file uploaded to different repos: Repo A: myorg/model-v1 └─ model.bin (sha256: abc123...) Repo B: myorg/model-v2 └─ model.bin (sha256: abc123...) S3 Storage: └─ lfs/ab/c1/abc123... ← SINGLE COPY ▲ ▲ │ │ Repo A Repo B (linked) (linked) Benefits: - Save storage space - Faster uploads (skip if exists) - Efficient for model variants ``` **Deduplication Points**: 1. **Preupload Check**: Query DB by SHA256 2. **LFS Batch API**: Check if OID exists 3. **Commit**: Link existing S3 object instead of uploading ## Error Handling Kohaku Hub uses HuggingFace-compatible error headers: ``` HTTP Response Headers: X-Error-Code: RepoNotFound X-Error-Message: Repository 'org/repo' not found ``` **Error Codes**: | Code | HTTP Status | Description | |------|-------------|-------------| | `RepoNotFound` | 404 | Repository doesn't exist | | `RepoExists` | 400 | Repository already exists | | `RevisionNotFound` | 404 | Branch/commit not found | | `EntryNotFound` | 404 | File not found | | `GatedRepo` | 403 | Need permission | | `BadRequest` | 400 | Invalid request | | `ServerError` | 500 | Internal error | These error codes are parsed by `huggingface_hub` client to raise appropriate Python exceptions. ## Performance Considerations ### Upload Performance ``` Small Files (≤10MB): Client → FastAPI → LakeFS → S3 (Proxied through server) Large Files (>10MB): Client ─────────────────────→ S3 (Direct upload, no proxy) ↓ Kohaku Hub (only metadata link) ``` **Why this matters**: Large files bypass the application server entirely, allowing unlimited throughput limited only by client and S3 bandwidth. ### Download Performance ``` All Downloads: Client → Kohaku Hub → 302 Redirect → S3 (metadata) (direct) ``` **Why this matters**: After initial redirect, all data transfer is direct from S3/CDN. Server only generates presigned URLs. ### Recommended S3 Providers | Provider | Best For | Pricing Model | Notes | |----------|----------|---------------|-------| | Cloudflare R2 | High download | Free egress, $0.015/GB storage | Best for public datasets | | Wasabi | Archive/backup | $6/TB/month, free egress* | *if download < storage | | MinIO | Self-hosted | Free (your hardware/bandwidth) | Full control, privacy | | AWS S3 | Enterprise | Pay per GB + egress | Most features, expensive egress | | Backblaze B2 | Budget | $6/TB storage, $0.01/GB egress | Good for mixed workloads | **Recommendation for KohakuHub:** - **Development**: MinIO (included in docker-compose) - **Public Hub**: Cloudflare R2 (free egress saves costs) - **Private/Enterprise**: Self-hosted MinIO or AWS S3 with VPC endpoints