mirror of https://github.com/KohakuBlueleaf/KohakuHub.git synced 2026-05-24 04:01:04 -05:00

Files

Kohaku-Blueleaf 88a2e3c328 update document for APIs

2025-10-24 18:45:55 +08:00

19 KiB

Raw Permalink Blame History

title, description, icon

title	description	icon
File Upload & Commit API	Direct file uploads via NDJSON commit protocol	i-carbon-document-add

File Upload & Commit API

Direct file upload and commit operations using HuggingFace-compatible NDJSON protocol.

Preupload Check

Check Files Before Upload

Pattern: POST /{repo_type}s/{namespace}/{name}/preupload/{revision}

Authentication: Required (write permission)

Purpose:

Determine upload mode (regular vs LFS)
Check for duplicate files (content deduplication)
Validate quota before upload

Request Body:

{
  "files": [
    {
      "path": "model.safetensors",
      "size": 5368709120,
      "sha256": "abc123def456...",
      "sample": "base64_encoded_first_512_bytes"
    },
    {
      "path": "config.json",
      "size": 512
    }
  ]
}

Field Explanations:

path: File path in repository (required)
size: File size in bytes (required)
sha256: SHA256 hash of file (optional, enables deduplication)
sample: Base64 encoded sample of file content (optional, for small files)

Response:

{
  "files": [
    {
      "path": "model.safetensors",
      "uploadMode": "lfs",
      "shouldIgnore": false
    },
    {
      "path": "config.json",
      "uploadMode": "regular",
      "shouldIgnore": true
    }
  ]
}

Upload Modes:

"lfs": File matches LFS criteria (size ≥ threshold OR suffix match)
"regular": File is small enough for inline base64 upload

Should Ignore:

true: File with same content already exists (skip upload)
false: File is new or changed (upload required)

Status Codes:

200 OK - Success
400 Bad Request - Invalid payload
404 Not Found - Repository not found
413 Payload Too Large - Quota exceeded

Commit Operation

Create Commit with Multiple File Operations

Pattern: POST /{repo_type}s/{namespace}/{name}/commit/{revision}

Authentication: Required (write permission)

Content-Type: application/x-ndjson or application/json

Purpose: Atomic commit with multiple file operations (add/modify/delete/copy)

Request Format:

NDJSON (Newline-Delimited JSON) - one JSON object per line:

{"key": "header", "value": {"summary": "Update model", "description": "Improved accuracy"}}
{"key": "file", "value": {"path": "config.json", "content": "base64_content", "encoding": "base64"}}
{"key": "lfsFile", "value": {"path": "model.safetensors", "oid": "sha256_hash", "size": 5368709120, "algo": "sha256"}}
{"key": "deletedFile", "value": {"path": "old_file.txt"}}
{"key": "deletedFolder", "value": {"path": "old_folder/"}}
{"key": "copyFile", "value": {"path": "new_location.txt", "srcPath": "source.txt", "srcRevision": "main"}}

Operation Types

1. Header (Required)

First line must be header:

{
  "key": "header",
  "value": {
    "summary": "Commit message",
    "description": "Optional detailed description"
  }
}

2. Regular File (Inline Base64)

For files < LFS threshold:

{
  "key": "file",
  "value": {
    "path": "config.json",
    "content": "eyJtb2RlbCI6ICJiZXJ0In0=",
    "encoding": "base64"
  }
}

Rules:

File size MUST be < LFS threshold
Content is base64 encoded
Encoding must be "base64"

Error if file too large:

{
  "error": "File config.json should use LFS (size: 10000000 bytes, threshold: 5000000 bytes). Use 'lfsFile' operation instead.",
  "file_size": 10000000,
  "lfs_threshold": 5000000,
  "suggested_operation": "lfsFile"
}

3. LFS File (Already Uploaded to S3)

For files uploaded via LFS batch API:

{
  "key": "lfsFile",
  "value": {
    "path": "model.safetensors",
    "oid": "abc123def456789...",
    "size": 5368709120,
    "algo": "sha256"
  }
}

Prerequisites:

File must be uploaded to S3 via LFS batch API first
OID (SHA256) must match uploaded file
Size must match actual file size

Server validates:

File exists in S3 at lfs/{oid[:2]}/{oid[2:4]}/{oid}
Size matches S3 object size

4. Delete File

Remove a single file:

{
  "key": "deletedFile",
  "value": {
    "path": "old_model.bin"
  }
}

Behavior:

Marks file as deleted in database (soft delete)
Removes file from LakeFS branch
Preserves LFS history for quota tracking

5. Delete Folder

Remove all files in a folder recursively:

{
  "key": "deletedFolder",
  "value": {
    "path": "old_experiments/"
  }
}

Behavior:

Lists all files under folder recursively
Deletes each file in parallel
Marks files as deleted in database

6. Copy File

Copy file from same or different revision:

{
  "key": "copyFile",
  "value": {
    "path": "backup/model.safetensors",
    "srcPath": "model.safetensors",
    "srcRevision": "main"
  }
}

Fields:

path: Destination path
srcPath: Source file path
srcRevision: Source revision (branch or commit, defaults to current revision)

Behavior:

Links physical S3 address (no duplication)
Copies database metadata
Works for both regular and LFS files

Response

Success:

{
  "commitUrl": "http://localhost:28080/username/my-model/commit/abc123def",
  "commitOid": "abc123def456789...",
  "pullRequestUrl": null
}

No Changes:

{
  "commitUrl": "http://localhost:28080/username/my-model/commit/previous_commit",
  "commitOid": "previous_commit_hash",
  "pullRequestUrl": null
}

Complete Upload Workflows

Example 1: Upload Large Model with Config (Single-Part LFS)

import requests
import base64
import hashlib
import json

API_BASE = "http://localhost:28080/api"
REPO_ID = "username/my-model"
TOKEN = "your_token"
HEADERS = {"Authorization": f"Bearer {TOKEN}"}

# Step 1: Preupload check
files_info = [
    {"path": "config.json", "size": 512},
    {"path": "model.safetensors", "size": 52428800, "sha256": "abc123..."}  # 50MB
]

preupload_resp = requests.post(
    f"{API_BASE}/models/{REPO_ID}/preupload/main",
    json={"files": files_info},
    headers=HEADERS
).json()

# Step 2: Upload files based on preupload response
lfs_files = []
regular_files = []

for file_info, preupload in zip(files_info, preupload_resp["files"]):
    if preupload["shouldIgnore"]:
        continue  # File already exists, skip

    if preupload["uploadMode"] == "lfs":
        # Upload via LFS batch API
        with open(file_info["path"], "rb") as f:
            content = f.read()
            sha256 = hashlib.sha256(content).hexdigest()

        # LFS batch request
        batch_resp = requests.post(
            f"{API_BASE}/{REPO_ID}.git/info/lfs/objects/batch",
            json={
                "operation": "upload",
                "transfers": ["basic"],
                "objects": [{"oid": sha256, "size": file_info["size"]}]
            },
            headers=HEADERS
        ).json()

        obj = batch_resp["objects"][0]
        if "actions" not in obj:
            # File already exists in LFS
            lfs_files.append({"path": file_info["path"], "oid": sha256, "size": file_info["size"]})
            continue

        # Single-part upload to S3
        with open(file_info["path"], "rb") as f:
            requests.put(obj["actions"]["upload"]["href"], data=f)

        # Verify
        requests.post(
            obj["actions"]["verify"]["href"],
            json={"oid": sha256, "size": file_info["size"]}
        )

        lfs_files.append({"path": file_info["path"], "oid": sha256, "size": file_info["size"]})

    else:
        # Regular file (base64)
        with open(file_info["path"], "rb") as f:
            content_b64 = base64.b64encode(f.read()).decode()

        regular_files.append({
            "path": file_info["path"],
            "content": content_b64,
            "encoding": "base64"
        })

# Step 3: Create commit with all operations
ndjson_lines = [
    json.dumps({"key": "header", "value": {"summary": "Upload model", "description": "Initial upload"}})
]

for f in regular_files:
    ndjson_lines.append(json.dumps({"key": "file", "value": f}))

for f in lfs_files:
    ndjson_lines.append(json.dumps({"key": "lfsFile", "value": {
        "path": f["path"],
        "oid": f["oid"],
        "size": f["size"],
        "algo": "sha256"
    }}))

ndjson_payload = "\n".join(ndjson_lines)

commit_resp = requests.post(
    f"{API_BASE}/models/{REPO_ID}/commit/main",
    data=ndjson_payload,
    headers={**HEADERS, "Content-Type": "application/x-ndjson"}
).json()

print(f"Committed: {commit_resp['commitUrl']}")

Example 2: Upload Very Large Model (Multipart LFS)

For files ≥ 100MB (default multipart threshold):

import requests
import hashlib
import json
import math
from concurrent.futures import ThreadPoolExecutor, as_completed

API_BASE = "http://localhost:28080/api"
REPO_ID = "username/my-model"
TOKEN = "your_token"
HEADERS = {"Authorization": f"Bearer {TOKEN}"}

def upload_large_file_multipart(file_path, repo_path):
    """Upload large file using LFS multipart protocol"""

    # Calculate SHA256
    print(f"Calculating SHA256 for {file_path}...")
    sha256_hash = hashlib.sha256()
    file_size = 0
    with open(file_path, "rb") as f:
        while chunk := f.read(8192):
            sha256_hash.update(chunk)
            file_size += len(chunk)

    sha256 = sha256_hash.hexdigest()
    print(f"SHA256: {sha256}, Size: {file_size:,} bytes")

    # Step 1: LFS batch request
    print("Requesting LFS batch upload URLs...")
    batch_resp = requests.post(
        f"{API_BASE}/{REPO_ID}.git/info/lfs/objects/batch",
        json={
            "operation": "upload",
            "transfers": ["basic"],
            "objects": [{"oid": sha256, "size": file_size}],
            "hash_algo": "sha256"
        },
        headers=HEADERS
    ).json()

    obj = batch_resp["objects"][0]

    # Check if file already exists
    if "actions" not in obj:
        print("File already exists in LFS storage (deduplication)")
        return {"oid": sha256, "size": file_size, "path": repo_path}

    upload_action = obj["actions"]["upload"]
    verify_action = obj["actions"]["verify"]

    # Check if multipart
    if "header" in upload_action and "chunk_size" in upload_action["header"]:
        # Multipart upload
        header = upload_action["header"]
        chunk_size = int(header["chunk_size"])
        upload_id = header["upload_id"]

        print(f"Multipart upload: chunk_size={chunk_size:,} bytes")

        # Calculate number of parts
        num_parts = math.ceil(file_size / chunk_size)
        print(f"Uploading {num_parts} parts in parallel...")

        # Upload parts in parallel
        parts = []

        def upload_part(part_number):
            """Upload a single part"""
            part_url = header[str(part_number)]

            # Read chunk
            with open(file_path, "rb") as f:
                f.seek((part_number - 1) * chunk_size)
                chunk = f.read(chunk_size)

            # Upload
            resp = requests.put(part_url, data=chunk)
            resp.raise_for_status()

            # Extract ETag (remove quotes if present)
            etag = resp.headers["ETag"].strip('"')

            print(f"  Part {part_number}/{num_parts} uploaded (ETag: {etag[:8]}...)")
            return {"PartNumber": part_number, "ETag": etag}

        # Upload parts concurrently (max 10 parallel)
        with ThreadPoolExecutor(max_workers=10) as executor:
            futures = [executor.submit(upload_part, i) for i in range(1, num_parts + 1)]

            for future in as_completed(futures):
                parts.append(future.result())

        # Sort parts by part number
        parts.sort(key=lambda p: p["PartNumber"])

        # Step 2: Complete multipart upload
        print("Completing multipart upload...")
        complete_resp = requests.post(
            f"{API_BASE}/{REPO_ID}.git/info/lfs/complete/{upload_id}",
            json={
                "oid": sha256,
                "size": file_size,
                "upload_id": upload_id,
                "parts": parts
            }
        )
        complete_resp.raise_for_status()
        print("Multipart upload completed")

        # Step 3: Verify with multipart info
        print("Verifying upload...")
        verify_resp = requests.post(
            verify_action["href"],
            json={
                "oid": sha256,
                "size": file_size,
                "upload_id": upload_id,
                "parts": parts
            }
        )
        verify_resp.raise_for_status()
        print("Upload verified")

    else:
        # Single-part upload (< 100MB)
        print("Single-part upload...")
        with open(file_path, "rb") as f:
            requests.put(upload_action["href"], data=f)

        # Verify
        requests.post(
            verify_action["href"],
            json={"oid": sha256, "size": file_size}
        )
        print("Upload complete")

    return {"oid": sha256, "size": file_size, "path": repo_path}

# Upload large model file
model_info = upload_large_file_multipart(
    "large_model.safetensors",  # Local file (e.g., 5GB)
    "model.safetensors"         # Path in repo
)

# Upload config (regular file)
with open("config.json", "rb") as f:
    config_b64 = json.dumps(json.load(f)).encode()
    config_b64_str = base64.b64encode(config_b64).decode()

# Create commit
ndjson_lines = [
    json.dumps({"key": "header", "value": {"summary": "Upload large model", "description": "5GB model with config"}}),
    json.dumps({"key": "lfsFile", "value": {
        "path": model_info["path"],
        "oid": model_info["oid"],
        "size": model_info["size"],
        "algo": "sha256"
    }}),
    json.dumps({"key": "file", "value": {
        "path": "config.json",
        "content": config_b64_str,
        "encoding": "base64"
    }})
]

commit_resp = requests.post(
    f"{API_BASE}/models/{REPO_ID}/commit/main",
    data="\n".join(ndjson_lines),
    headers={**HEADERS, "Content-Type": "application/x-ndjson"}
).json()

print(f"Committed: {commit_resp['commitUrl']}")

Key Points:

Chunk size: 50MB default (configurable)
Parallel uploads: Up to 10 parts concurrently
Progress tracking: Each part reports completion
Resume support: Can retry failed parts without restarting
ETags: Required for multipart completion

Multipart Upload Details

Thresholds (configurable via environment variables):

# When to use multipart (default: 100MB)
KOHAKU_HUB_LFS_MULTIPART_THRESHOLD_BYTES=104857600

# Size of each part (default: 50MB, min: 5MB)
KOHAKU_HUB_LFS_MULTIPART_CHUNK_SIZE_BYTES=52428800

Multipart Flow:

Batch Request → Server returns part URLs in header object
Upload Parts → PUT each part in parallel, collect ETags
Complete → POST to /lfs/complete/{upload_id} with ETags
Verify → POST to /lfs/verify with upload_id and parts

Part URL Format:

{
  "header": {
    "chunk_size": "52428800",
    "upload_id": "s3_upload_id_xxx",
    "1": "https://s3.../uploadId=xxx&partNumber=1&...",
    "2": "https://s3.../uploadId=xxx&partNumber=2&...",
    "10": "https://s3.../uploadId=xxx&partNumber=10&..."
  }
}

ETag Collection:

# From each part upload response
etag = response.headers["ETag"].strip('"')
parts.append({"PartNumber": part_num, "ETag": etag})

Complete Request:

{
  "oid": "sha256_hash",
  "size": 524288000,
  "upload_id": "s3_upload_id",
  "parts": [
    {"PartNumber": 1, "ETag": "etag1"},
    {"PartNumber": 2, "ETag": "etag2"}
  ]
}

Browser File Upload

Upload from Web Browser

Pattern: Same as above, but set is_browser: true in LFS batch request

Why?

Browser uploads need Content-Type header in presigned URL
Server includes it automatically when is_browser: true

Example:

// LFS batch request from browser
const batchResp = await fetch('/api/models/user/repo.git/info/lfs/objects/batch', {
  method: 'POST',
  headers: {
    'Authorization': `Bearer ${token}`,
    'Content-Type': 'application/json'
  },
  body: JSON.stringify({
    operation: 'upload',
    transfers: ['basic'],
    objects: [{oid: sha256, size: fileSize}],
    is_browser: true  // Important!
  })
});

const {objects} = await batchResp.json();
const uploadUrl = objects[0].actions.upload.href;

// Upload file (browser automatically adds Content-Type)
await fetch(uploadUrl, {
  method: 'PUT',
  body: file
});

Deduplication

Content-Based Deduplication

How it works:

Client provides SHA256 hash in preupload
Server checks if file with same hash already exists
If exists: shouldIgnore: true (skip upload)
If not exists: shouldIgnore: false (upload required)

Benefits:

Saves bandwidth (no redundant uploads)
Saves storage (same content = same S3 object)
Faster uploads (skip unchanged files)

Example:

{
  "files": [
    {
      "path": "config.json",
      "size": 512,
      "sha256": "abc123...",
      "sample": "eyJtb2RlbCI6..."
    }
  ]
}

Response if duplicate:

{
  "files": [
    {
      "path": "config.json",
      "uploadMode": "regular",
      "shouldIgnore": true
    }
  ]
}

Quota Management

Quota Checks

During preupload:

Total upload size calculated from all files
Quota checked against namespace (user or org)
Based on repository privacy (public vs private)

Error if quota exceeded:

{
  "error": "Storage quota exceeded",
  "message": "You have used 9.5 GB of your 10 GB quota. This upload requires 2.5 GB."
}

Status Code: 413 Payload Too Large

Error Handling

Common Errors

400 Bad Request - File too large for inline:

{
  "error": "File should use LFS (size: 10000000 bytes, threshold: 5000000 bytes)",
  "suggested_operation": "lfsFile"
}

400 Bad Request - LFS object not found:

{
  "error": "LFS object abc123... not found in storage. Upload to S3 may have failed."
}

404 Not Found - Repository:

{
  "error": "Repository not found"
}

403 Forbidden - Permission:

{
  "error": "You don't have write access to this repository"
}

413 Payload Too Large - Quota:

{
  "error": "Storage quota exceeded",
  "message": "..."
}

Performance Tips

For small repos (<100 files):

Use regular files when possible (< 5MB)
Batch operations in single commit
Enable deduplication with SHA256

For large repos (100+ files):

Always use LFS for files >5MB
Upload LFS files in parallel
Use multipart for files >100MB
Commit frequently (don't batch 1000+ files)

For CI/CD:

Cache LFS objects locally
Skip unchanged files with deduplication
Use shallow clones
Upload only changed files

Next Steps

Git LFS API - LFS batch protocol details
Branches API - Branch/tag management
Commits API - Commit history

19 KiB Raw Permalink Blame History

File Upload & Commit API

Preupload Check

Check Files Before Upload

Commit Operation

Create Commit with Multiple File Operations

Operation Types

1. Header (Required)

2. Regular File (Inline Base64)

3. LFS File (Already Uploaded to S3)

4. Delete File

5. Delete Folder

6. Copy File

Response

Complete Upload Workflows

Example 1: Upload Large Model with Config (Single-Part LFS)

Example 2: Upload Very Large Model (Multipart LFS)

Multipart Upload Details

Browser File Upload

Upload from Web Browser

Deduplication

Content-Based Deduplication

Quota Management

Quota Checks

Error Handling

Common Errors

Performance Tips

Next Steps

19 KiB

Raw Permalink Blame History