Files
KohakuHub/docs/api/file-upload.md
2025-10-24 18:45:55 +08:00

19 KiB

title, description, icon
title description icon
File Upload & Commit API Direct file uploads via NDJSON commit protocol i-carbon-document-add

File Upload & Commit API

Direct file upload and commit operations using HuggingFace-compatible NDJSON protocol.


Preupload Check

Check Files Before Upload

Pattern: POST /{repo_type}s/{namespace}/{name}/preupload/{revision}

Authentication: Required (write permission)

Purpose:

  • Determine upload mode (regular vs LFS)
  • Check for duplicate files (content deduplication)
  • Validate quota before upload

Request Body:

{
  "files": [
    {
      "path": "model.safetensors",
      "size": 5368709120,
      "sha256": "abc123def456...",
      "sample": "base64_encoded_first_512_bytes"
    },
    {
      "path": "config.json",
      "size": 512
    }
  ]
}

Field Explanations:

  • path: File path in repository (required)
  • size: File size in bytes (required)
  • sha256: SHA256 hash of file (optional, enables deduplication)
  • sample: Base64 encoded sample of file content (optional, for small files)

Response:

{
  "files": [
    {
      "path": "model.safetensors",
      "uploadMode": "lfs",
      "shouldIgnore": false
    },
    {
      "path": "config.json",
      "uploadMode": "regular",
      "shouldIgnore": true
    }
  ]
}

Upload Modes:

  • "lfs": File matches LFS criteria (size ≥ threshold OR suffix match)
  • "regular": File is small enough for inline base64 upload

Should Ignore:

  • true: File with same content already exists (skip upload)
  • false: File is new or changed (upload required)

Status Codes:

  • 200 OK - Success
  • 400 Bad Request - Invalid payload
  • 404 Not Found - Repository not found
  • 413 Payload Too Large - Quota exceeded

Commit Operation

Create Commit with Multiple File Operations

Pattern: POST /{repo_type}s/{namespace}/{name}/commit/{revision}

Authentication: Required (write permission)

Content-Type: application/x-ndjson or application/json

Purpose: Atomic commit with multiple file operations (add/modify/delete/copy)

Request Format:

NDJSON (Newline-Delimited JSON) - one JSON object per line:

{"key": "header", "value": {"summary": "Update model", "description": "Improved accuracy"}}
{"key": "file", "value": {"path": "config.json", "content": "base64_content", "encoding": "base64"}}
{"key": "lfsFile", "value": {"path": "model.safetensors", "oid": "sha256_hash", "size": 5368709120, "algo": "sha256"}}
{"key": "deletedFile", "value": {"path": "old_file.txt"}}
{"key": "deletedFolder", "value": {"path": "old_folder/"}}
{"key": "copyFile", "value": {"path": "new_location.txt", "srcPath": "source.txt", "srcRevision": "main"}}

Operation Types

1. Header (Required)

First line must be header:

{
  "key": "header",
  "value": {
    "summary": "Commit message",
    "description": "Optional detailed description"
  }
}

2. Regular File (Inline Base64)

For files < LFS threshold:

{
  "key": "file",
  "value": {
    "path": "config.json",
    "content": "eyJtb2RlbCI6ICJiZXJ0In0=",
    "encoding": "base64"
  }
}

Rules:

  • File size MUST be < LFS threshold
  • Content is base64 encoded
  • Encoding must be "base64"

Error if file too large:

{
  "error": "File config.json should use LFS (size: 10000000 bytes, threshold: 5000000 bytes). Use 'lfsFile' operation instead.",
  "file_size": 10000000,
  "lfs_threshold": 5000000,
  "suggested_operation": "lfsFile"
}

3. LFS File (Already Uploaded to S3)

For files uploaded via LFS batch API:

{
  "key": "lfsFile",
  "value": {
    "path": "model.safetensors",
    "oid": "abc123def456789...",
    "size": 5368709120,
    "algo": "sha256"
  }
}

Prerequisites:

  1. File must be uploaded to S3 via LFS batch API first
  2. OID (SHA256) must match uploaded file
  3. Size must match actual file size

Server validates:

  • File exists in S3 at lfs/{oid[:2]}/{oid[2:4]}/{oid}
  • Size matches S3 object size

4. Delete File

Remove a single file:

{
  "key": "deletedFile",
  "value": {
    "path": "old_model.bin"
  }
}

Behavior:

  • Marks file as deleted in database (soft delete)
  • Removes file from LakeFS branch
  • Preserves LFS history for quota tracking

5. Delete Folder

Remove all files in a folder recursively:

{
  "key": "deletedFolder",
  "value": {
    "path": "old_experiments/"
  }
}

Behavior:

  • Lists all files under folder recursively
  • Deletes each file in parallel
  • Marks files as deleted in database

6. Copy File

Copy file from same or different revision:

{
  "key": "copyFile",
  "value": {
    "path": "backup/model.safetensors",
    "srcPath": "model.safetensors",
    "srcRevision": "main"
  }
}

Fields:

  • path: Destination path
  • srcPath: Source file path
  • srcRevision: Source revision (branch or commit, defaults to current revision)

Behavior:

  • Links physical S3 address (no duplication)
  • Copies database metadata
  • Works for both regular and LFS files

Response

Success:

{
  "commitUrl": "http://localhost:28080/username/my-model/commit/abc123def",
  "commitOid": "abc123def456789...",
  "pullRequestUrl": null
}

No Changes:

{
  "commitUrl": "http://localhost:28080/username/my-model/commit/previous_commit",
  "commitOid": "previous_commit_hash",
  "pullRequestUrl": null
}

Complete Upload Workflows

Example 1: Upload Large Model with Config (Single-Part LFS)

import requests
import base64
import hashlib
import json

API_BASE = "http://localhost:28080/api"
REPO_ID = "username/my-model"
TOKEN = "your_token"
HEADERS = {"Authorization": f"Bearer {TOKEN}"}

# Step 1: Preupload check
files_info = [
    {"path": "config.json", "size": 512},
    {"path": "model.safetensors", "size": 52428800, "sha256": "abc123..."}  # 50MB
]

preupload_resp = requests.post(
    f"{API_BASE}/models/{REPO_ID}/preupload/main",
    json={"files": files_info},
    headers=HEADERS
).json()

# Step 2: Upload files based on preupload response
lfs_files = []
regular_files = []

for file_info, preupload in zip(files_info, preupload_resp["files"]):
    if preupload["shouldIgnore"]:
        continue  # File already exists, skip

    if preupload["uploadMode"] == "lfs":
        # Upload via LFS batch API
        with open(file_info["path"], "rb") as f:
            content = f.read()
            sha256 = hashlib.sha256(content).hexdigest()

        # LFS batch request
        batch_resp = requests.post(
            f"{API_BASE}/{REPO_ID}.git/info/lfs/objects/batch",
            json={
                "operation": "upload",
                "transfers": ["basic"],
                "objects": [{"oid": sha256, "size": file_info["size"]}]
            },
            headers=HEADERS
        ).json()

        obj = batch_resp["objects"][0]
        if "actions" not in obj:
            # File already exists in LFS
            lfs_files.append({"path": file_info["path"], "oid": sha256, "size": file_info["size"]})
            continue

        # Single-part upload to S3
        with open(file_info["path"], "rb") as f:
            requests.put(obj["actions"]["upload"]["href"], data=f)

        # Verify
        requests.post(
            obj["actions"]["verify"]["href"],
            json={"oid": sha256, "size": file_info["size"]}
        )

        lfs_files.append({"path": file_info["path"], "oid": sha256, "size": file_info["size"]})

    else:
        # Regular file (base64)
        with open(file_info["path"], "rb") as f:
            content_b64 = base64.b64encode(f.read()).decode()

        regular_files.append({
            "path": file_info["path"],
            "content": content_b64,
            "encoding": "base64"
        })

# Step 3: Create commit with all operations
ndjson_lines = [
    json.dumps({"key": "header", "value": {"summary": "Upload model", "description": "Initial upload"}})
]

for f in regular_files:
    ndjson_lines.append(json.dumps({"key": "file", "value": f}))

for f in lfs_files:
    ndjson_lines.append(json.dumps({"key": "lfsFile", "value": {
        "path": f["path"],
        "oid": f["oid"],
        "size": f["size"],
        "algo": "sha256"
    }}))

ndjson_payload = "\n".join(ndjson_lines)

commit_resp = requests.post(
    f"{API_BASE}/models/{REPO_ID}/commit/main",
    data=ndjson_payload,
    headers={**HEADERS, "Content-Type": "application/x-ndjson"}
).json()

print(f"Committed: {commit_resp['commitUrl']}")

Example 2: Upload Very Large Model (Multipart LFS)

For files ≥ 100MB (default multipart threshold):

import requests
import hashlib
import json
import math
from concurrent.futures import ThreadPoolExecutor, as_completed

API_BASE = "http://localhost:28080/api"
REPO_ID = "username/my-model"
TOKEN = "your_token"
HEADERS = {"Authorization": f"Bearer {TOKEN}"}

def upload_large_file_multipart(file_path, repo_path):
    """Upload large file using LFS multipart protocol"""

    # Calculate SHA256
    print(f"Calculating SHA256 for {file_path}...")
    sha256_hash = hashlib.sha256()
    file_size = 0
    with open(file_path, "rb") as f:
        while chunk := f.read(8192):
            sha256_hash.update(chunk)
            file_size += len(chunk)

    sha256 = sha256_hash.hexdigest()
    print(f"SHA256: {sha256}, Size: {file_size:,} bytes")

    # Step 1: LFS batch request
    print("Requesting LFS batch upload URLs...")
    batch_resp = requests.post(
        f"{API_BASE}/{REPO_ID}.git/info/lfs/objects/batch",
        json={
            "operation": "upload",
            "transfers": ["basic"],
            "objects": [{"oid": sha256, "size": file_size}],
            "hash_algo": "sha256"
        },
        headers=HEADERS
    ).json()

    obj = batch_resp["objects"][0]

    # Check if file already exists
    if "actions" not in obj:
        print("File already exists in LFS storage (deduplication)")
        return {"oid": sha256, "size": file_size, "path": repo_path}

    upload_action = obj["actions"]["upload"]
    verify_action = obj["actions"]["verify"]

    # Check if multipart
    if "header" in upload_action and "chunk_size" in upload_action["header"]:
        # Multipart upload
        header = upload_action["header"]
        chunk_size = int(header["chunk_size"])
        upload_id = header["upload_id"]

        print(f"Multipart upload: chunk_size={chunk_size:,} bytes")

        # Calculate number of parts
        num_parts = math.ceil(file_size / chunk_size)
        print(f"Uploading {num_parts} parts in parallel...")

        # Upload parts in parallel
        parts = []

        def upload_part(part_number):
            """Upload a single part"""
            part_url = header[str(part_number)]

            # Read chunk
            with open(file_path, "rb") as f:
                f.seek((part_number - 1) * chunk_size)
                chunk = f.read(chunk_size)

            # Upload
            resp = requests.put(part_url, data=chunk)
            resp.raise_for_status()

            # Extract ETag (remove quotes if present)
            etag = resp.headers["ETag"].strip('"')

            print(f"  Part {part_number}/{num_parts} uploaded (ETag: {etag[:8]}...)")
            return {"PartNumber": part_number, "ETag": etag}

        # Upload parts concurrently (max 10 parallel)
        with ThreadPoolExecutor(max_workers=10) as executor:
            futures = [executor.submit(upload_part, i) for i in range(1, num_parts + 1)]

            for future in as_completed(futures):
                parts.append(future.result())

        # Sort parts by part number
        parts.sort(key=lambda p: p["PartNumber"])

        # Step 2: Complete multipart upload
        print("Completing multipart upload...")
        complete_resp = requests.post(
            f"{API_BASE}/{REPO_ID}.git/info/lfs/complete/{upload_id}",
            json={
                "oid": sha256,
                "size": file_size,
                "upload_id": upload_id,
                "parts": parts
            }
        )
        complete_resp.raise_for_status()
        print("Multipart upload completed")

        # Step 3: Verify with multipart info
        print("Verifying upload...")
        verify_resp = requests.post(
            verify_action["href"],
            json={
                "oid": sha256,
                "size": file_size,
                "upload_id": upload_id,
                "parts": parts
            }
        )
        verify_resp.raise_for_status()
        print("Upload verified")

    else:
        # Single-part upload (< 100MB)
        print("Single-part upload...")
        with open(file_path, "rb") as f:
            requests.put(upload_action["href"], data=f)

        # Verify
        requests.post(
            verify_action["href"],
            json={"oid": sha256, "size": file_size}
        )
        print("Upload complete")

    return {"oid": sha256, "size": file_size, "path": repo_path}

# Upload large model file
model_info = upload_large_file_multipart(
    "large_model.safetensors",  # Local file (e.g., 5GB)
    "model.safetensors"         # Path in repo
)

# Upload config (regular file)
with open("config.json", "rb") as f:
    config_b64 = json.dumps(json.load(f)).encode()
    config_b64_str = base64.b64encode(config_b64).decode()

# Create commit
ndjson_lines = [
    json.dumps({"key": "header", "value": {"summary": "Upload large model", "description": "5GB model with config"}}),
    json.dumps({"key": "lfsFile", "value": {
        "path": model_info["path"],
        "oid": model_info["oid"],
        "size": model_info["size"],
        "algo": "sha256"
    }}),
    json.dumps({"key": "file", "value": {
        "path": "config.json",
        "content": config_b64_str,
        "encoding": "base64"
    }})
]

commit_resp = requests.post(
    f"{API_BASE}/models/{REPO_ID}/commit/main",
    data="\n".join(ndjson_lines),
    headers={**HEADERS, "Content-Type": "application/x-ndjson"}
).json()

print(f"Committed: {commit_resp['commitUrl']}")

Key Points:

  • Chunk size: 50MB default (configurable)
  • Parallel uploads: Up to 10 parts concurrently
  • Progress tracking: Each part reports completion
  • Resume support: Can retry failed parts without restarting
  • ETags: Required for multipart completion

Multipart Upload Details

Thresholds (configurable via environment variables):

# When to use multipart (default: 100MB)
KOHAKU_HUB_LFS_MULTIPART_THRESHOLD_BYTES=104857600

# Size of each part (default: 50MB, min: 5MB)
KOHAKU_HUB_LFS_MULTIPART_CHUNK_SIZE_BYTES=52428800

Multipart Flow:

  1. Batch Request → Server returns part URLs in header object
  2. Upload Parts → PUT each part in parallel, collect ETags
  3. Complete → POST to /lfs/complete/{upload_id} with ETags
  4. Verify → POST to /lfs/verify with upload_id and parts

Part URL Format:

{
  "header": {
    "chunk_size": "52428800",
    "upload_id": "s3_upload_id_xxx",
    "1": "https://s3.../uploadId=xxx&partNumber=1&...",
    "2": "https://s3.../uploadId=xxx&partNumber=2&...",
    "10": "https://s3.../uploadId=xxx&partNumber=10&..."
  }
}

ETag Collection:

# From each part upload response
etag = response.headers["ETag"].strip('"')
parts.append({"PartNumber": part_num, "ETag": etag})

Complete Request:

{
  "oid": "sha256_hash",
  "size": 524288000,
  "upload_id": "s3_upload_id",
  "parts": [
    {"PartNumber": 1, "ETag": "etag1"},
    {"PartNumber": 2, "ETag": "etag2"}
  ]
}

Browser File Upload

Upload from Web Browser

Pattern: Same as above, but set is_browser: true in LFS batch request

Why?

  • Browser uploads need Content-Type header in presigned URL
  • Server includes it automatically when is_browser: true

Example:

// LFS batch request from browser
const batchResp = await fetch('/api/models/user/repo.git/info/lfs/objects/batch', {
  method: 'POST',
  headers: {
    'Authorization': `Bearer ${token}`,
    'Content-Type': 'application/json'
  },
  body: JSON.stringify({
    operation: 'upload',
    transfers: ['basic'],
    objects: [{oid: sha256, size: fileSize}],
    is_browser: true  // Important!
  })
});

const {objects} = await batchResp.json();
const uploadUrl = objects[0].actions.upload.href;

// Upload file (browser automatically adds Content-Type)
await fetch(uploadUrl, {
  method: 'PUT',
  body: file
});

Deduplication

Content-Based Deduplication

How it works:

  1. Client provides SHA256 hash in preupload
  2. Server checks if file with same hash already exists
  3. If exists: shouldIgnore: true (skip upload)
  4. If not exists: shouldIgnore: false (upload required)

Benefits:

  • Saves bandwidth (no redundant uploads)
  • Saves storage (same content = same S3 object)
  • Faster uploads (skip unchanged files)

Example:

{
  "files": [
    {
      "path": "config.json",
      "size": 512,
      "sha256": "abc123...",
      "sample": "eyJtb2RlbCI6..."
    }
  ]
}

Response if duplicate:

{
  "files": [
    {
      "path": "config.json",
      "uploadMode": "regular",
      "shouldIgnore": true
    }
  ]
}

Quota Management

Quota Checks

During preupload:

  • Total upload size calculated from all files
  • Quota checked against namespace (user or org)
  • Based on repository privacy (public vs private)

Error if quota exceeded:

{
  "error": "Storage quota exceeded",
  "message": "You have used 9.5 GB of your 10 GB quota. This upload requires 2.5 GB."
}

Status Code: 413 Payload Too Large


Error Handling

Common Errors

400 Bad Request - File too large for inline:

{
  "error": "File should use LFS (size: 10000000 bytes, threshold: 5000000 bytes)",
  "suggested_operation": "lfsFile"
}

400 Bad Request - LFS object not found:

{
  "error": "LFS object abc123... not found in storage. Upload to S3 may have failed."
}

404 Not Found - Repository:

{
  "error": "Repository not found"
}

403 Forbidden - Permission:

{
  "error": "You don't have write access to this repository"
}

413 Payload Too Large - Quota:

{
  "error": "Storage quota exceeded",
  "message": "..."
}

Performance Tips

For small repos (<100 files):

  • Use regular files when possible (< 5MB)
  • Batch operations in single commit
  • Enable deduplication with SHA256

For large repos (100+ files):

  • Always use LFS for files >5MB
  • Upload LFS files in parallel
  • Use multipart for files >100MB
  • Commit frequently (don't batch 1000+ files)

For CI/CD:

  • Cache LFS objects locally
  • Skip unchanged files with deduplication
  • Use shallow clones
  • Upload only changed files

Next Steps