19 KiB
title, description, icon
| title | description | icon |
|---|---|---|
| File Upload & Commit API | Direct file uploads via NDJSON commit protocol | i-carbon-document-add |
File Upload & Commit API
Direct file upload and commit operations using HuggingFace-compatible NDJSON protocol.
Preupload Check
Check Files Before Upload
Pattern: POST /{repo_type}s/{namespace}/{name}/preupload/{revision}
Authentication: Required (write permission)
Purpose:
- Determine upload mode (regular vs LFS)
- Check for duplicate files (content deduplication)
- Validate quota before upload
Request Body:
{
"files": [
{
"path": "model.safetensors",
"size": 5368709120,
"sha256": "abc123def456...",
"sample": "base64_encoded_first_512_bytes"
},
{
"path": "config.json",
"size": 512
}
]
}
Field Explanations:
path: File path in repository (required)size: File size in bytes (required)sha256: SHA256 hash of file (optional, enables deduplication)sample: Base64 encoded sample of file content (optional, for small files)
Response:
{
"files": [
{
"path": "model.safetensors",
"uploadMode": "lfs",
"shouldIgnore": false
},
{
"path": "config.json",
"uploadMode": "regular",
"shouldIgnore": true
}
]
}
Upload Modes:
"lfs": File matches LFS criteria (size ≥ threshold OR suffix match)"regular": File is small enough for inline base64 upload
Should Ignore:
true: File with same content already exists (skip upload)false: File is new or changed (upload required)
Status Codes:
200 OK- Success400 Bad Request- Invalid payload404 Not Found- Repository not found413 Payload Too Large- Quota exceeded
Commit Operation
Create Commit with Multiple File Operations
Pattern: POST /{repo_type}s/{namespace}/{name}/commit/{revision}
Authentication: Required (write permission)
Content-Type: application/x-ndjson or application/json
Purpose: Atomic commit with multiple file operations (add/modify/delete/copy)
Request Format:
NDJSON (Newline-Delimited JSON) - one JSON object per line:
{"key": "header", "value": {"summary": "Update model", "description": "Improved accuracy"}}
{"key": "file", "value": {"path": "config.json", "content": "base64_content", "encoding": "base64"}}
{"key": "lfsFile", "value": {"path": "model.safetensors", "oid": "sha256_hash", "size": 5368709120, "algo": "sha256"}}
{"key": "deletedFile", "value": {"path": "old_file.txt"}}
{"key": "deletedFolder", "value": {"path": "old_folder/"}}
{"key": "copyFile", "value": {"path": "new_location.txt", "srcPath": "source.txt", "srcRevision": "main"}}
Operation Types
1. Header (Required)
First line must be header:
{
"key": "header",
"value": {
"summary": "Commit message",
"description": "Optional detailed description"
}
}
2. Regular File (Inline Base64)
For files < LFS threshold:
{
"key": "file",
"value": {
"path": "config.json",
"content": "eyJtb2RlbCI6ICJiZXJ0In0=",
"encoding": "base64"
}
}
Rules:
- File size MUST be < LFS threshold
- Content is base64 encoded
- Encoding must be "base64"
Error if file too large:
{
"error": "File config.json should use LFS (size: 10000000 bytes, threshold: 5000000 bytes). Use 'lfsFile' operation instead.",
"file_size": 10000000,
"lfs_threshold": 5000000,
"suggested_operation": "lfsFile"
}
3. LFS File (Already Uploaded to S3)
For files uploaded via LFS batch API:
{
"key": "lfsFile",
"value": {
"path": "model.safetensors",
"oid": "abc123def456789...",
"size": 5368709120,
"algo": "sha256"
}
}
Prerequisites:
- File must be uploaded to S3 via LFS batch API first
- OID (SHA256) must match uploaded file
- Size must match actual file size
Server validates:
- File exists in S3 at
lfs/{oid[:2]}/{oid[2:4]}/{oid} - Size matches S3 object size
4. Delete File
Remove a single file:
{
"key": "deletedFile",
"value": {
"path": "old_model.bin"
}
}
Behavior:
- Marks file as deleted in database (soft delete)
- Removes file from LakeFS branch
- Preserves LFS history for quota tracking
5. Delete Folder
Remove all files in a folder recursively:
{
"key": "deletedFolder",
"value": {
"path": "old_experiments/"
}
}
Behavior:
- Lists all files under folder recursively
- Deletes each file in parallel
- Marks files as deleted in database
6. Copy File
Copy file from same or different revision:
{
"key": "copyFile",
"value": {
"path": "backup/model.safetensors",
"srcPath": "model.safetensors",
"srcRevision": "main"
}
}
Fields:
path: Destination pathsrcPath: Source file pathsrcRevision: Source revision (branch or commit, defaults to current revision)
Behavior:
- Links physical S3 address (no duplication)
- Copies database metadata
- Works for both regular and LFS files
Response
Success:
{
"commitUrl": "http://localhost:28080/username/my-model/commit/abc123def",
"commitOid": "abc123def456789...",
"pullRequestUrl": null
}
No Changes:
{
"commitUrl": "http://localhost:28080/username/my-model/commit/previous_commit",
"commitOid": "previous_commit_hash",
"pullRequestUrl": null
}
Complete Upload Workflows
Example 1: Upload Large Model with Config (Single-Part LFS)
import requests
import base64
import hashlib
import json
API_BASE = "http://localhost:28080/api"
REPO_ID = "username/my-model"
TOKEN = "your_token"
HEADERS = {"Authorization": f"Bearer {TOKEN}"}
# Step 1: Preupload check
files_info = [
{"path": "config.json", "size": 512},
{"path": "model.safetensors", "size": 52428800, "sha256": "abc123..."} # 50MB
]
preupload_resp = requests.post(
f"{API_BASE}/models/{REPO_ID}/preupload/main",
json={"files": files_info},
headers=HEADERS
).json()
# Step 2: Upload files based on preupload response
lfs_files = []
regular_files = []
for file_info, preupload in zip(files_info, preupload_resp["files"]):
if preupload["shouldIgnore"]:
continue # File already exists, skip
if preupload["uploadMode"] == "lfs":
# Upload via LFS batch API
with open(file_info["path"], "rb") as f:
content = f.read()
sha256 = hashlib.sha256(content).hexdigest()
# LFS batch request
batch_resp = requests.post(
f"{API_BASE}/{REPO_ID}.git/info/lfs/objects/batch",
json={
"operation": "upload",
"transfers": ["basic"],
"objects": [{"oid": sha256, "size": file_info["size"]}]
},
headers=HEADERS
).json()
obj = batch_resp["objects"][0]
if "actions" not in obj:
# File already exists in LFS
lfs_files.append({"path": file_info["path"], "oid": sha256, "size": file_info["size"]})
continue
# Single-part upload to S3
with open(file_info["path"], "rb") as f:
requests.put(obj["actions"]["upload"]["href"], data=f)
# Verify
requests.post(
obj["actions"]["verify"]["href"],
json={"oid": sha256, "size": file_info["size"]}
)
lfs_files.append({"path": file_info["path"], "oid": sha256, "size": file_info["size"]})
else:
# Regular file (base64)
with open(file_info["path"], "rb") as f:
content_b64 = base64.b64encode(f.read()).decode()
regular_files.append({
"path": file_info["path"],
"content": content_b64,
"encoding": "base64"
})
# Step 3: Create commit with all operations
ndjson_lines = [
json.dumps({"key": "header", "value": {"summary": "Upload model", "description": "Initial upload"}})
]
for f in regular_files:
ndjson_lines.append(json.dumps({"key": "file", "value": f}))
for f in lfs_files:
ndjson_lines.append(json.dumps({"key": "lfsFile", "value": {
"path": f["path"],
"oid": f["oid"],
"size": f["size"],
"algo": "sha256"
}}))
ndjson_payload = "\n".join(ndjson_lines)
commit_resp = requests.post(
f"{API_BASE}/models/{REPO_ID}/commit/main",
data=ndjson_payload,
headers={**HEADERS, "Content-Type": "application/x-ndjson"}
).json()
print(f"Committed: {commit_resp['commitUrl']}")
Example 2: Upload Very Large Model (Multipart LFS)
For files ≥ 100MB (default multipart threshold):
import requests
import hashlib
import json
import math
from concurrent.futures import ThreadPoolExecutor, as_completed
API_BASE = "http://localhost:28080/api"
REPO_ID = "username/my-model"
TOKEN = "your_token"
HEADERS = {"Authorization": f"Bearer {TOKEN}"}
def upload_large_file_multipart(file_path, repo_path):
"""Upload large file using LFS multipart protocol"""
# Calculate SHA256
print(f"Calculating SHA256 for {file_path}...")
sha256_hash = hashlib.sha256()
file_size = 0
with open(file_path, "rb") as f:
while chunk := f.read(8192):
sha256_hash.update(chunk)
file_size += len(chunk)
sha256 = sha256_hash.hexdigest()
print(f"SHA256: {sha256}, Size: {file_size:,} bytes")
# Step 1: LFS batch request
print("Requesting LFS batch upload URLs...")
batch_resp = requests.post(
f"{API_BASE}/{REPO_ID}.git/info/lfs/objects/batch",
json={
"operation": "upload",
"transfers": ["basic"],
"objects": [{"oid": sha256, "size": file_size}],
"hash_algo": "sha256"
},
headers=HEADERS
).json()
obj = batch_resp["objects"][0]
# Check if file already exists
if "actions" not in obj:
print("File already exists in LFS storage (deduplication)")
return {"oid": sha256, "size": file_size, "path": repo_path}
upload_action = obj["actions"]["upload"]
verify_action = obj["actions"]["verify"]
# Check if multipart
if "header" in upload_action and "chunk_size" in upload_action["header"]:
# Multipart upload
header = upload_action["header"]
chunk_size = int(header["chunk_size"])
upload_id = header["upload_id"]
print(f"Multipart upload: chunk_size={chunk_size:,} bytes")
# Calculate number of parts
num_parts = math.ceil(file_size / chunk_size)
print(f"Uploading {num_parts} parts in parallel...")
# Upload parts in parallel
parts = []
def upload_part(part_number):
"""Upload a single part"""
part_url = header[str(part_number)]
# Read chunk
with open(file_path, "rb") as f:
f.seek((part_number - 1) * chunk_size)
chunk = f.read(chunk_size)
# Upload
resp = requests.put(part_url, data=chunk)
resp.raise_for_status()
# Extract ETag (remove quotes if present)
etag = resp.headers["ETag"].strip('"')
print(f" Part {part_number}/{num_parts} uploaded (ETag: {etag[:8]}...)")
return {"PartNumber": part_number, "ETag": etag}
# Upload parts concurrently (max 10 parallel)
with ThreadPoolExecutor(max_workers=10) as executor:
futures = [executor.submit(upload_part, i) for i in range(1, num_parts + 1)]
for future in as_completed(futures):
parts.append(future.result())
# Sort parts by part number
parts.sort(key=lambda p: p["PartNumber"])
# Step 2: Complete multipart upload
print("Completing multipart upload...")
complete_resp = requests.post(
f"{API_BASE}/{REPO_ID}.git/info/lfs/complete/{upload_id}",
json={
"oid": sha256,
"size": file_size,
"upload_id": upload_id,
"parts": parts
}
)
complete_resp.raise_for_status()
print("Multipart upload completed")
# Step 3: Verify with multipart info
print("Verifying upload...")
verify_resp = requests.post(
verify_action["href"],
json={
"oid": sha256,
"size": file_size,
"upload_id": upload_id,
"parts": parts
}
)
verify_resp.raise_for_status()
print("Upload verified")
else:
# Single-part upload (< 100MB)
print("Single-part upload...")
with open(file_path, "rb") as f:
requests.put(upload_action["href"], data=f)
# Verify
requests.post(
verify_action["href"],
json={"oid": sha256, "size": file_size}
)
print("Upload complete")
return {"oid": sha256, "size": file_size, "path": repo_path}
# Upload large model file
model_info = upload_large_file_multipart(
"large_model.safetensors", # Local file (e.g., 5GB)
"model.safetensors" # Path in repo
)
# Upload config (regular file)
with open("config.json", "rb") as f:
config_b64 = json.dumps(json.load(f)).encode()
config_b64_str = base64.b64encode(config_b64).decode()
# Create commit
ndjson_lines = [
json.dumps({"key": "header", "value": {"summary": "Upload large model", "description": "5GB model with config"}}),
json.dumps({"key": "lfsFile", "value": {
"path": model_info["path"],
"oid": model_info["oid"],
"size": model_info["size"],
"algo": "sha256"
}}),
json.dumps({"key": "file", "value": {
"path": "config.json",
"content": config_b64_str,
"encoding": "base64"
}})
]
commit_resp = requests.post(
f"{API_BASE}/models/{REPO_ID}/commit/main",
data="\n".join(ndjson_lines),
headers={**HEADERS, "Content-Type": "application/x-ndjson"}
).json()
print(f"Committed: {commit_resp['commitUrl']}")
Key Points:
- Chunk size: 50MB default (configurable)
- Parallel uploads: Up to 10 parts concurrently
- Progress tracking: Each part reports completion
- Resume support: Can retry failed parts without restarting
- ETags: Required for multipart completion
Multipart Upload Details
Thresholds (configurable via environment variables):
# When to use multipart (default: 100MB)
KOHAKU_HUB_LFS_MULTIPART_THRESHOLD_BYTES=104857600
# Size of each part (default: 50MB, min: 5MB)
KOHAKU_HUB_LFS_MULTIPART_CHUNK_SIZE_BYTES=52428800
Multipart Flow:
- Batch Request → Server returns part URLs in
headerobject - Upload Parts → PUT each part in parallel, collect ETags
- Complete → POST to
/lfs/complete/{upload_id}with ETags - Verify → POST to
/lfs/verifywith upload_id and parts
Part URL Format:
{
"header": {
"chunk_size": "52428800",
"upload_id": "s3_upload_id_xxx",
"1": "https://s3.../uploadId=xxx&partNumber=1&...",
"2": "https://s3.../uploadId=xxx&partNumber=2&...",
"10": "https://s3.../uploadId=xxx&partNumber=10&..."
}
}
ETag Collection:
# From each part upload response
etag = response.headers["ETag"].strip('"')
parts.append({"PartNumber": part_num, "ETag": etag})
Complete Request:
{
"oid": "sha256_hash",
"size": 524288000,
"upload_id": "s3_upload_id",
"parts": [
{"PartNumber": 1, "ETag": "etag1"},
{"PartNumber": 2, "ETag": "etag2"}
]
}
Browser File Upload
Upload from Web Browser
Pattern: Same as above, but set is_browser: true in LFS batch request
Why?
- Browser uploads need
Content-Typeheader in presigned URL - Server includes it automatically when
is_browser: true
Example:
// LFS batch request from browser
const batchResp = await fetch('/api/models/user/repo.git/info/lfs/objects/batch', {
method: 'POST',
headers: {
'Authorization': `Bearer ${token}`,
'Content-Type': 'application/json'
},
body: JSON.stringify({
operation: 'upload',
transfers: ['basic'],
objects: [{oid: sha256, size: fileSize}],
is_browser: true // Important!
})
});
const {objects} = await batchResp.json();
const uploadUrl = objects[0].actions.upload.href;
// Upload file (browser automatically adds Content-Type)
await fetch(uploadUrl, {
method: 'PUT',
body: file
});
Deduplication
Content-Based Deduplication
How it works:
- Client provides SHA256 hash in preupload
- Server checks if file with same hash already exists
- If exists:
shouldIgnore: true(skip upload) - If not exists:
shouldIgnore: false(upload required)
Benefits:
- Saves bandwidth (no redundant uploads)
- Saves storage (same content = same S3 object)
- Faster uploads (skip unchanged files)
Example:
{
"files": [
{
"path": "config.json",
"size": 512,
"sha256": "abc123...",
"sample": "eyJtb2RlbCI6..."
}
]
}
Response if duplicate:
{
"files": [
{
"path": "config.json",
"uploadMode": "regular",
"shouldIgnore": true
}
]
}
Quota Management
Quota Checks
During preupload:
- Total upload size calculated from all files
- Quota checked against namespace (user or org)
- Based on repository privacy (public vs private)
Error if quota exceeded:
{
"error": "Storage quota exceeded",
"message": "You have used 9.5 GB of your 10 GB quota. This upload requires 2.5 GB."
}
Status Code: 413 Payload Too Large
Error Handling
Common Errors
400 Bad Request - File too large for inline:
{
"error": "File should use LFS (size: 10000000 bytes, threshold: 5000000 bytes)",
"suggested_operation": "lfsFile"
}
400 Bad Request - LFS object not found:
{
"error": "LFS object abc123... not found in storage. Upload to S3 may have failed."
}
404 Not Found - Repository:
{
"error": "Repository not found"
}
403 Forbidden - Permission:
{
"error": "You don't have write access to this repository"
}
413 Payload Too Large - Quota:
{
"error": "Storage quota exceeded",
"message": "..."
}
Performance Tips
For small repos (<100 files):
- Use regular files when possible (< 5MB)
- Batch operations in single commit
- Enable deduplication with SHA256
For large repos (100+ files):
- Always use LFS for files >5MB
- Upload LFS files in parallel
- Use multipart for files >100MB
- Commit frequently (don't batch 1000+ files)
For CI/CD:
- Cache LFS objects locally
- Skip unchanged files with deduplication
- Use shallow clones
- Upload only changed files
Next Steps
- Git LFS API - LFS batch protocol details
- Branches API - Branch/tag management
- Commits API - Commit history