---
title: File Upload & Commit API
description: Direct file uploads via NDJSON commit protocol
icon: i-carbon-document-add
---

# File Upload & Commit API

Direct file upload and commit operations using HuggingFace-compatible NDJSON protocol.

---

## Preupload Check

### Check Files Before Upload

**Pattern:** `POST /{repo_type}s/{namespace}/{name}/preupload/{revision}`

**Authentication:** Required (write permission)

**Purpose:**
- Determine upload mode (regular vs LFS)
- Check for duplicate files (content deduplication)
- Validate quota before upload

**Request Body:**
```json
{
  "files": [
    {
      "path": "model.safetensors",
      "size": 5368709120,
      "sha256": "abc123def456...",
      "sample": "base64_encoded_first_512_bytes"
    },
    {
      "path": "config.json",
      "size": 512
    }
  ]
}
```

**Field Explanations:**
- `path`: File path in repository (required)
- `size`: File size in bytes (required)
- `sha256`: SHA256 hash of file (optional, enables deduplication)
- `sample`: Base64 encoded sample of file content (optional, for small files)

**Response:**
```json
{
  "files": [
    {
      "path": "model.safetensors",
      "uploadMode": "lfs",
      "shouldIgnore": false
    },
    {
      "path": "config.json",
      "uploadMode": "regular",
      "shouldIgnore": true
    }
  ]
}
```

**Upload Modes:**
- `"lfs"`: File matches LFS criteria (size ≥ threshold OR suffix match)
- `"regular"`: File is small enough for inline base64 upload

**Should Ignore:**
- `true`: File with same content already exists (skip upload)
- `false`: File is new or changed (upload required)

**Status Codes:**
- `200 OK` - Success
- `400 Bad Request` - Invalid payload
- `404 Not Found` - Repository not found
- `413 Payload Too Large` - Quota exceeded

---

## Commit Operation

### Create Commit with Multiple File Operations

**Pattern:** `POST /{repo_type}s/{namespace}/{name}/commit/{revision}`

**Authentication:** Required (write permission)

**Content-Type:** `application/x-ndjson` or `application/json`

**Purpose:** Atomic commit with multiple file operations (add/modify/delete/copy)

**Request Format:**

NDJSON (Newline-Delimited JSON) - one JSON object per line:

```ndjson
{"key": "header", "value": {"summary": "Update model", "description": "Improved accuracy"}}
{"key": "file", "value": {"path": "config.json", "content": "base64_content", "encoding": "base64"}}
{"key": "lfsFile", "value": {"path": "model.safetensors", "oid": "sha256_hash", "size": 5368709120, "algo": "sha256"}}
{"key": "deletedFile", "value": {"path": "old_file.txt"}}
{"key": "deletedFolder", "value": {"path": "old_folder/"}}
{"key": "copyFile", "value": {"path": "new_location.txt", "srcPath": "source.txt", "srcRevision": "main"}}
```

---

### Operation Types

#### 1. Header (Required)

**First line must be header:**
```json
{
  "key": "header",
  "value": {
    "summary": "Commit message",
    "description": "Optional detailed description"
  }
}
```

---

#### 2. Regular File (Inline Base64)

**For files < LFS threshold:**
```json
{
  "key": "file",
  "value": {
    "path": "config.json",
    "content": "eyJtb2RlbCI6ICJiZXJ0In0=",
    "encoding": "base64"
  }
}
```

**Rules:**
- File size MUST be < LFS threshold
- Content is base64 encoded
- Encoding must be "base64"

**Error if file too large:**
```json
{
  "error": "File config.json should use LFS (size: 10000000 bytes, threshold: 5000000 bytes). Use 'lfsFile' operation instead.",
  "file_size": 10000000,
  "lfs_threshold": 5000000,
  "suggested_operation": "lfsFile"
}
```

---

#### 3. LFS File (Already Uploaded to S3)

**For files uploaded via LFS batch API:**
```json
{
  "key": "lfsFile",
  "value": {
    "path": "model.safetensors",
    "oid": "abc123def456789...",
    "size": 5368709120,
    "algo": "sha256"
  }
}
```

**Prerequisites:**
1. File must be uploaded to S3 via LFS batch API first
2. OID (SHA256) must match uploaded file
3. Size must match actual file size

**Server validates:**
- File exists in S3 at `lfs/{oid[:2]}/{oid[2:4]}/{oid}`
- Size matches S3 object size

---

#### 4. Delete File

**Remove a single file:**
```json
{
  "key": "deletedFile",
  "value": {
    "path": "old_model.bin"
  }
}
```

**Behavior:**
- Marks file as deleted in database (soft delete)
- Removes file from LakeFS branch
- Preserves LFS history for quota tracking

---

#### 5. Delete Folder

**Remove all files in a folder recursively:**
```json
{
  "key": "deletedFolder",
  "value": {
    "path": "old_experiments/"
  }
}
```

**Behavior:**
- Lists all files under folder recursively
- Deletes each file in parallel
- Marks files as deleted in database

---

#### 6. Copy File

**Copy file from same or different revision:**
```json
{
  "key": "copyFile",
  "value": {
    "path": "backup/model.safetensors",
    "srcPath": "model.safetensors",
    "srcRevision": "main"
  }
}
```

**Fields:**
- `path`: Destination path
- `srcPath`: Source file path
- `srcRevision`: Source revision (branch or commit, defaults to current revision)

**Behavior:**
- Links physical S3 address (no duplication)
- Copies database metadata
- Works for both regular and LFS files

---

### Response

**Success:**
```json
{
  "commitUrl": "http://localhost:28080/username/my-model/commit/abc123def",
  "commitOid": "abc123def456789...",
  "pullRequestUrl": null
}
```

**No Changes:**
```json
{
  "commitUrl": "http://localhost:28080/username/my-model/commit/previous_commit",
  "commitOid": "previous_commit_hash",
  "pullRequestUrl": null
}
```

---

## Complete Upload Workflows

### Example 1: Upload Large Model with Config (Single-Part LFS)

```python
import requests
import base64
import hashlib
import json

API_BASE = "http://localhost:28080/api"
REPO_ID = "username/my-model"
TOKEN = "your_token"
HEADERS = {"Authorization": f"Bearer {TOKEN}"}

# Step 1: Preupload check
files_info = [
    {"path": "config.json", "size": 512},
    {"path": "model.safetensors", "size": 52428800, "sha256": "abc123..."}  # 50MB
]

preupload_resp = requests.post(
    f"{API_BASE}/models/{REPO_ID}/preupload/main",
    json={"files": files_info},
    headers=HEADERS
).json()

# Step 2: Upload files based on preupload response
lfs_files = []
regular_files = []

for file_info, preupload in zip(files_info, preupload_resp["files"]):
    if preupload["shouldIgnore"]:
        continue  # File already exists, skip

    if preupload["uploadMode"] == "lfs":
        # Upload via LFS batch API
        with open(file_info["path"], "rb") as f:
            content = f.read()
            sha256 = hashlib.sha256(content).hexdigest()

        # LFS batch request
        batch_resp = requests.post(
            f"{API_BASE}/{REPO_ID}.git/info/lfs/objects/batch",
            json={
                "operation": "upload",
                "transfers": ["basic"],
                "objects": [{"oid": sha256, "size": file_info["size"]}]
            },
            headers=HEADERS
        ).json()

        obj = batch_resp["objects"][0]
        if "actions" not in obj:
            # File already exists in LFS
            lfs_files.append({"path": file_info["path"], "oid": sha256, "size": file_info["size"]})
            continue

        # Single-part upload to S3
        with open(file_info["path"], "rb") as f:
            requests.put(obj["actions"]["upload"]["href"], data=f)

        # Verify
        requests.post(
            obj["actions"]["verify"]["href"],
            json={"oid": sha256, "size": file_info["size"]}
        )

        lfs_files.append({"path": file_info["path"], "oid": sha256, "size": file_info["size"]})

    else:
        # Regular file (base64)
        with open(file_info["path"], "rb") as f:
            content_b64 = base64.b64encode(f.read()).decode()

        regular_files.append({
            "path": file_info["path"],
            "content": content_b64,
            "encoding": "base64"
        })

# Step 3: Create commit with all operations
ndjson_lines = [
    json.dumps({"key": "header", "value": {"summary": "Upload model", "description": "Initial upload"}})
]

for f in regular_files:
    ndjson_lines.append(json.dumps({"key": "file", "value": f}))

for f in lfs_files:
    ndjson_lines.append(json.dumps({"key": "lfsFile", "value": {
        "path": f["path"],
        "oid": f["oid"],
        "size": f["size"],
        "algo": "sha256"
    }}))

ndjson_payload = "\n".join(ndjson_lines)

commit_resp = requests.post(
    f"{API_BASE}/models/{REPO_ID}/commit/main",
    data=ndjson_payload,
    headers={**HEADERS, "Content-Type": "application/x-ndjson"}
).json()

print(f"Committed: {commit_resp['commitUrl']}")
```

---

### Example 2: Upload Very Large Model (Multipart LFS)

**For files ≥ 100MB (default multipart threshold):**

```python
import requests
import hashlib
import json
import math
from concurrent.futures import ThreadPoolExecutor, as_completed

API_BASE = "http://localhost:28080/api"
REPO_ID = "username/my-model"
TOKEN = "your_token"
HEADERS = {"Authorization": f"Bearer {TOKEN}"}

def upload_large_file_multipart(file_path, repo_path):
    """Upload large file using LFS multipart protocol"""

    # Calculate SHA256
    print(f"Calculating SHA256 for {file_path}...")
    sha256_hash = hashlib.sha256()
    file_size = 0
    with open(file_path, "rb") as f:
        while chunk := f.read(8192):
            sha256_hash.update(chunk)
            file_size += len(chunk)

    sha256 = sha256_hash.hexdigest()
    print(f"SHA256: {sha256}, Size: {file_size:,} bytes")

    # Step 1: LFS batch request
    print("Requesting LFS batch upload URLs...")
    batch_resp = requests.post(
        f"{API_BASE}/{REPO_ID}.git/info/lfs/objects/batch",
        json={
            "operation": "upload",
            "transfers": ["basic"],
            "objects": [{"oid": sha256, "size": file_size}],
            "hash_algo": "sha256"
        },
        headers=HEADERS
    ).json()

    obj = batch_resp["objects"][0]

    # Check if file already exists
    if "actions" not in obj:
        print("File already exists in LFS storage (deduplication)")
        return {"oid": sha256, "size": file_size, "path": repo_path}

    upload_action = obj["actions"]["upload"]
    verify_action = obj["actions"]["verify"]

    # Check if multipart
    if "header" in upload_action and "chunk_size" in upload_action["header"]:
        # Multipart upload
        header = upload_action["header"]
        chunk_size = int(header["chunk_size"])
        upload_id = header["upload_id"]

        print(f"Multipart upload: chunk_size={chunk_size:,} bytes")

        # Calculate number of parts
        num_parts = math.ceil(file_size / chunk_size)
        print(f"Uploading {num_parts} parts in parallel...")

        # Upload parts in parallel
        parts = []

        def upload_part(part_number):
            """Upload a single part"""
            part_url = header[str(part_number)]

            # Read chunk
            with open(file_path, "rb") as f:
                f.seek((part_number - 1) * chunk_size)
                chunk = f.read(chunk_size)

            # Upload
            resp = requests.put(part_url, data=chunk)
            resp.raise_for_status()

            # Extract ETag (remove quotes if present)
            etag = resp.headers["ETag"].strip('"')

            print(f"  Part {part_number}/{num_parts} uploaded (ETag: {etag[:8]}...)")
            return {"PartNumber": part_number, "ETag": etag}

        # Upload parts concurrently (max 10 parallel)
        with ThreadPoolExecutor(max_workers=10) as executor:
            futures = [executor.submit(upload_part, i) for i in range(1, num_parts + 1)]

            for future in as_completed(futures):
                parts.append(future.result())

        # Sort parts by part number
        parts.sort(key=lambda p: p["PartNumber"])

        # Step 2: Complete multipart upload
        print("Completing multipart upload...")
        complete_resp = requests.post(
            f"{API_BASE}/{REPO_ID}.git/info/lfs/complete/{upload_id}",
            json={
                "oid": sha256,
                "size": file_size,
                "upload_id": upload_id,
                "parts": parts
            }
        )
        complete_resp.raise_for_status()
        print("Multipart upload completed")

        # Step 3: Verify with multipart info
        print("Verifying upload...")
        verify_resp = requests.post(
            verify_action["href"],
            json={
                "oid": sha256,
                "size": file_size,
                "upload_id": upload_id,
                "parts": parts
            }
        )
        verify_resp.raise_for_status()
        print("Upload verified")

    else:
        # Single-part upload (< 100MB)
        print("Single-part upload...")
        with open(file_path, "rb") as f:
            requests.put(upload_action["href"], data=f)

        # Verify
        requests.post(
            verify_action["href"],
            json={"oid": sha256, "size": file_size}
        )
        print("Upload complete")

    return {"oid": sha256, "size": file_size, "path": repo_path}

# Upload large model file
model_info = upload_large_file_multipart(
    "large_model.safetensors",  # Local file (e.g., 5GB)
    "model.safetensors"         # Path in repo
)

# Upload config (regular file)
with open("config.json", "rb") as f:
    config_b64 = json.dumps(json.load(f)).encode()
    config_b64_str = base64.b64encode(config_b64).decode()

# Create commit
ndjson_lines = [
    json.dumps({"key": "header", "value": {"summary": "Upload large model", "description": "5GB model with config"}}),
    json.dumps({"key": "lfsFile", "value": {
        "path": model_info["path"],
        "oid": model_info["oid"],
        "size": model_info["size"],
        "algo": "sha256"
    }}),
    json.dumps({"key": "file", "value": {
        "path": "config.json",
        "content": config_b64_str,
        "encoding": "base64"
    }})
]

commit_resp = requests.post(
    f"{API_BASE}/models/{REPO_ID}/commit/main",
    data="\n".join(ndjson_lines),
    headers={**HEADERS, "Content-Type": "application/x-ndjson"}
).json()

print(f"Committed: {commit_resp['commitUrl']}")
```

**Key Points:**
- **Chunk size:** 50MB default (configurable)
- **Parallel uploads:** Up to 10 parts concurrently
- **Progress tracking:** Each part reports completion
- **Resume support:** Can retry failed parts without restarting
- **ETags:** Required for multipart completion

---

### Multipart Upload Details

**Thresholds (configurable via environment variables):**

```bash
# When to use multipart (default: 100MB)
KOHAKU_HUB_LFS_MULTIPART_THRESHOLD_BYTES=104857600

# Size of each part (default: 50MB, min: 5MB)
KOHAKU_HUB_LFS_MULTIPART_CHUNK_SIZE_BYTES=52428800
```

**Multipart Flow:**

1. **Batch Request** → Server returns part URLs in `header` object
2. **Upload Parts** → PUT each part in parallel, collect ETags
3. **Complete** → POST to `/lfs/complete/{upload_id}` with ETags
4. **Verify** → POST to `/lfs/verify` with upload_id and parts

**Part URL Format:**
```json
{
  "header": {
    "chunk_size": "52428800",
    "upload_id": "s3_upload_id_xxx",
    "1": "https://s3.../uploadId=xxx&partNumber=1&...",
    "2": "https://s3.../uploadId=xxx&partNumber=2&...",
    "10": "https://s3.../uploadId=xxx&partNumber=10&..."
  }
}
```

**ETag Collection:**
```python
# From each part upload response
etag = response.headers["ETag"].strip('"')
parts.append({"PartNumber": part_num, "ETag": etag})
```

**Complete Request:**
```json
{
  "oid": "sha256_hash",
  "size": 524288000,
  "upload_id": "s3_upload_id",
  "parts": [
    {"PartNumber": 1, "ETag": "etag1"},
    {"PartNumber": 2, "ETag": "etag2"}
  ]
}
```

---

## Browser File Upload

### Upload from Web Browser

**Pattern:** Same as above, but set `is_browser: true` in LFS batch request

**Why?**
- Browser uploads need `Content-Type` header in presigned URL
- Server includes it automatically when `is_browser: true`

**Example:**
```javascript
// LFS batch request from browser
const batchResp = await fetch('/api/models/user/repo.git/info/lfs/objects/batch', {
  method: 'POST',
  headers: {
    'Authorization': `Bearer ${token}`,
    'Content-Type': 'application/json'
  },
  body: JSON.stringify({
    operation: 'upload',
    transfers: ['basic'],
    objects: [{oid: sha256, size: fileSize}],
    is_browser: true  // Important!
  })
});

const {objects} = await batchResp.json();
const uploadUrl = objects[0].actions.upload.href;

// Upload file (browser automatically adds Content-Type)
await fetch(uploadUrl, {
  method: 'PUT',
  body: file
});
```

---

## Deduplication

### Content-Based Deduplication

**How it works:**
1. Client provides SHA256 hash in preupload
2. Server checks if file with same hash already exists
3. If exists: `shouldIgnore: true` (skip upload)
4. If not exists: `shouldIgnore: false` (upload required)

**Benefits:**
- Saves bandwidth (no redundant uploads)
- Saves storage (same content = same S3 object)
- Faster uploads (skip unchanged files)

**Example:**
```json
{
  "files": [
    {
      "path": "config.json",
      "size": 512,
      "sha256": "abc123...",
      "sample": "eyJtb2RlbCI6..."
    }
  ]
}
```

**Response if duplicate:**
```json
{
  "files": [
    {
      "path": "config.json",
      "uploadMode": "regular",
      "shouldIgnore": true
    }
  ]
}
```

---

## Quota Management

### Quota Checks

**During preupload:**
- Total upload size calculated from all files
- Quota checked against namespace (user or org)
- Based on repository privacy (public vs private)

**Error if quota exceeded:**
```json
{
  "error": "Storage quota exceeded",
  "message": "You have used 9.5 GB of your 10 GB quota. This upload requires 2.5 GB."
}
```

**Status Code:** `413 Payload Too Large`

---

## Error Handling

### Common Errors

**400 Bad Request - File too large for inline:**
```json
{
  "error": "File should use LFS (size: 10000000 bytes, threshold: 5000000 bytes)",
  "suggested_operation": "lfsFile"
}
```

**400 Bad Request - LFS object not found:**
```json
{
  "error": "LFS object abc123... not found in storage. Upload to S3 may have failed."
}
```

**404 Not Found - Repository:**
```json
{
  "error": "Repository not found"
}
```

**403 Forbidden - Permission:**
```json
{
  "error": "You don't have write access to this repository"
}
```

**413 Payload Too Large - Quota:**
```json
{
  "error": "Storage quota exceeded",
  "message": "..."
}
```

---

## Performance Tips

**For small repos (<100 files):**
- Use regular files when possible (< 5MB)
- Batch operations in single commit
- Enable deduplication with SHA256

**For large repos (100+ files):**
- Always use LFS for files >5MB
- Upload LFS files in parallel
- Use multipart for files >100MB
- Commit frequently (don't batch 1000+ files)

**For CI/CD:**
- Cache LFS objects locally
- Skip unchanged files with deduplication
- Use shallow clones
- Upload only changed files

---

## Next Steps

- [Git LFS API](./git-lfs.md) - LFS batch protocol details
- [Branches API](./branches.md) - Branch/tag management
- [Commits API](./commits.md) - Commit history