Files
KohakuHub/docs/Git.md
Kohaku-Blueleaf a023ba593b update docs
2025-10-11 22:28:24 +08:00

1607 lines
41 KiB
Markdown

# Git Support in KohakuHub
*Complete guide covering Git clone operations, LFS integration, and server implementation*
**Last Updated:** January 2025
**Status:** ✅ Clone/Pull Production Ready | ⚠️ Push In Development
---
## Table of Contents
### Part 1: User Guide
1. [Quick Start](#quick-start)
2. [Authentication](#authentication-guide)
3. [LFS Integration](#lfs-integration-guide)
4. [Cloudflare Setup](#cloudflare-setup)
5. [Troubleshooting](#troubleshooting-guide)
### Part 2: Developer Guide
6. [Implementation Overview](#implementation-overview)
7. [Git Protocol Fundamentals](#git-protocol-fundamentals)
8. [Packet-Line Format](#packet-line-format)
9. [Git Smart HTTP Protocol](#git-smart-http-protocol)
10. [Pack File Generation](#pack-file-generation)
11. [Pure Python Implementation](#pure-python-implementation)
12. [LFS Pointer System](#lfs-pointer-system)
13. [Tree Building Algorithm](#tree-building-algorithm)
14. [Testing & Debugging](#testing-and-debugging)
15. [References](#references)
---
# Part 1: User Guide
## Quick Start
### Clone a Repository
```bash
# Public repository
git clone http://hub.example.com/namespace/repo-name.git
# Private repository (requires token)
git clone http://username:your-token@hub.example.com/namespace/private-repo.git
# Clone and download large files
cd repo-name
git lfs install
git lfs pull
```
### How LFS Works
KohakuHub automatically handles large files using Git LFS:
| File Size | In Clone | Download Method |
|-----------|----------|-----------------|
| < 1 MB | ✅ Full content | Included in pack |
| >= 1 MB | ✅ LFS pointer (~100 bytes) | `git lfs pull` |
**Example:**
```bash
$ git clone http://hub.example.com/org/large-model.git
Cloning... done. (Downloaded: 2 MB - only metadata!)
$ cd large-model
$ ls -lh model.safetensors
-rw-r--r-- 1 user user 132 Oct 9 14:30 model.safetensors # Pointer file
$ cat model.safetensors
version https://git-lfs.github.com/spec/v1
oid sha256:abc123...
size 10737418240
$ git lfs pull
Downloading model.safetensors (10 GB)... done.
$ ls -lh model.safetensors
-rw-r--r-- 1 user user 10G Oct 9 14:32 model.safetensors # Actual file
```
## Authentication Guide
### Using Access Tokens
**Generate Token:**
1. Login to KohakuHub web UI
2. Go to Settings → Access Tokens
3. Click "Create New Token"
4. Copy the token (you won't see it again!)
**Method 1: Credential Helper (Recommended)**
```bash
git clone http://hub.example.com/org/repo.git
# Git prompts for credentials:
# Username: your-username
# Password: paste-your-token-here
# Cache credentials for 1 hour
git config --global credential.helper 'cache --timeout=3600'
```
**Method 2: URL (Not Recommended - visible in history)**
```bash
git clone http://username:your-token@hub.example.com/org/repo.git
```
**Method 3: Environment Variable**
```bash
export GIT_USER=username
export GIT_TOKEN=your-token
git clone http://$GIT_USER:$GIT_TOKEN@hub.example.com/org/repo.git
```
## LFS Integration Guide
### Installation
```bash
# Install Git LFS (one-time)
git lfs install
```
### Download Large Files
```bash
# After cloning
git lfs pull
# Download specific files only
git lfs pull --include="models/*.safetensors"
# Skip LFS during clone (faster)
GIT_LFS_SKIP_SMUDGE=1 git clone http://hub.example.com/org/repo.git
cd repo
git lfs pull # Download later
```
### Check LFS Status
```bash
# List LFS-tracked files
git lfs ls-files
# Check LFS configuration
cat .lfsconfig
# Should show:
# [lfs]
# url = http://hub.example.com/namespace/repo.git/info/lfs
```
## Cloudflare Setup
If deploying behind Cloudflare, Git requests may be cached/modified. Fix:
### Create Page Rule
**Cloudflare Dashboard → Rules → Page Rules**
**URL Pattern:**
```
*yourdomain.com/*/*.git/*
```
**Settings:**
- ✅ Cache Level: **Bypass**
- ✅ Disable Performance
- ✅ Disable Apps
**Why:** Git protocol responses must not be cached or compressed.
### Alternative: Subdomain
Use a separate subdomain that bypasses Cloudflare:
```
git.hub.example.com → Direct to origin (DNS only)
hub.example.com → Through Cloudflare (for web UI)
```
```bash
git clone https://git.hub.example.com/org/repo.git
```
## Troubleshooting Guide
### Clone Hangs or Fails
**Problem:** `fatal: protocol error: bad pack header`
**Cause:** Old version with pkt-line chunking bug
**Solution:** Update to latest KohakuHub version
---
**Problem:** `fatal: repository not found`
**Cause:** Repository doesn't exist or no access
**Solution:** Check spelling, verify repo exists in web UI
---
**Problem:** Clone works but folders are missing
**Cause:** Old version with tree building bug
**Solution:** Update to latest KohakuHub version
### LFS Issues
**Problem:** `git lfs pull` does nothing
**Cause:** `.lfsconfig` missing or incorrect
**Solution:** Check/create `.lfsconfig`:
```bash
[lfs]
url = http://hub.example.com/namespace/repo.git/info/lfs
```
---
**Problem:** LFS files show as pointers after `git lfs pull`
**Cause:** LFS endpoint unreachable
**Solution:** Test LFS endpoint:
```bash
curl -v "http://hub.example.com/namespace/repo.git/info/lfs/objects/batch" \
-X POST -H "Content-Type: application/json" \
-d '{"operation":"download","objects":[{"oid":"abc","size":100}]}'
```
### Cloudflare Issues
**Problem:** `fatal: not a git repository`
**Cause:** Cloudflare caching Git responses
**Solution:** Create Cloudflare Page Rule (see above)
---
# Part 2: Developer Guide
## Implementation Overview
### What is a Git Server?
A Git server allows Git clients to clone, fetch, pull, and push repositories over the network. There are two main protocols:
- **Git Smart HTTP**: HTTP-based protocol (what we're implementing)
- **Git SSH**: SSH-based protocol (not covered here)
### Why Build Your Own?
In KohakuHub, we need to:
1. Provide native Git access to LakeFS-backed repositories
2. Integrate with existing authentication (tokens, sessions)
3. Maintain compatibility with HuggingFace Hub while adding Git support
4. Translate Git operations to LakeFS REST API calls
### Architecture Overview
```
Git Client (git clone/push)
HTTPS Request
Nginx (Proxy)
FastAPI (Git HTTP Endpoints)
GitLakeFSBridge (Translation Layer)
LakeFS REST API
S3/MinIO Storage
```
---
## Git Protocol Fundamentals
### Git Object Model
Git stores data as a directed acyclic graph (DAG) of objects:
1. **Blob**: File content
2. **Tree**: Directory listing (maps names to blobs/trees)
3. **Commit**: Snapshot with metadata (author, message, tree, parents)
4. **Tag**: Named reference to a commit
Each object is identified by its SHA-1 hash.
### Git References (Refs)
References are pointers to commits:
- `refs/heads/main` → Branch (e.g., main branch)
- `refs/tags/v1.0` → Tag
- `HEAD` → Current branch or commit
### Git Pack Files
To efficiently transfer objects, Git uses **pack files**:
- Compressed collection of objects
- Uses delta compression (stores differences between objects)
- Format: `PACK` header + objects + SHA-1 checksum
---
## Packet-Line Format
### What is Packet-Line (pkt-line)?
Git's wire protocol uses pkt-line format for framing data:
```
<4-byte hex length><payload>
```
**Examples:**
```
# Regular line (19 bytes = 4 (header) + 15 (payload))
0013hello world\n
# Flush packet (signals end of stream)
0000
# Empty payload (still valid)
0004
```
### Length Calculation
```python
# Formula: length_hex = hex(len(payload) + 4)
payload = b"hello\n"
length = len(payload) + 4 # 6 + 4 = 10 = 0x000a
pkt = b"000ahello\n"
```
### Special Packets
| Hex | Name | Purpose |
|------|-------|----------------------------|
| 0000 | Flush | End of command/data stream |
| 0001 | Delim | Delimiter (protocol v2) |
| 0002 | RSP | Response end (protocol v2) |
### Implementation
```python
def pkt_line(data: bytes | str | None) -> bytes:
"""Encode data as a git pkt-line."""
if data is None:
return b"0000" # Flush packet
if isinstance(data, str):
data = data.encode("utf-8")
length = len(data) + 4
return f"{length:04x}".encode("ascii") + data
def parse_pkt_line(data: bytes) -> tuple[bytes | None, bytes]:
"""Parse a single pkt-line from data.
Returns:
(line_data, remaining_data)
"""
if len(data) < 4:
return None, data
try:
length = int(data[:4].decode("ascii"), 16)
except (ValueError, UnicodeDecodeError):
return None, data[4:]
if length == 0:
return None, data[4:] # Flush packet
if length < 4:
return None, data[4:] # Invalid
line_data = data[4:length]
remaining = data[length:]
return line_data, remaining
```
---
## Git Smart HTTP Protocol
### Protocol Flow
```
1. Client → Server: GET /info/refs?service=git-upload-pack
Server → Client: Service advertisement (refs + capabilities)
2. Client → Server: POST /git-upload-pack (wants/haves)
Server → Client: Pack file with requested objects
3. (For push) Client → Server: POST /git-receive-pack (updates + pack)
Server → Client: Status report
```
### HTTP Endpoints
| Method | Path | Purpose |
|--------|---------------------------------------------|----------------------|
| GET | `/{namespace}/{name}.git/info/refs` | Service advertisement|
| GET | `/{namespace}/{name}.git/HEAD` | Get HEAD reference |
| POST | `/{namespace}/{name}.git/git-upload-pack` | Clone/fetch/pull |
| POST | `/{namespace}/{name}.git/git-receive-pack` | Push |
### Content-Type Headers
```
Service advertisement:
application/x-{service}-advertisement
Upload-pack response:
application/x-git-upload-pack-result
Receive-pack response:
application/x-git-receive-pack-result
```
---
## Service Advertisement
### Purpose
When a Git client runs `git clone`, it first requests `/info/refs?service=git-upload-pack` to discover:
1. Available references (branches, tags)
2. Server capabilities (what features the server supports)
### Request
```http
GET /{namespace}/{name}.git/info/refs?service=git-upload-pack HTTP/1.1
Host: hub.example.com
```
### Response Format
```
# Service line
001e# service=git-upload-pack\n
0000
# First ref includes capabilities
00a1<commit-sha> <ref-name>\0<capabilities>\n
# Subsequent refs (no capabilities)
003f<commit-sha> <ref-name>\n
003f<commit-sha> <ref-name>\n
0000 # Flush
```
### Example Response
```python
# Actual bytes sent:
001e# service=git-upload-pack\n
0000
00a1deadbeef123... refs/heads/main\0multi_ack side-band-64k thin-pack\n
003fdeadbeef123... HEAD\n
0000
```
### Implementation
```python
class GitServiceInfo:
def __init__(self, service: str, refs: dict[str, str], capabilities: list[str]):
self.service = service
self.refs = refs
self.capabilities = capabilities
def to_bytes(self) -> bytes:
lines = []
# Service header
lines.append(f"# service=git-{self.service}\n")
lines.append(None) # Flush
# Sort refs: HEAD first, then refs/heads/*, then refs/tags/*
sorted_refs = sorted(self.refs.items(), key=self._sort_key)
# First ref includes capabilities
first = True
for ref_name, commit_sha in sorted_refs:
if first:
caps = " ".join(self.capabilities)
lines.append(f"{commit_sha} {ref_name}\x00{caps}\n")
first = False
else:
lines.append(f"{commit_sha} {ref_name}\n")
# Empty repo: send capabilities with zero-id
if not self.refs:
caps = " ".join(self.capabilities)
lines.append(f"{'0' * 40} capabilities^{{}}\x00{caps}\n")
lines.append(None) # Flush
return pkt_line_stream(lines)
def _sort_key(self, item):
ref_name = item[0]
if ref_name == "HEAD":
return (0, ref_name)
elif ref_name.startswith("refs/heads/"):
return (1, ref_name)
elif ref_name.startswith("refs/tags/"):
return (2, ref_name)
else:
return (3, ref_name)
```
### Capabilities
Common capabilities:
| Capability | Description |
|------------------|-------------------------------------------|
| multi_ack | Client can negotiate common commits |
| side-band-64k | Multiplexed output (data/progress/errors) |
| thin-pack | Send pack with delta references |
| ofs-delta | Use offset delta encoding |
| agent | Identify server software |
| report-status | Server reports ref update status |
---
## Upload-Pack (Clone/Fetch/Pull)
### Purpose
Upload-pack handles **download operations**: clone, fetch, pull.
### Protocol Exchange
```
1. Client sends:
- List of commits it wants (want lines)
- List of commits it already has (have lines)
- "done" to finish negotiation
2. Server sends:
- NAK (no acknowledgment)
- Pack file containing requested objects
```
### Request Format
```
# Client wants this commit
0032want deadbeef123...\x00multi_ack side-band-64k\n
# Client already has these commits (optional)
0032have cafebabe456...\n
0032have 12345678...\n
# Negotiation done
0009done\n
0000
```
### Response Format
```
# NAK response
0008NAK\n
# Pack data on side-band 1
<pkt-line>\x01<pack-file-data>
0000 # Flush
```
### Implementation
```python
class GitUploadPackHandler:
def __init__(self, repo_path: str, bridge=None):
self.repo_path = repo_path
self.bridge = bridge # GitLakeFSBridge for generating packs
self.capabilities = [
"multi_ack",
"side-band-64k",
"thin-pack",
"ofs-delta",
"agent=kohakuhub/0.0.1",
]
async def handle_upload_pack(self, request_body: bytes) -> bytes:
# Parse want/have lines
wants = []
haves = []
lines = parse_pkt_lines(request_body)
for line in lines:
if line is None:
continue
line_str = line.decode("utf-8").strip()
if line_str.startswith("want "):
want_sha = line_str.split()[1]
wants.append(want_sha)
elif line_str.startswith("have "):
have_sha = line_str.split()[1]
haves.append(have_sha)
elif line_str == "done":
break
# Send NAK
nak = pkt_line_stream([b"NAK\n"])
# Generate pack file
if self.bridge:
pack_data = await self.bridge.build_pack_file(wants, haves)
else:
pack_data = self._create_empty_pack()
# Side-band protocol: prefix with \x01 (band 1 = data)
pack_line = b'\x01' + pack_data
response = nak + pkt_line(pack_line) + pkt_line(None)
return response
```
---
## Receive-Pack (Push)
### Purpose
Receive-pack handles **upload operations**: push.
### Protocol Exchange
```
1. Client sends:
- Ref update commands (old-sha new-sha ref-name)
- Pack file with new objects
2. Server sends:
- Unpack status (ok/ng)
- Per-ref status (ok/ng)
```
### Request Format
```
# Ref update commands
<pkt-line>old-sha new-sha refs/heads/main\x00capabilities\n
<pkt-line>old-sha new-sha refs/heads/feature\n
0000
# Pack file follows (PACK header + objects + checksum)
PACK...
```
### Response Format
```
0000 # Flush
# Unpack status on side-band 1
\x01unpack ok\n
# Per-ref status
\x01ok refs/heads/main\n
\x01ok refs/heads/feature\n
0000 # Flush
```
### Implementation
```python
class GitReceivePackHandler:
def __init__(self, repo_path: str):
self.repo_path = repo_path
self.capabilities = [
"report-status",
"side-band-64k",
"delete-refs",
"ofs-delta",
"agent=kohakuhub/0.0.1",
]
async def handle_receive_pack(self, request_body: bytes) -> bytes:
# Parse ref updates
ref_updates = []
lines = parse_pkt_lines(request_body)
for line in lines:
if line is None:
break # Flush packet marks end of commands
line_str = line.decode("utf-8").strip()
# Format: old-sha new-sha ref-name
parts = line_str.split()
if len(parts) >= 3:
old_sha = parts[0]
new_sha = parts[1]
ref_name = parts[2]
ref_updates.append((old_sha, new_sha, ref_name))
# TODO: Process pack file and update refs
# Send success status
status_lines = [
None, # Flush
b"\x01unpack ok\n",
]
for old_sha, new_sha, ref_name in ref_updates:
status_lines.append(f"\x01ok {ref_name}\n".encode())
status_lines.append(None) # Flush
return pkt_line_stream(status_lines)
```
---
## Pack File Format
### Structure
```
+-----------------+
| PACK header | 12 bytes
+-----------------+
| Object 1 | Variable
+-----------------+
| Object 2 | Variable
+-----------------+
| ... |
+-----------------+
| SHA-1 checksum | 20 bytes
+-----------------+
```
### Header Format
```python
import struct
# Signature (4 bytes): "PACK"
# Version (4 bytes): 2 or 3 (network byte order)
# Count (4 bytes): Number of objects (network byte order)
header = b'PACK' + struct.pack('>I', 2) + struct.pack('>I', num_objects)
```
### Object Types
| Type | Code | Description |
|------|------|-----------------------|
| Commit | 1 | Commit object |
| Tree | 2 | Tree object |
| Blob | 3 | Blob (file content) |
| Tag | 4 | Tag object |
| OFS_DELTA | 6 | Offset delta |
| REF_DELTA | 7 | Reference delta |
### Creating Pack Files (Pure Python)
**KohakuHub uses a pure Python implementation - no native dependencies!**
```python
import hashlib
import struct
import zlib
def create_pack_file(objects: list[tuple[int, bytes]]) -> bytes:
"""Build pack file using pure Python.
Args:
objects: List of (type, object_data_with_header) tuples
Types: 1=commit, 2=tree, 3=blob
Returns:
Complete pack file bytes
"""
# Pack header
pack_data = b"PACK"
pack_data += struct.pack(">I", 2) # Version 2
pack_data += struct.pack(">I", len(objects)) # Object count
# Add each object
for obj_type, obj_data in objects:
# Extract content (remove "type size\0" header)
null_pos = obj_data.find(b"\0")
content = obj_data[null_pos + 1:] if null_pos > 0 else obj_data
# Encode object header (type + size in variable-length encoding)
header = encode_pack_object_header(obj_type, len(content))
# Compress with zlib
compressed = zlib.compress(content)
# Add to pack
pack_data += header + compressed
# Add pack checksum (SHA-1 of everything)
checksum = hashlib.sha1(pack_data).digest()
pack_data += checksum
return pack_data
# Complete example - no temp files!
async def build_pack(repo_id, branch):
# 1. Build blobs (LFS pointers for large files)
blobs = {} # path -> (sha1, data_with_header, mode)
for file in files:
if is_lfs(file):
pointer = create_lfs_pointer(file.sha256, file.size)
sha1, blob_data = create_blob_object(pointer)
blobs[file.path] = (sha1, blob_data, "100644")
else:
content = await download(file.path)
sha1, blob_data = create_blob_object(content)
blobs[file.path] = (sha1, blob_data, "100644")
# 2. Build trees (pure logic)
flat = [(mode, path, sha1) for path, (sha1, data, mode) in blobs.items()]
root_tree_sha1, tree_objects = build_nested_trees(flat)
# 3. Build commit
commit_sha1, commit_data = create_commit_object(...)
# 4. Build pack
pack_objects = [(1, commit_data)] # Commit
pack_objects.extend(tree_objects) # Trees
for path, (sha1, data, mode) in blobs.items():
pack_objects.append((3, data)) # Blobs
return create_pack_file(pack_objects)
```
**Benefits:**
- No native dependencies (easier deployment)
- Full control over memory usage
- No temporary files needed
- Easier debugging
- Better performance with LFS
### Empty Pack File
```python
def create_empty_pack() -> bytes:
"""Create empty pack file (0 objects)."""
import hashlib
import struct
header = b'PACK' + struct.pack('>I', 2) + struct.pack('>I', 0)
checksum = hashlib.sha1(header).digest()
return header + checksum
```
---
## Authentication
### Methods
KohakuHub supports two authentication methods for Git:
1. **Token-based (Bearer)**: For API clients
2. **Basic Auth**: For Git clients
### Git Basic Auth
Git clients send credentials via HTTP Basic Auth:
```http
GET /namespace/repo.git/info/refs?service=git-upload-pack HTTP/1.1
Authorization: Basic <base64(username:token)>
```
### Parsing Credentials
```python
import base64
def parse_git_credentials(authorization: str | None) -> tuple[str | None, str | None]:
"""Parse username and token from Basic Auth header."""
if not authorization or not authorization.startswith("Basic "):
return None, None
try:
encoded = authorization[6:] # Remove "Basic "
decoded = base64.b64decode(encoded).decode("utf-8")
if ":" in decoded:
username, token = decoded.split(":", 1)
return username, token
except Exception:
pass
return None, None
```
### Token Validation
```python
from datetime import datetime, timezone
from kohakuhub.auth.utils import hash_token
from kohakuhub.db import Token, User, db
async def get_user_from_git_auth(authorization: str | None) -> User | None:
"""Authenticate user from Git Basic Auth."""
username, token_str = parse_git_credentials(authorization)
if not username or not token_str:
return None
# Hash and lookup token
token_hash = hash_token(token_str)
# Database operations are synchronous with transactions
with db.atomic():
token = Token.get_or_none(Token.token_hash == token_hash)
if not token:
return None
# Get user
user = User.get_or_none(User.id == token.user_id)
if not user or not user.is_active:
return None
# Update last used
Token.update(last_used=datetime.now(timezone.utc)).where(
Token.id == token.id
).execute()
return user
```
### Permission Checks
```python
from kohakuhub.auth.permissions import check_repo_read_permission, check_repo_write_permission
# For clone/fetch/pull (upload-pack)
user = await get_user_from_git_auth(authorization)
check_repo_read_permission(repo, user) # Raises HTTPException if denied
# For push (receive-pack)
user = await get_user_from_git_auth(authorization)
if not user:
raise HTTPException(401, detail="Authentication required for push")
check_repo_write_permission(repo, user)
```
---
## Implementation with FastAPI
### Router Structure
```python
# src/kohakuhub/api/routers/git_http.py
from fastapi import APIRouter, Depends, HTTPException, Header, Request, Response
router = APIRouter()
@router.get("/{namespace}/{name}.git/info/refs")
async def git_info_refs(
namespace: str,
name: str,
service: str,
authorization: str | None = Header(None),
):
"""Service advertisement endpoint."""
# Implementation here...
pass
@router.post("/{namespace}/{name}.git/git-upload-pack")
async def git_upload_pack(
namespace: str,
name: str,
request: Request,
authorization: str | None = Header(None),
):
"""Upload-pack endpoint for clone/fetch/pull."""
# Implementation here...
pass
@router.post("/{namespace}/{name}.git/git-receive-pack")
async def git_receive_pack(
namespace: str,
name: str,
request: Request,
authorization: str | None = Header(None),
):
"""Receive-pack endpoint for push."""
# Implementation here...
pass
@router.get("/{namespace}/{name}.git/HEAD")
async def git_head(
namespace: str,
name: str,
authorization: str | None = Header(None),
):
"""HEAD endpoint."""
return Response(
content=b"ref: refs/heads/main\n",
media_type="text/plain",
)
```
### Dynamic Repository Type Detection
Since we don't know if a repo is a model/dataset/space from the URL alone:
```python
from kohakuhub.db import Repository, db
async def find_repository(namespace: str, name: str) -> Repository | None:
"""Find repository by trying all types."""
# Database operations are synchronous
with db.atomic():
for repo_type in ["model", "dataset", "space"]:
repo = Repository.get_or_none(
Repository.namespace == namespace,
Repository.name == name,
Repository.repo_type == repo_type,
)
if repo:
return repo
return None
```
### Registering the Router
```python
# src/kohakuhub/main.py
from kohakuhub.api.routers import git_http
app.include_router(git_http.router, tags=["git"])
```
---
## Pure Python Implementation
**KohakuHub uses pure Python for Git operations - NO pygit2, NO native dependencies!**
### Architecture
```python
# Pure Python - all in-memory, no temp files
class GitLakeFSBridge:
"""Git-LakeFS bridge using pure Python."""
async def get_refs(self, branch: str) -> dict[str, str]:
"""Get Git refs - pure in-memory."""
# 1. List files from LakeFS (metadata only)
# 2. Build blob SHA-1s (LFS pointers for large files)
# 3. Build tree SHA-1s (pure logic)
# 4. Build commit SHA-1
# 5. Return refs dict
async def build_pack_file(self, wants, haves, branch) -> bytes:
"""Build pack file - pure in-memory."""
# 1. Build blob objects (with LFS pointers)
# 2. Build tree objects using build_nested_trees()
# 3. Build commit object
# 4. Create pack file with create_pack_file()
# 5. Return pack bytes
```
### Key Components
**1. Git Object Construction** (`git_objects.py`):
```python
def create_blob_object(content: bytes) -> tuple[str, bytes]:
"""Create blob object and compute SHA-1."""
header = f"blob {len(content)}\0".encode()
obj_data = header + content
sha1 = hashlib.sha1(obj_data).hexdigest()
return sha1, obj_data
def create_tree_object(entries: list[tuple[str, str, str]]) -> tuple[str, bytes]:
"""Create tree object from entries.
Args:
entries: List of (mode, name, sha1_hex)
mode: "100644" (file), "40000" (dir)
"""
# Sort with directories treated as having "/" suffix
def sort_key(entry):
mode, name, sha1 = entry
return name + "/" if mode in ("40000", "040000") else name
sorted_entries = sorted(entries, key=sort_key)
# Build tree content
tree_content = b""
for mode, name, sha1_hex in sorted_entries:
sha1_bytes = bytes.fromhex(sha1_hex)
tree_content += f"{mode} {name}\0".encode() + sha1_bytes
header = f"tree {len(tree_content)}\0".encode()
obj_data = header + tree_content
sha1 = hashlib.sha1(obj_data).hexdigest()
return sha1, obj_data
def build_nested_trees(flat_entries: list[tuple[str, str, str]]) -> tuple[str, list]:
"""Build nested tree structure from flat file list.
Critical: Root directory MUST be built LAST!
"""
# Organize files by directory
dir_contents = {}
for mode, path, blob_sha1 in flat_entries:
# Add file to parent directory
parts = path.split("/")
if len(parts) == 1:
dir_path = ""
else:
dir_path = "/".join(parts[:-1])
dir_contents.setdefault(dir_path, []).append((mode, parts[-1], blob_sha1))
# Sort directories: deepest first, ROOT LAST
def sort_dirs(dir_path):
return (-999, "") if dir_path == "" else (dir_path.count("/"), dir_path)
sorted_dirs = sorted(dir_contents.keys(), key=sort_dirs, reverse=True)
# Build trees bottom-up
dir_sha1s = {}
tree_objects = []
for dir_path in sorted_dirs:
entries = list(dir_contents[dir_path])
# Add subdirectories
for child_dir, child_sha1 in dir_sha1s.items():
if is_direct_child(dir_path, child_dir):
entries.append(("40000", get_dirname(dir_path, child_dir), child_sha1))
tree_sha1, tree_data = create_tree_object(entries)
dir_sha1s[dir_path] = tree_sha1
tree_objects.append((2, tree_data))
return dir_sha1s[""], tree_objects
```
**2. LFS Pointer Creation**:
```python
def create_lfs_pointer(sha256: str, size: int) -> bytes:
"""Create LFS pointer file (100 bytes instead of gigabytes!)."""
pointer = f"""version https://git-lfs.github.com/spec/v1
oid sha256:{sha256}
size {size}
"""
return pointer.encode("utf-8")
# Usage
if file_size >= 1_000_000: # 1MB threshold
pointer = create_lfs_pointer(file.sha256, file.size)
sha1, blob_data = create_blob_object(pointer)
# blob_data is only ~100 bytes, not gigabytes!
```
**3. Pack File Generation**:
```python
def create_pack_file(objects: list[tuple[int, bytes]]) -> bytes:
"""Build pack file using pure Python."""
pack_data = b"PACK"
pack_data += struct.pack(">I", 2) # Version
pack_data += struct.pack(">I", len(objects)) # Count
for obj_type, obj_data in objects:
# Extract content (remove header)
null_pos = obj_data.find(b"\0")
content = obj_data[null_pos + 1:]
# Encode object header
header = encode_pack_object_header(obj_type, len(content))
# Compress
compressed = zlib.compress(content)
pack_data += header + compressed
# Checksum
checksum = hashlib.sha1(pack_data).digest()
pack_data += checksum
return pack_data
```
### Benefits of Pure Python
| Aspect | pygit2 (Old) | Pure Python (Current) |
|--------|--------------|----------------------|
| Dependencies | pygit2 + libgit2 (C) | stdlib only |
| Installation | Can fail | Always works |
| Temp files | Creates temp git repo | None |
| Memory (10GB file) | 20GB | 100 bytes (LFS pointer) |
| Debugging | Black box | Full visibility |
| Deployment | Complex | Simple |
| Performance | Good | Better (with LFS) |
---
## Complete Code Examples
### 1. git_server.py (Protocol Utilities)
```python
"""Git protocol handler utilities."""
def pkt_line(data: bytes | str | None) -> bytes:
if data is None:
return b"0000"
if isinstance(data, str):
data = data.encode("utf-8")
length = len(data) + 4
return f"{length:04x}".encode("ascii") + data
def parse_pkt_lines(data: bytes) -> list[bytes | None]:
lines = []
remaining = data
while remaining:
line, remaining = parse_pkt_line(remaining)
if line is None and not remaining:
break
lines.append(line)
return lines
class GitUploadPackHandler:
def __init__(self, repo_path: str, bridge=None):
self.repo_path = repo_path
self.bridge = bridge
self.capabilities = [
"multi_ack",
"side-band-64k",
"thin-pack",
"ofs-delta",
]
def get_service_info(self, refs: dict[str, str]) -> bytes:
info = GitServiceInfo("upload-pack", refs, self.capabilities)
return info.to_bytes()
async def handle_upload_pack(self, request_body: bytes) -> bytes:
# Parse wants/haves
wants, haves = self._parse_wants_haves(request_body)
# Build pack
if self.bridge:
pack_data = await self.bridge.build_pack_file(wants, haves)
else:
pack_data = self._create_empty_pack()
# Send response
nak = pkt_line_stream([b"NAK\n"])
pack_line = b'\x01' + pack_data
return nak + pkt_line(pack_line) + pkt_line(None)
```
### 2. git_http.py (FastAPI Router)
```python
"""Git Smart HTTP endpoints."""
from fastapi import APIRouter, Header, Request, Response
router = APIRouter()
@router.get("/{namespace}/{name}.git/info/refs")
async def git_info_refs(
namespace: str,
name: str,
service: str,
authorization: str | None = Header(None),
):
# Find repository
repo = await find_repository(namespace, name)
if not repo:
raise HTTPException(404, detail="Repository not found")
# Authenticate
user = await get_user_from_git_auth(authorization)
# Check permissions
if service == "git-upload-pack":
check_repo_read_permission(repo, user)
elif service == "git-receive-pack":
if not user:
raise HTTPException(401, detail="Authentication required")
check_repo_write_permission(repo, user)
# Get refs from LakeFS
bridge = GitLakeFSBridge(repo.repo_type, namespace, name)
refs = await bridge.get_refs(branch="main")
# Generate response
handler = GitUploadPackHandler(repo.full_id) if service == "git-upload-pack" else GitReceivePackHandler(repo.full_id)
response_data = handler.get_service_info(refs)
return Response(
content=response_data,
media_type=f"application/x-{service}-advertisement",
headers={"Cache-Control": "no-cache"},
)
```
---
## Testing Your Implementation
### Manual Testing
```bash
# 1. Test service advertisement
curl -i "http://localhost:28080/myorg/myrepo.git/info/refs?service=git-upload-pack"
# 2. Test clone
git clone http://localhost:28080/myorg/myrepo.git
# 3. Test with authentication
git clone http://username:token@localhost:28080/myorg/private-repo.git
```
### Automated Testing
```python
import httpx
async def test_git_info_refs():
async with httpx.AsyncClient() as client:
response = await client.get(
"http://localhost:48888/test/repo.git/info/refs",
params={"service": "git-upload-pack"},
)
assert response.status_code == 200
assert b"# service=git-upload-pack" in response.content
assert b"refs/heads/main" in response.content
```
---
## Troubleshooting
### Common Issues
**1. "Repository not found"**
- Check that repository exists in database
- Verify namespace and name spelling
- Ensure dynamic type detection is working
**2. "Authentication failed"**
- Verify token is valid and not expired
- Check token hash calculation
- Ensure Basic Auth encoding is correct
**3. "Empty pack file"**
- Check LakeFS has objects in the branch
- Verify bridge is building blobs and trees correctly
- Check File table has LFS flags set properly
**4. Clone hangs**
- Check for pack file generation errors
- Verify side-band encoding is correct
- Look for missing flush packets
---
## Large File Handling with Git LFS
### The Problem
**Naive approach downloads ALL files:**
```python
# BAD - Downloads 10GB file to memory!
for obj in objects:
content = await client.get_object(...) # 10GB download
blob = repo.create_blob(content) # 10GB in memory
# Pack file becomes 10GB → OOM crash
```
**Impact:**
- Repo with 10GB model → Downloads 10GB, uses 20GB memory
- Server crashes with Out of Memory
- Clone takes forever even for metadata-only changes
### Solution: Git LFS Pointers
**Instead of including large files, create LFS pointer files:**
```python
# GOOD - Only metadata for large files
if size >= cfg.lfs.threshold_bytes:
# Get metadata only (no content download!)
stat = await client.stat_object(...)
sha256 = stat["checksum"].replace("sha256:", "")
# Create tiny pointer file
pointer = f"""version https://git-lfs.github.com/spec/v1
oid sha256:{sha256}
size {size}
"""
blob = repo.create_blob(pointer.encode()) # Only 100 bytes!
```
**Memory usage:**
- Old: 10GB file → 20GB memory
- New: 10GB file → 100 bytes pointer
- **200,000x reduction!**
### Implementation
```python
def create_lfs_pointer(sha256: str, size: int) -> bytes:
"""Create Git LFS pointer file."""
pointer = f"""version https://git-lfs.github.com/spec/v1
oid sha256:{sha256}
size {size}
"""
return pointer.encode("utf-8")
async def _build_tree_from_objects(repo, objects, branch):
# Separate small and large files
small_files = [obj for obj in objects if obj["size_bytes"] < threshold]
large_files = [obj for obj in objects if obj["size_bytes"] >= threshold]
# Process small files normally
async def process_small(obj):
content = await client.get_object(...)
return repo.create_blob(content)
# Process large files as pointers (metadata only!)
async def process_large(obj):
stat = await client.stat_object(...) # No content download
sha256 = stat["checksum"].replace("sha256:", "")
pointer = create_lfs_pointer(sha256, stat["size_bytes"])
return repo.create_blob(pointer)
# Process concurrently
small_blobs = await asyncio.gather(*[process_small(f) for f in small_files])
large_blobs = await asyncio.gather(*[process_large(f) for f in large_files])
```
### Client Usage
```bash
# 1. Clone repository (fast - only pointers!)
git clone https://hub.example.com/org/large-model.git
cd large-model
# 2. Install Git LFS
git lfs install
# 3. Pull large files via LFS protocol
git lfs pull
# Files are downloaded using existing HuggingFace LFS API
```
### Automatic .gitattributes
```python
def generate_gitattributes(lfs_paths: list[str]) -> bytes:
"""Generate .gitattributes for LFS files."""
extensions = set()
for path in lfs_paths:
if "." in path:
ext = path.rsplit(".", 1)[-1]
extensions.add(ext)
lines = ["# Git LFS tracking\n"]
for ext in sorted(extensions):
lines.append(f"*.{ext} filter=lfs diff=lfs merge=lfs -text\n")
return "".join(lines).encode("utf-8")
# Example output:
# # Git LFS tracking
# *.bin filter=lfs diff=lfs merge=lfs -text
# *.safetensors filter=lfs diff=lfs merge=lfs -text
```
## Performance Optimization
### 1. Caching
```python
# Cache refs for short periods
from functools import lru_cache
from datetime import datetime, timedelta
@lru_cache(maxsize=128)
def get_cached_refs(repo_id: str, timestamp: int):
# timestamp rounded to minute for 60s cache
return fetch_refs(repo_id)
```
### 2. Concurrent Processing
```python
# Process multiple files concurrently with asyncio.gather
results = await asyncio.gather(*[process_file(obj) for obj in objects])
```
### 3. Pagination
```python
# Process LakeFS objects in batches
async def list_all_objects(repo, ref):
objects = []
after = ""
while True:
result = await client.list_objects(
repository=repo,
ref=ref,
after=after,
amount=1000, # Batch size
)
objects.extend(result["results"])
if not result.get("pagination", {}).get("has_more"):
break
after = result["pagination"]["next_offset"]
return objects
```
### 4. Memory-Efficient Pack Generation
**Before optimization:**
- 100 files (1 x 10GB) → 20GB memory, 5 minutes
- Sequential processing
**After optimization:**
- 100 files (1 x 10GB) → 200MB memory, 30 seconds
- LFS pointers for large files
- Concurrent processing
- **100x faster, 100x less memory**
---
## References
### Official Documentation
- [Git Protocol Documentation](https://git-scm.com/docs/pack-protocol)
- [Git HTTP Protocol](https://git-scm.com/docs/http-protocol)
- [Pack Format](https://git-scm.com/docs/pack-format)
- [Packet-Line Format](https://git-scm.com/docs/protocol-common)
### Libraries
- [FastAPI](https://fastapi.tiangolo.com/) - Modern web framework
- [httpx](https://www.python-httpx.org/) - Async HTTP client
- Pure Python (stdlib only) - No native dependencies for Git operations
### Tutorials
- [Building a Git Server](https://git-scm.com/book/en/v2/Git-on-the-Server-The-Protocols)
- [Understanding Git Pack Files](https://git-scm.com/book/en/v2/Git-Internals-Packfiles)
---
## Conclusion
Building a Git-compatible server involves:
1. **Understanding the protocol**: pkt-line, service advertisement, upload/receive-pack
2. **Implementing core handlers**: Parsing requests, generating pack files
3. **Integrating with storage**: Translating Git operations to your backend (LakeFS)
4. **Adding authentication**: Token validation and permission checks
5. **Optimizing performance**: LFS pointers, concurrent processing, chunking
6. **Pure Python approach**: No native dependencies, full control, better debugging
**KohakuHub Implementation Highlights:**
-**Pure Python** - No pygit2, no libgit2, no native dependencies
-**In-memory** - No temporary directories or files
-**LFS integration** - Automatic LFS pointers for large files (>1MB)
-**Concurrent** - Parallel processing with asyncio.gather
-**Memory efficient** - Only downloads small files, pointers for large files
-**Production ready** - Handles repos of any size without OOM
This demonstrates how to build a complete Git server using only Python stdlib + FastAPI, with full Git LFS support for machine learning models and datasets.
---
**Last Updated:** January 2025
**Version:** 1.1
**Authors:** KohakuHub Team