mirror of https://github.com/KohakuBlueleaf/KohakuHub.git synced 2026-03-11 17:34:08 -05:00

Files

google-labs-jules[bot] 36b722607a Update and improve project documentation

This commit updates the project's documentation to be more consistent, accurate, and user-friendly.

Key changes include:
- Added a CODE_OF_CONDUCT.md file to foster a positive community.
- Updated CONTRIBUTING.md to link to the new Code of Conduct.
- Restructured and updated the `docs` directory, including:
  - Replacing ASCII and other diagrams with Mermaid charts for better visualization.
  - Adding a table of contents to each document for improved navigation.
  - Ensuring content is aligned with the latest implementation.
  - Adding more detailed descriptions to the documentation.

2025-10-11 15:49:08 +00:00

41 KiB

Raw Blame History

Git Support in KohakuHub

Complete guide covering Git clone operations, LFS integration, and server implementation

Last Updated: January 2025 Status: ✅ Clone/Pull Production Ready | ⚠️ Push In Development

Part 2: Developer Guide

Implementation Overview
Git Protocol Fundamentals
Packet-Line Format
Git Smart HTTP Protocol
Pack File Generation
Pure Python Implementation
LFS Pointer System
Tree Building Algorithm
Testing & Debugging
References

Part 1: User Guide

Quick Start

Clone a Repository

# Public repository
git clone http://hub.example.com/namespace/repo-name.git

# Private repository (requires token)
git clone http://username:your-token@hub.example.com/namespace/private-repo.git

# Clone and download large files
cd repo-name
git lfs install
git lfs pull

How LFS Works

KohakuHub automatically handles large files using Git LFS:

File Size	In Clone	Download Method
< 1 MB	✅ Full content	Included in pack
>= 1 MB	✅ LFS pointer (~100 bytes)	`git lfs pull`

Example:

$ git clone http://hub.example.com/org/large-model.git
Cloning... done. (Downloaded: 2 MB - only metadata!)

$ cd large-model
$ ls -lh model.safetensors
-rw-r--r-- 1 user user 132 Oct 9 14:30 model.safetensors  # Pointer file

$ cat model.safetensors
version https://git-lfs.github.com/spec/v1
oid sha256:abc123...
size 10737418240

$ git lfs pull
Downloading model.safetensors (10 GB)... done.

$ ls -lh model.safetensors
-rw-r--r-- 1 user user 10G Oct 9 14:32 model.safetensors  # Actual file

Authentication Guide

Using Access Tokens

Generate Token:

Login to KohakuHub web UI
Go to Settings → Access Tokens
Click "Create New Token"
Copy the token (you won't see it again!)

Method 1: Credential Helper (Recommended)

git clone http://hub.example.com/org/repo.git
# Git prompts for credentials:
# Username: your-username
# Password: paste-your-token-here

# Cache credentials for 1 hour
git config --global credential.helper 'cache --timeout=3600'

Method 2: URL (Not Recommended - visible in history)

git clone http://username:your-token@hub.example.com/org/repo.git

Method 3: Environment Variable

export GIT_USER=username
export GIT_TOKEN=your-token
git clone http://$GIT_USER:$GIT_TOKEN@hub.example.com/org/repo.git

LFS Integration Guide

Installation

# Install Git LFS (one-time)
git lfs install

Download Large Files

# After cloning
git lfs pull

# Download specific files only
git lfs pull --include="models/*.safetensors"

# Skip LFS during clone (faster)
GIT_LFS_SKIP_SMUDGE=1 git clone http://hub.example.com/org/repo.git
cd repo
git lfs pull  # Download later

Check LFS Status

# List LFS-tracked files
git lfs ls-files

# Check LFS configuration
cat .lfsconfig
# Should show:
# [lfs]
# 	url = http://hub.example.com/namespace/repo.git/info/lfs

Cloudflare Setup

If deploying behind Cloudflare, Git requests may be cached/modified. Fix:

Create Page Rule

Cloudflare Dashboard → Rules → Page Rules

URL Pattern:

*yourdomain.com/*/*.git/*

Settings:

✅ Cache Level: Bypass
✅ Disable Performance
✅ Disable Apps

Why: Git protocol responses must not be cached or compressed.

Alternative: Subdomain

Use a separate subdomain that bypasses Cloudflare:

git.hub.example.com → Direct to origin (DNS only)
hub.example.com → Through Cloudflare (for web UI)

git clone https://git.hub.example.com/org/repo.git

Troubleshooting Guide

Clone Hangs or Fails

Problem: fatal: protocol error: bad pack header Cause: Old version with pkt-line chunking bug Solution: Update to latest KohakuHub version

Problem: fatal: repository not found Cause: Repository doesn't exist or no access Solution: Check spelling, verify repo exists in web UI

Problem: Clone works but folders are missing Cause: Old version with tree building bug Solution: Update to latest KohakuHub version

LFS Issues

Problem: git lfs pull does nothing Cause: .lfsconfig missing or incorrect Solution: Check/create .lfsconfig:

[lfs]
	url = http://hub.example.com/namespace/repo.git/info/lfs

Problem: LFS files show as pointers after git lfs pull Cause: LFS endpoint unreachable Solution: Test LFS endpoint:

curl -v "http://hub.example.com/namespace/repo.git/info/lfs/objects/batch" \
  -X POST -H "Content-Type: application/json" \
  -d '{"operation":"download","objects":[{"oid":"abc","size":100}]}'

Cloudflare Issues

Problem: fatal: not a git repository Cause: Cloudflare caching Git responses Solution: Create Cloudflare Page Rule (see above)

Part 2: Developer Guide

Implementation Overview

What is a Git Server?

A Git server allows Git clients to clone, fetch, pull, and push repositories over the network. There are two main protocols:

Git Smart HTTP: HTTP-based protocol (what we're implementing)
Git SSH: SSH-based protocol (not covered here)

Why Build Your Own?

In KohakuHub, we need to:

Provide native Git access to LakeFS-backed repositories
Integrate with existing authentication (tokens, sessions)
Maintain compatibility with HuggingFace Hub while adding Git support
Translate Git operations to LakeFS REST API calls

Architecture Overview

The Git server is implemented as a set of FastAPI endpoints that handle Git's Smart HTTP protocol. The server uses a translation layer to convert Git operations into LakeFS REST API calls.

graph TD
    A[Git Client] --> B[FastAPI]
    B --> C[GitLakeFSBridge]
    C --> D[LakeFS REST API]
    D --> E[S3/MinIO Storage]

Git Protocol Fundamentals

Git Object Model

Git stores data as a directed acyclic graph (DAG) of objects:

Blob: File content
Tree: Directory listing (maps names to blobs/trees)
Commit: Snapshot with metadata (author, message, tree, parents)
Tag: Named reference to a commit

Each object is identified by its SHA-1 hash.

Git References (Refs)

References are pointers to commits:

refs/heads/main → Branch (e.g., main branch)
refs/tags/v1.0 → Tag
HEAD → Current branch or commit

Git Pack Files

To efficiently transfer objects, Git uses pack files:

Compressed collection of objects
Uses delta compression (stores differences between objects)
Format: PACK header + objects + SHA-1 checksum

Packet-Line Format

What is Packet-Line (pkt-line)?

Git's wire protocol uses pkt-line format for framing data:

<4-byte hex length><payload>

Examples:

# Regular line (19 bytes = 4 (header) + 15 (payload))
0013hello world\n

# Flush packet (signals end of stream)
0000

# Empty payload (still valid)
0004

Length Calculation

# Formula: length_hex = hex(len(payload) + 4)
payload = b"hello\n"
length = len(payload) + 4  # 6 + 4 = 10 = 0x000a
pkt = b"000ahello\n"

Special Packets

Hex	Name	Purpose
0000	Flush	End of command/data stream
0001	Delim	Delimiter (protocol v2)
0002	RSP	Response end (protocol v2)

Implementation

def pkt_line(data: bytes | str | None) -> bytes:
    """Encode data as a git pkt-line."""
    if data is None:
        return b"0000"  # Flush packet

    if isinstance(data, str):
        data = data.encode("utf-8")

    length = len(data) + 4
    return f"{length:04x}".encode("ascii") + data


def parse_pkt_line(data: bytes) -> tuple[bytes | None, bytes]:
    """Parse a single pkt-line from data.

    Returns:
        (line_data, remaining_data)
    """
    if len(data) < 4:
        return None, data

    try:
        length = int(data[:4].decode("ascii"), 16)
    except (ValueError, UnicodeDecodeError):
        return None, data[4:]

    if length == 0:
        return None, data[4:]  # Flush packet

    if length < 4:
        return None, data[4:]  # Invalid

    line_data = data[4:length]
    remaining = data[length:]

    return line_data, remaining

Git Smart HTTP Protocol

Protocol Flow

1. Client → Server: GET /info/refs?service=git-upload-pack
   Server → Client: Service advertisement (refs + capabilities)

2. Client → Server: POST /git-upload-pack (wants/haves)
   Server → Client: Pack file with requested objects

3. (For push) Client → Server: POST /git-receive-pack (updates + pack)
   Server → Client: Status report

HTTP Endpoints

Method	Path	Purpose
GET	`/{namespace}/{name}.git/info/refs`	Service advertisement
GET	`/{namespace}/{name}.git/HEAD`	Get HEAD reference
POST	`/{namespace}/{name}.git/git-upload-pack`	Clone/fetch/pull
POST	`/{namespace}/{name}.git/git-receive-pack`	Push

Content-Type Headers

Service advertisement:
  application/x-{service}-advertisement

Upload-pack response:
  application/x-git-upload-pack-result

Receive-pack response:
  application/x-git-receive-pack-result

Service Advertisement

Purpose

When a Git client runs git clone, it first requests /info/refs?service=git-upload-pack to discover:

Available references (branches, tags)
Server capabilities (what features the server supports)

Request

GET /{namespace}/{name}.git/info/refs?service=git-upload-pack HTTP/1.1
Host: hub.example.com

Response Format

# Service line
001e# service=git-upload-pack\n
0000

# First ref includes capabilities
00a1<commit-sha> <ref-name>\0<capabilities>\n

# Subsequent refs (no capabilities)
003f<commit-sha> <ref-name>\n
003f<commit-sha> <ref-name>\n

0000  # Flush

Example Response

# Actual bytes sent:
001e# service=git-upload-pack\n
0000
00a1deadbeef123... refs/heads/main\0multi_ack side-band-64k thin-pack\n
003fdeadbeef123... HEAD\n
0000

Implementation

class GitServiceInfo:
    def __init__(self, service: str, refs: dict[str, str], capabilities: list[str]):
        self.service = service
        self.refs = refs
        self.capabilities = capabilities

    def to_bytes(self) -> bytes:
        lines = []

        # Service header
        lines.append(f"# service=git-{self.service}\n")
        lines.append(None)  # Flush

        # Sort refs: HEAD first, then refs/heads/*, then refs/tags/*
        sorted_refs = sorted(self.refs.items(), key=self._sort_key)

        # First ref includes capabilities
        first = True
        for ref_name, commit_sha in sorted_refs:
            if first:
                caps = " ".join(self.capabilities)
                lines.append(f"{commit_sha} {ref_name}\x00{caps}\n")
                first = False
            else:
                lines.append(f"{commit_sha} {ref_name}\n")

        # Empty repo: send capabilities with zero-id
        if not self.refs:
            caps = " ".join(self.capabilities)
            lines.append(f"{'0' * 40} capabilities^{{}}\x00{caps}\n")

        lines.append(None)  # Flush

        return pkt_line_stream(lines)

    def _sort_key(self, item):
        ref_name = item[0]
        if ref_name == "HEAD":
            return (0, ref_name)
        elif ref_name.startswith("refs/heads/"):
            return (1, ref_name)
        elif ref_name.startswith("refs/tags/"):
            return (2, ref_name)
        else:
            return (3, ref_name)

Capabilities

Common capabilities:

Capability	Description
multi_ack	Client can negotiate common commits
side-band-64k	Multiplexed output (data/progress/errors)
thin-pack	Send pack with delta references
ofs-delta	Use offset delta encoding
agent	Identify server software
report-status	Server reports ref update status

Upload-Pack (Clone/Fetch/Pull)

Purpose

Upload-pack handles download operations: clone, fetch, pull.

Protocol Exchange

1. Client sends:
   - List of commits it wants (want lines)
   - List of commits it already has (have lines)
   - "done" to finish negotiation

2. Server sends:
   - NAK (no acknowledgment)
   - Pack file containing requested objects

Request Format

# Client wants this commit
0032want deadbeef123...\x00multi_ack side-band-64k\n

# Client already has these commits (optional)
0032have cafebabe456...\n
0032have 12345678...\n

# Negotiation done
0009done\n
0000

Response Format

# NAK response
0008NAK\n

# Pack data on side-band 1
<pkt-line>\x01<pack-file-data>

0000  # Flush

Implementation

class GitUploadPackHandler:
    def __init__(self, repo_path: str, bridge=None):
        self.repo_path = repo_path
        self.bridge = bridge  # GitLakeFSBridge for generating packs
        self.capabilities = [
            "multi_ack",
            "side-band-64k",
            "thin-pack",
            "ofs-delta",
            "agent=kohakuhub/0.0.1",
        ]

    async def handle_upload_pack(self, request_body: bytes) -> bytes:
        # Parse want/have lines
        wants = []
        haves = []

        lines = parse_pkt_lines(request_body)

        for line in lines:
            if line is None:
                continue

            line_str = line.decode("utf-8").strip()

            if line_str.startswith("want "):
                want_sha = line_str.split()[1]
                wants.append(want_sha)
            elif line_str.startswith("have "):
                have_sha = line_str.split()[1]
                haves.append(have_sha)
            elif line_str == "done":
                break

        # Send NAK
        nak = pkt_line_stream([b"NAK\n"])

        # Generate pack file
        if self.bridge:
            pack_data = await self.bridge.build_pack_file(wants, haves)
        else:
            pack_data = self._create_empty_pack()

        # Side-band protocol: prefix with \x01 (band 1 = data)
        pack_line = b'\x01' + pack_data

        response = nak + pkt_line(pack_line) + pkt_line(None)

        return response

Receive-Pack (Push)

Purpose

Receive-pack handles upload operations: push.

Protocol Exchange

1. Client sends:
   - Ref update commands (old-sha new-sha ref-name)
   - Pack file with new objects

2. Server sends:
   - Unpack status (ok/ng)
   - Per-ref status (ok/ng)

Request Format

# Ref update commands
<pkt-line>old-sha new-sha refs/heads/main\x00capabilities\n
<pkt-line>old-sha new-sha refs/heads/feature\n
0000

# Pack file follows (PACK header + objects + checksum)
PACK...

Response Format

0000  # Flush

# Unpack status on side-band 1
\x01unpack ok\n

# Per-ref status
\x01ok refs/heads/main\n
\x01ok refs/heads/feature\n

0000  # Flush

Implementation

class GitReceivePackHandler:
    def __init__(self, repo_path: str):
        self.repo_path = repo_path
        self.capabilities = [
            "report-status",
            "side-band-64k",
            "delete-refs",
            "ofs-delta",
            "agent=kohakuhub/0.0.1",
        ]

    async def handle_receive_pack(self, request_body: bytes) -> bytes:
        # Parse ref updates
        ref_updates = []

        lines = parse_pkt_lines(request_body)

        for line in lines:
            if line is None:
                break  # Flush packet marks end of commands

            line_str = line.decode("utf-8").strip()

            # Format: old-sha new-sha ref-name
            parts = line_str.split()
            if len(parts) >= 3:
                old_sha = parts[0]
                new_sha = parts[1]
                ref_name = parts[2]
                ref_updates.append((old_sha, new_sha, ref_name))

        # TODO: Process pack file and update refs

        # Send success status
        status_lines = [
            None,  # Flush
            b"\x01unpack ok\n",
        ]

        for old_sha, new_sha, ref_name in ref_updates:
            status_lines.append(f"\x01ok {ref_name}\n".encode())

        status_lines.append(None)  # Flush

        return pkt_line_stream(status_lines)

Pack File Format

Structure

+-----------------+
| PACK header     | 12 bytes
+-----------------+
| Object 1        | Variable
+-----------------+
| Object 2        | Variable
+-----------------+
| ...             |
+-----------------+
| SHA-1 checksum  | 20 bytes
+-----------------+

Header Format

import struct

# Signature (4 bytes): "PACK"
# Version (4 bytes): 2 or 3 (network byte order)
# Count (4 bytes): Number of objects (network byte order)

header = b'PACK' + struct.pack('>I', 2) + struct.pack('>I', num_objects)

Object Types

Type	Code	Description
Commit	1	Commit object
Tree	2	Tree object
Blob	3	Blob (file content)
Tag	4	Tag object
OFS_DELTA	6	Offset delta
REF_DELTA	7	Reference delta

Creating Pack Files (Pure Python)

KohakuHub uses a pure Python implementation - no native dependencies!

import hashlib
import struct
import zlib

def create_pack_file(objects: list[tuple[int, bytes]]) -> bytes:
    """Build pack file using pure Python.

    Args:
        objects: List of (type, object_data_with_header) tuples
                 Types: 1=commit, 2=tree, 3=blob

    Returns:
        Complete pack file bytes
    """
    # Pack header
    pack_data = b"PACK"
    pack_data += struct.pack(">I", 2)              # Version 2
    pack_data += struct.pack(">I", len(objects))   # Object count

    # Add each object
    for obj_type, obj_data in objects:
        # Extract content (remove "type size\0" header)
        null_pos = obj_data.find(b"\0")
        content = obj_data[null_pos + 1:] if null_pos > 0 else obj_data

        # Encode object header (type + size in variable-length encoding)
        header = encode_pack_object_header(obj_type, len(content))

        # Compress with zlib
        compressed = zlib.compress(content)

        # Add to pack
        pack_data += header + compressed

    # Add pack checksum (SHA-1 of everything)
    checksum = hashlib.sha1(pack_data).digest()
    pack_data += checksum

    return pack_data


# Complete example - no temp files!
async def build_pack(repo_id, branch):
    # 1. Build blobs (LFS pointers for large files)
    blobs = {}  # path -> (sha1, data_with_header, mode)

    for file in files:
        if is_lfs(file):
            pointer = create_lfs_pointer(file.sha256, file.size)
            sha1, blob_data = create_blob_object(pointer)
            blobs[file.path] = (sha1, blob_data, "100644")
        else:
            content = await download(file.path)
            sha1, blob_data = create_blob_object(content)
            blobs[file.path] = (sha1, blob_data, "100644")

    # 2. Build trees (pure logic)
    flat = [(mode, path, sha1) for path, (sha1, data, mode) in blobs.items()]
    root_tree_sha1, tree_objects = build_nested_trees(flat)

    # 3. Build commit
    commit_sha1, commit_data = create_commit_object(...)

    # 4. Build pack
    pack_objects = [(1, commit_data)]  # Commit
    pack_objects.extend(tree_objects)  # Trees
    for path, (sha1, data, mode) in blobs.items():
        pack_objects.append((3, data))  # Blobs

    return create_pack_file(pack_objects)

Benefits:

No native dependencies (easier deployment)
Full control over memory usage
No temporary files needed
Easier debugging
Better performance with LFS

Empty Pack File

def create_empty_pack() -> bytes:
    """Create empty pack file (0 objects)."""
    import hashlib
    import struct

    header = b'PACK' + struct.pack('>I', 2) + struct.pack('>I', 0)
    checksum = hashlib.sha1(header).digest()

    return header + checksum

Authentication

Methods

KohakuHub supports two authentication methods for Git:

Token-based (Bearer): For API clients
Basic Auth: For Git clients

Git Basic Auth

Git clients send credentials via HTTP Basic Auth:

GET /namespace/repo.git/info/refs?service=git-upload-pack HTTP/1.1
Authorization: Basic <base64(username:token)>

Parsing Credentials

import base64

def parse_git_credentials(authorization: str | None) -> tuple[str | None, str | None]:
    """Parse username and token from Basic Auth header."""
    if not authorization or not authorization.startswith("Basic "):
        return None, None

    try:
        encoded = authorization[6:]  # Remove "Basic "
        decoded = base64.b64decode(encoded).decode("utf-8")

        if ":" in decoded:
            username, token = decoded.split(":", 1)
            return username, token
    except Exception:
        pass

    return None, None

Token Validation

from datetime import datetime, timezone

from kohakuhub.auth.utils import hash_token
from kohakuhub.db import Token, User, db

async def get_user_from_git_auth(authorization: str | None) -> User | None:
    """Authenticate user from Git Basic Auth."""
    username, token_str = parse_git_credentials(authorization)
    if not username or not token_str:
        return None

    # Hash and lookup token
    token_hash = hash_token(token_str)

    # Database operations are synchronous with transactions
    with db.atomic():
        token = Token.get_or_none(Token.token_hash == token_hash)
        if not token:
            return None

        # Get user
        user = User.get_or_none(User.id == token.user_id)
        if not user or not user.is_active:
            return None

        # Update last used
        Token.update(last_used=datetime.now(timezone.utc)).where(
            Token.id == token.id
        ).execute()

    return user

Permission Checks

from kohakuhub.auth.permissions import check_repo_read_permission, check_repo_write_permission

# For clone/fetch/pull (upload-pack)
user = await get_user_from_git_auth(authorization)
check_repo_read_permission(repo, user)  # Raises HTTPException if denied

# For push (receive-pack)
user = await get_user_from_git_auth(authorization)
if not user:
    raise HTTPException(401, detail="Authentication required for push")
check_repo_write_permission(repo, user)

Implementation with FastAPI

Router Structure

# src/kohakuhub/api/routers/git_http.py

from fastapi import APIRouter, Depends, HTTPException, Header, Request, Response

router = APIRouter()

@router.get("/{namespace}/{name}.git/info/refs")
async def git_info_refs(
    namespace: str,
    name: str,
    service: str,
    authorization: str | None = Header(None),
):
    """Service advertisement endpoint."""
    # Implementation here...
    pass

@router.post("/{namespace}/{name}.git/git-upload-pack")
async def git_upload_pack(
    namespace: str,
    name: str,
    request: Request,
    authorization: str | None = Header(None),
):
    """Upload-pack endpoint for clone/fetch/pull."""
    # Implementation here...
    pass

@router.post("/{namespace}/{name}.git/git-receive-pack")
async def git_receive_pack(
    namespace: str,
    name: str,
    request: Request,
    authorization: str | None = Header(None),
):
    """Receive-pack endpoint for push."""
    # Implementation here...
    pass

@router.get("/{namespace}/{name}.git/HEAD")
async def git_head(
    namespace: str,
    name: str,
    authorization: str | None = Header(None),
):
    """HEAD endpoint."""
    return Response(
        content=b"ref: refs/heads/main\n",
        media_type="text/plain",
    )

Dynamic Repository Type Detection

Since we don't know if a repo is a model/dataset/space from the URL alone:

from kohakuhub.db import Repository, db

async def find_repository(namespace: str, name: str) -> Repository | None:
    """Find repository by trying all types."""
    # Database operations are synchronous
    with db.atomic():
        for repo_type in ["model", "dataset", "space"]:
            repo = Repository.get_or_none(
                Repository.namespace == namespace,
                Repository.name == name,
                Repository.repo_type == repo_type,
            )
            if repo:
                return repo
        return None

Registering the Router

# src/kohakuhub/main.py

from kohakuhub.api.routers import git_http

app.include_router(git_http.router, tags=["git"])

Pure Python Implementation

KohakuHub uses pure Python for Git operations - NO pygit2, NO native dependencies!

Architecture

# Pure Python - all in-memory, no temp files
class GitLakeFSBridge:
    """Git-LakeFS bridge using pure Python."""

    async def get_refs(self, branch: str) -> dict[str, str]:
        """Get Git refs - pure in-memory."""
        # 1. List files from LakeFS (metadata only)
        # 2. Build blob SHA-1s (LFS pointers for large files)
        # 3. Build tree SHA-1s (pure logic)
        # 4. Build commit SHA-1
        # 5. Return refs dict

    async def build_pack_file(self, wants, haves, branch) -> bytes:
        """Build pack file - pure in-memory."""
        # 1. Build blob objects (with LFS pointers)
        # 2. Build tree objects using build_nested_trees()
        # 3. Build commit object
        # 4. Create pack file with create_pack_file()
        # 5. Return pack bytes

Key Components

1. Git Object Construction (git_objects.py):

def create_blob_object(content: bytes) -> tuple[str, bytes]:
    """Create blob object and compute SHA-1."""
    header = f"blob {len(content)}\0".encode()
    obj_data = header + content
    sha1 = hashlib.sha1(obj_data).hexdigest()
    return sha1, obj_data

def create_tree_object(entries: list[tuple[str, str, str]]) -> tuple[str, bytes]:
    """Create tree object from entries.

    Args:
        entries: List of (mode, name, sha1_hex)
                 mode: "100644" (file), "40000" (dir)
    """
    # Sort with directories treated as having "/" suffix
    def sort_key(entry):
        mode, name, sha1 = entry
        return name + "/" if mode in ("40000", "040000") else name

    sorted_entries = sorted(entries, key=sort_key)

    # Build tree content
    tree_content = b""
    for mode, name, sha1_hex in sorted_entries:
        sha1_bytes = bytes.fromhex(sha1_hex)
        tree_content += f"{mode} {name}\0".encode() + sha1_bytes

    header = f"tree {len(tree_content)}\0".encode()
    obj_data = header + tree_content
    sha1 = hashlib.sha1(obj_data).hexdigest()

    return sha1, obj_data

def build_nested_trees(flat_entries: list[tuple[str, str, str]]) -> tuple[str, list]:
    """Build nested tree structure from flat file list.

    Critical: Root directory MUST be built LAST!
    """
    # Organize files by directory
    dir_contents = {}
    for mode, path, blob_sha1 in flat_entries:
        # Add file to parent directory
        parts = path.split("/")
        if len(parts) == 1:
            dir_path = ""
        else:
            dir_path = "/".join(parts[:-1])

        dir_contents.setdefault(dir_path, []).append((mode, parts[-1], blob_sha1))

    # Sort directories: deepest first, ROOT LAST
    def sort_dirs(dir_path):
        return (-999, "") if dir_path == "" else (dir_path.count("/"), dir_path)

    sorted_dirs = sorted(dir_contents.keys(), key=sort_dirs, reverse=True)

    # Build trees bottom-up
    dir_sha1s = {}
    tree_objects = []

    for dir_path in sorted_dirs:
        entries = list(dir_contents[dir_path])

        # Add subdirectories
        for child_dir, child_sha1 in dir_sha1s.items():
            if is_direct_child(dir_path, child_dir):
                entries.append(("40000", get_dirname(dir_path, child_dir), child_sha1))

        tree_sha1, tree_data = create_tree_object(entries)
        dir_sha1s[dir_path] = tree_sha1
        tree_objects.append((2, tree_data))

    return dir_sha1s[""], tree_objects

2. LFS Pointer Creation:

def create_lfs_pointer(sha256: str, size: int) -> bytes:
    """Create LFS pointer file (100 bytes instead of gigabytes!)."""
    pointer = f"""version https://git-lfs.github.com/spec/v1
oid sha256:{sha256}
size {size}
"""
    return pointer.encode("utf-8")

# Usage
if file_size >= 1_000_000:  # 1MB threshold
    pointer = create_lfs_pointer(file.sha256, file.size)
    sha1, blob_data = create_blob_object(pointer)
    # blob_data is only ~100 bytes, not gigabytes!

3. Pack File Generation:

def create_pack_file(objects: list[tuple[int, bytes]]) -> bytes:
    """Build pack file using pure Python."""
    pack_data = b"PACK"
    pack_data += struct.pack(">I", 2)              # Version
    pack_data += struct.pack(">I", len(objects))   # Count

    for obj_type, obj_data in objects:
        # Extract content (remove header)
        null_pos = obj_data.find(b"\0")
        content = obj_data[null_pos + 1:]

        # Encode object header
        header = encode_pack_object_header(obj_type, len(content))

        # Compress
        compressed = zlib.compress(content)

        pack_data += header + compressed

    # Checksum
    checksum = hashlib.sha1(pack_data).digest()
    pack_data += checksum

    return pack_data

Benefits of Pure Python

Aspect	pygit2 (Old)	Pure Python (Current)
Dependencies	pygit2 + libgit2 (C)	stdlib only
Installation	Can fail	Always works
Temp files	Creates temp git repo	None
Memory (10GB file)	20GB	100 bytes (LFS pointer)
Debugging	Black box	Full visibility
Deployment	Complex	Simple
Performance	Good	Better (with LFS)

Complete Code Examples

1. git_server.py (Protocol Utilities)

"""Git protocol handler utilities."""

def pkt_line(data: bytes | str | None) -> bytes:
    if data is None:
        return b"0000"
    if isinstance(data, str):
        data = data.encode("utf-8")
    length = len(data) + 4
    return f"{length:04x}".encode("ascii") + data

def parse_pkt_lines(data: bytes) -> list[bytes | None]:
    lines = []
    remaining = data
    while remaining:
        line, remaining = parse_pkt_line(remaining)
        if line is None and not remaining:
            break
        lines.append(line)
    return lines

class GitUploadPackHandler:
    def __init__(self, repo_path: str, bridge=None):
        self.repo_path = repo_path
        self.bridge = bridge
        self.capabilities = [
            "multi_ack",
            "side-band-64k",
            "thin-pack",
            "ofs-delta",
        ]

    def get_service_info(self, refs: dict[str, str]) -> bytes:
        info = GitServiceInfo("upload-pack", refs, self.capabilities)
        return info.to_bytes()

    async def handle_upload_pack(self, request_body: bytes) -> bytes:
        # Parse wants/haves
        wants, haves = self._parse_wants_haves(request_body)

        # Build pack
        if self.bridge:
            pack_data = await self.bridge.build_pack_file(wants, haves)
        else:
            pack_data = self._create_empty_pack()

        # Send response
        nak = pkt_line_stream([b"NAK\n"])
        pack_line = b'\x01' + pack_data
        return nak + pkt_line(pack_line) + pkt_line(None)

2. git_http.py (FastAPI Router)

"""Git Smart HTTP endpoints."""

from fastapi import APIRouter, Header, Request, Response

router = APIRouter()

@router.get("/{namespace}/{name}.git/info/refs")
async def git_info_refs(
    namespace: str,
    name: str,
    service: str,
    authorization: str | None = Header(None),
):
    # Find repository
    repo = await find_repository(namespace, name)
    if not repo:
        raise HTTPException(404, detail="Repository not found")

    # Authenticate
    user = await get_user_from_git_auth(authorization)

    # Check permissions
    if service == "git-upload-pack":
        check_repo_read_permission(repo, user)
    elif service == "git-receive-pack":
        if not user:
            raise HTTPException(401, detail="Authentication required")
        check_repo_write_permission(repo, user)

    # Get refs from LakeFS
    bridge = GitLakeFSBridge(repo.repo_type, namespace, name)
    refs = await bridge.get_refs(branch="main")

    # Generate response
    handler = GitUploadPackHandler(repo.full_id) if service == "git-upload-pack" else GitReceivePackHandler(repo.full_id)
    response_data = handler.get_service_info(refs)

    return Response(
        content=response_data,
        media_type=f"application/x-{service}-advertisement",
        headers={"Cache-Control": "no-cache"},
    )

Testing Your Implementation

Manual Testing

# 1. Test service advertisement
curl -i "http://localhost:28080/myorg/myrepo.git/info/refs?service=git-upload-pack"

# 2. Test clone
git clone http://localhost:28080/myorg/myrepo.git

# 3. Test with authentication
git clone http://username:token@localhost:28080/myorg/private-repo.git

Automated Testing

import httpx

async def test_git_info_refs():
    async with httpx.AsyncClient() as client:
        response = await client.get(
            "http://localhost:48888/test/repo.git/info/refs",
            params={"service": "git-upload-pack"},
        )

        assert response.status_code == 200
        assert b"# service=git-upload-pack" in response.content
        assert b"refs/heads/main" in response.content

Troubleshooting

Common Issues

1. "Repository not found"

Check that repository exists in database
Verify namespace and name spelling
Ensure dynamic type detection is working

2. "Authentication failed"

Verify token is valid and not expired
Check token hash calculation
Ensure Basic Auth encoding is correct

3. "Empty pack file"

Check LakeFS has objects in the branch
Verify bridge is building blobs and trees correctly
Check File table has LFS flags set properly

4. Clone hangs

Check for pack file generation errors
Verify side-band encoding is correct
Look for missing flush packets

Large File Handling with Git LFS

The Problem

Naive approach downloads ALL files:

# BAD - Downloads 10GB file to memory!
for obj in objects:
    content = await client.get_object(...)  # 10GB download
    blob = repo.create_blob(content)        # 10GB in memory
    # Pack file becomes 10GB → OOM crash

Impact:

Repo with 10GB model → Downloads 10GB, uses 20GB memory
Server crashes with Out of Memory
Clone takes forever even for metadata-only changes

Solution: Git LFS Pointers

Instead of including large files, create LFS pointer files:

# GOOD - Only metadata for large files
if size >= cfg.lfs.threshold_bytes:
    # Get metadata only (no content download!)
    stat = await client.stat_object(...)
    sha256 = stat["checksum"].replace("sha256:", "")

    # Create tiny pointer file
    pointer = f"""version https://git-lfs.github.com/spec/v1
oid sha256:{sha256}
size {size}
"""
    blob = repo.create_blob(pointer.encode())  # Only 100 bytes!

Memory usage:

Old: 10GB file → 20GB memory
New: 10GB file → 100 bytes pointer
200,000x reduction!

Implementation

def create_lfs_pointer(sha256: str, size: int) -> bytes:
    """Create Git LFS pointer file."""
    pointer = f"""version https://git-lfs.github.com/spec/v1
oid sha256:{sha256}
size {size}
"""
    return pointer.encode("utf-8")


async def _build_tree_from_objects(repo, objects, branch):
    # Separate small and large files
    small_files = [obj for obj in objects if obj["size_bytes"] < threshold]
    large_files = [obj for obj in objects if obj["size_bytes"] >= threshold]

    # Process small files normally
    async def process_small(obj):
        content = await client.get_object(...)
        return repo.create_blob(content)

    # Process large files as pointers (metadata only!)
    async def process_large(obj):
        stat = await client.stat_object(...)  # No content download
        sha256 = stat["checksum"].replace("sha256:", "")
        pointer = create_lfs_pointer(sha256, stat["size_bytes"])
        return repo.create_blob(pointer)

    # Process concurrently
    small_blobs = await asyncio.gather(*[process_small(f) for f in small_files])
    large_blobs = await asyncio.gather(*[process_large(f) for f in large_files])

Client Usage

# 1. Clone repository (fast - only pointers!)
git clone https://hub.example.com/org/large-model.git
cd large-model

# 2. Install Git LFS
git lfs install

# 3. Pull large files via LFS protocol
git lfs pull

# Files are downloaded using existing HuggingFace LFS API

Automatic .gitattributes

def generate_gitattributes(lfs_paths: list[str]) -> bytes:
    """Generate .gitattributes for LFS files."""
    extensions = set()
    for path in lfs_paths:
        if "." in path:
            ext = path.rsplit(".", 1)[-1]
            extensions.add(ext)

    lines = ["# Git LFS tracking\n"]
    for ext in sorted(extensions):
        lines.append(f"*.{ext} filter=lfs diff=lfs merge=lfs -text\n")

    return "".join(lines).encode("utf-8")

# Example output:
# # Git LFS tracking
# *.bin filter=lfs diff=lfs merge=lfs -text
# *.safetensors filter=lfs diff=lfs merge=lfs -text

Performance Optimization

1. Caching

# Cache refs for short periods
from functools import lru_cache
from datetime import datetime, timedelta

@lru_cache(maxsize=128)
def get_cached_refs(repo_id: str, timestamp: int):
    # timestamp rounded to minute for 60s cache
    return fetch_refs(repo_id)

2. Concurrent Processing

# Process multiple files concurrently with asyncio.gather
results = await asyncio.gather(*[process_file(obj) for obj in objects])

3. Pagination

# Process LakeFS objects in batches
async def list_all_objects(repo, ref):
    objects = []
    after = ""
    while True:
        result = await client.list_objects(
            repository=repo,
            ref=ref,
            after=after,
            amount=1000,  # Batch size
        )
        objects.extend(result["results"])
        if not result.get("pagination", {}).get("has_more"):
            break
        after = result["pagination"]["next_offset"]
    return objects

4. Memory-Efficient Pack Generation

Before optimization:

100 files (1 x 10GB) → 20GB memory, 5 minutes
Sequential processing

After optimization:

100 files (1 x 10GB) → 200MB memory, 30 seconds
LFS pointers for large files
Concurrent processing
100x faster, 100x less memory

References

Official Documentation

Libraries

FastAPI - Modern web framework
httpx - Async HTTP client
Pure Python (stdlib only) - No native dependencies for Git operations

Tutorials

Conclusion

Building a Git-compatible server involves:

Understanding the protocol: pkt-line, service advertisement, upload/receive-pack
Implementing core handlers: Parsing requests, generating pack files
Integrating with storage: Translating Git operations to your backend (LakeFS)
Adding authentication: Token validation and permission checks
Optimizing performance: LFS pointers, concurrent processing, chunking
Pure Python approach: No native dependencies, full control, better debugging

KohakuHub Implementation Highlights:

✅ Pure Python - No pygit2, no libgit2, no native dependencies
✅ In-memory - No temporary directories or files
✅ LFS integration - Automatic LFS pointers for large files (>1MB)
✅ Concurrent - Parallel processing with asyncio.gather
✅ Memory efficient - Only downloads small files, pointers for large files
✅ Production ready - Handles repos of any size without OOM

This demonstrates how to build a complete Git server using only Python stdlib + FastAPI, with full Git LFS support for machine learning models and datasets.

Last Updated: January 2025 Version: 1.1 Authors: KohakuHub Team

41 KiB Raw Blame History

Git Support in KohakuHub

Table of Contents

Part 1: User Guide

Part 2: Developer Guide

Part 1: User Guide

Quick Start

Clone a Repository

How LFS Works

Authentication Guide

Using Access Tokens

LFS Integration Guide

Installation

Download Large Files

Check LFS Status

Cloudflare Setup

Create Page Rule

Alternative: Subdomain

Troubleshooting Guide

Clone Hangs or Fails

LFS Issues

Cloudflare Issues

Part 2: Developer Guide

Implementation Overview

What is a Git Server?

Why Build Your Own?

Architecture Overview

Git Protocol Fundamentals

Git Object Model

Git References (Refs)

Git Pack Files

Packet-Line Format

What is Packet-Line (pkt-line)?

Length Calculation

Special Packets

Implementation

Git Smart HTTP Protocol

Protocol Flow

HTTP Endpoints

Content-Type Headers

Service Advertisement

Purpose

Request

Response Format

Example Response

Implementation

Capabilities

Upload-Pack (Clone/Fetch/Pull)

Purpose

Protocol Exchange

Request Format

Response Format

Implementation

Receive-Pack (Push)

Purpose

Protocol Exchange

Request Format

Response Format

Implementation

Pack File Format

Structure

Header Format

Object Types

Creating Pack Files (Pure Python)

Empty Pack File

Authentication

Methods

Git Basic Auth

Parsing Credentials

Token Validation

Permission Checks

Implementation with FastAPI

Router Structure

Dynamic Repository Type Detection

Registering the Router

Pure Python Implementation

Architecture

Key Components

Benefits of Pure Python

Complete Code Examples

41 KiB

Raw Blame History