Files
KohakuHub/scripts

KohakuHub Utility Scripts

This directory contains utility scripts for KohakuHub administration and maintenance.

Deployment Setup

Docker Compose Generator

Interactive tool to generate a customized docker-compose.yml based on your deployment preferences.

Features:

  • Choose between built-in or external PostgreSQL
  • Configure LakeFS to use PostgreSQL or SQLite
  • Choose between built-in MinIO or external S3 storage
  • Auto-generate secure secret keys
  • Comprehensive configuration validation

Usage:

# Interactive mode (asks questions)
python scripts/generate_docker_compose.py

# Generate a configuration template
python scripts/generate_docker_compose.py --generate-config

# Use configuration file (non-interactive)
python scripts/generate_docker_compose.py --config kohakuhub.conf

Configuration Options:

  1. PostgreSQL:

    • Built-in container (managed by docker-compose)
    • External PostgreSQL (specify host, port, credentials)
    • Default database name for hub-api: kohakuhub
  2. LakeFS Database:

    • Use PostgreSQL (recommended for production)
    • Use local SQLite (simpler for development)
    • Default database name: lakefs (separate from hub-api database)
    • Automatic database creation: Both databases are created automatically when LakeFS starts
    • Works with both built-in and external PostgreSQL
  3. S3 Storage:

    • Built-in MinIO container (self-hosted)
    • External S3-compatible storage (AWS S3, CloudFlare R2, etc.)
  4. Security:

    • Auto-generated session secret key
    • Auto-generated admin secret token
    • Option to use custom secrets
  5. Network:

    • External Docker bridge network support
    • Allows cross-compose communication with external PostgreSQL/S3 services
    • Automatically added to hub-api and lakefs when using external services

Example: Interactive Mode

$ python scripts/generate_docker_compose.py

=============================================================
KohakuHub Docker Compose Generator
=============================================================

--- PostgreSQL Configuration ---
Use built-in PostgreSQL container? [Y/n]: y
PostgreSQL username [hub]:
PostgreSQL password [hubpass]:
PostgreSQL database name for hub-api [kohakuhub]:

--- LakeFS Database Configuration ---
Use PostgreSQL for LakeFS? (No = use local SQLite) [Y/n]: y
PostgreSQL database name for LakeFS [lakefs]:

--- S3 Storage Configuration ---
Use built-in MinIO container? [Y/n]: n
S3 endpoint URL: https://my-account.r2.cloudflarestorage.com
S3 access key: xxxxxxxxxxxx
S3 secret key: yyyyyyyyyyyy
S3 region [us-east-1]: auto

--- Security Configuration ---
Generated session secret: AbCdEf123456...
Use generated session secret? [Y/n]: y

Use same secret for admin token? [y/N]: n
Generated admin secret: XyZ789...
Use generated admin secret? [Y/n]: y

=============================================================
Generating docker-compose.yml...
=============================================================

✓ Successfully generated: docker-compose.yml
✓ Database initialization scripts will run automatically when LakeFS starts
  - scripts/init-databases.sh
  - scripts/lakefs-entrypoint.sh

Configuration Summary:
------------------------------------------------------------
PostgreSQL: Built-in
  Hub-API Database: kohakuhub
  LakeFS Database: lakefs
LakeFS Database Backend: PostgreSQL
S3 Storage: Custom S3
  Endpoint: https://my-account.r2.cloudflarestorage.com
Session Secret: AbCdEf123456...
Admin Secret: XyZ789...
------------------------------------------------------------

Next steps:
1. Review the generated docker-compose.yml
2. Build frontend: npm run build --prefix ./src/kohaku-hub-ui
3. Start services: docker-compose up -d

   Note: Databases will be created automatically on first startup:
   - kohakuhub (hub-api)
   - lakefs (LakeFS)

4. Access at: http://localhost:28080

Example: Using Configuration File

# Step 1: Generate template
$ python scripts/generate_docker_compose.py --generate-config
[OK] Generated configuration template: kohakuhub.conf

Edit this file with your settings, then run:
  python scripts/generate_docker_compose.py --config kohakuhub.conf

# Step 2: Edit kohakuhub.conf with your settings
$ nano kohakuhub.conf  # or use any text editor

# Step 3: Generate docker-compose.yml from config
$ python scripts/generate_docker_compose.py --config kohakuhub.conf

============================================================
KohakuHub Docker Compose Generator
============================================================

Loading configuration from: kohakuhub.conf

Loaded configuration:
  PostgreSQL: External
    Host: db.example.com:5432
    Database: kohakuhub
  LakeFS: PostgreSQL
    Database: lakefs
  S3: External S3
    Endpoint: https://s3.example.com

============================================================
Generating docker-compose.yml...
============================================================

[OK] Successfully generated: docker-compose.yml
[OK] Database initialization scripts will run automatically when LakeFS starts

Configuration Summary:
------------------------------------------------------------
PostgreSQL: Custom
  Host: db.example.com:5432
  Hub-API Database: kohakuhub
  LakeFS Database: lakefs
LakeFS Database Backend: PostgreSQL
S3 Storage: Custom S3
  Endpoint: https://s3.example.com
Session Secret: AbCdEf123456...
Admin Secret: XyZ789...
------------------------------------------------------------

Requirements:

  • Python 3.10+
  • No additional dependencies required

Configuration File Format:

The configuration file (kohakuhub.conf) uses INI format:

[postgresql]
builtin = true
user = hub
password = hubpass
database = kohakuhub

[lakefs]
use_postgres = true
database = lakefs

[s3]
builtin = true
access_key = minioadmin
secret_key = minioadmin

[security]
session_secret = your-secret-here
admin_secret = your-admin-secret

[network]
# Optional: for cross-compose communication
external_network = shared-network

For external PostgreSQL or S3:

[postgresql]
builtin = false
host = your-postgres-host.com
port = 5432
user = hub
password = your-password
database = kohakuhub

[s3]
builtin = false
endpoint = https://your-s3-endpoint.com
access_key = your-access-key
secret_key = your-secret-key
region = us-east-1

[network]
# Required if external services are in different Docker Compose
external_network = shared-network

Using External Docker Network:

If PostgreSQL or S3 are in separate Docker Compose setups, you need a shared network:

# Create the shared network first
docker network create shared-network

# Your PostgreSQL docker-compose.yml
services:
  postgres:
    # ... your config
    networks:
      - shared-network

networks:
  shared-network:
    external: true

# Generate KohakuHub with external network
python scripts/generate_docker_compose.py --config kohakuhub.conf

The generator will automatically:

  • Add the external network to hub-api and lakefs services
  • Configure them to use both default (hub-net) and the external network
  • Allow container name resolution across compose files

Important Notes:

  • Shell scripts automatically use LF line endings (configured in .gitattributes)
  • Database initialization runs automatically on LakeFS startup
  • Works on both Windows and Linux development environments
  • Configuration file (kohakuhub.conf) is ignored by git (contains sensitive data)
  • Use --generate-config to create a template configuration file

Storage Management

Clear S3 Storage

For demo deployments with constrained S3 storage (like CloudFlare R2 free tier), use these scripts to manage storage.

All scripts now accept S3 credentials via command-line arguments or environment variables, making them standalone tools independent of KohakuHub configuration.

Features:

  • Rich progress display with progress bars
  • Detailed statistics (object count, total size)
  • Dry run mode
  • Batch deletion (handles large buckets efficiently)
  • Prefix filtering
  • Error reporting

Usage:

# Clear all content (interactive confirmation required)
python scripts/clear_s3_storage.py \
    --endpoint https://s3.amazonaws.com \
    --access-key YOUR_ACCESS_KEY \
    --secret-key YOUR_SECRET_KEY \
    --bucket my-bucket

# Clear only LFS files (large files)
python scripts/clear_s3_storage.py \
    --endpoint https://s3.amazonaws.com \
    --access-key YOUR_ACCESS_KEY \
    --secret-key YOUR_SECRET_KEY \
    --bucket my-bucket \
    --prefix lfs/

# Clear specific repository type
python scripts/clear_s3_storage.py \
    --endpoint https://s3.amazonaws.com \
    --access-key YOUR_ACCESS_KEY \
    --secret-key YOUR_SECRET_KEY \
    --bucket my-bucket \
    --prefix hf-model-

# Clear multiple prefixes
python scripts/clear_s3_storage.py \
    --endpoint https://s3.amazonaws.com \
    --access-key YOUR_ACCESS_KEY \
    --secret-key YOUR_SECRET_KEY \
    --bucket my-bucket \
    --prefix lfs/ --prefix hf-model-

# Dry run (show what would be deleted without deleting)
python scripts/clear_s3_storage.py \
    --endpoint https://s3.amazonaws.com \
    --access-key YOUR_ACCESS_KEY \
    --secret-key YOUR_SECRET_KEY \
    --bucket my-bucket \
    --dry-run

# Force delete without confirmation (dangerous!)
python scripts/clear_s3_storage.py \
    --endpoint https://s3.amazonaws.com \
    --access-key YOUR_ACCESS_KEY \
    --secret-key YOUR_SECRET_KEY \
    --bucket my-bucket \
    --force

# Use environment variables for credentials
export S3_ENDPOINT=https://s3.amazonaws.com
export S3_ACCESS_KEY=YOUR_ACCESS_KEY
export S3_SECRET_KEY=YOUR_SECRET_KEY
export S3_BUCKET=my-bucket
python scripts/clear_s3_storage.py --prefix lfs/

# Limit number of objects (for testing)
python scripts/clear_s3_storage.py \
    --endpoint https://s3.amazonaws.com \
    --access-key YOUR_ACCESS_KEY \
    --secret-key YOUR_SECRET_KEY \
    --bucket my-bucket \
    --max-objects 100

Requirements:

  • Python 3.10+
  • boto3 and rich packages (pip install boto3 rich)
  • S3 credentials with delete permissions

Shell Script (Alternative)

Features:

  • No Python dependencies (uses AWS CLI or MinIO client)
  • Simple and portable
  • Supports both aws CLI and mc (MinIO client)

Usage:

# Make script executable
chmod +x scripts/clear_s3_storage.sh

# Clear all content
./scripts/clear_s3_storage.sh

# Clear with prefix
./scripts/clear_s3_storage.sh --prefix lfs/

# Dry run
./scripts/clear_s3_storage.sh --dry-run

# Force delete
./scripts/clear_s3_storage.sh --force

Requirements:

  • AWS CLI (aws) or MinIO Client (mc)
  • KohakuHub environment variables configured

Install AWS CLI:

# Linux/macOS
curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip"
unzip awscliv2.zip
sudo ./aws/install

# Or via package manager
sudo apt install awscli  # Ubuntu/Debian
brew install awscli      # macOS

Install MinIO Client:

# Linux
wget https://dl.min.io/client/mc/release/linux-amd64/mc
chmod +x mc
sudo mv mc /usr/local/bin/

# macOS
brew install minio/stable/mc

# Windows
choco install minio-client

Common Prefixes in KohakuHub

Understanding KohakuHub's S3 storage structure:

Prefix Description Example
lfs/ Large File Storage objects (files >5MB) lfs/ab/cd/abcd1234...
hf-model-* Model repository data hf-model-myuser-mymodel/
hf-dataset-* Dataset repository data hf-dataset-myuser-mydataset/
hf-space-* Space repository data hf-space-myuser-myspace/

Note: LFS objects are deduplicated by SHA256 hash. Deleting from lfs/ prefix removes all large files across all repositories.


Database Migration Scripts

Add Storage Quota Fields

# Add quota tracking to users/organizations
python scripts/migrate_add_storage_quota.py
python scripts/migrate_separate_quotas.py
python scripts/migrate_add_quotas_final.py

Update Repository Schema

python scripts/migrate_repository_schema.py

Testing Scripts

Generate Test Files

# Generate test files for upload testing
python scripts/generate_test_files.py

Test Authentication

# Test authentication flow
python scripts/test_auth.py

Security

Generate Secret Keys

# Generate secure random secret for session/admin tokens
python scripts/generate_secret.py

Output:

Generated secure random secret (64 characters):
a1b2c3d4e5f6...

Add this to your docker-compose.yml:
  KOHAKU_HUB_SESSION_SECRET=a1b2c3d4e5f6...
  KOHAKU_HUB_ADMIN_SECRET_TOKEN=a1b2c3d4e5f6...

Environment Variables

S3 Storage Scripts (clear_s3_storage.py, show_s3_usage.py)

These scripts accept credentials via command-line arguments or environment variables:

# Option 1: Command-line arguments
python scripts/clear_s3_storage.py \
    --endpoint https://s3.amazonaws.com \
    --access-key YOUR_KEY \
    --secret-key YOUR_SECRET \
    --bucket my-bucket

# Option 2: Environment variables
export S3_ENDPOINT=https://s3.amazonaws.com
export S3_ACCESS_KEY=YOUR_ACCESS_KEY
export S3_SECRET_KEY=YOUR_SECRET_KEY
export S3_BUCKET=my-bucket
export S3_REGION=us-east-1  # Optional, default: us-east-1

python scripts/clear_s3_storage.py --prefix lfs/

Migration Scripts

Migration scripts require full KohakuHub configuration:

# S3 Storage
export KOHAKU_HUB_S3_ENDPOINT=http://localhost:29001
export KOHAKU_HUB_S3_PUBLIC_ENDPOINT=http://localhost:29001
export KOHAKU_HUB_S3_ACCESS_KEY=minioadmin
export KOHAKU_HUB_S3_SECRET_KEY=minioadmin
export KOHAKU_HUB_S3_BUCKET=hub-storage

# LakeFS
export KOHAKU_HUB_LAKEFS_ENDPOINT=http://localhost:28000
export KOHAKU_HUB_LAKEFS_ACCESS_KEY=...
export KOHAKU_HUB_LAKEFS_SECRET_KEY=...

# Database
export KOHAKU_HUB_DATABASE_URL=postgresql://user:pass@localhost:5432/kohakuhub

# Application
export KOHAKU_HUB_BASE_URL=http://localhost:28080

Tips for Demo Deployments

CloudFlare R2 Free Tier

  • Limit: 10GB storage
  • Strategy: Regularly clear LFS files and old repository data
# Weekly cleanup schedule
# 1. Clear old LFS files
python scripts/clear_s3_storage.py --prefix lfs/ --force

# 2. Clear test repositories
python scripts/clear_s3_storage.py --prefix hf-space-test --force

MinIO Development

  • No limits: MinIO has no storage limits
  • Cleanup: Only needed for testing or reset
# Reset entire storage for clean slate
python scripts/clear_s3_storage.py --force

Safety Features

All deletion scripts include:

  1. Dry run mode: See what would be deleted before committing
  2. Interactive confirmation: Requires explicit "yes" to proceed
  3. Double confirmation: Extra prompt for full bucket deletion
  4. Progress display: See deletion progress in real-time
  5. Error reporting: Track and report any deletion failures

Troubleshooting

"NoSuchBucket" Error

Problem: Bucket doesn't exist

Solution:

# Create bucket using AWS CLI
aws s3 mb "s3://${KOHAKU_HUB_S3_BUCKET}" --endpoint-url="${KOHAKU_HUB_S3_ENDPOINT}"

# Or using MinIO client
mc mb kohakuhub-temp/${KOHAKU_HUB_S3_BUCKET}

"Access Denied" Error

Problem: S3 credentials lack delete permissions

Solution: Check your S3 access key has s3:DeleteObject permission

Script Hangs

Problem: Large bucket (millions of objects)

Solution: Use --max-objects to test first:

python scripts/clear_s3_storage.py --max-objects 1000 --dry-run

Contributing

When adding new scripts:

  1. Add docstring with usage examples
  2. Include error handling
  3. Support dry-run mode if applicable
  4. Update this README
  5. Test with both SQLite and PostgreSQL backends