KohakuHub Utility Scripts
This directory contains utility scripts for KohakuHub administration and maintenance.
Deployment Setup
Docker Compose Generator
Interactive tool to generate a customized docker-compose.yml based on your deployment preferences.
Features:
- ✅ Choose between built-in or external PostgreSQL
- ✅ Configure LakeFS to use PostgreSQL or SQLite
- ✅ Choose between built-in MinIO or external S3 storage
- ✅ Auto-generate secure secret keys
- ✅ Comprehensive configuration validation
Usage:
# Interactive mode (asks questions)
python scripts/generate_docker_compose.py
# Generate a configuration template
python scripts/generate_docker_compose.py --generate-config
# Use configuration file (non-interactive)
python scripts/generate_docker_compose.py --config kohakuhub.conf
Configuration Options:
-
PostgreSQL:
- Built-in container (managed by docker-compose)
- External PostgreSQL (specify host, port, credentials)
- Default database name for hub-api:
kohakuhub
-
LakeFS Database:
- Use PostgreSQL (recommended for production)
- Use local SQLite (simpler for development)
- Default database name:
lakefs(separate from hub-api database) - Automatic database creation: Both databases are created automatically when LakeFS starts
- Works with both built-in and external PostgreSQL
-
S3 Storage:
- Built-in MinIO container (self-hosted)
- External S3-compatible storage (AWS S3, CloudFlare R2, etc.)
-
Security:
- Auto-generated session secret key
- Auto-generated admin secret token
- Option to use custom secrets
-
Network:
- External Docker bridge network support
- Allows cross-compose communication with external PostgreSQL/S3 services
- Automatically added to hub-api and lakefs when using external services
Example: Interactive Mode
$ python scripts/generate_docker_compose.py
=============================================================
KohakuHub Docker Compose Generator
=============================================================
--- PostgreSQL Configuration ---
Use built-in PostgreSQL container? [Y/n]: y
PostgreSQL username [hub]:
PostgreSQL password [hubpass]:
PostgreSQL database name for hub-api [kohakuhub]:
--- LakeFS Database Configuration ---
Use PostgreSQL for LakeFS? (No = use local SQLite) [Y/n]: y
PostgreSQL database name for LakeFS [lakefs]:
--- S3 Storage Configuration ---
Use built-in MinIO container? [Y/n]: n
S3 endpoint URL: https://my-account.r2.cloudflarestorage.com
S3 access key: xxxxxxxxxxxx
S3 secret key: yyyyyyyyyyyy
S3 region [us-east-1]: auto
--- Security Configuration ---
Generated session secret: AbCdEf123456...
Use generated session secret? [Y/n]: y
Use same secret for admin token? [y/N]: n
Generated admin secret: XyZ789...
Use generated admin secret? [Y/n]: y
=============================================================
Generating docker-compose.yml...
=============================================================
✓ Successfully generated: docker-compose.yml
✓ Database initialization scripts will run automatically when LakeFS starts
- scripts/init-databases.sh
- scripts/lakefs-entrypoint.sh
Configuration Summary:
------------------------------------------------------------
PostgreSQL: Built-in
Hub-API Database: kohakuhub
LakeFS Database: lakefs
LakeFS Database Backend: PostgreSQL
S3 Storage: Custom S3
Endpoint: https://my-account.r2.cloudflarestorage.com
Session Secret: AbCdEf123456...
Admin Secret: XyZ789...
------------------------------------------------------------
Next steps:
1. Review the generated docker-compose.yml
2. Build frontend: npm run build --prefix ./src/kohaku-hub-ui
3. Start services: docker-compose up -d
Note: Databases will be created automatically on first startup:
- kohakuhub (hub-api)
- lakefs (LakeFS)
4. Access at: http://localhost:28080
Example: Using Configuration File
# Step 1: Generate template
$ python scripts/generate_docker_compose.py --generate-config
[OK] Generated configuration template: kohakuhub.conf
Edit this file with your settings, then run:
python scripts/generate_docker_compose.py --config kohakuhub.conf
# Step 2: Edit kohakuhub.conf with your settings
$ nano kohakuhub.conf # or use any text editor
# Step 3: Generate docker-compose.yml from config
$ python scripts/generate_docker_compose.py --config kohakuhub.conf
============================================================
KohakuHub Docker Compose Generator
============================================================
Loading configuration from: kohakuhub.conf
Loaded configuration:
PostgreSQL: External
Host: db.example.com:5432
Database: kohakuhub
LakeFS: PostgreSQL
Database: lakefs
S3: External S3
Endpoint: https://s3.example.com
============================================================
Generating docker-compose.yml...
============================================================
[OK] Successfully generated: docker-compose.yml
[OK] Database initialization scripts will run automatically when LakeFS starts
Configuration Summary:
------------------------------------------------------------
PostgreSQL: Custom
Host: db.example.com:5432
Hub-API Database: kohakuhub
LakeFS Database: lakefs
LakeFS Database Backend: PostgreSQL
S3 Storage: Custom S3
Endpoint: https://s3.example.com
Session Secret: AbCdEf123456...
Admin Secret: XyZ789...
------------------------------------------------------------
Requirements:
- Python 3.10+
- No additional dependencies required
Configuration File Format:
The configuration file (kohakuhub.conf) uses INI format:
[postgresql]
builtin = true
user = hub
password = hubpass
database = kohakuhub
[lakefs]
use_postgres = true
database = lakefs
[s3]
builtin = true
access_key = minioadmin
secret_key = minioadmin
[security]
session_secret = your-secret-here
admin_secret = your-admin-secret
[network]
# Optional: for cross-compose communication
external_network = shared-network
For external PostgreSQL or S3:
[postgresql]
builtin = false
host = your-postgres-host.com
port = 5432
user = hub
password = your-password
database = kohakuhub
[s3]
builtin = false
endpoint = https://your-s3-endpoint.com
access_key = your-access-key
secret_key = your-secret-key
region = us-east-1
[network]
# Required if external services are in different Docker Compose
external_network = shared-network
Using External Docker Network:
If PostgreSQL or S3 are in separate Docker Compose setups, you need a shared network:
# Create the shared network first
docker network create shared-network
# Your PostgreSQL docker-compose.yml
services:
postgres:
# ... your config
networks:
- shared-network
networks:
shared-network:
external: true
# Generate KohakuHub with external network
python scripts/generate_docker_compose.py --config kohakuhub.conf
The generator will automatically:
- Add the external network to
hub-apiandlakefsservices - Configure them to use both
default(hub-net) and the external network - Allow container name resolution across compose files
Important Notes:
- Shell scripts automatically use LF line endings (configured in
.gitattributes) - Database initialization runs automatically on LakeFS startup
- Works on both Windows and Linux development environments
- Configuration file (
kohakuhub.conf) is ignored by git (contains sensitive data) - Use
--generate-configto create a template configuration file
Storage Management
Clear S3 Storage
For demo deployments with constrained S3 storage (like CloudFlare R2 free tier), use these scripts to manage storage.
All scripts now accept S3 credentials via command-line arguments or environment variables, making them standalone tools independent of KohakuHub configuration.
Python Script (Recommended)
Features:
- ✅ Rich progress display with progress bars
- ✅ Detailed statistics (object count, total size)
- ✅ Dry run mode
- ✅ Batch deletion (handles large buckets efficiently)
- ✅ Prefix filtering
- ✅ Error reporting
Usage:
# Clear all content (interactive confirmation required)
python scripts/clear_s3_storage.py \
--endpoint https://s3.amazonaws.com \
--access-key YOUR_ACCESS_KEY \
--secret-key YOUR_SECRET_KEY \
--bucket my-bucket
# Clear only LFS files (large files)
python scripts/clear_s3_storage.py \
--endpoint https://s3.amazonaws.com \
--access-key YOUR_ACCESS_KEY \
--secret-key YOUR_SECRET_KEY \
--bucket my-bucket \
--prefix lfs/
# Clear specific repository type
python scripts/clear_s3_storage.py \
--endpoint https://s3.amazonaws.com \
--access-key YOUR_ACCESS_KEY \
--secret-key YOUR_SECRET_KEY \
--bucket my-bucket \
--prefix hf-model-
# Clear multiple prefixes
python scripts/clear_s3_storage.py \
--endpoint https://s3.amazonaws.com \
--access-key YOUR_ACCESS_KEY \
--secret-key YOUR_SECRET_KEY \
--bucket my-bucket \
--prefix lfs/ --prefix hf-model-
# Dry run (show what would be deleted without deleting)
python scripts/clear_s3_storage.py \
--endpoint https://s3.amazonaws.com \
--access-key YOUR_ACCESS_KEY \
--secret-key YOUR_SECRET_KEY \
--bucket my-bucket \
--dry-run
# Force delete without confirmation (dangerous!)
python scripts/clear_s3_storage.py \
--endpoint https://s3.amazonaws.com \
--access-key YOUR_ACCESS_KEY \
--secret-key YOUR_SECRET_KEY \
--bucket my-bucket \
--force
# Use environment variables for credentials
export S3_ENDPOINT=https://s3.amazonaws.com
export S3_ACCESS_KEY=YOUR_ACCESS_KEY
export S3_SECRET_KEY=YOUR_SECRET_KEY
export S3_BUCKET=my-bucket
python scripts/clear_s3_storage.py --prefix lfs/
# Limit number of objects (for testing)
python scripts/clear_s3_storage.py \
--endpoint https://s3.amazonaws.com \
--access-key YOUR_ACCESS_KEY \
--secret-key YOUR_SECRET_KEY \
--bucket my-bucket \
--max-objects 100
Requirements:
- Python 3.10+
boto3andrichpackages (pip install boto3 rich)- S3 credentials with delete permissions
Shell Script (Alternative)
Features:
- ✅ No Python dependencies (uses AWS CLI or MinIO client)
- ✅ Simple and portable
- ✅ Supports both
awsCLI andmc(MinIO client)
Usage:
# Make script executable
chmod +x scripts/clear_s3_storage.sh
# Clear all content
./scripts/clear_s3_storage.sh
# Clear with prefix
./scripts/clear_s3_storage.sh --prefix lfs/
# Dry run
./scripts/clear_s3_storage.sh --dry-run
# Force delete
./scripts/clear_s3_storage.sh --force
Requirements:
- AWS CLI (
aws) or MinIO Client (mc) - KohakuHub environment variables configured
Install AWS CLI:
# Linux/macOS
curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip"
unzip awscliv2.zip
sudo ./aws/install
# Or via package manager
sudo apt install awscli # Ubuntu/Debian
brew install awscli # macOS
Install MinIO Client:
# Linux
wget https://dl.min.io/client/mc/release/linux-amd64/mc
chmod +x mc
sudo mv mc /usr/local/bin/
# macOS
brew install minio/stable/mc
# Windows
choco install minio-client
Common Prefixes in KohakuHub
Understanding KohakuHub's S3 storage structure:
| Prefix | Description | Example |
|---|---|---|
lfs/ |
Large File Storage objects (files >5MB) | lfs/ab/cd/abcd1234... |
hf-model-* |
Model repository data | hf-model-myuser-mymodel/ |
hf-dataset-* |
Dataset repository data | hf-dataset-myuser-mydataset/ |
hf-space-* |
Space repository data | hf-space-myuser-myspace/ |
Note: LFS objects are deduplicated by SHA256 hash. Deleting from lfs/ prefix removes all large files across all repositories.
Database Migration Scripts
Add Storage Quota Fields
# Add quota tracking to users/organizations
python scripts/migrate_add_storage_quota.py
python scripts/migrate_separate_quotas.py
python scripts/migrate_add_quotas_final.py
Update Repository Schema
python scripts/migrate_repository_schema.py
Testing Scripts
Generate Test Files
# Generate test files for upload testing
python scripts/generate_test_files.py
Test Authentication
# Test authentication flow
python scripts/test_auth.py
Security
Generate Secret Keys
# Generate secure random secret for session/admin tokens
python scripts/generate_secret.py
Output:
Generated secure random secret (64 characters):
a1b2c3d4e5f6...
Add this to your docker-compose.yml:
KOHAKU_HUB_SESSION_SECRET=a1b2c3d4e5f6...
KOHAKU_HUB_ADMIN_SECRET_TOKEN=a1b2c3d4e5f6...
Environment Variables
S3 Storage Scripts (clear_s3_storage.py, show_s3_usage.py)
These scripts accept credentials via command-line arguments or environment variables:
# Option 1: Command-line arguments
python scripts/clear_s3_storage.py \
--endpoint https://s3.amazonaws.com \
--access-key YOUR_KEY \
--secret-key YOUR_SECRET \
--bucket my-bucket
# Option 2: Environment variables
export S3_ENDPOINT=https://s3.amazonaws.com
export S3_ACCESS_KEY=YOUR_ACCESS_KEY
export S3_SECRET_KEY=YOUR_SECRET_KEY
export S3_BUCKET=my-bucket
export S3_REGION=us-east-1 # Optional, default: us-east-1
python scripts/clear_s3_storage.py --prefix lfs/
Migration Scripts
Migration scripts require full KohakuHub configuration:
# S3 Storage
export KOHAKU_HUB_S3_ENDPOINT=http://localhost:29001
export KOHAKU_HUB_S3_PUBLIC_ENDPOINT=http://localhost:29001
export KOHAKU_HUB_S3_ACCESS_KEY=minioadmin
export KOHAKU_HUB_S3_SECRET_KEY=minioadmin
export KOHAKU_HUB_S3_BUCKET=hub-storage
# LakeFS
export KOHAKU_HUB_LAKEFS_ENDPOINT=http://localhost:28000
export KOHAKU_HUB_LAKEFS_ACCESS_KEY=...
export KOHAKU_HUB_LAKEFS_SECRET_KEY=...
# Database
export KOHAKU_HUB_DATABASE_URL=postgresql://user:pass@localhost:5432/kohakuhub
# Application
export KOHAKU_HUB_BASE_URL=http://localhost:28080
JSON Schema Generator
Generate JSON Schema
python scripts/generate_json_schema.py
This script will generate a JSON schema for the KohakuHub types in __generated__/schemas/<filename>.json.
Currently supported files:
config.json
Tips for Demo Deployments
CloudFlare R2 Free Tier
- Limit: 10GB storage
- Strategy: Regularly clear LFS files and old repository data
# Weekly cleanup schedule
# 1. Clear old LFS files
python scripts/clear_s3_storage.py --prefix lfs/ --force
# 2. Clear test repositories
python scripts/clear_s3_storage.py --prefix hf-space-test --force
MinIO Development
- No limits: MinIO has no storage limits
- Cleanup: Only needed for testing or reset
# Reset entire storage for clean slate
python scripts/clear_s3_storage.py --force
Safety Features
All deletion scripts include:
- Dry run mode: See what would be deleted before committing
- Interactive confirmation: Requires explicit "yes" to proceed
- Double confirmation: Extra prompt for full bucket deletion
- Progress display: See deletion progress in real-time
- Error reporting: Track and report any deletion failures
Troubleshooting
"NoSuchBucket" Error
Problem: Bucket doesn't exist
Solution:
# Create bucket using AWS CLI
aws s3 mb "s3://${KOHAKU_HUB_S3_BUCKET}" --endpoint-url="${KOHAKU_HUB_S3_ENDPOINT}"
# Or using MinIO client
mc mb kohakuhub-temp/${KOHAKU_HUB_S3_BUCKET}
"Access Denied" Error
Problem: S3 credentials lack delete permissions
Solution: Check your S3 access key has s3:DeleteObject permission
Script Hangs
Problem: Large bucket (millions of objects)
Solution: Use --max-objects to test first:
python scripts/clear_s3_storage.py --max-objects 1000 --dry-run
Contributing
When adding new scripts:
- Add docstring with usage examples
- Include error handling
- Support dry-run mode if applicable
- Update this README
- Test with both SQLite and PostgreSQL backends