update doc and Docker related utils

This commit is contained in:
Kohaku-Blueleaf
2025-10-22 23:25:41 +08:00
parent e33eee9f17
commit edaee890db
4 changed files with 473 additions and 2 deletions

View File

@@ -9,12 +9,14 @@ RUN pip install --no-cache-dir uv
WORKDIR /app
COPY ./pyproject.toml .
RUN mkdir -p /app/src/kohakuhub
RUN echo "" > /app/src/kohakuhub/__init__.py
RUN uv pip install --system -e .
COPY ./src/kohakuhub ./src/kohakuhub
COPY ./scripts ./scripts
COPY ./docker/startup.py /app/startup.py
RUN chmod +x /app/startup.py
RUN uv pip install --system -e .
EXPOSE 48888
CMD ["/app/startup.py"]

View File

@@ -27,6 +27,7 @@ Self-hosted HuggingFace alternative with Git-like versioning for AI models and d
- **HuggingFace Compatible** - Drop-in replacement for `huggingface_hub`, `hfutils`, `transformers`, `diffusers`
- **External Source Fallback** - Browse HuggingFace (or other KohakuHub instances) when repos not found locally
- **User External Tokens** - Configure your own tokens for external sources (HuggingFace, etc.) with encrypted storage
- **Native Git Clone** - Standard Git operations (clone) with Git LFS support
- **Git-Like Versioning** - Branches, commits, tags via LakeFS
- **S3 Storage** - Works with MinIO, AWS S3, Cloudflare R2, etc.
@@ -199,10 +200,39 @@ KOHAKU_HUB_REQUIRE_EMAIL_VERIFICATION=false
# Admin Portal
KOHAKU_HUB_ADMIN_ENABLED=true
KOHAKU_HUB_ADMIN_SECRET_TOKEN=change-me-in-production
# External Tokens (for user-specific fallback tokens)
KOHAKU_HUB_DATABASE_KEY=$(openssl rand -hex 32) # Required for encryption
```
See [config-example.toml](./config-example.toml) for all options.
### External Fallback Tokens
Users can provide their own tokens for external sources (e.g., HuggingFace) to access private repositories:
**Via Web UI:**
1. Go to Settings → External Tokens
2. Add your HuggingFace token
3. Tokens are encrypted and stored securely
**Via CLI:**
```bash
kohub-cli settings user external-tokens add --url https://huggingface.co --token hf_abc123
```
**Via Authorization Header (API/programmatic):**
```bash
curl -H "Authorization: Bearer my_token|https://huggingface.co,hf_abc123" \
http://localhost:28080/api/models/org/model
```
**How it works:**
- User tokens override admin-configured tokens
- Tokens encrypted at rest using AES-256
- Works with session auth, API tokens, and anonymous requests
- Automatically used when repos not found locally
## Development
**Backend:**

View File

@@ -513,6 +513,16 @@ erDiagram
| `/api/auth/tokens/create` | POST | ✓ | Create new API token |
| `/api/auth/tokens/{token_id}` | DELETE | ✓ | Revoke API token |
### External Token Operations (Fallback System)
| Endpoint | Method | Auth | Description |
|----------|--------|------|-------------|
| `/api/fallback-sources/available` | GET | ✗ | List available fallback sources |
| `/api/users/{username}/external-tokens` | GET | ✓ | List user's external tokens (masked) |
| `/api/users/{username}/external-tokens` | POST | ✓ | Add/update external token |
| `/api/users/{username}/external-tokens/{url}` | DELETE | ✓ | Delete external token |
| `/api/users/{username}/external-tokens/bulk` | PUT | ✓ | Bulk update external tokens |
### Organization Operations
| Endpoint | Method | Auth | Description |
@@ -1011,3 +1021,160 @@ KohakuHub implements smart download tracking:
- **Development**: MinIO (included in docker-compose)
- **Public Hub**: Cloudflare R2 (free egress saves costs)
- **Private/Enterprise**: Self-hosted MinIO or AWS S3 with VPC endpoints
---
## External Token API (User Fallback Tokens)
Users can configure their own tokens for external fallback sources to access private repositories.
### List Available Sources
**Public endpoint - no authentication required**
```bash
GET /api/fallback-sources/available
```
**Response:**
```json
[
{
"url": "https://huggingface.co",
"name": "HuggingFace",
"source_type": "huggingface",
"priority": 1
}
]
```
### List User's External Tokens
```bash
GET /api/users/{username}/external-tokens
Authorization: Bearer YOUR_TOKEN
```
**Response (tokens are masked):**
```json
[
{
"url": "https://huggingface.co",
"token_preview": "hf_a***",
"created_at": "2025-01-22T10:30:00Z",
"updated_at": "2025-01-22T10:30:00Z"
}
]
```
### Add/Update External Token
```bash
POST /api/users/{username}/external-tokens
Authorization: Bearer YOUR_TOKEN
Content-Type: application/json
{
"url": "https://huggingface.co",
"token": "hf_abc123xyz"
}
```
**Response:**
```json
{
"success": true,
"message": "External token saved"
}
```
**Notes:**
- If token exists for this URL, it will be updated
- Token is encrypted before storage (AES-256)
- User can only manage their own tokens
### Delete External Token
```bash
DELETE /api/users/{username}/external-tokens/https%3A%2F%2Fhuggingface.co
Authorization: Bearer YOUR_TOKEN
```
**Response:**
```json
{
"success": true,
"message": "External token deleted"
}
```
**Note:** URL must be URL-encoded in path
### Bulk Update External Tokens
Replace all external tokens at once:
```bash
PUT /api/users/{username}/external-tokens/bulk
Authorization: Bearer YOUR_TOKEN
Content-Type: application/json
{
"tokens": [
{"url": "https://huggingface.co", "token": "hf_abc123"},
{"url": "https://other-hub.com", "token": "token456"}
]
}
```
**Response:**
```json
{
"success": true,
"message": "Updated 2 external tokens"
}
```
**Notes:**
- Deletes tokens not in the new list
- Atomic operation (all or nothing)
### Using External Tokens in Requests
**Authorization Header Format:**
```
Bearer <auth_token>|<url1>,<token1>|<url2>,<token2>...
```
**Examples:**
1. **API token + external token:**
```bash
curl -H "Authorization: Bearer my_api_token|https://huggingface.co,hf_abc123" \
http://localhost:28080/api/models/org/model
```
2. **Session auth + external token:**
```bash
# Frontend automatically sends: "Bearer |https://huggingface.co,hf_abc123"
```
3. **Anonymous + external token:**
```bash
curl -H "Authorization: Bearer |https://huggingface.co,hf_abc123" \
http://localhost:28080/api/models/facebook/gpt2
```
**Token Priority:**
1. Authorization header tokens (highest - per-request override)
2. Database tokens (medium - user preferences)
3. Admin tokens (lowest - server defaults)
**Configuration:**
```bash
# Required: Encryption key
export KOHAKU_HUB_DATABASE_KEY="$(openssl rand -hex 32)"
# Optional: Require auth for fallback
export KOHAKU_HUB_FALLBACK_REQUIRE_AUTH=false # Default: false
```

View File

@@ -0,0 +1,272 @@
#!/usr/bin/env python3
"""
Sync LakeFS Credentials to config.toml
This script reads LakeFS credentials from credentials.env (auto-generated by Docker)
and updates config.toml with the correct values.
Usage:
python scripts/sync_lakefs_credentials.py
python scripts/sync_lakefs_credentials.py --credentials-path ./custom/path/credentials.env
python scripts/sync_lakefs_credentials.py --config ./custom-config.toml
"""
import argparse
import re
import sys
import tomllib
from pathlib import Path
def find_credentials_path_from_docker_compose(docker_compose_path: Path) -> Path | None:
"""Find credentials.env path by parsing docker-compose.yml.
Args:
docker_compose_path: Path to docker-compose.yml
Returns:
Path to credentials.env or None if not found
"""
if not docker_compose_path.exists():
return None
try:
with open(docker_compose_path, "r", encoding="utf-8") as f:
content = f.read()
# Look for volume mount pattern: ./path/to/dir:/hub-api-creds
# Example: - ./hub-meta/hub-api:/hub-api-creds
match = re.search(r"- (\.[\w\-/\\]+):/hub-api-creds", content)
if match:
host_path = match.group(1)
# Resolve relative path
base_dir = docker_compose_path.parent
full_path = (base_dir / host_path / "credentials.env").resolve()
return full_path
return None
except Exception as e:
print(f"⚠ Failed to parse docker-compose.yml: {e}")
return None
def read_credentials_env(filepath: Path) -> dict[str, str]:
"""Read credentials from credentials.env file.
Args:
filepath: Path to credentials.env
Returns:
Dict of {key: value}
"""
if not filepath.exists():
raise FileNotFoundError(f"Credentials file not found: {filepath}")
credentials = {}
with open(filepath, "r", encoding="utf-8") as f:
for line in f:
line = line.strip()
if not line or line.startswith("#"):
continue
# Parse KEY=value
match = re.match(r"^([A-Z_]+)=(.+)$", line)
if match:
key, value = match.groups()
credentials[key] = value.strip()
return credentials
def update_config_toml(
config_path: Path, lakefs_access_key: str, lakefs_secret_key: str
):
"""Update config.toml with LakeFS credentials.
Args:
config_path: Path to config.toml
lakefs_access_key: LakeFS access key
lakefs_secret_key: LakeFS secret key
"""
if not config_path.exists():
raise FileNotFoundError(f"Config file not found: {config_path}")
# Read existing config
try:
with open(config_path, "rb") as f:
config = tomllib.load(f)
except Exception as e:
raise ValueError(f"Failed to parse config.toml: {e}")
# Update lakefs section
if "lakefs" not in config:
config["lakefs"] = {}
config["lakefs"]["access_key"] = lakefs_access_key
config["lakefs"]["secret_key"] = lakefs_secret_key
# Write back
lines = []
for section in [
"s3",
"lakefs",
"smtp",
"auth",
"admin",
"app",
"quota",
"fallback",
]:
if section not in config:
continue
lines.append(f"[{section}]")
for key, val in config[section].items():
if isinstance(val, bool):
lines.append(f"{key} = {str(val).lower()}")
elif isinstance(val, int):
# Check if it's a large number with underscores
if val >= 1000000:
# Format with underscores for readability
val_str = f"{val:_}"
lines.append(f"{key} = {val_str}")
else:
lines.append(f"{key} = {val}")
elif isinstance(val, float):
lines.append(f"{key} = {val}")
elif isinstance(val, str):
lines.append(f'{key} = "{val}"')
elif isinstance(val, list):
# Format list
items = ", ".join(
f'"{item}"' if isinstance(item, str) else str(item) for item in val
)
lines.append(f"{key} = [{items}]")
else:
lines.append(f'{key} = "{val}"')
lines.append("") # Blank line after section
with open(config_path, "w", encoding="utf-8") as f:
f.write("\n".join(lines))
print(f"✓ Updated {config_path}")
def main():
parser = argparse.ArgumentParser(
description="Sync LakeFS credentials from credentials.env to config.toml"
)
parser.add_argument(
"--credentials-path",
type=Path,
help="Path to credentials.env (default: auto-detect from docker-compose.yml)",
)
parser.add_argument(
"--config",
type=Path,
default=Path("config.toml"),
help="Path to config.toml (default: config.toml)",
)
parser.add_argument(
"--docker-compose",
type=Path,
default=Path("docker-compose.yml"),
help="Path to docker-compose.yml (default: docker-compose.yml)",
)
args = parser.parse_args()
print("=" * 60)
print("LakeFS Credentials Sync Tool")
print("=" * 60)
print()
# Determine credentials path
credentials_path = args.credentials_path
if not credentials_path:
# Auto-detect from docker-compose.yml
print(f"Auto-detecting credentials path from {args.docker_compose}...")
credentials_path = find_credentials_path_from_docker_compose(
args.docker_compose
)
if not credentials_path:
print("\n✗ Could not auto-detect credentials path from docker-compose.yml")
print("\n💡 Trying default path: ./hub-meta/hub-api/credentials.env")
credentials_path = Path("hub-meta/hub-api/credentials.env")
print(f"Credentials file: {credentials_path}")
print(f"Config file: {args.config}")
print()
# Check if files exist
if not credentials_path.exists():
print(f"✗ Credentials file not found: {credentials_path}")
print("\n💡 Make sure docker-compose is running and LakeFS has initialized:")
print(" docker-compose up -d")
print(" # Wait for LakeFS to start and create credentials.env")
sys.exit(1)
if not args.config.exists():
print(f"✗ Config file not found: {args.config}")
print("\n💡 Generate config.toml first:")
print(" python scripts/generate_docker_compose.py")
sys.exit(1)
# Read credentials
print("Reading LakeFS credentials...")
try:
credentials = read_credentials_env(credentials_path)
lakefs_access_key = credentials.get("KOHAKU_HUB_LAKEFS_ACCESS_KEY")
lakefs_secret_key = credentials.get("KOHAKU_HUB_LAKEFS_SECRET_KEY")
if not lakefs_access_key or not lakefs_secret_key:
print("✗ Missing LakeFS credentials in credentials.env")
print(f" Found keys: {list(credentials.keys())}")
sys.exit(1)
print(f" ✓ Access Key: {lakefs_access_key}")
print(f" ✓ Secret Key: {lakefs_secret_key[:8]}..." + "*" * 20)
print()
except FileNotFoundError as e:
print(f"{e}")
sys.exit(1)
except Exception as e:
print(f"✗ Failed to read credentials: {e}")
sys.exit(1)
# Update config.toml
print(f"Updating {args.config}...")
try:
update_config_toml(args.config, lakefs_access_key, lakefs_secret_key)
print()
print("=" * 60)
print("✓ Sync Complete!")
print("=" * 60)
print()
print("📋 Updated fields:")
print(f" • lakefs.access_key = {lakefs_access_key}")
print(f" • lakefs.secret_key = {lakefs_secret_key[:8]}***")
print()
print("💡 Next steps:")
print(" 1. Restart dev server if running")
print(" 2. Test LakeFS connection: curl http://localhost:28000/_health")
print()
except Exception as e:
print(f"✗ Failed to update config.toml: {e}")
import traceback
traceback.print_exc()
sys.exit(1)
if __name__ == "__main__":
main()