mirror of
https://github.com/KohakuBlueleaf/KohakuHub.git
synced 2026-05-07 20:38:08 -05:00
Implements issue #27 v4: file-level HF-compatible metadata preview computed entirely in the browser via HTTP Range reads against the existing /resolve/ 302 → presigned S3/MinIO URL. Zero new backend preview code, zero LRU, zero precomputation, zero new DB state. Backend (minimal CORS plumbing only): - main.py CORSMiddleware: add `expose_headers` so browsers can read Content-Range / X-Linked-* / X-Repo-Commit / ETag / Location off the final 206 response that follows the /resolve/ 302. - docker-compose.example.yml + scripts/dev/up_infra.sh: wire `MINIO_API_CORS_ALLOW_ORIGIN` so the SPA can cross-origin Range-read presigned targets. Configurable via `DEV_MINIO_CORS_ALLOW_ORIGIN`. - docs/development/local-dev.md: MinIO CORS section explaining the hard prerequisite + smoke-test probe + how to recreate the container. Frontend: - utils/safetensors.js (~190 LOC): pure-JS parser mirroring huggingface_hub.parse_safetensors_file_metadata byte-for-byte (speculative 100 KB first read, two-read fallback for fat headers, SAFETENSORS_MAX_HEADER_LENGTH guard). Exposes parseSafetensorsMetadata + summarizeSafetensors. - utils/parquet.js: thin wrapper over hyparquet's asyncBufferFromUrl + parquetMetadataAsync with mode:"cors" + credentials:"omit" so cookies never leak onto presigned URLs. Normalizes BigInt row counts. - components/repo/preview/FilePreviewDialog.vue: ElDialog with per-phase spinner text (range-head → parsing → done for safetensors, head → footer → parsing → done for parquet), dtype/row-group tables, and an explicit "CORS likely misconfigured" placeholder on failure. - RepoViewer.vue: HF-style chart-line-data icon next to .safetensors and .parquet rows; click opens the modal with the resolved /resolve/ URL for the current branch. Tests + fixtures: - test_files.py::test_resolve_get_302_exposes_cors_headers_for_browser_preview pins the `Access-Control-Expose-Headers` list against regressions. - test/kohaku-hub-ui/utils/test_safetensors.test.js: 6 cases covering the real-HF-format fixture, dtype summary, progress phases, fat-header fallback, oversized-header guard, and non-206 error paths. - test/kohaku-hub-ui/utils/test_parquet.test.js: footer parse + progress phase assertions. - test/kohaku-hub-ui/fixtures/previews/{tiny.safetensors,tiny.parquet}: byte-identical-to-HF fixtures produced by the real safetensors / pyarrow libs via scripts/dev/generate_preview_test_fixtures.py (committed so tests stay offline per AGENTS.md §5.2). Seed: - seed_demo_data.py: add two RemoteAsset entries for real HF-hosted small fixtures pinned by sha256, and wire them into visible paths (open-media-lab/vision-language-assistant-3b/fixtures/hf-tiny-random-bert.safetensors, open-media-lab/multimodal-benchmark-suite/fixtures/hf-no-robots-test.parquet) so the preview can be exercised against files that actually came off huggingface.co rather than purely local pyarrow/safetensors output. SEED_VERSION bumped to local-dev-demo-v4. Verified end-to-end against the dev stack: safetensors parser output on the seeded fixtures matches huggingface_hub.parse_safetensors_file_metadata byte-for-byte on the same file (100 tensors, 126,851 params, I64=512 / F32=126,339, metadata `{format: pt, ...}`). Browser preview modal renders both file kinds correctly. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
80 lines
2.7 KiB
Python
80 lines
2.7 KiB
Python
#!/usr/bin/env python3
|
|
"""Regenerate tiny real-format fixtures for frontend preview unit tests.
|
|
|
|
The fixtures are checked into the repo so tests do not require network or
|
|
live infrastructure (per AGENTS.md §5.2). This script is the source of
|
|
truth: anyone needing to refresh or re-verify the fixtures can run it and
|
|
diff the output.
|
|
|
|
Output:
|
|
- ``test/kohaku-hub-ui/fixtures/previews/tiny.safetensors`` — valid
|
|
safetensors file with three small tensors in three dtypes and a
|
|
non-empty ``__metadata__`` block. Produced via ``safetensors.numpy``
|
|
so the wire format is byte-identical to what HuggingFace emits.
|
|
- ``test/kohaku-hub-ui/fixtures/previews/tiny.parquet`` — valid parquet
|
|
file with ~100 rows and four columns (string, int64, float32, bool).
|
|
Produced via ``pyarrow.parquet`` so the footer/schema shape matches
|
|
anything the HuggingFace datasets-server would serve for a comparable
|
|
upload.
|
|
"""
|
|
|
|
from __future__ import annotations
|
|
|
|
from pathlib import Path
|
|
|
|
import numpy as np
|
|
import pyarrow as pa
|
|
import pyarrow.parquet as pq
|
|
from safetensors.numpy import save as save_safetensors
|
|
|
|
REPO_ROOT = Path(__file__).resolve().parents[2]
|
|
OUT_DIR = REPO_ROOT / "test" / "kohaku-hub-ui" / "fixtures" / "previews"
|
|
|
|
|
|
def build_safetensors() -> bytes:
|
|
rng = np.random.default_rng(seed=0)
|
|
tensors = {
|
|
"encoder.embed.weight": rng.standard_normal((32, 8)).astype(np.float32),
|
|
"encoder.layer0.attn.q_proj.weight": rng.standard_normal((16, 16)).astype(np.float16),
|
|
"encoder.layer0.ln.bias": np.arange(16, dtype=np.int64),
|
|
}
|
|
metadata = {
|
|
"format": "pt",
|
|
"framework": "kohakuhub-fixture",
|
|
"seed": "0",
|
|
}
|
|
return save_safetensors(tensors, metadata=metadata)
|
|
|
|
|
|
def build_parquet() -> bytes:
|
|
row_count = 100
|
|
table = pa.table(
|
|
{
|
|
"id": pa.array([f"row-{i:03d}" for i in range(row_count)], type=pa.string()),
|
|
"score": pa.array(np.arange(row_count, dtype=np.int64)),
|
|
"ratio": pa.array(np.linspace(0.0, 1.0, row_count, dtype=np.float32)),
|
|
"flag": pa.array([i % 2 == 0 for i in range(row_count)], type=pa.bool_()),
|
|
}
|
|
)
|
|
import io
|
|
|
|
sink = io.BytesIO()
|
|
pq.write_table(table, sink, compression="snappy")
|
|
return sink.getvalue()
|
|
|
|
|
|
def main() -> None:
|
|
OUT_DIR.mkdir(parents=True, exist_ok=True)
|
|
|
|
safetensors_bytes = build_safetensors()
|
|
(OUT_DIR / "tiny.safetensors").write_bytes(safetensors_bytes)
|
|
print(f"wrote tiny.safetensors ({len(safetensors_bytes)} bytes)")
|
|
|
|
parquet_bytes = build_parquet()
|
|
(OUT_DIR / "tiny.parquet").write_bytes(parquet_bytes)
|
|
print(f"wrote tiny.parquet ({len(parquet_bytes)} bytes)")
|
|
|
|
|
|
if __name__ == "__main__":
|
|
main()
|