mirror of
https://github.com/KohakuBlueleaf/KohakuHub.git
synced 2026-05-08 12:57:36 -05:00
The tree endpoint with `expand=true` was reproducing `git log --follow` for each path on the page by walking the LakeFS commit graph manually: one unfiltered `log_commits` call followed by per-commit `diff_refs` calls, client-side matching diff entries against unresolved targets. Latency was O(commits-walked-from-HEAD-until-the-deepest-target-resolved). On a 100-commit / 50-path page this scaled to ~7 s on loopback and ~25 s on WAN-deployed instances; on the new 280-commit churn fixture it hit ~20 s locally and matched the user-reported 30 s+ stalls on hub.deepghs.org (see issue #59). The replacement (issue #59 Plan E): * Per file target → `logCommits(objects=[path], amount=1, limit=true)`. * Per directory target → `logCommits(prefixes=[path/], amount=1, limit=true)`. * Both fanned out under `Semaphore(LAST_COMMIT_LOOKUP_CONCURRENCY=16)`. * Drops `_apply_changed_path`, `TREE_DIFF_PAGE_SIZE`, and `TREE_COMMIT_SCAN_PAGE_SIZE` — no client-side commit walk anymore. LakeFS implements the path filter via its content-addressed metarange tree (`pkg/catalog/catalog.go:checkPathListInCommit`): when a key's containing range hash matches between two commits, the key didn't change — no diff fetch, no value comparison, just two range-ID equality checks. Each call is single-digit milliseconds regardless of how deep the path sits in history. Behaviour preserved (live-checked on the new stress fixture): * `lastCommit` payload shape unchanged: `{id, title, date}`. * Identical `id` values across all entries vs. the old algorithm on a 48-entry recursive page (every (path, commit_id) tuple matches). * Per-target failures stay non-fatal — the affected entry resolves to `null` and the rest of the page still surfaces, matching the previous diff-walk's log-and-continue behaviour. Measured on the planted `tree-expand-stress-bench` (280 chaotic commits, ~94 surviving paths through add / modify / delete / restore / folder-delete): | page | before | after | speedup | |-------------------------------|-----------|-----------|---------| | root (2 entries: README+dir) | 19.76 s | 0.40 s | 49× | | /shard/group_00 (10 files) | 1.73 s | 0.72 s | 2.4× | | /shard recursive (48 files) | 17.66 s | 3.55 s | 5× | Connection pooling for `LakeFSRestClient` is intentionally NOT bundled here (also called out as a follow-up in #59); WAN benefit will widen further once it lands. LakeFS version requirement: `objects` / `prefixes` / `limit` parameters on `logCommits` were introduced in LakeFS v0.54.0 (2021-11-08). Anything from v0.54 onward works; pre-v0.54 servers ignore the params and return the unfiltered log. KohakuHub's docker bundle pins `treeverse/lakefs:latest` so default deployments are always compatible. Documented in `lakefs_rest_client.log_commits` and `tree.resolve_last_commits_for_paths` docstrings. Tests: * `test_tree_unit.py` — three new tests cover the call shape (objects= vs prefixes= dispatch, amount=1+limit=true contract, fan-out result map), per-target failure isolation, and the LAST_COMMIT_LOOKUP_CONCURRENCY semaphore cap. Old diff-walk tests for `_apply_changed_path` / paginated diff are removed. * `test_lakefs_rest_client.py` — new test pins the params shape (list-of-tuples for repeated `objects` / `prefixes`, `"true"`/`"false"` serialisation for `limit` / `first_parent`). One existing assertion updated for the params-list shape. * Full backend suite green: 626 passed in 542 s (was 624 before; +2 new tests). Acceptance fixture: a new local-only seed `mai_lin/tree-expand-stress-bench` (280 commits / ~94 surviving files) exercises the chaotic-history pattern behind issue #59 — heavy modify / delete / restore cycles biased to a hot 12-path tier, plus periodic `deletedFolder` ops. The seed is byte-deterministic (hash-based byte stream, no `random` module) per AGENTS §2. Bumps SEED_VERSION to `local-dev-demo-v8`. `commit_files` extended to dispatch the new `DeletedFileSeed` / `DeletedFolderSeed` / `CopyFileSeed` op shapes. (`CopyFileSeed` is wired through but unused for now — KohakuHub's `process_copy_file` re-links the source's internal LakeFS physical address, which LakeFS 1.80 rejects with "address is not signed: link address invalid" for non-LFS sources. Filed as separate issue.) Refs: #59 (the original perf report and updated Plan E follow-up).
4 lines
88 B
Python
4 lines
88 B
Python
"""Shared constants for local demo seed tooling."""
|
|
|
|
SEED_VERSION = "local-dev-demo-v8"
|