Files
KohakuHub/scripts/dev/seed_shared.py
narugo1992 598927010e perf(tree): use LakeFS path-filtered logCommits to resolve lastCommit per page
The tree endpoint with `expand=true` was reproducing `git log --follow` for
each path on the page by walking the LakeFS commit graph manually: one
unfiltered `log_commits` call followed by per-commit `diff_refs` calls,
client-side matching diff entries against unresolved targets. Latency was
O(commits-walked-from-HEAD-until-the-deepest-target-resolved). On a
100-commit / 50-path page this scaled to ~7 s on loopback and ~25 s on
WAN-deployed instances; on the new 280-commit churn fixture it hit ~20 s
locally and matched the user-reported 30 s+ stalls on hub.deepghs.org
(see issue #59).

The replacement (issue #59 Plan E):

  * Per file target → `logCommits(objects=[path], amount=1, limit=true)`.
  * Per directory target → `logCommits(prefixes=[path/], amount=1, limit=true)`.
  * Both fanned out under `Semaphore(LAST_COMMIT_LOOKUP_CONCURRENCY=16)`.
  * Drops `_apply_changed_path`, `TREE_DIFF_PAGE_SIZE`, and
    `TREE_COMMIT_SCAN_PAGE_SIZE` — no client-side commit walk anymore.

LakeFS implements the path filter via its content-addressed metarange tree
(`pkg/catalog/catalog.go:checkPathListInCommit`): when a key's containing
range hash matches between two commits, the key didn't change — no diff
fetch, no value comparison, just two range-ID equality checks. Each call is
single-digit milliseconds regardless of how deep the path sits in history.

Behaviour preserved (live-checked on the new stress fixture):

  * `lastCommit` payload shape unchanged: `{id, title, date}`.
  * Identical `id` values across all entries vs. the old algorithm on a
    48-entry recursive page (every (path, commit_id) tuple matches).
  * Per-target failures stay non-fatal — the affected entry resolves to
    `null` and the rest of the page still surfaces, matching the previous
    diff-walk's log-and-continue behaviour.

Measured on the planted `tree-expand-stress-bench` (280 chaotic commits,
~94 surviving paths through add / modify / delete / restore / folder-delete):

| page                          | before    | after     | speedup |
|-------------------------------|-----------|-----------|---------|
| root (2 entries: README+dir)  | 19.76 s   | 0.40 s    | 49×     |
| /shard/group_00 (10 files)    |  1.73 s   | 0.72 s    | 2.4×    |
| /shard recursive (48 files)   | 17.66 s   | 3.55 s    | 5×      |

Connection pooling for `LakeFSRestClient` is intentionally NOT bundled here
(also called out as a follow-up in #59); WAN benefit will widen further
once it lands.

LakeFS version requirement: `objects` / `prefixes` / `limit` parameters on
`logCommits` were introduced in LakeFS v0.54.0 (2021-11-08). Anything from
v0.54 onward works; pre-v0.54 servers ignore the params and return the
unfiltered log. KohakuHub's docker bundle pins `treeverse/lakefs:latest` so
default deployments are always compatible. Documented in
`lakefs_rest_client.log_commits` and `tree.resolve_last_commits_for_paths`
docstrings.

Tests:

  * `test_tree_unit.py` — three new tests cover the call shape (objects=
    vs prefixes= dispatch, amount=1+limit=true contract, fan-out result
    map), per-target failure isolation, and the
    LAST_COMMIT_LOOKUP_CONCURRENCY semaphore cap. Old diff-walk tests for
    `_apply_changed_path` / paginated diff are removed.
  * `test_lakefs_rest_client.py` — new test pins the params shape
    (list-of-tuples for repeated `objects` / `prefixes`, `"true"`/`"false"`
    serialisation for `limit` / `first_parent`). One existing assertion
    updated for the params-list shape.
  * Full backend suite green: 626 passed in 542 s (was 624 before; +2 new
    tests).

Acceptance fixture: a new local-only seed `mai_lin/tree-expand-stress-bench`
(280 commits / ~94 surviving files) exercises the chaotic-history pattern
behind issue #59 — heavy modify / delete / restore cycles biased to a hot
12-path tier, plus periodic `deletedFolder` ops. The seed is byte-deterministic
(hash-based byte stream, no `random` module) per AGENTS §2. Bumps SEED_VERSION
to `local-dev-demo-v8`. `commit_files` extended to dispatch the new
`DeletedFileSeed` / `DeletedFolderSeed` / `CopyFileSeed` op shapes.

(`CopyFileSeed` is wired through but unused for now — KohakuHub's
`process_copy_file` re-links the source's internal LakeFS physical address,
which LakeFS 1.80 rejects with "address is not signed: link address invalid"
for non-LFS sources. Filed as separate issue.)

Refs: #59 (the original perf report and updated Plan E follow-up).
2026-04-29 13:22:36 +08:00

4 lines
88 B
Python

"""Shared constants for local demo seed tooling."""
SEED_VERSION = "local-dev-demo-v8"