[PR #1619] [MERGED] refactor: prevent further clone-size bloat (Phase 1, #1393, #1175) #9225

Closed
opened 2026-05-03 01:29:24 -05:00 by GiteaMirror · 0 comments
Owner

📋 Pull Request Information

Original PR: https://github.com/harvard-edge/cs249r_book/pull/1619
Author: @profvjreddi
Created: 4/30/2026
Status: Merged
Merged: 4/30/2026
Merged by: @profvjreddi

Base: devHead: refactor/clone-size


📝 Commits (4)

  • 2ab05b8 chore(repo): add .gitattributes with going-forward Git LFS tracking
  • ea3087a chore(gitignore): exclude generated bundle.js drops going forward
  • f9c7a24 docs(contributing): add Common gotchas section for first-time contributors
  • 5b997c5 merge dev into refactor/clone-size, keep both ignore additions

📊 Changes

3 files changed (+135 additions, -0 deletions)

View changed files

.gitattributes (+83 -0)
📝 .gitignore (+16 -0)
📝 CONTRIBUTING.md (+36 -0)

📄 Description

Summary

This PR is Phase 1 of the response to issues #1393 and #1175: pure
prevention, no history rewrite. It stops new binaries and bundles from
compounding the existing 2.6 GB .git, and adds a Common gotchas
section to root CONTRIBUTING.md so first-time contributors find the
TinyTorch / Co-Labs / large-binary guidance the issue called out.

What's in this PR (three atomic commits):

  1. chore(repo): add .gitattributes with going-forward Git LFS tracking
    marks *.epub, *.pdf, *.mp3, *.wav, *.m4a, *.mp4, *.mov,
    *.webm, *.wasm for Git LFS. Mixed-size patterns (*.png, *.jpg,
    *.gif) deliberately omitted — see deferred decisions below.
  2. chore(gitignore): exclude generated bundle.js drops going forward
    adds book/quarto/tools/scripts/socratiQ/bundle.js, the symlinked
    mirror at book/tools/scripts/socratiQ/bundle.js, and the legacy
    scripts/ai_menu/dist/*.bundle.js paths.
  3. docs(contributing): add Common gotchas section for first-time contributors — adds the tito CLI, tito src export, Pyodide /
    Marimo cell-return, and large-binary callouts to root
    CONTRIBUTING.md, linking to canonical per-area docs rather than
    duplicating.

Important: adding LFS tracking to .gitattributes does NOT migrate
existing history. The big binaries already in .git stay there. That's
Phase 2 (below), and out of scope for this PR.

Audit numbers (snapshot taken on refactor/clone-size, 2026-04-30 UTC)

Metric Value
.git size on disk (common dir) 2.6 GB
Total reachable commits 13,660
Distinct blobs > 1 MB anywhere in history 797
Cumulative bytes of blobs > 1 MB 5.31 GB across history (pre-pack)

Top-10 paths by cumulative bytes across history

Cumulative bytes Versions Path
952,363,735 15 assets/downloads/Machine-Learning-Systems.epub
595,473,502 16 assets/downloads/Machine-Learning-Systems.pdf
581,652,800 32 interviews/staffml/src/data/corpus.json (already gitignored)
466,267,746 21 interviews/vault/corpus.json (already gitignored)
444,089,970 117 search.json (Quarto search index, regenerated each build)
285,573,256 25 interviews/corpus.json
158,585,757 5 book/assets/downloads/Machine-Learning-Systems.epub
156,938,369 7 kits/assets/downloads/Hardware-Kits.pdf
113,190,282 20 interviews/staffml/src/data/corpus-summary.json
94,026,126 3 assets/Machine-Learning-Systems.pdf

The full top-50 list and the >500 KB tracked-in-HEAD list are in the
audit document saved at /tmp/clone-size-audit-2026-04-30.md (the
intended .claude/_reviews/ path was blocked by the local Claude Code
sandbox in this run; the file should be moved to that location for the
permanent evidence trail).

3-phase migration plan

Phase 1 — this PR (non-destructive prevention)

  • .gitattributes for LFS tracking of binary patterns going forward.
  • .gitignore updates for build artefacts (bundle.js).
  • CONTRIBUTING.md Common gotchas section.
  • Audit document saved as evidence.
  • No commits are rewritten. Existing forks stay valid.

Phase 2 — future, requires VJ approval and team coordination

The actual migration of existing history out of pack files. Concrete
steps:

  1. Announce to all active contributors a 24-48h freeze window for
    PRs to dev and main.
  2. Take a fresh, full clone on a maintainer machine (not a worktree;
    git lfs migrate rewrites history):
    git clone https://github.com/harvard-edge/cs249r_book.git cs249r_book-migrate
    cd cs249r_book-migrate
    git fetch --all --tags
    
  3. Install git-lfs and git-filter-repo (brew install git-lfs git-filter-repo).
  4. Run the migration for the patterns we know are safe (matches
    .gitattributes from Phase 1):
    git lfs migrate import \
      --everything \
      --include="*.epub,*.pdf,*.mp3,*.wav,*.m4a,*.mp4,*.mov,*.webm,*.wasm"
    
    This rewrites history. Verify with
    git rev-list --objects --all | git cat-file --batch-check='%(objecttype) %(objectname) %(objectsize) %(rest)' | awk '$1=="blob" && $3>1000000' | wc -l (should drop sharply).
  5. git lfs push --all origin to upload the LFS objects.
  6. Force-push main and dev (and any active long-lived branches):
    git push --force-with-lease origin main dev
    
  7. Notify all contributors to either:
    • re-clone the repo, OR
    • run git fetch && git reset --hard origin/<branch> on every active
      branch (warn them this discards local history).
  8. Update CI to call git lfs install (and git lfs pull where the
    binaries are needed for the build — book PDF rendering, podcast
    embedding, etc.).
  9. Optional follow-up with git-filter-repo to drop paths that no
    longer exist in HEAD and aren't needed historically (e.g., the old
    assets/downloads/Machine-Learning-Systems.{epub,pdf} already
    removed from working tree). This shrinks .git further but is more
    destructive.

Estimated outcome (matching #1393's expectation): fresh clone drops from
~2 GB to under 200 MB, plus a one-time LFS-object pull on first
checkout.

Phase 3 — optional, longer-term

Consider splitting heavyweight historical content into release-asset
repos:

  • Old book distribution PDFs / EPUBs onto GitHub Releases (or a
    dedicated cs249r_book-releases repo).
  • Podcast MP3s onto a separate cs249r_book-podcasts repo or a CDN.
  • Keep the main repo for source: .qmd, .py, .bib, SVGs.

This requires URL-rewriting in book content and has CDN cost
implications; defer until Phase 2 is settled and cost / DX is measured.

Decisions deferred to VJ (honest list)

  1. PNG LFS coverage. I did NOT add *.png to .gitattributes.
    The repo mixes 1-5 MB cover art / kit photos with thousands of small
    icon PNGs and a blanket pattern would LFS-track the small ones too.
    Options: (a) path-scoped patterns
    (book/quarto/assets/images/covers/**/*.png,
    kits/contents/**/images/png/*.png); (b) *.png only after small
    icons are moved to SVG; (c) leave PNG out of LFS entirely. Please
    pick one.
  2. *.gif. Same dilemma — _alphafold.gif is 3 MB but most other
    GIFs are small. Deferred.
  3. Phase-2 timing. The migration plan above is a proposal, not a
    schedule.
  4. Whether to also untrack the existing bundle.js (separate change)
    and the existing 00_tinytorch.pdf, podcast MP3s, etc. tracked in
    HEAD. Phase 1 only stops new drops.

Test plan

  • Confirm .gitattributes does not break existing git status /
    git diff for non-LFS files.
  • Spot-check that adding a fake test.epub to a scratch branch
    results in an LFS pointer (git ls-files :test.epub shows
    lfs filter after git add).
  • Verify the markdown link checker (pre-commit
    Repo: Validate internal markdown links + anchors) passes —
    the CONTRIBUTING.md update added internal anchors and external
    issue links.
  • Confirm bundle.js paths in .gitignore don't accidentally
    ignore non-bundle files.
  • Manual: rebuild SocratiQ bundle (cd socratiq && npm run build:vite) and confirm git status correctly ignores the regen.

Relates to #1393
Relates to #1175


🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.

## 📋 Pull Request Information **Original PR:** https://github.com/harvard-edge/cs249r_book/pull/1619 **Author:** [@profvjreddi](https://github.com/profvjreddi) **Created:** 4/30/2026 **Status:** ✅ Merged **Merged:** 4/30/2026 **Merged by:** [@profvjreddi](https://github.com/profvjreddi) **Base:** `dev` ← **Head:** `refactor/clone-size` --- ### 📝 Commits (4) - [`2ab05b8`](https://github.com/harvard-edge/cs249r_book/commit/2ab05b88e18a898561be0c6064dd2299eb1df68a) chore(repo): add .gitattributes with going-forward Git LFS tracking - [`ea3087a`](https://github.com/harvard-edge/cs249r_book/commit/ea3087acd29e2da185e22281dcb27af0fdb0aec0) chore(gitignore): exclude generated bundle.js drops going forward - [`f9c7a24`](https://github.com/harvard-edge/cs249r_book/commit/f9c7a24c5f6ceb7cf5f0512ffa0cbcd7e5fa7a06) docs(contributing): add Common gotchas section for first-time contributors - [`5b997c5`](https://github.com/harvard-edge/cs249r_book/commit/5b997c5abe3de0e2583a7fafe86fdf15ef174a72) merge dev into refactor/clone-size, keep both ignore additions ### 📊 Changes **3 files changed** (+135 additions, -0 deletions) <details> <summary>View changed files</summary> ➕ `.gitattributes` (+83 -0) 📝 `.gitignore` (+16 -0) 📝 `CONTRIBUTING.md` (+36 -0) </details> ### 📄 Description ## Summary This PR is **Phase 1** of the response to issues #1393 and #1175: pure prevention, no history rewrite. It stops new binaries and bundles from compounding the existing 2.6 GB `.git`, and adds a Common gotchas section to root `CONTRIBUTING.md` so first-time contributors find the TinyTorch / Co-Labs / large-binary guidance the issue called out. What's in this PR (three atomic commits): 1. **`chore(repo): add .gitattributes with going-forward Git LFS tracking`** — marks `*.epub`, `*.pdf`, `*.mp3`, `*.wav`, `*.m4a`, `*.mp4`, `*.mov`, `*.webm`, `*.wasm` for Git LFS. Mixed-size patterns (`*.png`, `*.jpg`, `*.gif`) deliberately omitted — see deferred decisions below. 2. **`chore(gitignore): exclude generated bundle.js drops going forward`** — adds `book/quarto/tools/scripts/socratiQ/bundle.js`, the symlinked mirror at `book/tools/scripts/socratiQ/bundle.js`, and the legacy `scripts/ai_menu/dist/*.bundle.js` paths. 3. **`docs(contributing): add Common gotchas section for first-time contributors`** — adds the `tito` CLI, `tito src export`, Pyodide / Marimo cell-return, and large-binary callouts to root `CONTRIBUTING.md`, linking to canonical per-area docs rather than duplicating. **Important:** adding LFS tracking to `.gitattributes` does NOT migrate existing history. The big binaries already in `.git` stay there. That's Phase 2 (below), and out of scope for this PR. ## Audit numbers (snapshot taken on `refactor/clone-size`, 2026-04-30 UTC) | Metric | Value | |---|---| | `.git` size on disk (common dir) | **2.6 GB** | | Total reachable commits | 13,660 | | Distinct blobs > 1 MB anywhere in history | **797** | | Cumulative bytes of blobs > 1 MB | **5.31 GB across history** (pre-pack) | ### Top-10 paths by cumulative bytes across history | Cumulative bytes | Versions | Path | |---:|---:|---| | 952,363,735 | 15 | `assets/downloads/Machine-Learning-Systems.epub` | | 595,473,502 | 16 | `assets/downloads/Machine-Learning-Systems.pdf` | | 581,652,800 | 32 | `interviews/staffml/src/data/corpus.json` (already gitignored) | | 466,267,746 | 21 | `interviews/vault/corpus.json` (already gitignored) | | 444,089,970 | 117 | `search.json` (Quarto search index, regenerated each build) | | 285,573,256 | 25 | `interviews/corpus.json` | | 158,585,757 | 5 | `book/assets/downloads/Machine-Learning-Systems.epub` | | 156,938,369 | 7 | `kits/assets/downloads/Hardware-Kits.pdf` | | 113,190,282 | 20 | `interviews/staffml/src/data/corpus-summary.json` | | 94,026,126 | 3 | `assets/Machine-Learning-Systems.pdf` | The full top-50 list and the >500 KB tracked-in-HEAD list are in the audit document saved at `/tmp/clone-size-audit-2026-04-30.md` (the intended `.claude/_reviews/` path was blocked by the local Claude Code sandbox in this run; the file should be moved to that location for the permanent evidence trail). ## 3-phase migration plan ### Phase 1 — this PR (non-destructive prevention) - `.gitattributes` for LFS tracking of binary patterns going forward. - `.gitignore` updates for build artefacts (`bundle.js`). - `CONTRIBUTING.md` Common gotchas section. - Audit document saved as evidence. - **No commits are rewritten. Existing forks stay valid.** ### Phase 2 — future, requires VJ approval and team coordination The actual migration of existing history out of pack files. Concrete steps: 1. **Announce** to all active contributors a 24-48h freeze window for PRs to `dev` and `main`. 2. **Take a fresh, full clone** on a maintainer machine (not a worktree; `git lfs migrate` rewrites history): ```bash git clone https://github.com/harvard-edge/cs249r_book.git cs249r_book-migrate cd cs249r_book-migrate git fetch --all --tags ``` 3. **Install `git-lfs` and `git-filter-repo`** (`brew install git-lfs git-filter-repo`). 4. **Run the migration** for the patterns we know are safe (matches `.gitattributes` from Phase 1): ```bash git lfs migrate import \ --everything \ --include="*.epub,*.pdf,*.mp3,*.wav,*.m4a,*.mp4,*.mov,*.webm,*.wasm" ``` This rewrites history. Verify with `git rev-list --objects --all | git cat-file --batch-check='%(objecttype) %(objectname) %(objectsize) %(rest)' | awk '$1=="blob" && $3>1000000' | wc -l` (should drop sharply). 5. **`git lfs push --all origin`** to upload the LFS objects. 6. **Force-push** `main` and `dev` (and any active long-lived branches): ```bash git push --force-with-lease origin main dev ``` 7. **Notify all contributors** to either: - re-clone the repo, OR - run `git fetch && git reset --hard origin/<branch>` on every active branch (warn them this discards local history). 8. **Update CI** to call `git lfs install` (and `git lfs pull` where the binaries are needed for the build — book PDF rendering, podcast embedding, etc.). 9. **Optional follow-up** with `git-filter-repo` to drop paths that no longer exist in HEAD and aren't needed historically (e.g., the old `assets/downloads/Machine-Learning-Systems.{epub,pdf}` already removed from working tree). This shrinks `.git` further but is more destructive. Estimated outcome (matching #1393's expectation): fresh clone drops from ~2 GB to under 200 MB, plus a one-time LFS-object pull on first checkout. ### Phase 3 — optional, longer-term Consider splitting heavyweight historical content into release-asset repos: - Old book distribution PDFs / EPUBs onto GitHub Releases (or a dedicated `cs249r_book-releases` repo). - Podcast MP3s onto a separate `cs249r_book-podcasts` repo or a CDN. - Keep the main repo for source: `.qmd`, `.py`, `.bib`, SVGs. This requires URL-rewriting in book content and has CDN cost implications; defer until Phase 2 is settled and cost / DX is measured. ## Decisions deferred to VJ (honest list) 1. **PNG LFS coverage.** I did NOT add `*.png` to `.gitattributes`. The repo mixes 1-5 MB cover art / kit photos with thousands of small icon PNGs and a blanket pattern would LFS-track the small ones too. Options: (a) path-scoped patterns (`book/quarto/assets/images/covers/**/*.png`, `kits/contents/**/images/png/*.png`); (b) `*.png` only after small icons are moved to SVG; (c) leave PNG out of LFS entirely. **Please pick one.** 2. **`*.gif`.** Same dilemma — `_alphafold.gif` is 3 MB but most other GIFs are small. Deferred. 3. **Phase-2 timing.** The migration plan above is a proposal, not a schedule. 4. **Whether to also untrack the existing `bundle.js`** (separate change) and the existing `00_tinytorch.pdf`, podcast MP3s, etc. tracked in HEAD. Phase 1 only stops *new* drops. ## Test plan - [ ] Confirm `.gitattributes` does not break existing `git status` / `git diff` for non-LFS files. - [ ] Spot-check that adding a fake `test.epub` to a scratch branch results in an LFS pointer (`git ls-files :test.epub` shows `lfs filter` after `git add`). - [ ] Verify the markdown link checker (pre-commit `Repo: Validate internal markdown links + anchors`) passes — the CONTRIBUTING.md update added internal anchors and external issue links. - [ ] Confirm `bundle.js` paths in `.gitignore` don't accidentally ignore non-bundle files. - [ ] Manual: rebuild SocratiQ bundle (`cd socratiq && npm run build:vite`) and confirm git status correctly ignores the regen. Relates to #1393 Relates to #1175 --- <sub>🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.</sub>
GiteaMirror added the pull-request label 2026-05-03 01:29:24 -05:00
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/cs249r_book#9225