mirror of
https://github.com/harvard-edge/cs249r_book.git
synced 2026-05-06 17:49:07 -05:00
Reference in New Issue
Block a user
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
📋 Pull Request Information
Original PR: https://github.com/harvard-edge/cs249r_book/pull/1619
Author: @profvjreddi
Created: 4/30/2026
Status: ✅ Merged
Merged: 4/30/2026
Merged by: @profvjreddi
Base:
dev← Head:refactor/clone-size📝 Commits (4)
2ab05b8chore(repo): add .gitattributes with going-forward Git LFS trackingea3087achore(gitignore): exclude generated bundle.js drops going forwardf9c7a24docs(contributing): add Common gotchas section for first-time contributors5b997c5merge dev into refactor/clone-size, keep both ignore additions📊 Changes
3 files changed (+135 additions, -0 deletions)
View changed files
➕
.gitattributes(+83 -0)📝
.gitignore(+16 -0)📝
CONTRIBUTING.md(+36 -0)📄 Description
Summary
This PR is Phase 1 of the response to issues #1393 and #1175: pure
prevention, no history rewrite. It stops new binaries and bundles from
compounding the existing 2.6 GB
.git, and adds a Common gotchassection to root
CONTRIBUTING.mdso first-time contributors find theTinyTorch / Co-Labs / large-binary guidance the issue called out.
What's in this PR (three atomic commits):
chore(repo): add .gitattributes with going-forward Git LFS tracking—marks
*.epub,*.pdf,*.mp3,*.wav,*.m4a,*.mp4,*.mov,*.webm,*.wasmfor Git LFS. Mixed-size patterns (*.png,*.jpg,*.gif) deliberately omitted — see deferred decisions below.chore(gitignore): exclude generated bundle.js drops going forward—adds
book/quarto/tools/scripts/socratiQ/bundle.js, the symlinkedmirror at
book/tools/scripts/socratiQ/bundle.js, and the legacyscripts/ai_menu/dist/*.bundle.jspaths.docs(contributing): add Common gotchas section for first-time contributors— adds thetitoCLI,tito src export, Pyodide /Marimo cell-return, and large-binary callouts to root
CONTRIBUTING.md, linking to canonical per-area docs rather thanduplicating.
Important: adding LFS tracking to
.gitattributesdoes NOT migrateexisting history. The big binaries already in
.gitstay there. That'sPhase 2 (below), and out of scope for this PR.
Audit numbers (snapshot taken on
refactor/clone-size, 2026-04-30 UTC).gitsize on disk (common dir)Top-10 paths by cumulative bytes across history
assets/downloads/Machine-Learning-Systems.epubassets/downloads/Machine-Learning-Systems.pdfinterviews/staffml/src/data/corpus.json(already gitignored)interviews/vault/corpus.json(already gitignored)search.json(Quarto search index, regenerated each build)interviews/corpus.jsonbook/assets/downloads/Machine-Learning-Systems.epubkits/assets/downloads/Hardware-Kits.pdfinterviews/staffml/src/data/corpus-summary.jsonassets/Machine-Learning-Systems.pdfThe full top-50 list and the >500 KB tracked-in-HEAD list are in the
audit document saved at
/tmp/clone-size-audit-2026-04-30.md(theintended
.claude/_reviews/path was blocked by the local Claude Codesandbox in this run; the file should be moved to that location for the
permanent evidence trail).
3-phase migration plan
Phase 1 — this PR (non-destructive prevention)
.gitattributesfor LFS tracking of binary patterns going forward..gitignoreupdates for build artefacts (bundle.js).CONTRIBUTING.mdCommon gotchas section.Phase 2 — future, requires VJ approval and team coordination
The actual migration of existing history out of pack files. Concrete
steps:
PRs to
devandmain.git lfs migraterewrites history):git-lfsandgit-filter-repo(brew install git-lfs git-filter-repo)..gitattributesfrom Phase 1):git rev-list --objects --all | git cat-file --batch-check='%(objecttype) %(objectname) %(objectsize) %(rest)' | awk '$1=="blob" && $3>1000000' | wc -l(should drop sharply).git lfs push --all originto upload the LFS objects.mainanddev(and any active long-lived branches):git fetch && git reset --hard origin/<branch>on every activebranch (warn them this discards local history).
git lfs install(andgit lfs pullwhere thebinaries are needed for the build — book PDF rendering, podcast
embedding, etc.).
git-filter-repoto drop paths that nolonger exist in HEAD and aren't needed historically (e.g., the old
assets/downloads/Machine-Learning-Systems.{epub,pdf}alreadyremoved from working tree). This shrinks
.gitfurther but is moredestructive.
Estimated outcome (matching #1393's expectation): fresh clone drops from
~2 GB to under 200 MB, plus a one-time LFS-object pull on first
checkout.
Phase 3 — optional, longer-term
Consider splitting heavyweight historical content into release-asset
repos:
dedicated
cs249r_book-releasesrepo).cs249r_book-podcastsrepo or a CDN..qmd,.py,.bib, SVGs.This requires URL-rewriting in book content and has CDN cost
implications; defer until Phase 2 is settled and cost / DX is measured.
Decisions deferred to VJ (honest list)
*.pngto.gitattributes.The repo mixes 1-5 MB cover art / kit photos with thousands of small
icon PNGs and a blanket pattern would LFS-track the small ones too.
Options: (a) path-scoped patterns
(
book/quarto/assets/images/covers/**/*.png,kits/contents/**/images/png/*.png); (b)*.pngonly after smallicons are moved to SVG; (c) leave PNG out of LFS entirely. Please
pick one.
*.gif. Same dilemma —_alphafold.gifis 3 MB but most otherGIFs are small. Deferred.
schedule.
bundle.js(separate change)and the existing
00_tinytorch.pdf, podcast MP3s, etc. tracked inHEAD. Phase 1 only stops new drops.
Test plan
.gitattributesdoes not break existinggit status/git difffor non-LFS files.test.epubto a scratch branchresults in an LFS pointer (
git ls-files :test.epubshowslfs filteraftergit add).Repo: Validate internal markdown links + anchors) passes —the CONTRIBUTING.md update added internal anchors and external
issue links.
bundle.jspaths in.gitignoredon't accidentallyignore non-bundle files.
cd socratiq && npm run build:vite) and confirm git status correctly ignores the regen.Relates to #1393
Relates to #1175
🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.