Merge pull request #1619 from harvard-edge/refactor/clone-size

refactor: prevent further clone-size bloat (Phase 1, #1393, #1175)
2026-05-22 14:03:46 -05:00 · 2026-04-30 19:05:35 -04:00
parent f1d5a755f2 5b997c5abe
commit d5c91ac94d
3 changed files with 135 additions and 0 deletions
--- a/.gitattributes
+++ b/.gitattributes
@@ -0,0 +1,83 @@
+# =============================================================================
+# .gitattributes — going-forward Git LFS tracking and text/binary handling
+# =============================================================================
+#
+# IMPORTANT: this file affects ONLY future `git add` operations. Existing
+# blobs in history are NOT migrated by these patterns. A separate, coordinated
+# `git lfs migrate import` (Phase 2) is required to actually relocate the
+# ~2 GB of binaries already in `.git`. See PR #(this PR) and issues #1393,
+# #1175 for the migration plan.
+#
+# `.gitignore` takes precedence over LFS tracking — if a file is ignored
+# (e.g., `.gitignore` exempts callout-icon PDFs from the global *.pdf rule),
+# `.gitattributes` LFS tracking will only apply if the file is actually being
+# staged.
+
+# -----------------------------------------------------------------------------
+# Distribution / publish artefacts (large, infrequently changing, binary)
+# -----------------------------------------------------------------------------
+# EPUB: zero currently tracked in HEAD; ~952 MB across 15 historical versions
+# in `assets/downloads/Machine-Learning-Systems.epub`. Mark for LFS so any
+# future re-add does not bloat .git.
+*.epub filter=lfs diff=lfs merge=lfs -text
+
+# PDF: covers TinyTorch-Guide.pdf, 00_tinytorch.pdf, distribution PDFs.
+# Note: `.gitignore` excludes most PDFs by default but explicitly allows
+# callout-icon PDFs, mlsysim docs, paper figures, etc. Those exempted PDFs
+# WILL be LFS-tracked under this pattern when newly added — that's the
+# intended behaviour: small icon PDFs are still small as LFS pointers, and
+# they are infrequently changed.
+*.pdf filter=lfs diff=lfs merge=lfs -text
+
+# -----------------------------------------------------------------------------
+# Audio / video (always binary, never deltas well)
+# -----------------------------------------------------------------------------
+# Two MP3 podcasts are currently tracked (~16 MB combined). Multiple sites
+# (book quarto, socratiQ, kits) may add more in future.
+*.mp3 filter=lfs diff=lfs merge=lfs -text
+*.wav filter=lfs diff=lfs merge=lfs -text
+*.m4a filter=lfs diff=lfs merge=lfs -text
+*.mp4 filter=lfs diff=lfs merge=lfs -text
+*.mov filter=lfs diff=lfs merge=lfs -text
+*.webm filter=lfs diff=lfs merge=lfs -text
+
+# -----------------------------------------------------------------------------
+# Bundled JS / WASM artefacts (when ever tracked)
+# -----------------------------------------------------------------------------
+# These are typically build outputs and SHOULD be ignored via .gitignore
+# rather than tracked (see `book/quarto/tools/scripts/socratiQ/bundle.js`,
+# the historical `scripts/ai_menu/dist/bundle.js`, and the Next.js
+# `staffml/_next/static/chunks/*.js` blobs). However, if a bundle ever does
+# need to be tracked (e.g., a vendored externally-published artefact),
+# treat it as binary so we don't burn diff cycles.
+*.wasm filter=lfs diff=lfs merge=lfs -text
+
+# -----------------------------------------------------------------------------
+# NOT added to LFS (uncertainty / mixed-size patterns) — defer to VJ
+# -----------------------------------------------------------------------------
+# *.png  — repo mixes 1-5 MB cover art / kit photos with thousands of small
+#          icon PNGs. A blanket pattern would LFS-track the small ones too.
+#          Recommend either path-scoped patterns
+#          (e.g. `book/quarto/assets/images/covers/**/*.png filter=lfs ...`)
+#          or rasterizing big PNGs to a single canonical location first.
+# *.jpg / *.jpeg / *.gif — same mixed-size issue. The single biggest GIF is
+#          `book/quarto/contents/vol1/introduction/images/gif/_alphafold.gif`
+#          at 3 MB; most others are small.
+# *.json — `corpus.json`, `corpus-summary.json`, `search.json` are big but
+#          they are build artefacts and already in `.gitignore`. JSON in
+#          general should NOT be LFS-tracked (it's text and diffs well).
+
+# -----------------------------------------------------------------------------
+# Text-handling normalization
+# -----------------------------------------------------------------------------
+# Tell git to auto-normalize line endings on text files. Binary patterns
+# above already opt out via `-text`.
+* text=auto eol=lf
+
+# Shell scripts and Makefiles must keep LF on Windows checkouts.
+*.sh text eol=lf
+Makefile text eol=lf
+
+# Avoid CRLF translation for Windows-native batch files.
+*.bat text eol=crlf
+*.cmd text eol=crlf
--- a/.gitignore
+++ b/.gitignore
@@ -343,5 +343,21 @@ interviews/paper/svg_structure.txt
 interviews/vault/corpus.json
 interviews/staffml/src/data/corpus.json

+# SocratiQ bundle — build artefact produced by `socratiq/` Vite build
+# (`npm --prefix socratiq run build:vite`), which copies the output into
+# `book/quarto/tools/scripts/socratiQ/bundle.js`. The mirror at
+# `book/tools/scripts/socratiQ/bundle.js` is a symlink to that file. The
+# bundle is ~7 MB and has accumulated ~57 MB across 7 historical versions.
+# Going forward, the bundle should be regenerated rather than re-committed.
+# (This rule does NOT untrack the existing tracked bundle.js — that cleanup
+# is a separate, coordinated change. It only stops *new* drops from being
+# added by accident.)
+book/quarto/tools/scripts/socratiQ/bundle.js
+book/tools/scripts/socratiQ/bundle.js
+# Generic bundle.js anywhere under scripts/ai_menu/dist/ (legacy build path
+# from issue #1175 — no longer in HEAD but defensively excluded).
+scripts/ai_menu/dist/bundle.js
+scripts/ai_menu/dist/*.bundle.js
+
 # vault-cli check output — dump from `vault check` runs, regenerated on demand
 interviews/vault-cli/check_results.json
--- a/CONTRIBUTING.md
+++ b/CONTRIBUTING.md
@@ -29,6 +29,42 @@ Not sure which one applies? Open a
 [Discussion](https://github.com/harvard-edge/cs249r_book/discussions) and we'll
 help route it.

+## Common gotchas first-time contributors hit
+
+These are the things that aren't obvious from reading any single sub-project's
+README. Each links to the canonical doc rather than restating it.
+
+* **TinyTorch uses the `tito` CLI for everything.** Module status, tests,
+  exports, environment health all flow through `tito` (`tito --version`,
+  `tito system health`, `tito module status`, `tito module test NN`). See
+  [`tinytorch/CONTRIBUTING.md`](tinytorch/CONTRIBUTING.md) for the full
+  command list. If `tito` isn't on your PATH after
+  `pip install -e tinytorch/`, re-activate your venv.
+* **TinyTorch source edits need an export step.** When you change a file
+  under `tinytorch/src/`, the in-package version under `tinytorch/tinytorch/`
+  is regenerated by `tito src export` (see
+  [`tinytorch/CONTRIBUTING.md`](tinytorch/CONTRIBUTING.md), "Module
+  Development"). `tinytorch/tinytorch/*` is gitignored — the source of
+  truth is `src/`.
+* **Co-Labs run in the browser via Pyodide / WebAssembly.** Imports must be
+  Pyodide-compatible (no compiled-only packages without a wheel) and every
+  Marimo cell that produces a UI element must `return` it so the dataflow
+  routes it onward — that's release invariant #4 in
+  [`labs/PROTOCOL.md`](labs/PROTOCOL.md). The lab test suite enforces both.
+* **Don't commit large binaries.** Distribution PDFs, EPUBs, podcast MP3s,
+  and JS bundles balloon `.git` (see issues
+  [#1393](https://github.com/harvard-edge/cs249r_book/issues/1393) and
+  [#1175](https://github.com/harvard-edge/cs249r_book/issues/1175)). The
+  root `.gitattributes` is set up so future EPUB / PDF / MP3 / MP4 / WAV /
+  WASM additions land in Git LFS automatically. Generated artefacts
+  (`bundle.js`, `corpus.json`, search indexes) are gitignored — regenerate
+  them locally rather than committing.
+* **Where each area lives** — the table above is the authoritative map.
+  At a glance: `book/` for the textbook, `tinytorch/` for the framework,
+  `labs/` for browser labs, `kits/` for hardware recipes, `mlsysim/` for
+  the simulator, `instructors/` for teaching materials, `slides/` for
+  per-chapter decks, `interviews/` for StaffML.
+
 ## Universal policies (apply to every project)

 These conventions hold across the whole monorepo. The per-project guides