Merge pull request #1619 from harvard-edge/refactor/clone-size

refactor: prevent further clone-size bloat (Phase 1, #1393, #1175)
This commit is contained in:
Vijay Janapa Reddi
2026-04-30 19:05:35 -04:00
committed by GitHub
3 changed files with 135 additions and 0 deletions

83
.gitattributes vendored Normal file
View File

@@ -0,0 +1,83 @@
# =============================================================================
# .gitattributes — going-forward Git LFS tracking and text/binary handling
# =============================================================================
#
# IMPORTANT: this file affects ONLY future `git add` operations. Existing
# blobs in history are NOT migrated by these patterns. A separate, coordinated
# `git lfs migrate import` (Phase 2) is required to actually relocate the
# ~2 GB of binaries already in `.git`. See PR #(this PR) and issues #1393,
# #1175 for the migration plan.
#
# `.gitignore` takes precedence over LFS tracking — if a file is ignored
# (e.g., `.gitignore` exempts callout-icon PDFs from the global *.pdf rule),
# `.gitattributes` LFS tracking will only apply if the file is actually being
# staged.
# -----------------------------------------------------------------------------
# Distribution / publish artefacts (large, infrequently changing, binary)
# -----------------------------------------------------------------------------
# EPUB: zero currently tracked in HEAD; ~952 MB across 15 historical versions
# in `assets/downloads/Machine-Learning-Systems.epub`. Mark for LFS so any
# future re-add does not bloat .git.
*.epub filter=lfs diff=lfs merge=lfs -text
# PDF: covers TinyTorch-Guide.pdf, 00_tinytorch.pdf, distribution PDFs.
# Note: `.gitignore` excludes most PDFs by default but explicitly allows
# callout-icon PDFs, mlsysim docs, paper figures, etc. Those exempted PDFs
# WILL be LFS-tracked under this pattern when newly added — that's the
# intended behaviour: small icon PDFs are still small as LFS pointers, and
# they are infrequently changed.
*.pdf filter=lfs diff=lfs merge=lfs -text
# -----------------------------------------------------------------------------
# Audio / video (always binary, never deltas well)
# -----------------------------------------------------------------------------
# Two MP3 podcasts are currently tracked (~16 MB combined). Multiple sites
# (book quarto, socratiQ, kits) may add more in future.
*.mp3 filter=lfs diff=lfs merge=lfs -text
*.wav filter=lfs diff=lfs merge=lfs -text
*.m4a filter=lfs diff=lfs merge=lfs -text
*.mp4 filter=lfs diff=lfs merge=lfs -text
*.mov filter=lfs diff=lfs merge=lfs -text
*.webm filter=lfs diff=lfs merge=lfs -text
# -----------------------------------------------------------------------------
# Bundled JS / WASM artefacts (when ever tracked)
# -----------------------------------------------------------------------------
# These are typically build outputs and SHOULD be ignored via .gitignore
# rather than tracked (see `book/quarto/tools/scripts/socratiQ/bundle.js`,
# the historical `scripts/ai_menu/dist/bundle.js`, and the Next.js
# `staffml/_next/static/chunks/*.js` blobs). However, if a bundle ever does
# need to be tracked (e.g., a vendored externally-published artefact),
# treat it as binary so we don't burn diff cycles.
*.wasm filter=lfs diff=lfs merge=lfs -text
# -----------------------------------------------------------------------------
# NOT added to LFS (uncertainty / mixed-size patterns) — defer to VJ
# -----------------------------------------------------------------------------
# *.png — repo mixes 1-5 MB cover art / kit photos with thousands of small
# icon PNGs. A blanket pattern would LFS-track the small ones too.
# Recommend either path-scoped patterns
# (e.g. `book/quarto/assets/images/covers/**/*.png filter=lfs ...`)
# or rasterizing big PNGs to a single canonical location first.
# *.jpg / *.jpeg / *.gif — same mixed-size issue. The single biggest GIF is
# `book/quarto/contents/vol1/introduction/images/gif/_alphafold.gif`
# at 3 MB; most others are small.
# *.json — `corpus.json`, `corpus-summary.json`, `search.json` are big but
# they are build artefacts and already in `.gitignore`. JSON in
# general should NOT be LFS-tracked (it's text and diffs well).
# -----------------------------------------------------------------------------
# Text-handling normalization
# -----------------------------------------------------------------------------
# Tell git to auto-normalize line endings on text files. Binary patterns
# above already opt out via `-text`.
* text=auto eol=lf
# Shell scripts and Makefiles must keep LF on Windows checkouts.
*.sh text eol=lf
Makefile text eol=lf
# Avoid CRLF translation for Windows-native batch files.
*.bat text eol=crlf
*.cmd text eol=crlf

16
.gitignore vendored
View File

@@ -343,5 +343,21 @@ interviews/paper/svg_structure.txt
interviews/vault/corpus.json
interviews/staffml/src/data/corpus.json
# SocratiQ bundle — build artefact produced by `socratiq/` Vite build
# (`npm --prefix socratiq run build:vite`), which copies the output into
# `book/quarto/tools/scripts/socratiQ/bundle.js`. The mirror at
# `book/tools/scripts/socratiQ/bundle.js` is a symlink to that file. The
# bundle is ~7 MB and has accumulated ~57 MB across 7 historical versions.
# Going forward, the bundle should be regenerated rather than re-committed.
# (This rule does NOT untrack the existing tracked bundle.js — that cleanup
# is a separate, coordinated change. It only stops *new* drops from being
# added by accident.)
book/quarto/tools/scripts/socratiQ/bundle.js
book/tools/scripts/socratiQ/bundle.js
# Generic bundle.js anywhere under scripts/ai_menu/dist/ (legacy build path
# from issue #1175 — no longer in HEAD but defensively excluded).
scripts/ai_menu/dist/bundle.js
scripts/ai_menu/dist/*.bundle.js
# vault-cli check output — dump from `vault check` runs, regenerated on demand
interviews/vault-cli/check_results.json

View File

@@ -29,6 +29,42 @@ Not sure which one applies? Open a
[Discussion](https://github.com/harvard-edge/cs249r_book/discussions) and we'll
help route it.
## Common gotchas first-time contributors hit
These are the things that aren't obvious from reading any single sub-project's
README. Each links to the canonical doc rather than restating it.
* **TinyTorch uses the `tito` CLI for everything.** Module status, tests,
exports, environment health all flow through `tito` (`tito --version`,
`tito system health`, `tito module status`, `tito module test NN`). See
[`tinytorch/CONTRIBUTING.md`](tinytorch/CONTRIBUTING.md) for the full
command list. If `tito` isn't on your PATH after
`pip install -e tinytorch/`, re-activate your venv.
* **TinyTorch source edits need an export step.** When you change a file
under `tinytorch/src/`, the in-package version under `tinytorch/tinytorch/`
is regenerated by `tito src export` (see
[`tinytorch/CONTRIBUTING.md`](tinytorch/CONTRIBUTING.md), "Module
Development"). `tinytorch/tinytorch/*` is gitignored — the source of
truth is `src/`.
* **Co-Labs run in the browser via Pyodide / WebAssembly.** Imports must be
Pyodide-compatible (no compiled-only packages without a wheel) and every
Marimo cell that produces a UI element must `return` it so the dataflow
routes it onward — that's release invariant #4 in
[`labs/PROTOCOL.md`](labs/PROTOCOL.md). The lab test suite enforces both.
* **Don't commit large binaries.** Distribution PDFs, EPUBs, podcast MP3s,
and JS bundles balloon `.git` (see issues
[#1393](https://github.com/harvard-edge/cs249r_book/issues/1393) and
[#1175](https://github.com/harvard-edge/cs249r_book/issues/1175)). The
root `.gitattributes` is set up so future EPUB / PDF / MP3 / MP4 / WAV /
WASM additions land in Git LFS automatically. Generated artefacts
(`bundle.js`, `corpus.json`, search indexes) are gitignored — regenerate
them locally rather than committing.
* **Where each area lives** — the table above is the authoritative map.
At a glance: `book/` for the textbook, `tinytorch/` for the framework,
`labs/` for browser labs, `kits/` for hardware recipes, `mlsysim/` for
the simulator, `instructors/` for teaching materials, `slides/` for
per-chapter decks, `interviews/` for StaffML.
## Universal policies (apply to every project)
These conventions hold across the whole monorepo. The per-project guides