[GH-ISSUE #1393] proposal: reduce clone size and fix contributor onboarding gaps #4401

Open
opened 2026-04-19 12:25:16 -05:00 by GiteaMirror · 1 comment
Owner

Originally created by @Shashank-Tripathi-07 on GitHub (Apr 18, 2026).
Original GitHub issue: https://github.com/harvard-edge/cs249r_book/issues/1393

Background

I've been contributing to this repo for a few weeks now and noticed two friction points that compound each other: the repo is slow to clone, and once you have it, it's not obvious where to start.

This issue proposes concrete, non-destructive fixes for both.


Problem 1: Clone size (2 GB .git)

A fresh clone transfers roughly 2 GB. The top offenders in git history:

File Size in history
assets/downloads/Machine-Learning-Systems.epub 64 MB
assets/downloads/Machine-Learning-Systems.pdf 39 MB (multiple versions)
interviews/vault/corpus.json 27 MB
tools/scripts/socratiQ/bundle.js 18 MB

These are binary or generated files. Versioning them in git means every contributor and every CI run pays the full cost on every clone.

Impact: slow CI checkout, slow onboarding, frustration for first-time contributors on slower connections.


Problem 2: No contributor map at the repo root

The repo has three distinct worlds inside it: the TinyTorch framework, the marimo labs, and the Quarto book content. Each has different tooling, different contribution patterns, and different gotchas.

tinytorch/CONTRIBUTING.md exists and is detailed, but a new contributor landing on the repo root has no idea:

  • that tito is the CLI they need
  • that labs run in-browser via Pyodide and cell return tuples are critical
  • that src changes need tito dev export before they show up in the package
  • which area maps to which folder

The result: contributors either give up or submit PRs that break CI in ways they don't understand.


Proposed solution

Part 1: Git LFS for large binaries

Migrate assets/downloads/*.pdf, assets/downloads/*.epub to Git LFS via .gitattributes. This is non-destructive: existing forks stay intact, history is not rewritten, and LFS pointers replace the blobs going forward. CI just needs git lfs pull added where the files are actually needed.

For corpus.json and bundle.js: add to .gitignore and generate them in CI. Neither file should be hand-edited, so there is no reason to track them.

Expected outcome: fresh clone drops from ~2 GB to under 200 MB.

Part 2: Root-level CONTRIBUTING.md

A single file at the repo root that gives contributors a map:

  • What lives where (tinytorch / labs / book content / tools)
  • Which tooling each area uses
  • How the 7-stage CI pipeline works at a high level
  • Common gotchas (tito export, cell return tuples, large files)
  • Where to find good first issues

This file does not replace tinytorch/CONTRIBUTING.md. It sits one level above it and routes people to the right place.


What I can do

I can implement both parts: the LFS migration with updated CI steps, and the root CONTRIBUTING.md. Both are ready to go as separate PRs whenever you want them.

I have been contributing to this repo over the past few weeks across TinyTorch, the labs, and the test suite. I would love to take on a maintainer role for this repo if you are open to it. Happy to discuss what that looks like.

Originally created by @Shashank-Tripathi-07 on GitHub (Apr 18, 2026). Original GitHub issue: https://github.com/harvard-edge/cs249r_book/issues/1393 ## Background I've been contributing to this repo for a few weeks now and noticed two friction points that compound each other: the repo is slow to clone, and once you have it, it's not obvious where to start. This issue proposes concrete, non-destructive fixes for both. --- ## Problem 1: Clone size (2 GB .git) A fresh clone transfers roughly 2 GB. The top offenders in git history: | File | Size in history | |---|---| | `assets/downloads/Machine-Learning-Systems.epub` | 64 MB | | `assets/downloads/Machine-Learning-Systems.pdf` | 39 MB (multiple versions) | | `interviews/vault/corpus.json` | 27 MB | | `tools/scripts/socratiQ/bundle.js` | 18 MB | These are binary or generated files. Versioning them in git means every contributor and every CI run pays the full cost on every clone. **Impact:** slow CI checkout, slow onboarding, frustration for first-time contributors on slower connections. --- ## Problem 2: No contributor map at the repo root The repo has three distinct worlds inside it: the TinyTorch framework, the marimo labs, and the Quarto book content. Each has different tooling, different contribution patterns, and different gotchas. `tinytorch/CONTRIBUTING.md` exists and is detailed, but a new contributor landing on the repo root has no idea: - that `tito` is the CLI they need - that labs run in-browser via Pyodide and cell return tuples are critical - that src changes need `tito dev export` before they show up in the package - which area maps to which folder The result: contributors either give up or submit PRs that break CI in ways they don't understand. --- ## Proposed solution ### Part 1: Git LFS for large binaries Migrate `assets/downloads/*.pdf`, `assets/downloads/*.epub` to Git LFS via `.gitattributes`. This is non-destructive: existing forks stay intact, history is not rewritten, and LFS pointers replace the blobs going forward. CI just needs `git lfs pull` added where the files are actually needed. For `corpus.json` and `bundle.js`: add to `.gitignore` and generate them in CI. Neither file should be hand-edited, so there is no reason to track them. Expected outcome: fresh clone drops from ~2 GB to under 200 MB. ### Part 2: Root-level CONTRIBUTING.md A single file at the repo root that gives contributors a map: - What lives where (tinytorch / labs / book content / tools) - Which tooling each area uses - How the 7-stage CI pipeline works at a high level - Common gotchas (tito export, cell return tuples, large files) - Where to find good first issues This file does not replace `tinytorch/CONTRIBUTING.md`. It sits one level above it and routes people to the right place. --- ## What I can do I can implement both parts: the LFS migration with updated CI steps, and the root CONTRIBUTING.md. Both are ready to go as separate PRs whenever you want them. I have been contributing to this repo over the past few weeks across TinyTorch, the labs, and the test suite. I would love to take on a maintainer role for this repo if you are open to it. Happy to discuss what that looks like.
GiteaMirror added the area: booktype: bug labels 2026-04-19 12:25:16 -05:00
Author
Owner

@Shashank-Tripathi-07 commented on GitHub (Apr 18, 2026):

@profvjreddi , I require your help on this as this is a repo-wide change. I want to join as the maintainer/collaborator as it will allow me to contribute in systems rather than just PRs. I want to level up with my work for this project. Kindly consider this 😄

<!-- gh-comment-id:4274389705 --> @Shashank-Tripathi-07 commented on GitHub (Apr 18, 2026): @profvjreddi , I require your help on this as this is a repo-wide change. I want to join as the maintainer/collaborator as it will allow me to contribute in systems rather than just PRs. I want to level up with my work for this project. Kindly consider this 😄
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/cs249r_book#4401