mirror of
https://github.com/harvard-edge/cs249r_book.git
synced 2026-05-06 09:38:33 -05:00
Fix/Review slow cloning issue #503
Reference in New Issue
Block a user
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @harishb00 on GitHub (Feb 17, 2026).
When I cloned the repo, it was too slow. I wondered why this happens and used some AI assistance to get some insights. Here's it's response. Highlighted for quick review.
Reason for slow clone
Your repo is 2.8 GB (with 1.1 GB in .git alone and 9,492 commits).
Here's where the bloat comes from:
The Big Offenders
Category: Binary/media files (PNG, JPEG, PDF, MP3, EPUB)
Size in Git History: ~1.08 GB
Examples: alphafold.png (13 MB), sdcard.png (10 MB), podcast .mp3 (12 MB), .epub files
────────────────────────────────────────
Category: JavaScript bundles
Size in Git History: ~202 MB
Examples: bundle.js files (8-18 MB each), TensorFlow vendor bundle
────────────────────────────────────────
Category: Other
Size in Git History: ~1.2 GB
Examples: Multiple revisions of the above, other content
Root Causes
.gitattributesand no LFS tracking. Every binary file (images, PDFs, MP3s, EPUBs) is stored directly in Githistory. Even if you delete them from the working tree, they stay in .git/objects forever.
Recommendations
Quick win (for you right now):
Shallow clone — only gets latest state, no history
git clone --depth 1 <repo-url>Or limit history depth
git clone --depth 50 <repo-url>For the repo maintainers (longer-term):
My Take
git clone --depth 1 <repo-url>. Just with this command I was able to clone the entire repo much much faster with all the projects included.@profvjreddi commented on GitHub (Feb 17, 2026):
Hey @harishb00! Thank you so much for your note. I completely agree with you. This repository is honestly a bit of a mess, and I have to admit that I’m responsible for it. I never anticipated that so many people would take an interest in the repo, which is fantastic, but it has made me realize that I need to tidy things up.
Currently, I’m in the midst of a major refactor because we’re splitting the book into two parts: an introduction to machine learning systems and a new version focused on advanced machine learning systems. As part of this, I’m considering whether I should rewrite the history by removing some of the large files to keep the repository clean.
The repo really shouldn’t be as large as it is since, as you pointed out, every clone includes the full history of these large binaries. I was hoping to conduct some sanity checks around this time, but that won’t happen for at least another two to three months. In the meantime, using your approach of doing a partial clone might be the best solution.
By the way, this sparse checkout process is actually how the bash script for the TinyTorch installation works.