[GH-ISSUE #1175] Fix/Review slow cloning issue #4339

Open
opened 2026-04-19 12:21:01 -05:00 by GiteaMirror · 3 comments
Owner

Originally created by @harishb00 on GitHub (Feb 17, 2026).
Original GitHub issue: https://github.com/harvard-edge/cs249r_book/issues/1175

When I cloned the repo, it was too slow. I wondered why this happens and used some AI assistance to get some insights. Here's it's response. Highlighted for quick review.

Reason for slow clone

Your repo is 2.8 GB (with 1.1 GB in .git alone and 9,492 commits).
Here's where the bloat comes from:

The Big Offenders

Category: Binary/media files (PNG, JPEG, PDF, MP3, EPUB)
Size in Git History: ~1.08 GB
Examples: alphafold.png (13 MB), sdcard.png (10 MB), podcast .mp3 (12 MB), .epub files
────────────────────────────────────────
Category: JavaScript bundles
Size in Git History: ~202 MB
Examples: bundle.js files (8-18 MB each), TensorFlow vendor bundle
────────────────────────────────────────
Category: Other
Size in Git History: ~1.2 GB
Examples: Multiple revisions of the above, other content

Root Causes

  1. No Git LFS — There's no .gitattributes and no LFS tracking. Every binary file (images, PDFs, MP3s, EPUBs) is stored directly in Git
    history. Even if you delete them from the working tree, they stay in .git/objects forever.
  2. Huge uncompressed PNGs — Many images are 6-13 MB each. PNGs don't compress well in Git packfiles.
  3. Committed JS bundles — Generated bundle.js files (8-18 MB) are tracked in Git. These are build artifacts that shouldn't be in version control.
  4. 9,500 commits of history — Every clone downloads the full history with all those large binaries.

Recommendations

Quick win (for you right now):

Shallow clone — only gets latest state, no history

git clone --depth 1 <repo-url>

Or limit history depth

git clone --depth 50 <repo-url>

For the repo maintainers (longer-term):

  1. Set up Git LFS for binary files (.png, .jpg, .pdf, .mp3, .epub). This stores large files on a separate server and only downloads them on demand.
  2. Add JS bundles to .gitignore — scripts/ai_menu/dist/, tools/scripts/socratiQ/bundle.js, etc. should be built in CI, not committed.
  3. Compress images — Many PNGs could be reduced 50-80% with tools like pngquant or by converting to WebP/JPEG where lossless isn't needed.
  4. Use git-filter-repo to remove large files from history (requires a force-push and coordination with all contributors).

My Take

  • I thought it's good to discuss with the collaborators which path we can take.
  • I also found another strategy called partial clone in git which allows users to just clone the folder (say, tinytorch) and work on them. This saves huge amount of downloading time. Depending on the project (labs, kits, book, tinytorch), users can just clone them and work on them. I believe every project is independent of each other. Below are the commands I tried for your reference
git clone --filter=blob:none --sparse https://github.com/harvard-edge/cs249r_book.git
cd cs249r_book
git sparse-checkout init --cone
git sparse-checkout set tinytorch
  • I also tried the suggestion from AI's response: git clone --depth 1 <repo-url>. Just with this command I was able to clone the entire repo much much faster with all the projects included.
  • Let's decide what optimizations and guideline improvements we can do to make this repo super clean and cute :)
Originally created by @harishb00 on GitHub (Feb 17, 2026). Original GitHub issue: https://github.com/harvard-edge/cs249r_book/issues/1175 When I cloned the repo, it was too slow. I wondered why this happens and used some AI assistance to get some insights. Here's it's response. Highlighted for quick review. ## Reason for slow clone Your repo is **2.8 GB (with 1.1 GB in .git alone and 9,492 commits)**. Here's where the bloat comes from: ### The Big Offenders Category: Binary/media files (PNG, JPEG, PDF, MP3, EPUB) **Size in Git History: ~1.08 GB** Examples: alphafold.png (13 MB), sdcard.png (10 MB), podcast .mp3 (12 MB), .epub files ──────────────────────────────────────── Category: JavaScript bundles **Size in Git History: ~202 MB** Examples: bundle.js files (8-18 MB each), TensorFlow vendor bundle ──────────────────────────────────────── Category: Other **Size in Git History: ~1.2 GB** Examples: Multiple revisions of the above, other content ### Root Causes 1. **No Git LFS** — There's **no `.gitattributes`** and **no LFS tracking**. Every binary file (images, PDFs, MP3s, EPUBs) is stored directly in Git history. Even if you delete them from the working tree, they stay in .git/objects forever. 2. **Huge uncompressed PNGs** — Many images are 6-13 MB each. PNGs don't compress well in Git packfiles. 3. **Committed JS bundles** — Generated bundle.js files (8-18 MB) are tracked in Git. These are build artifacts that shouldn't be in version control. 4. 9,500 commits of history — Every clone downloads the full history with all those large binaries. ## Recommendations ### Quick win (for you right now): #### Shallow clone — only gets latest state, no history `git clone --depth 1 <repo-url>` #### Or limit history depth `git clone --depth 50 <repo-url>` ### For the repo maintainers (longer-term): 1. Set up Git LFS for binary files (.png, .jpg, .pdf, .mp3, .epub). This stores large files on a separate server and only downloads them on demand. 2. Add JS bundles to .gitignore — scripts/ai_menu/dist/, tools/scripts/socratiQ/bundle.js, etc. should be built in CI, not committed. 3. Compress images — Many PNGs could be reduced 50-80% with tools like pngquant or by converting to WebP/JPEG where lossless isn't needed. 4. Use git-filter-repo to remove large files from history (requires a force-push and coordination with all contributors). --- ## My Take * I thought it's good to discuss with the collaborators which path we can take. * I also found another strategy called **partial clone** in git which allows users to just clone the folder (say, tinytorch) and work on them. This saves huge amount of downloading time. Depending on the project (labs, kits, book, tinytorch), users can just clone them and work on them. I believe every project is independent of each other. Below are the commands I tried for your reference ``` git clone --filter=blob:none --sparse https://github.com/harvard-edge/cs249r_book.git cd cs249r_book git sparse-checkout init --cone git sparse-checkout set tinytorch ``` * I also tried the suggestion from AI's response: `git clone --depth 1 <repo-url>`. Just with this command I was able to clone the entire repo much much faster with all the projects included. * Let's decide what optimizations and guideline improvements we can do to make this repo super clean and cute :)
GiteaMirror added the area: websitetype: improvement labels 2026-04-19 12:21:01 -05:00
Author
Owner

@profvjreddi commented on GitHub (Feb 17, 2026):

Hey @harishb00! Thank you so much for your note. I completely agree with you. This repository is honestly a bit of a mess, and I have to admit that I’m responsible for it. I never anticipated that so many people would take an interest in the repo, which is fantastic, but it has made me realize that I need to tidy things up.

Currently, I’m in the midst of a major refactor because we’re splitting the book into two parts: an introduction to machine learning systems and a new version focused on advanced machine learning systems. As part of this, I’m considering whether I should rewrite the history by removing some of the large files to keep the repository clean.

The repo really shouldn’t be as large as it is since, as you pointed out, every clone includes the full history of these large binaries. I was hoping to conduct some sanity checks around this time, but that won’t happen for at least another two to three months. In the meantime, using your approach of doing a partial clone might be the best solution.

By the way, this sparse checkout process is actually how the bash script for the TinyTorch installation works.

<!-- gh-comment-id:3915715141 --> @profvjreddi commented on GitHub (Feb 17, 2026): Hey @harishb00! Thank you so much for your note. I completely agree with you. This repository is honestly a bit of a mess, and I have to admit that I’m responsible for it. I never anticipated that so many people would take an interest in the repo, which is fantastic, but it has made me realize that I need to tidy things up. Currently, I’m in the midst of a major refactor because we’re splitting the book into two parts: an introduction to machine learning systems and a new version focused on advanced machine learning systems. As part of this, I’m considering whether I should rewrite the history by removing some of the large files to keep the repository clean. The repo really shouldn’t be as large as it is since, as you pointed out, every clone includes the full history of these large binaries. I was hoping to conduct some sanity checks around this time, but that won’t happen for at least another two to three months. In the meantime, using your approach of doing a partial clone might be the best solution. By the way, this sparse checkout process is actually how the bash script for the TinyTorch installation works.
Author
Owner

@Shashank-Tripathi-07 commented on GitHub (Apr 16, 2026):

Agreed, the size of the clone is pretty big. From a systems perspective, we can clean the things and organize them into folders and then specifically ask AI tools to clone a few portions of the repo that the learner is interested in. That would allow selective cloning, reduce the cloning time and allow better use of any AI tool and the contributor time and other resources also.

For the current stage, selective cloning would be better and we can then spend time to clean up things so as to allow better systems design to be used. I am on the same page as prof. on this :)

<!-- gh-comment-id:4259732961 --> @Shashank-Tripathi-07 commented on GitHub (Apr 16, 2026): Agreed, the size of the clone is pretty big. From a systems perspective, we can clean the things and organize them into folders and then specifically ask AI tools to clone a few portions of the repo that the learner is interested in. That would allow selective cloning, reduce the cloning time and allow better use of any AI tool and the contributor time and other resources also. For the current stage, selective cloning would be better and we can then spend time to clean up things so as to allow better systems design to be used. I am on the same page as prof. on this :)
Author
Owner

@profvjreddi commented on GitHub (Apr 16, 2026):

thanks for the patience on this. the repo bloat is real and ive acknowledged it. cleaning it up properly is on the roadmap for upcoming releases once current curriculum work settles. leaving the issue open so we can reference it when that work starts.

<!-- gh-comment-id:4262304020 --> @profvjreddi commented on GitHub (Apr 16, 2026): thanks for the patience on this. the repo bloat is real and ive acknowledged it. cleaning it up properly is on the roadmap for upcoming releases once current curriculum work settles. leaving the issue open so we can reference it when that work starts.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/cs249r_book#4339