[PR #1631] fix(labs): correct cold start time in lab 13 Part D #9237

Open
opened 2026-05-03 01:29:44 -05:00 by GiteaMirror · 0 comments
Owner

📋 Pull Request Information

Original PR: https://github.com/harvard-edge/cs249r_book/pull/1631
Author: @Shashank-Tripathi-07
Created: 5/3/2026
Status: 🔄 Open

Base: devHead: fix/lab13-cold-start-answer


📝 Commits (1)

  • f763d28 fix(labs): correct cold start answer in lab 13 Part D

📊 Changes

1 file changed (+4 additions, -3 deletions)

View changed files

📝 labs/vol1/lab_13_model_serving.py (+4 -3)

📄 Description

Summary

Part D asks: "Auto-scaling Llama-2 70B during traffic spike. First-user wait?"

The answer option C said "~15 seconds (loading 140 GB over PCIe Gen5)" but this is wrong. The code uses NVMe sequential read (7 GB/s) as the storage source, not PCIe Gen5 (64 GB/s). The actual bottleneck is NVMe:

Disk read:       130 GB / 7 GB/s  ≈ 19 s   (NVMe, not PCIe Gen5)
Deserialization: 130 GB / 20 GB/s ≈  7 s
CUDA init:                           0.8 s
Warmup:                              0.5 s
Total:                             ~26 s

The dynamic display (_t_total) already showed ~26s, but the answer option and the "Correct" callout both said "~15 seconds" -- a visible contradiction.

Changes

  • Option C text: "~15 seconds (loading 140 GB over PCIe Gen5)""~26 seconds (NVMe read bottleneck + deserialization)"
  • Answer key: "15s""26s"
  • Correct callout: updated to show the NVMe breakdown

Test plan

  • Open lab 13 Part D with defaults (70B, NVMe SSD, PCIe Gen5)
  • Confirm the live cold start display shows ~26s
  • Select "C) ~26 seconds" -- confirm "Correct" callout with breakdown

🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.

## 📋 Pull Request Information **Original PR:** https://github.com/harvard-edge/cs249r_book/pull/1631 **Author:** [@Shashank-Tripathi-07](https://github.com/Shashank-Tripathi-07) **Created:** 5/3/2026 **Status:** 🔄 Open **Base:** `dev` ← **Head:** `fix/lab13-cold-start-answer` --- ### 📝 Commits (1) - [`f763d28`](https://github.com/harvard-edge/cs249r_book/commit/f763d28169f16ab2d0f7568abe7cb15ff4df87a8) fix(labs): correct cold start answer in lab 13 Part D ### 📊 Changes **1 file changed** (+4 additions, -3 deletions) <details> <summary>View changed files</summary> 📝 `labs/vol1/lab_13_model_serving.py` (+4 -3) </details> ### 📄 Description ## Summary Part D asks: "Auto-scaling Llama-2 70B during traffic spike. First-user wait?" The answer option C said **"~15 seconds (loading 140 GB over PCIe Gen5)"** but this is wrong. The code uses NVMe sequential read (7 GB/s) as the storage source, not PCIe Gen5 (64 GB/s). The actual bottleneck is NVMe: ``` Disk read: 130 GB / 7 GB/s ≈ 19 s (NVMe, not PCIe Gen5) Deserialization: 130 GB / 20 GB/s ≈ 7 s CUDA init: 0.8 s Warmup: 0.5 s Total: ~26 s ``` The dynamic display (`_t_total`) already showed ~26s, but the answer option and the "Correct" callout both said "~15 seconds" -- a visible contradiction. ## Changes - Option C text: `"~15 seconds (loading 140 GB over PCIe Gen5)"` → `"~26 seconds (NVMe read bottleneck + deserialization)"` - Answer key: `"15s"` → `"26s"` - Correct callout: updated to show the NVMe breakdown ## Test plan - [ ] Open lab 13 Part D with defaults (70B, NVMe SSD, PCIe Gen5) - [ ] Confirm the live cold start display shows ~26s - [ ] Select "C) ~26 seconds" -- confirm "Correct" callout with breakdown --- <sub>🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.</sub>
GiteaMirror added the pull-request label 2026-05-03 01:29:44 -05:00
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/cs249r_book#9237