Files
cs249r_book/labs/plans/vol2/lab_04_data_storage.md
Vijay Janapa Reddi 533cfa6e99 fix: pre-commit hooks — all 48 checks now pass
- book/quarto/mlsys/__init__.py: add repo-root sys.path injection so
  mlsysim is importable when scripts run from book/quarto/ context
- book/quarto/mlsys/{constants,formulas,formatting,hardware}.py: new
  compatibility shims that re-export from mlsysim.core.* and mlsysim.fmt
- mlsysim/viz/__init__.py: remove try/except for dashboard import; use
  explicit "import from mlsysim.viz.dashboard" pattern instead
- .codespell-ignore-words.txt: add "covert" (legitimate security term)
- book/tools/scripts/reference_check_log.txt: delete generated artifact
- Various QMD, bib, md files: auto-formatted by pre-commit hooks
  (trailing whitespace, bibtex-tidy, pipe table alignment)
2026-03-01 17:30:24 -05:00

4.4 KiB

📐 Mission Plan: 04_data_storage (Volume 2: Fleet Scale)

1. Chapter Context

  • Chapter Title: Data Storage: Feeding the Machine Learning Fleet.
  • Core Invariant: The Sequential Invariant (Random I/O is the enemy of throughput) and the I/O Wall.
  • The Struggle: Understanding that at scale, storage is about IOPS and Bandwidth, not just capacity. Students must navigate the trade-off between Data Locality (Local NVMe) and Shared Scalability (Object Stores/S3), specifically focusing on how random shuffling kills training performance.
  • Target Duration: 45 Minutes.

2. The 4-Track Storyboard (Storage Missions)

Track Persona Fixed North Star Mission The "Storage" Crisis
Cloud Titan LLM Architect Maximize Llama-3-70B serving. The Checkpoint Storm. Your 1024-node cluster is trying to save a 350GB checkpoint simultaneously. The shared storage has collapsed under the write-pressure.
Edge Guardian AV Systems Lead Deterministic 10ms safety loop. The Black-Box Log. Your 10,000-vehicle fleet is generating 5TB/hour of Lidar data. You must decide what to log locally vs. what to upload to the Cloud.
Mobile Nomad AR Glasses Dev 60FPS AR translation. The App Cache Wall. The 8GB glasses RAM is full. You must stream model weights from flash memory without causing a 50ms frame-skip.
Tiny Pioneer Hearable Lead Neural isolation in <10ms under 1mW. The Circular Buffer. You have only 64KB of audio buffer. If your Flash-read latency is inconsistent, the audio 'glitches' for the user.

3. The 3-Part Mission (The KATs)

Part 1: The Access Pattern Audit (Exploration - 15 Mins)

  • Objective: Quantify the 100x performance difference between Sequential and Random I/O.
  • The "Lock" (Prediction): "If you randomly shuffle a 10TB dataset during each epoch, will your training throughput be limited by your GPU or your Storage IOPS?"
  • The Workbench:
    • Action: Toggle between Sequential Reading and Stochastic Shuffling. Adjust File Format (Raw Files vs. TFRecord/WebDataset).
    • Observation: The I/O Waterfall (Wait-Time vs. Load-Time). Watch the "I/O Wait" bar explode during random shuffling.
  • Reflect: "Patterson asks: 'Why is Sequential access the only way to hit the 'Machine' peak?' (Reference the disk-head/block-prefetching physics)."

Part 2: Sizing the Pipeline (Trade-off - 15 Mins)

  • Objective: Dimension a tiered storage hierarchy (S3 -> NVMe -> DRAM) to hit a specific throughput target.
  • The "Lock" (Prediction): "Will adding more Local NVMe cache improve training speed if the bottleneck is the initial S3-to-Node network link?"
  • The Workbench:
    • Sliders: Buffer Size (GB), Download BW (Gbps), Local NVMe BW (GB/s).
    • Instruments: Data Flow Gauge. Pipeline Saturation Plot.
    • The 10-Iteration Rule: Students must find the "Balanced Tiering" that keeps the GPU 95% utilized for their track's specific dataset size.
  • Reflect: "Jeff Dean observes: 'Your storage system is 50% idle while your GPUs are starving.' Identify the 'Impedance Mismatch' in your pipeline."

Part 3: The Checkpoint Wall (Synthesis - 15 Mins)

  • Objective: Optimize the Checkpoint Interval to minimize the "Reliability Tax."
  • The "Lock" (Prediction): "Does saving a checkpoint every 10 minutes increase or decrease the total time to finish a 1-month training run?"
  • The Workbench:
    • Interaction: Checkpoint Frequency Slider. Write-Bandwidth Selector. MTBF (Mean Time Between Failures) Scrubber.
    • The "Stakeholder" Challenge: The Ops Lead warns that the MTBF of the cluster has dropped. You must use the Young-Daly Plot to find the optimal checkpoint frequency that minimizes "Wasted Work" without crashing the storage.
  • Reflect (The Ledger): "Defend your final 'Storage Strategy.' Did you choose 'Local-First' or 'Cloud-Native'? Justify how you solved the 'Feeding Problem' for your fleet."

4. Visual Layout Specification

  • Primary: DataFlowSankey (Visualizing bits moving from Cloud -> Disk -> GPU).
  • Secondary: IOPS_vs_Throughput_Curve (Showing the saturation point of different disk types).
  • Math Peek: Toggle for the Data Pipeline Equation and Young-Daly Checkpoint Interval.