# 📐 Mission Plan: 04_data_storage (Volume 2: Fleet Scale) ## 1. Chapter Context * **Chapter Title:** Data Storage: Feeding the Machine Learning Fleet. * **Core Invariant:** The Sequential Invariant (Random I/O is the enemy of throughput) and the **I/O Wall**. * **The Struggle:** Understanding that at scale, storage is about **IOPS** and **Bandwidth**, not just capacity. Students must navigate the trade-off between **Data Locality** (Local NVMe) and **Shared Scalability** (Object Stores/S3), specifically focusing on how random shuffling kills training performance. * **Target Duration:** 45 Minutes. --- ## 2. The 4-Track Storyboard (Storage Missions) | Track | Persona | Fixed North Star Mission | The "Storage" Crisis | | :--- | :--- | :--- | :--- | | **Cloud Titan** | LLM Architect | Maximize Llama-3-70B serving. | **The Checkpoint Storm.** Your 1024-node cluster is trying to save a 350GB checkpoint simultaneously. The shared storage has collapsed under the write-pressure. | | **Edge Guardian** | AV Systems Lead | Deterministic 10ms safety loop. | **The Black-Box Log.** Your 10,000-vehicle fleet is generating 5TB/hour of Lidar data. You must decide what to log locally vs. what to upload to the Cloud. | | **Mobile Nomad** | AR Glasses Dev | 60FPS AR translation. | **The App Cache Wall.** The 8GB glasses RAM is full. You must stream model weights from flash memory without causing a 50ms frame-skip. | | **Tiny Pioneer** | Hearable Lead | Neural isolation in <10ms under 1mW. | **The Circular Buffer.** You have only 64KB of audio buffer. If your Flash-read latency is inconsistent, the audio 'glitches' for the user. | --- ## 3. The 3-Part Mission (The KATs) ### Part 1: The Access Pattern Audit (Exploration - 15 Mins) * **Objective:** Quantify the 100x performance difference between Sequential and Random I/O. * **The "Lock" (Prediction):** "If you randomly shuffle a 10TB dataset during each epoch, will your training throughput be limited by your GPU or your Storage IOPS?" * **The Workbench:** * **Action:** Toggle between **Sequential Reading** and **Stochastic Shuffling**. Adjust **File Format** (Raw Files vs. TFRecord/WebDataset). * **Observation:** The **I/O Waterfall** (Wait-Time vs. Load-Time). Watch the "I/O Wait" bar explode during random shuffling. * **Reflect:** "Patterson asks: 'Why is Sequential access the only way to hit the 'Machine' peak?' (Reference the disk-head/block-prefetching physics)." ### Part 2: Sizing the Pipeline (Trade-off - 15 Mins) * **Objective:** Dimension a tiered storage hierarchy (S3 -> NVMe -> DRAM) to hit a specific throughput target. * **The "Lock" (Prediction):** "Will adding more Local NVMe cache improve training speed if the bottleneck is the initial S3-to-Node network link?" * **The Workbench:** * **Sliders:** Buffer Size (GB), Download BW (Gbps), Local NVMe BW (GB/s). * **Instruments:** **Data Flow Gauge**. **Pipeline Saturation Plot**. * **The 10-Iteration Rule:** Students must find the "Balanced Tiering" that keeps the GPU 95% utilized for their track's specific dataset size. * **Reflect:** "Jeff Dean observes: 'Your storage system is 50% idle while your GPUs are starving.' Identify the 'Impedance Mismatch' in your pipeline." ### Part 3: The Checkpoint Wall (Synthesis - 15 Mins) * **Objective:** Optimize the Checkpoint Interval to minimize the "Reliability Tax." * **The "Lock" (Prediction):** "Does saving a checkpoint every 10 minutes increase or decrease the total time to finish a 1-month training run?" * **The Workbench:** * **Interaction:** **Checkpoint Frequency Slider**. **Write-Bandwidth Selector**. **MTBF (Mean Time Between Failures) Scrubber**. * **The "Stakeholder" Challenge:** The **Ops Lead** warns that the MTBF of the cluster has dropped. You must use the **Young-Daly Plot** to find the optimal checkpoint frequency that minimizes "Wasted Work" without crashing the storage. * **Reflect (The Ledger):** "Defend your final 'Storage Strategy.' Did you choose 'Local-First' or 'Cloud-Native'? Justify how you solved the 'Feeding Problem' for your fleet." --- ## 4. Visual Layout Specification * **Primary:** `DataFlowSankey` (Visualizing bits moving from Cloud -> Disk -> GPU). * **Secondary:** `IOPS_vs_Throughput_Curve` (Showing the saturation point of different disk types). * **Math Peek:** Toggle for the `Data Pipeline Equation` and `Young-Daly Checkpoint Interval`.