Files
cs249r_book/labs/plans/vol1/lab_04_data_engr.md
Vijay Janapa Reddi fdd90ce139 Stabilize dev pre-commit workflow
Add fallback hook dependencies in validate-dev and apply trailing-whitespace fixes to lab plan files so pre-commit no longer fails on auto-modifications.
2026-03-02 10:22:41 -05:00

2.7 KiB

📐 Mission Plan: 04_data_engr (Data Gravity & Drift)

1. Chapter Context

  • Topic: Dataset Compilation, Data Gravity, and Signal-to-Noise Engineering.
  • Core Invariant: The Energy-Movement Invariant (Moving data costs >100x more than compute).
  • The Struggle: Managing the "Feeding Tax." Students must keep the GPU ALUs busy despite low-bandwidth storage pipelines.

2. The 4-Zone Dashboard Anatomy

Zone 1: Command Header

  • Title: Lab 04: The Data Factory
  • Persona Identity: Current Role (e.g., Tiny Pioneer) and Scale.
  • Constraint Badges:
    • Egress < $100k (Red/Green)
    • GPU Hunger < 10% (Red/Green) - Idle time
    • Drift Detected (Alert Badge)

Zone 2: Engineering Levers (Inputs)

  • Storage Tier: NVMe (Hot), S3 (Warm), Glacier (Cold).
  • Transfer Method: 10Gbps Fiber vs. AWS Snowball (Physical Truck).
  • Deduplication Scrubber: 0% to 50% removal of redundant data.
  • Drift Sensitivity: Alpha level for the K-S test.

Zone 3: Telemetry Center (Visuals)

  • The System Ledger: 4 Cards (Ingestion Speed, Egress Cost, Data Entropy, Pipeline Health).
  • The Plot: The Data Gravity Waterfall. Shows the time breakdown of a training epoch (Disk IO vs. Network Transfer vs. GPU Math).

Zone 4: Audit Trail & Justification

  • Consequence Log: "Alert: Egress fees for 1PB transfer exceed $90,000. Move compute to data?"
  • Rationale Box: Defend your storage tier choice using the Feeding Tax math.

3. The 3-Act Narrative (The Lab Journey)

Act I: The Physics of Data Gravity (15m)

  • Scenario: You have 1 Petabyte of raw video in a warehouse. Your cluster is 3,000 miles away.
  • Crisis: Project deadline is in 10 days.
  • Task: Calculate the transfer time over 1Gbps fiber. Realize it will take weeks. Toggle to "Snowball" (Sneakernet) and see the physical delivery time beat the fiber.

Act II: The Feeding Tax (15m)

  • Scenario: Training ResNet-50 on a Cloud Titan cluster.
  • Crisis: GPU utilization is stuck at 15% (85% idle).
  • Task: Identify the bottleneck. It's the standard cloud disk (250 MB/s). Upgrade to local NVMe and watch the "Feeding Tax" drop to zero.

Act III: The Drift Detector (15m)

  • Scenario: Smart Doorbell deployment.
  • Crisis: Detection accuracy is dropping in one city.
  • Task: Run a Kolmogorov-Smirnov (K-S) test on incoming image distributions. Discover that a firmware update changed the white balance, creating "Semantic Noise."

4. Real-World Data Sources

  • Storage: AWS S3, EBS, and Glacier pricing (2024).
  • Logistics: FedEx/AWS Snowball shipping durations and base fees.
  • Bandwidth: Standard egress fees ($0.09/GB).