Files
cs249r_book/labs/plans/vol1/lab_08_model_train.md
Vijay Janapa Reddi 69736d3bdb updates
2026-02-28 18:20:47 -05:00

17 KiB
Raw Blame History

Mission Plan: lab_08_model_train

1. Chapter Alignment

  • Chapter: Model Training (@sec-model-training)
  • Core Invariant: The Iron Law of Training PerformanceT_{train} = O_{total} / (N \cdot R_{peak} \cdot \eta) — where the hardware peak (R_{peak}) is fixed and the only levers are total operations (O_{total}) and utilization (\eta). This lab is strictly single-machine scope: one accelerator, one training loop, no network bandwidth.
  • Central Tension: Students believe that buying a faster GPU is the primary lever for training speed. The chapter's data reveals the opposite: real systems operate at 4555% MFU, not 90100%. The bottleneck is almost never peak FLOPS — it is the memory hierarchy (optimizer state, activation storage, pipeline stalls). Optimization means eliminating waste in \eta, not upgrading R_{peak}.
  • Target Duration: 3540 minutes (2 acts)

2. The Two-Act Structure Overview

Act 1 (Calibration, 12 min): The student has just read that Adam is the standard optimizer. They believe the cost is negligible — it is "just an optimizer." This act forces them to compute the actual memory breakdown for GPT-2 training: weights + gradients + Adam state = 4× the inference footprint, before activations even appear. The prediction question exploits the specific wrong prior that Adam costs "about the same" as SGD.

Act 2 (Design Challenge, 22 min): The student believes that saturating a GPU means high TFLOPS utilization. This act shows the GPT-2 training walkthrough from the chapter: the baseline system achieves only 45% MFU, with 40% of iteration time consumed by data loading — not by gradient computation. Students must diagnose the bottleneck using the pipeline breakdown, then apply optimizations (mixed precision, gradient checkpointing, prefetching) to hit a target MFU of 80%. Each optimization has a quantified cost and benefit.


3. Act 1: The Optimizer Memory Tax (Calibration — 12 minutes)

Pedagogical Goal

Students treat optimizer choice as a convergence decision ("Adam converges faster than SGD") not a memory decision. The chapter's claim is precise: SGD requires 1× model memory, Momentum 2×, Adam 3×before activations. For GPT-2 XL (1.5B parameters), switching from SGD to Adam adds 12 GB of GPU memory. This act makes students predict that number and then confront it.

The Lock (Structured Prediction)

Present a multiple-choice prediction before any instruments unlock:

"You are training GPT-2 XL (1.5 billion parameters) in FP32. You switch from SGD to Adam. How much additional GPU memory does Adam require compared to SGD?"

Options:

  • A) About the same — the optimizer doesn't store model data (~0 GB extra)
  • B) About 3 GB extra — it stores one extra vector
  • C) About 12 GB extra — it stores two extra vectors (momentum + velocity) ← correct
  • D) About 24 GB extra — it replicates the full model for safety

The correct answer requires knowing: Adam stores m_t and v_t, each the same size as the gradient vector; at FP32 and 1.5B parameters, each vector = 6 GB → 12 GB total additional.

The Instrument: Training Memory Ledger

A stacked bar chart for GPT-2 XL training memory decomposition. Four components, each toggleable:

Component Formula GPT-2 XL (FP32)
Parameters params × bytes_per_param 6 GB
Gradients same as parameters 6 GB
Adam State (m_t) same as parameters 6 GB
Adam State (v_t) same as parameters 6 GB
Activations batch × sum_of_layer_outputs × bytes varies

Controls:

  • Optimizer selector: SGD (params + gradients only) → Momentum (+ 1 vector) → Adam (+ 2 vectors). Each selection adds/removes bars live.
  • Precision toggle: FP32 (4 bytes) / FP16 (2 bytes) / Mixed (FP16 compute + FP32 master copy). Mixed precision halves the active compute buffers but keeps a FP32 master copy — students discover this is not simply "2× better."
  • Batch size slider: 1, 8, 16, 32, 64 — only activations scale with batch size; all other bars are constant.

A red threshold line marks the device memory budget. Two contexts selectable:

  • Training Node: H100 (80 GB)
  • Laptop GPU: 8 GB

When total memory exceeds threshold: bars turn red, banner reads "OOM — Optimizer state alone exceeds device memory."

The Reveal

After exploration, overlay the prediction:

"You predicted [X] GB of additional memory. Adam actually adds 12 GB for GPT-2 XL in FP32 — about [Y]× your estimate. This is why the chapter states Adam requires 3× SGD's memory: parameters (6 GB) + gradients (6 GB) + Adam state (12 GB) = 24 GB static memory before a single activation is stored."

Then surface the convergence trade-off:

"GPT-2 converges in ~50,000 steps with Adam vs. ~150,000+ steps with SGD+Momentum. Is 12 GB of extra memory worth 3× fewer training steps? This is an engineering decision, not a preference."

Reflection (Structured)

Students complete:

"Mixed precision reduces total training memory by approximately ___× because ___."

Dropdown options for blank 1: 1.3× / 1.5× / 2× / 4× Note: 2× is the common wrong answer. The actual reduction is ~1.51.6× because: active buffers (params + grads) halve (from 2W to W in FP16), but the FP32 master copy (W) is retained. So total goes from 4W (FP32 Adam) to ~2.5W, not 2W. Dropdown options for blank 2:

  • "active compute buffers (parameters, gradients) halve from FP32 to FP16, while the FP32 master copy is retained for numerically stable optimizer updates — so memory reduction is ~1.5×, not 2×" ← correct
  • "the GPU compresses all tensors automatically"
  • "optimizer state is not needed in FP16 mode"
  • "activation memory dominates and activations are always stored in INT8"

Math Peek (collapsible):

\text{Training Memory} = \underbrace{W}_{\text{params}} + \underbrace{W}_{\text{grads}} + \underbrace{2W}_{\text{Adam}} + \underbrace{\sum_l B \cdot n_l \cdot \text{bytes}}_{\text{activations}}

where W = \text{params} \times \text{bytes\_per\_param}, B = batch size, n_l = layer width.


4. Act 2: The MFU Gap (Design Challenge — 22 minutes)

Pedagogical Goal

Students expect that a well-configured training run uses 8095% of peak GPU FLOPS. This act distinguishes two metrics students conflate:

  • Pipeline Efficiency = fraction of wall-clock time the GPU is performing any tensor operations (not idle waiting for data or transfers). Baseline: 35%.
  • MFU (Model FLOPs Utilization) = actual model FLOPs completed / (wall time × hardware peak FLOPs). Baseline: 45%.

These are different because a GPU can be "busy" executing memory-bound ops (high pipeline efficiency) while achieving low MFU — the hardware's compute units are starved of data even when nominally active. Fixing data loading raises pipeline efficiency; fixing arithmetic intensity (kernel fusion from Lab 07) raises MFU. Both are needed to reach the chapter's optimized target of 85% MFU.

Students diagnose the baseline state, then apply optimizations in sequence to reach 85% MFU, observing that each fix raises a different metric.

The Lock (Structured Prediction)

Before instruments unlock — multiple choice:

"A GPT-2 XL training loop on an A100 is fully configured — correct batch size, no bugs, no data augmentation. What is the approximate MFU (fraction of peak hardware FLOPs actually used for model computation)?"

Options:

  • A) About 8595% — a well-configured loop should be near peak
  • B) About 6070% — some overhead is expected
  • C) About 45% — data loading, memory transfers, and kernel overhead together consume over half of potential compute ← correct
  • D) About 1020% — GPUs are almost never well-utilized in practice

The system records the prediction. After Act 2, it shows: "You predicted [X]%. Baseline MFU = 45% (from chapter's GPT-2 walkthrough). Optimized MFU = 85%."

The Instrument: Pipeline Breakdown Analyzer

A stacked horizontal bar showing one training iteration divided into three phases:

Phase Baseline Optimized
Data Loading 40% of iteration 5% (overlapped prefetch)
Memory Transfers 25% of iteration 20% (mixed precision reduces volume)
Compute 35% of iteration 75%

Controls:

  • Prefetching toggle (Off / On): When On, data loading drops from 40% → 5% (overlapped with prior step's compute). This is the highest-leverage single change.
  • Mixed precision toggle (FP32 / Mixed): Memory transfer bar shrinks because FP16 activations are half the volume; compute bar grows.
  • Gradient checkpointing toggle (Off / On): Activation memory bar in the memory panel shrinks 4×; a new "Recompute Overhead" segment appears in the compute bar (+33% of compute time, but allows 4× larger batch or deeper network).
  • Batch size slider (16, 32, 64, 128, 256): Shows the 6070% → 90%+ utilization transition the chapter describes.

Two separate gauges update live — students must watch both:

  1. Pipeline Efficiency gauge (0100%): fraction of wall time the GPU is executing kernels.
\text{Pipeline Efficiency} = \frac{T_{compute}}{T_{wall}} = \frac{T_{compute}}{T_{data} + T_{transfer} + T_{compute}}

Baseline: 35%. After prefetching: ~75%. This measures busyness, not quality of work.

  1. MFU gauge (0100%): fraction of peak hardware FLOPs used for actual model operations.
\text{MFU} = \frac{O_{total}}{T_{wall} \cdot R_{peak}}

Baseline: 45%. Target: ≥ 80%. This measures efficiency of work. MFU > Pipeline Efficiency because the GPU is partially busy with useful ops even during "memory transfer" phases.

Key insight: prefetching raises Pipeline Efficiency most; mixed precision + kernel fusion (from Lab 07) raises MFU most. Students must apply both classes of optimization to reach 85% MFU.

Secondary instrument: The Optimizer Walkthrough Summary — a side panel showing the chapter's GPT-2 scenario numbers in a table:

Metric Baseline After Optimization
Memory 89 GB 32 GB
MFU 45% 85%
Throughput 1,200 tokens/sec
Data loading share 40% 5%

Students can compare their optimized configuration against this reference.

The Scaling Challenge

A second panel: "Fit GPT-2 XL training on a Laptop GPU (8 GB)."

The student starts with the baseline configuration (89 GB total) and must apply a combination of:

  • FP16 mixed precision (halves active buffers)
  • Switch optimizer from Adam to SGD (saves 12 GB, costs 3× more iterations)
  • Gradient checkpointing (saves ~4× activation memory, adds 33% compute overhead)
  • Reduce batch size to 1

They must find a valid configuration that fits within 8 GB. The system tracks their move sequence.

Key discovery: even with every optimization applied, GPT-2 XL cannot train in 8 GB. The student must quantify the minimum memory floor from first principles:

Minimum static memory floor derivation (SGD + FP16 with FP32 master copy):

Component Calculation Size
Parameters (FP16 active) 1.5B × 2 bytes 3 GB
Gradients (FP16) 1.5B × 2 bytes 3 GB
FP32 master copy (required for SGD updates) 1.5B × 4 bytes 6 GB
Static floor total 12 GB

At batch=1, activation memory for GPT-2 XL ≈ 0.51 GB, making total minimum ≈ 1213 GB. This already exceeds the Laptop GPU's 8 GB. Adam adds 12 GB of optimizer state on top, making the FP32 Adam floor 24 GB (as shown in Act 1).

To train GPT-2 XL on a laptop GPU, you would need parameter sharding — which requires multiple machines and is out of scope for Volume 1.

Failure state: When the student has applied all available optimizations and still cannot fit:

"🔴 Physical Limit Reached. Static memory floor = 12 GB (SGD + FP16 active weights + FP32 master copy). This is 4 GB above the Laptop GPU's budget before a single activation is stored. GPT-2 XL cannot train on a Laptop GPU through single-node optimization alone. This boundary is the entry condition for Volume 2: parameter sharding across nodes."

Structured Reflection

Four-option multiple choice:

"Your pipeline analysis shows 40% of time is spent on data loading. The most effective single fix is:"

  • A) Buy a GPU with 2× more TFLOPS
  • B) Switch from Adam to SGD to reduce memory pressure
  • C) Enable prefetching so data loading overlaps with the previous step's compute ← correct
  • D) Reduce batch size to minimize the memory transfer volume

Then complete the sentence:

"Gradient checkpointing reduces activation memory by ___× at the cost of ___% additional compute, because ___."

Expected fill-in: 4×, 33%, "the backward pass must recompute intermediate activations that were discarded rather than stored."

Math Peek:

\text{MFU} = \frac{O_{total}}{T_{wall} \cdot N \cdot R_{peak}} \qquad \eta \approx 0.45 \text{ (GPT-3 baseline)}, \quad \eta > 0.55 \text{ (current target)}

5. Visual Layout Specification

Act 1: Optimizer Memory Tax

  • Primary: Stacked vertical bar chart — 46 components (params, grads, Adam m, Adam v, activations), toggleable per component. Y-axis: GB (0100). Device threshold line in red.
  • Secondary: Optimizer comparison table (SGD / Momentum / Adam) with memory multiplier column (1× / 2× / 3×) and convergence steps column (150K+ / 90K / 50K).
  • Prediction overlay: Student's selected option highlighted; correct bar annotated with "12 GB gap."
  • Failure state: All bars turn RedLine (#CB202D) when memory > device budget; banner appears.

Act 2: MFU Gap

  • Primary: Stacked horizontal bar (one iteration = 100%): Data Loading / Memory Transfers / Compute. Three segments, colors: OrangeLine / BlueLine / GreenLine.
  • Secondary: MFU gauge (semicircle, 0100%, threshold line at 80%).
  • Tertiary: Optimizer Walkthrough Summary table (baseline vs. optimized reference numbers from chapter).
  • Laptop GPU scaling panel: Memory bar starting at 89 GB, must reach ≤ 8 GB; failure state when all optimizations exhausted above 8 GB.

6. Deployment Context Definitions

Context Device Memory Power Budget Key Constraint
Training Node H100 (80 GB HBM3) 80 GB 700 W TDP Maximize MFU; optimizer state is a manageable fraction
Laptop GPU RTX 4060 (8 GB GDDR6) 8 GB 115 W TDP Static memory floor of GPT-2 XL exceeds budget even at minimum configuration

The two contexts isolate the chapter's key insight: single-node optimization can recover 40 percentage points of MFU, but cannot overcome the memory floor for large models. That boundary is a precise, computable number — not a vague "scale limit."


7. Design Ledger Output

{
  "chapter": 8,
  "optimizer_chosen": "adam | sgd | momentum",
  "precision_chosen": "fp32 | mixed | fp16",
  "gradient_checkpointing": true,
  "final_mfu_pct": 82,
  "baseline_mfu_estimate_pct": 75,
  "laptop_fit_achieved": false,
  "laptop_minimum_memory_gb": 12,
  "laptop_floor_derivation": "SGD+FP16: params_fp16(3GB) + grads_fp16(3GB) + master_fp32(6GB) = 12GB static floor"
}

The optimizer_chosen and precision_chosen fields feed forward to:

  • Lab 10 (Compression): The precision choice affects the quantization baseline comparison
  • Lab 11 (HW Acceleration): The MFU value becomes the starting point for roofline analysis

8. Traceability Table

Lab Element Chapter Section Exact Claim Being Tested
SGD = 1×, Momentum = 2×, Adam = 3× memory Line 992 "Memory requirements increase progressively from SGD (1× model size) through Momentum (2×) to Adam (3×)"
GPT-2 XL = 1.5B params, 6 GB FP32 Lighthouse spec, line 327 "~6 GB (FP32) for weights alone"
Adam adds 12 GB for GPT-2 XL Lines 10841085 "1.5B × 8 bytes = [Adam state]"; each vector = 6 GB, two vectors = 12 GB
50K steps (Adam) vs 150K steps (SGD) Line 1095 "GPT-2 converges in ~50K steps…~150K+ steps with SGD+Momentum"
Mixed precision saves ~50% of active compute buffers (not total training memory) Line 3463 "memory consumption decreases by approximately 50%" — applies to active FP16 buffers only; FP32 master copy is retained, making total reduction ~1.5× not 2×
Baseline MFU = 45% Line 4612 "accelerator utilization: 45%"
Data loading = 40% of iteration time (baseline) Line 4613 "Data loading: 40% of iteration time"
After optimization: compute = 75%, data loading = 5% Lines 46284629 "Data loading: 5%…Compute: 75% of iteration time (overlapped)"
Optimized MFU = 85% Line 4626 "accelerator utilization: 85%"
Memory reduction: 89 GB → 32 GB Line 4636 "Naive: 89 GB…Optimized: 32 GB…2.8× reduction"
Gradient checkpointing: 4× activation reduction Line 4604 "reduces activations by 4×"
Gradient checkpointing: +33% compute overhead Line 4604 "33% more compute for activation recomputation"
Batch ≥ 256 → >90% utilization Line 908 "batch sizes of 256 or higher typically achieve over 90% hardware utilization"
Batch 1632 → 6070% utilization Line 908 "smaller batches of 1632 may only achieve 6070% utilization"
GPT-3 baseline η ≈ 0.45 Line 223 "GPT-3 training achieved η ≈ 0.45"
Current systems target η > 0.55 Line 223 "current systems target η > 0.55"