mirror of
https://github.com/harvard-edge/cs249r_book.git
synced 2026-05-10 15:49:25 -05:00
Apply targeted fixes from the remaining high-confidence-major fix queue across cloud, edge, mobile, and tinyml tracks. Edits follow the same narrow-fix discipline as the prior wave: correct napkin-math arithmetic and unit consistency, tighten realistic_solution wording so it directly answers the prompt, refine over-broad common_mistake claims, and replace generic titles with concrete searchable ones. Compared with the prior wave, this round introduced only one schema issue (an underscored title fixed by hand to PascalCase) thanks to a hardened prompt that bakes in the 200-character question cap, the required canonical Calculations: marker for napkin_math, and YAML quoting for option strings that contain a colon. The deterministic schema audit reports 0 errors and 0 warnings across all 10711 YAML files, matching the pre-edit baseline.
46 lines
2.5 KiB
YAML
46 lines
2.5 KiB
YAML
schema_version: '1.0'
|
||
id: cloud-2152
|
||
track: cloud
|
||
level: L3
|
||
zone: diagnosis
|
||
topic: pruning-sparsity
|
||
competency_area: optimization
|
||
bloom_level: apply
|
||
phase: training
|
||
title: The FP16 Loss Scaling Dance
|
||
scenario: You switch a training run from BF16 to FP16 to use Tensor Core optimizations on older V100 GPUs. Training appears to work for 1,000 steps, then the loss flatlines and stops decreasing. Gradient norms show zero.
|
||
question: What happened, and what mechanism should have prevented this?
|
||
details:
|
||
realistic_solution: 'FP16 has a much smaller dynamic range than BF16: the smallest normal value is ~6×10⁻⁵, while subnormals extend down to ~6×10⁻⁸. Small gradients common in later training may become subnormal and can be flushed to zero by hardware or kernels, especially on older V100 FP16 paths. Dynamic loss scaling should prevent this: it multiplies the loss by a large factor (e.g., 2¹⁶ = 65536) before the backward pass, keeping gradients in FP16''s normal range. After backward, gradients are unscaled before the optimizer step. Check whether the loss scaler became too small after repeated Inf/NaN detection or whether the kernel path flushes FP16 subnormals.'
|
||
common_mistake: |
|
||
**The Pitfall:** Assuming gradients are actually zero because the model converged.
|
||
**The Rationale:** Candidates see zero gradients and assume the training is complete or stuck in a local minimum.
|
||
**The Consequence:** They fail to realize that small gradients underflow to zero in FP16's limited dynamic range.
|
||
napkin_math: |
|
||
1. **Assumptions & Constraints:**
|
||
- FP16 smallest normal value: ~6e-5; smallest subnormal value: ~6e-8.
|
||
- Typical late-training gradient: ~1 * 10^-7.
|
||
- Loss scaling factor: 65,536 (2^16).
|
||
2. **Calculations:**
|
||
- Without scaling: 1e-7 is representable only as a subnormal, so precision is poor and some FP16 hardware/kernel paths may flush it to 0.
|
||
- With scaling: 1 * 10^-7 * 65,536 = 0.0065536 (6.5 * 10^-3).
|
||
- Value is safely within FP16 normal range.
|
||
3. **Conclusion & Interpretation:** **FP16 underflow or flush-to-zero risk (Precision-Bound).**
|
||
status: published
|
||
provenance: imported
|
||
requires_explanation: false
|
||
expected_time_minutes: 6
|
||
validated: true
|
||
validation_status: OK
|
||
validation_date: '2026-04-01'
|
||
validation_model: gemini-2.5-flash
|
||
math_verified: true
|
||
math_status: CORRECT
|
||
math_date: '2026-04-03'
|
||
math_model: gemini-3.1-pro-preview
|
||
human_reviewed:
|
||
status: not-reviewed
|
||
by: null
|
||
date: null
|
||
notes: null
|