Files
Vijay Janapa Reddi 30e93af5b6 fix(interviews): wave-4 semantic-audit corrections across 1857 question YAMLs
Apply targeted fixes from the remaining high-confidence-major fix queue
across cloud, edge, mobile, and tinyml tracks. Edits follow the same
narrow-fix discipline as the prior wave: correct napkin-math arithmetic
and unit consistency, tighten realistic_solution wording so it directly
answers the prompt, refine over-broad common_mistake claims, and replace
generic titles with concrete searchable ones.

Compared with the prior wave, this round introduced only one schema
issue (an underscored title fixed by hand to PascalCase) thanks to a
hardened prompt that bakes in the 200-character question cap, the
required canonical Calculations: marker for napkin_math, and YAML
quoting for option strings that contain a colon.

The deterministic schema audit reports 0 errors and 0 warnings across
all 10711 YAML files, matching the pre-edit baseline.
2026-05-05 00:24:15 -04:00

46 lines
2.5 KiB
YAML
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
schema_version: '1.0'
id: cloud-2152
track: cloud
level: L3
zone: diagnosis
topic: pruning-sparsity
competency_area: optimization
bloom_level: apply
phase: training
title: The FP16 Loss Scaling Dance
scenario: You switch a training run from BF16 to FP16 to use Tensor Core optimizations on older V100 GPUs. Training appears to work for 1,000 steps, then the loss flatlines and stops decreasing. Gradient norms show zero.
question: What happened, and what mechanism should have prevented this?
details:
realistic_solution: 'FP16 has a much smaller dynamic range than BF16: the smallest normal value is ~6×10⁻⁵, while subnormals extend down to ~6×10⁻⁸. Small gradients common in later training may become subnormal and can be flushed to zero by hardware or kernels, especially on older V100 FP16 paths. Dynamic loss scaling should prevent this: it multiplies the loss by a large factor (e.g., 2¹⁶ = 65536) before the backward pass, keeping gradients in FP16''s normal range. After backward, gradients are unscaled before the optimizer step. Check whether the loss scaler became too small after repeated Inf/NaN detection or whether the kernel path flushes FP16 subnormals.'
common_mistake: |
**The Pitfall:** Assuming gradients are actually zero because the model converged.
**The Rationale:** Candidates see zero gradients and assume the training is complete or stuck in a local minimum.
**The Consequence:** They fail to realize that small gradients underflow to zero in FP16's limited dynamic range.
napkin_math: |
1. **Assumptions & Constraints:**
- FP16 smallest normal value: ~6e-5; smallest subnormal value: ~6e-8.
- Typical late-training gradient: ~1 * 10^-7.
- Loss scaling factor: 65,536 (2^16).
2. **Calculations:**
- Without scaling: 1e-7 is representable only as a subnormal, so precision is poor and some FP16 hardware/kernel paths may flush it to 0.
- With scaling: 1 * 10^-7 * 65,536 = 0.0065536 (6.5 * 10^-3).
- Value is safely within FP16 normal range.
3. **Conclusion & Interpretation:** **FP16 underflow or flush-to-zero risk (Precision-Bound).**
status: published
provenance: imported
requires_explanation: false
expected_time_minutes: 6
validated: true
validation_status: OK
validation_date: '2026-04-01'
validation_model: gemini-2.5-flash
math_verified: true
math_status: CORRECT
math_date: '2026-04-03'
math_model: gemini-3.1-pro-preview
human_reviewed:
status: not-reviewed
by: null
date: null
notes: null