Files
Vijay Janapa Reddi dc72ab3700 fix(interviews): semantic-audit corrections across 1748 question YAMLs
Apply targeted fixes from the semantic-review fix queue across cloud, edge,
mobile, and tinyml tracks. Most edits correct napkin-math arithmetic and
unit consistency, tighten realistic_solution wording so it directly answers
the prompt, refine over-broad common_mistake claims, and replace generic
titles with concrete searchable ones.

Per-track changes: cloud 573, edge 400, mobile 389, tinyml 386.

Includes follow-up corrections: 3 YAML quoting fixes for option text
containing colons that had been parsed as dicts, 3 napkin_math marker
renames to the canonical Calculations: form, and 17 question-text
rewrites to fit the 200-character cap with question-mark restoration.

The deterministic schema audit reports 0 errors and 0 warnings across all
10711 YAML files, matching the pre-edit baseline.
2026-05-04 21:00:10 -04:00

52 lines
3.0 KiB
YAML

schema_version: '1.0'
id: cloud-1013
track: cloud
level: L5
zone: evaluation
topic: pruning-sparsity
competency_area: optimization
bloom_level: evaluate
phase: training
title: Elastic Scale-Down with Constant Global Batch Size
scenario: You are orchestrating an elastic training job for a 7B parameter LLM on an autoscaling cluster of p4d.24xlarge instances (8x A100 40GB GPUs per node). The fleet dynamically resizes between 16 and 64 nodes based on spot instance availability. To preserve strict convergence guarantees, the global batch size is locked at 2048. Evaluate the architectural trade-offs and necessary configuration adjustments when the cluster abruptly scales down from 64 to 16 nodes.
question: When the fleet scales from 64 to 16 nodes with GBS locked at 2048, how should you adjust micro-batch and accumulation under 40GB VRAM?
details:
realistic_solution: |
Maintain the global batch size of 2048 by setting GPUs * micro-batch * accumulation steps = 2048. After scaling down to 16 nodes, there are 128 GPUs, so each GPU must contribute an effective batch of 16 samples per optimizer step. Because a single per-GPU micro-batch of 16 exceeds the 40GB VRAM limit, choose a micro-batch that fits, such as 4, and use 4 gradient accumulation steps: 128 * 4 * 4 = 2048. This preserves optimizer-step equivalence without triggering OOM errors.
common_mistake: |
**The Pitfall:** Scaling the learning rate using the linear scaling rule instead of adjusting per-worker batch sizes.
**The Rationale:** Altering the learning rate changes optimization dynamics, and candidates overlook OOM constraints when increasing the local batch size without gradient accumulation.
**The Consequence:** The training job either diverges due to incorrect learning rate adjustments or crashes entirely from out-of-memory errors on the 40GB GPUs.
napkin_math: |
**Assumptions & Constraints:**
- 64 nodes (512 GPUs) scaling to 16 nodes (128 GPUs).
- Global Batch Size = 2048.
- A100 40GB OOMs at BS > 8 for 7B model.
**Calculations:**
- At 64 nodes: Per-GPU Batch = 2048 / 512 = 4.
- At 16 nodes: Required Per-GPU Batch = 2048 / 128 = 16.
- Effective per-GPU batch required at 16 nodes = 2048 / 128 = 16 samples per optimizer step.
- A single micro-batch of 16 exceeds the VRAM limit, so choose Micro-Batch = 4 and Gradient Accumulation Steps = 16 / 4 = 4.
**Conclusion & Interpretation:**
- **Result: Total Global Batch = 128 GPUs * 4 (Micro-Batch) * 4 (Accumulation) = 2048.** The system correctly maintains the global batch size without exceeding memory limits.
status: published
provenance: imported
requires_explanation: false
expected_time_minutes: 12
validated: true
validation_status: OK
validation_date: '2026-04-01'
validation_model: gemini-2.5-flash
math_verified: true
math_status: CORRECT
math_date: '2026-04-03'
math_model: gemini-3.1-pro-preview
human_reviewed:
status: not-reviewed
by: null
date: null
notes: null
created_at: '2026-03-23T21:28:38.390855'