mirror of
https://github.com/harvard-edge/cs249r_book.git
synced 2026-05-10 15:49:25 -05:00
Apply targeted fixes from the semantic-review fix queue across cloud, edge, mobile, and tinyml tracks. Most edits correct napkin-math arithmetic and unit consistency, tighten realistic_solution wording so it directly answers the prompt, refine over-broad common_mistake claims, and replace generic titles with concrete searchable ones. Per-track changes: cloud 573, edge 400, mobile 389, tinyml 386. Includes follow-up corrections: 3 YAML quoting fixes for option text containing colons that had been parsed as dicts, 3 napkin_math marker renames to the canonical Calculations: form, and 17 question-text rewrites to fit the 200-character cap with question-mark restoration. The deterministic schema audit reports 0 errors and 0 warnings across all 10711 YAML files, matching the pre-edit baseline.
52 lines
3.0 KiB
YAML
52 lines
3.0 KiB
YAML
schema_version: '1.0'
|
|
id: cloud-1013
|
|
track: cloud
|
|
level: L5
|
|
zone: evaluation
|
|
topic: pruning-sparsity
|
|
competency_area: optimization
|
|
bloom_level: evaluate
|
|
phase: training
|
|
title: Elastic Scale-Down with Constant Global Batch Size
|
|
scenario: You are orchestrating an elastic training job for a 7B parameter LLM on an autoscaling cluster of p4d.24xlarge instances (8x A100 40GB GPUs per node). The fleet dynamically resizes between 16 and 64 nodes based on spot instance availability. To preserve strict convergence guarantees, the global batch size is locked at 2048. Evaluate the architectural trade-offs and necessary configuration adjustments when the cluster abruptly scales down from 64 to 16 nodes.
|
|
question: When the fleet scales from 64 to 16 nodes with GBS locked at 2048, how should you adjust micro-batch and accumulation under 40GB VRAM?
|
|
details:
|
|
realistic_solution: |
|
|
Maintain the global batch size of 2048 by setting GPUs * micro-batch * accumulation steps = 2048. After scaling down to 16 nodes, there are 128 GPUs, so each GPU must contribute an effective batch of 16 samples per optimizer step. Because a single per-GPU micro-batch of 16 exceeds the 40GB VRAM limit, choose a micro-batch that fits, such as 4, and use 4 gradient accumulation steps: 128 * 4 * 4 = 2048. This preserves optimizer-step equivalence without triggering OOM errors.
|
|
common_mistake: |
|
|
**The Pitfall:** Scaling the learning rate using the linear scaling rule instead of adjusting per-worker batch sizes.
|
|
**The Rationale:** Altering the learning rate changes optimization dynamics, and candidates overlook OOM constraints when increasing the local batch size without gradient accumulation.
|
|
**The Consequence:** The training job either diverges due to incorrect learning rate adjustments or crashes entirely from out-of-memory errors on the 40GB GPUs.
|
|
napkin_math: |
|
|
**Assumptions & Constraints:**
|
|
- 64 nodes (512 GPUs) scaling to 16 nodes (128 GPUs).
|
|
- Global Batch Size = 2048.
|
|
- A100 40GB OOMs at BS > 8 for 7B model.
|
|
|
|
**Calculations:**
|
|
- At 64 nodes: Per-GPU Batch = 2048 / 512 = 4.
|
|
- At 16 nodes: Required Per-GPU Batch = 2048 / 128 = 16.
|
|
- Effective per-GPU batch required at 16 nodes = 2048 / 128 = 16 samples per optimizer step.
|
|
- A single micro-batch of 16 exceeds the VRAM limit, so choose Micro-Batch = 4 and Gradient Accumulation Steps = 16 / 4 = 4.
|
|
|
|
**Conclusion & Interpretation:**
|
|
- **Result: Total Global Batch = 128 GPUs * 4 (Micro-Batch) * 4 (Accumulation) = 2048.** The system correctly maintains the global batch size without exceeding memory limits.
|
|
status: published
|
|
provenance: imported
|
|
requires_explanation: false
|
|
expected_time_minutes: 12
|
|
validated: true
|
|
validation_status: OK
|
|
validation_date: '2026-04-01'
|
|
validation_model: gemini-2.5-flash
|
|
math_verified: true
|
|
math_status: CORRECT
|
|
math_date: '2026-04-03'
|
|
math_model: gemini-3.1-pro-preview
|
|
human_reviewed:
|
|
status: not-reviewed
|
|
by: null
|
|
date: null
|
|
notes: null
|
|
created_at: '2026-03-23T21:28:38.390855'
|