cs249r_book/interviews/vault/questions/cloud/optimization/cloud-1013.yaml

schema_version: '1.0'
id: cloud-1013
track: cloud
level: L5
zone: evaluation
topic: pruning-sparsity
competency_area: optimization
bloom_level: evaluate
phase: training
title: Elastic Scale-Down with Constant Global Batch Size
scenario: You are orchestrating an elastic training job for a 7B parameter LLM on an autoscaling cluster of p4d.24xlarge instances (8x A100 40GB GPUs per node). The fleet dynamically resizes between 16 and 64 nodes based on spot instance availability. To preserve strict convergence guarantees, the global batch size is locked at 2048. Evaluate the architectural trade-offs and necessary configuration adjustments when the cluster abruptly scales down from 64 to 16 nodes.
question: When the fleet scales from 64 to 16 nodes with GBS locked at 2048, how should you adjust micro-batch and accumulation under 40GB VRAM?
details:
  realistic_solution: |
    Maintain the global batch size of 2048 by setting GPUs * micro-batch * accumulation steps = 2048. After scaling down to 16 nodes, there are 128 GPUs, so each GPU must contribute an effective batch of 16 samples per optimizer step. Because a single per-GPU micro-batch of 16 exceeds the 40GB VRAM limit, choose a micro-batch that fits, such as 4, and use 4 gradient accumulation steps: 128 * 4 * 4 = 2048. This preserves optimizer-step equivalence without triggering OOM errors.
  common_mistake: |
    **The Pitfall:** Scaling the learning rate using the linear scaling rule instead of adjusting per-worker batch sizes.
    **The Rationale:** Altering the learning rate changes optimization dynamics, and candidates overlook OOM constraints when increasing the local batch size without gradient accumulation.
    **The Consequence:** The training job either diverges due to incorrect learning rate adjustments or crashes entirely from out-of-memory errors on the 40GB GPUs.
  napkin_math: |
    **Assumptions & Constraints:**
    - 64 nodes (512 GPUs) scaling to 16 nodes (128 GPUs).
    - Global Batch Size = 2048.
    - A100 40GB OOMs at BS > 8 for 7B model.

    **Calculations:**
    - At 64 nodes: Per-GPU Batch = 2048 / 512 = 4.
    - At 16 nodes: Required Per-GPU Batch = 2048 / 128 = 16.
    - Effective per-GPU batch required at 16 nodes = 2048 / 128 = 16 samples per optimizer step.
    - A single micro-batch of 16 exceeds the VRAM limit, so choose Micro-Batch = 4 and Gradient Accumulation Steps = 16 / 4 = 4.

    **Conclusion & Interpretation:**
    - **Result: Total Global Batch = 128 GPUs * 4 (Micro-Batch) * 4 (Accumulation) = 2048.** The system correctly maintains the global batch size without exceeding memory limits.
status: published
provenance: imported
requires_explanation: false
expected_time_minutes: 12
validated: true
validation_status: OK
validation_date: '2026-04-01'
validation_model: gemini-2.5-flash
math_verified: true
math_status: CORRECT
math_date: '2026-04-03'
math_model: gemini-3.1-pro-preview
human_reviewed:
  status: not-reviewed
  by: null
  date: null
  notes: null
created_at: '2026-03-23T21:28:38.390855'