# StaffML ### ML Systems Interview Playbook **9,000+ physics-grounded systems design questions across Cloud, Edge, Mobile & TinyML.** *You can generate the code, but you cannot prompt your way out of a silicon bottleneck.*
Launch StaffML

GitHub Stars StaffML Deploy MLSysBook.ai
> [!NOTE] > **πŸ“Œ Early release (2026)** > > StaffML shipped with the **2026** MLSysBook refresh. The vault, web apps, and question flows are **actively iterated** as we tune for real interviewsβ€”expect meaningful updates to content, UX, and scoring. > > **Feedback** β€” [GitHub issues](https://github.com/harvard-edge/cs249r_book/issues) or pull requests. --- StaffML is a free, open-source interview prep platform for ML systems engineers. Browse a curated vault of questions organized by competency area, difficulty level (Bloom's Taxonomy L1–L6+), and deployment track. Built by [Prof. Vijay Janapa Reddi](https://github.com/profvjreddi), Harvard University.
Feature Description
VaultBrowse questions by area, topic, and difficulty
PracticeDrill with spaced repetition and daily challenges
GauntletTimed mock interview sessions with self-assessment
ProgressTrack coverage across competency areas and tracks
ChainsDeepening sequences from L1 Recall to L6+ Architect
> [!TIP] > If StaffML helps your prep, **[give us a star](https://github.com/harvard-edge/cs249r_book)** β€” it helps others find this resource. **Data:** `vault/corpus.json` (generated by `vault build`, not in git) Β· [`vault/taxonomy.json`](vault/taxonomy.json) Β· **App source:** [`staffml/`](staffml/) --- ## Deployment Tracks Each track targets a different deployment regime β€” different physics, different constraints, different interview questions.
Track Focus Primary constraint
☁️ CloudData center training & servingMemory bandwidth / network
πŸ€– EdgeAutonomous vehicles, roboticsThermal envelope / real-time
πŸ“± MobileOn-device AI for smartphonesBattery life / shared resources
πŸ”¬ TinyMLMicrocontroller & ultra-low-powerSRAM capacity / hard real-time
--- ## Mastery Levels Every question is tagged with a mastery level mapped to [Bloom's taxonomy](https://en.wikipedia.org/wiki/Bloom%27s_taxonomy):
Level Name Scope What the interviewer hears
πŸ”΅ L1RecallOwn a task"HBM is 300x slower than L1 cache."
🟒 L2UnderstandOwn a task"The Roofline model relates compute to memory bandwidth."
🟑 L3ApplyOwn a component"This workload is memory-bound because its arithmetic intensity is below the ridge point."
🟠 L4AnalyzeOwn a system"Switching from A100 to H100 won't help because the ridge point shifts."
πŸ”΄ L5EvaluateOwn the architecture"Let me derive the optimal parallelism from the NVLink topology."
🟣 L6+ArchitectOwn the org"Here's a fault-tolerant training architecture for 1T params across 3 data centers."
--- ## Depth Chains Questions are organized into **chains** β€” sequences that deepen understanding of a single topic from recall to architecture. Each chain walks you through the Bloom levels, building on the previous question. **Example: GPU Memory Hierarchy Chain (6 questions)**
Step Level Question
1πŸ”΅ L1The HBM vs L1 Latency Gap
2🟒 L2The FP16 Model Footprint
3🟑 L3KV Cache Memory for 7B Model Serving
4🟠 L4OOM at Step 500 but Not Step 1
5πŸ”΄ L5CPU Offloading vs Activation Recomputation
6🟣 L6+Memory Budget for High-Concurrency LLM Serving
> [!NOTE] > The vault contains **1,000+ chains** across all tracks. In the app, chains appear after you answer a question β€” click "Next in chain" to go deeper. --- ## Vault Stats
Metric Count
Questions9,000+
Chains1,000+
Taxonomy concepts650+
Competency areas12
Deployment tracks4 + Global
Mastery levelsL1–L6+
--- ## Sample Questions A taste of what's inside. Click any question to reveal the model answer with napkin math. ### ☁️ Cloud
🟒 L2   Physical Limits on Training Cluster Scale
Explain why you cannot simply double the number of GPUs indefinitely to halve training time, and identify the three physical ceilings that bound cluster scaling.
Three physical ceilings prevent infinite scaling: (1) **Communication bottleneck** β€” synchronous training requires AllReduce to average gradients across all GPUs every step. With N GPUs, AllReduce latency grows as O(log N) per step. At 10,000+ GPUs, communication time can exceed computation time. (2) **Power and cooling** β€” each GPU draws 300–700W. A 10K GPU cluster requires 4+ MW just for GPUs. (3) **Critical batch size** β€” beyond the critical batch size, gradient noise diminishes returns. For GPT-3, this is ~3.2M tokens. ```text 10K GPUs Γ— 400W = 4MW AllReduce at 10K nodes: ~10ms overhead vs ~50ms compute = 17% communication tax Critical batch size for GPT-3: ~3.2M tokens ```
🟠 L4   The Half-Baked Speedup
You converted most of your LLM to BF16 but only see 1.4x speedup instead of the expected 2x. What is happening?
Training involves a mix of compute-bound and memory-bound operations β€” only some benefit from BF16. Large GEMMs (attention, FFN) see ~2x speedup. But optimizer steps (Adam maintains FP32 master weights), normalization layers, and loss computation remain in FP32. The weighted average: 70% of time in BF16-accelerated ops Γ— 2x + 30% in FP32 ops Γ— 1x = 1.4x overall. ```text Forward GEMMs: 40% of time β†’ BF16 β†’ 2x speedup Backward GEMMs: 30% of time β†’ BF16 β†’ 2x speedup Optimizer (Adam): 15% of time β†’ FP32 β†’ 1x Other (norm, loss): 15% of time β†’ FP32 β†’ 1x Weighted: 0.7 Γ— 2 + 0.3 Γ— 1 = 1.7... but memory-bound ops don't see full 2x β†’ ~1.4x ```
🟣 L6+   The Exploding Data Lake Bill
Your data lake on S3 has grown to 500 PB. Design a tiering strategy to cut the monthly storage bill by 60%+.
Intelligent data tiering based on access frequency. Classify data, apply lifecycle policies: hot data (30%) stays in S3 Standard, warm data in S3 Standard-IA, cold data (70%) moves to Glacier Deep Archive. ```text S3 Standard: $0.023/GB/month S3 Glacier Deep Archive: $0.00099/GB/month Current: $0.023 Γ— 500 PB = $11.5M/month Optimized: 150 PB Γ— $0.023 + 350 PB Γ— $0.00099 β‰ˆ $3.8M/month Savings: ~67% ```
### πŸ€– Edge
πŸ”΅ L1   The Fleet's Cellular Bill
You have 1M autonomous vehicles. Compare the daily data cost of centralized retraining (10 MB upload/vehicle) vs. federated learning (50 MB gradient upload, 10% participation).
Centralized: 1M Γ— 10 MB = 10 TB/day. Federated (10% participate): 100K Γ— 50 MB = 5 TB/day. At $2/GB cellular cost: centralized = $20,000/day, federated = $10,000/day. Annual savings: $3.65M. But federated also avoids regulatory risk of centralizing raw sensor data. ```text Centralized: 1M Γ— 10 MB = 10 TB/day Γ— $2/GB = $20,000/day Federated: 100K Γ— 50 MB = 5 TB/day Γ— $2/GB = $10,000/day Annual savings: $3.65M + regulatory risk reduction ```
🟠 L4   The Phantom Sensor Attack
Your autonomous vehicle uses GPS, IMU, and wheel encoder. An attacker spoofs GPS signals. How do you detect and mitigate this?
Multi-layered defense using sensor fusion consistency checks. The IMU and wheel encoder provide *relative* motion β€” if GPS reports a 50m jump in 1 second while the IMU shows 0.5m movement, the innovation (residual) is 49.5m, far exceeding normal GPS noise (~3m). The state estimator (EKF/UKF) should reject GPS measurements with innovations exceeding a threshold, fall back to dead reckoning, and alert the operator. ```text IMU drift: ~1m/minute without GPS correction GPS accuracy: 1-3m Spoof detection threshold: innovation > 5Γ— expected noise = 15m At 50m jump vs 0.5m IMU: innovation = 49.5m β†’ reject with 99.99% confidence ```
### πŸ“± Mobile
🟒 L2   Background Inference Limits
You want to run an LLM to summarize audio while your iOS app is in the background. What is the primary risk?
The iOS Watchdog Timer. iOS aggressively monitors background apps for memory and CPU usage. Background execution limits: 30 seconds for most tasks, 3 minutes for audio processing. A 7B LLM at INT4 = 3.5 GB weights + 0.5 GB KV-cache = 4 GB. iPhone 16 Pro has 8 GB total, ~5 GB available. In foreground: fits. In background: iOS reclaims memory aggressively, and sustained 3W inference drains 20% battery per hour. ```text On-device LLM: ~3W sustained on A17 Pro Battery: 4,000 mAh Γ— 3.7V = 14.8 Wh Drain at 3W: 20% per hour iOS background limit: ~30 seconds β†’ 0.025 Wh per cycle ```
🟑 L3   The Trivial Model Paradox
A single 100-neuron dense layer runs faster on CPU than NPU. Why?
NPUs have significant startup and data transfer overheads that overshadow benefits for tiny models. Driver initialization (~100ΞΌs), data transfer to NPU memory (~20ΞΌs), and NPU compute (~5ΞΌs) total ~125ΞΌs. The CPU does the same computation in ~50ΞΌs with no transfer overhead. The crossover point is typically around 10K parameters β€” below that, CPU wins. ```text CPU: 50ΞΌs compute NPU: 100ΞΌs startup + 20ΞΌs transfer + 5ΞΌs compute = 125ΞΌs NPU is 2.5Γ— slower for trivial models Crossover: ~10K parameters ```
### πŸ”¬ TinyML
🟒 L2   Microcontroller Arithmetic Intensity
Calculate the Ridge Point for a Cortex-M4 microcontroller. Is it compute-bound or memory-bound?
Ridge Point = Peak Compute / Peak Memory Bandwidth. Cortex-M4 at 168 MHz: ~168 MFLOPS (1 FP op/cycle). Memory: 32-bit bus at 168 MHz = 672 MB/s. Ridge Point = 0.168 GFLOPS / 0.672 GB/s = 0.25 FLOPS/byte. Most neural network layers have arithmetic intensity of 10-100 β€” far above the ridge point. MCUs are almost always **compute-bound**, the opposite of GPUs. ```text Cortex-M4: 168 MFLOPS / 672 MB/s = 0.25 FLOPS/byte Conv2D AI: ~50 FLOPS/byte β†’ compute-bound GPU (H100): 989 TFLOPS / 3.35 TB/s = 295 FLOPS/byte β†’ memory-bound MCUs are the mirror image of GPUs on the roofline ```
🟣 L6+   The Ghost in the Dashboard
100,000 vehicles with Cortex-M4 voice assistants. After a year, humid-climate devices activate randomly. Design an OTA fix within 20% free Flash/SRAM.
Three components within the resource budget: (1) Lightweight drift detector β€” running mean/variance on audio energy, 12 bytes SRAM. (2) Circuit breaker β€” if drift exceeds threshold, suppress activations and log diagnostics. (3) Diagnostic reporter β€” 16-bin histogram per event, store 100 records in Flash. Total: ~10KB Flash (5% of budget), <1KB SRAM. ```text Free Flash: 20% of 1MB = 205KB Free SRAM: 20% of 256KB = 51KB Drift detector: 12B SRAM, 2KB Flash Circuit breaker: 32B Flash Diagnostics: 76B Γ— 100 records = 7.6KB Flash Total: ~10KB Flash, <1KB SRAM β€” well within budget ```
### 🌐 Global
🟒 L2   InfiniBand vs Ethernet for Training
Why do large-scale LLM training clusters prefer InfiniBand over Ethernet?
Three properties beyond raw bandwidth: (1) RDMA β€” GPU memory read/written directly over the network, bypassing CPU and OS kernel. Latency drops from ~50ΞΌs (TCP/IP) to ~1-2ΞΌs. (2) Lossless fabric β€” credit-based flow control guarantees zero packet loss, critical for AllReduce correctness. (3) Adaptive routing β€” hardware-level load balancing across multiple paths reduces congestion. ```text AllReduce for 1GB gradient buffer: InfiniBand RDMA: 1GB/(50 GB/s) + 2ΞΌs Γ— logβ‚‚(1024) = ~20ms Ethernet TCP: 1GB/(50 GB/s) + 50ΞΌs Γ— logβ‚‚(1024) + retransmit risk = 30-150ms InfiniBand: 2-7Γ— lower tail latency ```
🟠 L4   Mysterious 15% Throughput Drop at Noon
Your 64-GPU cluster shows 15% lower throughput between 11 AM and 3 PM. GPU utilization stays at 98%. No other jobs running. What is happening?
Thermal throttling. The data center's cooling struggles during peak afternoon heat. When GPU junction temperature exceeds 83Β°C (A100 throttle point), the GPU reduces clock frequency. Clock drops from 1410 MHz to 1200 MHz = exactly 15% reduction. The GPU reports 98% utilization because it's still busy β€” just at a lower clock. ```text GPU clock: 1410 MHz β†’ 1200 MHz = 15% reduction Night: junction 75Β°C, 8Β°C below throttle Afternoon: ambient +8Β°C β†’ junction hits 83Β°C β†’ throttle Fix: lower power limit from 400W to 350W β†’ 5Β°C drop β†’ no throttle Net result: +3% vs current daytime (lose 12.5% power but gain back 15% clock) ```
🟣 L6+   The Agentic Memory Architecture
Design a memory system for a coding agent that maintains context across a multi-hour session with 500K tokens of history.
Three-tier memory: (1) Working memory (<8K tokens) β€” current file, last 2-3 tool results, current plan. Managed programmatically, not by the LLM. (2) Episodic memory (vector DB) β€” summarized past interactions, indexed by embedding. Retrieved via semantic search when relevant. (3) Persistent memory (key-value store) β€” facts, decisions, file states. Never evicted, always available. ```text Raw: 500K tokens Γ— $0.003/1K = $1.50/turn (and wouldn't fit in context) Tiered: 8K tokens/turn Γ— $0.003/1K = $0.024/turn Compression ratio: 62.5Γ— Cost reduction: 98.4% ```
---

These are 15 of 9,000+ questions.
Explore the full vault β†’

--- ## Development ```bash # Run the StaffML app locally cd interviews/staffml npm install npm run dev # β†’ http://localhost:3000 # Regenerate vault manifest after corpus updates python3 scripts/generate-manifest.py ``` **CI/CD:** Pushes to `dev` auto-build and deploy via [GitHub Actions](https://github.com/harvard-edge/cs249r_book/actions/workflows/staffml-preview-dev.yml). --- ## Contributors Thanks to these wonderful people who have helped build StaffML! **Legend:** πŸͺ² Bug Hunter Β· ⚑ Code Warrior Β· πŸ“š Documentation Hero Β· 🎨 Design Artist Β· 🧠 Idea Generator Β· πŸ”Ž Code Reviewer Β· πŸ§ͺ Test Engineer Β· πŸ› οΈ Tool Builder
Vijay Janapa Reddi
Vijay Janapa Reddi

🎨 ✍️ 🧠
Rocky
Rocky

πŸͺ² πŸ§‘β€πŸ’»
Farhan Asghar
Farhan Asghar

πŸ§‘β€πŸ’»
**Recognize a contributor:** Comment on any issue or PR: ```text @all-contributors please add @username for code, doc, ideas, or design ``` ---

Wishing you all the best in your interviews and your engineering journey.
β€” Vijay Janapa Reddi