| Feature | Description |
|---|---|
| Vault | Browse questions by area, topic, and difficulty |
| Practice | Drill with spaced repetition and daily challenges |
| Gauntlet | Timed mock interview sessions with self-assessment |
| Progress | Track coverage across competency areas and tracks |
| Chains | Deepening sequences from L1 Recall to L6+ Architect |
| Track | Focus | Primary constraint |
|---|---|---|
| βοΈ Cloud | Data center training & serving | Memory bandwidth / network |
| π€ Edge | Autonomous vehicles, robotics | Thermal envelope / real-time |
| π± Mobile | On-device AI for smartphones | Battery life / shared resources |
| π¬ TinyML | Microcontroller & ultra-low-power | SRAM capacity / hard real-time |
| Level | Name | Scope | What the interviewer hears |
|---|---|---|---|
| π΅ L1 | Recall | Own a task | "HBM is 300x slower than L1 cache." |
| π’ L2 | Understand | Own a task | "The Roofline model relates compute to memory bandwidth." |
| π‘ L3 | Apply | Own a component | "This workload is memory-bound because its arithmetic intensity is below the ridge point." |
| π L4 | Analyze | Own a system | "Switching from A100 to H100 won't help because the ridge point shifts." |
| π΄ L5 | Evaluate | Own the architecture | "Let me derive the optimal parallelism from the NVLink topology." |
| π£ L6+ | Architect | Own the org | "Here's a fault-tolerant training architecture for 1T params across 3 data centers." |
| Step | Level | Question |
|---|---|---|
| 1 | π΅ L1 | The HBM vs L1 Latency Gap |
| 2 | π’ L2 | The FP16 Model Footprint |
| 3 | π‘ L3 | KV Cache Memory for 7B Model Serving |
| 4 | π L4 | OOM at Step 500 but Not Step 1 |
| 5 | π΄ L5 | CPU Offloading vs Activation Recomputation |
| 6 | π£ L6+ | Memory Budget for High-Concurrency LLM Serving |
| Metric | Count |
|---|---|
| Questions | 9,000+ |
| Chains | 1,000+ |
| Taxonomy concepts | 650+ |
| Competency areas | 12 |
| Deployment tracks | 4 + Global |
| Mastery levels | L1βL6+ |
Explain why you cannot simply double the number of GPUs indefinitely to halve training time, and identify the three physical ceilings that bound cluster scaling.Three physical ceilings prevent infinite scaling: (1) **Communication bottleneck** β synchronous training requires AllReduce to average gradients across all GPUs every step. With N GPUs, AllReduce latency grows as O(log N) per step. At 10,000+ GPUs, communication time can exceed computation time. (2) **Power and cooling** β each GPU draws 300β700W. A 10K GPU cluster requires 4+ MW just for GPUs. (3) **Critical batch size** β beyond the critical batch size, gradient noise diminishes returns. For GPT-3, this is ~3.2M tokens. ```text 10K GPUs Γ 400W = 4MW AllReduce at 10K nodes: ~10ms overhead vs ~50ms compute = 17% communication tax Critical batch size for GPT-3: ~3.2M tokens ```
You converted most of your LLM to BF16 but only see 1.4x speedup instead of the expected 2x. What is happening?Training involves a mix of compute-bound and memory-bound operations β only some benefit from BF16. Large GEMMs (attention, FFN) see ~2x speedup. But optimizer steps (Adam maintains FP32 master weights), normalization layers, and loss computation remain in FP32. The weighted average: 70% of time in BF16-accelerated ops Γ 2x + 30% in FP32 ops Γ 1x = 1.4x overall. ```text Forward GEMMs: 40% of time β BF16 β 2x speedup Backward GEMMs: 30% of time β BF16 β 2x speedup Optimizer (Adam): 15% of time β FP32 β 1x Other (norm, loss): 15% of time β FP32 β 1x Weighted: 0.7 Γ 2 + 0.3 Γ 1 = 1.7... but memory-bound ops don't see full 2x β ~1.4x ```
Your data lake on S3 has grown to 500 PB. Design a tiering strategy to cut the monthly storage bill by 60%+.Intelligent data tiering based on access frequency. Classify data, apply lifecycle policies: hot data (30%) stays in S3 Standard, warm data in S3 Standard-IA, cold data (70%) moves to Glacier Deep Archive. ```text S3 Standard: $0.023/GB/month S3 Glacier Deep Archive: $0.00099/GB/month Current: $0.023 Γ 500 PB = $11.5M/month Optimized: 150 PB Γ $0.023 + 350 PB Γ $0.00099 β $3.8M/month Savings: ~67% ```
You have 1M autonomous vehicles. Compare the daily data cost of centralized retraining (10 MB upload/vehicle) vs. federated learning (50 MB gradient upload, 10% participation).Centralized: 1M Γ 10 MB = 10 TB/day. Federated (10% participate): 100K Γ 50 MB = 5 TB/day. At $2/GB cellular cost: centralized = $20,000/day, federated = $10,000/day. Annual savings: $3.65M. But federated also avoids regulatory risk of centralizing raw sensor data. ```text Centralized: 1M Γ 10 MB = 10 TB/day Γ $2/GB = $20,000/day Federated: 100K Γ 50 MB = 5 TB/day Γ $2/GB = $10,000/day Annual savings: $3.65M + regulatory risk reduction ```
Your autonomous vehicle uses GPS, IMU, and wheel encoder. An attacker spoofs GPS signals. How do you detect and mitigate this?Multi-layered defense using sensor fusion consistency checks. The IMU and wheel encoder provide *relative* motion β if GPS reports a 50m jump in 1 second while the IMU shows 0.5m movement, the innovation (residual) is 49.5m, far exceeding normal GPS noise (~3m). The state estimator (EKF/UKF) should reject GPS measurements with innovations exceeding a threshold, fall back to dead reckoning, and alert the operator. ```text IMU drift: ~1m/minute without GPS correction GPS accuracy: 1-3m Spoof detection threshold: innovation > 5Γ expected noise = 15m At 50m jump vs 0.5m IMU: innovation = 49.5m β reject with 99.99% confidence ```
You want to run an LLM to summarize audio while your iOS app is in the background. What is the primary risk?The iOS Watchdog Timer. iOS aggressively monitors background apps for memory and CPU usage. Background execution limits: 30 seconds for most tasks, 3 minutes for audio processing. A 7B LLM at INT4 = 3.5 GB weights + 0.5 GB KV-cache = 4 GB. iPhone 16 Pro has 8 GB total, ~5 GB available. In foreground: fits. In background: iOS reclaims memory aggressively, and sustained 3W inference drains 20% battery per hour. ```text On-device LLM: ~3W sustained on A17 Pro Battery: 4,000 mAh Γ 3.7V = 14.8 Wh Drain at 3W: 20% per hour iOS background limit: ~30 seconds β 0.025 Wh per cycle ```
A single 100-neuron dense layer runs faster on CPU than NPU. Why?NPUs have significant startup and data transfer overheads that overshadow benefits for tiny models. Driver initialization (~100ΞΌs), data transfer to NPU memory (~20ΞΌs), and NPU compute (~5ΞΌs) total ~125ΞΌs. The CPU does the same computation in ~50ΞΌs with no transfer overhead. The crossover point is typically around 10K parameters β below that, CPU wins. ```text CPU: 50ΞΌs compute NPU: 100ΞΌs startup + 20ΞΌs transfer + 5ΞΌs compute = 125ΞΌs NPU is 2.5Γ slower for trivial models Crossover: ~10K parameters ```
Calculate the Ridge Point for a Cortex-M4 microcontroller. Is it compute-bound or memory-bound?Ridge Point = Peak Compute / Peak Memory Bandwidth. Cortex-M4 at 168 MHz: ~168 MFLOPS (1 FP op/cycle). Memory: 32-bit bus at 168 MHz = 672 MB/s. Ridge Point = 0.168 GFLOPS / 0.672 GB/s = 0.25 FLOPS/byte. Most neural network layers have arithmetic intensity of 10-100 β far above the ridge point. MCUs are almost always **compute-bound**, the opposite of GPUs. ```text Cortex-M4: 168 MFLOPS / 672 MB/s = 0.25 FLOPS/byte Conv2D AI: ~50 FLOPS/byte β compute-bound GPU (H100): 989 TFLOPS / 3.35 TB/s = 295 FLOPS/byte β memory-bound MCUs are the mirror image of GPUs on the roofline ```
100,000 vehicles with Cortex-M4 voice assistants. After a year, humid-climate devices activate randomly. Design an OTA fix within 20% free Flash/SRAM.Three components within the resource budget: (1) Lightweight drift detector β running mean/variance on audio energy, 12 bytes SRAM. (2) Circuit breaker β if drift exceeds threshold, suppress activations and log diagnostics. (3) Diagnostic reporter β 16-bin histogram per event, store 100 records in Flash. Total: ~10KB Flash (5% of budget), <1KB SRAM. ```text Free Flash: 20% of 1MB = 205KB Free SRAM: 20% of 256KB = 51KB Drift detector: 12B SRAM, 2KB Flash Circuit breaker: 32B Flash Diagnostics: 76B Γ 100 records = 7.6KB Flash Total: ~10KB Flash, <1KB SRAM β well within budget ```
Why do large-scale LLM training clusters prefer InfiniBand over Ethernet?Three properties beyond raw bandwidth: (1) RDMA β GPU memory read/written directly over the network, bypassing CPU and OS kernel. Latency drops from ~50ΞΌs (TCP/IP) to ~1-2ΞΌs. (2) Lossless fabric β credit-based flow control guarantees zero packet loss, critical for AllReduce correctness. (3) Adaptive routing β hardware-level load balancing across multiple paths reduces congestion. ```text AllReduce for 1GB gradient buffer: InfiniBand RDMA: 1GB/(50 GB/s) + 2ΞΌs Γ logβ(1024) = ~20ms Ethernet TCP: 1GB/(50 GB/s) + 50ΞΌs Γ logβ(1024) + retransmit risk = 30-150ms InfiniBand: 2-7Γ lower tail latency ```
Your 64-GPU cluster shows 15% lower throughput between 11 AM and 3 PM. GPU utilization stays at 98%. No other jobs running. What is happening?Thermal throttling. The data center's cooling struggles during peak afternoon heat. When GPU junction temperature exceeds 83Β°C (A100 throttle point), the GPU reduces clock frequency. Clock drops from 1410 MHz to 1200 MHz = exactly 15% reduction. The GPU reports 98% utilization because it's still busy β just at a lower clock. ```text GPU clock: 1410 MHz β 1200 MHz = 15% reduction Night: junction 75Β°C, 8Β°C below throttle Afternoon: ambient +8Β°C β junction hits 83Β°C β throttle Fix: lower power limit from 400W to 350W β 5Β°C drop β no throttle Net result: +3% vs current daytime (lose 12.5% power but gain back 15% clock) ```
Design a memory system for a coding agent that maintains context across a multi-hour session with 500K tokens of history.Three-tier memory: (1) Working memory (<8K tokens) β current file, last 2-3 tool results, current plan. Managed programmatically, not by the LLM. (2) Episodic memory (vector DB) β summarized past interactions, indexed by embedding. Retrieved via semantic search when relevant. (3) Persistent memory (key-value store) β facts, decisions, file states. Never evicted, always available. ```text Raw: 500K tokens Γ $0.003/1K = $1.50/turn (and wouldn't fit in context) Tiered: 8K tokens/turn Γ $0.003/1K = $0.024/turn Compression ratio: 62.5Γ Cost reduction: 98.4% ```
These are 15 of 9,000+ questions.
Explore the full vault β
Vijay Janapa Reddi π¨ βοΈ π§ |
Rocky πͺ² π§βπ» |
Farhan Asghar π§βπ» |
Wishing you all the best in your interviews and your engineering journey.
β Vijay Janapa Reddi