--- title: "Full-Stack Audit: LLaMA-70B Training" subtitle: "One model, six domains, twelve walls --- a complete systems analysis in 60 seconds." description: "Compose 6+ solvers across all six taxonomy domains to produce a holistic training analysis. Discover that the binding constraint is compute, but checkpoint overhead is the hidden cost." categories: ["capstone", "advanced"] --- ## The Question What does a **complete** systems analysis look like? No single solver captures the full picture. Training a 70B-parameter model on 512 H100 GPUs involves compute walls, memory walls, communication overhead, checkpoint I/O, energy costs, and carbon emissions --- simultaneously. This tutorial traces all six taxonomy domains and exercises 12 of the 22 systems walls through a single workload. ::: {.callout-note} ## Prerequisites Complete [Tutorial 0: Hello, Roofline](00_hello_roofline.qmd), [Tutorial 1: The Memory Wall](01_memory_wall.qmd), [Tutorial 6: Scaling to 1000 GPUs](06_scaling_1000_gpus.qmd), and [Tutorial 9: Sensitivity Analysis](09_sensitivity.qmd). You should understand roofline analysis, distributed training, and binding constraint identification. ::: ::: {.callout-note} ## What You Will Learn - **Compose** six solver families across all taxonomy domains into a holistic analysis - **Identify** which of the 22 systems walls bind for a real training workload - **Quantify** the hidden costs: checkpoint overhead, carbon, water, and TCO - **Produce** a summary table mapping domain -> solver -> binding wall ::: ::: {.callout-tip} ## Solver Quick Reference This capstone uses solvers from all six domains. If you arrived via an accelerated learning path, here is what each solver does: | Solver | Domain | What It Computes | |:-------|:-------|:-----------------| | `SingleNodeModel` | Node | Roofline bottleneck, latency, throughput | | `DataModel` | Data | Whether the data pipeline can sustain GPU demand | | `ScalingModel` | Algorithm | Compute-optimal training budget (Chinchilla) | | `DistributedModel` | Fleet | Communication overhead and scaling efficiency | | `ReliabilityModel` | Fleet | Cluster MTBF and optimal checkpoint intervals | | `EconomicsModel` | Ops | CapEx, OpEx, and total cost of ownership (TCO) | | `SustainabilityModel` | Ops | Energy, carbon footprint, and water usage | | `SensitivitySolver` | Analysis | Partial derivatives identifying the binding constraint | | `SynthesisSolver` | Analysis | Minimum hardware specs from a latency target | ::: ::: {.callout-tip} ## Background: The Six Taxonomy Domains The MLSys wall taxonomy organizes 22 systems walls into six domains: | Domain | Walls | What It Covers | |:-------|:------|:---------------| | Node | 1--3 | Compute, memory capacity, memory bandwidth | | Data | 8--10 | Storage throughput, data pipeline stalls | | Algorithm | 11--13 | Scaling laws, compute-optimal training | | Fleet | 14--16 | Communication, synchronization, reliability | | Ops | 17--20 | TCO, energy, carbon, water, safety | | Analysis | 21--22 | Sensitivity, inverse synthesis | No single solver spans all six. The insight emerges from **composition**. ::: --- ## 1. Setup: Build the Fleet We construct a 512-GPU training cluster: 64 DGX H100 nodes, 8 GPUs per node, NVLink intra-node, InfiniBand NDR inter-node, powered by Quebec's hydroelectric grid. ```{python} #| echo: false #| output: false import mlsysim # installed via `pip install mlsysim` (see workflow) import mlsysim ``` ```python import mlsysim from mlsysim.systems.types import Fleet, Node, NetworkFabric from mlsysim.core.constants import Q_ ``` ```{python} from mlsysim.systems.types import Fleet, Node, NetworkFabric from mlsysim.infra.registry import Grids from mlsysim.core.constants import Q_, NVLINK_H100_BW, INFINIBAND_NDR_BW model = mlsysim.Models.Language.Llama3_70B h100 = mlsysim.Hardware.Cloud.H100 # Build the DGX H100 node: 8 GPUs connected by NVLink 4.0 node = Node( name="DGX H100", accelerator=h100, accelerators_per_node=8, intra_node_bw=NVLINK_H100_BW ) # Build the cluster fabric: InfiniBand NDR (400 Gbps) fabric = NetworkFabric( name="InfiniBand NDR", topology="fat-tree", bandwidth=INFINIBAND_NDR_BW ) # Build the fleet: 64 nodes = 512 GPUs, Quebec grid fleet = Fleet( name="Training Cluster", node=node, count=64, fabric=fabric, region=Grids.Quebec ) from mlsysim.show import table, info, banner info("Fleet Configuration", Model=f"{model.name} ({model.parameters.to('Bparam'):.1f~})", Fleet=f"{fleet.count} nodes x {node.accelerators_per_node} GPUs = {fleet.total_accelerators} GPUs", Intra_node=f"NVLink 4.0 ({NVLINK_H100_BW.to('GB/s'):.0f~})", Inter_node=f"IB NDR ({INFINIBAND_NDR_BW.to('Gbps'):.0f~})", Region=Grids.Quebec.name) ``` --- ## 2. Node (Walls 1--3): Single-GPU Roofline First, classify the per-GPU forward-backward pass. Is each GPU compute-bound or memory-bound during training? ```{python} from mlsysim import SingleNodeModel node_solver = SingleNodeModel() node_result = node_solver.solve( model=model, hardware=h100, batch_size=4, precision="fp16" ) banner("Domain: Node (Walls 1-3)") info(Bottleneck=node_result.bottleneck, Per_GPU_latency=node_result.latency.to('ms'), Throughput=f"{node_result.throughput:.0f} samples/s") ``` Training at batch size 4 per GPU puts us in the compute-bound regime --- unlike inference, training has high arithmetic intensity due to the backward pass. Wall 1 (Compute) is the binding constraint at the node level. Compute-bound is good news --- it means the GPU is doing useful work, not waiting for data. But can the data pipeline actually keep up with 512 GPUs demanding training samples? --- ## 3. Data (Walls 8--10): Can the Pipeline Keep Up? The roofline tells us each GPU can consume data at a certain rate. But can the storage and preprocessing pipeline actually deliver data that fast? If not, the GPUs stall --- and "compute-bound" becomes a meaningless label. ```{python} from mlsysim import DataModel # Estimate data demand per step: 4 samples/GPU * 512 GPUs * 2048 tokens * 2 bytes ≈ 8 MB/step # At ~1 step/sec, this is ~8 MB/s — tokenized text is compact data_demand = Q_("8 MB/s") data_solver = DataModel() data_result = data_solver.solve( workload_data_rate=data_demand, hardware=h100 ) banner("Domain: Data (Walls 8-10)") info(Data_demand=data_result.demand_bw, Data_supply=data_result.supply_bw, Utilization=f"{data_result.utilization:.1%}", Stalled=data_result.is_stalled, Bottleneck=data_result.bottleneck) ``` For text-based training, the data pipeline is rarely the bottleneck --- tokenized text is compact. But for image or video training, this wall can dominate. The data pipeline can keep up. The GPUs are compute-bound and well-fed. But are we spending our compute budget wisely? A 30-day run on 512 GPUs is an enormous investment --- the scaling laws tell us whether we are allocating it optimally. --- ## 4. Algorithm (Walls 11--13): Compute-Optimal Budget Is our training budget compute-optimal? The Chinchilla scaling law says D = 20P (tokens = 20x parameters) for optimal allocation. ```{python} from mlsysim import ScalingModel # MFU (Model FLOP Utilization): the fraction of peak hardware FLOP/s that goes # to useful model computation (excluding communication, idle time, overhead). # MFU = 0.4 means 40% of theoretical peak -- typical for large-scale LLM training. # Published values: 0.30-0.45 (Llama-2/3), up to 0.50 (highly optimized runs). # Compute budget: 512 GPUs * 989 TFLOPs * 30 days * 86400s * 0.4 MFU gpu_flops = h100.compute.peak_flops.to("flop/s").magnitude total_flops = 512 * gpu_flops * 30 * 86400 * 0.4 compute_budget = Q_(total_flops, "flop") scaling_solver = ScalingModel() scaling_result = scaling_solver.solve( compute_budget=compute_budget, target_model_size=model.parameters ) banner("Domain: Algorithm (Walls 11-13)") info(Compute_budget=compute_budget.to('EFLOP'), Optimal_tokens=f"{scaling_result.optimal_tokens.magnitude:.2e}", Tokens_per_parameter=f"{scaling_result.tokens_per_parameter:.1f}", Chinchilla_ratio=f"{'OVER' if scaling_result.tokens_per_parameter > 20 else 'UNDER'}-trained") ``` If the tokens-per-parameter ratio is significantly above or below 20, the training budget is not optimally allocated. Over-training wastes compute; under-training wastes model capacity. So far, everything looks manageable: compute-bound GPUs, adequate data pipeline, reasonable training budget. If we throw 512 GPUs at this, we should scale linearly, right? The fleet-level analysis reveals what single-node reasoning misses. --- ## 5. Fleet (Walls 14--16): Communication and Reliability The distributed solver models AllReduce overhead and pipeline bubbles. The reliability solver computes cluster MTBF and optimal checkpoint intervals. ```{python} from mlsysim import DistributedModel, ReliabilityModel # 3D parallelism: TP=8 (within node), PP=1, DP=64 dist_solver = DistributedModel() dist_result = dist_solver.solve( model=model, fleet=fleet, batch_size=2048, precision="fp16", tp_size=8, pp_size=1, overlap_comm=True, seq_len=2048 ) banner("Domain: Fleet (Walls 14-16)") info(Scaling_efficiency=f"{dist_result.scaling_efficiency:.2%}", Step_latency=dist_result.step_latency_total.to('ms'), DP_comm_latency=dist_result.dp_communication_latency.to('ms'), TP_comm_latency=dist_result.tp_communication_latency.to('ms'), Bubble_fraction=f"{dist_result.bubble_fraction:.2%}") ``` ```{python} # Reliability: 30-day training job rel_solver = ReliabilityModel() rel_result = rel_solver.solve( fleet=fleet, job_duration_hours=30*24, checkpoint_time_s=120 ) info(Fleet_MTBF=rel_result.fleet_mtbf.to('hour'), Failure_probability=f"{rel_result.failure_probability:.2%}", Expected_failures=f"{rel_result.expected_failures:.1f}", Optimal_ckpt_interval=rel_result.optimal_checkpoint_interval.to('minute')) ``` At 512 GPUs, the cluster MTBF shrinks significantly. Checkpoint overhead becomes a non-trivial fraction of wall-clock time --- this is the "hidden cost" that single-node analysis misses entirely. The reliability analysis tells us HOW OFTEN the cluster fails. But failures cost money --- and so does the energy to keep 512 GPUs running for 30 days. The operational domain quantifies these costs. --- ## 6. Ops (Walls 17--20): TCO, Energy, Carbon, Water The economics solver combines CapEx, OpEx, and sustainability into a single financial model. ```{python} from mlsysim import EconomicsModel, SustainabilityModel # 30-day training run econ_solver = EconomicsModel() econ_result = econ_solver.solve( fleet=fleet, duration_days=30, grid=Grids.Quebec, mfu=0.4 ) banner("Domain: Ops (Walls 17-20)") info(CapEx=f"${econ_result.capex_usd:,.0f}", OpEx_energy=f"${econ_result.opex_energy_usd:,.0f}", OpEx_maintenance=f"${econ_result.opex_maintenance_usd:,.0f}", Total_TCO=f"${econ_result.tco_usd:,.0f}") ``` ```{python} sust_solver = SustainabilityModel() sust_result = sust_solver.solve( fleet=fleet, duration_days=30, datacenter=Grids.Quebec, mfu=0.4 ) info(IT_Energy=sust_result.it_energy_kwh.to('MWh'), Total_Energy_PUE=sust_result.total_energy_kwh.to('MWh'), Carbon_footprint=f"{sust_result.carbon_footprint_kg:.0f} kg CO2", Water_usage=f"{sust_result.water_usage_liters:.0f} liters", PUE=sust_result.pue, Region=sust_result.region_name) ``` Quebec's hydroelectric grid makes this one of the lowest-carbon training locations in the world. The same run in Poland (coal-heavy grid) would produce dramatically more CO2 --- infrastructure geography is a first-class engineering variable. --- ## 7. Analysis (Walls 21--22): Sensitivity and Synthesis Finally, confirm the binding constraint and derive minimum hardware for a 14-day completion target. ```{python} from mlsysim import SensitivitySolver, SynthesisSolver # Sensitivity: confirm compute is the binding constraint for training sens_solver = SensitivitySolver() sens_result = sens_solver.solve( model=model, hardware=h100, precision="fp16" ) banner("Domain: Analysis (Walls 21-22)") info(Binding_constraint=sens_result.binding_constraint) sens_rows = [[param, f"{val:+.4f}"] for param, val in sens_result.sensitivities.items()] table(["Parameter", "Sensitivity"], sens_rows) ``` ```{python} # Synthesis: what per-GPU step latency is needed to finish in 14 days? # Total training FLOPs / (N_GPUs * MFU * peak_FLOPS) = wall_clock_seconds target_days = 14 target_seconds = target_days * 86400 # Per-GPU step target: total_steps * step_latency = target_seconds # Approximate: we need each step to complete within a target latency synth_solver = SynthesisSolver() synth_result = synth_solver.solve( model=model, target_latency=Q_("200 ms"), # per-GPU training step target precision="fp16" ) info("Synthesis (200ms per-GPU training step target)", Required_BW=synth_result.required_bw.to('TB/s'), Required_FLOPS=synth_result.required_flops.to('TFLOPs/s'), Required_memory=synth_result.required_memory.to('GB')) ``` --- ## 8. Summary Table: The Complete Picture We have now traced a single workload through all six domains. Each solver answered one question in isolation. But the systems engineer's job is synthesis: seeing the complete picture at once. The table below is that picture --- and its most important property is that no single row captures the full story. ```{python} mtbf_hours = rel_result.fleet_mtbf.to('hour').magnitude summary_rows = [ ["Node", "SingleNodeModel", f"Bottleneck: {node_result.bottleneck}", "Wall 1: Compute"], ["Data", "DataModel", f"Util: {data_result.utilization:.0%}", "Not binding"], ["Algorithm", "ScalingModel", f"Tok/param: {scaling_result.tokens_per_parameter:.0f}","Wall 11"], ["Fleet", "DistributedModel", f"Efficiency: {dist_result.scaling_efficiency:.0%}", "Wall 14: Comm"], ["Fleet", "ReliabilityModel", f"MTBF: {mtbf_hours:.0f}h", "Wall 19: Ckpt"], ["Ops", "EconomicsModel", f"TCO: ${econ_result.tco_usd:,.0f}", "Wall 17: Cost"], ["Ops", "SustainabilityModel", f"CO2: {sust_result.carbon_footprint_kg:.0f} kg", "Wall 18: Energy"], ["Analysis", "SensitivitySolver", f"Binding: {sens_result.binding_constraint}", "Wall 21"], ] table(["Domain", "Solver", "Key Metric", "Binding Wall"], summary_rows, "<<>>") ``` ::: {.callout-important} ## Key Insight **No single solver captures the full picture --- the systems view emerges from composition.** This end-to-end trace exercises 12 of 22 walls through a single model. The per-GPU binding constraint is compute (Wall 1), but the **hidden costs** only appear at fleet scale: checkpoint overhead (Wall 19) consumes wall-clock time proportional to the MTBF-driven checkpoint frequency, and infrastructure geography (Quebec vs. Poland) can change the carbon footprint by 40x (as [Tutorial 7](07_geography.qmd) demonstrated). A complete systems analysis is not one solver run --- it is the composition of all six domains. ::: --- ## Your Turn ::: {.callout-caution} ## Exercises **Exercise 1: Predict before you compute.** What if you train in Poland instead of Quebec? Before running code, predict how the TCO and carbon footprint will change. (Hint: Poland's grid is coal-heavy with ~800 g CO2/kWh vs. Quebec's ~20 g CO2/kWh, and Poland has a higher PUE.) Then re-run the economics and sustainability solvers with `Grids.Poland` and compare. How close was your prediction? **Exercise 2: Double the cluster.** Scale the fleet to 1024 GPUs (128 nodes). Re-run the distributed solver and reliability solver. Does scaling efficiency hold? How does the MTBF change? At what cluster size does the checkpoint overhead exceed 5% of wall-clock time? **Exercise 3: Minimum viable cluster.** What is the minimum cluster size to complete Llama-3 70B training in 14 days? Use the scaling result to determine the required total FLOPS, then work backward to find the number of H100 GPUs needed at 40% MFU. Verify with the distributed solver that the communication overhead is acceptable at that scale. **Exercise 4: Propose a design change.** Using the full-stack analysis, identify the single highest-leverage change — hardware upgrade, parallelism strategy, region change, or precision change — that would reduce TCO by at least 20%. Re-run the relevant solvers with your proposed change and compute the new TCO. *Write one paragraph justifying why this change has the largest impact, referencing at least two domains from the summary table.* **Self-check:** If the fleet MTBF is 4 hours and each checkpoint takes 2 minutes, what fraction of wall-clock time is spent checkpointing? (Use the Young-Daly formula: optimal interval = sqrt(2 * delta * MTBF).) ::: --- ## Key Takeaways ::: {.callout-tip} ## Summary - **Composition is the method**: no single solver spans all six taxonomy domains; the systems view emerges only from composing 6+ solvers - **Compute binds at the node level**, but checkpoint overhead and communication are the hidden costs at fleet scale - **Infrastructure geography matters**: Quebec vs. Poland can change carbon footprint by 40x and TCO by 20--30% - **The summary table** is the deliverable: one row per domain, solver, key metric, and binding wall - **12 of 22 walls** are exercised through a single model-fleet pair --- this is what a complete analysis looks like ::: --- ## Next Steps - **[Sensitivity Analysis](09_sensitivity.qmd)** --- Dive deeper into the Analysis domain solvers - **[GPU vs. Wafer-Scale](10_gpu_vs_wafer.qmd)** --- See how architecture shifts the binding wall - **[Geography of AI](07_geography.qmd)** --- Explore how datacenter location changes sustainability - **[The \$9 Million GPU](08_nine_million_dollar.qmd)** --- Deep dive into TCO modeling