Files
cs249r_book/mlsysim/tutorial/assessment/simulation-qa-round1.md
Vijay Janapa Reddi b7bf7a4ce5 docs(tutorial): simulated Q&A — 8 tough questions + 3 hallway conversations
Answer quality ranges 5-9/10. Weakest: MoE support (6/10), diffusion
models (5/10), TinyML depth (6/10) — all honest v0.2.0 gaps.
Strongest: spreadsheet comparison (7/10), inverse Roofline value (9/10),
CPI analogy for efficiency (9/10).
3 hallway conversations simulate real adoption decision dynamics.
2026-04-02 07:30:04 -04:00

40 KiB

Simulated Q&A and Hallway Conversations -- ISCA 2026 Tutorial

Purpose: Stress-test the tutorial's claims before a live audience finds the weak spots. Date generated: 2026-04-01 Methodology: Each question is drawn from a real archetype (the skeptic, the competitor, the domain expert, the methodologist). Answers are the presenter's honest best response given the current state of mlsysim v0.1.0 and the calibration data in empirical-calibration.md.


Part 1: Tough Questions (Post-Session Q&A)


Q1: "How is this different from a spreadsheet?"

Questioner archetype: Senior engineer, slightly bored, has seen a hundred tool demos.

The question (verbatim): "I appreciate the presentation, but I could do all of this in a spreadsheet. I have a Google Sheet where I plug in FLOPS, bandwidth, and model size and get the same answers. What does mlsysim give me that Excel doesn't?"

The best honest answer:

You are right that any single equation in mlsysim can be reproduced in a spreadsheet. People do this all the time, and for one-off calculations it works fine. The differences show up at scale:

  1. Unit safety. mlsysim uses Pint for dimensional analysis. Every quantity carries its unit. If you accidentally add GB to TFLOPS, you get a DimensionalityError at the point of the mistake, not a silently wrong number three rows down. Spreadsheets have no type system -- a cell is a cell. We have seen real production capacity planning errors that came from mixing GB and GiB in a spreadsheet, or confusing FLOP/s with FLOPs (rate vs count). Those errors are structurally impossible in mlsysim.

  2. Composability across 22 walls. Your spreadsheet probably covers 3-4 constraints. mlsysim composes all 22 into a single Pipeline.solve() call. The capstone exercise we just did -- throughput, latency, budget, carbon, fault tolerance, all simultaneously -- would be a nightmare in a spreadsheet because the constraints interact. Changing precision changes memory, which changes whether you need tensor parallelism, which changes communication cost, which changes fleet size, which changes carbon. That cascade is what the pipeline handles.

  3. Traceability. Every hardware constant in mlsysim is a TraceableConstant with a source, date, and DOI. When your spreadsheet says "H100 bandwidth = 3350 GB/s," where did that number come from? Is it the datasheet peak or the measured sustained? Is it HBM3 or HBM3E? In mlsysim, you can audit any number back to its origin.

But I want to be clear: if you have a well-maintained spreadsheet that solves your problem, keep using it. mlsysim is most valuable when you are doing comparative analysis across many hardware platforms, or when you need to hand your analysis to someone else and they need to trust the numbers.

Answer quality: 7/10

What would make it stronger: A live side-by-side demo showing a real spreadsheet error that mlsysim catches. The unit-safety argument is strong in theory but needs a concrete "I had a $200K capacity planning error because of a GiB/GB confusion" war story. Without the anecdote, the senior engineer thinks "I just label my columns carefully." Also: the 22-wall composition claim is aspirational -- v0.1.0 does not actually compose all 22 in a single solve call yet. The pipeline exists for compute/memory/communication/cost/carbon, but walls like Tail Latency (Erlang-C), Multi-tenant (queueing), and Safety (DP-SGD overhead) are not yet wired into the solver. Being caught overstating coverage would be worse than admitting the gap.


Q2: "Your accuracy is 2-5x off. Why wouldn't I just benchmark?"

Questioner archetype: Industry engineer who runs real clusters. Pragmatist.

The question (verbatim): "Your own slides say the accuracy is within 2-5x of measured performance. That is an order of magnitude. If I need to make a hardware purchasing decision, I need numbers I can trust. Why would I use something that might be off by 5x instead of just running a benchmark?"

The best honest answer:

That is a fair challenge, and I want to be precise about what "2-5x" means and where it comes from.

First, the 2-5x number is our worst case across all configurations with default efficiency parameters. The calibration table in our docs shows the actual spread. For LLM decode latency -- the serving use case most people care about -- we are within the published range, not 2-5x off, because decode is memory-bandwidth-bound and the model correctly computes weights / bandwidth. There is no efficiency parameter on the critical path. For CNN training throughput, the default eta produces predictions that are 22% low on A100 and 54% high on H100. That is where the "2-5x" envelope comes from.

Second, benchmarking is always better when you can do it. The question is: can you? Benchmarking requires having the hardware, which means you have already purchased it or negotiated cloud access. It requires having the software stack working on that hardware, which for a new platform can take weeks. And it gives you one data point: this model, on this hardware, at this batch size, with this framework version.

mlsysim is for the phase before benchmarking: "Should I even request time on this cluster?" "Is it worth porting to AMD MI300X, or is it obviously bandwidth-starved for my workload?" "How many GPUs do I need to request in my cloud allocation?" If mlsysim tells you a workload is 3x over the memory capacity of a given GPU, you do not need to benchmark to know it will not work.

Third, with a single calibrated measurement, accuracy improves dramatically. If you measure eta on one hardware platform, the model matches benchmarks within 1-5% for that platform, and gives you a reasonable starting point for others. The workflow is: benchmark once, calibrate eta, then use mlsysim for the design space exploration.

Answer quality: 8/10

What would make it stronger: The answer is honest and the "benchmark once, explore with mlsysim" workflow is compelling. The weak spot is that the calibration doc shows eta does NOT transfer across GPU generations (A100 eta=0.13 vs H100 eta=0.065 for the same ResNet-50 workload). The presenter should own this explicitly: "Cross-generation transfer is the known weak point. We are within 2x, not 1%, when transferring eta across architectures." Overpromising on cross-platform accuracy would get destroyed in peer review.


Q3: "The efficiency parameter is just a fudge factor, right?"

Questioner archetype: Architecture PhD student. Technically sharp. Wants to understand the epistemology.

The question (verbatim): "You showed that you calibrate eta per-benchmark to match published numbers. ResNet-50 on A100 needs eta=0.13, ResNet-50 on H100 needs eta=0.065. You call this the CPI analogy, but CPI is measured -- you back-calculate eta from the answer you want to predict. Isn't that circular? It is a fudge factor."

The best honest answer:

I want to take this seriously because it is the right question.

You are correct that when we calibrate eta per-benchmark, we are fitting a single parameter to match observations. That is definitionally a fudge factor if the only thing we do with it is reproduce the number we already measured.

The value is not in reproducing the known number. The value is in what you can do after calibration. Once you have eta=0.13 for ResNet-50 on A100, you can ask: "What happens if I double the batch size? What if I switch to FP8? What if I add pipeline parallelism?" The model makes predictions for those counterfactuals that are constrained by physics -- the FLOP count changes, the memory footprint changes, the communication volume changes -- and eta carries forward as the empirical correction.

The CPI analogy is precise. Patterson and Hennessy measure CPI from SPEC benchmarks. You could call CPI a fudge factor too -- it absorbs cache miss rates, branch mispredictions, pipeline hazards, and a dozen other things into one number. The reason it is useful is not that CPI is predictable from first principles. It is that the performance equation Time = Instructions x CPI x Clock_Period lets you reason about what happens when you change the ISA (Instructions change), or the clock frequency (Clock_Period changes), or the microarchitecture (CPI changes). Each term is independently variable.

Where this analogy breaks down -- and I should be honest about this -- is transferability. CPI for a given benchmark on a given ISA transfers reasonably well across microarchitectures within a generation. Our calibration data shows that eta does NOT transfer well across GPU generations for the same workload. ResNet-50 gets eta=0.13 on A100 and eta=0.065 on H100. That is a 2x difference, and it means you cannot use eta measured on A100 to predict H100 performance without significant error.

We think this is a real and important limitation, and we document it explicitly. The fix is not to pretend eta transfers -- it is to build a richer efficiency model that decomposes eta into sub-factors (kernel utilization, memory system efficiency, framework overhead) that transfer independently. That is future work.

Answer quality: 9/10

What would make it stronger: This is the best answer in the set because it concedes the legitimate criticism, gives the honest intellectual defense, and names the specific failure mode. The only improvement would be showing preliminary results from a decomposed eta model -- even a two-factor version (compute utilization + memory utilization) that transfers better. Without that, the "future work" claim is promissory.


Q4: "Does this work for MoE models like Mixtral?"

Questioner archetype: Applied ML researcher working on MoE architectures.

The question (verbatim): "All your examples are dense models. Mixture-of-Experts changes the arithmetic intensity dramatically -- only 2 of 8 experts are active per token in Mixtral, so the active parameter count is 12B but the total is 46B. The memory footprint is 46B but the compute is 12B. Does mlsysim handle this?"

The best honest answer:

Not natively in v0.1.0, and this is an important gap.

You are exactly right about the analysis. MoE creates a fundamental decoupling between memory footprint and compute that breaks the assumption of most analytical models, including ours. For dense models, parameters and FLOPs are tightly coupled via the 6ND rule. For MoE, you need to track them separately: all experts must be resident in memory (46B for Mixtral), but only the active experts contribute FLOPs per token (roughly 12B worth).

In the current version, you could model Mixtral by manually setting the model parameters to 46B (for memory calculations) and overriding the FLOPs to match the active expert count. That is a workaround, not proper support.

What proper MoE support requires is: (1) separating the memory model from the compute model in the solver, which is an architectural change; (2) modeling the expert routing overhead -- the gating network, the all-to-all communication in distributed MoE where different tokens route to different GPUs; and (3) modeling the load imbalance problem, where popular experts become bottlenecks.

The expert routing communication pattern is particularly important for ISCA audiences. Dense models use AllReduce (symmetric, bandwidth-optimal). MoE uses All-to-All (asymmetric, sensitive to load balance). The communication wall looks completely different.

This is on our roadmap. MoE is the single most requested feature.

Answer quality: 6/10

What would make it stronger: The answer correctly identifies the gap and shows domain understanding, but "it is on our roadmap" is the weakest possible ending at ISCA. A concrete timeline ("v0.2 in Q4 2026") or a branch with preliminary MoE support would be much stronger. Even better: have the workaround as a prepared code snippet that the questioner can try immediately. Saying "manually override the FLOPs" without showing the 5-line code example makes it feel like a hand-wave.


Q5: "Why should I use this instead of Calculon?"

Questioner archetype: Someone who has actually used Calculon from HPE/LLNL for training performance modeling.

The question (verbatim): "Calculon already does analytical training performance modeling. It handles 3D parallelism, pipeline bubbles, and communication overlap. It was validated against Megatron-LM at scale. Why should I switch?"

The best honest answer:

You should not switch. You should use both, for different questions.

Calculon is excellent at what it does: training performance modeling for large language models with 3D parallelism. It was built by people who run some of the largest training clusters in the world, and it shows. If your question is "what is the optimal parallelism configuration for training GPT-4-scale models on 2048 H100s," Calculon is probably the better tool right now. It models pipeline bubble fractions, communication-computation overlap, and micro-batch scheduling in more detail than mlsysim does.

mlsysim makes different trade-offs:

  1. Breadth over depth. Calculon covers training on NVIDIA hardware. mlsysim covers training, inference, serving, TinyML, cost, carbon, and sustainability across five vendors. If your question is "should I deploy this model on H100, MI300X, or Gaudi 3, and what is the carbon footprint of each option," Calculon cannot help you.

  2. Inference and serving. Calculon does not model the two-phase serving regime (prefill/decode), KV-cache memory pressure, or tail latency under load. mlsysim does.

  3. Unit safety and traceability. This is a differentiator for pedagogical and audit use cases. Every number in mlsysim is dimensionally typed and traceable to a source.

  4. Pedagogical design. mlsysim was designed for teaching. The API is intentionally simple: Engine.solve(model, hardware, batch_size). Calculon is designed for research-grade modeling, which means a steeper learning curve and more configuration.

Where Calculon wins cleanly: training-specific fidelity, communication overlap modeling, validation at real scale (thousands of GPUs with real Megatron-LM measurements). We respect that work enormously.

The honest positioning is: mlsysim is a broader, shallower tool. Calculon is a narrower, deeper one. Use mlsysim for rapid design-space exploration across the full stack. Use Calculon for detailed training performance prediction once you have narrowed the design space.

Answer quality: 8/10

What would make it stronger: This answer is good because it does not trash the competitor. The risk is that "broader but shallower" sounds like "worse at everything Calculon does." The presenter should have a prepared example where breadth matters: "A startup choosing between H100 and MI300X for a serving workload cannot use Calculon at all. mlsysim gives them a quantitative answer in under a second." The serving use case is the clearest differentiator -- lean into it hard.


Q6: "All your examples are Llama and ResNet. What about diffusion models?"

Questioner archetype: Computer vision researcher working on generative models.

The question (verbatim): "You showed Llama-3-8B and ResNet-50 in every exercise. Those are Transformer and CNN workloads. Diffusion models like Stable Diffusion have a completely different compute profile -- iterative denoising, U-Net backbone, cross-attention with text embeddings, variable-length generation. Can mlsysim handle that?"

The best honest answer:

The honest answer is: partially, and with manual effort.

The core physics still applies. A diffusion model is ultimately a sequence of forward passes through a neural network (the U-Net or DiT), each of which has a known FLOP count and memory footprint. mlsysim can model each denoising step as a forward pass and multiply by the number of steps. The Roofline analysis applies -- each step is either compute-bound or memory-bound depending on the batch size and model size.

What we do not model natively:

  1. The iterative structure. Diffusion inference requires N denoising steps (typically 20-50). Total latency is N times the per-step latency. This is trivial to compute but our solver API is not designed around iterative generation -- you would need to multiply the single-step result by N yourself.

  2. The U-Net architecture. Our FLOP counting assumes either a standard Transformer or a CNN. U-Nets have skip connections and variable resolution stages that make the per-layer FLOP distribution uneven. You would need to provide the total FLOPs manually rather than relying on our auto-counting.

  3. Cross-attention. The text-conditioned cross-attention between the CLIP embeddings and the U-Net features is a different attention pattern than self-attention in Transformers. It has different memory and compute characteristics.

  4. Classifier-free guidance. CFG doubles the forward pass cost (one conditioned, one unconditioned). This is easy to model (multiply by 2) but is not automatic.

The newer DiT (Diffusion Transformer) architectures are actually easier for us to model because they are standard Transformers with the iterative denoising wrapper. As the field moves from U-Net to DiT, our coverage improves.

I would say: for rough capacity planning ("can I serve Stable Diffusion XL on this GPU at this batch size?"), mlsysim works with manual FLOP input. For detailed latency optimization of the denoising pipeline, you need profiling tools.

Answer quality: 5/10

What would make it stronger: This answer reveals a real coverage gap. The presenter knows the physics but has to say "do it manually" four times. A prepared notebook showing DiffusionModel support -- even if it just wraps the manual steps into a helper function -- would turn this from a 5 to an 8. Diffusion models are the second-largest inference workload after LLMs in 2026. Not having native support is a significant gap for a tool claiming to cover "22 walls." The calibration doc should include at least one diffusion model benchmark.


Q7: "The TinyML section felt like an afterthought. Is it real?"

Questioner archetype: Embedded systems researcher. Noticed that the hardware zoo lists ESP32-S3 and nRF52840 but the tutorial exercises never touch them.

The question (verbatim): "Your hardware table lists microcontrollers -- ESP32, nRF52840 -- and the efficiency guide mentions TFLite Micro. But every exercise in the tutorial was about H100s and A100s. Have you actually validated the model on microcontrollers, or is it just a row in a table?"

The best honest answer:

Fair criticism. The TinyML support is real in the sense that the hardware specs are in the registry and the solver can compute Roofline-style predictions for them. An ESP32-S3 has known FLOPS (roughly 0.5 GOPS for INT8) and known SRAM (512 KB). You can ask mlsysim "does a 250 KB quantized MobileNet-v2 fit in SRAM and how long does inference take?" and get a physically grounded answer.

But you are right that it is not deeply validated. We have not run MLPerf Tiny benchmarks against our predictions for MCUs. The efficiency parameter for TinyML (eta=0.05-0.15) is estimated from general knowledge of interpreter overhead in TFLite Micro, not from systematic measurement. The tutorial does not include TinyML exercises because the ISCA audience skews toward datacenter and cloud.

What is genuinely useful for TinyML right now: the memory feasibility check. "Does this model fit on this MCU?" is a binary question that mlsysim answers correctly because it is just arithmetic -- model size vs SRAM capacity. That is actually the most common question in TinyML deployment, and getting it wrong wastes weeks of porting effort.

What needs work: per-operator latency modeling for MCUs (where there are no tensor cores and each operator type has wildly different efficiency), flash vs SRAM partitioning (MCUs often execute from flash, which is 10x slower), and DMA/interrupt overhead modeling. These are real TinyML constraints that our current single-parameter efficiency model does not capture.

If there is interest from the embedded community, I would love collaborators who can provide calibration data from MLPerf Tiny submissions.

Answer quality: 6/10

What would make it stronger: The memory feasibility argument is strong and honest. The weakness is the appeal for collaborators, which sounds like "we have not done the work." A concrete plan would help: "We are running MLPerf Tiny benchmarks on ESP32-S3 and nRF52840 this quarter and will publish calibration tables by v0.2." Even better: have one validated MCU benchmark in the calibration doc. One real number beats ten promises.


Q8: "What's your validation methodology? You calibrate eta per-benchmark."

Questioner archetype: Faculty member on a program committee. This is the question that decides whether the paper gets accepted.

The question (verbatim): "Let me make sure I understand your validation. You have six calibration points. For two of them (the CNN training cases), you set a default eta, get predictions that are 22% and 54% off, then show that per-configuration calibration brings error to 1%. For two more (LLM decode), the efficiency parameter does not even appear because the workload is memory-bound. And for the last one (GPT-3 FLOPs), it is a closed-form equation with no empirical parameter at all.

So your only genuinely predictive validation -- where eta is set in advance and the model makes a falsifiable prediction -- is... zero cases? Every case is either trivially correct (bandwidth division, closed-form FLOPs) or calibrated after the fact. How is this a validated model?"

The best honest answer:

You have identified the central methodological weakness, and I am not going to try to talk around it.

You are correct that in the current calibration document, there are zero cases where we set eta before seeing the benchmark result and then made a falsifiable prediction. The CNN cases use calibrated eta. The LLM decode cases are efficiency-insensitive. The FLOP counting is definitional.

Here is what I think the fair assessment is:

What is genuinely validated: The structural claims. The model correctly identifies that LLM decode is memory-bound (not compute-bound). It correctly identifies that the H100 speedup over A100 for decode is ~1.7x (matching the bandwidth ratio), not 3.2x (the FLOPS ratio). It correctly computes that KV-cache dominates memory at high batch sizes. These qualitative predictions are falsifiable and correct. They are also the predictions that matter most for system design -- knowing which constraint binds is more useful than knowing the exact latency.

What is NOT validated: Quantitative accuracy for compute-bound workloads with a fixed eta. The model cannot currently say "ResNet-50 on H100 will achieve X images/second" without a calibrated eta, and the calibrated eta does not transfer across hardware generations.

What we need for a strong validation: A held-out test. The methodology would be: calibrate eta on workload A (say, Llama-3-8B training on H100), then predict workload B (Llama-3-70B training on H100) using the same eta. If the model predicts within 20%, that is meaningful. If it is off by 2x, that tells us eta is workload-specific, not just hardware-specific. We have not run this experiment yet. We should, and we will before submitting the paper.

The broader intellectual claim is not "we can predict exact performance." It is "we can systematically identify binding constraints across 22 walls using a common analytical framework." That claim is validated by the structural results. But I acknowledge that the quantitative accuracy claim is currently undersupported.

Answer quality: 9/10

What would make it stronger: This is the right answer for an academic audience. The only improvement is having the held-out experiment done before the tutorial. Running Llama-3-8B and Llama-3-70B on the same hardware with a shared eta, then reporting the cross-workload transfer error, would cost roughly $50 in cloud compute and would either validate or refute the model's utility for quantitative prediction. Not having done this before presenting at ISCA is a significant omission. The structural validation (correct bottleneck identification) is the real value proposition and should be foregrounded in the paper.


Part 2: Hallway Conversations


Conversation A: Two PhD Students Debating Whether to Use mlsysim

Setting: Coffee break, 10:35 AM, after the Roofline module.

Characters:

  • Priya -- 3rd year, systems/architecture, working on communication-efficient distributed training. Uses ASTRA-sim for network simulation.
  • Marcus -- 2nd year, ML/NLP, working on efficient inference for long-context LLMs. Has never used a performance simulator.

Marcus: That batch-size sweep exercise was actually useful. I have been fighting with KV-cache OOM for weeks and I never sat down to do the arithmetic. It took like 30 seconds in their tool.

Priya: Sure, but that is literally dividing bytes by capacity. I could do that on a napkin.

Marcus: You could, but you would not. That is the point. I have been running profilers and reading CUDA traces trying to figure out why I OOM at batch 48 on an A100. Turns out 8B parameters at FP16 is 16 GB, and at batch 48 with 4K context the KV-cache is another 48 GB. That is 64 GB on an 80 GB card. The arithmetic was always there. I just never did it.

Priya: OK, fair. But my work is on communication. The AllReduce model they showed is textbook ring AllReduce. Real systems use hierarchical AllReduce with NVLink within the node and InfiniBand across nodes. ASTRA-sim models that at the packet level. This tool gives me one number.

Marcus: Do you always need packet-level simulation though? Like, for your paper, sure. But when you are writing your NSF proposal and you need to say "we will need 128 GPUs for this experiment," do you fire up ASTRA-sim?

Priya: ...No. I use a spreadsheet.

Marcus: Right. So their tool is a better spreadsheet. That is the pitch. It is not replacing ASTRA-sim for your research. It is replacing your spreadsheet for your capacity planning.

Priya: The efficiency parameter bugs me though. Did you catch the calibration numbers? ResNet-50 gets eta=0.13 on A100 and eta=0.065 on H100. That is a 2x difference for the same workload. If I use this for my proposal and pick the wrong eta, I am off by 2x on my GPU estimate, which means I am asking NSF for twice too many or twice too few GPUs.

Marcus: Yeah, that is a real problem. For inference it is less of an issue because decode is memory-bound and eta drops out. But for training... I do not know. You would have to calibrate it yourself.

Priya: Which means I need the hardware already. Chicken and egg.

Marcus: I think the move is: use it for feasibility ("does the model fit? am I compute-bound or memory-bound?") and ranking ("is MI300X better than H100 for my workload?"). Do not use it for exact throughput prediction unless you have a calibrated eta.

Priya: That is a narrow use case.

Marcus: For you, yes. For me? Every question I have right now is a feasibility or ranking question. "Can I serve 70B at batch 64 on one H100?" "Should I quantize to INT4 or use tensor parallelism?" "Is it worth trying MI300X for its 192 GB HBM?" Those are the questions keeping me up at night, and this tool answers all of them in under a second.

Priya: Fine. I will install it. But if my proposal gets rejected because the GPU count was wrong, I am blaming you.

Marcus: [laughs] Blame the efficiency parameter. At least you will know which wall to hit.

Verdict: Marcus will use mlsysim regularly. Priya will install it, use it twice for capacity estimates in grant proposals, and go back to ASTRA-sim for her actual research. This is the correct adoption pattern -- the tool serves Marcus's needs well and Priya's needs partially.


Conversation B: The AMD Engineer and the Intel Engineer Comparing Notes

Setting: Lunch, standing near the "Wall of Walls" sticky-note board.

Characters:

  • Rajan -- Senior performance architect at AMD, works on MI300X benchmarking and competitive analysis. Noticed mlsysim includes MI250X and MI300X.
  • Katharina -- Software engineer at Intel, works on Gaudi accelerator software stack. Noticed mlsysim includes Gaudi 2 and Gaudi 3.

Rajan: Did you see they have MI300X in there? I checked their bandwidth number -- 5.3 TB/s. That matches our datasheet.

Katharina: They have Gaudi 2 and Gaudi 3 as well. The FLOPS numbers look correct. I am less sure about the memory bandwidth -- Gaudi's memory subsystem is different from what you would infer from a single bandwidth number.

Rajan: That is my concern too. The roofline model assumes one bandwidth number and one compute number. MI300X has an interesting memory hierarchy -- the HBM3 bandwidth is 5.3 TB/s, but the Infinity Fabric between the chiplets has its own bandwidth characteristics. For workloads that fit in one chiplet's local HBM, you get the full bandwidth. For workloads that span chiplets, you get less.

Katharina: Same with Gaudi. The on-die SRAM is a critical tier that the roofline misses entirely. Gaudi's differentiation is the large SRAM that keeps activations close to compute. A single bandwidth number averaging across the memory hierarchy undersells us.

Rajan: So both of us have the same complaint: the model is too simple for our hardware's memory hierarchy.

Katharina: Yes. But I will say this -- the fact that our hardware is in there at all is progress. Most open tools are NVIDIA-only. When a customer asks "should I use H100 or MI300X or Gaudi 3," the default answer is "run MLPerf." If this tool gives a directionally correct ranking with sub-second latency, that is useful even if the absolute numbers are off.

Rajan: Directionally correct is the key phrase. Let me check something. [pulls out laptop] OK, I ran Llama-3-70B inference on MI300X at batch 1. Their model says ITL = 2.3 ms. Our internal benchmarks with ROCm show 3-5 ms depending on the framework. So they are within 2x, on the optimistic side.

Katharina: Optimistic is dangerous for competitive analysis. If a customer uses this tool and it says MI300X is faster than Gaudi 3 for their workload, but the prediction is optimistic for MI300X and pessimistic for Gaudi, we lose a sale based on a modeling artifact.

Rajan: Or vice versa. The bias matters.

Katharina: I think the play for both of us is to contribute calibrated efficiency values. If they have a TraceableConstant system where every number has a source, we can submit official numbers from our benchmark teams. Then at least the hardware specs are accurate and traceable to us.

Rajan: That is smart. Control the inputs, and the model works in our favor. Or at least does not work against us.

Katharina: The question is whether they accept vendor-submitted numbers. There is an obvious conflict of interest.

Rajan: The SPEC benchmark organization handles this with disclosure rules. You can submit your own numbers, but the methodology must be public and reproducible. If mlsysim adopted something similar...

Katharina: That would actually be useful for the industry. A neutral analytical framework with vendor-contributed, auditable hardware specs. Like a TPC for ML hardware.

Rajan: That is a much bigger project than what they showed today.

Katharina: Agreed. But the infrastructure -- unit-safe constants with provenance tracking -- is the right foundation. The question is whether they execute on it.

Rajan: Let us talk to them after the afternoon session. I want to understand the contribution model.

Verdict: Both engineers see potential value but have legitimate concerns about accuracy bias across vendors. The most likely outcome: one or both vendors contribute hardware specs to the project within 6 months, but only if the contribution model is clear and the numbers are auditable. The "neutral analytical framework" vision is compelling but requires governance that does not exist yet.


Conversation C: The Faculty Member and the Startup CTO

Setting: 4:50 PM, packing up after the closing. They were seated near each other during the capstone.

Characters:

  • Professor Chen -- Teaches ML Systems at a large state university. Has 150 students per semester. Currently uses ad-hoc Jupyter notebooks for homework assignments.
  • Diego -- CTO of a 40-person startup doing LLM-based document processing. Running inference on a mix of A100s and H100s in AWS. Attended the tutorial because he is about to make a $2M hardware purchasing decision.

Professor Chen: What did you think of the capstone?

Diego: The capstone was the best part. It was the first time today where all the pieces came together. The individual exercises were useful but felt like textbook problems. The capstone felt like my actual job.

Professor Chen: That is exactly why I am here. I want to redesign my ML Systems course around something like this. Right now my students do profiling labs with PyTorch and CUDA, but they never step back and think about the system as a whole. They optimize one kernel and think they have solved the problem.

Diego: So you would use mlsysim as a teaching tool?

Professor Chen: As the backbone of the course, potentially. The 22 walls taxonomy is a natural syllabus. Week 1: Compute Wall. Week 2: Memory Wall. Week 3: Software Wall. And so on. Each week, students use the tool to explore one wall, then do a real profiling lab to validate the model.

Diego: The "validate the model" part is key. If students just trust the analytical model, they learn the wrong lesson. The model is useful precisely because it is wrong in interesting ways. The gap between the model and reality IS the systems engineering.

Professor Chen: Exactly. I would have them measure eta themselves. "Run ResNet-50 on the department A100. Measure throughput. Back-calculate eta. Now predict what happens on H100. Next week, you will get H100 time and check your prediction."

Diego: That is a great assignment. Wish I had taken that class. What is your main concern?

Professor Chen: Maturity. I cannot build my course around a v0.1.0 tool that might break or change APIs between semesters. I have 150 students and 2 TAs. If Engine.solve() changes its signature, that is 150 broken notebooks and a week of debugging instead of teaching.

Diego: That is a real risk. What about for my use case? I need to decide between renewing our A100 instances or migrating to H100 or Trainium2. The cost difference over 3 years is about $1.5M.

Professor Chen: Did the tool help you with that?

Diego: Partially. The feasibility analysis is immediate -- I now know that our 70B model in FP16 does not fit on one H100, which means tensor parallelism or quantization regardless of the platform. That alone saved me from a mistake. I was about to price out single-GPU instances.

Professor Chen: And the cost modeling?

Diego: Rougher. The TCO calculator gives me directional answers, but I need exact numbers for a board presentation. I still have to get actual quotes from AWS and run actual benchmarks. But here is the thing -- now I know which benchmarks to run. Before today, I would have benchmarked every model on every instance type at every batch size. That is a $50K benchmarking bill. Now I know to benchmark only the three configurations that the model says are in the right ballpark. That probably saves me $40K.

Professor Chen: That is the real value proposition. Not "replaces benchmarking" but "focuses benchmarking."

Diego: Right. And for the carbon constraint -- my board just added a sustainability requirement. The geography exercise was an eye-opener. I had no idea the grid intensity variation was that large. I am going to move our training jobs to Quebec. That is a free win.

Professor Chen: You and every other company that does that exercise. Watch Quebec's grid get overloaded in two years.

Diego: [laughs] The tragedy of the commons, modeled analytically.

Professor Chen: Let me ask you something as an industry person. Is the 22-wall framework actually useful for practitioners, or is it an academic taxonomy?

Diego: Some of the 22 walls matter a lot. Compute, Memory, Communication, Capital -- those are my daily constraints. Sustainability is becoming one. Serving, Batching -- absolutely. But "Reasoning Wall" (inference-time compute), "Sensitivity Wall" (partial derivatives) -- those feel like textbook walls, not engineering walls. I would never say to my team "we are hitting the Sensitivity Wall."

Professor Chen: That is useful feedback. The framework might benefit from a "practitioner's top 10" versus the full 22.

Diego: Or just clear tiers. Tier 1: walls that determine your architecture (Compute, Memory, Communication, Capital). Tier 2: walls that affect your optimization (Software, Serving, Batching, Compression). Tier 3: walls that matter for specific contexts (Carbon, Safety, TinyML, Multi-tenant). The full 22 is great for a course. For a startup CTO, I need the top 8.

Professor Chen: I am going to steal that for my syllabus. Tier 1 in the first half of the semester, Tier 2 in the second half, Tier 3 as optional projects.

Diego: Send me the syllabus. I will send you interns.

Professor Chen: Deal.

Verdict: Professor Chen will adopt mlsysim for teaching if and only if the API stabilizes by fall 2026 and there are pre-built assignment notebooks. Diego will use it this week to narrow his benchmarking from 20 configurations to 3, potentially saving $40K. He will not use it for final purchasing decisions -- those still require real benchmarks and real quotes. Both see the 22-wall taxonomy as pedagogically strong but operationally over-broad, and both independently converge on the idea that a tiered subset would increase practical adoption.


Summary: Strength and Weakness Assessment

Where the tutorial is strong

Strength Evidence
The "aha moments" work The H100-is-only-1.7x-faster exercise consistently surprises even experienced engineers. The KV-cache OOM exercise solves a real problem people have.
Honest positioning The slides explicitly state "2-5x accuracy" and "not a replacement for benchmarking." This disarms skeptics.
The CPI analogy is intellectually sound It is the correct framing and maps to something the ISCA audience already understands.
Multi-vendor coverage is a real differentiator No other open analytical tool covers NVIDIA, AMD, Intel, Google, and Cerebras.
The capstone design challenge works It synthesizes all the concepts and feels like a real engineering problem.

Where the tutorial is vulnerable

Vulnerability Severity Mitigation needed
Zero held-out validation experiments Critical Run cross-workload eta transfer experiments before ISCA. This is a $50 experiment that determines paper acceptance.
MoE models not supported High At minimum, provide a documented workaround notebook. Better: native MoEModel class.
Diffusion models not supported High Add one diffusion benchmark to the calibration table.
TinyML claims exceed evidence Medium Run one MLPerf Tiny benchmark against predictions.
Eta does not transfer across GPU generations Medium Document this as a known limitation prominently, not buried in the calibration doc. Frame it as a research contribution opportunity.
22 walls claim vs actual solver coverage Medium Audit which walls are actually wired into Pipeline.solve() vs which are standalone calculations. Be precise in slides.
API stability for course adoption Medium Commit to a stable v1.0 API freeze by a specific date.
No governance model for vendor-contributed specs Low Define a contribution policy before AMD and Intel contribute numbers.

The adoption decision matrix

Persona Will adopt? Why / Why not
PhD student (systems) Maybe Good for capacity planning in proposals. Not deep enough for research.
PhD student (ML) Yes Solves real "why is my training slow" questions immediately.
Industry engineer (NVIDIA) No Has internal tools that are better.
Industry engineer (AMD/Intel) Interested Wants to contribute specs to ensure fair competitive comparison.
Faculty (ML Systems course) Yes, if API stabilizes The 22-wall taxonomy is a ready-made syllabus.
Startup CTO Yes, for scoping "Focuses benchmarking" is the real value. Saves $40K on unnecessary benchmarks.
Startup CTO (final decision) No Final purchasing decisions require real benchmarks and real quotes.