Benchmarking Checklist Thoughts (Jeff) #326

New Issue

GiteaMirror · 2026-03-22T15:36:41-05:00

GiteaMirror commented

2026-03-22 15:36:41 -05:00

Originally created by @18jeffreyma on GitHub (Feb 3, 2025).

Originally assigned to: @profvjreddi on GitHub.

Benchmarking Thoughts

12.1

nit: Merge the first sentence of this section into “As computing systems continue to evolve and grow in complexity, understanding their performance becomes essential to engineer them better” (feels like a more thesis statement)

12.2.1

nit: “systems solved linear equations” -> “could solve”
Love the 3DMark reference (I remember using this when I built my first desktop computer).
12.2.[2,3] could use some plots of maybe system performance (MLPerf + MLPerf power plots from paper)

12.3

nit: apex of evolution seems a bit vague (i get what you mean). Maybe wording like “As the field of machine learning developed towards domain-specific applications, the development of benchmarks truly hit its stride”.
“across all three dimensions”: could be nice to put a venn diagram in this section similar to other sections: at the triple intersection is ml systems benchmarks, in each isolated section a single example of an algorithmic, hardware, and training data benchmark.

12.3.{1, 2, 3}

For each of these (you do this well in 12.3.2), I think it may be helpful to provide some units when discussing what these benchmarks measure. For example, pure algorithmic benchmarks primarily measure things in terms of optimizer steps for example, whereas systems might care about wall clock time, power efficiency; and data might look at evaluation metrics like interrater reliability, label error rate, etc.

12.4

12.4.2 add a diagram here (i.e. of imagenet) to show an example task
12.4.3 add a mention of “perplexity” as a metric here (or other NLP metrics like ROUGE)
https://arxiv.org/pdf/2202.02842 a good paper to cite to explain how metric choice matters and you should choose it to correlate to your task
12.4.4: “models like BERT or GPT” -> “BERT or GPT architecture models”
12.4.5: Maybe some note on explicitly understanding if your benchmark has a dependence on hardware/software specs (for example, a lot of model quality benchmarks don’t really get affected by this)
Also probably mention techniques like containerization (docker) here.
Other thoughts: I think in general 12.4 feels a bit too big: what do you think about splitting out 12.4.7 and 12.4.8 (maybe plus 12.5) as a “Key Benchmark Considerations” or similar.
I feel like 12.4 should focus on “what should you include in a benchmark?” and the section after should be “how should you interpret and consider your benchmark once its created”?

12.5

Really love this section (great figures)

12.5.2: add a MLPerf figure here (given the richness of citations here -))
12.5.4: nit: title -> “Making Tradeoffs between Granularities”
Table 12.1: “May miss interaction effects”: add “only observable with multi operation interaction”

12.6

Include linear (device) scaling plot from MLPerf?
Some additional citations here (like the BERT fact in “Resource Utilization”)
For Fault Tolerance, can refer them to https://www.adept.ai/blog/sherlock-sdc and llama 3 training paper (which https://www.datacenterdynamics.com/en/news/meta-report-details-hundreds-of-gpu-and-hbm3-related-interruptions-to-llama-3-training-run/ )
“TF32 introduces numerical instability”: I think you use this wording multiple times (if you Ctrl-F), maybe rewrite this to a different example

12.7

Maybe some note on how inference is much more constrained than training from optimization perspective: generally one can assume datacenter level resources for training, but for inference, restrictions are much tighter.
precision section: maybe some citations
https://arxiv.org/abs/2407.03211
https://arxiv.org/pdf/2212.09720
I’m less familiar with this section, but maybe some diagrams we can draw from MLPerf to showcase each section (given these are basically MLPerf case studies)
I’ll take a read of the papers.
What are your thoughts on moving Table 12.3 higher and adding hyper links to each of the sections? I actually think having this table up higher and seeing it first helps me understand it better before digging deeper into each one

12.8

the jump to MLPerf power is a bit jarring: I understand why its presented (as a holistic E2E energy benchmark, but maybe something to introduce this more smoothly:
In my mind it would go like: bring in our earlier discussion on E2E benchmarking and discuss how datacenter level benchmarking is (close to) the final E2E truly
Discuss how the datacenters are very heterogeneous but they are unified in terms of efficiency given fixed workloads by POWER
Enter ML Perf Power as an example benchmark!
Otherwise good diagrams here, learned a ton on this :-)

12.9

12.9.2: Maybe in this discussion, discuss how theres a tradeoff between designing feasible benchmarks (ease of collecting data and structuring task) and realism of a benchmark (how close it is to a task). Sometimes folks lean too much on the former, which means benchmark != real work performance.
An example of good benchmark to task correlation might be LMSys Chatbot arena, which grades LLMs based on actual user use while also being largely unhackable.
i think somewhere in this section (also could be early on, but just thought of this), we should add some discussion on Goodhart’s Law
https://en.wikipedia.org/wiki/Goodhart%27s_law
Many “bad” benchmarks these days are bc ppl overoptimized on it instead of optimizing on a true task performance.

Other notes

My feeling is that after 12.8, the organization gets a big chaotic (unlike the structure of the previous section with clear general benchmark components, inference benchmarks, training benchmarks etc.): maybe let's brainstorm Thursday to figure this out :-)

Originally created by @18jeffreyma on GitHub (Feb 3, 2025). Originally assigned to: @profvjreddi on GitHub. Benchmarking Thoughts ## 12.1 - [ ] nit: Merge the first sentence of this section into “As computing systems continue to evolve and grow in complexity, understanding their performance becomes essential to engineer them better” (feels like a more thesis statement) ## 12.2.1 - [ ] nit: “systems solved linear equations” -> “could solve” Love the 3DMark reference (I remember using this when I built my first desktop computer). - [ ] 12.2.[2,3] could use some plots of maybe system performance (MLPerf + MLPerf power plots from paper) ## 12.3 - [ ] nit: apex of evolution seems a bit vague (i get what you mean). Maybe wording like “As the field of machine learning developed towards domain-specific applications, the development of benchmarks truly hit its stride”. - [ ] “across all three dimensions”: could be nice to put a venn diagram in this section similar to other sections: at the triple intersection is ml systems benchmarks, in each isolated section a single example of an algorithmic, hardware, and training data benchmark. ## 12.3.{1, 2, 3} - [ ] For each of these (you do this well in 12.3.2), I think it may be helpful to provide some units when discussing what these benchmarks measure. For example, pure algorithmic benchmarks primarily measure things in terms of optimizer steps for example, whereas systems might care about wall clock time, power efficiency; and data might look at evaluation metrics like interrater reliability, label error rate, etc. ## 12.4 - [ ] 12.4.2 add a diagram here (i.e. of imagenet) to show an example task - [ ] 12.4.3 add a mention of “perplexity” as a metric here (or other NLP metrics like ROUGE) https://arxiv.org/pdf/2202.02842 a good paper to cite to explain how metric choice matters and you should choose it to correlate to your task - [ ] 12.4.4: “models like BERT or GPT” -> “BERT or GPT architecture models” - [ ] 12.4.5: Maybe some note on explicitly understanding if your benchmark has a dependence on hardware/software specs (for example, a lot of model quality benchmarks don’t really get affected by this) Also probably mention techniques like containerization (docker) here. Other thoughts: I think in general 12.4 feels a bit too big: what do you think about splitting out 12.4.7 and 12.4.8 (maybe plus 12.5) as a “Key Benchmark Considerations” or similar. I feel like 12.4 should focus on “what should you include in a benchmark?” and the section after should be “how should you interpret and consider your benchmark once its created”? ## 12.5 Really love this section (great figures) - [ ] 12.5.2: add a MLPerf figure here (given the richness of citations here -)) - [ ] 12.5.4: nit: title -> “Making Tradeoffs between Granularities” - [ ] Table 12.1: “May miss interaction effects”: add “only observable with multi operation interaction” ## 12.6 - [ ] Include linear (device) scaling plot from MLPerf? - [ ] Some additional citations here (like the BERT fact in “Resource Utilization”) For Fault Tolerance, can refer them to https://www.adept.ai/blog/sherlock-sdc and llama 3 training paper (which https://www.datacenterdynamics.com/en/news/meta-report-details-hundreds-of-gpu-and-hbm3-related-interruptions-to-llama-3-training-run/ ) - [ ] “TF32 introduces numerical instability”: I think you use this wording multiple times (if you Ctrl-F), maybe rewrite this to a different example ## 12.7 - [ ] Maybe some note on how inference is much more constrained than training from optimization perspective: generally one can assume datacenter level resources for training, but for inference, restrictions are much tighter. - [ ] precision section: maybe some citations https://arxiv.org/abs/2407.03211 https://arxiv.org/pdf/2212.09720 - [ ] I’m less familiar with this section, but maybe some diagrams we can draw from MLPerf to showcase each section (given these are basically MLPerf case studies) I’ll take a read of the papers. - [ ] What are your thoughts on moving Table 12.3 higher and adding hyper links to each of the sections? I actually think having this table up higher and seeing it first helps me understand it better before digging deeper into each one ## 12.8 - [ ] the jump to MLPerf power is a bit jarring: I understand why its presented (as a holistic E2E energy benchmark, but maybe something to introduce this more smoothly: In my mind it would go like: bring in our earlier discussion on E2E benchmarking and discuss how datacenter level benchmarking is (close to) the final E2E truly Discuss how the datacenters are very heterogeneous but they are unified in terms of efficiency given fixed workloads by POWER Enter ML Perf Power as an example benchmark! Otherwise good diagrams here, learned a ton on this :-) ## 12.9 - [ ] 12.9.2: Maybe in this discussion, discuss how theres a tradeoff between designing feasible benchmarks (ease of collecting data and structuring task) and realism of a benchmark (how close it is to a task). Sometimes folks lean too much on the former, which means benchmark != real work performance. An example of good benchmark to task correlation might be LMSys Chatbot arena, which grades LLMs based on actual user use while also being largely unhackable. - [ ] i think somewhere in this section (also could be early on, but just thought of this), we should add some discussion on Goodhart’s Law https://en.wikipedia.org/wiki/Goodhart%27s_law Many “bad” benchmarks these days are bc ppl overoptimized on it instead of optimizing on a true task performance. ## Other notes My feeling is that after 12.8, the organization gets a big chaotic (unlike the structure of the previous section with clear general benchmark components, inference benchmarks, training benchmarks etc.): maybe let's brainstorm Thursday to figure this out :-)

GiteaMirror added the area: book type: improvement labels 2026-03-22 15:36:42 -05:00

GiteaMirror closed this issue

2026-03-22 15:36:43 -05:00

GiteaMirror commented

2026-03-22 15:36:44 -05:00

@profvjreddi commented on GitHub (Aug 23, 2025):

Resolved - Benchmarking checklist implemented

@profvjreddi commented on GitHub (Aug 23, 2025): Resolved - Benchmarking checklist implemented

GiteaMirror referenced this issue

2026-04-11 07:45:37 -05:00

[GH-ISSUE #326] PDF video links are broken #1367

GiteaMirror referenced this issue