mirror of
https://github.com/harvard-edge/cs249r_book.git
synced 2026-05-07 02:03:55 -05:00
Benchmarking Checklist Thoughts (Jeff) #326
Reference in New Issue
Block a user
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @18jeffreyma on GitHub (Feb 3, 2025).
Originally assigned to: @profvjreddi on GitHub.
Benchmarking Thoughts
12.1
12.2.1
Love the 3DMark reference (I remember using this when I built my first desktop computer).
12.3
12.3.{1, 2, 3}
12.4
https://arxiv.org/pdf/2202.02842 a good paper to cite to explain how metric choice matters and you should choose it to correlate to your task
Also probably mention techniques like containerization (docker) here.
Other thoughts: I think in general 12.4 feels a bit too big: what do you think about splitting out 12.4.7 and 12.4.8 (maybe plus 12.5) as a “Key Benchmark Considerations” or similar.
I feel like 12.4 should focus on “what should you include in a benchmark?” and the section after should be “how should you interpret and consider your benchmark once its created”?
12.5
Really love this section (great figures)
12.6
For Fault Tolerance, can refer them to https://www.adept.ai/blog/sherlock-sdc and llama 3 training paper (which https://www.datacenterdynamics.com/en/news/meta-report-details-hundreds-of-gpu-and-hbm3-related-interruptions-to-llama-3-training-run/ )
12.7
https://arxiv.org/abs/2407.03211
https://arxiv.org/pdf/2212.09720
I’ll take a read of the papers.
12.8
In my mind it would go like: bring in our earlier discussion on E2E benchmarking and discuss how datacenter level benchmarking is (close to) the final E2E truly
Discuss how the datacenters are very heterogeneous but they are unified in terms of efficiency given fixed workloads by POWER
Enter ML Perf Power as an example benchmark!
Otherwise good diagrams here, learned a ton on this :-)
12.9
An example of good benchmark to task correlation might be LMSys Chatbot arena, which grades LLMs based on actual user use while also being largely unhackable.
https://en.wikipedia.org/wiki/Goodhart%27s_law
Many “bad” benchmarks these days are bc ppl overoptimized on it instead of optimizing on a true task performance.
Other notes
My feeling is that after 12.8, the organization gets a big chaotic (unlike the structure of the previous section with clear general benchmark components, inference benchmarks, training benchmarks etc.): maybe let's brainstorm Thursday to figure this out :-)
@profvjreddi commented on GitHub (Aug 23, 2025):
Resolved - Benchmarking checklist implemented