12.3

Sub-chapter on challenges - model diversity
Sub-chapter on challenges - system measurements are hard, different scales, keep this at a high level

12.4

Graph with all the components that show relationships

12.7

Figure on comparison between considerations + metrics of inference vs. training (find in MLPerf slides)

12.7.4

Comparison btwn MLPerf benchmarks (model size vs accuracy, perf, energy, etc.?)

12.8

Include MLPerf Power trends

Maybe? Shows how MACs vs Energy efficiency scales proportionally across workload sizes + across system scales

12.9.3

Do we want to show perf measurement methodology at different scales (similar to mlperf power graph)
Talk about how LLMs can train on open-sourced benchmarks and “cheat”
https://arxiv.org/pdf/2404.18824
Reproducibility, cherrypicking, report range of hyperparameter search (error bars) instead of just the best

12.9.4

Add evolving benchmarks

Talk about how LLM Benchmarks are evolving (frontiermath, humanitiys last exam) and show how new models are going up on it
Feedback loop graph (development cycle)
high level goal —> benchmark —> submissions —> higher scores —> new high level goal + new benchmarks —> repeat
The good, the bad, the ugly

Chapter 9

Jevons paradox, better hardware pushes models, better models pushes hardware

Originally created by @aryatschand on GitHub (Feb 6, 2025). Originally assigned to: @profvjreddi on GitHub. # 12.3 - [x] Sub-chapter on challenges - model diversity - [x] Sub-chapter on challenges - system measurements are hard, different scales, keep this at a high level # 12.4 - [x] Graph with all the components that show relationships ![Image](https://github.com/user-attachments/assets/6d5d0ab9-963d-426a-9176-6c0cddf41b59) # 12.7 - [ ] Figure on comparison between considerations + metrics of inference vs. training (find in MLPerf slides) # 12.7.4 - [x] Comparison btwn MLPerf benchmarks (model size vs accuracy, perf, energy, etc.?) ![Image](https://github.com/user-attachments/assets/fed436c2-705c-4f52-918e-32b86e7e5c97) # 12.8 - [ ] Include MLPerf Power trends ![Image](https://github.com/user-attachments/assets/a5f01c7f-a960-48f0-95dc-9b277b90c78d) - [ ] Maybe? Shows how MACs vs Energy efficiency scales proportionally across workload sizes + across system scales ![Image](https://github.com/user-attachments/assets/db8fb0d8-ec79-41a6-a2fe-0f25ed32904e) # 12.9.3 - [ ] Do we want to show perf measurement methodology at different scales (similar to mlperf power graph) - [x] Talk about how LLMs can train on open-sourced benchmarks and “cheat” - [x] https://arxiv.org/pdf/2404.18824 - [ ] Reproducibility, cherrypicking, report range of hyperparameter search (error bars) instead of just the best # 12.9.4 - [ ] Add evolving benchmarks ![Image](https://github.com/user-attachments/assets/d6ce2970-58b3-4439-b836-feaa377ee60a) - [ ] Talk about how LLM Benchmarks are evolving (frontiermath, humanitiys last exam) and show how new models are going up on it - [ ] Feedback loop graph (development cycle) - [ ] high level goal —> benchmark —> submissions —> higher scores —> new high level goal + new benchmarks —> repeat - [ ] The good, the bad, the ugly # Chapter 9 - [ ] Jevons paradox, better hardware pushes models, better models pushes hardware