Benchmarking visualization additions (Arya + Andy) #328

Closed
opened 2026-03-22 15:36:46 -05:00 by GiteaMirror · 0 comments
Owner

Originally created by @aryatschand on GitHub (Feb 6, 2025).

Originally assigned to: @profvjreddi on GitHub.

12.3

  • Sub-chapter on challenges - model diversity
  • Sub-chapter on challenges - system measurements are hard, different scales, keep this at a high level

12.4

  • Graph with all the components that show relationships

Image

12.7

  • Figure on comparison between considerations + metrics of inference vs. training (find in MLPerf slides)

12.7.4

  • Comparison btwn MLPerf benchmarks (model size vs accuracy, perf, energy, etc.?)

Image

12.8

  • Include MLPerf Power trends

Image

  • Maybe? Shows how MACs vs Energy efficiency scales proportionally across workload sizes + across system scales

Image

12.9.3

  • Do we want to show perf measurement methodology at different scales (similar to mlperf power graph)
  • Talk about how LLMs can train on open-sourced benchmarks and “cheat”
  • https://arxiv.org/pdf/2404.18824
  • Reproducibility, cherrypicking, report range of hyperparameter search (error bars) instead of just the best

12.9.4

  • Add evolving benchmarks

Image

  • Talk about how LLM Benchmarks are evolving (frontiermath, humanitiys last exam) and show how new models are going up on it
  • Feedback loop graph (development cycle)
  • high level goal —> benchmark —> submissions —> higher scores —> new high level goal + new benchmarks —> repeat
  • The good, the bad, the ugly

Chapter 9

  • Jevons paradox, better hardware pushes models, better models pushes hardware
Originally created by @aryatschand on GitHub (Feb 6, 2025). Originally assigned to: @profvjreddi on GitHub. # 12.3 - [x] Sub-chapter on challenges - model diversity - [x] Sub-chapter on challenges - system measurements are hard, different scales, keep this at a high level # 12.4 - [x] Graph with all the components that show relationships ![Image](https://github.com/user-attachments/assets/6d5d0ab9-963d-426a-9176-6c0cddf41b59) # 12.7 - [ ] Figure on comparison between considerations + metrics of inference vs. training (find in MLPerf slides) # 12.7.4 - [x] Comparison btwn MLPerf benchmarks (model size vs accuracy, perf, energy, etc.?) ![Image](https://github.com/user-attachments/assets/fed436c2-705c-4f52-918e-32b86e7e5c97) # 12.8 - [ ] Include MLPerf Power trends ![Image](https://github.com/user-attachments/assets/a5f01c7f-a960-48f0-95dc-9b277b90c78d) - [ ] Maybe? Shows how MACs vs Energy efficiency scales proportionally across workload sizes + across system scales ![Image](https://github.com/user-attachments/assets/db8fb0d8-ec79-41a6-a2fe-0f25ed32904e) # 12.9.3 - [ ] Do we want to show perf measurement methodology at different scales (similar to mlperf power graph) - [x] Talk about how LLMs can train on open-sourced benchmarks and “cheat” - [x] https://arxiv.org/pdf/2404.18824 - [ ] Reproducibility, cherrypicking, report range of hyperparameter search (error bars) instead of just the best # 12.9.4 - [ ] Add evolving benchmarks ![Image](https://github.com/user-attachments/assets/d6ce2970-58b3-4439-b836-feaa377ee60a) - [ ] Talk about how LLM Benchmarks are evolving (frontiermath, humanitiys last exam) and show how new models are going up on it - [ ] Feedback loop graph (development cycle) - [ ] high level goal —> benchmark —> submissions —> higher scores —> new high level goal + new benchmarks —> repeat - [ ] The good, the bad, the ugly # Chapter 9 - [ ] Jevons paradox, better hardware pushes models, better models pushes hardware
GiteaMirror added the area: book label 2026-03-22 15:36:46 -05:00
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/cs249r_book#328