[GH-ISSUE #709] Improve Chapter 10 Model Optimization

Originally created by @18jeffreyma on GitHub (Feb 15, 2025). Original GitHub issue: https://github.com/harvard-edge/cs249r_book/issues/709 Originally assigned to: @profvjreddi, @18jeffreyma on GitHub. - [ ] Purpose: I’d like to rewrite this purpose statement to “How do we translate theoretical neural network and naive implementations into efficient practical solutions, and what techniques are available to bridge gaps in efficiency?” - [ ] 10.2 (Efficient Model Representation): “you were introduced to pruning and model compression” this needs to be removed since model compression section was removed. - [ ] 10.2.1 (Pruning): I think this section should touch a bit more on sparsity (maybe some references to sparsity also becoming a first party compute unit). We should give an explicit shout to sparsity early on in structured pruning (maybe like a “deciding desired sparsity” in the “structures to target for pruning). - [ ] Advantages of structure Pruning section should mention sparsity acceleration on hardware as well as a benefit. - [ ] 10.2.1: Update Figure 10.6 caption to be more informative of whats going on (i.e. training lottery tickets replicates original model curve and often better performance) - [ ] 10.2.2 - [ ] A diagram to show what temperature scaling does in softmax, gives an idea of soft correctness? https://aman.ai/primers/ai/assets/token-sampling/T.jpg - [ ] Explain intuition of KL divergence (i.e. quantifying the error between two probability distribution of tokens) - [ ] Add a note that generally the teacher and student are differently sized, so theres a ML infrastructure problem here in terms of “slightly” heterogeneous workloads. - [ ] In Low Rank section, maybe a quick pointer to LoRA as an example where the authors find that finetuning changes are low rank, and thus LoRA kinda works? - [ ] 10.2.3 - [ ] For TinyNAS, maybe include an image from the paper https://hanlab.mit.edu/projects/mcunet - [ ] 10.3.3 - [ ] Add some examples of hardware compatibility in this section and show how over time, its become first party (i.e. A100 -> V100 (float16() -> H100 (bfloat16) -> b200 (float8) shows float16 to bfloat to float8 transition) - [ ] Precision and Accuracy Trade-offs sections should include how lower precision is notoriously less stable: https://proceedings.neurips.cc/paper_files/paper/2018/file/335d3d1cd7ef05ec77714a215134914c-Paper.pdf - [ ] 10.3.5 through 10.3.9 should probably be subsections of 10.3.4 (the structure organization is a bit weird here) or rename the sections to flow better i.e. add "of Quantization" to each to make it clear the section is still about quantization even though its not nested)? - [ ] Zero Shot Quantization (some example techniques and citations? AWQ, some other techniques here: https://huggingface.co/docs/transformers/main/en/quantization/overview - [ ] 10.4: Since we’re going for a MLSystems twist, do we want to include any GPU/datacenter level hardware aware neural architecture search: Some ideas: Could discuss how Transformers win the hardware lottery here and are an example of Hardware-Aware NN design. Could discuss things like tensor cores and FMA units (and how kernel fusions map to this). Could discuss FlashAttention here as a hardware aware model optimization (symbolically equivalent). Possibly other ideas.

GiteaMirror commented

2026-04-19 12:10:58 -05:00

Owner

Originally created by @18jeffreyma on GitHub (Feb 15, 2025).
Original GitHub issue: https://github.com/harvard-edge/cs249r_book/issues/709

Originally assigned to: @profvjreddi, @18jeffreyma on GitHub.

Purpose: I’d like to rewrite this purpose statement to “How do we translate theoretical neural network and naive implementations into efficient practical solutions, and what techniques are available to bridge gaps in efficiency?”
10.2 (Efficient Model Representation): “you were introduced to pruning and model compression” this needs to be removed since model compression section was removed.
10.2.1 (Pruning): I think this section should touch a bit more on sparsity (maybe some references to sparsity also becoming a first party compute unit). We should give an explicit shout to sparsity early on in structured pruning (maybe like a “deciding desired sparsity” in the “structures to target for pruning).
Advantages of structure Pruning section should mention sparsity acceleration on hardware as well as a benefit.
10.2.1: Update Figure 10.6 caption to be more informative of whats going on (i.e. training lottery tickets replicates original model curve and often better performance)
10.2.2
- A diagram to show what temperature scaling does in softmax, gives an idea of soft correctness?
  https://aman.ai/primers/ai/assets/token-sampling/T.jpg
- Explain intuition of KL divergence (i.e. quantifying the error between two probability distribution of tokens)
- Add a note that generally the teacher and student are differently sized, so theres a ML infrastructure problem here in terms of “slightly” heterogeneous workloads.
- In Low Rank section, maybe a quick pointer to LoRA as an example where the authors find that finetuning changes are low rank, and thus LoRA kinda works?
10.2.3
- For TinyNAS, maybe include an image from the paper https://hanlab.mit.edu/projects/mcunet
10.3.3
- Add some examples of hardware compatibility in this section and show how over time, its become first party (i.e. A100 -> V100 (float16() -> H100 (bfloat16) -> b200 (float8) shows float16 to bfloat to float8 transition)
- Precision and Accuracy Trade-offs sections should include how lower precision is notoriously less stable: https://proceedings.neurips.cc/paper_files/paper/2018/file/335d3d1cd7ef05ec77714a215134914c-Paper.pdf
10.3.5 through 10.3.9 should probably be subsections of 10.3.4 (the structure organization is a bit weird here) or rename the sections to flow better i.e. add "of Quantization" to each to make it clear the section is still about quantization even though its not nested)?
Zero Shot Quantization (some example techniques and citations? AWQ, some other techniques here: https://huggingface.co/docs/transformers/main/en/quantization/overview
10.4: Since we’re going for a MLSystems twist, do we want to include any GPU/datacenter level hardware aware neural architecture search:

Some ideas:
Could discuss how Transformers win the hardware lottery here and are an example of Hardware-Aware NN design.
Could discuss things like tensor cores and FMA units (and how kernel fusions map to this).
Could discuss FlashAttention here as a hardware aware model optimization (symbolically equivalent).
Possibly other ideas.

GiteaMirror added the area: book type: improvement labels 2026-04-19 12:10:58 -05:00

GiteaMirror closed this issue

2026-04-19 12:10:59 -05:00

2026-04-19 12:11:00 -05:00

Author

@profvjreddi commented on GitHub (Feb 17, 2025):

Thanks Jeff, I was going to draft an outline from scratch and see what it comes out to be, and then see what we can reuse. But this is good material anyways!

@profvjreddi commented on GitHub (Feb 17, 2025): Thanks Jeff, I was going to draft an outline from scratch and see what it comes out to be, and then see what we can reuse. But this is good material anyways!

[GH-ISSUE #709] Improve Chapter 10 Model Optimization #4174