mirror of
https://github.com/harvard-edge/cs249r_book.git
synced 2026-05-07 18:18:42 -05:00
[GH-ISSUE #709] Improve Chapter 10 Model Optimization #4174
Reference in New Issue
Block a user
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @18jeffreyma on GitHub (Feb 15, 2025).
Original GitHub issue: https://github.com/harvard-edge/cs249r_book/issues/709
Originally assigned to: @profvjreddi, @18jeffreyma on GitHub.
Purpose: I’d like to rewrite this purpose statement to “How do we translate theoretical neural network and naive implementations into efficient practical solutions, and what techniques are available to bridge gaps in efficiency?”
10.2 (Efficient Model Representation): “you were introduced to pruning and model compression” this needs to be removed since model compression section was removed.
10.2.1 (Pruning): I think this section should touch a bit more on sparsity (maybe some references to sparsity also becoming a first party compute unit). We should give an explicit shout to sparsity early on in structured pruning (maybe like a “deciding desired sparsity” in the “structures to target for pruning).
Advantages of structure Pruning section should mention sparsity acceleration on hardware as well as a benefit.
10.2.1: Update Figure 10.6 caption to be more informative of whats going on (i.e. training lottery tickets replicates original model curve and often better performance)
10.2.2
https://aman.ai/primers/ai/assets/token-sampling/T.jpg
10.2.3
10.3.3
10.3.5 through 10.3.9 should probably be subsections of 10.3.4 (the structure organization is a bit weird here) or rename the sections to flow better i.e. add "of Quantization" to each to make it clear the section is still about quantization even though its not nested)?
Zero Shot Quantization (some example techniques and citations? AWQ, some other techniques here: https://huggingface.co/docs/transformers/main/en/quantization/overview
10.4: Since we’re going for a MLSystems twist, do we want to include any GPU/datacenter level hardware aware neural architecture search:
Some ideas:
Could discuss how Transformers win the hardware lottery here and are an example of Hardware-Aware NN design.
Could discuss things like tensor cores and FMA units (and how kernel fusions map to this).
Could discuss FlashAttention here as a hardware aware model optimization (symbolically equivalent).
Possibly other ideas.
@profvjreddi commented on GitHub (Feb 17, 2025):
Thanks Jeff, I was going to draft an outline from scratch and see what it comes out to be, and then see what we can reuse. But this is good material anyways!