mirror of
https://github.com/harvard-edge/cs249r_book.git
synced 2026-03-11 17:49:25 -05:00
Clarifies memoization computation savings
Refines the explanation of K,V computation savings in the memoization module, quantifying redundant computations and highlighting the efficiency gain. The paper and module now specify that generating 100 tokens requires 5,050 total K,V computations, but only 100 are necessary, resulting in 4,950 redundant calculations.
This commit is contained in:
@@ -560,7 +560,7 @@ Students transition from ``models that train'' to ``systems that deploy.'' The o
|
||||
|
||||
The tier then divides into two optimization categories. \textbf{Model-level optimizations} (Modules 15--16) change the model itself: Quantization (15) achieves 4$\times$ compression (FP32$\rightarrow$INT8) with 1--2\% accuracy cost, while Compression (16) applies pruning and distillation for 10$\times$ shrinkage. These techniques permanently modify model weights and architecture.
|
||||
|
||||
\textbf{Runtime optimizations} (Modules 17--18) change how execution happens without modifying model weights. Acceleration (17) teaches general-purpose optimization: vectorization exploits SIMD instructions for 10--100$\times$ convolution speedups, memory access pattern optimization improves cache locality, and kernel fusion eliminates intermediate memory traffic. These techniques apply to any numerical computation. Memoization (18) then applies domain-specific optimization to transformers through KV caching: students discover that naive autoregressive generation recomputes attention keys and values at every step, so generating 100 tokens requires 5,050 redundant computations (1+2+...+100). By caching these values, students transform $O(n^2)$ generation into $O(n)$, achieving 10--100$\times$ speedup and understanding why this optimization is essential in systems like ChatGPT and Claude for economically viable inference.
|
||||
\textbf{Runtime optimizations} (Modules 17--18) change how execution happens without modifying model weights. Acceleration (17) teaches general-purpose optimization: vectorization exploits SIMD instructions for 10--100$\times$ convolution speedups, memory access pattern optimization improves cache locality, and kernel fusion eliminates intermediate memory traffic. These techniques apply to any numerical computation. Memoization (18) then applies domain-specific optimization to transformers through KV caching: students discover that naive autoregressive generation recomputes attention keys and values at every step, so generating 100 tokens requires 5,050 total K,V computations (1+2+\dots+100), of which 4,950 are redundant. By caching these values, students transform $O(n^2)$ generation into $O(n)$, achieving 10--100$\times$ speedup and understanding why this optimization is essential in systems like ChatGPT and Claude for economically viable inference.
|
||||
|
||||
Benchmarking (19) teaches statistical rigor in performance measurement: students learn that single measurements are meaningless (performance varies 10--30\% across runs due to thermal throttling, OS noise, and cache state), implement confidence intervals and warmup protocols, and discover when a 5\% speedup is statistically significant versus noise.
|
||||
|
||||
|
||||
@@ -237,7 +237,8 @@ Step n: n K,V computations
|
||||
Total: 1 + 2 + 3 + ... + n = n(n+1)/2 = O(n²) complexity!
|
||||
```
|
||||
|
||||
For a 100-token sequence, this means **5,050 redundant computations**!
|
||||
For a 100-token sequence, this means **5,050 total K,V computations** — but only 100 are
|
||||
actually necessary (one per token). That's **4,950 redundant computations**!
|
||||
|
||||
### Real-World Impact
|
||||
|
||||
|
||||
Reference in New Issue
Block a user