From 69f46d4f7e00104cc455a359f4bb84f79dbdc532 Mon Sep 17 00:00:00 2001 From: Vijay Janapa Reddi Date: Thu, 19 Feb 2026 17:59:10 -0500 Subject: [PATCH] Clarifies memoization computation savings Refines the explanation of K,V computation savings in the memoization module, quantifying redundant computations and highlighting the efficiency gain. The paper and module now specify that generating 100 tokens requires 5,050 total K,V computations, but only 100 are necessary, resulting in 4,950 redundant calculations. --- tinytorch/paper/paper.tex | 2 +- tinytorch/src/18_memoization/18_memoization.py | 3 ++- 2 files changed, 3 insertions(+), 2 deletions(-) diff --git a/tinytorch/paper/paper.tex b/tinytorch/paper/paper.tex index 37aefbe12..e8c0ca6f3 100644 --- a/tinytorch/paper/paper.tex +++ b/tinytorch/paper/paper.tex @@ -560,7 +560,7 @@ Students transition from ``models that train'' to ``systems that deploy.'' The o The tier then divides into two optimization categories. \textbf{Model-level optimizations} (Modules 15--16) change the model itself: Quantization (15) achieves 4$\times$ compression (FP32$\rightarrow$INT8) with 1--2\% accuracy cost, while Compression (16) applies pruning and distillation for 10$\times$ shrinkage. These techniques permanently modify model weights and architecture. -\textbf{Runtime optimizations} (Modules 17--18) change how execution happens without modifying model weights. Acceleration (17) teaches general-purpose optimization: vectorization exploits SIMD instructions for 10--100$\times$ convolution speedups, memory access pattern optimization improves cache locality, and kernel fusion eliminates intermediate memory traffic. These techniques apply to any numerical computation. Memoization (18) then applies domain-specific optimization to transformers through KV caching: students discover that naive autoregressive generation recomputes attention keys and values at every step, so generating 100 tokens requires 5,050 redundant computations (1+2+...+100). By caching these values, students transform $O(n^2)$ generation into $O(n)$, achieving 10--100$\times$ speedup and understanding why this optimization is essential in systems like ChatGPT and Claude for economically viable inference. +\textbf{Runtime optimizations} (Modules 17--18) change how execution happens without modifying model weights. Acceleration (17) teaches general-purpose optimization: vectorization exploits SIMD instructions for 10--100$\times$ convolution speedups, memory access pattern optimization improves cache locality, and kernel fusion eliminates intermediate memory traffic. These techniques apply to any numerical computation. Memoization (18) then applies domain-specific optimization to transformers through KV caching: students discover that naive autoregressive generation recomputes attention keys and values at every step, so generating 100 tokens requires 5,050 total K,V computations (1+2+\dots+100), of which 4,950 are redundant. By caching these values, students transform $O(n^2)$ generation into $O(n)$, achieving 10--100$\times$ speedup and understanding why this optimization is essential in systems like ChatGPT and Claude for economically viable inference. Benchmarking (19) teaches statistical rigor in performance measurement: students learn that single measurements are meaningless (performance varies 10--30\% across runs due to thermal throttling, OS noise, and cache state), implement confidence intervals and warmup protocols, and discover when a 5\% speedup is statistically significant versus noise. diff --git a/tinytorch/src/18_memoization/18_memoization.py b/tinytorch/src/18_memoization/18_memoization.py index 961143fb3..aac0dec30 100644 --- a/tinytorch/src/18_memoization/18_memoization.py +++ b/tinytorch/src/18_memoization/18_memoization.py @@ -237,7 +237,8 @@ Step n: n K,V computations Total: 1 + 2 + 3 + ... + n = n(n+1)/2 = O(n²) complexity! ``` -For a 100-token sequence, this means **5,050 redundant computations**! +For a 100-token sequence, this means **5,050 total K,V computations** — but only 100 are +actually necessary (one per token). That's **4,950 redundant computations**! ### Real-World Impact