mirror of
https://github.com/MLSysBook/TinyTorch.git
synced 2026-04-28 16:12:32 -05:00
Enhanced Module 14 with extensive educational documentation explaining: Three-Path Selection Strategy: - PATH 1: Training (seq_len > 1) - Uses original attention, preserves gradients - PATH 2: First Token (cache empty) - Uses original attention, initializes cache - PATH 3: Cached Generation (cache populated) - THE SPEEDUP PATH, O(n) computation Why .data Instead of Tensor Operations: - Explicit intent: Clear separation of training vs inference code - Performance: Avoids autograd overhead during generation - Industry standard: Production LLMs (vLLM, llama.cpp) use same pattern O(n²) to O(n) Transformation Explained: - WITHOUT cache: O(N³) total across all steps (1² + 2² + ... + N²) - WITH cache: O(N²) total across all steps (1 + 2 + ... + N) - Result: 5-7x speedup on short sequences, 10-15x on longer ones Inline comments added at every decision point for student comprehension. Module 14 now complete with working implementation and comprehensive pedagogy.