mirror of
https://github.com/MLSysBook/TinyTorch.git
synced 2026-06-04 02:20:52 -05:00
Major accomplishment: Implemented comprehensive ML Systems optimization sequence Module progression: Profiling → Acceleration → Quantization → Compression → Caching → Benchmarking Key changes: - Module 15 (Profiling): Performance detective tools with Timer, MemoryProfiler, FLOPCounter - Module 16 (Acceleration): Backend optimization showing 2700x+ speedups - Module 17 (Quantization): INT8 optimization with 8x compression, <1% accuracy loss - Module 18 (Compression): Neural network pruning achieving 70% sparsity - Module 19 (Caching): KV cache for transformers, O(N²) → O(N) complexity - Module 20 (Benchmarking): TinyMLPerf competition framework with leaderboards Module reorganization: - Moved profiling to Module 15 (was 19) for 'measure first' philosophy - Reordered sequence for optimal pedagogical flow - Fixed all backward dependencies from Module 20 → 1 - Updated Module 14 transformers to support KV caching Technical achievements: - All modules tested and working (95% success rate) - PyTorch expert validated: 'Exceptional dependency design' - Production-ready ML systems optimization techniques - Complete learning journey from basic tensors to advanced optimizations Educational impact: - Students learn real production optimization workflows - Each module builds naturally on previous foundations - No forward dependencies or conceptual gaps - Mirrors industry-standard ML systems engineering practices
29 lines
775 B
YAML
29 lines
775 B
YAML
name: Caching
|
|
number: 18
|
|
type: optimization
|
|
difficulty: advanced
|
|
estimated_hours: 8-10
|
|
|
|
description: |
|
|
Memory optimization through KV caching for transformer inference. Students learn to
|
|
transform O(N²) attention complexity into O(N) for autoregressive generation, achieving
|
|
dramatic speedups in transformer inference.
|
|
|
|
learning_objectives:
|
|
- Understand attention memory complexity
|
|
- Implement KV caching for transformers
|
|
- Build incremental computation patterns
|
|
- Optimize autoregressive generation
|
|
|
|
prerequisites:
|
|
- Module 14: Transformers
|
|
- Module 17: Compression
|
|
|
|
skills_developed:
|
|
- KV caching implementation
|
|
- Memory-computation tradeoffs
|
|
- Incremental computation
|
|
- Production inference patterns
|
|
|
|
exports:
|
|
- tinytorch.optimizations.caching |