mirror of
https://github.com/MLSysBook/TinyTorch.git
synced 2026-03-12 02:53:34 -05:00
Added comprehensive documentation clarifying that KV caching is designed ONLY for inference (generation), not training. Key Clarifications: - Cache operations use .data (no gradient tracking) - This is correct and intentional for maximum speed - During generation: no gradients computed (model.eval() mode) - During training: cache not used (standard forward pass) - DO NOT use caching during training Why This is Safe: 1. Training: Uses standard forward pass (full gradient flow) 2. Generation: No backward pass (no gradients needed) 3. Cache is inference optimization, not training component 4. .data usage is correct for generation-only use case Documentation Updates: - Added prominent warning in class docstring - Updated update() method docs - Updated get() method docs - Added inline comments explaining .data usage This addresses gradient flow concerns by making it crystal clear that caching is never used when gradients are needed.