Clean up repository: remove temp files, organize modules, prepare for PyPI publication

- Removed temporary test files and audit reports - Deleted backup and temp_holding directories - Reorganized module structure (07->09 spatial, 09->07 dataloader) - Added new modules: 11-14 (tokenization, embeddings, attention, transformers) - Updated examples with historical ML milestones - Cleaned up documentation structure
2026-05-05 12:24:20 -05:00 · 2025-09-24 10:13:37 -04:00
parent 60569cfaaa
commit 6491a7512e
124 changed files with 26011 additions and 66763 deletions
--- a/modules/13_attention/README.md
+++ b/modules/13_attention/README.md
@@ -0,0 +1,97 @@
+# Module 13: Attention - The Mechanism That Revolutionized Language Understanding
+
+## Overview
+This module implements the attention mechanisms that power modern transformer architectures. You'll build scaled dot-product attention, multi-head attention, and KV-cache systems while understanding how attention's quadratic scaling affects practical transformer deployment and optimization strategies.
+
+## What You'll Learn
+
+### Core Implementations
+- **Scaled Dot-Product Attention**: The fundamental attention mechanism with masking support
+- **Multi-Head Attention**: Parallel attention heads with linear projections and output combination
+- **KV-Cache System**: Efficient caching for autoregressive text generation
+- **Causal Masking**: Support for autoregressive language modeling patterns
+
+### ML Systems Concepts
+- **Quadratic Scaling**: How O(N²) memory scaling limits transformer sequence length
+- **Memory Bottlenecks**: Understanding attention as the memory constraint in transformers
+- **Generation Efficiency**: KV-cache optimization for production text generation
+- **Hardware Optimization**: Attention parallelization and memory bandwidth optimization
+
+### Performance Engineering
+- **Attention Profiling**: Measuring computation time and memory usage scaling
+- **Scaling Analysis**: Understanding practical limits of attention-based architectures
+- **Optimization Techniques**: Memory-efficient attention patterns and cache management
+- **Production Patterns**: Real-world attention system design and deployment strategies
+
+## Key Learning Outcomes
+
+By completing this module, you'll understand:
+
+1. **Attention Mathematics**: The scaled dot-product attention formula and its implementation
+2. **Multi-Head Architecture**: How parallel attention heads capture diverse relationships
+3. **Memory Scaling**: Why attention's O(N²) complexity fundamentally limits sequence length
+4. **Generation Optimization**: How KV-cache dramatically improves autoregressive efficiency
+5. **Production Systems**: How real transformers optimize attention for deployment constraints
+
+## Files in This Module
+
+- `attention_dev.py` - Main implementation with all attention mechanisms
+- `attention_dev.ipynb` - Jupyter notebook (auto-generated)
+- `module.yaml` - Module configuration and metadata
+- `README.md` - This documentation file
+
+## Usage Example
+
+```python
+from tinytorch.core.attention import ScaledDotProductAttention, MultiHeadAttention
+from tinytorch.core.embeddings import Embedding, PositionalEncoding
+
+# Create attention mechanisms
+scaled_attn = ScaledDotProductAttention()
+multi_head_attn = MultiHeadAttention(embed_dim=256, num_heads=8)
+
+# Process sequences with attention
+query = key = value = embeddings  # Self-attention
+output = multi_head_attn(query, key, value)
+
+# Causal masking for generation
+causal_mask = create_causal_mask(seq_length)
+masked_output = multi_head_attn(query, key, value, mask=causal_mask)
+```
+
+## Integration with TinyTorch
+
+This module exports to `tinytorch.core.attention` and provides the attention foundation for:
+- **Transformer blocks** (Module 14) - Complete transformer layer implementation
+- **Language generation** - Efficient autoregressive text generation
+- **Sequence modeling** - Advanced sequence processing architectures
+
+## Systems Engineering Focus
+
+This module emphasizes the systems engineering aspects of attention:
+
+### Memory Characteristics
+- **Quadratic scaling**: Attention memory = O(batch_size × seq_length²)
+- **Memory bottleneck**: Attention often limits practical transformer sequence length
+- **KV-cache benefits**: Reduces generation memory from O(N²) to O(N)
+- **GPU memory limits**: Determines maximum feasible sequence lengths
+
+### Performance Considerations
+- **Matrix multiplication bound**: Attention performance limited by GEMM operations
+- **Memory bandwidth**: Large attention matrices stress memory subsystem
+- **Parallelization**: Multi-head attention enables parallel computation
+- **Generation patterns**: Autoregressive vs parallel processing trade-offs
+
+## Prerequisites
+- Module 02: Tensor (for matrix operations and data structures)
+- Module 12: Embeddings (for understanding sequence representations)
+- Understanding of matrix multiplication and softmax operations
+
+## Estimated Time
+5-6 hours including implementation, testing, and performance analysis
+
+## Next Steps
+After completing this module, you'll be ready for:
+- **Module 14: Transformers** - Complete transformer block implementation
+- Advanced transformer architectures and optimization techniques
+- Production language model deployment and serving systems