Clean up repository: remove temp files, organize modules, prepare for PyPI publication

- Removed temporary test files and audit reports
- Deleted backup and temp_holding directories
- Reorganized module structure (07->09 spatial, 09->07 dataloader)
- Added new modules: 11-14 (tokenization, embeddings, attention, transformers)
- Updated examples with historical ML milestones
- Cleaned up documentation structure
This commit is contained in:
Vijay Janapa Reddi
2025-09-24 10:13:37 -04:00
parent 60569cfaaa
commit 6491a7512e
124 changed files with 26011 additions and 66763 deletions

View File

@@ -0,0 +1,97 @@
# Module 13: Attention - The Mechanism That Revolutionized Language Understanding
## Overview
This module implements the attention mechanisms that power modern transformer architectures. You'll build scaled dot-product attention, multi-head attention, and KV-cache systems while understanding how attention's quadratic scaling affects practical transformer deployment and optimization strategies.
## What You'll Learn
### Core Implementations
- **Scaled Dot-Product Attention**: The fundamental attention mechanism with masking support
- **Multi-Head Attention**: Parallel attention heads with linear projections and output combination
- **KV-Cache System**: Efficient caching for autoregressive text generation
- **Causal Masking**: Support for autoregressive language modeling patterns
### ML Systems Concepts
- **Quadratic Scaling**: How O(N²) memory scaling limits transformer sequence length
- **Memory Bottlenecks**: Understanding attention as the memory constraint in transformers
- **Generation Efficiency**: KV-cache optimization for production text generation
- **Hardware Optimization**: Attention parallelization and memory bandwidth optimization
### Performance Engineering
- **Attention Profiling**: Measuring computation time and memory usage scaling
- **Scaling Analysis**: Understanding practical limits of attention-based architectures
- **Optimization Techniques**: Memory-efficient attention patterns and cache management
- **Production Patterns**: Real-world attention system design and deployment strategies
## Key Learning Outcomes
By completing this module, you'll understand:
1. **Attention Mathematics**: The scaled dot-product attention formula and its implementation
2. **Multi-Head Architecture**: How parallel attention heads capture diverse relationships
3. **Memory Scaling**: Why attention's O(N²) complexity fundamentally limits sequence length
4. **Generation Optimization**: How KV-cache dramatically improves autoregressive efficiency
5. **Production Systems**: How real transformers optimize attention for deployment constraints
## Files in This Module
- `attention_dev.py` - Main implementation with all attention mechanisms
- `attention_dev.ipynb` - Jupyter notebook (auto-generated)
- `module.yaml` - Module configuration and metadata
- `README.md` - This documentation file
## Usage Example
```python
from tinytorch.core.attention import ScaledDotProductAttention, MultiHeadAttention
from tinytorch.core.embeddings import Embedding, PositionalEncoding
# Create attention mechanisms
scaled_attn = ScaledDotProductAttention()
multi_head_attn = MultiHeadAttention(embed_dim=256, num_heads=8)
# Process sequences with attention
query = key = value = embeddings # Self-attention
output = multi_head_attn(query, key, value)
# Causal masking for generation
causal_mask = create_causal_mask(seq_length)
masked_output = multi_head_attn(query, key, value, mask=causal_mask)
```
## Integration with TinyTorch
This module exports to `tinytorch.core.attention` and provides the attention foundation for:
- **Transformer blocks** (Module 14) - Complete transformer layer implementation
- **Language generation** - Efficient autoregressive text generation
- **Sequence modeling** - Advanced sequence processing architectures
## Systems Engineering Focus
This module emphasizes the systems engineering aspects of attention:
### Memory Characteristics
- **Quadratic scaling**: Attention memory = O(batch_size × seq_length²)
- **Memory bottleneck**: Attention often limits practical transformer sequence length
- **KV-cache benefits**: Reduces generation memory from O(N²) to O(N)
- **GPU memory limits**: Determines maximum feasible sequence lengths
### Performance Considerations
- **Matrix multiplication bound**: Attention performance limited by GEMM operations
- **Memory bandwidth**: Large attention matrices stress memory subsystem
- **Parallelization**: Multi-head attention enables parallel computation
- **Generation patterns**: Autoregressive vs parallel processing trade-offs
## Prerequisites
- Module 02: Tensor (for matrix operations and data structures)
- Module 12: Embeddings (for understanding sequence representations)
- Understanding of matrix multiplication and softmax operations
## Estimated Time
5-6 hours including implementation, testing, and performance analysis
## Next Steps
After completing this module, you'll be ready for:
- **Module 14: Transformers** - Complete transformer block implementation
- Advanced transformer architectures and optimization techniques
- Production language model deployment and serving systems