mirror of
https://github.com/MLSysBook/TinyTorch.git
synced 2026-05-04 16:58:00 -05:00
- Removed temporary test files and audit reports - Deleted backup and temp_holding directories - Reorganized module structure (07->09 spatial, 09->07 dataloader) - Added new modules: 11-14 (tokenization, embeddings, attention, transformers) - Updated examples with historical ML milestones - Cleaned up documentation structure
33 lines
1.2 KiB
YAML
33 lines
1.2 KiB
YAML
name: "Attention"
|
|
number: 13
|
|
description: "Scaled dot-product and multi-head attention mechanisms that enable transformer architectures"
|
|
learning_objectives:
|
|
- "Implement scaled dot-product attention with proper masking and numerical stability"
|
|
- "Build multi-head attention with parallel head processing and output projection"
|
|
- "Design KV-cache systems for efficient autoregressive generation"
|
|
- "Understand attention's O(N²) scaling and memory optimization techniques"
|
|
- "Analyze attention performance bottlenecks and production optimization strategies"
|
|
|
|
prerequisites:
|
|
- "02_tensor"
|
|
- "12_embeddings"
|
|
|
|
exports:
|
|
- "ScaledDotProductAttention"
|
|
- "MultiHeadAttention"
|
|
- "KVCache"
|
|
- "AttentionProfiler"
|
|
|
|
systems_concepts:
|
|
- "Quadratic memory scaling O(N²) with sequence length"
|
|
- "Memory-bandwidth bound attention computation"
|
|
- "KV-cache optimization for autoregressive generation"
|
|
- "Multi-head parallelization and hardware optimization"
|
|
- "Attention masking patterns and causal dependencies"
|
|
|
|
ml_systems_focus: "Attention memory scaling, generation efficiency optimization, sequence length limitations"
|
|
|
|
estimated_time: "5-6 hours"
|
|
|
|
next_modules:
|
|
- "14_transformers" |