mirror of https://github.com/MLSysBook/TinyTorch.git synced 2026-05-04 12:12:33 -05:00

Files

Vijay Janapa Reddi 6491a7512e Clean up repository: remove temp files, organize modules, prepare for PyPI publication

- Removed temporary test files and audit reports
- Deleted backup and temp_holding directories
- Reorganized module structure (07->09 spatial, 09->07 dataloader)
- Added new modules: 11-14 (tokenization, embeddings, attention, transformers)
- Updated examples with historical ML milestones
- Cleaned up documentation structure

2025-09-24 10:13:37 -04:00

attention_dev.py

Clean up repository: remove temp files, organize modules, prepare for PyPI publication

2025-09-24 10:13:37 -04:00

module.yaml

Clean up repository: remove temp files, organize modules, prepare for PyPI publication

2025-09-24 10:13:37 -04:00

README.md

Clean up repository: remove temp files, organize modules, prepare for PyPI publication

2025-09-24 10:13:37 -04:00

README.md

Module 13: Attention - The Mechanism That Revolutionized Language Understanding

Overview

This module implements the attention mechanisms that power modern transformer architectures. You'll build scaled dot-product attention, multi-head attention, and KV-cache systems while understanding how attention's quadratic scaling affects practical transformer deployment and optimization strategies.

What You'll Learn

Core Implementations

Scaled Dot-Product Attention: The fundamental attention mechanism with masking support
Multi-Head Attention: Parallel attention heads with linear projections and output combination
KV-Cache System: Efficient caching for autoregressive text generation
Causal Masking: Support for autoregressive language modeling patterns

ML Systems Concepts

Quadratic Scaling: How O(N²) memory scaling limits transformer sequence length
Memory Bottlenecks: Understanding attention as the memory constraint in transformers
Generation Efficiency: KV-cache optimization for production text generation
Hardware Optimization: Attention parallelization and memory bandwidth optimization

Performance Engineering

Attention Profiling: Measuring computation time and memory usage scaling
Scaling Analysis: Understanding practical limits of attention-based architectures
Optimization Techniques: Memory-efficient attention patterns and cache management
Production Patterns: Real-world attention system design and deployment strategies

Key Learning Outcomes

By completing this module, you'll understand:

Attention Mathematics: The scaled dot-product attention formula and its implementation
Multi-Head Architecture: How parallel attention heads capture diverse relationships
Memory Scaling: Why attention's O(N²) complexity fundamentally limits sequence length
Generation Optimization: How KV-cache dramatically improves autoregressive efficiency
Production Systems: How real transformers optimize attention for deployment constraints

Files in This Module

attention_dev.py - Main implementation with all attention mechanisms
attention_dev.ipynb - Jupyter notebook (auto-generated)
module.yaml - Module configuration and metadata
README.md - This documentation file

Usage Example

from tinytorch.core.attention import ScaledDotProductAttention, MultiHeadAttention
from tinytorch.core.embeddings import Embedding, PositionalEncoding

# Create attention mechanisms
scaled_attn = ScaledDotProductAttention()
multi_head_attn = MultiHeadAttention(embed_dim=256, num_heads=8)

# Process sequences with attention
query = key = value = embeddings  # Self-attention
output = multi_head_attn(query, key, value)

# Causal masking for generation
causal_mask = create_causal_mask(seq_length)
masked_output = multi_head_attn(query, key, value, mask=causal_mask)

Integration with TinyTorch

This module exports to tinytorch.core.attention and provides the attention foundation for:

Transformer blocks (Module 14) - Complete transformer layer implementation
Language generation - Efficient autoregressive text generation
Sequence modeling - Advanced sequence processing architectures

Systems Engineering Focus

This module emphasizes the systems engineering aspects of attention:

Memory Characteristics

Quadratic scaling: Attention memory = O(batch_size × seq_length²)
Memory bottleneck: Attention often limits practical transformer sequence length
KV-cache benefits: Reduces generation memory from O(N²) to O(N)
GPU memory limits: Determines maximum feasible sequence lengths

Performance Considerations

Matrix multiplication bound: Attention performance limited by GEMM operations
Memory bandwidth: Large attention matrices stress memory subsystem
Parallelization: Multi-head attention enables parallel computation
Generation patterns: Autoregressive vs parallel processing trade-offs

Prerequisites

Module 02: Tensor (for matrix operations and data structures)
Module 12: Embeddings (for understanding sequence representations)
Understanding of matrix multiplication and softmax operations

Estimated Time

5-6 hours including implementation, testing, and performance analysis

Next Steps

After completing this module, you'll be ready for:

Module 14: Transformers - Complete transformer block implementation
Advanced transformer architectures and optimization techniques
Production language model deployment and serving systems

README.md Unescape Escape

Module 13: Attention - The Mechanism That Revolutionized Language Understanding

Overview

What You'll Learn

Core Implementations

ML Systems Concepts

Performance Engineering

Key Learning Outcomes

Files in This Module

Usage Example

Integration with TinyTorch

Systems Engineering Focus

Memory Characteristics

Performance Considerations

Prerequisites

Estimated Time

Next Steps

README.md