diff --git a/.claude/agents/module-developer.md b/.claude/agents/module-developer.md index bd9bc6fb..0cb1d3a3 100644 --- a/.claude/agents/module-developer.md +++ b/.claude/agents/module-developer.md @@ -156,34 +156,35 @@ This operation is fundamental for data transformations. **Test each component immediately after implementation - NO delayed testing!** -**WRONG (delayed testing):** +**The Essential Pattern:** ```python -# Implement all methods... -def add(self, other): ... -def multiply(self, other): ... -def matmul(self, other): ... +def function_name(self, params): + """Implementation with NBGrader scaffolding""" + ### BEGIN SOLUTION + # Your implementation + ### END SOLUTION -# Much later... -def test_all_operations(): ... +# Immediate test - ALWAYS after implementation +def test_unit_function_name(): + """Test [function] with educational feedback""" + print("๐Ÿ”ฌ Unit Test: [Function Name]...") + # Test implementation with clear assertions + print("โœ… [Function] works correctly!") + +test_unit_function_name() # Run immediately ``` -**CORRECT (immediate testing):** +**Complete Module Testing:** ```python -def add(self, other): ... +def test_unit_all(): + """Run all unit tests for this module""" + print("๐Ÿงช Running all unit tests...") + test_unit_function1() + test_unit_function2() + test_unit_function3() + print("โœ… All tests passed! Module implementation complete.") -# Immediate test -def test_unit_tensor_addition(): - """Test tensor addition immediately""" - # Test implementation -test_unit_tensor_addition() # Run immediately - -def multiply(self, other): ... - -# Immediate test -def test_unit_tensor_multiplication(): - """Test tensor multiplication immediately""" - # Test implementation -test_unit_tensor_multiplication() # Run immediately +test_unit_all() # Run before systems analysis ``` ## ๐ŸŽฏ **The Golden Rules of Educational Notebook Design** @@ -1266,14 +1267,7 @@ def test_unit_function_name(): # Run immediately after implementation test_unit_function_name() -# === COMPLETE TESTING SECTION === - -# %% [markdown] -""" -## ๐Ÿงช Complete Module Testing - -Before exploring systems behavior, let's run all tests to ensure everything works: -""" +# === COMPLETE MODULE TESTING === # %% def test_unit_all(): @@ -1465,78 +1459,15 @@ print("๐Ÿ”ฅ Next up: [Next Module] - [exciting capability]!") print("๐Ÿ’ช You're building real ML infrastructure, one module at a time!") ``` -## Your "Test-Immediately" Innovation +## ๐Ÿงช **Your Testing Excellence Framework** -**The Rodriguez Testing Pattern** (Implementation โ†’ Test โ†’ Measure): +**The Essential Testing Flow**: Implementation โ†’ Test โ†’ Measure โ†’ Reflect -### **1. Immediate Unit Testing After Each Implementation** -```markdown -### ๐Ÿงช Unit Test: [Function Name] -This test validates [specific functionality being tested] -``` - -```python -def test_unit_function_name(): - """Test [function] with educational feedback""" - print("๐Ÿ”ฌ Unit Test: [Function Name]...") - - # Test basic functionality - result = function_implementation() - assert condition, "Educational assertion that explains why this matters" - - # Test edge cases that teach concepts - edge_result = function_with_edge_case() - assert edge_condition, "Edge case explanation that builds understanding" - - print("โœ… [Function] works correctly!") - print("๐ŸŽฏ Key insight: [What this test revealed about the concept]") - -# Run immediately after implementation -test_unit_function_name() -``` - -### **2. Complete Module Testing Before Systems Analysis** -```python -def test_unit_all(): - """Run all unit tests for this module""" - print("๐Ÿงช Running all unit tests...") - - test_unit_function1() - test_unit_function2() - test_unit_function3() - - print("โœ… All tests passed! Module implementation complete.") - print("๐Ÿ” Ready for systems analysis...") - -# Run before moving to measurement phase -test_unit_all() -``` - -### **3. Critical Flow**: Implementation โ†’ Test โ†’ Measure โ†’ Reflect - -## Your Complete Testing Architecture - -**The 3-Layer Testing Hierarchy**: - -1. **Individual Tests**: Immediate after each implementation -2. **Aggregate Function**: `test_unit_all()` calls all individual tests -3. **Main Execution Block**: Runs complete validation - -```python -def test_unit_all(): - """Run complete module validation.""" - print("๐Ÿงช Running all unit tests...") - - # Call every individual test function - test_unit_function1() - test_unit_function2() - test_unit_function3() - - print("โœ… All tests passed! Module ready for integration.") - -if __name__ == "__main__": - test_unit_all() -``` +**Core Principles:** +1. **Immediate Testing**: Test each function right after implementation +2. **Educational Feedback**: Tests teach concepts while validating +3. **Aggregate Validation**: `test_unit_all()` runs complete module validation +4. **Clear Patterns**: Consistent `test_unit_[function_name]()` naming **Your Rule**: Every test called immediately + included in aggregate = complete validation @@ -1727,15 +1658,12 @@ Systematically update all existing modules to follow your proven patterns - the **Your Systematic Process:** 1. Find test code not wrapped in functions -2. Apply your `test_unit_[function_name]()` pattern -3. Add standardized markdown headers -4. Ensure immediate function calls -5. Correct ordering: Implementation โ†’ Test โ†’ Reflection -6. Add `test_unit_all()` aggregate function -7. Add main execution block - -**Critical Issue - 09_spatial Module:** -Lines 345, 522, 778, 1072, 1281 have unwrapped test code +2. Apply your `test_unit_[function_name]()` pattern (see Testing Excellence Framework) +3. Add standardized markdown headers: `### ๐Ÿงช Unit Test: [Function Name]` +4. Ensure immediate function calls after each test definition +5. Correct ordering: Implementation โ†’ Test โ†’ Continue +6. Add `test_unit_all()` aggregate function at module end +7. Follow the consolidated testing patterns from line 155 **Your Fix Pattern:** ```python @@ -1747,6 +1675,7 @@ print("๐Ÿ”ฌ Unit Test: Conv2D...") def test_unit_conv2d(): print("๐Ÿ”ฌ Unit Test: Conv2D...") # test logic... + print("โœ… Conv2D works correctly!") test_unit_conv2d() # Immediate call ``` diff --git a/fix_unicode.py b/fix_unicode.py new file mode 100644 index 00000000..c6f687c0 --- /dev/null +++ b/fix_unicode.py @@ -0,0 +1,121 @@ +#!/usr/bin/env python3 +""" +Fix Unicode characters in module files that cause syntax errors. +""" + +import os +import re + +def fix_unicode_in_file(filepath): + """Fix Unicode characters in a Python file.""" + try: + with open(filepath, 'r', encoding='utf-8') as f: + content = f.read() + + # Track if changes were made + original_content = content + + # Common Unicode fixes for Python compatibility + replacements = { + 'โ”‚': '|', # Box drawing vertical + 'โ”œ': '+', # Box drawing vertical and right + 'โ”ผ': '+', # Box drawing vertical and horizontal + 'โ”ด': '+', # Box drawing up and horizontal + 'โ”ฌ': '+', # Box drawing down and horizontal + 'โ•ญ': '+', # Box drawing arc top left + 'โ•ฎ': '+', # Box drawing arc top right + 'โ•ฐ': '+', # Box drawing arc bottom left + 'โ•ฏ': '+', # Box drawing arc bottom right + 'โ”€': '-', # Box drawing horizontal + 'โ””': '+', # Box drawing up and right + 'โ”˜': '+', # Box drawing up and left + 'โ”': '+', # Box drawing down and left + 'โ”Œ': '+', # Box drawing down and right + 'โ†’': '->', # Right arrow + 'โ†': '<-', # Left arrow + 'โ†“': 'v', # Down arrow + 'โ†‘': '^', # Up arrow + 'โ–ฒ': '^', # Triangle up + 'โ–บ': '>', # Triangle right + 'โ—„': '<', # Triangle left + 'โ–ผ': 'v', # Triangle down + 'โ•ฑ': '/', # Box drawing diagonal upper right to lower left + 'โ•ฒ': '\\', # Box drawing diagonal upper left to lower right + 'โ•': '=', # Double horizontal line + 'โ•‘': '|', # Double vertical line + 'โ•”': '+', # Double line box drawing + 'โ•—': '+', + 'โ•š': '+', + 'โ•': '+', + 'โ• ': '+', + 'โ•ฃ': '+', + 'โ•ฆ': '+', + 'โ•ฉ': '+', + 'โ•ฌ': '+', + 'โ‰ฅ': '>=', # Greater than or equal + 'โ‰ค': '<=', # Less than or equal + 'ร—': '*', # Multiplication sign + 'รท': '/', # Division sign + 'โˆ‚': 'd', # Partial derivative + 'โˆ‡': 'grad', # Nabla (gradient) + 'ฮฃ': 'Sum', # Sigma (summation) + 'โˆ‘': 'sum', # Summation + 'โˆš': 'sqrt', # Square root + 'โˆž': 'inf', # Infinity + 'โ‰ ': '!=', # Not equal + 'โ‰ˆ': '~=', # Approximately equal + 'โˆˆ': 'in', # Element of + 'โˆ‰': 'not in', # Not element of + 'โš ': 'WARNING', # Warning sign + 'โœ“': 'OK', # Check mark + 'โœ…': 'PASS', # Check mark button + 'โŒ': 'FAIL', # Cross mark + '๐Ÿ’ก': 'TIP', # Light bulb + '๐Ÿ’ฅ': 'CRASH', # Explosion + '๐Ÿ”ฅ': 'FIRE', # Fire + '๐Ÿ”—': 'LINK', # Link + '๐Ÿš€': 'ROCKET', # Rocket + '๐ŸŽฏ': 'TARGET', # Target + '๐Ÿ”': 'MAGNIFY', # Magnifying glass + '๐Ÿค”': 'THINK', # Thinking face + '๐Ÿงช': 'TEST', # Test tube + '๐Ÿ“ˆ': 'PROGRESS', # Chart increasing + '๐Ÿ“ฆ': 'PACKAGE', # Package + '๐ŸŽ‰': 'CELEBRATE', # Party + 'โšก': 'SPEED', # Lightning + } + + # Apply replacements + for unicode_char, replacement in replacements.items(): + content = content.replace(unicode_char, replacement) + + # Write back if changes were made + if content != original_content: + with open(filepath, 'w', encoding='utf-8') as f: + f.write(content) + print(f"Fixed Unicode characters in: {filepath}") + return True + + return False + + except Exception as e: + print(f"Error processing {filepath}: {e}") + return False + +def main(): + """Fix Unicode in all module files.""" + modules_dir = '/Users/VJ/GitHub/TinyTorch/modules' + fixed_count = 0 + + # Find all Python files in modules + for root, dirs, files in os.walk(modules_dir): + for file in files: + if file.endswith('_dev.py'): + filepath = os.path.join(root, file) + if fix_unicode_in_file(filepath): + fixed_count += 1 + + print(f"\nFixed Unicode characters in {fixed_count} files.") + +if __name__ == '__main__': + main() \ No newline at end of file diff --git a/modules/01_tensor/tensor_dev.py b/modules/01_tensor/tensor_dev.py index a0368f75..cdd0d2af 100644 --- a/modules/01_tensor/tensor_dev.py +++ b/modules/01_tensor/tensor_dev.py @@ -16,7 +16,7 @@ Welcome to Tensor! You'll build the fundamental data structure that powers every ## ๐Ÿ”— Building on Previous Learning **What You Built Before**: -- Environment Setup: Python environment with NumPy, the foundation for numerical computing +- Module 01 (Setup): Python environment with NumPy, the foundation for numerical computing **What's Working**: You have a complete development environment with all the tools needed for machine learning! @@ -26,8 +26,8 @@ Welcome to Tensor! You'll build the fundamental data structure that powers every **Connection Map**: ``` -Environment โ†’ Tensor โ†’ Activations - (tools) (data) (nonlinearity) +Setup โ†’ Tensor โ†’ Activations +(tools) (data) (nonlinearity) ``` ## Learning Objectives @@ -39,11 +39,11 @@ By completing this module, you will: 3. **Create ML-ready APIs** - Design clean interfaces that mirror PyTorch and TensorFlow patterns 4. **Enable neural networks** - Build the foundation that supports weights, biases, and data in all ML models -## Build โ†’ Use โ†’ Reflect +## Build โ†’ Test โ†’ Use 1. **Build**: Implement Tensor class with creation, arithmetic, and advanced operations -2. **Use**: Apply tensors to real multi-dimensional data operations that neural networks require -3. **Reflect**: Understand how memory layout and broadcasting enable efficient ML computations at scale +2. **Test**: Validate each component immediately to ensure correctness and performance +3. **Use**: Apply tensors to real multi-dimensional data operations that neural networks require """ # In[ ]: @@ -308,7 +308,7 @@ class Tensor: # ML convention: prefer float32 for memory and GPU efficiency self._data = self._data.astype(np.float32) - # Initialize gradient tracking attributes (used in Module 05 - Autograd) + # Initialize gradient tracking attributes (used in Module 9 - Autograd) self.requires_grad = requires_grad self.grad = None self._grad_fn = None @@ -1288,16 +1288,33 @@ def test_unit_all(): ## Main Execution Block """ +def analyze_tensor_performance(): + """ + ๐Ÿ” SYSTEMS ANALYSIS: Tensor Performance and Memory Characteristics + + Focused analysis of core tensor behavior for ML systems understanding. + """ + try: + print("๐Ÿ“Š Tensor Systems Analysis:") + print(f" โ€ข Memory Layout: NumPy provides contiguous memory for 10-100x speedup over Python lists") + print(f" โ€ข Broadcasting: Automatic shape matching saves memory and enables vectorized operations") + print(f" โ€ข Matrix Operations: O(Nยณ) complexity for NxN matrices - GPU acceleration critical for large models") + print(f" โ€ข Memory Scaling: Each tensor uses N*dtype_bytes RAM - batch size directly impacts memory usage") + print(f" โ€ข Production Pattern: Your Tensor mirrors PyTorch's core design for 100% compatibility") + + except Exception as e: + print(f"โš ๏ธ Analysis failed: {e}") + if __name__ == "__main__": # Run all tensor tests test_unit_all() - + + # Single focused analysis for foundation module + analyze_tensor_performance() + print("\n๐ŸŽ‰ Tensor module implementation complete!") print("๐Ÿ“ฆ Ready to export to tinytorch.core.tensor") - # Demonstrate the new ML Framework Advisor improvements - print("\n๐Ÿš€ New Features Demonstration:") - # 1. Enhanced dtype handling t1 = Tensor([1, 2, 3], dtype="float32") t2 = Tensor([1, 2, 3], dtype=np.float64) @@ -1347,34 +1364,10 @@ Calculate the memory requirements for parameters, gradients, and optimizer state # In[ ]: """ -SYSTEMS ANALYSIS: Memory Efficiency at Production Scale +YOUR ANALYSIS: -Key Insights from Your Tensor Implementation: - -1. **Memory Layout Impact**: - - Contiguous tensors: 10-100x faster due to cache efficiency - - Your implementation defaults to contiguous NumPy arrays - - Production impact: GPT-3 training requires 700GB+ of contiguous memory - -2. **Memory Requirements Calculation**: - - Parameters: 7B ร— 4 bytes = 28GB - - Gradients: 7B ร— 4 bytes = 28GB - - Optimizer states (Adam): 7B ร— 8 bytes = 56GB - - Total: 112GB > 16GB GPU memory โ†’ Need optimization! - -3. **Tensor-Level Optimizations**: - - Gradient checkpointing: Trade compute for memory (your tensor.clone() enables this) - - Mixed precision: float16 for forward, float32 for gradients - - Parameter sharding: Split tensors across multiple GPUs - - Memory mapping: Stream tensors from disk when needed - -4. **Your Implementation Enables**: - - .contiguous() method for memory layout optimization - - dtype conversion for mixed precision training - - .view() operations for zero-copy tensor reshaping - - Gradient tracking foundation for automatic differentiation - -Production Connection: Your tensor design choices directly impact whether a model can train on available hardware. Every major ML framework (PyTorch, JAX, TensorFlow) implements these same optimizations at the tensor level. +[Write your response here - consider memory layout, cache efficiency, +and optimization strategies for large-scale tensor operations] """ # %% [markdown] @@ -1391,51 +1384,17 @@ How would you extend your `__add__` and `__mul__` methods to handle these comple # In[ ]: """ -SYSTEMS ANALYSIS: Broadcasting in Production Transformer Architectures +YOUR ANALYSIS: -Key Insights from Your Broadcasting Implementation: - -1. **Current Implementation Strengths**: - - Your __add__ and __mul__ methods handle basic broadcasting via NumPy - - Automatic shape alignment from right to left - - Memory-efficient operations without data copying - -2. **Transformer Broadcasting Challenges**: - ``` - Query @ Key^T: (32, 512, 768) ร— (32, 768, 512) โ†’ (32, 512, 512) - Attention + Bias: (32, 8, 512, 512) + (1, 1, 512, 512) โ†’ (32, 8, 512, 512) - Multi-head: (32, 8, 512, 64) โ†’ reshape โ†’ (32, 512, 512) - ``` - -3. **Enhanced Error Handling Needed**: - ```python - def __add__(self, other): - if isinstance(other, Tensor): - try: - result = self._data + other._data # NumPy handles broadcasting - except ValueError as e: - raise ValueError(f"Cannot broadcast shapes {self.shape} and {other.shape}: {e}") - return Tensor(result) - ``` - -4. **Production Broadcasting Patterns**: - - Attention masks: (batch, 1, seq_len, seq_len) broadcasts to (batch, heads, seq_len, seq_len) - - Position embeddings: (1, seq_len, hidden) broadcasts to (batch, seq_len, hidden) - - Layer normalization: (hidden,) broadcasts to (batch, seq_len, hidden) - -5. **Memory Implications**: - - Broadcasting saves memory: No data copying for dimension expansion - - Your implementation leverages NumPy's optimized broadcasting - - Critical for transformer efficiency: 8-head attention without 8x memory - -Production Connection: Transformer models rely heavily on broadcasting for attention mechanisms. Your tensor broadcasting foundation enables efficient multi-head attention, position encoding, and layer normalization - the core operations that make modern NLP possible. +[Write your response here - consider broadcasting rules, error handling, +and complex shape operations in transformer architectures] """ # %% [markdown] """ ### Question 3: Gradient Compatibility -**Challenge**: Your Tensor class includes `requires_grad` and basic gradient tracking. When you implement automatic differentiation (Module 05), how will your current design support gradient computation? +**Challenge**: Your Tensor class includes `requires_grad` and basic gradient tracking. When you implement automatic differentiation (Module 09), how will your current design support gradient computation? Consider how operations like `c = a * b` need to track both forward computation and backward gradient flow. What modifications would your Tensor methods need to support this? """ @@ -1443,54 +1402,10 @@ Consider how operations like `c = a * b` need to track both forward computation # In[ ]: """ -SYSTEMS ANALYSIS: Gradient Compatibility and Computational Graphs +YOUR ANALYSIS: -Key Insights from Your Gradient-Ready Tensor Design: - -1. **Current Gradient Foundation**: - - `requires_grad` flag enables gradient tracking - - `grad` attribute stores computed gradients - - `_grad_fn` placeholder for backward function references - -2. **Computational Graph Requirements**: - ```python - # Forward: c = a * b - # Your current implementation: - def __mul__(self, other): - result = Tensor(self._data * other._data) - # Missing: gradient function attachment - return result - - # Autograd-ready version needed: - def __mul__(self, other): - result = Tensor(self._data * other._data) - if self.requires_grad or other.requires_grad: - result.requires_grad = True - result._grad_fn = MultiplyBackward(self, other) # Store backward function - return result - ``` - -3. **Gradient Flow Architecture**: - - Forward pass: Compute values and build computational graph - - Backward pass: Traverse graph in reverse, accumulating gradients - - Your tensor operations become nodes in the computation graph - -4. **Memory Implications for Gradients**: - - Each tensor operation must store references to inputs - - Gradient computation requires keeping intermediate values - - Your implementation's memory efficiency directly impacts gradient memory - -5. **Production Gradient Patterns**: - - Chain rule: โˆ‚loss/โˆ‚a = โˆ‚loss/โˆ‚c ร— โˆ‚c/โˆ‚a - - Gradient accumulation: Multiple backward passes sum gradients - - Memory optimization: Gradient checkpointing trades compute for memory - -6. **Your Design Enables**: - - Zero-copy operations preserve gradient tracking - - Contiguous memory layout accelerates gradient computation - - Broadcasting rules apply to gradient shapes automatically - -Production Connection: Your tensor design directly enables automatic differentiation. Every PyTorch operation (torch.add, torch.mul) follows this exact pattern - storing forward results while building the computational graph for backward gradient flow. Your foundation makes neural network training possible. +[Write your response here - consider gradient tracking, computational graphs, +and how your tensor operations will support automatic differentiation] """ # %% [markdown] @@ -1504,16 +1419,15 @@ Congratulations! You've built the fundamental data structure that powers all mac - **Memory Efficiency Mastery**: Discovered that memory layout affects performance more than algorithms (10-100x speedups) - **Broadcasting Implementation**: Created automatic shape matching that saves memory and enables flexible operations - **Production-Ready API**: Designed interfaces that mirror PyTorch and TensorFlow patterns -- **Systems Thinking**: Connected tensor design choices to production ML constraints and GPU acceleration patterns ### Ready for Next Steps Your tensor implementation now enables: -- **Module 02 (Activations)**: Add nonlinear functions that make neural networks powerful +- **Module 03 (Activations)**: Add nonlinear functions that make neural networks powerful - **Neural network operations**: Matrix multiplication, broadcasting, and gradient preparation - **Real data processing**: Handle images, text, and complex multi-dimensional datasets ### Export Your Work -1. **Export to package**: `tito module complete 01` +1. **Export to package**: `tito module complete 01_tensor` 2. **Verify integration**: Your Tensor class will be available as `tinytorch.core.tensor.Tensor` 3. **Enable next module**: Activations build on your tensor foundation diff --git a/modules/01_tensor/tensor_dev_old.py b/modules/01_tensor/tensor_dev_old.py deleted file mode 100644 index ef3ce961..00000000 --- a/modules/01_tensor/tensor_dev_old.py +++ /dev/null @@ -1,1528 +0,0 @@ -#!/usr/bin/env python -# coding: utf-8 - -# # Tensor - Core Data Structure and Memory Management -# -# Welcome to the Tensor module! You'll implement the fundamental data structure that powers all neural networks and understand why memory layout determines performance. -# -# ## Learning Goals -# - Systems understanding: How tensor memory layout affects cache performance and computational efficiency -# - Core implementation skill: Build a complete Tensor class with shape management and arithmetic operations -# - Pattern recognition: Understand how tensors abstract N-dimensional data for ML algorithms -# - Framework connection: See how your implementation mirrors PyTorch's tensor design and memory model -# - Performance insight: Learn why contiguous memory layout and vectorized operations are critical for ML performance -# -# ## Build โ†’ Use โ†’ Reflect -# 1. **Build**: Complete Tensor class with shape management, broadcasting, and vectorized operations -# 2. **Use**: Perform tensor arithmetic and transformations on real multi-dimensional data -# 3. **Reflect**: Why does tensor memory layout become the performance bottleneck in large neural networks? -# -# ## What You'll Achieve -# By the end of this module, you'll understand: -# - Deep technical understanding of how N-dimensional arrays are stored and manipulated in memory -# - Practical capability to build efficient tensor operations that form the foundation of neural networks -# - Systems insight into why memory access patterns determine whether ML operations run fast or slow -# - Performance consideration of when tensor operations trigger expensive memory copies vs efficient in-place updates -# - Connection to production ML systems and how PyTorch optimizes tensor storage for GPU acceleration -# -# ## Systems Reality Check -# ๐Ÿ’ก **Production Context**: PyTorch tensors automatically choose optimal memory layouts and can seamlessly move between CPU and GPU - your implementation reveals these design decisions -# โšก **Performance Note**: Non-contiguous tensors can be 10-100x slower than contiguous ones - memory layout is often more important than algorithm choice in ML systems - -# In[ ]: - - -#| default_exp core.tensor - -#| export -import numpy as np -import sys -from typing import Union, Tuple, Optional, Any - - -# In[ ]: - - -print("๐Ÿ”ฅ TinyTorch Tensor Module") -print(f"NumPy version: {np.__version__}") -print(f"Python version: {sys.version_info.major}.{sys.version_info.minor}") -print("Ready to build tensors!") - - -# ## Where This Code Lives in the Final Package -# -# **Learning Side:** You work in `modules/source/02_tensor/tensor_dev.py` -# **Building Side:** Code exports to `tinytorch.core.tensor` -# -# ```python -# # Final package structure: -# from tinytorch.core.tensor import Tensor # The foundation of everything! -# from tinytorch.core.activations import ReLU, Sigmoid, Tanh -# from tinytorch.core.layers import Linear, Conv2D -# ``` -# -# **Why this matters:** -# - **Learning:** Focused modules for deep understanding -# - **Production:** Proper organization like PyTorch's `torch.Tensor` -# - **Consistency:** All tensor operations live together in `core.tensor` -# - **Foundation:** Every other module depends on Tensor - -# ## Mathematical Foundation: From Scalars to Tensors -# -# Understanding tensors requires building from mathematical fundamentals: -# -# ### Scalars (Rank 0) -# - **Definition**: A single number with no direction -# - **Examples**: Temperature (25ยฐC), mass (5.2 kg), probability (0.7) -# - **Operations**: Addition, multiplication, comparison -# - **ML Context**: Loss values, learning rates, regularization parameters -# -# ### Vectors (Rank 1) -# - **Definition**: An ordered list of numbers with direction and magnitude -# - **Examples**: Position [x, y, z], RGB color [255, 128, 0], word embedding [0.1, -0.5, 0.8] -# - **Operations**: Dot product, cross product, norm calculation -# - **ML Context**: Feature vectors, gradients, model parameters -# -# ### Matrices (Rank 2) -# - **Definition**: A 2D array organizing data in rows and columns -# - **Examples**: Image (height ร— width), weight matrix (input ร— output), covariance matrix -# - **Operations**: Matrix multiplication, transpose, inverse, eigendecomposition -# - **ML Context**: Linear layer weights, attention matrices, batch data -# -# ### Higher-Order Tensors (Rank 3+) -# - **Definition**: Multi-dimensional arrays extending matrices -# - **Examples**: -# - **3D**: Video frames (time ร— height ร— width), RGB images (height ร— width ร— channels) -# - **4D**: Image batches (batch ร— height ร— width ร— channels) -# - **5D**: Video batches (batch ร— time ร— height ร— width ร— channels) -# - **Operations**: Tensor products, contractions, decompositions -# - **ML Context**: Convolutional features, RNN states, transformer attention - -# ## Why Tensors Matter in ML: The Computational Foundation -# -# ### Unified Data Representation -# Tensors provide a consistent way to represent all ML data: -# ```python -# # All of these are tensors with different shapes -# scalar_loss = Tensor(0.5) # Shape: () -# feature_vector = Tensor([1, 2, 3]) # Shape: (3,) -# weight_matrix = Tensor([[1, 2], [3, 4]]) # Shape: (2, 2) -# image_batch = Tensor(np.random.rand(32, 224, 224, 3)) # Shape: (32, 224, 224, 3) -# ``` -# -# ### Efficient Batch Processing -# ML systems process multiple samples simultaneously: -# ```python -# # Instead of processing one image at a time: -# for image in images: -# result = model(image) # Slow: 1000 separate operations -# -# # Process entire batch at once: -# batch_result = model(image_batch) # Fast: 1 vectorized operation -# ``` -# -# ### Hardware Acceleration -# Modern hardware (GPUs, TPUs) excels at tensor operations: -# - **Parallel processing**: Multiple operations simultaneously -# - **Vectorization**: SIMD (Single Instruction, Multiple Data) operations -# - **Memory optimization**: Contiguous memory layout for cache efficiency -# -# ### Automatic Differentiation -# Tensors enable gradient computation through computational graphs: -# ```python -# # Each tensor operation creates a node in the computation graph -# x = Tensor([1, 2, 3]) -# y = x * 2 # Node: multiplication -# z = y + 1 # Node: addition -# loss = z.sum() # Node: summation -# # Gradients flow backward through this graph -# ``` - -# ## Real-World Examples: Tensors in Action -# -# ### Computer Vision -# - **Grayscale image**: 2D tensor `(height, width)` - `(28, 28)` for MNIST -# - **Color image**: 3D tensor `(height, width, channels)` - `(224, 224, 3)` for RGB -# - **Image batch**: 4D tensor `(batch, height, width, channels)` - `(32, 224, 224, 3)` -# - **Video**: 5D tensor `(batch, time, height, width, channels)` -# -# ### Natural Language Processing -# - **Word embedding**: 1D tensor `(embedding_dim,)` - `(300,)` for Word2Vec -# - **Sentence**: 2D tensor `(sequence_length, embedding_dim)` - `(50, 768)` for BERT -# - **Batch of sentences**: 3D tensor `(batch, sequence_length, embedding_dim)` -# -# ### Audio Processing -# - **Audio signal**: 1D tensor `(time_steps,)` - `(16000,)` for 1 second at 16kHz -# - **Spectrogram**: 2D tensor `(time_frames, frequency_bins)` -# - **Batch of audio**: 3D tensor `(batch, time_steps, features)` -# -# ### Time Series -# - **Single series**: 2D tensor `(time_steps, features)` -# - **Multiple series**: 3D tensor `(batch, time_steps, features)` -# - **Multivariate forecasting**: 4D tensor `(batch, time_steps, features, predictions)` - -# ## Why Not Just Use NumPy? -# -# While we use NumPy internally, our Tensor class adds ML-specific functionality: -# -# ### ML-Specific Operations -# - **Gradient tracking**: For automatic differentiation (coming in Module 7) -# - **GPU support**: For hardware acceleration (future extension) -# - **Broadcasting semantics**: ML-friendly dimension handling -# -# ### Consistent API -# - **Type safety**: Predictable behavior across operations -# - **Error checking**: Clear error messages for debugging -# - **Integration**: Seamless work with other TinyTorch components -# -# ### Educational Value -# - **Conceptual clarity**: Understand what tensors really are -# - **Implementation insight**: See how frameworks work internally -# - **Debugging skills**: Trace through tensor operations step by step -# -# ### Extensibility -# - **Future features**: Ready for gradients, GPU, distributed computing -# - **Customization**: Add domain-specific operations -# - **Optimization**: Profile and optimize specific use cases - -# ## Performance Considerations: Building Efficient Tensors -# -# ### Memory Layout -# - **Contiguous arrays**: Better cache locality and performance -# - **Data types**: `float32` vs `float64` trade-offs -# - **Memory sharing**: Avoid unnecessary copies -# -# ### Vectorization -# - **SIMD operations**: Single Instruction, Multiple Data -# - **Broadcasting**: Efficient operations on different shapes -# - **Batch operations**: Process multiple samples simultaneously -# -# ### Numerical Stability -# - **Precision**: Balancing speed and accuracy -# - **Overflow/underflow**: Handling extreme values -# - **Gradient flow**: Maintaining numerical stability for training - -# # CONCEPT -# Tensors are N-dimensional arrays that carry data through neural networks. -# Think NumPy arrays with ML superpowers - same math, more capabilities. - -# # CODE STRUCTURE -# ```python -# class Tensor: -# def __init__(self, data): # Create from any data type -# def __add__(self, other): # Enable tensor + tensor -# def __mul__(self, other): # Enable tensor * tensor -# # Properties: .shape, .size, .dtype, .data -# ``` - -# # CONNECTIONS -# - torch.Tensor (PyTorch) - same concept, production optimized -# - tf.Tensor (TensorFlow) - distributed computing focus -# - np.ndarray (NumPy) - we wrap this with ML operations - -# # CONSTRAINTS -# - Handle broadcasting (auto-shape matching for operations) -# - Support multiple data types (float32, int32, etc.) -# - Efficient memory usage (copy only when necessary) -# - Natural math notation (tensor + tensor should just work) - -# # CONTEXT -# Every ML operation flows through tensors: -# - Neural networks: All computations operate on tensors -# - Training: Gradients flow through tensor operations -# - Hardware: GPUs optimized for tensor math -# - Production: Millions of tensor ops per second in real systems -# -# **You're building the universal language of machine learning.** - -# In[ ]: - - -#| export -class Tensor: - """ - TinyTorch Tensor: N-dimensional array with ML operations. - - The fundamental data structure for all TinyTorch operations. - Wraps NumPy arrays with ML-specific functionality. - """ - - def __init__(self, data: Any, dtype: Optional[str] = None, requires_grad: bool = False): - """ - Create a new tensor from data. - - Args: - data: Input data (scalar, list, or numpy array) - dtype: Data type ('float32', 'int32', etc.). Defaults to auto-detect. - requires_grad: Whether this tensor needs gradients for training. Defaults to False. - - TODO: Implement tensor creation with proper type handling. - - STEP-BY-STEP: - 1. Convert input data to numpy array using np.array() - 2. Apply dtype conversion if specified - 3. Set default float32 for float64 arrays (ML convention) - 4. Store the result in self._data - 5. Initialize gradient tracking attributes - - EXAMPLE: - Tensor(5) โ†’ stores np.array(5, dtype='int32') - Tensor([1.0, 2.0, 3.0]) โ†’ stores np.array([1.0, 2.0, 3.0], dtype='float32') - Tensor(np.array([1, 2, 3])) โ†’ stores the array with consistent dtype - - HINTS: - - Let NumPy handle most type conversions with np.array() - - Convert float64 to float32 by default (ML best practice) - - Store the array in self._data - - Initialize gradient tracking for later modules - """ - ### BEGIN SOLUTION - # Convert input to numpy array - let NumPy handle most conversions - if isinstance(data, Tensor): - # Input is another Tensor - share data efficiently - self._data = data.data.copy() if dtype else data.data - else: - # Convert to numpy array - self._data = np.array(data, dtype=dtype) - - # Apply ML-friendly dtype defaults - if dtype is None and self._data.dtype == np.float64: - self._data = self._data.astype(np.float32) # ML convention: prefer float32 - elif dtype and self._data.dtype != np.dtype(dtype): - self._data = self._data.astype(dtype) - - # Initialize gradient tracking attributes (used in Module 9 - Autograd) - self.requires_grad = requires_grad - self.grad = None - self._grad_fn = None - ### END SOLUTION - - @property - def data(self) -> np.ndarray: - """ - Access underlying numpy array. - - TODO: Return the stored numpy array. - - STEP-BY-STEP IMPLEMENTATION: - 1. Access the internal _data attribute - 2. Return the numpy array directly - 3. This provides access to underlying data for NumPy operations - - LEARNING CONNECTIONS: - Real-world relevance: - - PyTorch: tensor.numpy() converts to NumPy for visualization/analysis - - TensorFlow: tensor.numpy() enables integration with scientific Python - - Production: Data scientists need to access raw arrays for debugging - - Performance: Direct access avoids copying for read-only operations - - HINT: Return self._data (the array you stored in __init__) - """ - ### BEGIN SOLUTION - return self._data - ### END SOLUTION - - @data.setter - def data(self, value: Union[np.ndarray, 'Tensor']) -> None: - """ - Set the underlying data of the tensor. - - Args: - value: New data (numpy array or Tensor) - """ - if isinstance(value, Tensor): - self._data = value._data.copy() - else: - self._data = np.array(value) - - @property - def shape(self) -> Tuple[int, ...]: - """ - Get tensor shape. - - TODO: Return the shape of the stored numpy array. - - STEP-BY-STEP IMPLEMENTATION: - 1. Access the _data attribute (the NumPy array) - 2. Get the shape property from the NumPy array - 3. Return the shape tuple directly - - LEARNING CONNECTIONS: - Real-world relevance: - - Neural networks: Layer compatibility requires matching shapes - - Computer vision: Image shape (height, width, channels) determines architecture - - NLP: Sequence length and vocabulary size affect model design - - Debugging: Shape mismatches are the #1 cause of ML errors - - HINT: Use .shape attribute of the numpy array - EXAMPLE: Tensor([1, 2, 3]).shape should return (3,) - """ - ### BEGIN SOLUTION - return self._data.shape - ### END SOLUTION - - @property - def size(self) -> int: - """ - Get total number of elements. - - TODO: Return the total number of elements in the tensor. - - STEP-BY-STEP IMPLEMENTATION: - 1. Access the _data attribute (the NumPy array) - 2. Get the size property from the NumPy array - 3. Return the total element count as an integer - - LEARNING CONNECTIONS: - Real-world relevance: - - Memory planning: Calculate RAM requirements for large tensors - - Model architecture: Determine parameter counts for layers - - Performance optimization: Size affects computation time - - Batch processing: Total elements determines vectorization efficiency - - HINT: Use .size attribute of the numpy array - EXAMPLE: Tensor([1, 2, 3]).size should return 3 - """ - ### BEGIN SOLUTION - return self._data.size - ### END SOLUTION - - @property - def dtype(self) -> np.dtype: - """ - Get data type as numpy dtype. - - TODO: Return the data type of the stored numpy array. - - STEP-BY-STEP IMPLEMENTATION: - 1. Access the _data attribute (the NumPy array) - 2. Get the dtype property from the NumPy array - 3. Return the NumPy dtype object directly - - LEARNING CONNECTIONS: - Real-world relevance: - - Precision vs speed: float32 is faster, float64 more accurate - - Memory optimization: int8 uses 1/4 memory of int32 - - GPU compatibility: Some operations only work with specific types - - Model deployment: Mobile/edge devices prefer smaller data types - - HINT: Use .dtype attribute of the numpy array - EXAMPLE: Tensor([1, 2, 3]).dtype should return dtype('int32') - """ - ### BEGIN SOLUTION - return self._data.dtype - ### END SOLUTION - - def __repr__(self) -> str: - """ - String representation with size limits for readability. - - TODO: Create a clear string representation of the tensor. - - STEP-BY-STEP IMPLEMENTATION: - 1. Check tensor size - if large, show shape/dtype only - 2. For small tensors, convert numpy array to list using .tolist() - 3. Format appropriately based on size - 4. Return the formatted string - - LEARNING CONNECTIONS: - Real-world relevance: - - Debugging: Clear tensor representation speeds debugging - - Jupyter notebooks: Good __repr__ improves data exploration - - Logging: Production systems log tensor info for monitoring - - Large tensors: Shape/dtype more useful than full data for big arrays - - APPROACH: - 1. For large tensors (>20 elements): Show shape and dtype only - 2. For small tensors: Show data, shape, and dtype - 3. Keep format consistent and readable - - EXAMPLE: - Tensor([1, 2, 3]) โ†’ "Tensor([1, 2, 3], shape=(3,), dtype=int32)" - Large tensor โ†’ "Tensor(shape=(1000, 1000), dtype=float32)" - - HINTS: - - Check self.size to determine if tensor is large - - Use .tolist() for small tensors, shape/dtype for large ones - - Include shape and dtype information for debugging - """ - ### BEGIN SOLUTION - if self.size > 20: - # Large tensors: show shape and dtype only for readability - return f"Tensor(shape={self.shape}, dtype={self.dtype})" - else: - # Small tensors: show data, shape, and dtype - return f"Tensor({self._data.tolist()}, shape={self.shape}, dtype={self.dtype})" - ### END SOLUTION - - def item(self) -> Union[int, float]: - """ - Extract a scalar value from a single-element tensor. - - Returns: - The scalar value contained in the tensor - - Raises: - ValueError: If tensor contains more than one element - - Examples: - >>> t = Tensor([5.0]) - >>> t.item() # Returns 5.0 - >>> t2 = Tensor([[1]]) - >>> t2.item() # Returns 1 - """ - if self._data.size != 1: - raise ValueError(f"item() can only be called on tensors with exactly one element, got {self._data.size} elements") - return self._data.item() - - def add(self, other: 'Tensor') -> 'Tensor': - """ - Add two tensors element-wise. - - TODO: Implement tensor addition. - - STEP-BY-STEP IMPLEMENTATION: - 1. Extract numpy arrays from both tensors - 2. Use NumPy's + operator for element-wise addition - 3. Create a new Tensor object with the result - 4. Return the new tensor - - LEARNING CONNECTIONS: - Real-world relevance: - - Neural networks: Adding bias terms to linear layer outputs - - Residual connections: skip connections in ResNet architectures - - Gradient updates: Adding computed gradients to parameters - - Ensemble methods: Combining predictions from multiple models - - APPROACH: - 1. Add the numpy arrays using + - 2. Return a new Tensor with the result - 3. Handle broadcasting automatically - - EXAMPLE: - Tensor([1, 2]) + Tensor([3, 4]) โ†’ Tensor([4, 6]) - - HINTS: - - Use self._data + other._data - - Return Tensor(result) - - NumPy handles broadcasting automatically - """ - ### BEGIN SOLUTION - result_data = self._data + other._data - result = Tensor(result_data) - - # TODO: Gradient tracking will be added in Module 9 (Autograd) - # This enables automatic differentiation for neural network training - # For now, we focus on the core tensor operation - - return result - ### END SOLUTION - - def multiply(self, other: 'Tensor') -> 'Tensor': - """ - Multiply two tensors element-wise. - - TODO: Implement tensor multiplication. - - STEP-BY-STEP IMPLEMENTATION: - 1. Extract numpy arrays from both tensors - 2. Use NumPy's * operator for element-wise multiplication - 3. Create a new Tensor object with the result - 4. Return the new tensor - - LEARNING CONNECTIONS: - Real-world relevance: - - Activation functions: Element-wise operations like ReLU masking - - Attention mechanisms: Element-wise scaling in transformer models - - Feature scaling: Multiplying features by learned scaling factors - - Gating: Element-wise gating in LSTM and GRU cells - - APPROACH: - 1. Multiply the numpy arrays using * - 2. Return a new Tensor with the result - 3. Handle broadcasting automatically - - EXAMPLE: - Tensor([1, 2]) * Tensor([3, 4]) โ†’ Tensor([3, 8]) - - HINTS: - - Use self._data * other._data - - Return Tensor(result) - - This is element-wise, not matrix multiplication - """ - ### BEGIN SOLUTION - result_data = self._data * other._data - result = Tensor(result_data) - - # TODO: Gradient tracking will be added in Module 9 (Autograd) - # This enables automatic differentiation for neural network training - # For now, we focus on the core tensor operation - - return result - ### END SOLUTION - - def __add__(self, other: Union['Tensor', int, float]) -> 'Tensor': - """ - Addition operator: tensor + other - - TODO: Implement + operator for tensors. - - STEP-BY-STEP IMPLEMENTATION: - 1. Check if other is a Tensor object - 2. If Tensor, call the add() method directly - 3. If scalar, convert to Tensor then call add() - 4. Return the result from add() method - - LEARNING CONNECTIONS: - Real-world relevance: - - Natural syntax: tensor + scalar enables intuitive code - - Broadcasting: Adding scalars to tensors is common in ML - - Operator overloading: Python's magic methods enable math-like syntax - - API design: Clean interfaces reduce cognitive load for researchers - - APPROACH: - 1. If other is a Tensor, use tensor addition - 2. If other is a scalar, convert to Tensor first - 3. Return the result - - EXAMPLE: - Tensor([1, 2]) + Tensor([3, 4]) โ†’ Tensor([4, 6]) - Tensor([1, 2]) + 5 โ†’ Tensor([6, 7]) - """ - ### BEGIN SOLUTION - if isinstance(other, Tensor): - return self.add(other) - else: - return self.add(Tensor(other)) - ### END SOLUTION - - def __mul__(self, other: Union['Tensor', int, float]) -> 'Tensor': - """ - Multiplication operator: tensor * other - - TODO: Implement * operator for tensors. - - STEP-BY-STEP IMPLEMENTATION: - 1. Check if other is a Tensor object - 2. If Tensor, call the multiply() method directly - 3. If scalar, convert to Tensor then call multiply() - 4. Return the result from multiply() method - - LEARNING CONNECTIONS: - Real-world relevance: - - Scaling features: tensor * learning_rate for gradient updates - - Masking: tensor * mask for attention mechanisms - - Regularization: tensor * dropout_mask during training - - Normalization: tensor * scale_factor in batch normalization - - APPROACH: - 1. If other is a Tensor, use tensor multiplication - 2. If other is a scalar, convert to Tensor first - 3. Return the result - - EXAMPLE: - Tensor([1, 2]) * Tensor([3, 4]) โ†’ Tensor([3, 8]) - Tensor([1, 2]) * 3 โ†’ Tensor([3, 6]) - """ - ### BEGIN SOLUTION - if isinstance(other, Tensor): - return self.multiply(other) - else: - return self.multiply(Tensor(other)) - ### END SOLUTION - - def __sub__(self, other: Union['Tensor', int, float]) -> 'Tensor': - """ - Subtraction operator: tensor - other - - TODO: Implement - operator for tensors. - - STEP-BY-STEP IMPLEMENTATION: - 1. Check if other is a Tensor object - 2. If Tensor, subtract other._data from self._data - 3. If scalar, subtract scalar directly from self._data - 4. Create new Tensor with result and return - - LEARNING CONNECTIONS: - Real-world relevance: - - Gradient computation: parameter - learning_rate * gradient - - Residual connections: output - skip_connection in some architectures - - Error calculation: predicted - actual for loss computation - - Centering data: tensor - mean for zero-centered inputs - - APPROACH: - 1. Convert other to Tensor if needed - 2. Subtract using numpy arrays - 3. Return new Tensor with result - - EXAMPLE: - Tensor([5, 6]) - Tensor([1, 2]) โ†’ Tensor([4, 4]) - Tensor([5, 6]) - 1 โ†’ Tensor([4, 5]) - """ - ### BEGIN SOLUTION - if isinstance(other, Tensor): - result = self._data - other._data - else: - result = self._data - other - return Tensor(result) - ### END SOLUTION - - def __truediv__(self, other: Union['Tensor', int, float]) -> 'Tensor': - """ - Division operator: tensor / other - - TODO: Implement / operator for tensors. - - STEP-BY-STEP IMPLEMENTATION: - 1. Check if other is a Tensor object - 2. If Tensor, divide self._data by other._data - 3. If scalar, divide self._data by scalar directly - 4. Create new Tensor with result and return - - LEARNING CONNECTIONS: - Real-world relevance: - - Normalization: tensor / std_deviation for standard scaling - - Learning rate decay: parameter / decay_factor over time - - Probability computation: counts / total_counts for frequencies - - Temperature scaling: logits / temperature in softmax functions - - APPROACH: - 1. Convert other to Tensor if needed - 2. Divide using numpy arrays - 3. Return new Tensor with result - - EXAMPLE: - Tensor([6, 8]) / Tensor([2, 4]) โ†’ Tensor([3, 2]) - Tensor([6, 8]) / 2 โ†’ Tensor([3, 4]) - """ - ### BEGIN SOLUTION - if isinstance(other, Tensor): - result = self._data / other._data - else: - result = self._data / other - return Tensor(result) - ### END SOLUTION - - def mean(self) -> 'Tensor': - """Computes the mean of the tensor's elements.""" - return Tensor(np.mean(self.data)) - - def sum(self, axis=None, keepdims=False) -> 'Tensor': - """ - Sum tensor elements along specified axes. - - Args: - axis: Axis or axes to sum over. If None, sum all elements. - keepdims: Whether to keep dimensions of size 1 in output. - - Returns: - New tensor with summed values. - """ - result_data = np.sum(self._data, axis=axis, keepdims=keepdims) - result = Tensor(result_data) - - if self.requires_grad: - result.requires_grad = True - - def grad_fn(grad): - # Sum gradient: broadcast gradient back to original shape - grad_data = grad.data - if axis is None: - # Sum over all axes - gradient is broadcast to full shape - grad_data = np.full(self.shape, grad_data) - else: - # Sum over specific axes - expand back those dimensions - if not isinstance(axis, tuple): - axis_tuple = (axis,) if axis is not None else () - else: - axis_tuple = axis - - # Expand dimensions that were summed - for ax in sorted(axis_tuple): - if ax < 0: - ax = len(self.shape) + ax - grad_data = np.expand_dims(grad_data, axis=ax) - - # Broadcast to original shape - grad_data = np.broadcast_to(grad_data, self.shape) - - self.backward(Tensor(grad_data)) - - result._grad_fn = grad_fn - - return result - - def matmul(self, other: 'Tensor') -> 'Tensor': - """ - Matrix multiplication with both educational and efficient implementations. - - Shows the learning progression from basic loops to optimized operations. - This dual approach helps students understand both the concept and production reality. - - TODO: Implement matrix multiplication. - - STEP-BY-STEP IMPLEMENTATION: - 1. Extract numpy arrays from both tensors - 2. Check tensor shapes for compatibility - 3. For small tensors: use educational loops to show concept - 4. For larger tensors: use NumPy's optimized implementation - 5. Create new Tensor object with the result - 6. Return the new tensor - - LEARNING CONNECTIONS: - Real-world relevance: - - Linear layers: input @ weight matrices in neural networks - - Transformer attention: Q @ K^T for attention scores - - CNN convolutions: Implemented as matrix multiplications - - Batch processing: Matrix ops enable parallel computation - - EDUCATIONAL APPROACH: - 1. Small examples: Show every operation explicitly with loops - 2. Larger examples: Use NumPy's optimized BLAS implementation - 3. Connect mathematical operations to performance considerations - - EXAMPLE: - Tensor([[1, 2], [3, 4]]) @ Tensor([[5, 6], [7, 8]]) โ†’ Tensor([[19, 22], [43, 50]]) - - HINTS: - - Small tensors show educational loops for understanding - - Large tensors use optimized NumPy for realistic performance - - This progression mirrors real ML framework design - """ - ### BEGIN SOLUTION - a_data = self._data - b_data = other._data - - # Validate tensor shapes - if len(a_data.shape) != 2 or len(b_data.shape) != 2: - raise ValueError("matmul requires 2D tensors") - - m, k = a_data.shape - k2, n = b_data.shape - - if k != k2: - raise ValueError(f"Inner dimensions must match: {k} != {k2}") - - # For small tensors (โ‰ค 4x4): Educational loops to show the concept - if m <= 4 and n <= 4 and k <= 4: - return self._matmul_educational(other) - - # For larger tensors: Use NumPy's optimized implementation (production approach) - result_data = np.dot(a_data, b_data) - return Tensor(result_data) - ### END SOLUTION - - def _matmul_educational(self, other: 'Tensor') -> 'Tensor': - """ - Educational matrix multiplication using explicit loops. - - This shows the fundamental computation clearly for small examples. - Understanding this helps appreciate why optimized BLAS libraries are essential. - """ - a_data = self._data - b_data = other._data - m, k = a_data.shape - k2, n = b_data.shape - - # Initialize result matrix - result = np.zeros((m, n), dtype=a_data.dtype) - - # Triple nested loops - educational, shows every operation - # This demonstrates the O(nยณ) complexity clearly - for i in range(m): # For each row in result - for j in range(n): # For each column in result - for k_idx in range(k): # Dot product: sum over inner dimension - result[i, j] += a_data[i, k_idx] * b_data[k_idx, j] - - return Tensor(result) - - def __matmul__(self, other: 'Tensor') -> 'Tensor': - """ - Matrix multiplication operator: tensor @ other - - Enables the @ operator for matrix multiplication, providing - clean syntax for neural network operations. - """ - return self.matmul(other) - - def backward(self, gradient=None): - """ - Compute gradients for this tensor and propagate backward. - - Basic backward pass - accumulates gradients and propagates to dependencies. - This enables simple gradient computation for basic operations. - - Args: - gradient: Gradient from upstream. If None, assumes scalar with grad=1 - """ - if not self.requires_grad: - return - - if gradient is None: - # Scalar case - gradient is 1 - gradient = Tensor(np.ones_like(self._data)) - - # Accumulate gradients - if self.grad is None: - self.grad = gradient - else: - self.grad = self.grad + gradient - - # Propagate to dependencies via grad_fn - if self._grad_fn is not None: - self._grad_fn(gradient) - - def zero_grad(self): - """ - Reset gradients to None. Used by optimizers before backward pass. - - This method is called by optimizers to clear gradients before - computing new ones, preventing gradient accumulation across batches. - """ - self.grad = None - - def reshape(self, *shape: int) -> 'Tensor': - """ - Return a new tensor with the same data but different shape. - - Args: - *shape: New shape dimensions. Use -1 for automatic sizing. - - Returns: - New Tensor with reshaped data - - Example: - tensor.reshape(2, -1) # Reshape to 2 rows, auto columns - tensor.reshape(4, 3) # Reshape to 4x3 matrix - """ - reshaped_data = self._data.reshape(*shape) - return Tensor(reshaped_data) - - def numpy(self) -> np.ndarray: - """ - Convert tensor to NumPy array. - - This is the PyTorch-inspired method for tensor-to-numpy conversion. - Provides clean interface for interoperability with NumPy operations. - - Returns: - NumPy array containing the tensor's data - - Example: - tensor = Tensor([1, 2, 3]) - array = tensor.numpy() # Get NumPy array for scientific computing - """ - return self._data - -# ============================================================================ -# ADVANCED: NumPy Integration Protocols -# These methods enable tensors to work seamlessly with NumPy functions -# You can skip these on first reading - they're for integration with scientific Python -# ============================================================================ - - def __array__(self, dtype=None) -> np.ndarray: - """ - Enable np.array(tensor) and np.allclose(tensor, array). - - This protocol method allows NumPy functions to automatically convert - Tensor objects to arrays when needed for scientific computing integration. - - Args: - dtype: Optional dtype to cast to (NumPy may request this) - - Returns: - The underlying NumPy array, optionally cast to requested dtype - - Examples: - tensor = Tensor([1, 2, 3]) - np.sum(tensor) # Works automatically via this method - np.allclose(tensor, [1, 2, 3]) # Also works! - """ - if dtype is not None: - return self._data.astype(dtype) - return self._data - - def __array_ufunc__(self, ufunc, method, *inputs, **kwargs): - """ - Enable NumPy universal functions with Tensor objects. - - This protocol allows NumPy ufuncs (like np.maximum, np.minimum) to work - with Tensor objects by converting them to arrays and wrapping results. - - Advanced feature - most students can ignore this implementation detail. - """ - # Convert Tensor inputs to NumPy arrays - args = [] - for input_ in inputs: - if isinstance(input_, Tensor): - args.append(input_._data) - else: - args.append(input_) - - # Call the ufunc on NumPy arrays - outputs = getattr(ufunc, method)(*args, **kwargs) - - # If method returns NotImplemented, let NumPy handle it - if outputs is NotImplemented: - return NotImplemented - - # Wrap result back in Tensor if appropriate - if method == '__call__': - if isinstance(outputs, np.ndarray): - return Tensor(outputs) - elif isinstance(outputs, tuple): - return tuple(Tensor(output) if isinstance(output, np.ndarray) else output - for output in outputs) - - return outputs - - -# # Testing Your Implementation -# -# Now let's test our tensor implementation with comprehensive tests that validate all functionality. - -# ### ๐Ÿงช Unit Test: Tensor Creation -# -# Let's test your tensor creation implementation right away! This gives you immediate feedback on whether your `__init__` method works correctly. -# -# **This is a unit test** - it tests one specific function (tensor creation) in isolation. - -# In[ ]: - - -# Test tensor creation immediately after implementation -print("๐Ÿ”ฌ Unit Test: Tensor Creation...") - -# Test basic tensor creation -try: - # Test scalar - scalar = Tensor(5.0) - assert hasattr(scalar, '_data'), "Tensor should have _data attribute" - assert scalar._data.shape == (), f"Scalar should have shape (), got {scalar._data.shape}" - print("โœ… Scalar creation works") - - # Test vector - vector = Tensor([1, 2, 3]) - assert vector._data.shape == (3,), f"Vector should have shape (3,), got {vector._data.shape}" - print("โœ… Vector creation works") - - # Test matrix - matrix = Tensor([[1, 2], [3, 4]]) - assert matrix._data.shape == (2, 2), f"Matrix should have shape (2, 2), got {matrix._data.shape}" - print("โœ… Matrix creation works") - - print("๐Ÿ“ˆ Progress: Tensor Creation โœ“") - -except Exception as e: - print(f"โŒ Tensor creation test failed: {e}") - raise - -print("๐ŸŽฏ Tensor creation behavior:") -print(" Converts data to NumPy arrays") -print(" Preserves shape and data type") -print(" Stores in _data attribute") - - -# ### ๐Ÿงช Unit Test: Tensor Properties -# -# Now let's test that your tensor properties work correctly. This tests the @property methods you implemented. -# -# **This is a unit test** - it tests specific properties (shape, size, dtype, data) in isolation. - -# In[ ]: - - -# Test tensor properties immediately after implementation -print("๐Ÿ”ฌ Unit Test: Tensor Properties...") - -# Test properties with simple examples -try: - # Test with a simple matrix - tensor = Tensor([[1, 2, 3], [4, 5, 6]]) - - # Test shape property - assert tensor.shape == (2, 3), f"Shape should be (2, 3), got {tensor.shape}" - print("โœ… Shape property works") - - # Test size property - assert tensor.size == 6, f"Size should be 6, got {tensor.size}" - print("โœ… Size property works") - - # Test data property - assert np.array_equal(tensor.data, np.array([[1, 2, 3], [4, 5, 6]])), "Data property should return numpy array" - print("โœ… Data property works") - - # Test dtype property - assert tensor.dtype in [np.int32, np.int64], f"Dtype should be int32 or int64, got {tensor.dtype}" - print("โœ… Dtype property works") - - print("๐Ÿ“ˆ Progress: Tensor Properties โœ“") - -except Exception as e: - print(f"โŒ Tensor properties test failed: {e}") - raise - -print("๐ŸŽฏ Tensor properties behavior:") -print(" shape: Returns tuple of dimensions") -print(" size: Returns total number of elements") -print(" data: Returns underlying NumPy array") -print(" dtype: Returns NumPy data type") - - -# ### ๐Ÿงช Unit Test: Tensor Arithmetic -# -# Let's test your tensor arithmetic operations. This tests the __add__, __mul__, __sub__, __truediv__ methods. -# -# **This is a unit test** - it tests specific arithmetic operations in isolation. - -# In[ ]: - - -# Test tensor arithmetic immediately after implementation -print("๐Ÿ”ฌ Unit Test: Tensor Arithmetic...") - -# Test basic arithmetic with simple examples -try: - # Test addition - a = Tensor([1, 2, 3]) - b = Tensor([4, 5, 6]) - result = a + b - expected = np.array([5, 7, 9]) - assert np.array_equal(result.data, expected), f"Addition failed: expected {expected}, got {result.data}" - print("โœ… Addition works") - - # Test scalar addition - result_scalar = a + 10 - expected_scalar = np.array([11, 12, 13]) - assert np.array_equal(result_scalar.data, expected_scalar), f"Scalar addition failed: expected {expected_scalar}, got {result_scalar.data}" - print("โœ… Scalar addition works") - - # Test multiplication - result_mul = a * b - expected_mul = np.array([4, 10, 18]) - assert np.array_equal(result_mul.data, expected_mul), f"Multiplication failed: expected {expected_mul}, got {result_mul.data}" - print("โœ… Multiplication works") - - # Test scalar multiplication - result_scalar_mul = a * 2 - expected_scalar_mul = np.array([2, 4, 6]) - assert np.array_equal(result_scalar_mul.data, expected_scalar_mul), f"Scalar multiplication failed: expected {expected_scalar_mul}, got {result_scalar_mul.data}" - print("โœ… Scalar multiplication works") - - print("๐Ÿ“ˆ Progress: Tensor Arithmetic โœ“") - -except Exception as e: - print(f"โŒ Tensor arithmetic test failed: {e}") - raise - -print("๐ŸŽฏ Tensor arithmetic behavior:") -print(" Element-wise operations on tensors") -print(" Broadcasting with scalars") -print(" Returns new Tensor objects") - - -# ### ๐Ÿ”ฌ Comprehensive Tests -# -# Now let's run comprehensive tests that validate all tensor functionality together. These tests ensure your implementation is production-ready. -# -# **These are comprehensive tests** - they test multiple features and edge cases to ensure robustness. - -# In[ ]: - - -def test_unit_tensor_creation(): - """Comprehensive test of tensor creation with all data types and shapes.""" - print("๐Ÿ”ฌ Testing comprehensive tensor creation...") - - # Test scalar creation - scalar_int = Tensor(42) - assert scalar_int.shape == () - - # Test vector creation - vector_int = Tensor([1, 2, 3]) - assert vector_int.shape == (3,) - - # Test matrix creation - matrix_2x2 = Tensor([[1, 2], [3, 4]]) - assert matrix_2x2.shape == (2, 2) - print("โœ… Tensor creation tests passed!") - -# Test function defined (called in main block) - - -# ### Unit Test: Tensor Properties -# -# This test validates your tensor property methods (shape, size, dtype, data), ensuring they correctly reflect the tensor's dimensional structure and data characteristics. - -# In[ ]: - - -def test_unit_tensor_properties(): - """Comprehensive test of tensor properties (shape, size, dtype, data access).""" - print("๐Ÿ”ฌ Testing comprehensive tensor properties...") - - tensor = Tensor([[1, 2, 3], [4, 5, 6]]) - - # Test shape property - assert tensor.shape == (2, 3) - - # Test size property - assert tensor.size == 6 - - # Test data property - assert np.array_equal(tensor.data, np.array([[1, 2, 3], [4, 5, 6]])) - - # Test dtype property - assert tensor.dtype in [np.int32, np.int64] - print("โœ… Tensor properties tests passed!") - -# Test function defined (called in main block) - - -# ### ๐Ÿงช Unit Test: Tensor Arithmetic Operations -# -# Now let's test all your arithmetic operations working together! This comprehensive test validates that addition, subtraction, multiplication, and division all work correctly with your tensor implementation. -# -# **What This Tests:** -# - Element-wise addition, subtraction, multiplication, division -# - Proper NumPy array handling in arithmetic -# - Result correctness across different operations -# -# **Why This Matters:** -# - Arithmetic operations are the foundation of all neural network computations -# - These operations must be fast and mathematically correct -# - Your implementation should match NumPy's behavior exactly - -# In[ ]: - - -def test_unit_tensor_arithmetic(): - """Comprehensive test of tensor arithmetic operations.""" - print("๐Ÿ”ฌ Testing comprehensive tensor arithmetic...") - - a = Tensor([1, 2, 3]) - b = Tensor([4, 5, 6]) - - # Test addition - c = a + b - expected = np.array([5, 7, 9]) - assert np.array_equal(c.data, expected) - - # Test multiplication - d = a * b - expected = np.array([4, 10, 18]) - assert np.array_equal(d.data, expected) - - # Test subtraction - e = b - a - expected = np.array([3, 3, 3]) - assert np.array_equal(e.data, expected) - - # Test division - f = b / a - expected = np.array([4.0, 2.5, 2.0]) - assert np.allclose(f.data, expected) - print("โœ… Tensor arithmetic tests passed!") - -# Test function defined (called in main block) - - -# ### ๐Ÿงช Integration Test: Tensor-NumPy Integration -# -# This integration test validates that your tensor system works seamlessly with NumPy, the foundation of the scientific Python ecosystem. -# -# **What This Tests:** -# - Creating tensors from NumPy arrays -# - Converting tensors back to NumPy arrays -# - Mixed operations between tensors and NumPy -# - Data type preservation and consistency -# -# **Why This Matters:** -# - Real ML systems must integrate with NumPy seamlessly -# - Data scientists expect tensors to work with existing NumPy code -# - Performance optimizations often involve NumPy operations -# - This compatibility is what makes PyTorch and TensorFlow so powerful -# -# **Real-World Connection:** -# - PyTorch tensors have `.numpy()` and `torch.from_numpy()` methods -# - TensorFlow has similar NumPy integration -# - This test ensures your tensors work in real data science workflows - -# In[ ]: - - -def test_module_tensor_numpy_integration(): - """ - Integration test for tensor operations with NumPy arrays. - - Tests that tensors properly integrate with NumPy operations and maintain - compatibility with the scientific Python ecosystem. - """ - print("๐Ÿ”ฌ Running Integration Test: Tensor-NumPy Integration...") - - # Test 1: Tensor from NumPy array - numpy_array = np.array([[1, 2, 3], [4, 5, 6]]) - tensor_from_numpy = Tensor(numpy_array) - - assert tensor_from_numpy.shape == (2, 3), "Tensor should preserve NumPy array shape" - assert np.array_equal(tensor_from_numpy.data, numpy_array), "Tensor should preserve NumPy array data" - - # Test 2: Tensor arithmetic with NumPy-compatible operations - a = Tensor([1.0, 2.0, 3.0]) - b = Tensor([4.0, 5.0, 6.0]) - - # Test operations that would be used in neural networks - dot_product_result = np.dot(a.data, b.data) # Common in layers - assert np.isclose(dot_product_result, 32.0), "Dot product should work with tensor data" - - # Test 3: Broadcasting compatibility - matrix = Tensor([[1, 2], [3, 4]]) - scalar = Tensor(10) - - result = matrix + scalar - expected = np.array([[11, 12], [13, 14]]) - assert np.array_equal(result.data, expected), "Broadcasting should work like NumPy" - - # Test 4: Integration with scientific computing patterns - data = Tensor([1, 4, 9, 16, 25]) - sqrt_result = Tensor(np.sqrt(data.data)) # Using NumPy functions on tensor data - expected_sqrt = np.array([1., 2., 3., 4., 5.]) - assert np.allclose(sqrt_result.data, expected_sqrt), "Should integrate with NumPy functions" - - print("โœ… Integration Test Passed: Tensor-NumPy integration works correctly.") - -# Test function defined (called in main block) - -if __name__ == "__main__": - # Run all tensor tests - test_unit_tensor_creation() - test_unit_tensor_properties() - test_unit_tensor_arithmetic() - test_module_tensor_numpy_integration() - - print("All tests passed!") - print("Tensor module complete!") - - -# ## ๐Ÿค” ML Systems Thinking: Interactive Questions -# -# Now that you've built a working tensor system, let's connect this foundational work to broader ML systems challenges. These questions help you think critically about how tensor operations scale to production ML environments. -# -# Take time to reflect thoughtfully on each question - your insights will help you understand how the tensor concepts you've implemented connect to real-world ML systems engineering. - -# ### Question 1: Memory Layout and Cache Efficiency -# -# **Context**: Your tensor implementation wraps NumPy arrays and creates new tensors for each operation. In production ML systems, tensor operations happen millions of times per second, making memory layout and cache efficiency critical for performance. -# -# **Reflection Question**: Design a memory-efficient tensor system for training large neural networks (billions of parameters). How would you balance memory layout optimization with cache efficiency? Consider scenarios where you need to process massive image batches (1000+ images) while maintaining memory locality for CPU cache optimization. What trade-offs would you make between memory copying and in-place operations? -# -# Think about: contiguous memory layout, cache line utilization, memory fragmentation, and the difference between row-major vs column-major storage in different computational contexts. -# -# *Target length: 150-300 words* - -# In[ ]: - - -""" -YOUR REFLECTION ON MEMORY LAYOUT AND CACHE EFFICIENCY: - -TODO: Replace this text with your thoughtful response about memory-efficient tensor system design. - -Consider addressing: -- How would you optimize memory layout for large batch processing? -- What strategies would you use to minimize cache misses during tensor operations? -- How would you handle the trade-off between memory copying and in-place operations? -- What role does contiguous memory layout play in computational efficiency? -- How would different storage patterns (row-major vs column-major) affect performance? - -Write a practical design connecting your tensor implementation to real memory optimization challenges. - -GRADING RUBRIC (Instructor Use): -- Demonstrates understanding of memory layout impact on performance (3 points) -- Addresses cache efficiency and locality concerns appropriately (3 points) -- Shows practical knowledge of memory optimization strategies (2 points) -- Demonstrates systems thinking about large-scale tensor operations (2 points) -- Clear technical reasoning and practical considerations (bonus points for innovative approaches) -""" - -### BEGIN SOLUTION -# Student response area - instructor will replace this section during grading setup -# This is a manually graded question requiring technical analysis of memory optimization -# Students should demonstrate understanding of cache efficiency and memory layout optimization -### END SOLUTION - - -# ### Question 2: Hardware Abstraction and Multi-Platform Deployment -# -# **Context**: Your tensor class currently operates on CPU through NumPy. Production ML systems must run efficiently across diverse hardware: development laptops (CPU), training clusters (GPU), mobile devices (ARM processors), and edge devices (specialized AI chips). -# -# **Reflection Question**: Architect a hardware-abstraction layer for your tensor system that enables the same tensor operations to run optimally across CPU, GPU, and specialized AI accelerators. How would you handle the complexity of different memory models, precision requirements, and computational paradigms while maintaining a simple user interface? Consider the challenges of automatic device placement and memory management across heterogeneous hardware. -# -# Think about: device-specific optimizations, memory transfer costs, precision trade-offs, and automatic kernel selection for different hardware architectures. -# -# *Target length: 150-300 words* - -# In[ ]: - - -""" -YOUR REFLECTION ON HARDWARE ABSTRACTION AND MULTI-PLATFORM DEPLOYMENT: - -TODO: Replace this text with your thoughtful response about hardware abstraction design. - -Consider addressing: -- How would you design an abstraction layer that works across CPU, GPU, and AI accelerators? -- What strategies would you use for automatic device placement and memory management? -- How would you handle different precision requirements across hardware platforms? -- What role would kernel selection and optimization play in your design? -- How would you minimize memory transfer costs between different compute devices? - -Write an architectural analysis connecting your tensor foundation to real hardware deployment challenges. - -GRADING RUBRIC (Instructor Use): -- Shows understanding of multi-platform hardware challenges (3 points) -- Designs practical abstraction layer for device management (3 points) -- Addresses precision and optimization considerations (2 points) -- Demonstrates systems thinking about hardware-software interfaces (2 points) -- Clear architectural reasoning with practical insights (bonus points for comprehensive understanding) -""" - -### BEGIN SOLUTION -# Student response area - instructor will replace this section during grading setup -# This is a manually graded question requiring understanding of hardware abstraction challenges -# Students should demonstrate knowledge of multi-platform deployment and device optimization -### END SOLUTION - - -# ### Question 3: Computational Graph Integration and Automatic Differentiation -# -# **Context**: Your tensor performs operations immediately (eager execution). Modern deep learning frameworks build computational graphs to track operations for automatic differentiation, enabling gradient-based optimization that powers neural network training. -# -# **Reflection Question**: Extend your tensor design to support computational graph construction for automatic differentiation. How would you modify your tensor operations to build a graph of dependencies while maintaining performance for both training (graph construction) and inference (optimized execution)? Consider the challenge of supporting both eager execution for debugging and graph mode for production deployment. -# -# Think about: operation tracking, gradient flow, memory management for large graphs, and the trade-offs between flexibility and performance in different execution modes. -# -# *Target length: 150-300 words* - -# In[ ]: - - -""" -YOUR REFLECTION ON COMPUTATIONAL GRAPH INTEGRATION: - -TODO: Replace this text with your thoughtful response about computational graph design. - -Consider addressing: -- How would you modify your tensor class to support computational graph construction? -- What strategies would you use to balance eager execution with graph-based optimization? -- How would you handle gradient flow and automatic differentiation in your design? -- What memory management challenges arise with large computational graphs? -- How would you support both debugging-friendly and production-optimized execution modes? - -Write a design analysis connecting your tensor operations to automatic differentiation and training systems. - -GRADING RUBRIC (Instructor Use): -- Understands computational graph concepts and gradient tracking (3 points) -- Designs practical approach to eager vs graph execution modes (3 points) -- Addresses memory management and performance considerations (2 points) -- Shows systems thinking about training vs inference requirements (2 points) -- Clear design reasoning with automatic differentiation insights (bonus points for deep understanding) -""" - -### BEGIN SOLUTION -# Student response area - instructor will replace this section during grading setup -# This is a manually graded question requiring understanding of computational graphs and automatic differentiation -# Students should demonstrate knowledge of how tensor operations enable gradient computation -### END SOLUTION - - -# ## Parameter Helper Function -# -# Now that we have Tensor with gradient support, let's add a convenient helper function for creating trainable parameters: - -# In[ ]: - - -#| export -def Parameter(data, dtype=None): - """ - Convenience function for creating trainable tensors. - - This is equivalent to Tensor(data, requires_grad=True) but provides - cleaner syntax for neural network parameters. - - Args: - data: Input data (scalar, list, or numpy array) - dtype: Data type ('float32', 'int32', etc.). Defaults to auto-detect. - - Returns: - Tensor with requires_grad=True - - Examples: - weight = Parameter(np.random.randn(784, 128)) # Neural network weight - bias = Parameter(np.zeros(128)) # Neural network bias - """ - return Tensor(data, dtype=dtype, requires_grad=True) - - -# # MODULE SUMMARY: Tensor Foundation -# -# Congratulations! You've successfully implemented the fundamental data structure that powers all machine learning: -# -# ## What You've Built -# - **Tensor Class**: N-dimensional array wrapper with professional interfaces -# - **Core Operations**: Creation, property access, and arithmetic operations -# - **Shape Management**: Automatic shape tracking and validation -# - **Data Types**: Proper NumPy integration and type handling -# - **Foundation**: The building block for all subsequent TinyTorch modules -# -# ## Key Learning Outcomes -# - **Understanding**: How tensors work as the foundation of machine learning -# - **Implementation**: Built tensor operations from scratch -# - **Professional patterns**: Clean APIs, proper error handling, comprehensive testing -# - **Real-world connection**: Understanding PyTorch/TensorFlow tensor foundations -# - **Systems thinking**: Building reliable, reusable components -# -# ## Mathematical Foundations Mastered -# - **N-dimensional arrays**: Shape, size, and dimensionality concepts -# - **Element-wise operations**: Addition, subtraction, multiplication, division -# - **Broadcasting**: Understanding how operations work with different shapes -# - **Memory management**: Efficient data storage and access patterns -# -# ## Professional Skills Developed -# - **API design**: Clean, intuitive interfaces for tensor operations -# - **Error handling**: Graceful handling of invalid operations and edge cases -# - **Testing methodology**: Comprehensive validation of tensor functionality -# - **Documentation**: Clear, educational documentation with examples -# -# ## Ready for Advanced Applications -# Your tensor implementation now enables: -# - **Neural Networks**: Foundation for all layer implementations -# - **Automatic Differentiation**: Gradient computation through computational graphs -# - **Complex Models**: CNNs, RNNs, Transformers - all built on tensors -# - **Real Applications**: Training models on real datasets -# -# ## Connection to Real ML Systems -# Your implementation mirrors production systems: -# - **PyTorch**: `torch.Tensor` provides identical functionality -# - **TensorFlow**: `tf.Tensor` implements similar concepts -# - **NumPy**: `numpy.ndarray` serves as the foundation -# - **Industry Standard**: Every major ML framework uses these exact principles -# -# ## The Power of Tensors -# You've built the fundamental data structure of modern AI: -# - **Universality**: Tensors represent all data: images, text, audio, video -# - **Efficiency**: Vectorized operations enable fast computation -# - **Scalability**: Handles everything from single numbers to massive matrices -# - **Flexibility**: Foundation for any mathematical operation -# -# ## What's Next -# Your tensor implementation is the foundation for: -# - **Activations**: Nonlinear functions that enable complex learning -# - **Layers**: Linear transformations and neural network building blocks -# - **Networks**: Composing layers into powerful architectures -# - **Training**: Optimizing networks to solve real problems -# -# **Next Module**: Activation functions - adding the nonlinearity that makes neural networks powerful! -# -# You've built the foundation of modern AI. Now let's add the mathematical functions that enable machines to learn complex patterns! diff --git a/modules/02_activations/activations_dev.py b/modules/02_activations/activations_dev.py index 6c5ffbdb..b60106be 100644 --- a/modules/02_activations/activations_dev.py +++ b/modules/02_activations/activations_dev.py @@ -37,24 +37,22 @@ Tensor โ†’ Activations โ†’ Neural Networks ## Build โ†’ Use โ†’ Reflect 1. **Build**: ReLU and Softmax with validation, error handling, and systems analysis -""" -# 2. **Use**: Test in realistic neural network pipelines with edge cases -# 3. **Reflect**: Connect your implementation measurements to production ML systems design +2. **Use**: Test in realistic neural network pipelines with edge cases +3. **Reflect**: Connect your implementation measurements to production ML systems design -# ## Systems Reality Check -# ๐Ÿ’ก **Production Context**: Your ReLU implementation uses the same algorithm as PyTorch's CUDA kernels -# โšก **Performance Insight**: You'll experience firsthand why ReLU's computational simplicity revolutionized deep learning +## Systems Reality Check +๐Ÿ’ก **Production Context**: Your ReLU implementation uses the same algorithm as PyTorch's CUDA kernels +โšก **Performance Insight**: You'll experience firsthand why ReLU's computational simplicity revolutionized deep learning +""" # In[ ]: #| default_exp core.activations #| export -import math import numpy as np import os import sys -from typing import Union, List # Import our tensor foundation try: @@ -119,70 +117,71 @@ Why Revolutionary: โ”‚ Old Problem โ”‚ ReLU Solves โ”‚ ML Impact โ”‚ โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค โ”‚ Vanishing Grads โ”‚ โˆ‚f/โˆ‚x = 1 or 0 โ”‚ Deep networks โ”‚ +โ”‚ Slow computationโ”‚ Just max(0,x) โ”‚ 6x training โ”‚ +โ”‚ Complex math โ”‚ Simple compare โ”‚ Hardware-friendlyโ”‚ +โ”‚ Always active โ”‚ 50% sparse โ”‚ Efficient memoryโ”‚ +โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ +``` + +### Softmax: Converting Scores to Probabilities + +``` +Softmax Transformation: + +Raw Logits: Softmax Probabilities: +[2.0, 1.0, 0.1] โ”€โ”€> [0.67, 0.24, 0.09] + โ†“ + Sum = 1.0 โœ“ + All โ‰ฅ 0 โœ“ + Proper probability! + +Attention Mechanism Pattern: + +Query-Key Similarities: Attention Weights: +[0.8, 1.2, 0.4, 0.9] โ”€โ”€> [0.19, 0.42, 0.12, 0.27] + โ†“ + Weighted sum of values + Focus on important parts! + +Why Essential: +โ€ข Classification: Convert network outputs to class probabilities +โ€ข Attention: Focus mechanism in transformers +โ€ข Sampling: Probability-based token generation +โ€ข Interpretability: Understand model confidence +``` + +### Computational Complexity: Why ReLU Dominates + +``` +Performance Analysis (per element): + +ReLU: Softmax: +โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” +โ”‚ Compare โ”‚ โ”‚ 1. Subtract max (stability) โ”‚ +โ”‚ + โ”‚ โ”‚ 2. Exponential computation โ”‚ +โ”‚ Select โ”‚ โ”‚ 3. Sum all exponentials โ”‚ +โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ 4. Divide each by sum โ”‚ + โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ + +2 operations vs 4N + 3 operations (N = vector size) + +GPU Parallelization: +ReLU: [Perfect] Each element independent +Softmax: [Good] Element-wise ops + reduction step + +Memory Pattern: +ReLU: [Optimal] Can compute in-place +Softmax: [Good] Needs temporary storage for stability +``` """ -# โ”‚ Slow computationโ”‚ Just max(0,x) โ”‚ 6x training โ”‚ -# โ”‚ Complex math โ”‚ Simple compare โ”‚ Hardware-friendlyโ”‚ -# โ”‚ Always active โ”‚ 50% sparse โ”‚ Efficient memoryโ”‚ -# โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ -# ``` -# ### Softmax: Converting Scores to Probabilities -# -# ``` -# Softmax Transformation: -# -# Raw Logits: Softmax Probabilities: -# [2.0, 1.0, 0.1] โ”€โ”€> [0.67, 0.24, 0.09] -# โ†“ -# Sum = 1.0 โœ“ -# All โ‰ฅ 0 โœ“ -# Proper probability! -# -# Attention Mechanism Pattern: -# -# Query-Key Similarities: Attention Weights: -# [0.8, 1.2, 0.4, 0.9] โ”€โ”€> [0.19, 0.42, 0.12, 0.27] -# โ†“ -# Weighted sum of values -# Focus on important parts! -# -# Why Essential: -# โ€ข Classification: Convert network outputs to class probabilities -# โ€ข Attention: Focus mechanism in transformers -# โ€ข Sampling: Probability-based token generation -# โ€ข Interpretability: Understand model confidence -# ``` - -# ### Computational Complexity: Why ReLU Dominates -# -# ``` -# Performance Analysis (per element): -# -# ReLU: Softmax: -# โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” -# โ”‚ Compare โ”‚ โ”‚ 1. Subtract max (stability) โ”‚ -# โ”‚ + โ”‚ โ”‚ 2. Exponential computation โ”‚ -# โ”‚ Select โ”‚ โ”‚ 3. Sum all exponentials โ”‚ -# โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ 4. Divide each by sum โ”‚ -# โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ -# -# 2 operations vs 4N + 3 operations (N = vector size) -# -# GPU Parallelization: -# ReLU: [Perfect] Each element independent -# Softmax: [Good] Element-wise ops + reduction step -# -# Memory Pattern: -# ReLU: [Optimal] Can compute in-place -# Softmax: [Good] Needs temporary storage for stability -# ``` +# %% [markdown] +""" +## Part 1: ReLU - The Foundation of Modern Deep Learning +""" # %% nbgrader={"grade": false, "grade_id": "relu-class", "solution": true} -# ## Part 1: ReLU - The Foundation of Modern Deep Learning - -# ReLU (Rectified Linear Unit) revolutionized deep learning by solving the vanishing gradient problem while being computationally trivial. - #| export class ReLU: """ @@ -394,68 +393,27 @@ def experience_memory_bottleneck(): # Run the memory bottleneck experience experience_memory_bottleneck() -# ๐Ÿ” SYSTEMS INSIGHT #2: Why ReLU Revolutionized Deep Learning -def analyze_relu_performance(): - """Measure why ReLU became the dominant activation function.""" +# ๐Ÿ” SYSTEMS ANALYSIS: Activation Function Performance and Behavior +def analyze_activation_performance(): + """Consolidated analysis of activation function behavior for ML systems.""" try: - import time - - # Create test data (simulating a large neural network layer) - size = 1_000_000 - test_data = np.random.randn(size) - 0.5 # Mix of positive/negative - - # Test ReLU performance - start = time.perf_counter() - relu_result = np.maximum(0, test_data) - relu_time = time.perf_counter() - start - - # Compare with sigmoid (traditional activation) - start = time.perf_counter() - sigmoid_result = 1 / (1 + np.exp(-test_data)) - sigmoid_time = time.perf_counter() - start - - # Compare with tanh (another traditional activation) - start = time.perf_counter() - tanh_result = np.tanh(test_data) - tanh_time = time.perf_counter() - start - - print(f"Performance Comparison ({size:,} elements):") - print(f"ReLU: {relu_time:.4f}s") - print(f"Sigmoid: {sigmoid_time:.4f}s ({sigmoid_time/relu_time:.1f}x slower)") - print(f"Tanh: {tanh_time:.4f}s ({tanh_time/relu_time:.1f}x slower)") - - # Memory analysis - print(f"\nMemory Usage Analysis:") - print(f"ReLU sparsity: {np.mean(relu_result == 0):.1%} zeros") - print(f"Memory savings from sparsity: ~{np.mean(relu_result == 0)*100:.0f}%") - - # Gradient analysis - relu_grad = (test_data > 0).astype(float) - sigmoid_grad = sigmoid_result * (1 - sigmoid_result) - - print(f"\nGradient Health Analysis:") - print(f"ReLU: {np.mean(relu_grad == 1):.1%} active gradients (1.0)") - print(f"Sigmoid: max gradient = {np.max(sigmoid_grad):.3f} (vanishing!)") - - # ๐Ÿ’ก WHY THIS MATTERS: ReLU's simplicity and gradient properties - # enabled training of very deep networks (100+ layers) - print(f"\n๐Ÿ’ก Key Insights:") - print(f"โ€ข ReLU is {sigmoid_time/relu_time:.0f}x faster than sigmoid") - print(f"โ€ข {np.mean(relu_result == 0):.0%} sparsity saves memory and computation") - print(f"โ€ข Gradients are 1.0 (not vanishing) for {np.mean(relu_grad == 1):.0%} of neurons") - print(f"โ€ข This enabled the deep learning revolution!") - + print("๐Ÿ“Š Activation Systems Analysis:") + print(f" โ€ข ReLU Performance: 3-10x faster than sigmoid/tanh due to simple max(0,x) operation") + print(f" โ€ข Memory Efficiency: ReLU sparsity (~50% zeros) reduces computation and storage") + print(f" โ€ข Gradient Health: ReLU maintains 1.0 gradients (no vanishing), enabling deep networks") + print(f" โ€ข Softmax Complexity: O(N) + exp operations - numerically stable with max subtraction") + print(f" โ€ข Production Impact: Activation choice affects both training speed and model capacity") + except Exception as e: - print(f"โš ๏ธ Error in ReLU analysis: {e}") - print("Make sure ReLU implementation is complete") + print(f"โš ๏ธ Analysis failed: {e}") -# Run the analysis -analyze_relu_performance() +# %% [markdown] +""" +## Testing ReLU Implementation -# ## Testing ReLU Implementation - -# ### ๐Ÿงช Unit Test: ReLU Activation -# This test validates our ReLU implementation with various input scenarios +### ๐Ÿงช Unit Test: ReLU Activation +This test validates our ReLU implementation with various input scenarios +""" def test_unit_relu_activation(): """ @@ -503,7 +461,6 @@ def test_unit_relu_activation(): # Test 5: In-place operation inplace_input = Tensor([[-1, 0, 1]]) - original_data = inplace_input.data.copy() relu.forward_(inplace_input) expected_inplace = np.array([[0, 0, 1]]) @@ -518,13 +475,16 @@ def test_unit_relu_activation(): # Test immediately after implementation test_unit_relu_activation() +# %% [markdown] +""" +## Part 2: Softmax - Converting Scores to Probabilities + +Softmax transforms any real-valued vector into a probability distribution. +Essential for classification and attention mechanisms. +""" + # %% nbgrader={"grade": false, "grade_id": "softmax-class", "solution": true} -# ## Part 2: Softmax - Converting Scores to Probabilities - -# Softmax transforms any real-valued vector into a probability distribution. -# Essential for classification and attention mechanisms. - #| export class Softmax: """ @@ -635,95 +595,17 @@ class Softmax: # O(N)? O(Nยฒ)? O(N log N)? Your answer: _______ # ๐Ÿ” SYSTEMS INSIGHT #3: Softmax Computational Complexity and Numerical Stability -def analyze_softmax_complexity(): - """Analyze Softmax performance characteristics and numerical stability.""" - try: - import time - - print("Softmax Scaling Analysis:") - print("=" * 50) - - sizes = [100, 1000, 10000, 100000] - times = [] - - for size in sizes: - # Create test data with large values (numerical challenge) - test_data = np.random.randn(size) * 10 + 50 # Large values - - # Measure softmax computation time - start = time.perf_counter() - - # Numerically stable softmax - max_val = np.max(test_data) - shifted = test_data - max_val - exp_vals = np.exp(shifted) - result = exp_vals / np.sum(exp_vals) - - elapsed = time.perf_counter() - start - times.append(elapsed) - - print(f"Size {size:6,}: {elapsed*1000:.2f}ms") - - # Analyze scaling behavior - print(f"\nScaling Analysis:") - if len(times) >= 2: - scale_factor = times[-1] / times[0] - size_factor = sizes[-1] / sizes[0] - complexity_order = np.log(scale_factor) / np.log(size_factor) - print(f"Time scaling: ~O(N^{complexity_order:.1f})") - - # Test numerical stability - print(f"\nNumerical Stability Test:") - - # Without stability (would overflow) - large_vals = np.array([1000.0, 1001.0, 1002.0]) - try: - # This would overflow without stability measures - raw_exp = np.exp(large_vals) - if np.any(np.isinf(raw_exp)): - print("โŒ Raw exponentials overflow to infinity") - else: - print("โœ… Raw exponentials computed successfully") - except: - print("โŒ Raw exponentials failed completely") - - # With stability - max_val = np.max(large_vals) - stable_vals = large_vals - max_val # [-2, -1, 0] - stable_exp = np.exp(stable_vals) - stable_softmax = stable_exp / np.sum(stable_exp) - - print(f"โœ… Stable softmax: {stable_softmax}") - print(f"โœ… Sum check: {np.sum(stable_softmax):.6f} (should be 1.0)") - - # Memory analysis - print(f"\nMemory Usage Pattern:") - print(f"Input tensor: {size * 4 / 1024:.1f} KB (float32)") - print(f"Intermediate max: {4 / 1024:.3f} KB") - print(f"Shifted values: {size * 4 / 1024:.1f} KB") - print(f"Exponentials: {size * 4 / 1024:.1f} KB") - print(f"Sum result: {4 / 1024:.3f} KB") - print(f"Total peak memory: {size * 12 / 1024:.1f} KB (~3x input)") - - # ๐Ÿ’ก WHY THIS MATTERS: Softmax is computationally expensive - # but essential for interpretable probability outputs - print(f"\n๐Ÿ’ก Key Insights:") - print(f"โ€ข Softmax is O(N) but with high constant factors") - print(f"โ€ข Requires careful numerical implementation") - print(f"โ€ข Uses ~3x memory during computation") - print(f"โ€ข Critical for classification and attention mechanisms") - - except Exception as e: - print(f"โš ๏ธ Error in Softmax analysis: {e}") - print("Make sure Softmax implementation is complete") +# Softmax analysis consolidated into analyze_activation_performance() above -# Run the analysis -analyze_softmax_complexity() +# Analysis consolidated into single function above -# ## Testing Softmax Implementation +# %% [markdown] +""" +## Testing Softmax Implementation -# ### ๐Ÿงช Unit Test: Softmax Activation -# This test validates our Softmax implementation for correctness and numerical stability +### ๐Ÿงช Unit Test: Softmax Activation +This test validates our Softmax implementation for correctness and numerical stability +""" def test_unit_softmax_activation(): """ @@ -803,92 +685,16 @@ test_unit_softmax_activation() # ๐Ÿค” PREDICTION: Which activation uses more memory during computation? # ReLU or Softmax? Why? Your answer: _______ -# ๐Ÿ” SYSTEMS INSIGHT #4: Activation Function Memory Comparison -def analyze_activation_memory(): - """Compare memory usage patterns between ReLU and Softmax.""" - try: - import sys - - print("Activation Function Memory Analysis:") - print("=" * 50) - - # Test with different tensor sizes - sizes = [1000, 10000, 100000] - - for size in sizes: - print(f"\nTensor size: {size:,} elements") - - # Create test data - test_data = np.random.randn(size) - base_memory = test_data.nbytes - - print(f"Input memory: {base_memory / 1024:.1f} KB") - - # ReLU memory analysis - relu_result = np.maximum(0, test_data) - relu_memory = relu_result.nbytes - relu_total = base_memory + relu_memory - - print(f"ReLU:") - print(f" Output memory: {relu_memory / 1024:.1f} KB") - print(f" Total memory: {relu_total / 1024:.1f} KB ({relu_total / base_memory:.1f}x input)") - - # Softmax memory analysis (tracking peak usage) - max_val = np.max(test_data) # Scalar: 8 bytes - shifted = test_data - max_val # Same size as input - exp_vals = np.exp(shifted) # Same size as input - sum_exp = np.sum(exp_vals) # Scalar: 8 bytes - softmax_result = exp_vals / sum_exp # Reuses exp_vals memory - - # Peak memory: input + shifted + exp_vals - softmax_peak = base_memory + test_data.nbytes + exp_vals.nbytes - softmax_final = base_memory + softmax_result.nbytes - - print(f"Softmax:") - print(f" Peak memory: {softmax_peak / 1024:.1f} KB ({softmax_peak / base_memory:.1f}x input)") - print(f" Final memory: {softmax_final / 1024:.1f} KB ({softmax_final / base_memory:.1f}x input)") - - # In-place potential - print(f"In-place potential:") - print(f" ReLU: โœ… Can modify input directly") - print(f" Softmax: โŒ Needs intermediate storage") - - # Real-world scenario analysis - print(f"\nReal-world Impact Example:") - print(f"Large language model layer (2048 hidden units, batch size 32):") - - layer_size = 2048 * 32 - layer_memory = layer_size * 4 # float32 - - print(f"Base tensor: {layer_memory / 1024 / 1024:.1f} MB") - print(f"ReLU peak: {layer_memory * 2 / 1024 / 1024:.1f} MB") - print(f"Softmax peak: {layer_memory * 3 / 1024 / 1024:.1f} MB") - - # GPU memory impact - gpu_memory = 24 * 1024 # 24GB GPU - print(f"\nGPU Memory Usage (24GB total):") - print(f"ReLU impact: {layer_memory * 2 / 1024 / 1024 / 1024 * 100:.2f}% of GPU memory") - print(f"Softmax impact: {layer_memory * 3 / 1024 / 1024 / 1024 * 100:.2f}% of GPU memory") - - # ๐Ÿ’ก WHY THIS MATTERS: Memory usage affects model size limits - print(f"\n๐Ÿ’ก Key Insights:") - print(f"โ€ข ReLU: 2x memory (can be optimized to 1x with in-place)") - print(f"โ€ข Softmax: 3x memory peak (needs intermediate storage)") - print(f"โ€ข ReLU enables larger models in same memory") - print(f"โ€ข Softmax memory cost limits attention scale") - - except Exception as e: - print(f"โš ๏ธ Error in memory analysis: {e}") - print("Make sure both activation implementations are complete") - -# Run the analysis -analyze_activation_memory() +# Memory analysis consolidated into analyze_activation_performance() above # In[ ]: -# ## Integration Testing: Activations in Neural Network Context +# %% [markdown] +""" +## Integration Testing: Activations in Neural Network Context -# Let's test these activations in realistic neural network scenarios +Let's test these activations in realistic neural network scenarios +""" def test_unit_activations_comprehensive(): """Comprehensive test of both activation functions working together.""" @@ -971,9 +777,12 @@ test_unit_activations_comprehensive() # In[ ]: -# ## Integration Test: Realistic Neural Network Pipeline +# %% [markdown] +""" +## Integration Test: Realistic Neural Network Pipeline -# Test activations in a complete neural network forward pass simulation +Test activations in a complete neural network forward pass simulation +""" def test_module_activation_integration(): """Enhanced integration test: activations in realistic neural network pipeline with edge cases.""" @@ -1088,7 +897,7 @@ if __name__ == "__main__": print("\n" + "="*60) print("๐Ÿš€ RUNNING ALL ACTIVATION TESTS") print("="*60) - + # Run all activation tests in sequence test_unit_relu_activation() print() @@ -1097,7 +906,10 @@ if __name__ == "__main__": test_unit_activations_comprehensive() print() test_module_activation_integration() - + + # Single consolidated analysis for foundation module + analyze_activation_performance() + print("\n" + "="*60) print("๐ŸŽ‰ ALL ACTIVATION TESTS PASSED!") print("="*60) @@ -1113,99 +925,105 @@ if __name__ == "__main__": print(f" โ€ข Form the foundation of 90%+ of modern architectures") print(f"\nNext: Use these activations to build neural network layers!") -# ## ๐Ÿค” ML Systems Thinking: Interactive Questions +# %% [markdown] +""" +## ๐Ÿค” ML Systems Thinking: Interactive Questions -# Now that you've built ReLU and Softmax activation functions and analyzed their performance characteristics, let's connect this work to broader ML systems challenges. +Now that you've built ReLU and Softmax activation functions and analyzed their performance characteristics, let's connect this work to broader ML systems challenges. -# ### Question 1: Memory Bottleneck Analysis and Hardware Trade-offs +### Question 1: Memory Bottleneck Analysis and Hardware Trade-offs -# **Context**: In your memory bottleneck experience, you saw how activation memory usage scales with tensor size. Your ReLU analysis showed 2x memory usage while Softmax peaked at 3x. You also measured performance differences between ReLU's simple comparison vs Softmax's exponential computations. +**Context**: In your memory bottleneck experience, you saw how activation memory usage scales with tensor size. Your ReLU analysis showed 2x memory usage while Softmax peaked at 3x. You also measured performance differences between ReLU's simple comparison vs Softmax's exponential computations. -# **Reflection Question**: Based on your measurements, design a memory-efficient activation strategy for training a large language model with 7B parameters where GPU memory is the primary constraint. How would you modify your ReLU and Softmax implementations to reduce memory overhead? Consider when to use in-place operations, how to handle gradient computation, and specific optimizations for different parts of the network (hidden layers vs attention vs output layers). +**Reflection Question**: Based on your measurements, design a memory-efficient activation strategy for training a large language model with 7B parameters where GPU memory is the primary constraint. How would you modify your ReLU and Softmax implementations to reduce memory overhead? Consider when to use in-place operations, how to handle gradient computation, and specific optimizations for different parts of the network (hidden layers vs attention vs output layers). -# Think about: gradient checkpointing integration, when in-place operations break backpropagation, memory vs recomputation trade-offs, and how activation sparsity affects subsequent layer memory usage. +Think about: gradient checkpointing integration, when in-place operations break backpropagation, memory vs recomputation trade-offs, and how activation sparsity affects subsequent layer memory usage. -# *Target length: 200-400 words* +*Target length: 200-400 words* -# **YOUR ANALYSIS OF MEMORY-EFFICIENT ACTIVATION STRATEGIES:** +**YOUR ANALYSIS OF MEMORY-EFFICIENT ACTIVATION STRATEGIES:** -# [Student response area - replace this text with your analysis] +[Student response area - replace this text with your analysis] -# ### Question 2: Numerical Stability and Error Propagation +### Question 2: Numerical Stability and Error Propagation -# **Context**: Your Softmax implementation includes numerical stability measures (subtracting max values) and error handling for NaN/infinite inputs. You tested extreme values like [1000.0, 999.0, 998.0] and verified the stability measures work correctly. +**Context**: Your Softmax implementation includes numerical stability measures (subtracting max values) and error handling for NaN/infinite inputs. You tested extreme values like [1000.0, 999.0, 998.0] and verified the stability measures work correctly. -# **Reflection Question**: Analyze how numerical errors in activations propagate through deep networks during training. If small floating-point inconsistencies occur in your ReLU or Softmax implementations (due to hardware differences or precision settings), how do these errors compound across 100+ layers? Design specific error detection and mitigation strategies for your activation functions that could be integrated into a production training loop. +**Reflection Question**: Analyze how numerical errors in activations propagate through deep networks during training. If small floating-point inconsistencies occur in your ReLU or Softmax implementations (due to hardware differences or precision settings), how do these errors compound across 100+ layers? Design specific error detection and mitigation strategies for your activation functions that could be integrated into a production training loop. -# Think about: error accumulation patterns, early error detection strategies, precision monitoring during training, and how to balance numerical accuracy with computational efficiency. +Think about: error accumulation patterns, early error detection strategies, precision monitoring during training, and how to balance numerical accuracy with computational efficiency. -# *Target length: 200-400 words* +*Target length: 200-400 words* -# **YOUR ANALYSIS OF NUMERICAL STABILITY AND ERROR PROPAGATION:** +**YOUR ANALYSIS OF NUMERICAL STABILITY AND ERROR PROPAGATION:** -# [Student response area - replace this text with your analysis] +[Student response area - replace this text with your analysis] -# ### Question 3: Production Integration and Framework Evolution +### Question 3: Production Integration and Framework Evolution -# **Context**: Your implementations mirror PyTorch's ReLU and Softmax algorithms, but production frameworks add optimizations like CUDA kernels, kernel fusion, and automatic mixed precision. You've built the core algorithms that these optimizations build upon. +**Context**: Your implementations mirror PyTorch's ReLU and Softmax algorithms, but production frameworks add optimizations like CUDA kernels, kernel fusion, and automatic mixed precision. You've built the core algorithms that these optimizations build upon. -# **Reflection Question**: Design an evolution path for your activation implementations to support advanced production features. How would you extend your current ReLU and Softmax classes to support automatic mixed precision, gradient checkpointing, and kernel fusion? What interfaces would you add to make your implementations compatible with advanced optimizers and distributed training systems while maintaining the simplicity of your current API? +**Reflection Question**: Design an evolution path for your activation implementations to support advanced production features. How would you extend your current ReLU and Softmax classes to support automatic mixed precision, gradient checkpointing, and kernel fusion? What interfaces would you add to make your implementations compatible with advanced optimizers and distributed training systems while maintaining the simplicity of your current API? -# Think about: backward compatibility, performance monitoring hooks, optimization hint interfaces, and how to abstract hardware-specific optimizations while keeping the mathematical core unchanged. +Think about: backward compatibility, performance monitoring hooks, optimization hint interfaces, and how to abstract hardware-specific optimizations while keeping the mathematical core unchanged. -# *Target length: 200-400 words* +*Target length: 200-400 words* -# **YOUR ANALYSIS OF PRODUCTION EVOLUTION STRATEGIES:** +**YOUR ANALYSIS OF PRODUCTION EVOLUTION STRATEGIES:** -# [Student response area - replace this text with your analysis] +[Student response area - replace this text with your analysis] +""" -# ## ๐ŸŽฏ MODULE SUMMARY: Essential Activations +# %% [markdown] +""" +## ๐ŸŽฏ MODULE SUMMARY: Essential Activations -# Congratulations! You've successfully implemented the two most crucial activation functions in modern deep learning: +Congratulations! You've successfully implemented the two most crucial activation functions in modern deep learning: -# ### What You've Accomplished -# โœ… **ReLU Implementation**: 25+ lines of the activation that revolutionized deep learning -# โœ… **Softmax Implementation**: 30+ lines of numerically stable probability distribution creation -# โœ… **Performance Analysis**: Comprehensive benchmarking revealing why ReLU dominates hidden layers -# โœ… **Memory Profiling**: Discovered that Softmax uses 3x peak memory vs ReLU's 2x -# โœ… **Integration Testing**: Validated activations work in realistic neural network pipelines +### What You've Accomplished +โœ… **ReLU Implementation**: 25+ lines of the activation that revolutionized deep learning +โœ… **Softmax Implementation**: 30+ lines of numerically stable probability distribution creation +โœ… **Performance Analysis**: Comprehensive benchmarking revealing why ReLU dominates hidden layers +โœ… **Memory Profiling**: Discovered that Softmax uses 3x peak memory vs ReLU's 2x +โœ… **Integration Testing**: Validated activations work in realistic neural network pipelines -# ### Key Learning Outcomes -# - **Nonlinearity Mastery**: Understanding how activation functions enable neural networks to learn complex patterns -# - **Numerical Stability**: Implementing mathematically correct algorithms that handle edge cases -# - **Performance Awareness**: Connecting computational complexity to hardware capabilities and architecture choices -# - **Systems Integration**: Building components that work seamlessly in larger neural network systems +### Key Learning Outcomes +- **Nonlinearity Mastery**: Understanding how activation functions enable neural networks to learn complex patterns +- **Numerical Stability**: Implementing mathematically correct algorithms that handle edge cases +- **Performance Awareness**: Connecting computational complexity to hardware capabilities and architecture choices +- **Systems Integration**: Building components that work seamlessly in larger neural network systems -# ### Mathematical Foundations Mastered -# - **ReLU Mathematics**: f(x) = max(0, x) and its gradient properties that solved vanishing gradients -# - **Softmax Mathematics**: f(x_i) = e^(x_i - max(x)) / ฮฃ(e^(x_j - max(x))) with numerical stability -# - **Probability Theory**: Converting arbitrary scores to valid probability distributions -# - **Computational Complexity**: O(N) operations with different constant factors and memory patterns +### Mathematical Foundations Mastered +- **ReLU Mathematics**: f(x) = max(0, x) and its gradient properties that solved vanishing gradients +- **Softmax Mathematics**: f(x_i) = e^(x_i - max(x)) / ฮฃ(e^(x_j - max(x))) with numerical stability +- **Probability Theory**: Converting arbitrary scores to valid probability distributions +- **Computational Complexity**: O(N) operations with different constant factors and memory patterns -# ### Professional Skills Developed -# - **Numerical Programming**: Implementing mathematically stable algorithms for production use -# - **Performance Analysis**: Measuring and understanding computational bottlenecks in ML systems -# - **Systems Design**: Considering memory usage, hardware constraints, and scalability in implementation choices -# - **Integration Testing**: Validating components work correctly in realistic system contexts +### Professional Skills Developed +- **Numerical Programming**: Implementing mathematically stable algorithms for production use +- **Performance Analysis**: Measuring and understanding computational bottlenecks in ML systems +- **Systems Design**: Considering memory usage, hardware constraints, and scalability in implementation choices +- **Integration Testing**: Validating components work correctly in realistic system contexts -# ### Ready for Advanced Applications -# Your activation implementations now enable: -# - **Neural Network Layers**: Combining linear transformations with nonlinear activations -# - **Deep Architectures**: Using ReLU to train networks with 100+ layers without vanishing gradients -# - **Classification Systems**: Converting network outputs to interpretable probability distributions -# - **Attention Mechanisms**: Using Softmax for attention weight computation in transformers +### Ready for Advanced Applications +Your activation implementations now enable: +- **Neural Network Layers**: Combining linear transformations with nonlinear activations +- **Deep Architectures**: Using ReLU to train networks with 100+ layers without vanishing gradients +- **Classification Systems**: Converting network outputs to interpretable probability distributions +- **Attention Mechanisms**: Using Softmax for attention weight computation in transformers -# ### Connection to Real ML Systems -# Your implementations mirror production systems: -# - **PyTorch**: `torch.nn.ReLU()` and `torch.nn.Softmax(dim=-1)` implement identical mathematics with hardware optimizations -# - **TensorFlow**: `tf.nn.relu()` and `tf.nn.softmax()` follow the same algorithmic approaches with CUDA acceleration -# - **Hardware Acceleration**: Modern GPUs have specialized tensor cores optimized for these exact operations -# - **Industry Standard**: Every major ML framework prioritizes optimizing these specific activation functions +### Connection to Real ML Systems +Your implementations mirror production systems: +- **PyTorch**: `torch.nn.ReLU()` and `torch.nn.Softmax(dim=-1)` implement identical mathematics with hardware optimizations +- **TensorFlow**: `tf.nn.relu()` and `tf.nn.softmax()` follow the same algorithmic approaches with CUDA acceleration +- **Hardware Acceleration**: Modern GPUs have specialized tensor cores optimized for these exact operations +- **Industry Standard**: Every major ML framework prioritizes optimizing these specific activation functions -# ### Next Steps -# 1. **Export your module**: `tito module complete 02_activations` -# 2. **Validate integration**: `tito test --module activations` -# 3. **Explore activation variants**: Experiment with Leaky ReLU or GELU implementations -# 4. **Ready for Module 04**: Layers - combining your activations with linear transformations! +### Next Steps +1. **Export your module**: `tito module complete 03_activations` +2. **Validate integration**: `tito test --module activations` +3. **Explore activation variants**: Experiment with Leaky ReLU or GELU implementations +4. **Ready for Module 04**: Layers - combining your activations with linear transformations! -# **Forward Momentum**: Your activation functions provide the nonlinear intelligence that transforms simple linear operations into powerful learning systems capable of solving complex real-world problems! \ No newline at end of file +**Forward Momentum**: Your activation functions provide the nonlinear intelligence that transforms simple linear operations into powerful learning systems capable of solving complex real-world problems! +""" \ No newline at end of file diff --git a/modules/03_layers/layers_dev.py b/modules/03_layers/layers_dev.py index 4026037c..fd75ebfe 100644 --- a/modules/03_layers/layers_dev.py +++ b/modules/03_layers/layers_dev.py @@ -14,7 +14,7 @@ Welcome to Layers! You'll implement the essential building blocks that compose into complete neural network architectures. -## ๐Ÿ”— Building on Previous Learning +## LINK Building on Previous Learning **What You Built Before**: - Module 02 (Tensor): N-dimensional arrays with shape management and broadcasting - Module 03 (Activations): ReLU and Softmax functions providing nonlinear intelligence @@ -27,25 +27,24 @@ Welcome to Layers! You'll implement the essential building blocks that compose i **Connection Map**: ``` -Activations โ†’ Layers โ†’ Training +Activations -> Layers -> Training (intelligence) (architecture) (learning) ``` -## Learning Goals -- Systems understanding: How layer composition affects memory usage, parameter counts, and computational complexity in neural networks -- Core implementation skill: Build complete Module system, Linear transformations, and Sequential composition for scalable architectures -- Pattern/abstraction mastery: Understand how modular design patterns enable building complex networks from simple, reusable components -- Framework connections: See how your implementation mirrors PyTorch's nn.Module, nn.Linear, and nn.Sequential - the foundation of all modern ML frameworks -- Optimization trade-offs: Learn why proper parameter management and clean abstractions are essential for both performance and maintainability in production systems +## Learning Objectives -## Build โ†’ Use โ†’ Reflect -1. **Build**: Complete layer system with Module base class, Linear transformations, Sequential composition, and tensor reshaping operations -2. **Use**: Compose layers into complete neural networks and analyze architectural trade-offs with real parameter counting -3. **Reflect**: How does modular architecture design affect both system scalability and computational efficiency in production ML systems? +By completing this module, you will: -## Systems Reality Check -๐Ÿ’ก **Production Context**: PyTorch's nn.Module system enables all modern neural networks through automatic parameter collection and clean composition patterns -โšก **Performance Insight**: Layer composition and parameter management patterns determine training speed and memory efficiency - proper abstraction is a systems requirement, not just good design +1. **Build layer abstractions** - Create the building blocks that compose into neural networks +2. **Implement Linear layers** - The fundamental operation that transforms data between dimensions +3. **Create Sequential networks** - Chain layers together to build complete neural networks +4. **Manage parameters** - Handle weights and biases in an organized way +5. **Foundation for architectures** - Enable building everything from simple MLPs to complex models + +## Build -> Use -> Reflect +1. **Build**: Module base class, Linear layers, and Sequential composition +2. **Use**: Combine layers into complete neural networks with real data +3. **Reflect**: Understand how simple building blocks enable complex architectures """ # In[ ]: @@ -80,114 +79,120 @@ else: # In[ ]: -print("๐Ÿ”ฅ TinyTorch Layers Module") +print("FIRE TinyTorch Layers Module") print(f"NumPy version: {np.__version__}") print(f"Python version: {sys.version_info.major}.{sys.version_info.minor}") print("Ready to build neural network layers!") -# ## Visual Guide: Understanding Neural Network Architecture Through Diagrams +# %% [markdown] +""" +## Visual Guide: Understanding Neural Network Architecture Through Diagrams -# ### Neural Network Layers: From Components to Systems -# -# ``` -# Individual Neuron: Neural Network Layer: -# xโ‚ โ”€โ”€โ—‹ wโ‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” -# โ•ฒ โ”‚ Input Vector โ”‚ -# xโ‚‚ โ”€โ”€โ—‹ wโ‚‚ โ”€โ”€> ฮฃ โ”€โ”€> f() โ”€โ”€> y โ”‚ [xโ‚, xโ‚‚, xโ‚ƒ] โ”‚ -# โ•ฑ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ -# xโ‚ƒ โ”€โ”€โ—‹ wโ‚ƒ โ†“ -# + bias โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” -# โ”‚ Weight Matrix W โ”‚ -# One computation unit โ”‚ โ”Œwโ‚โ‚ wโ‚โ‚‚ wโ‚โ‚ƒโ” โ”‚ -# โ”‚ โ”‚wโ‚‚โ‚ wโ‚‚โ‚‚ wโ‚‚โ‚ƒโ”‚ โ”‚ -# โ”‚ โ””wโ‚ƒโ‚ wโ‚ƒโ‚‚ wโ‚ƒโ‚ƒโ”˜ โ”‚ -# โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ -# โ†“ -# Matrix multiplication -# Y = X @ W + b -# โ†“ -# โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” -# โ”‚ Output Vector โ”‚ -# โ”‚ [yโ‚, yโ‚‚, yโ‚ƒ] โ”‚ -# โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ -# -# Parallel processing of many neurons! -# ``` +### Neural Network Layers: From Components to Systems -# ### Layer Composition: Building Complex Architectures -# -# ``` -# Multi-Layer Perceptron (MLP) Architecture: -# -# Input Hidden Layer 1 Hidden Layer 2 Output -# (784 dims) (256 neurons) (128 neurons) (10 classes) -# โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” -# โ”‚ Image โ”‚โ”€โ”€โ”€โ”€โ–ถโ”‚ ReLU โ”‚โ”€โ”€โ–ถโ”‚ ReLU โ”‚โ”€โ”€โ–ถโ”‚ Softmax โ”‚ -# โ”‚ 28ร—28px โ”‚ โ”‚ Activations โ”‚ โ”‚ Activations โ”‚ โ”‚ Probs โ”‚ -# โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ -# โ†“ โ†“ โ†“ โ†“ -# 200,960 params 32,896 params 1,290 params Total: 235,146 -# -# Parameter calculation for Linear(input_size, output_size): -# โ€ข Weights: input_size ร— output_size matrix -# โ€ข Biases: output_size vector -# โ€ข Total: (input_size ร— output_size) + output_size -# -# Memory scaling pattern: -# Layer width doubles โ†’ Parameters quadruple โ†’ Memory quadruples -# ``` +``` +Individual Neuron: Neural Network Layer: + xโ‚ --โ—‹ wโ‚ +---------------------+ + \ | Input Vector | + xโ‚‚ --โ—‹ wโ‚‚ --> Sum --> f() --> y | [xโ‚, xโ‚‚, xโ‚ƒ] | + / +---------------------+ + xโ‚ƒ --โ—‹ wโ‚ƒ v + + bias +---------------------+ + | Weight Matrix W | +One computation unit | +wโ‚โ‚ wโ‚โ‚‚ wโ‚โ‚ƒ+ | + | |wโ‚‚โ‚ wโ‚‚โ‚‚ wโ‚‚โ‚ƒ| | + | +wโ‚ƒโ‚ wโ‚ƒโ‚‚ wโ‚ƒโ‚ƒ+ | + +---------------------+ + v + Matrix multiplication + Y = X @ W + b + v + +---------------------+ + | Output Vector | + | [yโ‚, yโ‚‚, yโ‚ƒ] | + +---------------------+ -# ### Module System: Automatic Parameter Management -# -# ``` -# Parameter Collection Hierarchy: -# -# Model (Sequential) -# โ”œโ”€โ”€ Layer1 (Linear) -# โ”‚ โ”œโ”€โ”€ weights [784 ร— 256] โ”€โ”€โ” -# โ”‚ โ””โ”€โ”€ bias [256] โ”€โ”€โ”ค -# โ”œโ”€โ”€ Layer2 (Linear) โ”œโ”€โ”€โ–ถ model.parameters() -# โ”‚ โ”œโ”€โ”€ weights [256 ร— 128] โ”€โ”€โ”ค Automatically collects -# โ”‚ โ””โ”€โ”€ bias [128] โ”€โ”€โ”ค all parameters for -# โ””โ”€โ”€ Layer3 (Linear) โ”œโ”€โ”€โ–ถ optimizer.step() -# โ”œโ”€โ”€ weights [128 ร— 10] โ”€โ”€โ”ค -# โ””โ”€โ”€ bias [10] โ”€โ”€โ”˜ -# -# Before Module system: With Module system: -# manually track params โ†’ automatic collection -# params = [w1, b1, w2,...] params = model.parameters() -# -# Enables: optimizer = Adam(model.parameters()) -# ``` +Parallel processing of many neurons! +``` -# ### Memory Layout and Performance Implications -# -# ``` -# Tensor Memory Access Patterns: -# -# Matrix Multiplication: A @ B = C -# -# Efficient (Row-major access): Inefficient (Column-major): -# A: โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ถ A: โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ–ถ -# Cache-friendly โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ -# Sequential reads โ–ผ โ–ผ โ–ผ โ–ผ โ–ผ -# Cache misses -# B: โ”‚ B: โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ถ -# โ”‚ -# โ–ผ -# -# Performance impact: -# โ€ข Good memory layout: 100% cache hit ratio -# โ€ข Poor memory layout: 10-50% cache hit ratio -# โ€ข 10-100x performance difference in practice -# -# Why contiguous tensors matter in production! -# ``` +### Layer Composition: Building Complex Architectures + +``` +Multi-Layer Perceptron (MLP) Architecture: + + Input Hidden Layer 1 Hidden Layer 2 Output + (784 dims) (256 neurons) (128 neurons) (10 classes) ++---------+ +-------------+ +-------------+ +---------+ +| Image |----โ–ถ| ReLU |--โ–ถ| ReLU |--โ–ถ| Softmax | +| 28*28px | | Activations | | Activations | | Probs | ++---------+ +-------------+ +-------------+ +---------+ + v v v v +200,960 params 32,896 params 1,290 params Total: 235,146 + +Parameter calculation for Linear(input_size, output_size): +โ€ข Weights: input_size * output_size matrix +โ€ข Biases: output_size vector +โ€ข Total: (input_size * output_size) + output_size + +Memory scaling pattern: +Layer width doubles -> Parameters quadruple -> Memory quadruples +``` + +### Module System: Automatic Parameter Management + +``` +Parameter Collection Hierarchy: + +Model (Sequential) ++-- Layer1 (Linear) +| +-- weights [784 * 256] --+ +| +-- bias [256] --โ”ค ++-- Layer2 (Linear) +--โ–ถ model.parameters() +| +-- weights [256 * 128] --โ”ค Automatically collects +| +-- bias [128] --โ”ค all parameters for ++-- Layer3 (Linear) +--โ–ถ optimizer.step() + +-- weights [128 * 10] --โ”ค + +-- bias [10] --+ + +Before Module system: With Module system: +manually track params -> automatic collection +params = [w1, b1, w2,...] params = model.parameters() + +Enables: optimizer = Adam(model.parameters()) +``` + +### Memory Layout and Performance Implications + +``` +Tensor Memory Access Patterns: + +Matrix Multiplication: A @ B = C + +Efficient (Row-major access): Inefficient (Column-major): +A: --------------โ–ถ A: | | | | | โ–ถ + Cache-friendly | | | | | + Sequential reads v v v v v + Cache misses +B: | B: --------------โ–ถ + | + v + +Performance impact: +โ€ข Good memory layout: 100% cache hit ratio +โ€ข Poor memory layout: 10-50% cache hit ratio +โ€ข 10-100x performance difference in practice + +Why contiguous tensors matter in production! +``` +""" + +# %% [markdown] +""" +## Part 1: Module Base Class - The Foundation of Neural Network Architecture +""" # %% nbgrader={"grade": false, "grade_id": "module-base", "solution": true} -# ## Part 1: Module Base Class - The Foundation of Neural Network Architecture - # Before building specific layers, we need a base class that enables clean composition and automatic parameter management. #| export @@ -291,54 +296,33 @@ class Module: # In[ ]: -# โœ… IMPLEMENTATION CHECKPOINT: Basic Module class complete +# PASS IMPLEMENTATION CHECKPOINT: Basic Module class complete -# ๐Ÿค” PREDICTION: How many parameters would a simple 3-layer network have? +# THINK PREDICTION: How many parameters would a simple 3-layer network have? # Write your guess here: _______ -# ๐Ÿ” SYSTEMS INSIGHT #1: Parameter Counter -def analyze_parameter_scaling(): - """Count parameters in networks of different sizes.""" +# ๐Ÿ” SYSTEMS ANALYSIS: Neural Network Layer Performance and Scaling +def analyze_layer_performance(): + """Consolidated analysis of layer performance and scaling characteristics.""" try: - print("๐Ÿ“Š Parameter Scaling Analysis") - print("=" * 40) - - layer_configs = [ - (100, 50), # Small network - (784, 256), # MNIST-style - (1024, 512), # Medium network - (2048, 1024), # Large network - (4096, 2048), # Very large - ] - - for input_size, output_size in layer_configs: - # Calculate parameters for Linear layer - weight_params = input_size * output_size - bias_params = output_size - total_params = weight_params + bias_params - - # Memory calculation (float32 = 4 bytes) - memory_mb = total_params * 4 / (1024 * 1024) - - print(f" {input_size:4d} โ†’ {output_size:4d}: {total_params:,} params, {memory_mb:.2f} MB") - - print("\n๐Ÿ’ก Key Insights:") - print(" โ€ข Parameters scale quadratically with layer width") - print(" โ€ข Doubling width โ†’ 4x parameters โ†’ 4x memory") - print(" โ€ข Modern networks balance width vs depth carefully") - print(" โ€ข GPT-3 has 175B parameters = ~700GB just for weights!") - - except Exception as e: - print(f"โš ๏ธ Error in parameter analysis: {e}") + print("๐Ÿ“Š Layer Systems Analysis:") + print(f" โ€ข Parameter Scaling: Linear layers scale O(input_size ร— output_size) - quadratic growth") + print(f" โ€ข Matrix Multiplication: O(Mร—Nร—K) complexity - GPU acceleration essential for large layers") + print(f" โ€ข Memory Usage: Each parameter uses 4 bytes (float32) - 1M params = 4MB memory") + print(f" โ€ข Architecture Impact: Deep vs wide networks - depth adds expressivity, width adds capacity") + print(f" โ€ข Production Reality: Modern networks (GPT-3: 175B params) require distributed training") -# Run the analysis -analyze_parameter_scaling() + except Exception as e: + print(f"โš ๏ธ Analysis failed: {e}") # In[ ]: -# ## Part 2: Matrix Multiplication - The Heart of Neural Networks +# %% [markdown] +""" +## Part 2: Matrix Multiplication - The Heart of Neural Networks -# Every neural network operation ultimately reduces to matrix multiplication. Let's build the foundation that powers everything from simple perceptrons to transformers. +Every neural network operation ultimately reduces to matrix multiplication. Let's build the foundation that powers everything from simple perceptrons to transformers. +""" #| export def matmul(a: Tensor, b: Tensor) -> Tensor: @@ -347,7 +331,7 @@ def matmul(a: Tensor, b: Tensor) -> Tensor: This implementation uses triple-nested loops for educational understanding of the fundamental operations. Module 15 will show the optimization progression - from loops โ†’ blocking โ†’ vectorized operations. + from loops -> blocking -> vectorized operations. Args: a: Left tensor (shape: ..., m, k) @@ -439,10 +423,10 @@ def matmul(a: Tensor, b: Tensor) -> Tensor: # In[ ]: -# ๐Ÿงช Unit Test: Matrix Multiplication +# TEST Unit Test: Matrix Multiplication def test_unit_matmul(): """Test matrix multiplication implementation.""" - print("๐Ÿงช Testing Matrix Multiplication...") + print("TEST Testing Matrix Multiplication...") # Test case 1: Simple 2x2 matrices a = Tensor([[1, 2], [3, 4]]) @@ -451,7 +435,7 @@ def test_unit_matmul(): expected = np.array([[19, 22], [43, 50]]) assert np.allclose(result.data, expected), f"Expected {expected}, got {result.data}" - print("โœ… 2x2 matrix multiplication") + print("PASS 2x2 matrix multiplication") # Test case 2: Non-square matrices a = Tensor([[1, 2, 3], [4, 5, 6]]) # 2x3 @@ -460,7 +444,7 @@ def test_unit_matmul(): expected = np.array([[58, 64], [139, 154]]) assert np.allclose(result.data, expected), f"Expected {expected}, got {result.data}" - print("โœ… Non-square matrix multiplication") + print("PASS Non-square matrix multiplication") # Test case 3: Vector-matrix multiplication a = Tensor([[1, 2, 3]]) # 1x3 (row vector) @@ -469,60 +453,32 @@ def test_unit_matmul(): expected = np.array([[32]]) # 1*4 + 2*5 + 3*6 = 32 assert np.allclose(result.data, expected), f"Expected {expected}, got {result.data}" - print("โœ… Vector-matrix multiplication") + print("PASS Vector-matrix multiplication") - print("๐ŸŽ‰ All matrix multiplication tests passed!") + print("CELEBRATE All matrix multiplication tests passed!") test_unit_matmul() # In[ ]: -# โœ… IMPLEMENTATION CHECKPOINT: Matrix multiplication complete +# PASS IMPLEMENTATION CHECKPOINT: Matrix multiplication complete -# ๐Ÿค” PREDICTION: How many operations does matrix multiplication take? -# For two Nร—N matrices, your guess: _______ +# THINK PREDICTION: How many operations does matrix multiplication take? +# For two N*N matrices, your guess: _______ -# ๐Ÿ” SYSTEMS INSIGHT #2: FLOPS Analysis -def analyze_matmul_complexity(): - """Analyze computational complexity of matrix multiplication.""" - try: - print("๐Ÿ“Š Matrix Multiplication FLOPS Analysis") - print("=" * 45) - - sizes = [64, 128, 256, 512, 1024] - - for size in sizes: - # For Nร—N @ Nร—N matrices: - # - Nยณ multiply operations - # - Nยณ add operations - # - Total: 2Nยณ FLOPs (Floating Point Operations) - flops = 2 * size ** 3 - - # Memory requirements - memory_elements = 3 * size * size # A, B, and result matrices - memory_mb = memory_elements * 4 / (1024 * 1024) # float32 = 4 bytes - - print(f" {size:4d}ร—{size:4d}: {flops/1e9:.1f} GFLOPS, {memory_mb:.1f} MB") - - print("\n๐Ÿ’ก Computational Insights:") - print(" โ€ข FLOPs grow cubically O(Nยณ) - very expensive!") - print(" โ€ข Memory grows quadratically O(Nยฒ)") - print(" โ€ข Large matrices become compute-bound") - print(" โ€ข GPU acceleration essential for deep learning") - print(" โ€ข This is why matrix operations dominate ML workloads") - - except Exception as e: - print(f"โš ๏ธ Error in FLOPS analysis: {e}") +# Matrix multiplication analysis consolidated into analyze_layer_performance() above -# Run the analysis -analyze_matmul_complexity() +# Analysis consolidated into analyze_layer_performance() above + +# %% [markdown] +""" +## Part 3: Linear Layer - The Fundamental Neural Network Component + +Linear layers (also called Dense or Fully Connected layers) are the building blocks of neural networks. +""" # %% nbgrader={"grade": false, "grade_id": "linear-layer", "solution": true} -# ## Part 3: Linear Layer - The Fundamental Neural Network Component - -# Linear layers (also called Dense or Fully Connected layers) are the building blocks of neural networks. - #| export class Linear(Module): """ @@ -579,7 +535,7 @@ class Linear(Module): # Initialize weights with small random values using Parameter # Shape: (input_size, output_size) for matrix multiplication # - # ๐Ÿ” WEIGHT INITIALIZATION CONTEXT: + # MAGNIFY WEIGHT INITIALIZATION CONTEXT: # Weight initialization is critical for training deep networks successfully. # Our simple approach (small random * 0.1) works for shallow networks, but # deeper networks require more sophisticated initialization strategies: @@ -589,8 +545,8 @@ class Linear(Module): # โ€ข Our approach: scale = 0.1 - simple but effective for basic networks # # Why proper initialization matters: - # - Prevents vanishing gradients (weights too small โ†’ signals disappear) - # - Prevents exploding gradients (weights too large โ†’ signals blow up) + # - Prevents vanishing gradients (weights too small -> signals disappear) + # - Prevents exploding gradients (weights too large -> signals blow up) # - Enables stable training in deeper architectures (Module 11 training) # - Affects convergence speed and final model performance # @@ -600,7 +556,7 @@ class Linear(Module): # Initialize bias if requested if use_bias: - # ๐Ÿ” GRADIENT FLOW PREPARATION: + # MAGNIFY GRADIENT FLOW PREPARATION: # Clean parameter management is essential for backpropagation (Module 09). # When we implement autograd, the optimizer needs to find ALL trainable # parameters automatically. Our Module base class ensures that: @@ -677,10 +633,10 @@ class Linear(Module): # In[ ]: -# ๐Ÿงช Unit Test: Linear Layer +# TEST Unit Test: Linear Layer def test_unit_linear(): """Test Linear layer implementation.""" - print("๐Ÿงช Testing Linear Layer...") + print("TEST Testing Linear Layer...") # Test case 1: Basic functionality layer = Linear(input_size=3, output_size=2) @@ -689,12 +645,12 @@ def test_unit_linear(): # Check output shape assert output.shape == (1, 2), f"Expected shape (1, 2), got {output.shape}" - print("โœ… Output shape correct") + print("PASS Output shape correct") # Test case 2: No bias layer_no_bias = Linear(input_size=2, output_size=3, use_bias=False) assert layer_no_bias.bias is None, "Bias should be None when use_bias=False" - print("โœ… No bias option works") + print("PASS No bias option works") # Test case 3: Multiple samples (batch processing) batch_input = Tensor([[1.0, 2.0], [3.0, 4.0], [5.0, 6.0]]) # Shape: (3, 2) @@ -702,12 +658,12 @@ def test_unit_linear(): batch_output = layer_batch.forward(batch_input) assert batch_output.shape == (3, 2), f"Expected shape (3, 2), got {batch_output.shape}" - print("โœ… Batch processing works") + print("PASS Batch processing works") # Test case 4: Callable interface callable_output = layer_batch(batch_input) assert np.allclose(callable_output.data, batch_output.data), "Callable interface should match forward()" - print("โœ… Callable interface works") + print("PASS Callable interface works") # Test case 5: Parameter initialization layer_init = Linear(input_size=10, output_size=5) @@ -716,18 +672,18 @@ def test_unit_linear(): # Check that weights are reasonably small (good initialization) assert np.abs(layer_init.weights.data).mean() < 1.0, "Weights should be small for good initialization" - print("โœ… Parameter initialization correct") + print("PASS Parameter initialization correct") - print("๐ŸŽ‰ All Linear layer tests passed!") + print("CELEBRATE All Linear layer tests passed!") test_unit_linear() # In[ ]: -# ๐Ÿงช Unit Test: Parameter Management +# TEST Unit Test: Parameter Management def test_unit_parameter_management(): """Test Linear layer parameter management and module composition.""" - print("๐Ÿงช Testing Parameter Management...") + print("TEST Testing Parameter Management...") # Test case 1: Parameter registration layer = Linear(input_size=3, output_size=2) @@ -736,7 +692,7 @@ def test_unit_parameter_management(): assert len(params) == 2, f"Expected 2 parameters (weights + bias), got {len(params)}" assert layer.weights in params, "Weights should be in parameters list" assert layer.bias in params, "Bias should be in parameters list" - print("โœ… Parameter registration works") + print("PASS Parameter registration works") # Test case 2: Module composition class SimpleNetwork(Module): @@ -754,14 +710,14 @@ def test_unit_parameter_management(): # Should have 4 parameters: 2 from each layer (weights + bias) assert len(all_params) == 4, f"Expected 4 parameters from network, got {len(all_params)}" - print("โœ… Module composition and parameter collection works") + print("PASS Module composition and parameter collection works") # Test case 3: Forward pass through composed network input_tensor = Tensor([[1.0, 2.0, 3.0, 4.0]]) output = network(input_tensor) assert output.shape == (1, 2), f"Expected output shape (1, 2), got {output.shape}" - print("โœ… Network forward pass works") + print("PASS Network forward pass works") # Test case 4: No bias option layer_no_bias = Linear(input_size=3, output_size=2, use_bias=False) @@ -769,74 +725,32 @@ def test_unit_parameter_management(): assert len(params_no_bias) == 1, f"Expected 1 parameter (weights only), got {len(params_no_bias)}" assert layer_no_bias.bias is None, "Bias should be None when use_bias=False" - print("โœ… No bias option works") + print("PASS No bias option works") - print("๐ŸŽ‰ All parameter management tests passed!") + print("CELEBRATE All parameter management tests passed!") test_unit_parameter_management() # In[ ]: -# โœ… IMPLEMENTATION CHECKPOINT: Linear layer complete +# PASS IMPLEMENTATION CHECKPOINT: Linear layer complete -# ๐Ÿค” PREDICTION: How does memory usage scale with network depth vs width? +# THINK PREDICTION: How does memory usage scale with network depth vs width? # Deeper network (more layers): _______ # Wider network (more neurons per layer): _______ -# ๐Ÿ” SYSTEMS INSIGHT #3: Architecture Memory Analysis -def analyze_architecture_scaling(): - """Compare memory usage of deep vs wide networks.""" - try: - print("๐Ÿ“Š Architecture Scaling: Deep vs Wide Networks") - print("=" * 50) - - # Compare networks with similar parameter counts - print("\nDeep Network (8 layers, narrow):") - deep_layers = [128, 64, 64, 64, 64, 64, 64, 10] - deep_params = 0 - deep_memory = 0 - - for i in range(len(deep_layers) - 1): - layer_params = deep_layers[i] * deep_layers[i+1] + deep_layers[i+1] - deep_params += layer_params - layer_memory = layer_params * 4 / (1024 * 1024) # MB - deep_memory += layer_memory - print(f" Layer {i+1}: {deep_layers[i]:3d} โ†’ {deep_layers[i+1]:3d} = {layer_params:,} params") - - print(f" Total: {deep_params:,} params, {deep_memory:.2f} MB") - - print("\nWide Network (3 layers, wide):") - wide_layers = [128, 256, 256, 10] - wide_params = 0 - wide_memory = 0 - - for i in range(len(wide_layers) - 1): - layer_params = wide_layers[i] * wide_layers[i+1] + wide_layers[i+1] - wide_params += layer_params - layer_memory = layer_params * 4 / (1024 * 1024) # MB - wide_memory += layer_memory - print(f" Layer {i+1}: {wide_layers[i]:3d} โ†’ {wide_layers[i+1]:3d} = {layer_params:,} params") - - print(f" Total: {wide_params:,} params, {wide_memory:.2f} MB") - - print(f"\n๐Ÿ’ก Architecture Insights:") - print(f" โ€ข Deep network: {len(deep_layers)-1} layers, {deep_params:,} params") - print(f" โ€ข Wide network: {len(wide_layers)-1} layers, {wide_params:,} params") - print(f" โ€ข Memory ratio: {wide_memory/deep_memory:.1f}x (wide uses more)") - print(f" โ€ข Deep networks: better feature hierarchies") - print(f" โ€ข Wide networks: more parallel computation") - print(f" โ€ข Modern trend: Balance depth + width for best performance") - - except Exception as e: - print(f"โš ๏ธ Error in architecture analysis: {e}") +# MAGNIFY SYSTEMS INSIGHT #3: Architecture Memory Analysis +# Architecture analysis consolidated into analyze_layer_performance() above -# Run the analysis -analyze_architecture_scaling() +# Analysis consolidated into analyze_layer_performance() above + +# %% [markdown] +""" +## Part 4: Sequential Network Composition +""" # %% nbgrader={"grade": false, "grade_id": "sequential-composition", "solution": true} -# ## Part 4: Sequential Network Composition - #| export class Sequential(Module): """ @@ -900,33 +814,33 @@ class Sequential(Module): # In[ ]: -# ๐Ÿงช Unit Test: Sequential Networks +# TEST Unit Test: Sequential Networks def test_unit_sequential(): """Test Sequential network implementation.""" - print("๐Ÿงช Testing Sequential Network...") + print("TEST Testing Sequential Network...") # Test case 1: Create empty network empty_net = Sequential() assert len(empty_net.layers) == 0, "Empty Sequential should have no layers" - print("โœ… Empty Sequential network creation") + print("PASS Empty Sequential network creation") # Test case 2: Create network with layers layers = [Linear(3, 4), Linear(4, 2)] network = Sequential(layers) assert len(network.layers) == 2, "Network should have 2 layers" - print("โœ… Sequential network with layers") + print("PASS Sequential network with layers") # Test case 3: Forward pass through network input_tensor = Tensor([[1.0, 2.0, 3.0]]) output = network(input_tensor) assert output.shape == (1, 2), f"Expected output shape (1, 2), got {output.shape}" - print("โœ… Forward pass through Sequential network") + print("PASS Forward pass through Sequential network") # Test case 4: Parameter collection from all layers all_params = network.parameters() # Should have 4 parameters: 2 weights + 2 biases from 2 Linear layers assert len(all_params) == 4, f"Expected 4 parameters from Sequential network, got {len(all_params)}" - print("โœ… Parameter collection from all layers") + print("PASS Parameter collection from all layers") # Test case 5: Adding layers dynamically network.add(Linear(2, 1)) @@ -935,15 +849,18 @@ def test_unit_sequential(): # Test forward pass after adding layer final_output = network(input_tensor) assert final_output.shape == (1, 1), f"Expected final output shape (1, 1), got {final_output.shape}" - print("โœ… Dynamic layer addition") + print("PASS Dynamic layer addition") - print("๐ŸŽ‰ All Sequential network tests passed!") + print("CELEBRATE All Sequential network tests passed!") test_unit_sequential() -# %% nbgrader={"grade": false, "grade_id": "flatten-operations", "solution": true} +# %% [markdown] +""" +## Part 5: Flatten Operation - Connecting Different Layer Types +""" -# ## Part 5: Flatten Operation - Connecting Different Layer Types +# %% nbgrader={"grade": false, "grade_id": "flatten-operations", "solution": true} #| export def flatten(x, start_dim=1): @@ -1033,35 +950,35 @@ class Flatten(Module): # In[ ]: -# ๐Ÿงช Unit Test: Flatten Operations +# TEST Unit Test: Flatten Operations def test_unit_flatten(): """Test Flatten layer and function implementation.""" - print("๐Ÿงช Testing Flatten Operations...") + print("TEST Testing Flatten Operations...") # Test case 1: Flatten function with 2D tensor x_2d = Tensor([[1, 2], [3, 4]]) flattened_func = flatten(x_2d) assert flattened_func.shape == (2, 2), f"Expected shape (2, 2), got {flattened_func.shape}" - print("โœ… Flatten function with 2D tensor") + print("PASS Flatten function with 2D tensor") # Test case 2: Flatten function with 4D tensor (simulating CNN output) x_4d = Tensor(np.random.randn(2, 3, 4, 4)) # (batch, channels, height, width) flattened_4d = flatten(x_4d) assert flattened_4d.shape == (2, 48), f"Expected shape (2, 48), got {flattened_4d.shape}" # 3*4*4 = 48 - print("โœ… Flatten function with 4D tensor") + print("PASS Flatten function with 4D tensor") # Test case 3: Flatten layer class flatten_layer = Flatten() layer_output = flatten_layer(x_4d) assert layer_output.shape == (2, 48), f"Expected shape (2, 48), got {layer_output.shape}" assert np.allclose(layer_output.data, flattened_4d.data), "Flatten layer should match flatten function" - print("โœ… Flatten layer class") + print("PASS Flatten layer class") # Test case 4: Different start dimensions flatten_from_0 = Flatten(start_dim=0) full_flat = flatten_from_0(x_2d) assert len(full_flat.shape) <= 2, "Flattening from dim 0 should create vector" - print("โœ… Different start dimensions") + print("PASS Different start dimensions") # Test case 5: Integration with Sequential network = Sequential([ @@ -1071,30 +988,32 @@ def test_unit_flatten(): test_input = Tensor(np.random.randn(2, 8)) output = network(test_input) assert output.shape == (2, 4), f"Expected shape (2, 4), got {output.shape}" - print("โœ… Flatten integration with Sequential") + print("PASS Flatten integration with Sequential") - print("๐ŸŽ‰ All Flatten operations tests passed!") + print("CELEBRATE All Flatten operations tests passed!") test_unit_flatten() # In[ ]: -# ## NBGrader Assessment Questions - -# โญ QUESTION 1: Parameter Counting Challenge +# %% [markdown] """ +## NBGrader Assessment Questions + +โญ QUESTION 1: Parameter Counting Challenge + You're building a Multi-Layer Perceptron (MLP) for MNIST digit classification. Network architecture: -- Input: 784 features (28ร—28 pixel images, flattened) +- Input: 784 features (28*28 pixel images, flattened) - Hidden layer 1: 256 neurons with ReLU activation -- Hidden layer 2: 128 neurons with ReLU activation +- Hidden layer 2: 128 neurons with ReLU activation - Output layer: 10 neurons (one per digit class) Calculate the total number of trainable parameters in this network. Show your work: -- Layer 1 parameters: _____ +- Layer 1 parameters: _____ - Layer 2 parameters: _____ - Layer 3 parameters: _____ - Total parameters: _____ @@ -1104,17 +1023,17 @@ Hint: Remember that each Linear layer has both weights and biases! # ### BEGIN SOLUTION # Layer 1: Linear(784, 256) -# - Weights: 784 ร— 256 = 200,704 +# - Weights: 784 * 256 = 200,704 # - Biases: 256 # - Subtotal: 200,960 # Layer 2: Linear(256, 128) -# - Weights: 256 ร— 128 = 32,768 +# - Weights: 256 * 128 = 32,768 # - Biases: 128 # - Subtotal: 32,896 # Layer 3: Linear(128, 10) -# - Weights: 128 ร— 10 = 1,280 +# - Weights: 128 * 10 = 1,280 # - Biases: 10 # - Subtotal: 1,290 @@ -1125,8 +1044,8 @@ Hint: Remember that each Linear layer has both weights and biases! """ Compare the memory requirements of two different MLP architectures for the same task: -Architecture A (Wide): 784 โ†’ 512 โ†’ 512 โ†’ 10 -Architecture B (Deep): 784 โ†’ 128 โ†’ 128 โ†’ 128 โ†’ 128 โ†’ 10 +Architecture A (Wide): 784 -> 512 -> 512 -> 10 +Architecture B (Deep): 784 -> 128 -> 128 -> 128 -> 128 -> 10 For each architecture, calculate: 1. Total number of parameters @@ -1145,21 +1064,21 @@ Mobile device choice and reasoning: _____ """ # ### BEGIN SOLUTION -# Architecture A (Wide): 784 โ†’ 512 โ†’ 512 โ†’ 10 -# - Layer 1: (784 ร— 512) + 512 = 401,920 -# - Layer 2: (512 ร— 512) + 512 = 262,656 -# - Layer 3: (512 ร— 10) + 10 = 5,130 +# Architecture A (Wide): 784 -> 512 -> 512 -> 10 +# - Layer 1: (784 * 512) + 512 = 401,920 +# - Layer 2: (512 * 512) + 512 = 262,656 +# - Layer 3: (512 * 10) + 10 = 5,130 # - Total: 669,706 parameters -# - Memory: 669,706 ร— 4 bytes = 2.68 MB +# - Memory: 669,706 * 4 bytes = 2.68 MB -# Architecture B (Deep): 784 โ†’ 128 โ†’ 128 โ†’ 128 โ†’ 128 โ†’ 10 -# - Layer 1: (784 ร— 128) + 128 = 100,480 -# - Layer 2: (128 ร— 128) + 128 = 16,512 -# - Layer 3: (128 ร— 128) + 128 = 16,512 -# - Layer 4: (128 ร— 128) + 128 = 16,512 -# - Layer 5: (128 ร— 10) + 10 = 1,290 +# Architecture B (Deep): 784 -> 128 -> 128 -> 128 -> 128 -> 10 +# - Layer 1: (784 * 128) + 128 = 100,480 +# - Layer 2: (128 * 128) + 128 = 16,512 +# - Layer 3: (128 * 128) + 128 = 16,512 +# - Layer 4: (128 * 128) + 128 = 16,512 +# - Layer 5: (128 * 10) + 10 = 1,290 # - Total: 151,306 parameters -# - Memory: 151,306 ร— 4 bytes = 0.61 MB +# - Memory: 151,306 * 4 bytes = 0.61 MB # Mobile choice: Architecture B (Deep) # Reasoning: Uses 4.4x less memory while maintaining similar representational capacity through depth @@ -1169,25 +1088,25 @@ Mobile device choice and reasoning: _____ """ Calculate the computational cost (in FLOPs) for a forward pass through this network: -Input batch: 32 samples ร— 784 features -Network: 784 โ†’ 256 โ†’ 128 โ†’ 10 +Input batch: 32 samples * 784 features +Network: 784 -> 256 -> 128 -> 10 For each layer, calculate: -- Matrix multiplication FLOPs: 2 ร— batch_size ร— input_size ร— output_size -- Bias addition FLOPs: batch_size ร— output_size +- Matrix multiplication FLOPs: 2 * batch_size * input_size * output_size +- Bias addition FLOPs: batch_size * output_size - Total FLOPs per layer -Layer 1 (784 โ†’ 256): +Layer 1 (784 -> 256): - MatMul FLOPs: _____ - Bias FLOPs: _____ - Layer total: _____ -Layer 2 (256 โ†’ 128): +Layer 2 (256 -> 128): - MatMul FLOPs: _____ - Bias FLOPs: _____ - Layer total: _____ -Layer 3 (128 โ†’ 10): +Layer 3 (128 -> 10): - MatMul FLOPs: _____ - Bias FLOPs: _____ - Layer total: _____ @@ -1198,19 +1117,19 @@ Network total FLOPs: _____ # ### BEGIN SOLUTION # Batch size = 32 samples -# Layer 1 (784 โ†’ 256): -# - MatMul FLOPs: 2 ร— 32 ร— 784 ร— 256 = 12,582,912 -# - Bias FLOPs: 32 ร— 256 = 8,192 +# Layer 1 (784 -> 256): +# - MatMul FLOPs: 2 * 32 * 784 * 256 = 12,582,912 +# - Bias FLOPs: 32 * 256 = 8,192 # - Layer total: 12,591,104 -# Layer 2 (256 โ†’ 128): -# - MatMul FLOPs: 2 ร— 32 ร— 256 ร— 128 = 2,097,152 -# - Bias FLOPs: 32 ร— 128 = 4,096 +# Layer 2 (256 -> 128): +# - MatMul FLOPs: 2 * 32 * 256 * 128 = 2,097,152 +# - Bias FLOPs: 32 * 128 = 4,096 # - Layer total: 2,101,248 -# Layer 3 (128 โ†’ 10): -# - MatMul FLOPs: 2 ร— 32 ร— 128 ร— 10 = 81,920 -# - Bias FLOPs: 32 ร— 10 = 320 +# Layer 3 (128 -> 10): +# - MatMul FLOPs: 2 * 32 * 128 * 10 = 81,920 +# - Bias FLOPs: 32 * 10 = 320 # - Layer total: 82,240 # Network total: 12,591,104 + 2,101,248 + 82,240 = 14,774,592 FLOPs (~14.8 MFLOPS) @@ -1218,11 +1137,14 @@ Network total FLOPs: _____ # In[ ]: -# ## Complete Neural Network Demo +# %% [markdown] +""" +## Complete Neural Network Demo +""" def demonstrate_complete_networks(): """Demonstrate complete neural networks using all implemented components.""" - print("๐Ÿ”ฅ Complete Neural Network Demo") + print("FIRE Complete Neural Network Demo") print("=" * 50) print("\n1. MLP for Classification (MNIST-style):") @@ -1242,7 +1164,7 @@ def demonstrate_complete_networks(): print(f" Parameters: {len(mlp.parameters())} tensors") print("\n2. CNN-style Architecture (with Flatten):") - # Simulate CNN โ†’ Flatten โ†’ Dense pattern + # Simulate CNN -> Flatten -> Dense pattern cnn_style = Sequential([ # Simulate Conv2D output with random "features" Flatten(), # Flatten spatial features @@ -1263,12 +1185,12 @@ def demonstrate_complete_networks(): for i in range(len(layer_sizes) - 1): deep_net.add(Linear(layer_sizes[i], layer_sizes[i+1])) - print(f" Added layer: {layer_sizes[i]} โ†’ {layer_sizes[i+1]}") + print(f" Added layer: {layer_sizes[i]} -> {layer_sizes[i+1]}") # Test deep network deep_input = Tensor(np.random.randn(8, 100)) deep_output = deep_net(deep_input) - print(f" Deep network: {deep_input.shape} โ†’ {deep_output.shape}") + print(f" Deep network: {deep_input.shape} -> {deep_output.shape}") print(f" Total parameters: {len(deep_net.parameters())} tensors") print("\n4. Parameter Management Across Networks:") @@ -1280,7 +1202,7 @@ def demonstrate_complete_networks(): memory_mb = total_params * 4 / (1024 * 1024) # float32 = 4 bytes print(f" {name}: {len(params)} param tensors, {total_params:,} total params, {memory_mb:.2f} MB") - print("\n๐ŸŽ‰ All components work together seamlessly!") + print("\nCELEBRATE All components work together seamlessly!") print(" โ€ข Module system enables automatic parameter collection") print(" โ€ข Linear layers handle matrix transformations") print(" โ€ข Sequential composes layers into complete architectures") @@ -1291,11 +1213,14 @@ demonstrate_complete_networks() # In[ ]: -# ## Testing Framework +# %% [markdown] +""" +## Testing Framework +""" def test_unit_all(): """Run complete module validation.""" - print("๐Ÿงช Running all unit tests...") + print("TEST Running all unit tests...") # Call every individual test function test_unit_matmul() @@ -1304,41 +1229,44 @@ def test_unit_all(): test_unit_sequential() test_unit_flatten() - print("โœ… All tests passed! Module ready for integration.") + print("PASS All tests passed! Module ready for integration.") # In[ ]: if __name__ == "__main__": - print("๐Ÿ”ฅ TinyTorch Layers Module - Complete Foundation Demo") + print("FIRE TinyTorch Layers Module - Complete Foundation Demo") print("=" * 60) - + # Test all core components - print("\n๐Ÿงช Testing All Core Components:") + print("\nTEST Testing All Core Components:") test_unit_all() - + + # Single consolidated analysis for foundation module + analyze_layer_performance() + print("\n" + "="*60) demonstrate_complete_networks() - print("\n๐ŸŽ‰ Complete neural network foundation ready!") - print(" โœ… Module system for parameter management") - print(" โœ… Linear layers for transformations") - print(" โœ… Sequential networks for composition") - print(" โœ… Flatten operations for tensor reshaping") - print(" โœ… All components tested and integrated!") + print("\nCELEBRATE Complete neural network foundation ready!") + print(" PASS Module system for parameter management") + print(" PASS Linear layers for transformations") + print(" PASS Sequential networks for composition") + print(" PASS Flatten operations for tensor reshaping") + print(" PASS All components tested and integrated!") -# ## ๐Ÿค” ML Systems Thinking: Interactive Questions - -# Now that you've implemented all the core neural network components, let's think about their implications for ML systems: - -# โญ QUESTION: Memory vs Computation Trade-offs +# %% [markdown] """ -๐Ÿค” **Question 1: Memory vs Computation Analysis** +## ๐Ÿค” ML Systems Thinking: Interactive Questions + +Now that you've implemented all the core neural network components, let's think about their implications for ML systems: + +**Question 1: Memory vs Computation Analysis** You're designing a neural network for deployment on a mobile device with limited memory (1GB RAM) but decent compute power. You have two architecture options: -A) Wide network: 784 โ†’ 2048 โ†’ 2048 โ†’ 10 (3 layers, wide) -B) Deep network: 784 โ†’ 256 โ†’ 256 โ†’ 256 โ†’ 256 โ†’ 10 (5 layers, narrow) +A) Wide network: 784 -> 2048 -> 2048 -> 10 (3 layers, wide) +B) Deep network: 784 -> 256 -> 256 -> 256 -> 256 -> 10 (5 layers, narrow) Calculate the memory requirements for each option and explain which you'd choose for mobile deployment and why. @@ -1347,11 +1275,8 @@ Consider: - Intermediate activation storage during forward pass - Training vs inference memory requirements - How your choice affects model capacity and accuracy -""" -# โญ QUESTION: Performance Optimization -""" -๐Ÿค” **Question 2: Production Performance Optimization** +โญ **Question 2: Production Performance Optimization** Your Linear layer implementation works correctly, but you notice it's slower than PyTorch's nn.Linear on the same hardware. @@ -1366,11 +1291,8 @@ Research areas to consider: - Memory layout and cache efficiency - Vectorization and SIMD instructions - GPU kernel optimization -""" -# โญ QUESTION: Scaling and Architecture Design -""" -๐Ÿค” **Question 3: Systems Architecture Scaling** +โญ **Question 3: Systems Architecture Scaling** Modern transformer models like GPT-3 have billions of parameters, primarily in Linear layers. @@ -1387,52 +1309,55 @@ Systems considerations: - Inference optimization for production serving """ -# ## ๐ŸŽฏ MODULE SUMMARY: Layers - Complete Neural Network Foundation +# %% [markdown] +""" +## ๐ŸŽฏ MODULE SUMMARY: Layers - Complete Neural Network Foundation -# ## ๐ŸŽฏ What You've Accomplished +### What You've Accomplished -# You've successfully implemented the complete foundation for neural networks - all the essential components working together: +You've successfully implemented the complete foundation for neural networks - all the essential components working together: -# ### โœ… **Complete Core System** -# - **Module Base Class**: Parameter management and composition patterns for all neural network components -# - **Matrix Multiplication**: The computational primitive underlying all neural network operations -# - **Linear (Dense) Layers**: Complete implementation with proper parameter initialization and forward propagation -# - **Sequential Networks**: Clean composition system for building complete neural network architectures -# - **Flatten Operations**: Tensor reshaping to connect different layer types (essential for CNNโ†’MLP transitions) +### โœ… **Complete Core System** +- **Module Base Class**: Parameter management and composition patterns for all neural network components +- **Matrix Multiplication**: The computational primitive underlying all neural network operations +- **Linear (Dense) Layers**: Complete implementation with proper parameter initialization and forward propagation +- **Sequential Networks**: Clean composition system for building complete neural network architectures +- **Flatten Operations**: Tensor reshaping to connect different layer types (essential for CNN->MLP transitions) -# ### โœ… **Systems Understanding** -# - **Architectural Patterns**: How modular design enables everything from MLPs to complex deep networks -# - **Memory Analysis**: How layer composition affects memory usage and computational efficiency -# - **Performance Characteristics**: Understanding how tensor operations and layer composition affect performance -# - **Production Context**: Connection to real-world ML frameworks and their component organization +### โœ… **Systems Understanding** +- **Architectural Patterns**: How modular design enables everything from MLPs to complex deep networks +- **Memory Analysis**: How layer composition affects memory usage and computational efficiency +- **Performance Characteristics**: Understanding how tensor operations and layer composition affect performance +- **Production Context**: Connection to real-world ML frameworks and their component organization -# ### โœ… **ML Engineering Skills** -# - **Complete Parameter Management**: How neural networks automatically collect parameters from all components -# - **Network Composition**: Building complex architectures from simple, reusable components -# - **Tensor Operations**: Essential reshaping and transformation operations for different network types -# - **Clean Abstraction**: Professional software design patterns that scale to production systems +### โœ… **ML Engineering Skills** +- **Complete Parameter Management**: How neural networks automatically collect parameters from all components +- **Network Composition**: Building complex architectures from simple, reusable components +- **Tensor Operations**: Essential reshaping and transformation operations for different network types +- **Clean Abstraction**: Professional software design patterns that scale to production systems -# ## ๐Ÿ”— **Connection to Production ML Systems** +### ๐Ÿ”— **Connection to Production ML Systems** -# Your unified implementation mirrors the complete component systems used in: -# - **PyTorch's nn.Module system**: Same parameter management and composition patterns -# - **PyTorch's nn.Sequential**: Identical architecture composition approach -# - **All major frameworks**: The same modular design principles that power TensorFlow, JAX, and others -# - **Production ML systems**: Clean abstractions that enable complex models while maintaining manageable code +Your unified implementation mirrors the complete component systems used in: +- **PyTorch's nn.Module system**: Same parameter management and composition patterns +- **PyTorch's nn.Sequential**: Identical architecture composition approach +- **All major frameworks**: The same modular design principles that power TensorFlow, JAX, and others +- **Production ML systems**: Clean abstractions that enable complex models while maintaining manageable code -# ## ๐Ÿš€ **What's Next** +### ๐Ÿš€ **What's Next** -# With your complete layer foundation, you're ready to: -# - **Module 05 (Dense)**: Build complete dense networks for classification tasks -# - **Module 06 (Spatial)**: Add convolutional layers for computer vision -# - **Module 09 (Autograd)**: Enable automatic differentiation for learning -# - **Module 10 (Optimizers)**: Implement sophisticated optimization algorithms +With your complete layer foundation, you're ready to: +- **Module 05 (Dense)**: Build complete dense networks for classification tasks +- **Module 06 (Spatial)**: Add convolutional layers for computer vision +- **Module 09 (Autograd)**: Enable automatic differentiation for learning +- **Module 10 (Optimizers)**: Implement sophisticated optimization algorithms -# ## ๐Ÿ’ก **Key Systems Insights** +### ๐Ÿ’ก **Key Systems Insights** -# 1. **Modular composition is the key to scalable ML systems** - clean interfaces enable complex behaviors -# 2. **Parameter management must be automatic** - manual parameter tracking doesn't scale to deep networks -# 3. **Tensor operations like flattening are architectural requirements** - different layer types need different tensor shapes -# 4. **Clean abstractions enable innovation** - good foundational design supports unlimited architectural experimentation +1. **Modular composition is the key to scalable ML systems** - clean interfaces enable complex behaviors +2. **Parameter management must be automatic** - manual parameter tracking doesn't scale to deep networks +3. **Tensor operations like flattening are architectural requirements** - different layer types need different tensor shapes +4. **Clean abstractions enable innovation** - good foundational design supports unlimited architectural experimentation -# You now understand how to build complete, production-ready neural network foundations that can scale to any architecture! \ No newline at end of file +You now understand how to build complete, production-ready neural network foundations that can scale to any architecture! +""" \ No newline at end of file diff --git a/modules/04_losses/losses_dev.py b/modules/04_losses/losses_dev.py index 9bb2463c..077bc8cf 100644 --- a/modules/04_losses/losses_dev.py +++ b/modules/04_losses/losses_dev.py @@ -14,7 +14,7 @@ Welcome to Loss Functions! You'll implement the critical bridge between model predictions and learning objectives that makes neural network training possible. -## ๐Ÿ”— Building on Previous Learning +## LINK Building on Previous Learning **What You Built Before**: - Module 02 (Tensor): Data structures for predictions and targets - Module 03 (Activations): Nonlinear transformations for model outputs @@ -28,33 +28,31 @@ Welcome to Loss Functions! You'll implement the critical bridge between model pr **Connection Map**: ``` -Layers โ†’ Loss Functions โ†’ Gradients +Layers -> Loss Functions -> Gradients (predictions) (objectives) (learning signals) ``` -## Learning Goals (Systems-Focused) -- **Systems understanding**: How loss functions translate business problems into optimization objectives with proper numerical stability -- **Core implementation skill**: Build production-quality loss functions with stable computation and efficient batch processing -- **Pattern mastery**: Understand how different loss functions shape learning dynamics and convergence behavior -- **Framework connections**: See how your implementations mirror PyTorch's loss functions and autograd integration patterns -- **Optimization trade-offs**: Learn why numerical stability and computational efficiency matter for reliable training at scale +## Learning Objectives -## Build โ†’ Use โ†’ Reflect -1. **Build**: Complete loss function implementations with numerical stability and gradient support -2. **Use**: Apply loss functions to regression and classification problems with real neural networks -3. **Reflect**: Why do different loss functions lead to different learning behaviors, and when does numerical stability matter? +By completing this module, you will: + +1. **Understand loss functions** - Learn how to measure the quality of model predictions +2. **Implement MSE Loss** - Build loss functions for regression problems +3. **Implement CrossEntropy Loss** - Create loss functions for classification tasks +4. **Handle numerical stability** - Deal with edge cases and extreme values safely +5. **Enable learning** - Provide the feedback signal that allows networks to improve + +## Build -> Use -> Reflect +1. **Build**: MSE, CrossEntropy, and BinaryCrossEntropy loss functions with proper error handling +2. **Use**: Apply different loss functions to real prediction problems and compare results +3. **Reflect**: Understand when to use each loss function and why numerical stability matters ## What You'll Achieve -By implementing loss functions from scratch, you'll understand: -- Deep technical understanding of how loss functions quantify prediction quality and enable learning -- Practical capability to implement numerically stable loss computation for production ML systems -- Systems insight into computational complexity, memory requirements, and batch processing efficiency -- Performance awareness of how loss function choice affects training speed and convergence characteristics -- Production knowledge of how frameworks implement robust loss computation with proper error handling - -## Systems Reality Check -๐Ÿ’ก **Production Context**: PyTorch's loss functions use numerically stable implementations and automatic mixed precision to handle extreme gradients and values -โšก **Performance Insight**: Numerically unstable loss functions can cause training to fail catastrophically - proper implementation is critical for reliable ML systems +- **Mathematical understanding**: How loss functions quantify prediction quality +- **Implementation skills**: Building robust loss functions with error handling +- **Problem matching**: Choosing the right loss function for different ML tasks +- **Numerical awareness**: Understanding and preventing common computational issues +- **Training foundation**: Enabling the learning process that makes neural networks work """ # %% nbgrader={"grade": false, "grade_id": "losses-imports", "locked": false, "schema_version": 3, "solution": false, "task": false} @@ -76,7 +74,7 @@ except ImportError: from tensor_dev import Tensor # %% nbgrader={"grade": false, "grade_id": "losses-setup", "locked": false, "schema_version": 3, "solution": false, "task": false} -print("๐Ÿ”ฅ TinyTorch Loss Functions Module") +print("FIRE TinyTorch Loss Functions Module") print(f"NumPy version: {np.__version__}") print(f"Python version: {sys.version_info.major}.{sys.version_info.minor}") print("Ready to build loss functions for neural network training!") @@ -112,21 +110,21 @@ Loss functions are the mathematical bridge between what your model predicts and ``` Business Goal: "Predict house prices accurately" - โ†“ + v Mathematical Loss: MSE = (predicted_price - actual_price)ยฒ - โ†“ -Optimization Signal: gradient = 2 ร— (predicted - actual) - โ†“ -Learning Update: parameter -= learning_rate ร— gradient + v +Optimization Signal: gradient = 2 * (predicted - actual) + v +Learning Update: parameter -= learning_rate * gradient ``` ## The Learning Ecosystem Loss functions provide four critical capabilities: -๐ŸŽฏ **Learning Objectives**: Define what "good" performance means mathematically -๐Ÿ“ˆ **Gradient Signal**: Provide directional improvement information for parameters -๐Ÿ” **Progress Measurement**: Enable monitoring training progress and convergence detection +TARGET **Learning Objectives**: Define what "good" performance means mathematically +PROGRESS **Gradient Signal**: Provide directional improvement information for parameters +MAGNIFY **Progress Measurement**: Enable monitoring training progress and convergence detection โš–๏ธ **Trade-off Control**: Balance different aspects of model performance and regularization ## Visual Understanding: Loss Function Landscape @@ -134,12 +132,12 @@ Loss functions provide four critical capabilities: ``` Loss Function Behavior: MSE Loss CrossEntropy Loss - High โ”‚ โ•ฑโ•ฒ High โ”‚ โ•ฑโ•ฒ - โ”‚ โ•ฑ โ•ฒ โ”‚ โ•ฑ โ•ฒ - โ”‚ โ•ฑ โ•ฒ โ”‚ โ•ฑ โ•ฒ - โ”‚ โ•ฑ โ•ฒ โ”‚ โ•ฑ โ•ฒ - Low โ”‚โ•ฑ โ•ฒ Low โ”‚ โ•ฑ โ•ฒ - โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ + High | /\\ High | /\\ + | / \\ | / \\ + | / \\ | / \\ + | / \\ | / \\ + Low |/ \\ Low | / \\ + +-------------- +-------------- Wrong Right Wrong Right โ€ข Smooth gradients โ€ข Steep near wrong predictions @@ -160,30 +158,30 @@ MSE is the cornerstone loss function for regression problems. It measures predic ``` MSE Loss Visualization: - - Loss โ”‚ โ•ฑโ•ฒ - 4 โ”‚ โ•ฑ โ•ฒ โ€ข Error = 2 โ†’ Loss = 4 - 3 โ”‚ โ•ฑ โ•ฒ โ€ข Error = 1 โ†’ Loss = 1 - 2 โ”‚ โ•ฑ โ•ฒ โ€ข Error = 0 โ†’ Loss = 0 - 1 โ”‚ โ•ฑ โ•ฒ โ€ข Quadratic penalty! - 0 โ”‚โ•ฑ__________โ•ฒ____ + + Loss | /\\ + 4 | / \\ โ€ข Error = 2 -> Loss = 4 + 3 | / \\ โ€ข Error = 1 -> Loss = 1 + 2 | / \\ โ€ข Error = 0 -> Loss = 0 + 1 | / \\ โ€ข Quadratic penalty! + 0 |/__________\\____ -2 -1 0 1 2 Error Gradient Flow: - โˆ‚Loss/โˆ‚prediction = 2 ร— (predicted - actual) + dLoss/dprediction = 2 * (predicted - actual) - Large errors โ†’ Large gradients โ†’ Big updates - Small errors โ†’ Small gradients โ†’ Fine tuning + Large errors -> Large gradients -> Big updates + Small errors -> Small gradients -> Fine tuning ``` ## Mathematical Foundation For batch of predictions and targets: ``` -MSE = (1/n) ร— ฮฃ(y_pred - y_true)ยฒ +MSE = (1/n) * Sum(y_pred - y_true)ยฒ -Gradient: โˆ‚MSE/โˆ‚y_pred = (2/n) ร— (y_pred - y_true) +Gradient: dMSE/dy_pred = (2/n) * (y_pred - y_true) ``` ## Learning Objectives @@ -196,7 +194,7 @@ By implementing MSE, you'll understand: # %% nbgrader={"grade": false, "grade_id": "mse-concept-question", "locked": false, "schema_version": 3, "solution": false, "task": false} """ -๐Ÿค” **Computational Question: MSE Properties** +THINK **Computational Question: MSE Properties** Before implementing, let's understand MSE behavior: @@ -215,7 +213,7 @@ class MeanSquaredError: Mean Squared Error Loss for Regression Problems Computes the average squared difference between predictions and targets: - MSE = (1/n) ร— ฮฃ(y_pred - y_true)ยฒ + MSE = (1/n) * Sum(y_pred - y_true)ยฒ Features: - Numerically stable computation @@ -283,10 +281,10 @@ class MeanSquaredError: """Alternative interface for forward pass.""" return self.__call__(y_pred, y_true) -# ๐Ÿ” SYSTEMS INSIGHT: Gradient Landscape Visualization +# MAGNIFY SYSTEMS INSIGHT: Gradient Landscape Visualization def visualize_loss_landscapes(): """Visualize how different loss functions create different optimization landscapes.""" - print("๐Ÿ” Loss Function Landscape Visualization") + print("MAGNIFY Loss Function Landscape Visualization") print("=" * 45) try: @@ -296,12 +294,12 @@ def visualize_loss_landscapes(): prediction_range = np.linspace(-3, 3, 100) true_value = 0.0 # Target value - print("\n๐Ÿ“ˆ Loss Landscape Comparison:") + print("\nPROGRESS Loss Landscape Comparison:") print(" How loss changes as predictions move away from target") # Calculate loss landscapes mse = MeanSquaredError() - ce = CrossEntropyLoss() + _ = CrossEntropyLoss() # Not used in this comparison bce = BinaryCrossEntropyLoss() # MSE landscape (regression) @@ -320,7 +318,7 @@ def visualize_loss_landscapes(): mse_gradient_at_zero = 2 * (0 - true_value) # MSE gradient formula mse_gradient_at_one = 2 * (1 - true_value) - print(f"\n๐ŸŽฏ Gradient Behavior Analysis:") + print(f"\nTARGET Gradient Behavior Analysis:") print(f" MSE gradient at prediction=0: {mse_gradient_at_zero:.3f}") print(f" MSE gradient at prediction=1: {mse_gradient_at_one:.3f}") print(f" MSE provides linear gradient growth") @@ -328,8 +326,8 @@ def visualize_loss_landscapes(): # Binary CE gradient analysis sigmoid_at_zero = 1 / (1 + np.exp(-0)) # = 0.5 bce_grad_at_zero = sigmoid_at_zero - 1.0 # = -0.5 - sigmoid_at_one = 1 / (1 + np.exp(-1)) # โ‰ˆ 0.73 - bce_grad_at_one = sigmoid_at_one - 1.0 # โ‰ˆ -0.27 + sigmoid_at_one = 1 / (1 + np.exp(-1)) # ~= 0.73 + bce_grad_at_one = sigmoid_at_one - 1.0 # ~= -0.27 print(f" BCE gradient at logit=0: {bce_grad_at_zero:.3f}") print(f" BCE gradient at logit=1: {bce_grad_at_one:.3f}") @@ -359,32 +357,32 @@ def visualize_loss_landscapes(): print(f" {point:>10.1f} {mse_loss.data:>10.3f} {bce_loss.data:>10.3f} {grad_type:>15}") # Optimization implications - print(f"\n๐Ÿš€ Optimization Implications:") + print(f"\nROCKET Optimization Implications:") print(f" MSE (Regression):") print(f" โ€ข Quadratic penalty grows smoothly") - print(f" โ€ข Large errors โ†’ large gradients (aggressive correction)") - print(f" โ€ข Small errors โ†’ small gradients (fine-tuning)") + print(f" โ€ข Large errors -> large gradients (aggressive correction)") + print(f" โ€ข Small errors -> small gradients (fine-tuning)") print(f" โ€ข Symmetric around target value") print(f" Binary CrossEntropy (Classification):") print(f" โ€ข Logarithmic penalty creates adaptive gradients") - print(f" โ€ข Wrong confident predictions โ†’ steep gradients") - print(f" โ€ข Right confident predictions โ†’ gentle gradients") + print(f" โ€ข Wrong confident predictions -> steep gradients") + print(f" โ€ข Right confident predictions -> gentle gradients") print(f" โ€ข Asymmetric penalty structure encourages confidence") - # ๐Ÿ’ก WHY THIS MATTERS: Different loss landscapes create different + # TIP WHY THIS MATTERS: Different loss landscapes create different # optimization dynamics. MSE's smooth quadratic surface enables # stable gradient descent, while CrossEntropy's adaptive gradients # help classification models learn faster from confident mistakes. except Exception as e: - print(f"โš ๏ธ Visualization error: {e}") + print(f"WARNING๏ธ Visualization error: {e}") print("Ensure loss functions are implemented for landscape analysis") -# ๐Ÿ” SYSTEMS INSIGHT: MSE Computational Analysis +# MAGNIFY SYSTEMS INSIGHT: MSE Computational Analysis def analyze_mse_properties(): """Analyze MSE loss characteristics for systems understanding.""" - print("๐Ÿ” MSE Loss Analysis - Understanding the Math") + print("MAGNIFY MSE Loss Analysis - Understanding the Math") print("=" * 45) try: @@ -397,12 +395,12 @@ def analyze_mse_properties(): pred = Tensor([error]) true = Tensor([0.0]) loss = mse(pred, true) - print(f" Error: {error:4.1f} โ†’ Loss: {loss.data:8.3f} (ร— {loss.data/(error**2):5.1f} baseline)") + print(f" Error: {error:4.1f} -> Loss: {loss.data:8.3f} (* {loss.data/(error**2):5.1f} baseline)") # Batch vs individual processing - print(f"\nโšก Batch Processing Efficiency:") + print(f"\nSPEED Batch Processing Efficiency:") single_losses = [] - for i in range(100): + for _ in range(100): pred = Tensor([np.random.randn()]) true = Tensor([np.random.randn()]) loss = mse(pred, true) @@ -428,25 +426,25 @@ def analyze_mse_properties(): print(f" Large loss memory: {sys.getsizeof(large_tensor.data)} bytes") print(f" MSE memory is independent of input size!") - # ๐Ÿ’ก WHY THIS MATTERS: MSE provides stable, well-behaved gradients + # TIP WHY THIS MATTERS: MSE provides stable, well-behaved gradients # that are proportional to error magnitude, making optimization smooth. # The quadratic penalty means large errors dominate learning initially, # then fine-tuning happens as errors get smaller. except Exception as e: - print(f"โš ๏ธ Analysis error: {e}") + print(f"WARNING๏ธ Analysis error: {e}") print("Ensure MSE implementation is complete before running analysis") # %% [markdown] """ -### ๐Ÿงช Unit Test: MSE Loss Computation +### TEST Unit Test: MSE Loss Computation This test validates `MeanSquaredError.__call__`, ensuring correct MSE computation with various input types and batch sizes. """ # %% nbgrader={"grade": true, "grade_id": "test-mse-loss", "locked": true, "points": 3, "schema_version": 3, "solution": false, "task": false} def test_unit_mse_loss(): """Test MSE loss implementation.""" - print("๐Ÿงช Testing Mean Squared Error Loss...") + print("TEST Testing Mean Squared Error Loss...") mse = MeanSquaredError() @@ -454,8 +452,8 @@ def test_unit_mse_loss(): y_pred = Tensor([[1.0, 2.0], [3.0, 4.0]]) y_true = Tensor([[1.0, 2.0], [3.0, 4.0]]) loss = mse(y_pred, y_true) - assert abs(loss.data) < 1e-6, f"Perfect predictions should have loss โ‰ˆ 0, got {loss.data}" - print("โœ… Perfect predictions test passed") + assert abs(loss.data) < 1e-6, f"Perfect predictions should have loss ~= 0, got {loss.data}" + print("PASS Perfect predictions test passed") # Test case 2: Known loss computation y_pred = Tensor([[1.0, 2.0]]) @@ -463,7 +461,7 @@ def test_unit_mse_loss(): loss = mse(y_pred, y_true) expected = 1.0 # [(1-0)ยฒ + (2-1)ยฒ] / 2 = [1 + 1] / 2 = 1.0 assert abs(loss.data - expected) < 1e-6, f"Expected loss {expected}, got {loss.data}" - print("โœ… Known loss computation test passed") + print("PASS Known loss computation test passed") # Test case 3: Batch processing y_pred = Tensor([[1.0, 2.0], [3.0, 4.0]]) @@ -471,7 +469,7 @@ def test_unit_mse_loss(): loss = mse(y_pred, y_true) expected = 0.25 # All squared differences are 0.25 assert abs(loss.data - expected) < 1e-6, f"Expected batch loss {expected}, got {loss.data}" - print("โœ… Batch processing test passed") + print("PASS Batch processing test passed") # Test case 4: Single value y_pred = Tensor([5.0]) @@ -479,9 +477,9 @@ def test_unit_mse_loss(): loss = mse(y_pred, y_true) expected = 4.0 # (5-3)ยฒ = 4 assert abs(loss.data - expected) < 1e-6, f"Expected single value loss {expected}, got {loss.data}" - print("โœ… Single value test passed") + print("PASS Single value test passed") - print("๐ŸŽ‰ MSE loss tests passed! Understanding regression objectives.") + print("CELEBRATE MSE loss tests passed! Understanding regression objectives.") test_unit_mse_loss() @@ -497,21 +495,21 @@ Cross-Entropy Loss measures the "information distance" between predicted probabi Cross-Entropy Loss for 3-Class Problem: Class Probabilities after Softmax: - Input: [2.0, 1.0, 0.1] โ†’ Probabilities: [0.66, 0.24, 0.10] - True: Class 0 (index 0) โ†’ Target: [1.0, 0.0, 0.0] + Input: [2.0, 1.0, 0.1] -> Probabilities: [0.66, 0.24, 0.10] + True: Class 0 (index 0) -> Target: [1.0, 0.0, 0.0] Loss Computation: CE = -log(probability_of_correct_class) CE = -log(0.66) = 0.415 Intuition: - High confidence + Correct โ†’ Low loss - High confidence + Wrong โ†’ High loss - Low confidence + Any โ†’ Medium loss + High confidence + Correct -> Low loss + High confidence + Wrong -> High loss + Low confidence + Any -> Medium loss Gradient Behavior: - Wrong predictions โ†’ Steep gradients โ†’ Big corrections - Right predictions โ†’ Gentle gradients โ†’ Fine tuning + Wrong predictions -> Steep gradients -> Big corrections + Right predictions -> Gentle gradients -> Fine tuning ``` ## Numerical Stability Challenge @@ -521,7 +519,7 @@ The Numerical Stability Problem: Raw logits: [50.0, 49.0, 48.0] Naive softmax: exp(50)/[exp(50)+exp(49)+exp(48)] - Problem: exp(50) โ‰ˆ 5ร—10ยฒยน โ†’ Overflow! + Problem: exp(50) ~= 5*10ยฒยน -> Overflow! Our Solution (Log-Sum-Exp Trick): 1. max_val = max(logits) = 50.0 @@ -534,10 +532,10 @@ Our Solution (Log-Sum-Exp Trick): For predictions and class indices: ``` -CrossEntropy = -ฮฃ y_true ร— log(softmax(y_pred)) +CrossEntropy = -Sum y_true * log(softmax(y_pred)) -Softmax: softmax(x_i) = exp(x_i) / ฮฃ exp(x_j) -Stable: softmax(x_i) = exp(x_i - max(x)) / ฮฃ exp(x_j - max(x)) +Softmax: softmax(x_i) = exp(x_i) / Sum exp(x_j) +Stable: softmax(x_i) = exp(x_i - max(x)) / Sum exp(x_j - max(x)) ``` ## Learning Objectives @@ -550,7 +548,7 @@ By implementing Cross-Entropy, you'll understand: # %% nbgrader={"grade": false, "grade_id": "crossentropy-concept-question", "locked": false, "schema_version": 3, "solution": false, "task": false} """ -๐Ÿค” **Computational Question: CrossEntropy Stability** +THINK **Computational Question: CrossEntropy Stability** Consider numerical stability in cross-entropy: @@ -641,7 +639,7 @@ class CrossEntropyLoss: softmax_pred = exp_pred / np.sum(exp_pred, axis=1, keepdims=True) # Step 4: Prevent numerical instability in log computation - epsilon = 1e-15 # Small value to prevent log(0) โ†’ -inf and log(1) โ†’ 0 issues + epsilon = 1e-15 # Small value to prevent log(0) -> -inf and log(1) -> 0 issues softmax_pred = np.clip(softmax_pred, epsilon, 1.0 - epsilon) # Step 5: Compute cross-entropy loss based on target format @@ -666,17 +664,17 @@ class CrossEntropyLoss: """Alternative interface for forward pass.""" return self.__call__(y_pred, y_true) -# ๐Ÿ” SYSTEMS INSIGHT: CrossEntropy Stability Analysis +# MAGNIFY SYSTEMS INSIGHT: CrossEntropy Stability Analysis def analyze_crossentropy_stability(): """Analyze numerical stability in cross-entropy computation.""" - print("๐Ÿ” CrossEntropy Stability Analysis") + print("MAGNIFY CrossEntropy Stability Analysis") print("=" * 40) try: ce = CrossEntropyLoss() # Test numerical stability with extreme values - print("\nโšก Numerical Stability Testing:") + print("\nSPEED Numerical Stability Testing:") # Extreme logits that would overflow in naive implementation extreme_logits = Tensor([[100.0, 99.0, 98.0]]) @@ -720,26 +718,26 @@ def analyze_crossentropy_stability(): print(f" Small vocab (100 classes): {small_memory / 1024:.1f} KB") print(f" Large vocab (10k classes): {large_memory / 1024:.1f} KB") - print(f" Memory scales O(batch_size ร— num_classes)") + print(f" Memory scales O(batch_size * num_classes)") - # ๐Ÿ’ก WHY THIS MATTERS: CrossEntropy memory scales with vocabulary size. + # TIP WHY THIS MATTERS: CrossEntropy memory scales with vocabulary size. # This is why large language models use techniques like hierarchical softmax # or sampling-based training to handle vocabularies with 50k+ tokens. except Exception as e: - print(f"โš ๏ธ Analysis error: {e}") + print(f"WARNING๏ธ Analysis error: {e}") print("Ensure CrossEntropy implementation is complete") # %% [markdown] """ -### ๐Ÿงช Unit Test: Cross-Entropy Loss Computation +### TEST Unit Test: Cross-Entropy Loss Computation This test validates `CrossEntropyLoss.__call__`, ensuring correct cross-entropy computation with numerically stable softmax. """ # %% nbgrader={"grade": true, "grade_id": "test-crossentropy-loss", "locked": true, "points": 4, "schema_version": 3, "solution": false, "task": false} def test_unit_crossentropy_loss(): """Test CrossEntropy loss implementation.""" - print("๐Ÿงช Testing Cross-Entropy Loss...") + print("TEST Testing Cross-Entropy Loss...") ce = CrossEntropyLoss() @@ -748,31 +746,31 @@ def test_unit_crossentropy_loss(): y_true = Tensor([0, 1]) # Class indices loss = ce(y_pred, y_true) assert loss.data < 0.1, f"Perfect predictions should have low loss, got {loss.data}" - print("โœ… Perfect predictions test passed") + print("PASS Perfect predictions test passed") # Test case 2: Random predictions (should have higher loss) y_pred = Tensor([[0.0, 0.0, 0.0], [0.0, 0.0, 0.0]]) # Uniform after softmax y_true = Tensor([0, 1]) loss = ce(y_pred, y_true) expected_random = -np.log(1.0/3.0) # log(1/num_classes) for uniform distribution - assert abs(loss.data - expected_random) < 0.1, f"Random predictions should have loss โ‰ˆ {expected_random}, got {loss.data}" - print("โœ… Random predictions test passed") + assert abs(loss.data - expected_random) < 0.1, f"Random predictions should have loss ~= {expected_random}, got {loss.data}" + print("PASS Random predictions test passed") # Test case 3: Binary classification y_pred = Tensor([[2.0, 1.0], [1.0, 2.0]]) y_true = Tensor([0, 1]) loss = ce(y_pred, y_true) assert 0.0 < loss.data < 2.0, f"Binary classification loss should be reasonable, got {loss.data}" - print("โœ… Binary classification test passed") + print("PASS Binary classification test passed") # Test case 4: One-hot encoded labels y_pred = Tensor([[2.0, 1.0, 0.0], [0.0, 2.0, 1.0]]) y_true = Tensor([[1.0, 0.0, 0.0], [0.0, 1.0, 0.0]]) # One-hot encoded loss = ce(y_pred, y_true) assert 0.0 < loss.data < 2.0, f"One-hot encoded loss should be reasonable, got {loss.data}" - print("โœ… One-hot encoded labels test passed") + print("PASS One-hot encoded labels test passed") - print("๐ŸŽ‰ Cross-Entropy loss tests passed! Understanding classification objectives.") + print("CELEBRATE Cross-Entropy loss tests passed! Understanding classification objectives.") test_unit_crossentropy_loss() @@ -788,21 +786,21 @@ Binary Cross-Entropy Loss is the specialized, efficient version of cross-entropy Binary Classification Landscape: Sigmoid Activation: - Raw Logit โ†’ Sigmoid โ†’ Probability โ†’ Loss - -5.0 โ†’ 0.007 โ†’ 0.007 โ†’ High loss (if true=1) - 0.0 โ†’ 0.500 โ†’ 0.500 โ†’ Medium loss - +5.0 โ†’ 0.993 โ†’ 0.993 โ†’ Low loss (if true=1) + Raw Logit -> Sigmoid -> Probability -> Loss + -5.0 -> 0.007 -> 0.007 -> High loss (if true=1) + 0.0 -> 0.500 -> 0.500 -> Medium loss + +5.0 -> 0.993 -> 0.993 -> Low loss (if true=1) Loss Behavior: - BCE = -[yร—log(p) + (1-y)ร—log(1-p)] + BCE = -[y*log(p) + (1-y)*log(1-p)] For y=1 (positive class): - p=0.9 โ†’ -log(0.9) = 0.105 (low loss) - p=0.1 โ†’ -log(0.1) = 2.303 (high loss) + p=0.9 -> -log(0.9) = 0.105 (low loss) + p=0.1 -> -log(0.1) = 2.303 (high loss) For y=0 (negative class): - p=0.1 โ†’ -log(0.9) = 0.105 (low loss) - p=0.9 โ†’ -log(0.1) = 2.303 (high loss) + p=0.1 -> -log(0.9) = 0.105 (low loss) + p=0.9 -> -log(0.1) = 2.303 (high loss) ``` ## Numerical Stability Solution @@ -810,30 +808,30 @@ Loss Behavior: ``` The Binary Cross-Entropy Stability Problem: - BCE = -[yร—log(ฯƒ(x)) + (1-y)ร—log(1-ฯƒ(x))] + BCE = -[y*log(ฯƒ(x)) + (1-y)*log(1-ฯƒ(x))] Where ฯƒ(x) = 1/(1+exp(-x)) Problems: - - Large positive x: exp(-x) โ†’ 0, then log(1) โ†’ 0 (loss of precision) - - Large negative x: ฯƒ(x) โ†’ 0, then log(0) โ†’ -โˆž + - Large positive x: exp(-x) -> 0, then log(1) -> 0 (loss of precision) + - Large negative x: ฯƒ(x) -> 0, then log(0) -> -inf Our Stable Solution: - BCE = max(x,0) - xร—y + log(1 + exp(-|x|)) + BCE = max(x,0) - x*y + log(1 + exp(-|x|)) Why this works: - max(x,0) handles positive values - - -xร—y is the "cross" term - - log(1+exp(-|x|)) is always stable (expโ‰ค1) + - -x*y is the "cross" term + - log(1+exp(-|x|)) is always stable (exp<=1) ``` ## Mathematical Foundation For binary predictions and labels: ``` -BCE = -y ร— log(ฯƒ(x)) - (1-y) ร— log(1-ฯƒ(x)) +BCE = -y * log(ฯƒ(x)) - (1-y) * log(1-ฯƒ(x)) -Stable form: BCE = max(x,0) - xร—y + log(1 + exp(-|x|)) +Stable form: BCE = max(x,0) - x*y + log(1 + exp(-|x|)) ``` ## Learning Objectives @@ -846,11 +844,11 @@ By implementing Binary Cross-Entropy, you'll understand: # %% nbgrader={"grade": false, "grade_id": "binary-crossentropy-concept", "locked": false, "schema_version": 3, "solution": false, "task": false} """ -๐Ÿค” **Computational Question: Binary Stability** +THINK **Computational Question: Binary Stability** Consider the stable BCE formulation: -1. Why does max(x,0) - xร—y + log(1+exp(-|x|)) work? +1. Why does max(x,0) - x*y + log(1+exp(-|x|)) work? 2. What happens when x=100? (trace through the computation) 3. What happens when x=-100? (trace through the computation) 4. How does this prevent both overflow and underflow? @@ -897,7 +895,7 @@ class BinaryCrossEntropyLoss: APPROACH: 1. Convert inputs to tensors and flatten for consistent processing - 2. Use stable BCE formula: max(x,0) - xร—y + log(1+exp(-|x|)) + 2. Use stable BCE formula: max(x,0) - x*y + log(1+exp(-|x|)) 3. Apply this formula element-wise across the batch 4. Return mean loss across all samples @@ -911,7 +909,7 @@ class BinaryCrossEntropyLoss: HINTS: - Use np.maximum(logits, 0) for the max(x,0) term - - Use np.abs(logits) to ensure exp argument is โ‰ค 0 + - Use np.abs(logits) to ensure exp argument is <= 0 - The formula naturally handles both positive and negative logits - Return np.mean() for batch averaging """ @@ -933,8 +931,8 @@ class BinaryCrossEntropyLoss: BCE(logits, y) = max(logits, 0) - logits * y + log(1 + exp(-|logits|)) This formulation prevents: - - exp(large_positive_logit) โ†’ overflow - - log(very_small_sigmoid) โ†’ -inf + - exp(large_positive_logit) -> overflow + - log(very_small_sigmoid) -> -inf Mathematical equivalence: - For positive logits: x - x*y + log(1 + exp(-x)) @@ -963,10 +961,10 @@ class BinaryCrossEntropyLoss: """Alternative interface for forward pass.""" return self.__call__(y_pred, y_true) -# ๐Ÿ” SYSTEMS INSIGHT: Binary CrossEntropy Efficiency Analysis +# MAGNIFY SYSTEMS INSIGHT: Binary CrossEntropy Efficiency Analysis def analyze_binary_crossentropy_efficiency(): """Analyze binary cross-entropy computational efficiency.""" - print("๐Ÿ” Binary CrossEntropy Efficiency Analysis") + print("MAGNIFY Binary CrossEntropy Efficiency Analysis") print("=" * 45) try: @@ -974,7 +972,7 @@ def analyze_binary_crossentropy_efficiency(): ce = CrossEntropyLoss() # For comparison # Compare binary-specific vs general cross-entropy - print("\nโšก Binary vs Multi-Class Efficiency:") + print("\nSPEED Binary vs Multi-Class Efficiency:") # Binary problem solved two ways binary_logits = Tensor([[1.5], [-0.8], [2.1]]) @@ -1001,7 +999,7 @@ def analyze_binary_crossentropy_efficiency(): print(f" Binary approach: {binary_memory / 1024:.1f} KB") print(f" Multi-class (2): {multiclass_memory / 1024:.1f} KB") - print(f" Binary is {multiclass_memory/binary_memory:.1f}ร— more memory efficient") + print(f" Binary is {multiclass_memory/binary_memory:.1f}* more memory efficient") # Stability test with extreme values print(f"\n๐Ÿ›ก๏ธ Extreme Value Stability:") @@ -1018,24 +1016,24 @@ def analyze_binary_crossentropy_efficiency(): is_stable = not (np.isnan(loss.data) or np.isinf(loss.data)) print(f" {name:15}: Loss = {loss.data:.6f}, Stable = {is_stable}") - # ๐Ÿ’ก WHY THIS MATTERS: Binary CrossEntropy is 2ร— more memory efficient + # TIP WHY THIS MATTERS: Binary CrossEntropy is 2* more memory efficient # than regular CrossEntropy for binary problems, and provides better # numerical stability through its specialized formulation. except Exception as e: - print(f"โš ๏ธ Analysis error: {e}") + print(f"WARNING๏ธ Analysis error: {e}") print("Ensure BinaryCrossEntropy implementation is complete") # %% [markdown] """ -### ๐Ÿงช Unit Test: Binary Cross-Entropy Loss +### TEST Unit Test: Binary Cross-Entropy Loss This test validates `BinaryCrossEntropyLoss.__call__`, ensuring stable binary cross-entropy computation with extreme values. """ # %% nbgrader={"grade": true, "grade_id": "test-binary-crossentropy", "locked": true, "points": 4, "schema_version": 3, "solution": false, "task": false} def test_unit_binary_crossentropy_loss(): """Test Binary CrossEntropy loss implementation.""" - print("๐Ÿงช Testing Binary Cross-Entropy Loss...") + print("TEST Testing Binary Cross-Entropy Loss...") bce = BinaryCrossEntropyLoss() @@ -1044,22 +1042,22 @@ def test_unit_binary_crossentropy_loss(): y_true = Tensor([[1.0], [0.0]]) loss = bce(y_pred, y_true) assert loss.data < 0.1, f"Perfect predictions should have low loss, got {loss.data}" - print("โœ… Perfect predictions test passed") + print("PASS Perfect predictions test passed") # Test case 2: Random predictions (should have higher loss) y_pred = Tensor([[0.0], [0.0]]) # 0.5 probability after sigmoid y_true = Tensor([[1.0], [0.0]]) loss = bce(y_pred, y_true) expected_random = -np.log(0.5) # log(0.5) for random guessing - assert abs(loss.data - expected_random) < 0.1, f"Random predictions should have loss โ‰ˆ {expected_random}, got {loss.data}" - print("โœ… Random predictions test passed") + assert abs(loss.data - expected_random) < 0.1, f"Random predictions should have loss ~= {expected_random}, got {loss.data}" + print("PASS Random predictions test passed") # Test case 3: Batch processing y_pred = Tensor([[1.0], [2.0], [-1.0]]) y_true = Tensor([[1.0], [1.0], [0.0]]) loss = bce(y_pred, y_true) assert 0.0 < loss.data < 2.0, f"Batch processing loss should be reasonable, got {loss.data}" - print("โœ… Batch processing test passed") + print("PASS Batch processing test passed") # Test case 4: Extreme values (test numerical stability) y_pred = Tensor([[100.0], [-100.0]]) # Extreme logits @@ -1067,9 +1065,9 @@ def test_unit_binary_crossentropy_loss(): loss = bce(y_pred, y_true) assert not np.isnan(loss.data) and not np.isinf(loss.data), f"Extreme values should not cause NaN/Inf, got {loss.data}" assert loss.data < 1.0, f"Extreme correct predictions should have low loss, got {loss.data}" - print("โœ… Extreme values test passed") + print("PASS Extreme values test passed") - print("๐ŸŽ‰ Binary Cross-Entropy loss tests passed! Understanding binary objectives.") + print("CELEBRATE Binary Cross-Entropy loss tests passed! Understanding binary objectives.") test_unit_binary_crossentropy_loss() @@ -1085,7 +1083,7 @@ Beyond standard loss functions, production ML systems often need custom losses t When false positives and false negatives have different costs: ```python -# Medical diagnosis: False negatives (missing disease) cost 10ร— more +# Medical diagnosis: False negatives (missing disease) cost 10* more class AsymmetricBinaryCrossEntropy(BinaryCrossEntropyLoss): def __init__(self, false_negative_weight=10.0): super().__init__() @@ -1134,11 +1132,15 @@ class FocalLoss(CrossEntropyLoss): return Tensor(np.mean(focal_loss)) ``` +""" +# %% [markdown] +""" ### Ranking-Aware Loss For problems where order matters (search, recommendations): +""" -```python +# %% nbgrader={"grade": false, "grade_id": "ranking-loss", "solution": true} class RankingAwareLoss: def __init__(self, position_weights=None): # Higher weights for top positions @@ -1146,7 +1148,7 @@ class RankingAwareLoss: def __call__(self, predictions, targets, positions): """predictions: relevance scores, targets: true relevance, positions: result positions""" - mse = MeanSquaredError() + # Not using MeanSquaredError() - computing directly # Weight errors by position importance weighted_errors = [] @@ -1156,14 +1158,16 @@ class RankingAwareLoss: weighted_errors.append(error) return Tensor(np.mean(weighted_errors)) -``` +# %% [markdown] +""" ## Advanced Custom Loss Patterns ### Multi-Task Learning Loss Combining multiple objectives with learned weights: +""" -```python +# %% nbgrader={"grade": false, "grade_id": "multitask-loss", "solution": true} class MultiTaskLoss: def __init__(self, num_tasks=3): # Learnable loss weights (log-variance parameterization for stability) @@ -1186,12 +1190,14 @@ class MultiTaskLoss: total_loss += weighted_loss return Tensor(total_loss) -``` +# %% [markdown] +""" ### Contrastive Loss for Similarity Learning For learning embeddings and similarity: +""" -```python +# %% nbgrader={"grade": false, "grade_id": "contrastive-loss", "solution": true} class ContrastiveLoss: def __init__(self, margin=1.0): self.margin = margin @@ -1207,12 +1213,15 @@ class ContrastiveLoss: total_loss = 0.5 * (positive_loss + negative_loss) return Tensor(np.mean(total_loss)) -``` +# %% [markdown] +""" ## Custom Loss Implementation Guidelines ### Numerical Stability Considerations -```python +""" + +# %% nbgrader={"grade": false, "grade_id": "stable-loss", "solution": true} # Always include stability measures in custom losses class StableCustomLoss: def __call__(self, predictions, targets): @@ -1221,16 +1230,21 @@ class StableCustomLoss: predictions = Tensor(predictions) # 2. Handle edge cases - predictions_clipped = np.clip(predictions.data, -100, 100) # Prevent overflow + # predictions_clipped would be used here for actual computation + # predictions_clipped = np.clip(predictions.data, -100, 100) # Prevent overflow # 3. Use numerically stable formulations # Avoid: exp(large_number), log(small_number) # Use: log-sum-exp trick, epsilon clipping - # 4. Return tensor for consistency - return Tensor(computed_loss) -``` + # 4. Compute loss (example - actual implementation depends on loss type) + computed_loss = np.mean((predictions.data - targets.data) ** 2) + # 5. Return tensor for consistency + return Tensor(computed_loss) + +# %% [markdown] +""" ### Gradient-Friendly Design ```python # Ensure gradients flow properly @@ -1246,7 +1260,7 @@ class GradientFriendlyLoss: return smooth_loss def smooth_l1_loss(self, pred, target, beta=1.0): - """Smooth L1 loss - less sensitive to outliers than MSE""" + \"\"\"Smooth L1 loss - less sensitive to outliers than MSE\"\"\" diff = np.abs(pred.data - target.data) loss = np.where(diff < beta, 0.5 * diff * diff / beta, @@ -1272,7 +1286,7 @@ Activation: Usually none (linear output) Penalty: Quadratic (large errors >> small errors) Model Architecture: -Input โ†’ Hidden Layers โ†’ Linear Output โ†’ MSE Loss +Input -> Hidden Layers -> Linear Output -> MSE Loss ``` ### Cross-Entropy Loss - Multi-Class Classification @@ -1284,7 +1298,7 @@ Activation: Softmax Penalty: Logarithmic (encouraging confident correct predictions) Model Architecture: -Input โ†’ Hidden Layers โ†’ Softmax โ†’ CrossEntropy Loss +Input -> Hidden Layers -> Softmax -> CrossEntropy Loss ``` ### Binary Cross-Entropy Loss - Binary Classification @@ -1296,7 +1310,7 @@ Activation: Sigmoid Penalty: Asymmetric (confident wrong predictions heavily penalized) Model Architecture: -Input โ†’ Hidden Layers โ†’ Sigmoid โ†’ Binary CrossEntropy Loss +Input -> Hidden Layers -> Sigmoid -> Binary CrossEntropy Loss ``` ## Performance and Stability Comparison @@ -1304,8 +1318,8 @@ Input โ†’ Hidden Layers โ†’ Sigmoid โ†’ Binary CrossEntropy Loss ``` Computational Characteristics: MSE CrossEntropy Binary CE -Time Complexity: O(n) O(nร—c) O(n) -Memory Complexity: O(1) O(nร—c) O(n) +Time Complexity: O(n) O(n*c) O(n) +Memory Complexity: O(1) O(n*c) O(n) Numerical Stability: High Medium High Convergence Speed: Fast Medium Fast @@ -1319,27 +1333,27 @@ Where: n = batch size, c = number of classes # Regression Problem (House Price Prediction) regression_model = Sequential([ - Linear(10, 64), # Input features โ†’ Hidden + Linear(10, 64), # Input features -> Hidden ReLU(), - Linear(64, 1), # Hidden โ†’ Single output + Linear(64, 1), # Hidden -> Single output # No activation - linear output for regression ]) loss_fn = MeanSquaredError() # Multi-Class Classification (Image Recognition) classification_model = Sequential([ - Linear(784, 128), # Flattened image โ†’ Hidden + Linear(784, 128), # Flattened image -> Hidden ReLU(), - Linear(128, 10), # Hidden โ†’ 10 classes + Linear(128, 10), # Hidden -> 10 classes Softmax() # Convert to probabilities ]) loss_fn = CrossEntropyLoss() # Binary Classification (Spam Detection) binary_model = Sequential([ - Linear(100, 64), # Text features โ†’ Hidden + Linear(100, 64), # Text features -> Hidden ReLU(), - Linear(64, 1), # Hidden โ†’ Single output + Linear(64, 1), # Hidden -> Single output Sigmoid() # Convert to probability ]) loss_fn = BinaryCrossEntropyLoss() @@ -1355,7 +1369,7 @@ for batch in dataloader: # %% [markdown] """ -### ๐Ÿงช Comprehensive Integration Test +### TEST Comprehensive Integration Test This test validates all loss functions work together correctly and can be used interchangeably in production systems. """ @@ -1370,7 +1384,7 @@ def test_unit_comprehensive_loss_integration(): mse = MeanSquaredError() ce = CrossEntropyLoss() bce = BinaryCrossEntropyLoss() - print(" โœ… All loss functions created successfully") + print(" PASS All loss functions created successfully") # Test 2: Loss functions return appropriate types print("\n2. Return Type Verification:") @@ -1396,7 +1410,7 @@ def test_unit_comprehensive_loss_integration(): assert isinstance(loss, Tensor), "Binary CrossEntropy should return Tensor" assert loss.data.shape == (), "Binary CrossEntropy should return scalar" - print(" โœ… All loss functions return correct types") + print(" PASS All loss functions return correct types") # Test 3: Loss values are reasonable print("\n3. Loss Value Sanity Checks:") @@ -1406,7 +1420,7 @@ def test_unit_comprehensive_loss_integration(): assert ce.forward(Tensor([[1.0, 0.0]]), Tensor([0])).data >= 0, "CrossEntropy should be non-negative" assert bce.forward(Tensor([1.0]), Tensor([1.0])).data >= 0, "Binary CrossEntropy should be non-negative" - print(" โœ… All loss functions produce reasonable values") + print(" PASS All loss functions produce reasonable values") # Test 4: Perfect predictions give low loss print("\n4. Perfect Prediction Tests:") @@ -1419,9 +1433,9 @@ def test_unit_comprehensive_loss_integration(): assert perfect_ce.data < 0.1, f"Perfect CE should be low, got {perfect_ce.data}" assert perfect_bce.data < 0.1, f"Perfect BCE should be low, got {perfect_bce.data}" - print(" โœ… Perfect predictions produce low loss") + print(" PASS Perfect predictions produce low loss") - print("\n๐ŸŽ‰ All comprehensive integration tests passed!") + print("\nCELEBRATE All comprehensive integration tests passed!") print(" โ€ข Loss functions instantiate correctly") print(" โ€ข Return types are consistent (Tensor scalars)") print(" โ€ข Loss values are mathematically sound") @@ -1448,9 +1462,9 @@ MSE (Mean Squared Error): Bottleneck: Memory bandwidth (simple arithmetic operations) CrossEntropy (Multi-Class): - Time: O(nร—c) - linear in samples ร— classes - Space: O(nร—c) - store full probability distributions - Operations: nร—c exp + nร—c divisions + nร—c logs + reductions + Time: O(n*c) - linear in samples * classes + Space: O(n*c) - store full probability distributions + Operations: n*c exp + n*c divisions + n*c logs + reductions Bottleneck: Exponential computations and memory bandwidth Binary CrossEntropy: @@ -1468,9 +1482,9 @@ Understanding memory requirements is crucial for large-scale training: Memory Requirements by Problem Scale: Small Problem (1K samples, 100 classes): - MSE: 8 KB (1K samples ร— 8 bytes) - CrossEntropy: 800 KB (1K ร— 100 ร— 8 bytes) - Binary CE: 16 KB (1K ร— 2 ร— 8 bytes) + MSE: 8 KB (1K samples * 8 bytes) + CrossEntropy: 800 KB (1K * 100 * 8 bytes) + Binary CE: 16 KB (1K * 2 * 8 bytes) Large Problem (100K samples, 10K classes): MSE: 800 KB (independent of classes!) @@ -1491,14 +1505,14 @@ Production systems must handle edge cases robustly: Stability Challenges and Solutions: CrossEntropy Stability Issues: - Problem: exp(large_logit) โ†’ overflow โ†’ NaN gradients + Problem: exp(large_logit) -> overflow -> NaN gradients Solution: log-sum-exp trick with max subtraction - Problem: log(very_small_prob) โ†’ -โˆž โ†’ training collapse + Problem: log(very_small_prob) -> -inf -> training collapse Solution: epsilon clipping (1e-15 to 1-1e-15) Binary CrossEntropy Stability Issues: - Problem: sigmoid(large_positive) โ†’ 1.0 โ†’ log(0) issues + Problem: sigmoid(large_positive) -> 1.0 -> log(0) issues Solution: stable logits formulation bypasses sigmoid Problem: exp(large_negative) in naive implementation @@ -1520,13 +1534,13 @@ Inference Throughput (measured on modern hardware): Training Memory Bandwidth Requirements: MSE: ~800 MB/s (lightweight computation) - CrossEntropy: ~80 GB/s (10ร— higher due to softmax!) + CrossEntropy: ~80 GB/s (10* higher due to softmax!) Binary CE: ~1.6 GB/s (moderate requirements) Gradient Computation Overhead: - MSE: 1.1ร— forward pass time (simple derivatives) - CrossEntropy: 1.5ร— forward pass time (softmax gradients) - Binary CE: 1.2ร— forward pass time (sigmoid gradients) + MSE: 1.1* forward pass time (simple derivatives) + CrossEntropy: 1.5* forward pass time (softmax gradients) + Binary CE: 1.2* forward pass time (sigmoid gradients) ``` ## Framework Integration and Production Patterns @@ -1573,10 +1587,10 @@ Monitoring and Debugging: ``` """ -# ๐Ÿ” SYSTEMS INSIGHT: Performance Profiling Analysis +# MAGNIFY SYSTEMS INSIGHT: Performance Profiling Analysis def analyze_loss_performance_characteristics(): """Comprehensive performance analysis of all loss functions.""" - print("๐Ÿ” Loss Function Performance Analysis") + print("MAGNIFY Loss Function Performance Analysis") print("=" * 45) try: @@ -1587,7 +1601,7 @@ def analyze_loss_performance_characteristics(): ce = CrossEntropyLoss() bce = BinaryCrossEntropyLoss() - print("\nโšก Computational Complexity Measurement:") + print("\nSPEED Computational Complexity Measurement:") # Test different batch sizes to see scaling behavior batch_sizes = [100, 1000, 10000] @@ -1601,7 +1615,7 @@ def analyze_loss_performance_characteristics(): start = time.perf_counter() for _ in range(100): # Average over multiple runs - mse_loss = mse(mse_pred, mse_true) + _ = mse(mse_pred, mse_true) mse_time = (time.perf_counter() - start) / 100 # CrossEntropy timing @@ -1610,7 +1624,7 @@ def analyze_loss_performance_characteristics(): start = time.perf_counter() for _ in range(100): - ce_loss = ce(ce_pred, ce_true) + _ = ce(ce_pred, ce_true) ce_time = (time.perf_counter() - start) / 100 # Binary CrossEntropy timing @@ -1619,7 +1633,7 @@ def analyze_loss_performance_characteristics(): start = time.perf_counter() for _ in range(100): - bce_loss = bce(bce_pred, bce_true) + _ = bce(bce_pred, bce_true) bce_time = (time.perf_counter() - start) / 100 print(f" MSE: {mse_time*1000:8.3f} ms") @@ -1649,19 +1663,19 @@ def analyze_loss_performance_characteristics(): print(f" BCE memory: {bce_memory / 1024 / 1024:8.1f} MB") print(f" CE overhead: {ce_memory/mse_memory:8.1f}x") - # ๐Ÿ’ก WHY THIS MATTERS: These performance characteristics determine + # TIP WHY THIS MATTERS: These performance characteristics determine # which loss functions are feasible for different deployment scenarios. - # CrossEntropy's O(nร—c) memory scaling makes it prohibitive for + # CrossEntropy's O(n*c) memory scaling makes it prohibitive for # large vocabularies without specialized techniques. except Exception as e: - print(f"โš ๏ธ Performance analysis error: {e}") + print(f"WARNING๏ธ Performance analysis error: {e}") print("Performance analysis requires complete implementations") -# ๐Ÿ” SYSTEMS INSIGHT: Numerical Stability Deep Analysis +# MAGNIFY SYSTEMS INSIGHT: Numerical Stability Deep Analysis def analyze_numerical_stability_edge_cases(): """Deep analysis of numerical stability across all loss functions.""" - print("๐Ÿ” Numerical Stability Edge Case Analysis") + print("MAGNIFY Numerical Stability Edge Case Analysis") print("=" * 50) try: @@ -1719,42 +1733,42 @@ def analyze_numerical_stability_edge_cases(): ("Very right", [[5.0, -5.0, 0.0]], [0]) ] - print(" Prediction Quality โ†’ CrossEntropy Loss:") + print(" Prediction Quality -> CrossEntropy Loss:") for name, logits, labels in confidence_levels: loss = ce(Tensor(logits), Tensor(labels)) print(f" {name:15}: {loss.data:8.4f}") - # ๐Ÿ’ก WHY THIS MATTERS: Understanding how loss functions behave + # TIP WHY THIS MATTERS: Understanding how loss functions behave # at extremes helps debug training failures and choose appropriate # loss scaling and clipping strategies for production systems. except Exception as e: - print(f"โš ๏ธ Stability analysis error: {e}") + print(f"WARNING๏ธ Stability analysis error: {e}") print("Stability analysis requires complete implementations") -# ๐Ÿ” SYSTEMS INSIGHT: Mixed Precision Training Analysis +# MAGNIFY SYSTEMS INSIGHT: Mixed Precision Training Analysis def analyze_mixed_precision_considerations(): """Analyze loss function behavior with FP16 mixed precision training.""" - print("๐Ÿ” Mixed Precision Training Analysis") + print("MAGNIFY Mixed Precision Training Analysis") print("=" * 40) try: - print("\nโšก FP16 Numerical Range Analysis:") - print(" FP16 range: ~ยฑ65,504 (much smaller than FP32's ~ยฑ3.4ร—10ยณโธ)") + print("\nSPEED FP16 Numerical Range Analysis:") + print(" FP16 range: ~ยฑ65,504 (much smaller than FP32's ~ยฑ3.4*10ยณโธ)") # Simulate FP16 range limitations fp16_max = 65504.0 - fp16_min_normal = 2**-14 # Smallest normal FP16 number โ‰ˆ 6.1ร—10โปโต + fp16_min_normal = 2**-14 # Smallest normal FP16 number ~= 6.1*10โปโต print(f" FP16 maximum: ยฑ{fp16_max:,.0f}") print(f" FP16 min normal: {fp16_min_normal:.2e}") - print(f" Risk: Gradients/losses exceeding range โ†’ infinity/NaN") + print(f" Risk: Gradients/losses exceeding range -> infinity/NaN") mse = MeanSquaredError() - ce = CrossEntropyLoss() - bce = BinaryCrossEntropyLoss() + # ce = CrossEntropyLoss() # Not used in this test + # bce = BinaryCrossEntropyLoss() # Not used in this test - print(f"\n๐ŸŽฏ Loss Function Mixed Precision Compatibility:") + print(f"\nTARGET Loss Function Mixed Precision Compatibility:") # Test cases that might overflow in FP16 test_cases = [ @@ -1772,7 +1786,7 @@ def analyze_mixed_precision_considerations(): squared_error = (pred - true) ** 2 fp16_safe = squared_error < fp16_max - print(f" {name:>15} {mse_loss.data:>12.1f} {'โœ…' if fp16_safe else 'โŒ':>12}") + print(f" {name:>15} {mse_loss.data:>12.1f} {'PASS' if fp16_safe else 'FAIL':>12}") print(f"\n๐Ÿ›ก๏ธ Mixed Precision Loss Scaling Strategy:") @@ -1836,22 +1850,22 @@ def analyze_mixed_precision_considerations(): print(f" scaler.step(optimizer) # Automatically unscales gradients") print(f" ```") - # ๐Ÿ’ก WHY THIS MATTERS: Mixed precision training can provide 1.5-2ร— speedup + # TIP WHY THIS MATTERS: Mixed precision training can provide 1.5-2* speedup # and 50% memory reduction, but loss functions must be carefully implemented # to handle the reduced numerical precision without losing training stability. except Exception as e: - print(f"โš ๏ธ Mixed precision analysis error: {e}") + print(f"WARNING๏ธ Mixed precision analysis error: {e}") print("Mixed precision analysis requires complete loss implementations") -# ๐Ÿ” SYSTEMS INSIGHT: Production Deployment Analysis +# MAGNIFY SYSTEMS INSIGHT: Production Deployment Analysis def analyze_production_deployment_patterns(): """Analyze how loss functions affect production ML system design.""" - print("๐Ÿ” Production Deployment Pattern Analysis") + print("MAGNIFY Production Deployment Pattern Analysis") print("=" * 50) try: - print("\n๐Ÿš€ Deployment Scenario Analysis:") + print("\nROCKET Deployment Scenario Analysis:") # Different deployment scenarios with constraints scenarios = [ @@ -1897,7 +1911,7 @@ def analyze_production_deployment_patterns(): trade_offs = [ ("Memory Efficiency", "MSE > Binary CE >> CrossEntropy"), ("Computational Speed", "MSE > Binary CE > CrossEntropy"), - ("Numerical Stability", "MSE โ‰ˆ Binary CE > CrossEntropy"), + ("Numerical Stability", "MSE ~= Binary CE > CrossEntropy"), ("Implementation Complexity", "MSE > CrossEntropy > Binary CE"), ("Gradient Quality", "CrossEntropy > Binary CE > MSE"), ("Debug-ability", "MSE > Binary CE > CrossEntropy") @@ -1918,24 +1932,24 @@ def analyze_production_deployment_patterns(): for framework, losses in frameworks: print(f" {framework:12}: {losses}") - # ๐Ÿ’ก WHY THIS MATTERS: Loss function choice affects every aspect + # TIP WHY THIS MATTERS: Loss function choice affects every aspect # of ML system design - from memory requirements to latency to # debugging complexity. Understanding these trade-offs enables # informed architectural decisions for production systems. except Exception as e: - print(f"โš ๏ธ Deployment analysis error: {e}") + print(f"WARNING๏ธ Deployment analysis error: {e}") # %% [markdown] """ -## ๐Ÿค” ML Systems Thinking: Interactive Questions +## THINK ML Systems Thinking: Interactive Questions Now that you've implemented all core loss functions and analyzed their systems characteristics, let's explore their implications for real ML systems: """ # %% nbgrader={"grade": false, "grade_id": "question-1-loss-selection", "locked": false, "schema_version": 3, "solution": false, "task": false} """ -๐Ÿค” **Question 1: Loss Function Selection for Production Systems** +THINK **Question 1: Loss Function Selection for Production Systems** You're building a production recommendation system that predicts user ratings (1-5 stars) for movies. @@ -1947,8 +1961,8 @@ C) Ordinal approach: Use a custom loss that penalizes being off by multiple star Analyze each approach considering your implementations: **Technical Analysis:** -- How does the memory scaling of CrossEntropy (O(batch_size ร— num_classes)) affect this 5-class problem? -- What are the computational complexity differences between MSE's O(n) and CrossEntropy's O(nร—c) for c=5? +- How does the memory scaling of CrossEntropy (O(batch_size * num_classes)) affect this 5-class problem? +- What are the computational complexity differences between MSE's O(n) and CrossEntropy's O(n*c) for c=5? - How do the gradient behaviors differ? (MSE's quadratic vs CrossEntropy's logarithmic penalties) **Systems Implications:** @@ -1966,7 +1980,7 @@ Recommend an approach with justification based on your implementation experience # %% nbgrader={"grade": false, "grade_id": "question-2-numerical-stability", "locked": false, "schema_version": 3, "solution": false, "task": false} """ -๐Ÿค” **Question 2: Debugging Numerical Stability in Production** +THINK **Question 2: Debugging Numerical Stability in Production** Your cross-entropy loss function works perfectly in development, but in production you start seeing NaN losses that crash training after several hours. @@ -1994,7 +2008,7 @@ Research how PyTorch and TensorFlow handle these same challenges in their loss i # %% nbgrader={"grade": false, "grade_id": "question-3-custom-loss-design", "locked": false, "schema_version": 3, "solution": false, "task": false} """ -๐Ÿค” **Question 3: Implementing and Optimizing Custom Loss Functions** +THINK **Question 3: Implementing and Optimizing Custom Loss Functions** You've seen examples of custom loss functions for business objectives. Now analyze implementation and optimization challenges: @@ -2036,16 +2050,16 @@ Implement one optimization for your chosen custom loss and explain how it addres # %% [markdown] """ -## ๐ŸŽฏ MODULE SUMMARY: Loss Functions - Learning Objectives Made Mathematical +## TARGET MODULE SUMMARY: Loss Functions - Learning Objectives Made Mathematical Congratulations! You've successfully implemented the complete foundation for neural network training objectives: ### What You've Accomplished -โœ… **Complete Loss Function Library**: MSE for regression, CrossEntropy for multi-class classification, and Binary CrossEntropy for binary classification with production-grade numerical stability -โœ… **Systems Engineering Understanding**: Deep comprehension of computational complexity, memory scaling, and numerical stability requirements for reliable ML systems -โœ… **Mathematical Implementation Mastery**: Built loss functions from mathematical foundations through stable computational formulations to working code -โœ… **Production Readiness Knowledge**: Understanding of how loss function choice affects training speed, memory usage, and deployment feasibility -โœ… **Framework Integration Insight**: Clear connection between your implementations and how PyTorch/TensorFlow solve the same problems +PASS **Complete Loss Function Library**: MSE for regression, CrossEntropy for multi-class classification, and Binary CrossEntropy for binary classification with production-grade numerical stability +PASS **Systems Engineering Understanding**: Deep comprehension of computational complexity, memory scaling, and numerical stability requirements for reliable ML systems +PASS **Mathematical Implementation Mastery**: Built loss functions from mathematical foundations through stable computational formulations to working code +PASS **Production Readiness Knowledge**: Understanding of how loss function choice affects training speed, memory usage, and deployment feasibility +PASS **Framework Integration Insight**: Clear connection between your implementations and how PyTorch/TensorFlow solve the same problems ### Key Learning Outcomes - **Loss Function Theory**: How mathematical loss functions translate business objectives into optimization targets that neural networks can learn from @@ -2054,9 +2068,9 @@ Congratulations! You've successfully implemented the complete foundation for neu - **Production ML Patterns**: Knowledge of how loss function choice affects system architecture, monitoring requirements, and debugging complexity ### Mathematical Foundations Mastered -- **MSE computation**: `(1/n) ร— ฮฃ(y_pred - y_true)ยฒ` with smooth quadratic gradients for regression optimization +- **MSE computation**: `(1/n) * Sum(y_pred - y_true)ยฒ` with smooth quadratic gradients for regression optimization - **CrossEntropy with stable softmax**: Log-sum-exp trick and epsilon clipping for numerically robust classification -- **Binary CrossEntropy stability**: `max(x,0) - xร—y + log(1 + exp(-|x|))` formulation preventing overflow/underflow issues +- **Binary CrossEntropy stability**: `max(x,0) - x*y + log(1 + exp(-|x|))` formulation preventing overflow/underflow issues - **Gradient behavior understanding**: How different loss functions create different optimization landscapes and learning dynamics ### Professional Skills Developed @@ -2091,11 +2105,11 @@ With solid loss function implementations, you're ready to: # %% nbgrader={"grade": false, "grade_id": "final-demo", "locked": false, "schema_version": 3, "solution": false, "task": false} if __name__ == "__main__": - print("๐Ÿ”ฅ TinyTorch Loss Functions Module - Complete Demo") + print("FIRE TinyTorch Loss Functions Module - Complete Demo") print("=" * 55) # Test all core implementations - print("\n๐Ÿงช Testing All Loss Functions:") + print("\nTEST Testing All Loss Functions:") test_unit_mse_loss() test_unit_crossentropy_loss() test_unit_binary_crossentropy_loss() @@ -2103,7 +2117,7 @@ if __name__ == "__main__": # Run systems analysis functions print("\n" + "="*60) - print("๐Ÿ” Systems Analysis Functions") + print("MAGNIFY Systems Analysis Functions") print("=" * 30) visualize_loss_landscapes() @@ -2145,7 +2159,7 @@ if __name__ == "__main__": print(f" Spam detection loss: {spam_loss.data:.4f}") print("\n" + "="*60) - print("๐ŸŽฏ Loss Function Characteristics") + print("TARGET Loss Function Characteristics") print("=" * 35) # Compare perfect vs imperfect predictions @@ -2169,11 +2183,11 @@ if __name__ == "__main__": print(f" Random CE loss: {random_ce.data:.6f}") print(f" Random BCE loss: {random_bce.data:.6f}") - print("\n๐ŸŽ‰ Complete loss function foundation ready!") - print(" โœ… MSE for regression problems") - print(" โœ… CrossEntropy for multi-class classification") - print(" โœ… Binary CrossEntropy for binary classification") - print(" โœ… Numerically stable implementations") - print(" โœ… Production-ready batch processing") - print(" โœ… Systems analysis and performance insights") - print(" โœ… Ready for neural network training!") \ No newline at end of file + print("\nCELEBRATE Complete loss function foundation ready!") + print(" PASS MSE for regression problems") + print(" PASS CrossEntropy for multi-class classification") + print(" PASS Binary CrossEntropy for binary classification") + print(" PASS Numerically stable implementations") + print(" PASS Production-ready batch processing") + print(" PASS Systems analysis and performance insights") + print(" PASS Ready for neural network training!") \ No newline at end of file diff --git a/modules/05_autograd/autograd_dev.py b/modules/05_autograd/autograd_dev.py index 77d93e75..51b72595 100644 --- a/modules/05_autograd/autograd_dev.py +++ b/modules/05_autograd/autograd_dev.py @@ -14,7 +14,7 @@ Welcome to Autograd! You'll implement the automatic differentiation engine that makes neural network training possible by automatically computing gradients through complex computational graphs. -## ๐Ÿ”— Building on Previous Learning +## LINK Building on Previous Learning **What You Built Before**: - Module 02 (Tensor): Data structures that hold neural network parameters - Module 05 (Losses): Functions that measure prediction accuracy @@ -27,33 +27,37 @@ Welcome to Autograd! You'll implement the automatic differentiation engine that **Connection Map**: ``` -Tensors โ†’ Loss Functions โ†’ Autograd โ†’ Optimizers -(data) (error measure) (โˆ‡L/โˆ‡ฮธ) (parameter updates) +Tensors -> Loss Functions -> Autograd -> Optimizers +(data) (error measure) (gradL/gradฮธ) (parameter updates) ``` -## Learning Goals -- Systems understanding: How computational graphs enable automatic differentiation and why this approach scales to arbitrary network architectures -- Core implementation skill: Build the Variable class with gradient tracking and implement backward propagation through dynamic computation graphs -- Pattern recognition: Understand how chain rule application through computational graphs generalizes to any differentiable function -- Framework connection: See how your implementation mirrors PyTorch's autograd engine and tensor gradient tracking -- Performance insight: Learn why computational graph memory management and gradient accumulation strategies determine training scalability +## Learning Objectives -## Build โ†’ Use โ†’ Reflect -1. **Build**: Complete automatic differentiation system with Variable class, gradient tracking, and backward propagation -2. **Use**: Apply autograd to complex mathematical expressions and neural network operations -3. **Reflect**: Why does automatic differentiation enable ML at scale, and how does graph memory management affect training? +By completing this module, you will: + +1. **Implement automatic differentiation** - Build the system that computes gradients automatically +2. **Create computational graphs** - Track operations to enable backward propagation +3. **Apply the chain rule** - Understand how gradients flow through complex operations +4. **Build Variable class** - Extend tensors with gradient tracking capabilities +5. **Enable training** - Provide the automatic gradient computation that makes learning possible + +## Build -> Use -> Reflect +1. **Build**: Variable class with gradient tracking and backward propagation through operations +2. **Use**: Apply autograd to mathematical expressions and see gradients computed automatically +3. **Reflect**: Understand how automatic differentiation enables efficient neural network training ## What You'll Achieve -By the end of this module, you'll understand: -- Deep technical understanding of how computational graphs enable automatic gradient computation for arbitrary functions -- Practical capability to build the gradient computation engine that powers all modern neural network training -- Systems insight into why automatic differentiation was the breakthrough that enabled deep learning at scale +- **Gradient computation**: Automatically compute derivatives for any mathematical expression +- **Chain rule implementation**: Apply calculus systematically through complex operations +- **Memory management**: Handle gradient accumulation and computational graph lifecycle +- **Training enablement**: Provide the gradient information needed for parameter optimization +- **Framework understanding**: See how PyTorch and TensorFlow implement automatic differentiation - Performance consideration of how computational graph size and memory management affect training efficiency - Connection to production ML systems and how frameworks optimize gradient computation and memory usage ## Systems Reality Check -๐Ÿ’ก **Production Context**: PyTorch's autograd can handle graphs with millions of nodes and uses sophisticated memory optimization like gradient checkpointing to train models larger than GPU memory -โšก **Performance Note**: Gradient computation often requires storing forward activations, leading to memory usage that scales with network depth - this drives innovations like gradient checkpointing +TIP **Production Context**: PyTorch's autograd can handle graphs with millions of nodes and uses sophisticated memory optimization like gradient checkpointing to train models larger than GPU memory +SPEED **Performance Note**: Gradient computation often requires storing forward activations, leading to memory usage that scales with network depth - this drives innovations like gradient checkpointing """ # %% nbgrader={"grade": false, "grade_id": "autograd-imports", "locked": false, "schema_version": 3, "solution": false, "task": false} @@ -75,14 +79,14 @@ except ImportError: from tensor_dev import Tensor # %% nbgrader={"grade": false, "grade_id": "autograd-setup", "locked": false, "schema_version": 3, "solution": false, "task": false} -print("๐Ÿ”ฅ TinyTorch Autograd Module") +print("FIRE TinyTorch Autograd Module") print(f"NumPy version: {np.__version__}") print(f"Python version: {sys.version_info.major}.{sys.version_info.minor}") print("Ready to build automatic differentiation!") # %% [markdown] """ -## ๐Ÿ“ฆ Where This Code Lives in the Final Package +## PACKAGE Where This Code Lives in the Final Package **Learning Side:** You work in `modules/06_autograd/autograd_dev.py` **Building Side:** Code exports to `tinytorch.core.autograd` @@ -109,7 +113,7 @@ from tinytorch.core.activations import ReLU, Sigmoid, Tanh Neural networks have millions of parameters. To train them, we need gradients of the loss function with respect to every parameter: ``` -โˆ‡ฮธ L = [โˆ‚L/โˆ‚wโ‚, โˆ‚L/โˆ‚wโ‚‚, ..., โˆ‚L/โˆ‚wโ‚™, โˆ‚L/โˆ‚bโ‚, โˆ‚L/โˆ‚bโ‚‚, ..., โˆ‚L/โˆ‚bโ‚˜] +gradฮธ L = [dL/dwโ‚, dL/dwโ‚‚, ..., dL/dwโ‚™, dL/dbโ‚, dL/dbโ‚‚, ..., dL/dbโ‚˜] ``` **Manual differentiation fails** because: @@ -121,7 +125,7 @@ Neural networks have millions of parameters. To train them, we need gradients of **Autograd** automatically computes derivatives of functions represented as computational graphs: ```python -# Instead of manually computing: โˆ‚(xยฒ + 2xy + yยฒ)/โˆ‚x = 2x + 2y +# Instead of manually computing: d(xยฒ + 2xy + yยฒ)/dx = 2x + 2y # Autograd does it automatically: x = Variable(3.0, requires_grad=True) y = Variable(4.0, requires_grad=True) @@ -136,21 +140,21 @@ print(x.grad) # 2*3 + 2*4 = 14 (computed automatically!) Mathematical Expression: z = xยฒ + 2xy + yยฒ Computational Graph: - x โ”€โ”€โ”ฌโ”€โ†’ [ร—] โ”€โ”€โ†’ xยฒ โ”€โ”€โ”ฌโ”€โ†’ [+] โ”€โ”€โ†’ [+] โ”€โ”€โ†’ z - โ†‘ โ”‚ โ”‚ โ†‘ โ†‘ - โ”‚ โ””โ”€โ†’ [ร—] โ”€โ”€โ†’ 2x โ”€โ”˜ โ”‚ โ”‚ - โ”‚ โ†‘ โ”‚ โ”‚ - โ”‚ 2 โ”‚ โ”‚ - โ”‚ โ”‚ โ”‚ - x โ”€โ”€โ”ฌโ”€โ†’ [ร—] โ”€โ”€โ†’ xy โ”€โ†’ [ร—] โ”€โ”€โ†’ 2xy โ”‚ - โ†‘ โ”‚ โ†‘ โ†‘ โ”‚ - โ”‚ โ”‚ โ”‚ 2 โ”‚ - โ”‚ โ”‚ y โ”‚ - โ”‚ โ”‚ โ”‚ - y โ”€โ”€โ”ดโ”€โ†’ [ร—] โ”€โ”€โ†’ yยฒ โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ + x --+--> [*] ---> xยฒ --+--> [+] ---> [+] ---> z + ^ | | ^ ^ + | +--> [*] ---> 2x -+ | | + | ^ | | + | 2 | | + | | | + x --+--> [*] ---> xy --> [*] ---> 2xy | + ^ | ^ ^ | + | | | 2 | + | | y | + | | | + y --+--> [*] ---> yยฒ --------------------+ Forward Pass: Compute values xยฒ = 9, 2xy = 24, yยฒ = 16, z = 49 -Backward Pass: Compute gradients โˆ‚z/โˆ‚x = 14, โˆ‚z/โˆ‚y = 20 +Backward Pass: Compute gradients dz/dx = 14, dz/dy = 20 ``` ### Why This is Revolutionary @@ -187,39 +191,39 @@ A **Variable** wraps a Tensor and tracks: ### Visual: The Computational Graph Structure ``` Variable Structure: -โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” -โ”‚ Variable Object โ”‚ -โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค -โ”‚ data: Tensor([1.5, 2.3, ...]) โ”‚ โ† Forward pass values -โ”‚ grad: None โ†’ Tensor([...]) โ”‚ โ† Backward pass gradients -โ”‚ requires_grad: True/False โ”‚ โ† Should compute gradients? -โ”‚ grad_fn: โ”‚ โ† How to compute gradients -โ”‚ is_leaf: True/False โ”‚ โ† Original parameter? -โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ ++---------------------------------+ +| Variable Object | ++---------------------------------โ”ค +| data: Tensor([1.5, 2.3, ...]) | <- Forward pass values +| grad: None -> Tensor([...]) | <- Backward pass gradients +| requires_grad: True/False | <- Should compute gradients? +| grad_fn: | <- How to compute gradients +| is_leaf: True/False | <- Original parameter? ++---------------------------------+ Computational Graph Example: - x (leaf) โ”€โ”€โ” - โ”œโ”€โ”€[ADD]โ”€โ”€โ†’ z (intermediate) - y (leaf) โ”€โ”€โ”˜ + x (leaf) --+ + +--[ADD]---> z (intermediate) + y (leaf) --+ Forward: x.data + y.data = z.data - Backward: z.grad โ†’ x.grad, y.grad (via chain rule) + Backward: z.grad -> x.grad, y.grad (via chain rule) ``` ### Memory Layout: Variables vs Tensors ``` Memory Comparison: Tensor Only Variable with Autograd - โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” - โ”‚ Data โ”‚ โ”‚ Data โ”‚ โ† Same data storage - โ”‚ 4 bytes โ”‚ โ”‚ 4 bytes โ”‚ - โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค - โ”‚ Gradient โ”‚ โ† Additional gradient storage - โ”‚ 4 bytes โ”‚ - โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค - โ”‚ grad_fn โ”‚ โ† Function pointer - โ”‚ 8 bytes โ”‚ - โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ + +-------------+ +-------------+ + | Data | | Data | <- Same data storage + | 4 bytes | | 4 bytes | + +-------------+ +-------------โ”ค + | Gradient | <- Additional gradient storage + | 4 bytes | + +-------------โ”ค + | grad_fn | <- Function pointer + | 8 bytes | + +-------------+ Total: ~2x memory overhead ``` @@ -365,7 +369,7 @@ class Variable: # %% [markdown] """ -### ๐Ÿงช Unit Test: Variable Class +### TEST Unit Test: Variable Class This test validates Variable initialization, ensuring gradient tracking capabilities work correctly. """ @@ -400,16 +404,16 @@ def test_unit_variable_class(): x.zero_grad() assert x.grad is None, "zero_grad should reset gradient to None" - print("โœ… Variable class tests passed!") - print(f"โœ… Variable creation and initialization working") - print(f"โœ… Data access and properties working") - print(f"โœ… Gradient management working") + print("PASS Variable class tests passed!") + print(f"PASS Variable creation and initialization working") + print(f"PASS Data access and properties working") + print(f"PASS Gradient management working") # Test will run in main block # %% [markdown] """ -## ๐Ÿค” Computational Assessment: Variable Understanding +## THINK Computational Assessment: Variable Understanding Test your understanding of computational graphs and Variable design. """ @@ -462,27 +466,27 @@ Every operation must implement: ### Visual: Chain Rule Through Addition ``` Forward Pass: z = x + y - x: 3.0 โ”€โ”€โ” - โ”œโ”€โ”€[+]โ”€โ”€โ†’ z: 5.0 - y: 2.0 โ”€โ”€โ”˜ + x: 3.0 --+ + +--[+]---> z: 5.0 + y: 2.0 --+ -Backward Pass: โˆ‚z/โˆ‚x = 1, โˆ‚z/โˆ‚y = 1 - โˆ‚L/โˆ‚z: 1.0 โ”€โ”€โ”ฌโ”€โ”€โ†’ โˆ‚L/โˆ‚x: 1.0 (โˆ‚z/โˆ‚x = 1) - โ”‚ - โ””โ”€โ”€โ†’ โˆ‚L/โˆ‚y: 1.0 (โˆ‚z/โˆ‚y = 1) +Backward Pass: dz/dx = 1, dz/dy = 1 + dL/dz: 1.0 --+---> dL/dx: 1.0 (dz/dx = 1) + | + +---> dL/dy: 1.0 (dz/dy = 1) -Chain Rule: โˆ‚L/โˆ‚x = โˆ‚L/โˆ‚z ยท โˆ‚z/โˆ‚x = 1.0 ยท 1 = 1.0 +Chain Rule: dL/dx = dL/dz ยท dz/dx = 1.0 ยท 1 = 1.0 ``` ### Mathematical Foundation The chain rule states: ``` -โˆ‚f/โˆ‚x = โˆ‚f/โˆ‚z ยท โˆ‚z/โˆ‚x +df/dx = df/dz ยท dz/dx ``` For complex expressions like f(g(h(x))): ``` -โˆ‚f/โˆ‚x = โˆ‚f/โˆ‚g ยท โˆ‚g/โˆ‚h ยท โˆ‚h/โˆ‚x +df/dx = df/dg ยท dg/dh ยท dh/dx ``` ### Implementation Pattern @@ -551,13 +555,13 @@ def add(a: Union[Variable, float, int], b: Union[Variable, float, int]) -> Varia STEP-BY-STEP IMPLEMENTATION: 1. Convert inputs to Variables if they are scalars 2. Compute forward pass: result = a.data + b.data - 3. Create gradient function that implements: โˆ‚(a+b)/โˆ‚a = 1, โˆ‚(a+b)/โˆ‚b = 1 + 3. Create gradient function that implements: d(a+b)/da = 1, d(a+b)/db = 1 4. Return new Variable with result and gradient function MATHEMATICAL FOUNDATION: - Forward: z = x + y - - Backward: โˆ‚z/โˆ‚x = 1, โˆ‚z/โˆ‚y = 1 - - Chain rule: โˆ‚L/โˆ‚x = โˆ‚L/โˆ‚z ยท โˆ‚z/โˆ‚x = โˆ‚L/โˆ‚z ยท 1 = โˆ‚L/โˆ‚z + - Backward: dz/dx = 1, dz/dy = 1 + - Chain rule: dL/dx = dL/dz ยท dz/dx = dL/dz ยท 1 = dL/dz EXAMPLE USAGE: ```python @@ -565,8 +569,8 @@ def add(a: Union[Variable, float, int], b: Union[Variable, float, int]) -> Varia y = Variable(3.0, requires_grad=True) z = add(x, y) # z = 5.0 z.backward() - print(x.grad) # 1.0 (โˆ‚z/โˆ‚x = 1) - print(y.grad) # 1.0 (โˆ‚z/โˆ‚y = 1) + print(x.grad) # 1.0 (dz/dx = 1) + print(y.grad) # 1.0 (dz/dy = 1) ``` IMPLEMENTATION HINTS: @@ -640,7 +644,7 @@ def add(a: Union[Variable, float, int], b: Union[Variable, float, int]) -> Varia # %% [markdown] """ -### ๐Ÿงช Unit Test: Addition Operation +### TEST Unit Test: Addition Operation This test validates addition operation, ensuring gradients flow correctly through addition. """ @@ -664,8 +668,8 @@ def test_unit_add_operation(): assert x.grad is not None, "x should have gradient" assert y.grad is not None, "y should have gradient" - assert x.grad.numpy().item() == 1.0, "โˆ‚z/โˆ‚x should be 1.0" - assert y.grad.numpy().item() == 1.0, "โˆ‚z/โˆ‚y should be 1.0" + assert x.grad.numpy().item() == 1.0, "dz/dx should be 1.0" + assert y.grad.numpy().item() == 1.0, "dz/dy should be 1.0" # Test with scalar a = Variable(5.0, requires_grad=True) @@ -676,23 +680,23 @@ def test_unit_add_operation(): b.backward() assert a.grad.numpy().item() == 1.0, "Gradient through scalar addition should be 1.0" - print("โœ… Addition operation tests passed!") - print(f"โœ… Forward pass computing correct results") - print(f"โœ… Backward pass computing correct gradients") - print(f"โœ… Scalar addition working correctly") + print("PASS Addition operation tests passed!") + print(f"PASS Forward pass computing correct results") + print(f"PASS Backward pass computing correct gradients") + print(f"PASS Scalar addition working correctly") # Test will run in main block -# โœ… IMPLEMENTATION CHECKPOINT: Addition operation complete +# PASS IMPLEMENTATION CHECKPOINT: Addition operation complete -# ๐Ÿค” PREDICTION: How does the chain rule apply when operations are chained together? +# THINK PREDICTION: How does the chain rule apply when operations are chained together? # Your answer: _______ -# ๐Ÿ” SYSTEMS INSIGHT #1: Gradient Flow Analysis +# MAGNIFY SYSTEMS INSIGHT #1: Gradient Flow Analysis def analyze_gradient_flow(): """Analyze how gradients flow through computational graphs.""" try: - print("๐Ÿ” GRADIENT FLOW ANALYSIS") + print("MAGNIFY GRADIENT FLOW ANALYSIS") print("=" * 35) # Create simple computational graph @@ -713,8 +717,8 @@ def analyze_gradient_flow(): z.backward() print(f"\nBackward pass:") - print(f" โˆ‚z/โˆ‚x = {x.grad.numpy().item()}") - print(f" โˆ‚z/โˆ‚y = {y.grad.numpy().item()}") + print(f" dz/dx = {x.grad.numpy().item()}") + print(f" dz/dy = {y.grad.numpy().item()}") # Analyze memory usage import sys @@ -726,14 +730,14 @@ def analyze_gradient_flow(): print(f" Intermediate result (z): ~{z_memory} bytes") print(f" Memory overhead: {z_memory/x_memory:.1f}x") - # ๐Ÿ’ก WHY THIS MATTERS: In large models, computational graphs can consume + # TIP WHY THIS MATTERS: In large models, computational graphs can consume # significant memory. Each intermediate result stores gradients and backward functions. # This is why techniques like gradient checkpointing are crucial for training large models! return True except Exception as e: - print(f"โš ๏ธ Error in gradient flow analysis: {e}") + print(f"WARNING๏ธ Error in gradient flow analysis: {e}") print("Make sure addition and multiplication are implemented") return False @@ -746,23 +750,23 @@ def analyze_gradient_flow(): ### The Product Rule For z = x * y: - **Forward**: z = x * y -- **Backward**: โˆ‚z/โˆ‚x = y, โˆ‚z/โˆ‚y = x +- **Backward**: dz/dx = y, dz/dy = x ### Visual: Product Rule in Action ``` Forward Pass: z = x * y - x: 2.0 โ”€โ”€โ” - โ”œโ”€โ”€[ร—]โ”€โ”€โ†’ z: 6.0 - y: 3.0 โ”€โ”€โ”˜ + x: 2.0 --+ + +--[*]---> z: 6.0 + y: 3.0 --+ -Backward Pass: โˆ‚z/โˆ‚x = y, โˆ‚z/โˆ‚y = x - โˆ‚L/โˆ‚z: 1.0 โ”€โ”€โ”ฌโ”€โ”€โ†’ โˆ‚L/โˆ‚x: 3.0 (โˆ‚z/โˆ‚x = y = 3.0) - โ”‚ - โ””โ”€โ”€โ†’ โˆ‚L/โˆ‚y: 2.0 (โˆ‚z/โˆ‚y = x = 2.0) +Backward Pass: dz/dx = y, dz/dy = x + dL/dz: 1.0 --+---> dL/dx: 3.0 (dz/dx = y = 3.0) + | + +---> dL/dy: 2.0 (dz/dy = x = 2.0) Product Rule: -- โˆ‚(xy)/โˆ‚x = y -- โˆ‚(xy)/โˆ‚y = x +- d(xy)/dx = y +- d(xy)/dy = x ``` ### Why This Matters @@ -774,8 +778,8 @@ Multiplication is everywhere in neural networks: ### Chain Rule Application When gradients flow back through multiplication: ``` -โˆ‚L/โˆ‚x = โˆ‚L/โˆ‚z ยท โˆ‚z/โˆ‚x = โˆ‚L/โˆ‚z ยท y -โˆ‚L/โˆ‚y = โˆ‚L/โˆ‚z ยท โˆ‚z/โˆ‚y = โˆ‚L/โˆ‚z ยท x +dL/dx = dL/dz ยท dz/dx = dL/dz ยท y +dL/dy = dL/dz ยท dz/dy = dL/dz ยท x ``` """ @@ -785,7 +789,7 @@ def multiply(a: Union[Variable, float, int], b: Union[Variable, float, int]) -> """ Multiplication operation with gradient tracking: a * b - Uses the product rule: โˆ‚(a*b)/โˆ‚a = b, โˆ‚(a*b)/โˆ‚b = a + Uses the product rule: d(a*b)/da = b, d(a*b)/db = a """ ### BEGIN SOLUTION # Convert scalars to Variables @@ -808,7 +812,7 @@ def multiply(a: Union[Variable, float, int], b: Union[Variable, float, int]) -> # %% [markdown] """ -### ๐Ÿงช Unit Test: Multiplication Operation +### TEST Unit Test: Multiplication Operation This test validates multiplication operation, ensuring the product rule is implemented correctly. """ @@ -831,8 +835,8 @@ def test_unit_multiply_operation(): assert x.grad is not None, "x should have gradient" assert y.grad is not None, "y should have gradient" - assert x.grad.numpy().item() == 3.0, "โˆ‚z/โˆ‚x should be y = 3.0" - assert y.grad.numpy().item() == 2.0, "โˆ‚z/โˆ‚y should be x = 2.0" + assert x.grad.numpy().item() == 3.0, "dz/dx should be y = 3.0" + assert y.grad.numpy().item() == 2.0, "dz/dy should be x = 2.0" # Test with scalar a = Variable(4.0, requires_grad=True) @@ -843,10 +847,10 @@ def test_unit_multiply_operation(): b.backward() assert a.grad.numpy().item() == 2.0, "Gradient through scalar multiplication should be the scalar" - print("โœ… Multiplication operation tests passed!") - print(f"โœ… Forward pass computing correct results") - print(f"โœ… Backward pass implementing product rule correctly") - print(f"โœ… Scalar multiplication working correctly") + print("PASS Multiplication operation tests passed!") + print(f"PASS Forward pass computing correct results") + print(f"PASS Backward pass implementing product rule correctly") + print(f"PASS Scalar multiplication working correctly") # Test will run in main block @@ -859,7 +863,7 @@ def subtract(a: Union[Variable, float, int], b: Union[Variable, float, int]) -> """ Subtraction operation with gradient tracking: a - b - Uses the rule: โˆ‚(a-b)/โˆ‚a = 1, โˆ‚(a-b)/โˆ‚b = -1 + Uses the rule: d(a-b)/da = 1, d(a-b)/db = -1 """ ### BEGIN SOLUTION # Convert to Variables if needed @@ -888,7 +892,7 @@ def matmul(a: Union[Variable, float, int], b: Union[Variable, float, int]) -> Va """ Matrix multiplication operation with gradient tracking: a @ b - Uses matrix multiplication gradients: โˆ‚C/โˆ‚A = grad_C @ B^T, โˆ‚C/โˆ‚B = A^T @ grad_C + Uses matrix multiplication gradients: dC/dA = grad_C @ B^T, dC/dB = A^T @ grad_C """ ### BEGIN SOLUTION # Convert scalars to Variables @@ -901,12 +905,12 @@ def matmul(a: Union[Variable, float, int], b: Union[Variable, float, int]) -> Va def grad_fn(grad_output): # Matrix multiplication gradients if a.requires_grad: - # โˆ‚C/โˆ‚A = grad_C @ B^T + # dC/dA = grad_C @ B^T grad_a_data = grad_output.numpy() @ b.numpy().T a.backward(Variable(grad_a_data)) if b.requires_grad: - # โˆ‚C/โˆ‚B = A^T @ grad_C + # dC/dB = A^T @ grad_C grad_b_data = a.numpy().T @ grad_output.numpy() b.backward(Variable(grad_b_data)) @@ -920,7 +924,7 @@ def divide(a: Union[Variable, float, int], b: Union[Variable, float, int]) -> Va """ Division operation with gradient tracking: a / b - Uses the quotient rule: โˆ‚(a/b)/โˆ‚a = 1/b, โˆ‚(a/b)/โˆ‚b = -a/bยฒ + Uses the quotient rule: d(a/b)/da = 1/b, d(a/b)/db = -a/bยฒ """ ### BEGIN SOLUTION # Convert scalars to Variables @@ -932,11 +936,11 @@ def divide(a: Union[Variable, float, int], b: Union[Variable, float, int]) -> Va # Backward function def grad_fn(grad_output): if a.requires_grad: - # โˆ‚(a/b)/โˆ‚a = 1/b + # d(a/b)/da = 1/b grad_a = Variable(grad_output.numpy() / b.numpy()) a.backward(grad_a) if b.requires_grad: - # โˆ‚(a/b)/โˆ‚b = -a/bยฒ + # d(a/b)/db = -a/bยฒ grad_b = Variable(-grad_output.numpy() * a.numpy() / (b.numpy() ** 2)) b.backward(grad_b) @@ -962,8 +966,8 @@ def test_unit_subtract_operation(): assert x.grad is not None, "x should have gradient" assert y.grad is not None, "y should have gradient" - assert x.grad.numpy().item() == 1.0, "โˆ‚z/โˆ‚x should be 1.0" - assert y.grad.numpy().item() == -1.0, "โˆ‚z/โˆ‚y should be -1.0" + assert x.grad.numpy().item() == 1.0, "dz/dx should be 1.0" + assert y.grad.numpy().item() == -1.0, "dz/dy should be -1.0" # Test with scalar a = Variable(4.0, requires_grad=True) @@ -974,16 +978,16 @@ def test_unit_subtract_operation(): b.backward() assert a.grad.numpy().item() == 1.0, "Gradient through scalar subtraction should be 1.0" - print("โœ… Subtraction operation tests passed!") - print(f"โœ… Forward pass computing correct results") - print(f"โœ… Backward pass implementing subtraction rule correctly") - print(f"โœ… Scalar subtraction working correctly") + print("PASS Subtraction operation tests passed!") + print(f"PASS Forward pass computing correct results") + print(f"PASS Backward pass implementing subtraction rule correctly") + print(f"PASS Scalar subtraction working correctly") # Test will run in main block # %% [markdown] """ -## ๐Ÿค” Computational Assessment: Chain Rule Application +## THINK Computational Assessment: Chain Rule Application Test your understanding of how gradients flow through multiple operations. """ @@ -1003,17 +1007,17 @@ c.backward() ``` **Calculate manually:** -1. โˆ‚c/โˆ‚b = _____ -2. โˆ‚b/โˆ‚a = _____ -3. โˆ‚b/โˆ‚x = _____ -4. โˆ‚a/โˆ‚x = _____ -5. โˆ‚a/โˆ‚y = _____ +1. dc/db = _____ +2. db/da = _____ +3. db/dx = _____ +4. da/dx = _____ +5. da/dy = _____ **Apply chain rule:** -6. โˆ‚c/โˆ‚x (through path cโ†’bโ†’aโ†’x) = _____ -7. โˆ‚c/โˆ‚x (through path cโ†’bโ†’x) = _____ -8. Total โˆ‚c/โˆ‚x = _____ + _____ = _____ -9. โˆ‚c/โˆ‚y = _____ +6. dc/dx (through path c->b->a->x) = _____ +7. dc/dx (through path c->b->x) = _____ +8. Total dc/dx = _____ + _____ = _____ +9. dc/dy = _____ **Verification:** TODO: Run the code above and verify your calculations match the computed gradients. @@ -1022,15 +1026,15 @@ TODO: Run the code above and verify your calculations match the computed gradien ### BEGIN SOLUTION # Student calculation space - this will be manually graded # Expected answers: -# 1. โˆ‚c/โˆ‚b = 2 (c = b * 2) -# 2. โˆ‚b/โˆ‚a = 1 (b = a + x) -# 3. โˆ‚b/โˆ‚x = 1 (b = a + x) -# 4. โˆ‚a/โˆ‚x = y = 3 (a = x * y) -# 5. โˆ‚a/โˆ‚y = x = 2 (a = x * y) -# 6. โˆ‚c/โˆ‚x (path 1) = 2 * 1 * 3 = 6 -# 7. โˆ‚c/โˆ‚x (path 2) = 2 * 1 = 2 -# 8. Total โˆ‚c/โˆ‚x = 6 + 2 = 8 -# 9. โˆ‚c/โˆ‚y = 2 * 1 * 2 = 4 +# 1. dc/db = 2 (c = b * 2) +# 2. db/da = 1 (b = a + x) +# 3. db/dx = 1 (b = a + x) +# 4. da/dx = y = 3 (a = x * y) +# 5. da/dy = x = 2 (a = x * y) +# 6. dc/dx (path 1) = 2 * 1 * 3 = 6 +# 7. dc/dx (path 2) = 2 * 1 = 2 +# 8. Total dc/dx = 6 + 2 = 8 +# 9. dc/dy = 2 * 1 * 2 = 4 ### END SOLUTION # %% [markdown] @@ -1045,22 +1049,22 @@ Now let us test how multiple operations work together through the chain rule: Example: f(x, y) = (x + y) * (x - y) = xยฒ - yยฒ Computational Graph: - x โ”€โ”€โ”ฌโ”€โ†’ [+] โ”€โ”€โ”ฌโ”€โ†’ [ร—] โ”€โ”€โ†’ result - โ”‚ โ”‚ - y โ”€โ”€โ”ดโ”€โ†’ [+] โ”€โ”€โ”˜ - โ”‚ - โ””โ”€โ†’ [-] โ”€โ”€โ”˜ + x --+--> [+] --+--> [*] ---> result + | | + y --+--> [+] --+ + | + +--> [-] --+ x Forward Pass Flow: - x=3, y=2 โ†’ sum=5, diff=1 โ†’ result=5 + x=3, y=2 -> sum=5, diff=1 -> result=5 Backward Pass Flow: - โˆ‚L/โˆ‚result=1 โ†’ โˆ‚L/โˆ‚sum=1, โˆ‚L/โˆ‚diff=5 โ†’ โˆ‚L/โˆ‚x=6, โˆ‚L/โˆ‚y=-4 + dL/dresult=1 -> dL/dsum=1, dL/ddiff=5 -> dL/dx=6, dL/dy=-4 Manual verification: f(x,y) = xยฒ - yยฒ -โˆ‚f/โˆ‚x = 2x = 2(3) = 6 โœ“ -โˆ‚f/โˆ‚y = -2y = -2(2) = -4 โœ“ +df/dx = 2x = 2(3) = 6 OK +df/dy = -2y = -2(2) = -4 OK ``` ### Chain Rule Application @@ -1095,7 +1099,7 @@ def test_unit_chain_rule(): # Compute gradients result.backward() - # Check gradients: โˆ‚(xยฒ-yยฒ)/โˆ‚x = 2x, โˆ‚(xยฒ-yยฒ)/โˆ‚y = -2y + # Check gradients: d(xยฒ-yยฒ)/dx = 2x, d(xยฒ-yยฒ)/dy = -2y expected_x_grad = 2 * x.numpy().item() # 2 * 3 = 6 expected_y_grad = -2 * y.numpy().item() # -2 * 2 = -4 @@ -1122,24 +1126,24 @@ def test_unit_chain_rule(): assert abs(x2.grad.numpy().item() - expected_grad) < 1e-6, f"Complex gradient should be {expected_grad}" - print("โœ… Chain rule tests passed!") - print(f"โœ… Simple expression: (x+y)*(x-y) = xยฒ-yยฒ") - print(f"โœ… Complex expression: (x+1)*(x+2)*(x+3)") - print(f"โœ… Automatic gradient computation working correctly") - print(f"โœ… Chain rule implemented correctly") + print("PASS Chain rule tests passed!") + print(f"PASS Simple expression: (x+y)*(x-y) = xยฒ-yยฒ") + print(f"PASS Complex expression: (x+1)*(x+2)*(x+3)") + print(f"PASS Automatic gradient computation working correctly") + print(f"PASS Chain rule implemented correctly") # Test will run in main block -# โœ… IMPLEMENTATION CHECKPOINT: Basic operations complete +# PASS IMPLEMENTATION CHECKPOINT: Basic operations complete -# ๐Ÿค” PREDICTION: How does computational graph memory scale with network depth? +# THINK PREDICTION: How does computational graph memory scale with network depth? # Your answer: _______ -# ๐Ÿ” SYSTEMS INSIGHT #2: Computational Graph Memory Analysis +# MAGNIFY SYSTEMS INSIGHT #2: Computational Graph Memory Analysis def analyze_computational_graph_memory(): """Analyze memory consumption patterns in computational graphs.""" try: - print("๐Ÿ” COMPUTATIONAL GRAPH MEMORY ANALYSIS") + print("MAGNIFY COMPUTATIONAL GRAPH MEMORY ANALYSIS") print("=" * 45) import sys @@ -1186,14 +1190,14 @@ def analyze_computational_graph_memory(): print(f" โ€ข Transformer (100 layers): ~{memory_usage[0] * 100:.0f} MB graph memory") print(f" โ€ข GPT-3 scale models: Gradient checkpointing essential!") - # ๐Ÿ’ก WHY THIS MATTERS: Deep networks require storing intermediate activations + # TIP WHY THIS MATTERS: Deep networks require storing intermediate activations # for gradient computation. This memory grows linearly with depth, leading to # memory constraints. Gradient checkpointing trades compute for memory! return memory_usage except Exception as e: - print(f"โš ๏ธ Error in memory analysis: {e}") + print(f"WARNING๏ธ Error in memory analysis: {e}") print("Make sure all operations are implemented") return [1.0] @@ -1215,20 +1219,20 @@ Let us see how autograd enables neural network training: ### Visual: Neural Network Training Flow ``` Training Loop Architecture: -โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” -โ”‚ Forward โ”‚โ”€โ”€โ”€โ–ถโ”‚ Loss โ”‚โ”€โ”€โ”€โ–ถโ”‚ Backward โ”‚โ”€โ”€โ”€โ–ถโ”‚ Update โ”‚ -โ”‚ Pass โ”‚ โ”‚ Computation โ”‚ โ”‚ Pass โ”‚ โ”‚ Parameters โ”‚ -โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ - โ–ฒ โ”‚ โ”‚ - โ”‚ โ–ผ โ–ผ -โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” -โ”‚ Input Data โ”‚ โ”‚ Gradients โ”‚ โ”‚ New Weightsโ”‚ -โ”‚ (x, y) โ”‚ โ”‚ โˆ‡L/โˆ‡ฮธ โ”‚ โ”‚ ฮธ' โ”‚ -โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ ++-------------+ +-------------+ +-------------+ +-------------+ +| Forward |---โ–ถ| Loss |---โ–ถ| Backward |---โ–ถ| Update | +| Pass | | Computation | | Pass | | Parameters | ++-------------+ +-------------+ +-------------+ +-------------+ + ^ | | + | v v ++-------------+ +-------------+ +-------------+ +| Input Data | | Gradients | | New Weights| +| (x, y) | | gradL/gradฮธ | | ฮธ' | ++-------------+ +-------------+ +-------------+ Memory Flow During Training: - Parameters โ†’ Forward Activations โ†’ Loss โ†’ Gradients โ†’ Parameter Updates - ฮธ f(x; ฮธ) L โˆ‡L/โˆ‡ฮธ ฮธ - ฮฑโˆ‡L/โˆ‡ฮธ + Parameters -> Forward Activations -> Loss -> Gradients -> Parameter Updates + ฮธ f(x; ฮธ) L gradL/gradฮธ ฮธ - ฮฑgradL/gradฮธ 4 MB 12 MB 1 val 4 MB 4 MB (stored for (in-place) backward) @@ -1328,24 +1332,24 @@ def test_module_neural_network_training(): prediction_error = abs(test_prediction.numpy().item() - expected_output) assert prediction_error < 0.5, f"Prediction error should be small, got {prediction_error}" - print("โœ… Neural network training comprehensive tests passed!") - print(f"โœ… Parameters converged to correct values") - print(f"โœ… Model makes accurate predictions") - print(f"โœ… Autograd enables automatic training") - print(f"โœ… Ready for complex neural network architectures!") + print("PASS Neural network training comprehensive tests passed!") + print(f"PASS Parameters converged to correct values") + print(f"PASS Model makes accurate predictions") + print(f"PASS Autograd enables automatic training") + print(f"PASS Ready for complex neural network architectures!") # Test will run in main block -# โœ… IMPLEMENTATION CHECKPOINT: Neural network training complete +# PASS IMPLEMENTATION CHECKPOINT: Neural network training complete -# ๐Ÿค” PREDICTION: How does backward pass time compare to forward pass time? +# THINK PREDICTION: How does backward pass time compare to forward pass time? # Your answer: _______ -# ๐Ÿ” SYSTEMS INSIGHT #3: Forward vs Backward Pass Performance +# MAGNIFY SYSTEMS INSIGHT #3: Forward vs Backward Pass Performance def analyze_forward_backward_performance(): """Analyze performance characteristics of forward vs backward passes.""" try: - print("๐Ÿ” FORWARD VS BACKWARD PASS PERFORMANCE") + print("MAGNIFY FORWARD VS BACKWARD PASS PERFORMANCE") print("=" * 45) import time @@ -1415,18 +1419,18 @@ def analyze_forward_backward_performance(): print(f" โ€ข Balanced forward/backward performance") print(f"\n๐Ÿญ Production Implications:") - print(f" โ€ข Training time โ‰ˆ {1 + avg_ratio:.1f}x inference time") - print(f" โ€ข Memory usage โ‰ˆ 2x parameters (gradients + weights)") + print(f" โ€ข Training time ~= {1 + avg_ratio:.1f}x inference time") + print(f" โ€ข Memory usage ~= 2x parameters (gradients + weights)") print(f" โ€ข Gradient checkpointing can trade compute for memory") - # ๐Ÿ’ก WHY THIS MATTERS: Backward pass typically takes 1.5-3x forward pass time. + # TIP WHY THIS MATTERS: Backward pass typically takes 1.5-3x forward pass time. # This determines training speed and influences architecture choices. # Understanding this ratio helps optimize training pipelines! return results except Exception as e: - print(f"โš ๏ธ Error in performance analysis: {e}") + print(f"WARNING๏ธ Error in performance analysis: {e}") print("Basic timing analysis shows autograd overhead patterns") return [] @@ -1451,7 +1455,7 @@ Normal Training: Gradient Explosion: With Clipping: | \\ | \\ | \\___ | \\__ | \\ | \\ | \\ | \\ | \\ - โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ โ””โ”€โ”€โ”€โ”€โ”€โ”€\\ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ + +-------- +------\\ +--------- Epoch Epoch NaN Epoch โ†— Training @@ -1459,7 +1463,7 @@ Normal Training: Gradient Explosion: With Clipping: ``` ### Mathematical Foundation -- **Gradient norm**: ||g|| = โˆš(gโ‚ยฒ + gโ‚‚ยฒ + ... + gโ‚™ยฒ) +- **Gradient norm**: ||g|| = sqrt(gโ‚ยฒ + gโ‚‚ยฒ + ... + gโ‚™ยฒ) - **Clipping factor**: max_norm / max(||g||, max_norm) - **Clipped gradients**: g' = g * clipping_factor @@ -1567,7 +1571,7 @@ def enable_mixed_precision_gradients(variables: List[Variable], loss_scale: floa for var in variables: if var.grad is not None: var.zero_grad() - print(f"โš ๏ธ Gradient overflow detected, skipping optimizer step") + print(f"WARNING๏ธ Gradient overflow detected, skipping optimizer step") return not overflow_detected ### END SOLUTION @@ -1819,7 +1823,7 @@ class AutogradSystemsProfiler: "๐Ÿ’พ Gradient checkpointing for memory-time trade-offs", "๐Ÿ”„ In-place operations where mathematically valid", "๐Ÿ“Š Dynamic memory allocation with smart pre-allocation", - "๐ŸŽฏ Lazy evaluation for unused computation branches" + "TARGET Lazy evaluation for unused computation branches" ]) return analysis @@ -1839,8 +1843,8 @@ class AutogradSystemsProfiler: if avg_operations_per_layer > 3: fusion_analysis['fusion_opportunities'] = [ "๐Ÿ”€ Element-wise operation fusion (add, multiply, activation)", - "๐Ÿ”— Matrix operation chains (matmul + bias + activation)", - "๐Ÿ“ˆ Reduction operation fusion (sum, mean, variance)", + "LINK Matrix operation chains (matmul + bias + activation)", + "PROGRESS Reduction operation fusion (sum, mean, variance)", "๐ŸŽญ Attention pattern fusion (Q@K^T, softmax, @V)" ] @@ -1863,8 +1867,8 @@ class AutogradSystemsProfiler: # Add kernel optimization strategies fusion_analysis['kernel_optimization_strategies'] = [ - "โšก JIT compilation for operation sequences", - "๐ŸŽฏ Vectorization of element-wise operations", + "SPEED JIT compilation for operation sequences", + "TARGET Vectorization of element-wise operations", "๐Ÿ”„ Loop fusion for reduced memory bandwidth", "๐Ÿ“ฑ GPU kernel optimization for parallel execution", "๐Ÿงฎ Mixed precision kernel specialization" @@ -1972,7 +1976,7 @@ class AutogradSystemsProfiler: This function is PROVIDED to demonstrate checkpointing analysis. Students use it to understand memory optimization strategies. """ - print("๐Ÿ” GRADIENT CHECKPOINTING ANALYSIS") + print("MAGNIFY GRADIENT CHECKPOINTING ANALYSIS") print("=" * 45) base_graph_depth = 12 @@ -2019,7 +2023,7 @@ class AutogradSystemsProfiler: # Find optimal trade-off optimal = max(checkpointing_results, key=lambda x: x['memory_time_ratio']) - print(f"\n๐Ÿ“ˆ Checkpointing Analysis:") + print(f"\nPROGRESS Checkpointing Analysis:") print(f" Optimal frequency: Every {optimal['checkpoint_frequency']} layers") print(f" Best trade-off: {optimal['memory_reduction_pct']:.1f}% memory reduction") print(f" Cost: {optimal['time_overhead_pct']:.1f}% time overhead") @@ -2033,7 +2037,7 @@ class AutogradSystemsProfiler: This function is PROVIDED to show mixed precision analysis. Students explore precision trade-offs in autograd systems. """ - print("๐Ÿ” MIXED PRECISION TRAINING ANALYSIS") + print("MAGNIFY MIXED PRECISION TRAINING ANALYSIS") print("=" * 45) model_size_mb = 100 # Example 100MB model @@ -2094,7 +2098,7 @@ class AutogradSystemsProfiler: optimal = max(precision_results, key=score_precision) - print(f"\n๐Ÿ“ˆ Mixed Precision Analysis:") + print(f"\nPROGRESS Mixed Precision Analysis:") print(f" Optimal configuration: {optimal['precision'].upper()}") print(f" Memory savings: {optimal['memory_savings_pct']:.1f}%") print(f" Performance gain: {optimal['relative_speed']:.1f}x") @@ -2110,7 +2114,7 @@ class AutogradSystemsProfiler: # %% [markdown] """ -### ๐Ÿงช Unit Test: Autograd Systems Profiling +### TEST Unit Test: Autograd Systems Profiling This test validates our autograd systems profiler with realistic computational graph scenarios. """ @@ -2144,7 +2148,7 @@ def test_autograd_systems_profiler(): assert result['forward_time_ms'] >= 0, f"Forward time should be non-negative for depth {depth}" assert result['backward_time_ms'] >= 0, f"Backward time should be non-negative for depth {depth}" - print("โœ… Computational graph depth analysis test passed") + print("PASS Computational graph depth analysis test passed") # Test memory checkpointing analysis checkpointing_analysis = profiler.analyze_memory_checkpointing_trade_offs(checkpoint_frequencies=[1, 2, 4]) @@ -2158,7 +2162,7 @@ def test_autograd_systems_profiler(): assert 'time_overhead_pct' in result, "Should calculate time overhead" assert result['memory_reduction_pct'] >= 0, "Memory reduction should be non-negative" - print("โœ… Memory checkpointing analysis test passed") + print("PASS Memory checkpointing analysis test passed") # Test mixed precision analysis mixed_precision_analysis = profiler.demonstrate_mixed_precision_benefits() @@ -2171,13 +2175,13 @@ def test_autograd_systems_profiler(): assert 'memory_savings_pct' in result, "Should calculate memory savings" assert 'relative_speed' in result, "Should include performance metrics" - print("โœ… Mixed precision analysis test passed") + print("PASS Mixed precision analysis test passed") except Exception as e: - print(f"โš ๏ธ Autograd profiling test had issues: {e}") - print("โœ… Basic structure test passed (graceful degradation)") + print(f"WARNING๏ธ Autograd profiling test had issues: {e}") + print("PASS Basic structure test passed (graceful degradation)") - print("๐ŸŽฏ Autograd Systems Profiler: All tests passed!") + print("TARGET Autograd Systems Profiler: All tests passed!") # Test will run in main block @@ -2186,10 +2190,15 @@ def test_unit_gradient_clipping(): """Test gradient clipping functionality.""" print("๐Ÿ”ฌ Unit Test: Gradient Clipping...") + # Set seed for deterministic test + np.random.seed(42) + # Create variables with large gradients w1 = Variable(np.random.randn(5, 5), requires_grad=True) w2 = Variable(np.random.randn(5, 3), requires_grad=True) + # Set seed again for gradient generation to ensure deterministic gradients + np.random.seed(42) # Simulate large gradients w1.grad = Variable(np.random.randn(5, 5) * 10) # Large gradients w2.grad = Variable(np.random.randn(5, 3) * 15) # Even larger gradients @@ -2203,7 +2212,8 @@ def test_unit_gradient_clipping(): computed_norm = clip_gradients([w1, w2], max_norm=max_norm) # Verify gradient norm was computed correctly - assert abs(computed_norm - total_original_norm) < 1e-6, "Should compute correct gradient norm" + norm_diff = abs(computed_norm - total_original_norm) + assert norm_diff < 1e-5, f"Gradient norm mismatch: computed={computed_norm:.8f}, expected={total_original_norm:.8f}, diff={norm_diff:.8f}" # Check that gradients were clipped if necessary if total_original_norm > max_norm: @@ -2212,11 +2222,11 @@ def test_unit_gradient_clipping(): new_total_norm = np.sqrt(new_norm1**2 + new_norm2**2) assert abs(new_total_norm - max_norm) < 1e-6, f"Clipped norm should be {max_norm}, got {new_total_norm}" - print(f"โœ… Gradients clipped from {total_original_norm:.3f} to {new_total_norm:.3f}") + print(f"PASS Gradients clipped from {total_original_norm:.3f} to {new_total_norm:.3f}") else: - print(f"โœ… Gradients within limit ({total_original_norm:.3f} <= {max_norm})") + print(f"PASS Gradients within limit ({total_original_norm:.3f} <= {max_norm})") - print("โœ… Gradient clipping tests passed!") + print("PASS Gradient clipping tests passed!") def test_unit_mixed_precision(): """Test mixed precision gradient handling.""" @@ -2241,10 +2251,10 @@ def test_unit_mixed_precision(): assert success == False, "Should detect overflow and return False" assert w1.grad is None, "Should zero gradients on overflow" - print("โœ… Mixed precision tests passed!") + print("PASS Mixed precision tests passed!") if __name__ == "__main__": - print("\n๐Ÿงช Running Autograd Module Tests...") + print("\nTEST Running Autograd Module Tests...") # Run all unit tests test_unit_variable_class() @@ -2257,12 +2267,12 @@ if __name__ == "__main__": test_unit_mixed_precision() test_autograd_systems_profiler() - print("\nโœ… All Autograd Module Tests Completed!") + print("\nPASS All Autograd Module Tests Completed!") print("Autograd module complete!") # %% [markdown] """ -## ๐Ÿค” ML Systems Thinking: Interactive Questions +## THINK ML Systems Thinking: Interactive Questions Now that you've built automatic differentiation capabilities that enable neural network training, let's connect this foundational work to broader ML systems challenges. These questions help you think critically about how computational graphs scale to production training environments. @@ -2397,18 +2407,18 @@ GRADING RUBRIC (Instructor Use): # %% [markdown] """ -## ๐ŸŽฏ MODULE SUMMARY: Automatic Differentiation +## TARGET MODULE SUMMARY: Automatic Differentiation Congratulations! You have successfully implemented automatic differentiation: ### What You have Accomplished -โœ… **Computational Graphs**: Dynamic graph construction for gradient computation (Variable class with 200+ lines) -โœ… **Backpropagation**: Efficient gradient computation through reverse mode AD (add, multiply, subtract operations) -โœ… **Gradient Tracking**: Automatic gradient accumulation and management (chain rule implementation) -โœ… **Training Stability**: Gradient clipping and mixed precision support for robust training -โœ… **Memory Optimization**: Advanced profiling with checkpointing and fusion analysis -โœ… **Integration**: Seamless compatibility with Tensor operations (neural network training capability) -โœ… **Real Applications**: Neural network training and optimization (linear regression convergence test) +PASS **Computational Graphs**: Dynamic graph construction for gradient computation (Variable class with 200+ lines) +PASS **Backpropagation**: Efficient gradient computation through reverse mode AD (add, multiply, subtract operations) +PASS **Gradient Tracking**: Automatic gradient accumulation and management (chain rule implementation) +PASS **Training Stability**: Gradient clipping and mixed precision support for robust training +PASS **Memory Optimization**: Advanced profiling with checkpointing and fusion analysis +PASS **Integration**: Seamless compatibility with Tensor operations (neural network training capability) +PASS **Real Applications**: Neural network training and optimization (linear regression convergence test) ### Key Learning Outcomes - **Computational graphs**: How operations are tracked for gradient computation through dynamic graph construction @@ -2420,7 +2430,7 @@ Congratulations! You have successfully implemented automatic differentiation: - **Integration patterns**: How autograd works with neural networks for training ### Mathematical Foundations Mastered -- **Chain rule**: The mathematical foundation โˆ‚f/โˆ‚x = โˆ‚f/โˆ‚z ยท โˆ‚z/โˆ‚x for backpropagation +- **Chain rule**: The mathematical foundation df/dx = df/dz ยท dz/dx for backpropagation - **Computational graphs**: Representing operations as directed acyclic graphs with forward/backward passes - **Gradient flow**: How gradients propagate through complex functions automatically - **Memory efficiency**: O(N) gradient storage scaling with graph depth diff --git a/modules/06_optimizers/optimizers_dev.py b/modules/06_optimizers/optimizers_dev.py index c99018ac..55048b73 100644 --- a/modules/06_optimizers/optimizers_dev.py +++ b/modules/06_optimizers/optimizers_dev.py @@ -14,7 +14,7 @@ Welcome to Optimizers! You'll implement the algorithms that actually make neural networks learn! -## ๐Ÿ”— Building on Previous Learning +## LINK Building on Previous Learning **What You Built Before**: - Module 02 (Tensor): Data structures that hold parameters - Module 06 (Autograd): Automatic gradient computation @@ -27,8 +27,8 @@ Welcome to Optimizers! You'll implement the algorithms that actually make neural **Connection Map**: ``` -Autograd โ†’ Optimizers โ†’ Training Loop -(โˆ‡L/โˆ‡ฮธ) (ฮธ = ฮธ - ฮฑโˆ‡) (iterate until convergence) +Autograd -> Optimizers -> Training Loop +(gradL/gradฮธ) (ฮธ = ฮธ - ฮฑgrad) (iterate until convergence) ``` ## Learning Goals (Your 5-Point Framework) @@ -38,14 +38,14 @@ Autograd โ†’ Optimizers โ†’ Training Loop - Framework connections: See how your implementations match PyTorch's optim module - Optimization trade-offs: When to use SGD vs Adam vs other optimizers -## Build โ†’ Use โ†’ Reflect +## Build -> Use -> Reflect 1. **Build**: Complete SGD and Adam optimizers with proper state management 2. **Use**: Train neural networks and compare convergence behavior 3. **Reflect**: Why do some optimizers work better and use different memory? ## Systems Reality Check -๐Ÿ’ก **Production Context**: PyTorch's Adam uses numerically stable variants and can scale learning rates automatically -โšก **Performance Insight**: Adam stores momentum + velocity for every parameter = 3ร— memory overhead vs SGD +TIP **Production Context**: PyTorch's Adam uses numerically stable variants and can scale learning rates automatically +SPEED **Performance Insight**: Adam stores momentum + velocity for every parameter = 3* memory overhead vs SGD """ # %% nbgrader={"grade": false, "grade_id": "optimizers-imports", "locked": false, "schema_version": 3, "solution": false, "task": false} @@ -113,7 +113,7 @@ except ImportError: return f"Variable({self.data.data})" # %% nbgrader={"grade": false, "grade_id": "optimizers-setup", "locked": false, "schema_version": 3, "solution": false, "task": false} -print("๐Ÿ”ฅ TinyTorch Optimizers Module") +print("FIRE TinyTorch Optimizers Module") print(f"NumPy version: {np.__version__}") print(f"Python version: {sys.version_info.major}.{sys.version_info.minor}") print("Ready to build optimization algorithms!") @@ -153,7 +153,7 @@ def get_grad_data(param): # %% [markdown] """ -## ๐Ÿ“ฆ Where This Code Lives in the Final Package +## PACKAGE Where This Code Lives in the Final Package **Learning Side:** You work in `modules/source/07_optimizers/optimizers_dev.py` **Building Side:** Code exports to `tinytorch.core.optimizers` @@ -181,17 +181,17 @@ from tinytorch.core.tensor import Tensor # Data structures High-dimensional loss surface (imagine in 3D): Loss - โ†‘ - โ”‚ โ•ญโ”€โ•ฎ โ•ญโ”€โ•ฎ - โ”‚ โ•ฑ โ•ฒ โ•ฑ โ•ฒ โ† Local minima - โ”‚ โ•ฑ โ•ฒ โ•ฑ โ•ฒ - โ”‚ โ•ฑ โ•ฒโ•ฑ โ•ฒ - โ”‚ โ•ฑ โ•ฒ - โ”‚โ•ฑ โ•ฒ - โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ†’ Parameters + ^ + | +-+ +-+ + | / \\ / \\ <- Local minima + | / \\ / \\ + | / \\/ \\ + | / \\ + |/ \ + +--------------------------> Parameters SGD path: โ†˜โ†—โ†˜โ†—โ†˜โ†—โ†˜ (oscillating) -Adam path: โ†˜โ†’โ†’โ†’โ†’โ— (smooth to optimum) +Adam path: โ†˜->->->->โ— (smooth to optimum) ``` ### The Problem: How to Navigate Parameter Space @@ -226,16 +226,16 @@ But **naive gradient descent** has problems: Loss Landscape Cross-Section: Loss - โ†‘ - โ”‚ โ•ฑโ•ฒ - โ”‚ โ•ฑ โ•ฒ - โ”‚ โ•ฑ โ•ฒ - โ”‚ โ•ฑ โ•ฒ โ† We want to reach bottom - โ”‚ โ•ฑ โ•ฒ - โ”‚ โ•ฑ Current โ•ฒ - โ”‚โ•ฑ position โ•ฒ - โ””โ”€โ”€โ”€โ”€โ”€โ”€โ—โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฒโ”€โ†’ Parameters - โ†‘ + ^ + | /\ + | / \\ + | / \\ + | / \\ <- We want to reach bottom + | / \\ + | / Current \ + |/ position \ + +------โ—-------\\--> Parameters + ^ Gradient points โ†— (uphill) So we move โ†™ (downhill) ``` @@ -244,20 +244,20 @@ Loss Landscape Cross-Section: **Gradient descent** finds minimum by following negative gradient: ``` -ฮธ_{t+1} = ฮธ_t - ฮฑ โˆ‡f(ฮธ_t) +ฮธ_{t+1} = ฮธ_t - ฮฑ gradf(ฮธ_t) ``` Where: - ฮธ: Parameters we optimize - ฮฑ: Learning rate (step size) -- โˆ‡f(ฮธ): Gradient (slope) at current position +- gradf(ฮธ): Gradient (slope) at current position ### Learning Rate Visualization ``` Learning Rate Effects: Too Large (ฮฑ = 1.0): Just Right (ฮฑ = 0.1): Too Small (ฮฑ = 0.01): - โ—โ†’โ†’โ†’โ†’โ†’โ†’โ†’โ†’โ†’โ†’โ— โ—โ†’โ—โ†’โ—โ†’โ—โ†’โ—โ†’โ— โ—โ†’โ—โ†’โ—โ†’โ—โ†’โ—โ†’...โ†’โ— + โ—->->->->->->->->->->โ— โ—->โ—->โ—->โ—->โ—->โ— โ—->โ—->โ—->โ—->โ—->...->โ— Start Overshoot Start Target Start Very slow ``` @@ -327,7 +327,7 @@ def gradient_descent_step(parameter: Variable, learning_rate: float) -> None: # %% [markdown] """ -### ๐Ÿงช Unit Test: Gradient Descent Step +### TEST Unit Test: Gradient Descent Step Let's test your gradient descent implementation right away! This is the foundation of all optimization algorithms. @@ -350,10 +350,10 @@ def test_unit_gradient_descent_step(): expected_value = original_value - 0.1 * 0.5 # 2.0 - 0.05 = 1.95 assert abs(new_value - expected_value) < 1e-6, f"Expected {expected_value}, got {new_value}" - print("โœ… Basic parameter update works") + print("PASS Basic parameter update works") except Exception as e: - print(f"โŒ Basic parameter update failed: {e}") + print(f"FAIL Basic parameter update failed: {e}") raise # Test with negative gradient @@ -364,10 +364,10 @@ def test_unit_gradient_descent_step(): gradient_descent_step(w2, learning_rate=0.1) expected_value2 = 1.0 - 0.1 * (-0.2) # 1.0 + 0.02 = 1.02 assert abs(w2.data.data.item() - expected_value2) < 1e-6, "Negative gradient test failed" - print("โœ… Negative gradient handling works") + print("PASS Negative gradient handling works") except Exception as e: - print(f"โŒ Negative gradient handling failed: {e}") + print(f"FAIL Negative gradient handling failed: {e}") raise # Test with no gradient (should not update) @@ -378,28 +378,28 @@ def test_unit_gradient_descent_step(): gradient_descent_step(w3, learning_rate=0.1) assert w3.data.data.item() == original_value3, "Parameter with no gradient should not update" - print("โœ… No gradient case works") + print("PASS No gradient case works") except Exception as e: - print(f"โŒ No gradient case failed: {e}") + print(f"FAIL No gradient case failed: {e}") raise - print("๐ŸŽฏ Gradient descent step behavior:") + print("TARGET Gradient descent step behavior:") print(" Updates parameters in negative gradient direction") print(" Uses learning rate to control step size") print(" Skips updates when gradient is None") - print("๐Ÿ“ˆ Progress: Gradient Descent Step โœ“") + print("PROGRESS Progress: Gradient Descent Step OK") -# โœ… IMPLEMENTATION CHECKPOINT: Basic gradient descent complete +# PASS IMPLEMENTATION CHECKPOINT: Basic gradient descent complete -# ๐Ÿค” PREDICTION: How do you think learning rate affects convergence speed? +# THINK PREDICTION: How do you think learning rate affects convergence speed? # Your guess: _______ -# ๐Ÿ” SYSTEMS INSIGHT #1: Learning Rate Impact Analysis +# MAGNIFY SYSTEMS INSIGHT #1: Learning Rate Impact Analysis def analyze_learning_rate_effects(): """Analyze how learning rate affects parameter updates.""" try: - print("๐Ÿ” SYSTEMS INSIGHT: Learning Rate Effects") + print("MAGNIFY SYSTEMS INSIGHT: Learning Rate Effects") print("=" * 50) # Create test parameter with fixed gradient @@ -422,22 +422,22 @@ def analyze_learning_rate_effects(): new_value = param.data.data.item() step_size = abs(1.0 - new_value) - print(f"LR = {lr:4.2f}: {1.0:.3f} โ†’ {new_value:.3f} (step size: {step_size:.3f})") + print(f"LR = {lr:4.2f}: {1.0:.3f} -> {new_value:.3f} (step size: {step_size:.3f})") if lr >= 1.0: - print(f" โš ๏ธ Large LR = overshooting behavior!") + print(f" WARNING๏ธ Large LR = overshooting behavior!") - print("\n๐Ÿ’ก KEY INSIGHTS:") + print("\nTIP KEY INSIGHTS:") print("โ€ข Small LR (0.01): Safe but slow progress") print("โ€ข Medium LR (0.1): Good balance of speed and stability") print("โ€ข Large LR (1.0+): Risk of overshooting minimum") print("โ€ข LR selection affects training speed vs stability trade-off") - # ๐Ÿ’ก WHY THIS MATTERS: Learning rate is often the most important hyperparameter. + # TIP WHY THIS MATTERS: Learning rate is often the most important hyperparameter. # Too small = slow training, too large = unstable training or divergence. except Exception as e: - print(f"โš ๏ธ Error in learning rate analysis: {e}") + print(f"WARNING๏ธ Error in learning rate analysis: {e}") # Analyze learning rate effects analyze_learning_rate_effects() @@ -451,11 +451,11 @@ analyze_learning_rate_effects() Loss Landscape with Narrow Valley: Without Momentum: With Momentum: - โ†— โ†™ โ†— โ†™ โ†— โ†™ โ†— โ†’ โ†’ โ†’ โ†’ โ†’ - โ•ฑ โ•ฒ โ•ฑ โ•ฒ โ•ฑ โ•ฒ โ•ฑ โ•ฒ - โ•ฑ X X โ•ฒ โ•ฑ โ•ฒ - โ•ฑoscillating โ•ฒ โ•ฑ smooth path โ•ฒ -โ•ฑ slowly โ•ฒ โ•ฑ to optimum โ•ฒ + โ†— โ†™ โ†— โ†™ โ†— โ†™ โ†— -> -> -> -> -> + / \\ / \\ / \\ / \\ + / X X \\ / \\ + /oscillating \\ / smooth path \\ +/ slowly \\ / to optimum \\ Momentum accumulates velocity: v = ฮฒv + g Then updates: ฮธ = ฮธ - ฮฑv @@ -465,8 +465,8 @@ Then updates: ฮธ = ฮธ - ฮฑv **SGD with Momentum** adds velocity to accelerate convergence: ``` -v_t = ฮฒ v_{t-1} + โˆ‡L(ฮธ_t) โ† Accumulate velocity -ฮธ_{t+1} = ฮธ_t - ฮฑ v_t โ† Update with velocity +v_t = ฮฒ v_{t-1} + gradL(ฮธ_t) <- Accumulate velocity +ฮธ_{t+1} = ฮธ_t - ฮฑ v_t <- Update with velocity ``` Where: @@ -476,9 +476,9 @@ Where: ### Momentum Dynamics Visualization ``` -Gradient History: [0.1, 0.1, 0.1, 0.1, 0.1] โ† Consistent direction -Without momentum: [0.1, 0.1, 0.1, 0.1, 0.1] โ† Same steps -With momentum: [0.1, 0.19, 0.27, 0.34, 0.41] โ† Accelerating! +Gradient History: [0.1, 0.1, 0.1, 0.1, 0.1] <- Consistent direction +Without momentum: [0.1, 0.1, 0.1, 0.1, 0.1] <- Same steps +With momentum: [0.1, 0.19, 0.27, 0.34, 0.41] <- Accelerating! Momentum Coefficient Effects: ฮฒ = 0.0: No momentum (regular SGD) @@ -502,11 +502,11 @@ Momentum Coefficient Effects: # %% [markdown] """ -### ๐Ÿค” Assessment Question: Momentum Understanding +### THINK Assessment Question: Momentum Understanding **Understanding momentum's role in optimization:** -In a narrow valley loss landscape, vanilla SGD oscillates between valley walls. How does momentum help solve this problem, and what's the mathematical intuition behind the velocity accumulation formula `v_t = ฮฒ v_{t-1} + โˆ‡L(ฮธ_t)`? +In a narrow valley loss landscape, vanilla SGD oscillates between valley walls. How does momentum help solve this problem, and what's the mathematical intuition behind the velocity accumulation formula `v_t = ฮฒ v_{t-1} + gradL(ฮธ_t)`? Consider a sequence of gradients: [0.1, -0.1, 0.1, -0.1, 0.1] (oscillating). Show how momentum with ฮฒ=0.9 transforms this into smoother updates. """ @@ -536,11 +536,11 @@ GRADING RUBRIC: # # For oscillating gradients [0.1, -0.1, 0.1, -0.1, 0.1] with ฮฒ=0.9: # vโ‚€ = 0 -# vโ‚ = 0.9ร—0 + 0.1 = 0.1 -# vโ‚‚ = 0.9ร—0.1 + (-0.1) = 0.09 - 0.1 = -0.01 -# vโ‚ƒ = 0.9ร—(-0.01) + 0.1 = -0.009 + 0.1 = 0.091 -# vโ‚„ = 0.9ร—0.091 + (-0.1) = 0.082 - 0.1 = -0.018 -# vโ‚… = 0.9ร—(-0.018) + 0.1 = -0.016 + 0.1 = 0.084 +# vโ‚ = 0.9*0 + 0.1 = 0.1 +# vโ‚‚ = 0.9*0.1 + (-0.1) = 0.09 - 0.1 = -0.01 +# vโ‚ƒ = 0.9*(-0.01) + 0.1 = -0.009 + 0.1 = 0.091 +# vโ‚„ = 0.9*0.091 + (-0.1) = 0.082 - 0.1 = -0.018 +# vโ‚… = 0.9*(-0.018) + 0.1 = -0.016 + 0.1 = 0.084 # # The oscillating gradients average out through momentum, creating much smaller, smoother updates # instead of large oscillations. This allows progress along the valley bottom rather than bouncing between walls. @@ -556,8 +556,8 @@ class SGD: Momentum accumulates velocity to accelerate in consistent directions and dampen oscillations. Mathematical Update Rules: - Without momentum: ฮธ = ฮธ - ฮฑโˆ‡ฮธ - With momentum: v = ฮฒv + โˆ‡ฮธ, ฮธ = ฮธ - ฮฑv + Without momentum: ฮธ = ฮธ - ฮฑgradฮธ + With momentum: v = ฮฒv + gradฮธ, ฮธ = ฮธ - ฮฑv SYSTEMS INSIGHT - Memory Usage: SGD stores only parameters list, learning rate, and optionally momentum buffers. @@ -616,8 +616,8 @@ class SGD: 3. Handle momentum buffer initialization and updates MATHEMATICAL FORMULATION: - Without momentum: ฮธ = ฮธ - ฮฑโˆ‡ฮธ - With momentum: v = ฮฒv + โˆ‡ฮธ, ฮธ = ฮธ - ฮฑv + Without momentum: ฮธ = ฮธ - ฮฑgradฮธ + With momentum: v = ฮฒv + gradฮธ, ฮธ = ฮธ - ฮฑv IMPLEMENTATION HINTS: - Check if param.grad exists before using it @@ -639,7 +639,7 @@ class SGD: # Initialize momentum buffer with first gradient velocity = grad_data else: - # Update velocity: v = ฮฒv + โˆ‡ฮธ + # Update velocity: v = ฮฒv + gradฮธ velocity = self.momentum * self.momentum_buffers[param_id] + grad_data # Store updated velocity @@ -648,7 +648,7 @@ class SGD: # Update parameter: ฮธ = ฮธ - ฮฑv new_data = current_data - self.learning_rate * velocity else: - # Vanilla SGD: ฮธ = ฮธ - ฮฑโˆ‡ฮธ + # Vanilla SGD: ฮธ = ฮธ - ฮฑgradฮธ new_data = current_data - self.learning_rate * grad_data set_param_data(param, new_data) @@ -677,7 +677,7 @@ class SGD: # %% [markdown] """ -### ๐Ÿงช Unit Test: SGD Optimizer +### TEST Unit Test: SGD Optimizer Let's test your SGD optimizer implementation! This includes both vanilla SGD and momentum variants. @@ -702,10 +702,10 @@ def test_unit_sgd_optimizer(): assert optimizer.learning_rate == 0.1, "Learning rate should be stored correctly" assert optimizer.momentum == 0.0, "Momentum should be stored correctly" assert len(optimizer.parameters) == 3, "Should store all 3 parameters" - print("โœ… Initialization works correctly") + print("PASS Initialization works correctly") except Exception as e: - print(f"โŒ Initialization failed: {e}") + print(f"FAIL Initialization failed: {e}") raise # Test zero_grad @@ -719,10 +719,10 @@ def test_unit_sgd_optimizer(): assert w1.grad is None, "Gradient should be None after zero_grad" assert w2.grad is None, "Gradient should be None after zero_grad" assert b.grad is None, "Gradient should be None after zero_grad" - print("โœ… zero_grad() works correctly") + print("PASS zero_grad() works correctly") except Exception as e: - print(f"โŒ zero_grad() failed: {e}") + print(f"FAIL zero_grad() failed: {e}") raise # Test vanilla SGD step @@ -746,10 +746,10 @@ def test_unit_sgd_optimizer(): assert abs(w1.data.data.item() - expected_w1) < 1e-6, f"w1 update failed" assert abs(w2.data.data.item() - expected_w2) < 1e-6, f"w2 update failed" assert abs(b.data.data.item() - expected_b) < 1e-6, f"b update failed" - print("โœ… Vanilla SGD step works correctly") + print("PASS Vanilla SGD step works correctly") except Exception as e: - print(f"โŒ Vanilla SGD step failed: {e}") + print(f"FAIL Vanilla SGD step failed: {e}") raise # Test SGD with momentum @@ -761,7 +761,7 @@ def test_unit_sgd_optimizer(): w_momentum.grad = Variable(0.1) optimizer_momentum.step() - # Should be: vโ‚ = 0.9ร—0 + 0.1 = 0.1, ฮธโ‚ = 1.0 - 0.1ร—0.1 = 0.99 + # Should be: vโ‚ = 0.9*0 + 0.1 = 0.1, ฮธโ‚ = 1.0 - 0.1*0.1 = 0.99 expected_first = 1.0 - 0.1 * 0.1 assert abs(w_momentum.data.data.item() - expected_first) < 1e-6, "First momentum step failed" @@ -769,32 +769,32 @@ def test_unit_sgd_optimizer(): w_momentum.grad = Variable(0.1) optimizer_momentum.step() - # Should be: vโ‚‚ = 0.9ร—0.1 + 0.1 = 0.19, ฮธโ‚‚ = 0.99 - 0.1ร—0.19 = 0.971 + # Should be: vโ‚‚ = 0.9*0.1 + 0.1 = 0.19, ฮธโ‚‚ = 0.99 - 0.1*0.19 = 0.971 expected_second = expected_first - 0.1 * 0.19 assert abs(w_momentum.data.data.item() - expected_second) < 1e-6, "Second momentum step failed" - print("โœ… Momentum SGD works correctly") + print("PASS Momentum SGD works correctly") except Exception as e: - print(f"โŒ Momentum SGD failed: {e}") + print(f"FAIL Momentum SGD failed: {e}") raise - print("๐ŸŽฏ SGD optimizer behavior:") + print("TARGET SGD optimizer behavior:") print(" Vanilla SGD: Direct gradient-based updates") print(" Momentum SGD: Accumulates velocity for smoother convergence") print(" Memory efficient: O(1) without momentum, O(P) with momentum") - print("๐Ÿ“ˆ Progress: SGD Optimizer โœ“") + print("PROGRESS Progress: SGD Optimizer OK") -# โœ… IMPLEMENTATION CHECKPOINT: SGD with momentum complete +# PASS IMPLEMENTATION CHECKPOINT: SGD with momentum complete -# ๐Ÿค” PREDICTION: How much faster will momentum SGD converge compared to vanilla SGD? +# THINK PREDICTION: How much faster will momentum SGD converge compared to vanilla SGD? # Your guess: ____x faster -# ๐Ÿ” SYSTEMS INSIGHT #2: SGD vs Momentum Convergence Analysis +# MAGNIFY SYSTEMS INSIGHT #2: SGD vs Momentum Convergence Analysis def analyze_sgd_momentum_convergence(): """Compare convergence behavior of vanilla SGD vs momentum SGD.""" try: - print("๐Ÿ” SYSTEMS INSIGHT: SGD vs Momentum Convergence") + print("MAGNIFY SYSTEMS INSIGHT: SGD vs Momentum Convergence") print("=" * 55) # Simulate optimization on quadratic function: f(x) = (x-3)ยฒ @@ -860,22 +860,22 @@ def analyze_sgd_momentum_convergence(): print(f"Momentum SGD error: {final_momentum_error:.6f}") print(f"Overall speedup: {overall_speedup:.2f}x") - print("\n๐Ÿ’ก KEY INSIGHTS:") + print("\nTIP KEY INSIGHTS:") print("โ€ข Momentum accumulates velocity over time") print("โ€ข Faster convergence in consistent gradient directions") print("โ€ข Smoother trajectory with less oscillation") print("โ€ข Trade-off: slight memory overhead for velocity storage") - # ๐Ÿ’ก WHY THIS MATTERS: Momentum can significantly accelerate training, + # TIP WHY THIS MATTERS: Momentum can significantly accelerate training, # especially for problems with consistent gradient directions or narrow valleys. except Exception as e: - print(f"โš ๏ธ Error in convergence analysis: {e}") + print(f"WARNING๏ธ Error in convergence analysis: {e}") # Analyze SGD vs momentum convergence analyze_sgd_momentum_convergence() -# ๐Ÿ” SYSTEMS INSIGHT: Convergence Visualization +# MAGNIFY SYSTEMS INSIGHT: Convergence Visualization def visualize_optimizer_convergence(): """ Create visual comparison of optimizer convergence curves. @@ -961,7 +961,7 @@ def visualize_optimizer_convergence(): adam_params.append(adam_val) # ASCII Plot Generation (since matplotlib not available) - print("\n๐Ÿ“ˆ CONVERGENCE CURVES (Loss vs Epoch)") + print("\nPROGRESS CONVERGENCE CURVES (Loss vs Epoch)") print("-" * 50) # Find convergence points (within 1% of minimum) @@ -999,16 +999,16 @@ def visualize_optimizer_convergence(): print(f" Adam: {adam_params[-1]:.3f}") # Convergence insights - print(f"\n๐Ÿ” CONVERGENCE INSIGHTS:") + print(f"\nMAGNIFY CONVERGENCE INSIGHTS:") print(f"โ€ข SGD: {'Steady' if sgd_conv < epochs else 'Slow'} convergence") print(f"โ€ข Momentum: {'Accelerated' if momentum_conv < sgd_conv else 'Similar'} convergence") print(f"โ€ข Adam: {'Adaptive' if adam_conv < max(sgd_conv, momentum_conv) else 'Standard'} convergence") # Systems implications - print(f"\n๐Ÿ’ก SYSTEMS IMPLICATIONS:") + print(f"\nTIP SYSTEMS IMPLICATIONS:") print(f"โ€ข Early stopping: Could stop training at epoch {min(sgd_conv, momentum_conv, adam_conv)}") print(f"โ€ข Resource efficiency: Faster convergence = less compute time") - print(f"โ€ข Memory trade-off: Adam's 3ร— memory may be worth faster convergence") + print(f"โ€ข Memory trade-off: Adam's 3* memory may be worth faster convergence") print(f"โ€ข Learning rate sensitivity: Different optimizers need different LRs") return { @@ -1019,7 +1019,7 @@ def visualize_optimizer_convergence(): } except Exception as e: - print(f"โš ๏ธ Error in convergence visualization: {e}") + print(f"WARNING๏ธ Error in convergence visualization: {e}") return None # Visualize optimizer convergence patterns @@ -1034,7 +1034,7 @@ visualize_optimizer_convergence() Parameter Update Landscape: Parameter 1 (large gradients): Parameter 2 (small gradients): - โˆ‡ = [1.0, 0.9, 1.1, 0.8] โˆ‡ = [0.01, 0.02, 0.01, 0.01] + grad = [1.0, 0.9, 1.1, 0.8] grad = [0.01, 0.02, 0.01, 0.01] SGD (fixed LR=0.1): SGD (fixed LR=0.1): Updates: [0.1, 0.09, 0.11, 0.08] Updates: [0.001, 0.002, 0.001, 0.001] @@ -1051,26 +1051,26 @@ Result: Adam automatically adjusts learning rate per parameter! **Adam** combines momentum + adaptive learning rates: ``` -First moment: m_t = ฮฒโ‚ m_{t-1} + (1-ฮฒโ‚) โˆ‡ฮธ_t โ† Like momentum -Second moment: v_t = ฮฒโ‚‚ v_{t-1} + (1-ฮฒโ‚‚) โˆ‡ฮธ_tยฒ โ† Gradient variance +First moment: m_t = ฮฒโ‚ m_{t-1} + (1-ฮฒโ‚) gradฮธ_t <- Like momentum +Second moment: v_t = ฮฒโ‚‚ v_{t-1} + (1-ฮฒโ‚‚) gradฮธ_tยฒ <- Gradient variance Bias correction: -mฬ‚_t = m_t / (1 - ฮฒโ‚แต—) โ† Correct momentum bias -vฬ‚_t = v_t / (1 - ฮฒโ‚‚แต—) โ† Correct variance bias +mฬ‚_t = m_t / (1 - ฮฒโ‚แต—) <- Correct momentum bias +vฬ‚_t = v_t / (1 - ฮฒโ‚‚แต—) <- Correct variance bias -Update: ฮธ_t = ฮธ_{t-1} - ฮฑ mฬ‚_t / (โˆšvฬ‚_t + ฮต) +Update: ฮธ_t = ฮธ_{t-1} - ฮฑ mฬ‚_t / (sqrtvฬ‚_t + ฮต) ``` ### Adam Algorithm Visualization ``` Adam State Machine: - Gradients โ†’ [First Moment] โ†’ Momentum (like SGD) - โ†“ โ†“ - Squared โ†’ [Second Moment] โ†’ Variance estimate - โ†“ โ†“ - [Bias Correction] โ†’ [Combine] โ†’ Adaptive Update - โ†“ + Gradients -> [First Moment] -> Momentum (like SGD) + v v + Squared -> [Second Moment] -> Variance estimate + v v + [Bias Correction] -> [Combine] -> Adaptive Update + v Parameter Update ``` @@ -1084,19 +1084,19 @@ Adam State Machine: ``` Memory Usage per Parameter: -SGD: [Parameter] โ†’ 1ร— memory -SGD+Mom: [Parameter][Momentum] โ†’ 2ร— memory -Adam: [Parameter][Momentum][Velocity] โ†’ 3ร— memory +SGD: [Parameter] -> 1* memory +SGD+Mom: [Parameter][Momentum] -> 2* memory +Adam: [Parameter][Momentum][Velocity] -> 3* memory For 100M parameter model: SGD: 400MB (parameters only) -Adam: 1200MB (3ร— memory overhead!) +Adam: 1200MB (3* memory overhead!) ``` """ # %% [markdown] """ -### ๐Ÿค” Assessment Question: Adam's Adaptive Mechanism +### THINK Assessment Question: Adam's Adaptive Mechanism **Understanding Adam's adaptive learning rates:** @@ -1128,26 +1128,26 @@ GRADING RUBRIC: ### BEGIN SOLUTION # Adam adapts learning rates per parameter using gradient variance (second moment). -# Large gradients โ†’ large variance โ†’ smaller effective LR (prevents overshooting) -# Small gradients โ†’ small variance โ†’ larger effective LR (accelerates progress) +# Large gradients -> large variance -> smaller effective LR (prevents overshooting) +# Small gradients -> small variance -> larger effective LR (accelerates progress) # # For gradients g = [0.1, 0.01], ฮฑ = 0.001, ฮฒโ‚=0.9, ฮฒโ‚‚=0.999: # # Parameter 1 (g=0.1): -# mโ‚ = 0.9ร—0 + 0.1ร—0.1 = 0.01 -# vโ‚ = 0.999ร—0 + 0.001ร—0.01 = 0.00001 +# mโ‚ = 0.9*0 + 0.1*0.1 = 0.01 +# vโ‚ = 0.999*0 + 0.001*0.01 = 0.00001 # mฬ‚โ‚ = 0.01/(1-0.9ยน) = 0.01/0.1 = 0.1 # vฬ‚โ‚ = 0.00001/(1-0.999ยน) = 0.00001/0.001 = 0.01 -# Updateโ‚ = -0.001 ร— 0.1/โˆš(0.01 + 1e-8) โ‰ˆ -0.001 +# Updateโ‚ = -0.001 * 0.1/sqrt(0.01 + 1e-8) ~= -0.001 # # Parameter 2 (g=0.01): -# mโ‚ = 0.9ร—0 + 0.1ร—0.01 = 0.001 -# vโ‚ = 0.999ร—0 + 0.001ร—0.0001 = 0.0000001 +# mโ‚ = 0.9*0 + 0.1*0.01 = 0.001 +# vโ‚ = 0.999*0 + 0.001*0.0001 = 0.0000001 # mฬ‚โ‚ = 0.001/0.1 = 0.01 # vฬ‚โ‚ = 0.0000001/0.001 = 0.0001 -# Updateโ‚ = -0.001 ร— 0.01/โˆš(0.0001 + 1e-8) โ‰ˆ -0.001 +# Updateโ‚ = -0.001 * 0.01/sqrt(0.0001 + 1e-8) ~= -0.001 # -# Both get similar effective updates despite 10ร— gradient difference! +# Both get similar effective updates despite 10* gradient difference! # Bias correction prevents small initial estimates from causing tiny updates. ### END SOLUTION @@ -1161,14 +1161,14 @@ class Adam: Adjusts learning rate per parameter based on gradient history and variance. Mathematical Update Rules: - m_t = ฮฒโ‚ m_{t-1} + (1-ฮฒโ‚) โˆ‡ฮธ_t โ† First moment (momentum) - v_t = ฮฒโ‚‚ v_{t-1} + (1-ฮฒโ‚‚) โˆ‡ฮธ_tยฒ โ† Second moment (variance) - mฬ‚_t = m_t / (1 - ฮฒโ‚แต—) โ† Bias correction - vฬ‚_t = v_t / (1 - ฮฒโ‚‚แต—) โ† Bias correction - ฮธ_t = ฮธ_{t-1} - ฮฑ mฬ‚_t / (โˆšvฬ‚_t + ฮต) โ† Adaptive update + m_t = ฮฒโ‚ m_{t-1} + (1-ฮฒโ‚) gradฮธ_t <- First moment (momentum) + v_t = ฮฒโ‚‚ v_{t-1} + (1-ฮฒโ‚‚) gradฮธ_tยฒ <- Second moment (variance) + mฬ‚_t = m_t / (1 - ฮฒโ‚แต—) <- Bias correction + vฬ‚_t = v_t / (1 - ฮฒโ‚‚แต—) <- Bias correction + ฮธ_t = ฮธ_{t-1} - ฮฑ mฬ‚_t / (sqrtvฬ‚_t + ฮต) <- Adaptive update SYSTEMS INSIGHT - Memory Usage: - Adam stores first moment + second moment for each parameter = 3ร— memory vs SGD. + Adam stores first moment + second moment for each parameter = 3* memory vs SGD. For large models, this memory overhead can be limiting factor. Trade-off: Better convergence vs higher memory requirements. """ @@ -1239,14 +1239,14 @@ class Adam: b. Update first moment: m = ฮฒโ‚m + (1-ฮฒโ‚)g c. Update second moment: v = ฮฒโ‚‚v + (1-ฮฒโ‚‚)gยฒ d. Apply bias correction: mฬ‚ = m/(1-ฮฒโ‚แต—), vฬ‚ = v/(1-ฮฒโ‚‚แต—) - e. Update parameter: ฮธ = ฮธ - ฮฑ mฬ‚/(โˆšvฬ‚ + ฮต) + e. Update parameter: ฮธ = ฮธ - ฮฑ mฬ‚/(sqrtvฬ‚ + ฮต) MATHEMATICAL IMPLEMENTATION: - m_t = ฮฒโ‚ m_{t-1} + (1-ฮฒโ‚) โˆ‡ฮธ_t - v_t = ฮฒโ‚‚ v_{t-1} + (1-ฮฒโ‚‚) โˆ‡ฮธ_tยฒ + m_t = ฮฒโ‚ m_{t-1} + (1-ฮฒโ‚) gradฮธ_t + v_t = ฮฒโ‚‚ v_{t-1} + (1-ฮฒโ‚‚) gradฮธ_tยฒ mฬ‚_t = m_t / (1 - ฮฒโ‚แต—) vฬ‚_t = v_t / (1 - ฮฒโ‚‚แต—) - ฮธ_t = ฮธ_{t-1} - ฮฑ mฬ‚_t / (โˆšvฬ‚_t + ฮต) + ฮธ_t = ฮธ_{t-1} - ฮฑ mฬ‚_t / (sqrtvฬ‚_t + ฮต) IMPLEMENTATION HINTS: - Increment self.t at the start @@ -1280,7 +1280,7 @@ class Adam: m_hat = state['m'] / (1 - self.beta1 ** self.t) v_hat = state['v'] / (1 - self.beta2 ** self.t) - # Parameter update: ฮธ = ฮธ - ฮฑ mฬ‚/(โˆšvฬ‚ + ฮต) + # Parameter update: ฮธ = ฮธ - ฮฑ mฬ‚/(sqrtvฬ‚ + ฮต) new_data = current_data - self.learning_rate * m_hat / (np.sqrt(v_hat) + self.epsilon) set_param_data(param, new_data) @@ -1309,7 +1309,7 @@ class Adam: # %% [markdown] """ -### ๐Ÿงช Unit Test: Adam Optimizer +### TEST Unit Test: Adam Optimizer Let's test your Adam optimizer implementation! This tests the complete adaptive learning rate mechanism. @@ -1335,10 +1335,10 @@ def test_unit_adam_optimizer(): assert optimizer.beta2 == 0.999, "Beta2 should be stored correctly" assert optimizer.epsilon == 1e-8, "Epsilon should be stored correctly" assert optimizer.t == 0, "Timestep should start at 0" - print("โœ… Initialization works correctly") + print("PASS Initialization works correctly") except Exception as e: - print(f"โŒ Initialization failed: {e}") + print(f"FAIL Initialization failed: {e}") raise # Test zero_grad @@ -1350,10 +1350,10 @@ def test_unit_adam_optimizer(): assert w.grad is None, "Gradient should be None after zero_grad" assert b.grad is None, "Gradient should be None after zero_grad" - print("โœ… zero_grad() works correctly") + print("PASS zero_grad() works correctly") except Exception as e: - print(f"โŒ zero_grad() failed: {e}") + print(f"FAIL zero_grad() failed: {e}") raise # Test first Adam step with bias correction @@ -1385,10 +1385,10 @@ def test_unit_adam_optimizer(): assert optimizer.state[w_id]['m'] is not None, "First moment should be initialized" assert optimizer.state[w_id]['v'] is not None, "Second moment should be initialized" - print("โœ… First Adam step works correctly") + print("PASS First Adam step works correctly") except Exception as e: - print(f"โŒ First Adam step failed: {e}") + print(f"FAIL First Adam step failed: {e}") raise # Test second Adam step (momentum accumulation) @@ -1412,10 +1412,10 @@ def test_unit_adam_optimizer(): assert after_second_w != before_second_w, "w should continue updating" assert after_second_b != before_second_b, "b should continue updating" - print("โœ… Second Adam step works correctly") + print("PASS Second Adam step works correctly") except Exception as e: - print(f"โŒ Second Adam step failed: {e}") + print(f"FAIL Second Adam step failed: {e}") raise # Test adaptive behavior (different gradients should get different effective learning rates) @@ -1441,29 +1441,29 @@ def test_unit_adam_optimizer(): assert update_large > 0, "Large gradient parameter should update" assert update_small > 0, "Small gradient parameter should update" - print("โœ… Adaptive learning rates work correctly") + print("PASS Adaptive learning rates work correctly") except Exception as e: - print(f"โŒ Adaptive learning rates failed: {e}") + print(f"FAIL Adaptive learning rates failed: {e}") raise - print("๐ŸŽฏ Adam optimizer behavior:") + print("TARGET Adam optimizer behavior:") print(" Combines momentum (first moment) with adaptive learning rates (second moment)") print(" Bias correction prevents small updates in early training steps") print(" Automatically adjusts effective learning rate per parameter") - print(" Memory overhead: 3ร— parameters (original + momentum + variance)") - print("๐Ÿ“ˆ Progress: Adam Optimizer โœ“") + print(" Memory overhead: 3* parameters (original + momentum + variance)") + print("PROGRESS Progress: Adam Optimizer OK") -# โœ… IMPLEMENTATION CHECKPOINT: Adam optimizer complete +# PASS IMPLEMENTATION CHECKPOINT: Adam optimizer complete -# ๐Ÿค” PREDICTION: Which optimizer will use more memory - SGD with momentum or Adam? +# THINK PREDICTION: Which optimizer will use more memory - SGD with momentum or Adam? # Your guess: Adam uses ____x more memory than SGD -# ๐Ÿ” SYSTEMS INSIGHT #3: Optimizer Memory Usage Analysis +# MAGNIFY SYSTEMS INSIGHT #3: Optimizer Memory Usage Analysis def analyze_optimizer_memory(): """Analyze memory usage patterns across different optimizers.""" try: - print("๐Ÿ” SYSTEMS INSIGHT: Optimizer Memory Usage") + print("MAGNIFY SYSTEMS INSIGHT: Optimizer Memory Usage") print("=" * 50) # Simulate memory usage for different model sizes @@ -1509,9 +1509,9 @@ def analyze_optimizer_memory(): print(f"{model_name:<12}: SGD {sgd_gb:>6.1f}GB, Adam {adam_gb:>6.1f}GB") if adam_gb > 16: # Typical GPU memory - print(f" โš ๏ธ Adam exceeds typical GPU memory!") + print(f" WARNING๏ธ Adam exceeds typical GPU memory!") - print("\n๐Ÿ’ก KEY INSIGHTS:") + print("\nTIP KEY INSIGHTS:") print("โ€ข SGD: O(P) memory (just parameters)") print("โ€ข SGD+Momentum: O(2P) memory (parameters + momentum)") print("โ€ข Adam: O(3P) memory (parameters + momentum + variance)") @@ -1523,11 +1523,11 @@ def analyze_optimizer_memory(): print("โ€ข Adam better for most tasks, SGD for memory-limited scenarios") print("โ€ข Consider memory-efficient variants (AdaFactor, 8-bit Adam)") - # ๐Ÿ’ก WHY THIS MATTERS: For large models, memory is often the bottleneck. + # TIP WHY THIS MATTERS: For large models, memory is often the bottleneck. # Understanding optimizer memory overhead is crucial for production deployments. except Exception as e: - print(f"โš ๏ธ Error in memory analysis: {e}") + print(f"WARNING๏ธ Error in memory analysis: {e}") # Analyze optimizer memory usage analyze_optimizer_memory() @@ -1542,10 +1542,10 @@ analyze_optimizer_memory() ``` Normal Training: - Gradient: [-0.1, 0.2, -0.05] โ†’ Update: [-0.01, 0.02, -0.005] โœ“ + Gradient: [-0.1, 0.2, -0.05] -> Update: [-0.01, 0.02, -0.005] OK Exploding Gradients: - Gradient: [-15.0, 23.0, -8.0] โ†’ Update: [-1.5, 2.3, -0.8] โŒ Too large! + Gradient: [-15.0, 23.0, -8.0] -> Update: [-1.5, 2.3, -0.8] FAIL Too large! Result: Parameters jump far from optimum, loss explodes ``` @@ -1555,14 +1555,14 @@ Result: Parameters jump far from optimum, loss explodes Gradient Landscape: Loss - โ†‘ - โ”‚ โ”Œโ”€ Clipping threshold (e.g., 1.0) - โ”‚ โ•ฑ - โ”‚ โ•ฑ - โ”‚ โ•ฑ Original gradient (magnitude = 2.5) - โ”‚ โ•ฑ Clipped gradient (magnitude = 1.0) - โ”‚โ•ฑ - โ””โ”€โ”€โ”€โ”€โ”€โ”€โ†’ Parameters + ^ + | +- Clipping threshold (e.g., 1.0) + | / + | / + | / Original gradient (magnitude = 2.5) + | / Clipped gradient (magnitude = 1.0) + |/ + +-------> Parameters Clipping: gradient = gradient * (threshold / ||gradient||) if ||gradient|| > threshold ``` @@ -1570,7 +1570,7 @@ Clipping: gradient = gradient * (threshold / ||gradient||) if ||gradient|| > thr ### Mathematical Foundation **Gradient Norm Clipping**: ``` -1. Compute gradient norm: ||g|| = โˆš(gโ‚ยฒ + gโ‚‚ยฒ + ... + gโ‚™ยฒ) +1. Compute gradient norm: ||g|| = sqrt(gโ‚ยฒ + gโ‚‚ยฒ + ... + gโ‚™ยฒ) 2. If ||g|| > threshold: g_clipped = g * (threshold / ||g||) 3. Else: g_clipped = g @@ -1632,7 +1632,7 @@ def clip_gradients(parameters: List[Variable], max_norm: float = 1.0) -> float: return total_norm ### END SOLUTION -# ๐Ÿ” SYSTEMS INSIGHT: Numerical Stability Analysis +# MAGNIFY SYSTEMS INSIGHT: Numerical Stability Analysis def analyze_numerical_stability(): """ Demonstrate gradient clipping effects and numerical issues at scale. @@ -1682,7 +1682,7 @@ def analyze_numerical_stability(): print(f"{scenario_name:<16} | {original_norm:>11.2f} | {new_norm:>10.2f} | {reduction:>7.1f}%") # Demonstrate numerical precision issues - print(f"\n๐Ÿ” NUMERICAL PRECISION ISSUES:") + print(f"\nMAGNIFY NUMERICAL PRECISION ISSUES:") # Very small numbers (underflow risk) small_grad = 1e-8 @@ -1699,7 +1699,7 @@ def analyze_numerical_stability(): print(f" Large parameters + small gradients = precision loss") # Production implications - print(f"\n๐Ÿ’ก PRODUCTION IMPLICATIONS:") + print(f"\nTIP PRODUCTION IMPLICATIONS:") print(f"โ€ข Mixed precision (float16/float32) requires careful gradient scaling") print(f"โ€ข Distributed training amplifies numerical issues across GPUs") print(f"โ€ข Gradient accumulation may need norm rescaling") @@ -1725,14 +1725,14 @@ def analyze_numerical_stability(): # When clipping becomes critical if params > 1e9: - print(f" โš ๏ธ Gradient clipping CRITICAL for stability") + print(f" WARNING๏ธ Gradient clipping CRITICAL for stability") elif params > 100e6: print(f" ๐Ÿ“Š Gradient clipping recommended") else: - print(f" โœ… Standard gradients usually stable") + print(f" PASS Standard gradients usually stable") except Exception as e: - print(f"โš ๏ธ Error in numerical stability analysis: {e}") + print(f"WARNING๏ธ Error in numerical stability analysis: {e}") # Analyze gradient clipping and numerical stability analyze_numerical_stability() @@ -1746,28 +1746,28 @@ analyze_numerical_stability() Learning Rate Over Time: Constant LR: -LR โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ - โ”‚ ฮฑ = 0.01 (same throughout training) - โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ†’ Steps +LR +---------------------------------------- + | ฮฑ = 0.01 (same throughout training) + +-----------------------------------------> Steps Step Decay: -LR โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” - โ”‚ ฮฑ = 0.01 โ”‚ - โ”‚ โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” - โ”‚ ฮฑ = 0.001โ”‚ โ”‚ - โ”‚ โ”‚ โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ - โ”‚ โ”‚ ฮฑ = 0.0001 - โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ†’ Steps +LR +---------+ + | ฮฑ = 0.01 | + | +---------+ + | ฮฑ = 0.001| | + | | +--------------------- + | | ฮฑ = 0.0001 + +----------+---------+----------------------> Steps step1 step2 Exponential Decay: -LR โ”œโ”€โ•ฒ - โ”‚ โ•ฒ - โ”‚ โ•ฒ__ - โ”‚ โ•ฒ__ - โ”‚ โ•ฒ____ - โ”‚ โ•ฒ________ - โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ†’ Steps +LR +-\ + | \\ + | \\__ + | \\__ + | \\____ + | \\________ + +-------------------------------------------> Steps ``` ### Why Learning Rate Scheduling Matters @@ -1800,16 +1800,16 @@ High LR Phase (Exploration): โ†™ โ†˜ โ†™ โ†˜ (large steps, finding good regions) Medium LR Phase (Convergence): - โ†“ โ†“ โ†“ (steady progress toward minimum) + v v v (steady progress toward minimum) Low LR Phase (Fine-tuning): - โ†“ โ†“ (small adjustments, precision optimization) + v v (small adjustments, precision optimization) ``` """ # %% [markdown] """ -### ๐Ÿค” Assessment Question: Learning Rate Scheduling Strategy +### THINK Assessment Question: Learning Rate Scheduling Strategy **Understanding when and why to adjust learning rates:** @@ -1981,7 +1981,7 @@ class StepLR: # %% [markdown] """ -### ๐Ÿงช Unit Test: Learning Rate Scheduler +### TEST Unit Test: Learning Rate Scheduler Let's test your learning rate scheduler implementation! This ensures proper LR decay over epochs. @@ -2004,20 +2004,20 @@ def test_unit_step_scheduler(): assert scheduler.gamma == 0.1, "Gamma should be stored correctly" assert scheduler.initial_lr == 0.01, "Initial LR should be stored correctly" assert scheduler.current_epoch == 0, "Should start at epoch 0" - print("โœ… Initialization works correctly") + print("PASS Initialization works correctly") except Exception as e: - print(f"โŒ Initialization failed: {e}") + print(f"FAIL Initialization failed: {e}") raise # Test get_lr before any steps try: initial_lr = scheduler.get_lr() assert initial_lr == 0.01, f"Initial LR should be 0.01, got {initial_lr}" - print("โœ… get_lr() works correctly") + print("PASS get_lr() works correctly") except Exception as e: - print(f"โŒ get_lr() failed: {e}") + print(f"FAIL get_lr() failed: {e}") raise # Test LR updates over multiple epochs @@ -2029,7 +2029,7 @@ def test_unit_step_scheduler(): expected_lr = 0.01 # No decay yet assert abs(current_lr - expected_lr) < 1e-10, f"Epoch {epoch+1}: expected {expected_lr}, got {current_lr}" - print("โœ… First 10 epochs maintain initial LR") + print("PASS First 10 epochs maintain initial LR") # Epoch 11 should trigger first decay scheduler.step() # Epoch 11 @@ -2037,7 +2037,7 @@ def test_unit_step_scheduler(): expected_lr = 0.01 * 0.1 # First decay assert abs(current_lr - expected_lr) < 1e-10, f"First decay: expected {expected_lr}, got {current_lr}" - print("โœ… First LR decay works correctly") + print("PASS First LR decay works correctly") # Continue to second decay point for epoch in range(9): # Epochs 12-20 @@ -2048,10 +2048,10 @@ def test_unit_step_scheduler(): expected_lr = 0.01 * (0.1 ** 2) # Second decay assert abs(current_lr - expected_lr) < 1e-10, f"Second decay: expected {expected_lr}, got {current_lr}" - print("โœ… Second LR decay works correctly") + print("PASS Second LR decay works correctly") except Exception as e: - print(f"โŒ LR decay failed: {e}") + print(f"FAIL LR decay failed: {e}") raise # Test with different parameters @@ -2068,28 +2068,28 @@ def test_unit_step_scheduler(): expected_lr = 0.001 * 0.5 assert abs(current_lr - expected_lr) < 1e-10, f"Custom params: expected {expected_lr}, got {current_lr}" - print("โœ… Custom parameters work correctly") + print("PASS Custom parameters work correctly") except Exception as e: - print(f"โŒ Custom parameters failed: {e}") + print(f"FAIL Custom parameters failed: {e}") raise - print("๐ŸŽฏ Step LR scheduler behavior:") + print("TARGET Step LR scheduler behavior:") print(" Reduces learning rate by gamma every step_size epochs") print(" Enables fast initial training with gradual fine-tuning") print(" Essential for achieving optimal model performance") - print("๐Ÿ“ˆ Progress: Learning Rate Scheduling โœ“") + print("PROGRESS Progress: Learning Rate Scheduling OK") -# โœ… IMPLEMENTATION CHECKPOINT: Learning rate scheduling complete +# PASS IMPLEMENTATION CHECKPOINT: Learning rate scheduling complete -# ๐Ÿค” PREDICTION: How much will proper LR scheduling improve final model accuracy? +# THINK PREDICTION: How much will proper LR scheduling improve final model accuracy? # Your guess: ____% improvement -# ๐Ÿ” SYSTEMS INSIGHT #4: Learning Rate Schedule Impact Analysis +# MAGNIFY SYSTEMS INSIGHT #4: Learning Rate Schedule Impact Analysis def analyze_lr_schedule_impact(): """Analyze the impact of learning rate scheduling on training dynamics.""" try: - print("๐Ÿ” SYSTEMS INSIGHT: Learning Rate Schedule Impact") + print("MAGNIFY SYSTEMS INSIGHT: Learning Rate Schedule Impact") print("=" * 55) # Simulate training with different LR strategies @@ -2178,23 +2178,23 @@ def analyze_lr_schedule_impact(): print(f"Step Decay: {step_convergence} epochs ({const_convergence-step_convergence:+d} epochs)") print(f"Exponential: {exp_convergence} epochs ({const_convergence-exp_convergence:+d} epochs)") - print("\n๐Ÿ’ก KEY INSIGHTS:") + print("\nTIP KEY INSIGHTS:") print("โ€ข Proper LR scheduling improves final performance by 1-5%") - print("โ€ข Step decay provides clear phase transitions (explore โ†’ converge โ†’ fine-tune)") + print("โ€ข Step decay provides clear phase transitions (explore -> converge -> fine-tune)") print("โ€ข Exponential decay offers smooth transitions but may converge slower") print("โ€ข LR scheduling often as important as optimizer choice") print("\n๐Ÿญ PRODUCTION BEST PRACTICES:") print("โ€ข Most successful models use LR scheduling") - print("โ€ข Common pattern: high LR โ†’ reduce at plateaus โ†’ final fine-tuning") + print("โ€ข Common pattern: high LR -> reduce at plateaus -> final fine-tuning") print("โ€ข Monitor validation loss to determine schedule timing") print("โ€ข Cosine annealing popular for transformer training") - # ๐Ÿ’ก WHY THIS MATTERS: Learning rate scheduling is one of the most impactful + # TIP WHY THIS MATTERS: Learning rate scheduling is one of the most impactful # hyperparameter choices. It can mean the difference between good and great model performance. except Exception as e: - print(f"โš ๏ธ Error in LR schedule analysis: {e}") + print(f"WARNING๏ธ Error in LR schedule analysis: {e}") # Analyze learning rate schedule impact analyze_lr_schedule_impact() @@ -2208,7 +2208,7 @@ analyze_lr_schedule_impact() Different training scenarios benefit from different LR patterns: ``` -Training Scenario โ†’ Optimal Scheduler: +Training Scenario -> Optimal Scheduler: โ€ข Image Classification: Cosine annealing for smooth convergence โ€ข Language Models: Exponential decay with warmup @@ -2220,18 +2220,18 @@ Training Scenario โ†’ Optimal Scheduler: ``` Learning Rate Over Time: -StepLR: โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€ - โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ”‚โ–‘โ–‘โ–‘โ–‘โ–‘โ”‚โ–‘โ–‘โ–‘โ–‘โ–‘โ”‚โ–‘โ–‘โ–‘โ–‘โ–‘โ”‚โ–‘ - โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ””โ”€โ”€โ”€โ”€โ”€โ”˜โ–‘โ–‘โ–‘โ–‘โ–‘โ””โ”€โ”€โ”€โ”€โ”€โ”˜โ–‘ +StepLR: ------+ +-----+ +-- + โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘|โ–‘โ–‘โ–‘โ–‘โ–‘|โ–‘โ–‘โ–‘โ–‘โ–‘|โ–‘โ–‘โ–‘โ–‘โ–‘|โ–‘ + โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘+-----+โ–‘โ–‘โ–‘โ–‘โ–‘+-----+โ–‘ -Exponential: โ”€โ”€โ•ฒ - โ–‘โ–‘โ–‘โ•ฒ - โ–‘โ–‘โ–‘โ–‘โ•ฒ - โ–‘โ–‘โ–‘โ–‘โ–‘โ•ฒ +Exponential: --\ + โ–‘โ–‘โ–‘\ + โ–‘โ–‘โ–‘โ–‘\ + โ–‘โ–‘โ–‘โ–‘โ–‘\\ -Cosine: โ”€โ”€โ•ฒ โ•ฑโ”€โ”€โ•ฒ โ•ฑโ”€โ”€โ•ฒ โ•ฑโ”€โ”€ - โ–‘โ–‘โ–‘โ•ฒ โ•ฑโ–‘โ–‘โ–‘โ–‘โ•ฒ โ•ฑโ–‘โ–‘โ–‘โ–‘โ•ฒ โ•ฑโ–‘โ–‘โ–‘ - โ–‘โ–‘โ–‘โ–‘โ•ฒโ•ฑโ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ•ฒโ•ฑโ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ•ฒโ•ฑโ–‘โ–‘ +Cosine: --\\ /--\\ /--\\ /-- + โ–‘โ–‘โ–‘\\ /โ–‘โ–‘โ–‘โ–‘\\ /โ–‘โ–‘โ–‘โ–‘\\ /โ–‘โ–‘โ–‘ + โ–‘โ–‘โ–‘โ–‘\\/โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘\\/โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘\\/โ–‘โ–‘ Epoch: 0 10 20 30 40 50 ``` @@ -2386,7 +2386,7 @@ class CosineAnnealingLR: return self.eta_min + (self.eta_max - self.eta_min) * cosine_factor ### END SOLUTION -# ๐Ÿ” SYSTEMS INSIGHT: Advanced Scheduler Comparison +# MAGNIFY SYSTEMS INSIGHT: Advanced Scheduler Comparison def analyze_advanced_schedulers(): """ Compare advanced learning rate schedulers across different training scenarios. @@ -2441,13 +2441,13 @@ def analyze_advanced_schedulers(): print(f" {name.capitalize():<12}: {final_lr:.6f}") # Scheduler characteristics - print(f"\n๐Ÿ” SCHEDULER CHARACTERISTICS:") + print(f"\nMAGNIFY SCHEDULER CHARACTERISTICS:") print(f"โ€ข Step: Sudden drops, good for milestone-based training") print(f"โ€ข Exponential: Smooth decay, good for fine-tuning") print(f"โ€ข Cosine: Natural curve, excellent for final convergence") # Production use cases - print(f"\n๐Ÿ’ก PRODUCTION USE CASES:") + print(f"\nTIP PRODUCTION USE CASES:") print(f"โ€ข Image Classification: Cosine annealing (ImageNet standard)") print(f"โ€ข Language Models: Exponential with warmup (BERT, GPT)") print(f"โ€ข Transfer Learning: Step decay at validation plateaus") @@ -2463,7 +2463,7 @@ def analyze_advanced_schedulers(): return lr_history except Exception as e: - print(f"โš ๏ธ Error in advanced scheduler analysis: {e}") + print(f"WARNING๏ธ Error in advanced scheduler analysis: {e}") return None # Analyze advanced scheduler comparison @@ -2477,13 +2477,13 @@ analyze_advanced_schedulers() ``` Training Loop Architecture: -Data โ†’ Forward Pass โ†’ Loss Computation - โ†‘ โ†“ โ†“ - โ”‚ Predictions Gradients (Autograd) - โ”‚ โ†‘ โ†“ - โ””โ”€โ”€โ”€ Parameters โ† Optimizer Updates - โ†‘ โ†“ - LR Scheduler โ†’ Learning Rate +Data -> Forward Pass -> Loss Computation + ^ v v + | Predictions Gradients (Autograd) + | ^ v + +--- Parameters <- Optimizer Updates + ^ v + LR Scheduler -> Learning Rate ``` ### Complete Training Pattern @@ -2510,20 +2510,20 @@ for epoch in range(num_epochs): ``` Training Progress Over Time: -Loss โ”‚ - โ”‚โ•ฒ - โ”‚ โ•ฒ - โ”‚ โ•ฒ__ - โ”‚ โ•ฒ__ โ† LR reductions - โ”‚ โ•ฒ____ - โ”‚ โ•ฒ____ - โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ†’ Epochs +Loss | + |\\ + | \\ + | \\__ + | \\__ <- LR reductions + | \\____ + | \____ + +--------------------------> Epochs -Learning โ”‚ 0.01 โ”Œโ”€โ”€โ”€โ”€โ”€โ” -Rate โ”‚ โ”‚ โ”‚ 0.001 โ”Œโ”€โ”€โ”€โ” - โ”‚ โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค โ”‚ 0.0001 - โ”‚ โ”‚ โ””โ”€โ”€โ”€โ”˜ - โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ†’ Epochs +Learning | 0.01 +-----+ +Rate | | | 0.001 +---+ + | | +-------โ”ค | 0.0001 + | | +---+ + +------+----------------------> Epochs ``` This integration shows how all components work together for effective neural network training. @@ -2595,7 +2595,7 @@ def train_simple_model(parameters: List[Variable], optimizer, scheduler, } if verbose: - print("๐Ÿš€ Starting training...") + print("ROCKET Starting training...") print(f"Optimizer: {type(optimizer).__name__}") print(f"Scheduler: {type(scheduler).__name__ if scheduler else 'None'}") print(f"Epochs: {num_epochs}") @@ -2627,7 +2627,7 @@ def train_simple_model(parameters: List[Variable], optimizer, scheduler, if verbose: print("-" * 50) - print(f"โœ… Training completed!") + print(f"PASS Training completed!") print(f"Final loss: {history['losses'][-1]:.6f}") print(f"Final LR: {history['learning_rates'][-1]:.6f}") @@ -2636,7 +2636,7 @@ def train_simple_model(parameters: List[Variable], optimizer, scheduler, # %% [markdown] """ -### ๐Ÿงช Unit Test: Training Integration +### TEST Unit Test: Training Integration Let's test your complete training integration! This validates that all components work together. @@ -2681,10 +2681,10 @@ def test_unit_training(): print(f"Debug: LR at index 10 = {history['learning_rates'][10]}, expected = 0.01") assert abs(history['learning_rates'][10] - 0.01) < 1e-10, "LR should decay after step_size" - print("โœ… SGD + StepLR integration works correctly") + print("PASS SGD + StepLR integration works correctly") except Exception as e: - print(f"โŒ SGD + StepLR integration failed: {e}") + print(f"FAIL SGD + StepLR integration failed: {e}") raise # Test with Adam optimizer (basic convergence check) @@ -2699,10 +2699,10 @@ def test_unit_training(): assert len(history_adam['losses']) == 15, "Should track all epochs" assert history_adam['losses'][0] > history_adam['losses'][-1], "Loss should decrease with Adam" - print("โœ… Adam integration works correctly") + print("PASS Adam integration works correctly") except Exception as e: - print(f"โŒ Adam integration failed: {e}") + print(f"FAIL Adam integration failed: {e}") raise # Test convergence to correct solution @@ -2713,10 +2713,10 @@ def test_unit_training(): # Relaxed convergence test - optimizers are working but convergence depends on many factors assert error < 10.0, f"Should show some progress toward target {target}, got {final_x}" - print("โœ… Shows optimization progress") + print("PASS Shows optimization progress") except Exception as e: - print(f"โŒ Convergence test failed: {e}") + print(f"FAIL Convergence test failed: {e}") raise # Test training history format @@ -2730,27 +2730,27 @@ def test_unit_training(): assert len(history['learning_rates']) == n_epochs, "LR history length mismatch" assert len(history['epochs']) == n_epochs, "Epoch history length mismatch" - print("โœ… Training history format is correct") + print("PASS Training history format is correct") except Exception as e: - print(f"โŒ History format test failed: {e}") + print(f"FAIL History format test failed: {e}") raise - print("๐ŸŽฏ Training integration behavior:") + print("TARGET Training integration behavior:") print(" Coordinates optimizer, scheduler, and loss computation") print(" Tracks complete training history for analysis") print(" Supports both SGD and Adam with optional scheduling") print(" Provides foundation for real neural network training") - print("๐Ÿ“ˆ Progress: Training Integration โœ“") + print("PROGRESS Progress: Training Integration OK") # Final system checkpoint and readiness verification -print("\n๐ŸŽฏ OPTIMIZATION SYSTEM STATUS:") -print("โœ… Gradient Descent: Foundation algorithm implemented") -print("โœ… SGD with Momentum: Accelerated convergence algorithm") -print("โœ… Adam Optimizer: Adaptive learning rate algorithm") -print("โœ… Learning Rate Scheduling: Dynamic LR adjustment") -print("โœ… Training Integration: Complete pipeline ready") -print("\n๐Ÿš€ Ready for neural network training!") +print("\nTARGET OPTIMIZATION SYSTEM STATUS:") +print("PASS Gradient Descent: Foundation algorithm implemented") +print("PASS SGD with Momentum: Accelerated convergence algorithm") +print("PASS Adam Optimizer: Adaptive learning rate algorithm") +print("PASS Learning Rate Scheduling: Dynamic LR adjustment") +print("PASS Training Integration: Complete pipeline ready") +print("\nROCKET Ready for neural network training!") # %% [markdown] """ @@ -2762,7 +2762,7 @@ This section runs all unit tests to validate the complete optimizer implementati # %% nbgrader={"grade": false, "grade_id": "comprehensive-tests", "locked": false, "schema_version": 3, "solution": false, "task": false} def test_all_optimizers(): """Run all optimizer tests to validate complete implementation.""" - print("๐Ÿงช Running Comprehensive Optimizer Tests...") + print("TEST Running Comprehensive Optimizer Tests...") print("=" * 60) try: @@ -2774,21 +2774,21 @@ def test_all_optimizers(): test_unit_training() print("\n" + "=" * 60) - print("๐ŸŽ‰ ALL OPTIMIZER TESTS PASSED!") - print("โœ… Gradient descent foundation working") - print("โœ… SGD with momentum implemented correctly") - print("โœ… Adam adaptive learning rates functional") - print("โœ… Learning rate scheduling operational") - print("โœ… Complete training integration successful") - print("\n๐Ÿš€ Optimizer system ready for neural network training!") + print("CELEBRATE ALL OPTIMIZER TESTS PASSED!") + print("PASS Gradient descent foundation working") + print("PASS SGD with momentum implemented correctly") + print("PASS Adam adaptive learning rates functional") + print("PASS Learning rate scheduling operational") + print("PASS Complete training integration successful") + print("\nROCKET Optimizer system ready for neural network training!") except Exception as e: - print(f"\nโŒ Optimizer test failed: {e}") + print(f"\nFAIL Optimizer test failed: {e}") print("๐Ÿ”ง Please fix implementation before proceeding") raise if __name__ == "__main__": - print("๐Ÿงช Running core optimizer tests...") + print("TEST Running core optimizer tests...") # Core understanding tests (REQUIRED) test_unit_gradient_descent_step() @@ -2810,11 +2810,11 @@ if __name__ == "__main__": analyze_lr_schedule_impact() analyze_advanced_schedulers() - print("โœ… Core tests passed!") + print("PASS Core tests passed!") # %% [markdown] """ -## ๐Ÿค” ML Systems Thinking: Interactive Questions +## THINK ML Systems Thinking: Interactive Questions *Complete these after implementing the optimizers to reflect on systems implications* """ @@ -2947,19 +2947,19 @@ GRADING RUBRIC (Instructor Use): # %% [markdown] """ -## ๐ŸŽฏ MODULE SUMMARY: Optimization Algorithms +## TARGET MODULE SUMMARY: Optimization Algorithms Congratulations! You've successfully implemented the algorithms that make neural networks learn efficiently: ### What You've Accomplished -โœ… **Gradient Descent Foundation**: 50+ lines implementing the core parameter update mechanism -โœ… **SGD with Momentum**: Complete optimizer class with velocity accumulation for accelerated convergence -โœ… **Adam Optimizer**: Advanced adaptive learning rates with first/second moment estimation and bias correction -โœ… **Learning Rate Scheduling**: StepLR, ExponentialLR, and CosineAnnealingLR schedulers for diverse training scenarios -โœ… **Gradient Clipping**: Numerical stability features preventing exploding gradients in deep networks -โœ… **Convergence Visualization**: Real loss curve analysis comparing optimizer convergence patterns -โœ… **Training Integration**: Complete training loop coordinating optimizer, scheduler, and loss computation -โœ… **Systems Analysis**: Memory profiling, numerical stability analysis, and advanced scheduler comparisons +PASS **Gradient Descent Foundation**: 50+ lines implementing the core parameter update mechanism +PASS **SGD with Momentum**: Complete optimizer class with velocity accumulation for accelerated convergence +PASS **Adam Optimizer**: Advanced adaptive learning rates with first/second moment estimation and bias correction +PASS **Learning Rate Scheduling**: StepLR, ExponentialLR, and CosineAnnealingLR schedulers for diverse training scenarios +PASS **Gradient Clipping**: Numerical stability features preventing exploding gradients in deep networks +PASS **Convergence Visualization**: Real loss curve analysis comparing optimizer convergence patterns +PASS **Training Integration**: Complete training loop coordinating optimizer, scheduler, and loss computation +PASS **Systems Analysis**: Memory profiling, numerical stability analysis, and advanced scheduler comparisons ### Key Learning Outcomes - **Optimization fundamentals**: How gradient-based algorithms navigate loss landscapes to find optima @@ -2968,8 +2968,8 @@ Congratulations! You've successfully implemented the algorithms that make neural - **Professional skills**: Building production-ready optimizers with advanced features matching PyTorch's design patterns ### Mathematical Foundations Mastered -- **Gradient Descent**: ฮธ = ฮธ - ฮฑโˆ‡ฮธ (foundation of all neural network training) -- **SGD Momentum**: v = ฮฒv + โˆ‡ฮธ, ฮธ = ฮธ - ฮฑv (acceleration through velocity accumulation) +- **Gradient Descent**: ฮธ = ฮธ - ฮฑgradฮธ (foundation of all neural network training) +- **SGD Momentum**: v = ฮฒv + gradฮธ, ฮธ = ฮธ - ฮฑv (acceleration through velocity accumulation) - **Adam Algorithm**: Adaptive moments with bias correction for per-parameter learning rates - **Gradient Clipping**: ||g||โ‚‚ normalization preventing exploding gradients in deep networks - **Advanced Scheduling**: Step, exponential, and cosine annealing patterns for optimal convergence @@ -3000,5 +3000,5 @@ Your implementations mirror production systems: 3. **Explore advanced features**: Experiment with different momentum coefficients and learning rates 4. **Ready for Module 08**: Build complete training loops with your optimizers! -**๐Ÿš€ Achievement Unlocked**: Your optimization algorithms form the learning engine that transforms gradients into intelligence! +**ROCKET Achievement Unlocked**: Your optimization algorithms form the learning engine that transforms gradients into intelligence! """ \ No newline at end of file diff --git a/modules/07_training/training_dev.py b/modules/07_training/training_dev.py index c825f5cf..40881540 100644 --- a/modules/07_training/training_dev.py +++ b/modules/07_training/training_dev.py @@ -67,34 +67,14 @@ sys.path.append(os.path.abspath('modules/source/09_dataloader')) # No longer needed # Import all the building blocks we need -try: - from tinytorch.core.tensor import Tensor - from tinytorch.core.activations import ReLU, Sigmoid, Tanh, Softmax - from tinytorch.core.layers import Linear - from tinytorch.core.networks import Sequential, create_mlp - from tinytorch.core.spatial import Conv2D, flatten - from tinytorch.utils.data import Dataset, DataLoader - from tinytorch.core.autograd import Variable # FOR AUTOGRAD INTEGRATION - from tinytorch.core.optimizers import SGD, Adam -except ImportError: - # For development - import from local modules - import sys - import os - sys.path.append(os.path.join(os.path.dirname(__file__), '..', '01_tensor')) - sys.path.append(os.path.join(os.path.dirname(__file__), '..', '02_activations')) - sys.path.append(os.path.join(os.path.dirname(__file__), '..', '03_layers')) - sys.path.append(os.path.join(os.path.dirname(__file__), '..', '05_autograd')) - sys.path.append(os.path.join(os.path.dirname(__file__), '..', '06_optimizers')) - sys.path.append(os.path.join(os.path.dirname(__file__), '..', '08_spatial')) - sys.path.append(os.path.join(os.path.dirname(__file__), '..', '09_dataloader')) - - from tensor_dev import Tensor - from activations_dev import ReLU, Sigmoid, Tanh, Softmax - from layers_dev import Linear, Sequential, create_mlp - from spatial_dev import Conv2D, flatten - from dataloader_dev import Dataset, DataLoader - from autograd_dev import Variable - from optimizers_dev import SGD, Adam +from tinytorch.core.tensor import Tensor +from tinytorch.core.activations import ReLU, Sigmoid, Tanh, Softmax +from tinytorch.core.layers import Linear +from tinytorch.core.networks import Sequential, create_mlp +from tinytorch.core.spatial import Conv2D, flatten +from tinytorch.utils.data import Dataset, DataLoader +from tinytorch.core.autograd import Variable # FOR AUTOGRAD INTEGRATION +from tinytorch.core.optimizers import SGD, Adam # ๐Ÿ”ฅ AUTOGRAD INTEGRATION: Loss functions now return Variables that support .backward() # This enables automatic gradient computation for neural network training! @@ -346,63 +326,20 @@ class MeanSquaredError: return self.__call__(y_pred, y_true) -# ๐Ÿ” SYSTEMS INSIGHT: MSE Loss Memory & Performance Analysis -def analyze_mse_computational_complexity(): - """Analyze MSE loss computational and memory characteristics.""" +# ๐Ÿ” SYSTEMS INSIGHT #1: Training Performance Analysis +def analyze_training_performance(): + """Consolidated analysis of training performance characteristics.""" try: - import time - import numpy as np - - print("๐Ÿ”ฌ MSE Loss Computational Analysis:") - - # Test different input sizes to understand scaling - sizes = [100, 1000, 10000, 100000] - times = [] - memory_usage = [] - - mse = MeanSquaredError() - - for size in sizes: - # Create test data - y_pred = Tensor(np.random.randn(size, 10)) - y_true = Tensor(np.random.randn(size, 10)) - - # Time the computation - start_time = time.perf_counter() - loss = mse(y_pred, y_true) - end_time = time.perf_counter() - - computation_time = end_time - start_time - times.append(computation_time) - - # Estimate memory usage (pred + true + diff + squared_diff) - memory_mb = (4 * size * 10 * 4) / (1024 * 1024) # 4 arrays, float32 - memory_usage.append(memory_mb) - - print(f" Size {size:>6}: {computation_time*1000:.2f}ms, ~{memory_mb:.1f}MB") - - # Analyze scaling behavior - if len(times) > 1: - time_ratio = times[-1] / times[0] if times[0] > 0 else 0 - size_ratio = sizes[-1] / sizes[0] - scaling_factor = np.log(time_ratio) / np.log(size_ratio) if time_ratio > 0 else 0 - - print(f"\n๐Ÿ“Š Scaling Analysis:") - print(f" Time scales as O(N^{scaling_factor:.1f}) - {'Linear' if 0.8 <= scaling_factor <= 1.2 else 'Non-linear'}") - print(f" Memory grows linearly: O(N) - {memory_usage[-1]/memory_usage[0]:.1f}x increase") - - print(f"\n๐Ÿ’ก Key Insights:") - print(f" โ€ข MSE requires 4x input memory (pred, true, diff, squared)") - print(f" โ€ข Linear time complexity O(N) - suitable for large batches") - print(f" โ€ข Temporary arrays needed - watch memory in large models") - print(f" โ€ข Simple operations = good GPU acceleration potential") - + print("๐Ÿ“Š Training Performance Analysis:") + print(f" โ€ข MSE Loss: O(N) time, 4x memory overhead (pred + true + diff + squared)") + print(f" โ€ข Batch processing: 10-50x faster than single samples due to vectorization") + print(f" โ€ข Training bottlenecks: Data loading > Model forward > Gradient computation") + print(f" โ€ข Memory scaling: Batch size directly impacts GPU memory (watch for OOM)") + print(f" โ€ข Convergence: Loss oscillation normal early, smoothing indicates learning") + except Exception as e: print(f"โš ๏ธ Analysis failed: {e}") -# Run analysis -analyze_mse_computational_complexity() - # %% [markdown] """ ### ๐Ÿงช Unit Test: MSE Loss diff --git a/modules/08_spatial/spatial_dev.py b/modules/08_spatial/spatial_dev.py index 0ba4a036..1642d4c1 100644 --- a/modules/08_spatial/spatial_dev.py +++ b/modules/08_spatial/spatial_dev.py @@ -21,7 +21,7 @@ Welcome to the Spatial module! You'll implement convolutional operations that en - Framework connection: See how your implementation reveals the design decisions in PyTorch's nn.Conv2D optimizations - Performance insight: Learn why convolution is computationally expensive but highly parallelizable, driving modern GPU architecture -## Build โ†’ Use โ†’ Reflect +## Build -> Use -> Reflect 1. **Build**: Conv2D layer with sliding window convolution, understanding every memory access and computation 2. **Use**: Transform real image data and visualize how feature maps capture spatial patterns 3. **Reflect**: Why does convolution enable parameter sharing, and how does this affect model capacity vs efficiency? @@ -35,8 +35,8 @@ By the end of this module, you'll understand: - Connection to production ML systems and how frameworks optimize convolution for different hardware architectures ## Systems Reality Check -๐Ÿ’ก **Production Context**: PyTorch's Conv2D uses highly optimized implementations like cuDNN that can be 100x faster than naive implementations through algorithm choice and memory layout optimization -โšก **Performance Note**: Convolution is O(Hร—Wร—Cร—Kยฒ) per output pixel - modern CNNs perform billions of these operations, making optimization critical for real-time applications +TIP **Production Context**: PyTorch's Conv2D uses highly optimized implementations like cuDNN that can be 100x faster than naive implementations through algorithm choice and memory layout optimization +SPEED **Performance Note**: Convolution is O(H*W*C*Kยฒ) per output pixel - modern CNNs perform billions of these operations, making optimization critical for real-time applications """ # %% nbgrader={"grade": false, "grade_id": "cnn-imports", "locked": false, "schema_version": 3, "solution": false, "task": false} @@ -66,14 +66,14 @@ except ImportError: from layers_dev import Linear, Module # %% nbgrader={"grade": false, "grade_id": "cnn-welcome", "locked": false, "schema_version": 3, "solution": false, "task": false} -print("๐Ÿ”ฅ TinyTorch CNN Module") +print("FIRE TinyTorch CNN Module") print(f"NumPy version: {np.__version__}") print(f"Python version: {sys.version_info.major}.{sys.version_info.minor}") print("Ready to build convolutional neural networks!") # %% [markdown] """ -## ๐Ÿ“ฆ Where This Code Lives in the Final Package +## PACKAGE Where This Code Lives in the Final Package **Learning Side:** You work in `modules/source/05_cnn/cnn_dev.py` **Building Side:** Code exports to `tinytorch.core.cnn` @@ -251,32 +251,32 @@ def max_pool2d(x, kernel_size, stride=None): Convolution Sliding Window Operation: Step 1: Position kernel at top-left -โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” -โ”‚ 1 2 3 4 5 โ”‚ โ”‚ 1 0 โ”‚ โ† 2ร—2 Kernel -โ”‚ 6 7 8 9 10 โ”‚ โ”‚ 0 -1 โ”‚ -โ”‚11 12 13 14 15 โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ -โ”‚16 17 18 19 20 โ”‚ -โ”‚21 22 23 24 25 โ”‚ -โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ - โ†“ Compute: 1ร—1 + 2ร—0 + 6ร—0 + 7ร—(-1) = -6 ++-----------------+ +-------+ +| 1 2 3 4 5 | | 1 0 | <- 2*2 Kernel +| 6 7 8 9 10 | | 0 -1 | +|11 12 13 14 15 | +-------+ +|16 17 18 19 20 | +|21 22 23 24 25 | ++-----------------+ + v Compute: 1*1 + 2*0 + 6*0 + 7*(-1) = -6 Step 2: Slide kernel right -โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” -โ”‚ 1 2 3 4 5 โ”‚ โ”‚ 1 0 โ”‚ -โ”‚ 6 7 8 9 10 โ”‚ โ”‚ 0 -1 โ”‚ -โ”‚11 12 13 14 15 โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ -โ”‚16 17 18 19 20 โ”‚ -โ”‚21 22 23 24 25 โ”‚ -โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ - โ†“ Compute: 2ร—1 + 3ร—0 + 7ร—0 + 8ร—(-1) = -6 ++-----------------+ +-------+ +| 1 2 3 4 5 | | 1 0 | +| 6 7 8 9 10 | | 0 -1 | +|11 12 13 14 15 | +-------+ +|16 17 18 19 20 | +|21 22 23 24 25 | ++-----------------+ + v Compute: 2*1 + 3*0 + 7*0 + 8*(-1) = -6 Result Feature Map: -โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” -โ”‚ -6 -6 -6 -6 โ”‚ -โ”‚ -6 -6 -6 -6 โ”‚ -โ”‚ -6 -6 -6 -6 โ”‚ -โ”‚ -6 -6 -6 -6 โ”‚ -โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ ++---------------+ +| -6 -6 -6 -6 | +| -6 -6 -6 -6 | +| -6 -6 -6 -6 | +| -6 -6 -6 -6 | ++---------------+ ``` ### Multi-Channel Convolution Visualization @@ -284,11 +284,11 @@ Result Feature Map: ``` RGB Image Processing: -Input (3 channels): Kernel (3โ†’1): Output (1 channel): -โ”Œโ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ” -โ”‚ R โ”‚ โ”‚ G โ”‚ โ”‚ B โ”‚ * โ”‚ Kr โ”‚ โ”‚ Kg โ”‚ โ”‚ Kb โ”‚ = โ”‚ Out โ”‚ -โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ -โ””โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”˜ +Input (3 channels): Kernel (3->1): Output (1 channel): ++-----+ +-----+ +-----+ +-----+ +-----+ +-----+ +-----+ +| R | | G | | B | * | Kr | | Kg | | Kb | = | Out | +| | | | | | | | | | | | | | ++-----+ +-----+ +-----+ +-----+ +-----+ +-----+ +-----+ Computation: Output[i,j] = Sum(R[i,j] * Kr + G[i,j] * Kg + B[i,j] * Kb) ``` @@ -393,12 +393,12 @@ def conv2d_naive(input: np.ndarray, kernel: np.ndarray) -> np.ndarray: return output ### END SOLUTION -# โœ… IMPLEMENTATION CHECKPOINT: Basic convolution complete +# PASS IMPLEMENTATION CHECKPOINT: Basic convolution complete -# ๐Ÿค” PREDICTION: How many multiply-add operations does a 3ร—3 convolution on a 28ร—28 image require? +# THINK PREDICTION: How many multiply-add operations does a 3*3 convolution on a 28*28 image require? # Your guess: _______ operations -# ๐Ÿ” SYSTEMS INSIGHT #1: Convolution Computational Complexity +# MAGNIFY SYSTEMS INSIGHT #1: Convolution Computational Complexity def analyze_convolution_complexity(): """Analyze computational cost of convolution operations.""" try: @@ -426,17 +426,17 @@ def analyze_convolution_complexity(): operations = out_h * out_w * kernel.shape[0] * kernel.shape[1] ops_per_sec = operations / elapsed if elapsed > 0 else float('inf') - print(f"{h}ร—{w}\t\t{operations:,}\t\t{elapsed*1000:.2f}\t\t{ops_per_sec:,.0f}") + print(f"{h}*{w}\t\t{operations:,}\t\t{elapsed*1000:.2f}\t\t{ops_per_sec:,.0f}") # Real-world context - print("\n๐Ÿ’ก Real-World Context:") - print("โ€ข CIFAR-10 (32ร—32): ~25K operations per 3ร—3 conv") - print("โ€ข ImageNet (224ร—224): ~1.2M operations per 3ร—3 conv") + print("\nTIP Real-World Context:") + print("โ€ข CIFAR-10 (32*32): ~25K operations per 3*3 conv") + print("โ€ข ImageNet (224*224): ~1.2M operations per 3*3 conv") print("โ€ข ResNet-50 has ~25M conv operations per forward pass!") print("โ€ข Modern GPUs can perform 100+ TOPS (trillion ops/sec)") except Exception as e: - print(f"โš ๏ธ Error in complexity analysis: {e}") + print(f"WARNING๏ธ Error in complexity analysis: {e}") print("Make sure conv2d_naive is implemented correctly") # Run the analysis @@ -444,7 +444,7 @@ analyze_convolution_complexity() # %% [markdown] """ -### ๐Ÿงช Unit Test: Convolution Operation +### TEST Unit Test: Convolution Operation Let us test your convolution implementation right away! This is the core operation that powers computer vision. @@ -470,10 +470,10 @@ def test_unit_convolution_operation(): print(f"Expected:\n{expected}") assert np.allclose(result, expected), f"Convolution failed: expected {expected}, got {result}" - print("โœ… Simple convolution test passed") + print("PASS Simple convolution test passed") except Exception as e: - print(f"โŒ Simple convolution test failed: {e}") + print(f"FAIL Simple convolution test failed: {e}") raise # Test edge detection kernel @@ -485,10 +485,10 @@ def test_unit_convolution_operation(): expected = np.array([[0, 0], [0, 0]], dtype=np.float32) # Uniform region = no edges assert np.allclose(result, expected), f"Edge detection failed: expected {expected}, got {result}" - print("โœ… Edge detection test passed") + print("PASS Edge detection test passed") except Exception as e: - print(f"โŒ Edge detection test failed: {e}") + print(f"FAIL Edge detection test failed: {e}") raise # Test output shape @@ -500,18 +500,18 @@ def test_unit_convolution_operation(): expected_shape = (3, 3) # 5-3+1 = 3 assert result.shape == expected_shape, f"Output shape wrong: expected {expected_shape}, got {result.shape}" - print("โœ… Output shape test passed") + print("PASS Output shape test passed") except Exception as e: - print(f"โŒ Output shape test failed: {e}") + print(f"FAIL Output shape test failed: {e}") raise # Show the convolution process - print("๐ŸŽฏ Convolution behavior:") + print("TARGET Convolution behavior:") print(" Slides kernel across input") print(" Computes dot product at each position") print(" Output size = Input size - Kernel size + 1") - print("๐Ÿ“ˆ Progress: Convolution operation โœ“") + print("PROGRESS Progress: Convolution operation OK") # Call the test immediately test_unit_convolution_operation() @@ -539,7 +539,7 @@ A **Conv2D layer** is a learnable convolutional layer that: - **Autonomous driving**: Identify road features ### Design Decisions -- **Kernel size**: Typically 3ร—3 or 5ร—5 for balance of locality and capacity +- **Kernel size**: Typically 3*3 or 5*5 for balance of locality and capacity - **Initialization**: Small random values to break symmetry - **Integration**: Works with Tensor class and other layers """ @@ -622,7 +622,7 @@ class SimpleConv2D: # %% [markdown] """ -### ๐Ÿงช Unit Test: Conv2D Layer +### TEST Unit Test: Conv2D Layer Let us test your Conv2D layer implementation! This is a learnable convolutional layer that can be trained. @@ -643,7 +643,7 @@ def test_unit_simple_conv2d_layer(): # Test that kernel is initialized properly assert layer.kernel.shape == (2, 2), f"Kernel shape should be (2, 2), got {layer.kernel.shape}" assert not np.allclose(layer.kernel, 0), "Kernel should not be all zeros" - print("โœ… Conv2D layer initialization successful") + print("PASS Conv2D layer initialization successful") # Test with sample input x = Tensor([[1, 2, 3], [4, 5, 6], [7, 8, 9]]) @@ -656,10 +656,10 @@ def test_unit_simple_conv2d_layer(): # Verify shapes assert y.shape == (2, 2), f"Output shape should be (2, 2), got {y.shape}" assert isinstance(y, Tensor), "Output should be a Tensor" - print("โœ… Conv2D layer forward pass successful") + print("PASS Conv2D layer forward pass successful") except Exception as e: - print(f"โŒ SimpleConv2D layer test failed: {e}") + print(f"FAIL SimpleConv2D layer test failed: {e}") raise # Test different kernel sizes @@ -669,18 +669,18 @@ def test_unit_simple_conv2d_layer(): y_3x3 = layer_3x3(x_5x5) assert y_3x3.shape == (3, 3), f"3x3 kernel output should be (3, 3), got {y_3x3.shape}" - print("โœ… Different kernel sizes work correctly") + print("PASS Different kernel sizes work correctly") except Exception as e: - print(f"โŒ Different kernel sizes test failed: {e}") + print(f"FAIL Different kernel sizes test failed: {e}") raise # Show the layer behavior - print("๐ŸŽฏ Conv2D layer behavior:") + print("TARGET Conv2D layer behavior:") print(" Learnable kernel weights") print(" Applies convolution to detect patterns") print(" Can be trained end-to-end") - print("๐Ÿ“ˆ Progress: Convolution operation โœ“, Conv2D layer โœ“") + print("PROGRESS Progress: Convolution operation OK, Conv2D layer OK") # Call the test immediately test_unit_simple_conv2d_layer() @@ -713,12 +713,12 @@ Each output feature map is computed by: 3. **Summation**: Sum across input channels for each output pixel ### Systems Insight: Parameter Scaling -- **Single channel**: 1 filter = Kร—K parameters -- **Multi-channel**: 1 filter = in_channels ร— Kร—K parameters -- **Multiple filters**: out_channels ร— in_channels ร— Kร—K total parameters +- **Single channel**: 1 filter = K*K parameters +- **Multi-channel**: 1 filter = in_channels * K*K parameters +- **Multiple filters**: out_channels * in_channels * K*K total parameters - **Memory impact**: Parameters grow linearly with channels -Example: 32 filters of size 3ร—3 on RGB input = 32 ร— 3 ร— 3 ร— 3 = 864 parameters +Example: 32 filters of size 3*3 on RGB input = 32 * 3 * 3 * 3 = 864 parameters """ # %% nbgrader={"grade": false, "grade_id": "multi-channel-conv2d", "locked": false, "schema_version": 3, "solution": true, "task": false} @@ -734,27 +734,27 @@ class Conv2D(Module): VISUAL ARCHITECTURE: ``` Input Tensor: Weight Tensor: Output Tensor: - โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” - โ”‚ in_channels โ”‚ โ”‚ out_channels โ”‚ โ”‚ out_channels โ”‚ - โ”‚ ร— โ”‚ * โ”‚ ร— โ”‚ = โ”‚ ร— โ”‚ - โ”‚ heightร—width โ”‚ โ”‚ in_chร—kernร—kern โ”‚ โ”‚ out_heightร—widthโ”‚ - โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ + +-----------------+ +-----------------+ +-----------------+ + | in_channels | | out_channels | | out_channels | + | * | * | * | = | * | + | height*width | | in_ch*kern*kern | | out_height*width| + +-----------------+ +-----------------+ +-----------------+ Memory Layout (NCHW format): - Batch โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” - 0 โ”‚ Ch0[Hร—W] Ch1[Hร—W] Ch2[Hร—W] ... โ”‚ - 1 โ”‚ Ch0[Hร—W] Ch1[Hร—W] Ch2[Hร—W] ... โ”‚ - 2 โ”‚ Ch0[Hร—W] Ch1[Hร—W] Ch2[Hร—W] ... โ”‚ - โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ + Batch +------------------------------------------+ + 0 | Ch0[H*W] Ch1[H*W] Ch2[H*W] ... | + 1 | Ch0[H*W] Ch1[H*W] Ch2[H*W] ... | + 2 | Ch0[H*W] Ch1[H*W] Ch2[H*W] ... | + +------------------------------------------+ ``` PARAMETER CALCULATION: ``` - Weight Parameters: out_channels ร— in_channels ร— kernel_h ร— kernel_w + Weight Parameters: out_channels * in_channels * kernel_h * kernel_w Bias Parameters: out_channels (if bias=True) - Total Parameters: (out_ch ร— in_ch ร— k_h ร— k_w) + (out_ch if bias else 0) + Total Parameters: (out_ch * in_ch * k_h * k_w) + (out_ch if bias else 0) - Example: Conv2D(3, 64, (3,3)) = 64 ร— 3 ร— 3 ร— 3 + 64 = 1,792 parameters + Example: Conv2D(3, 64, (3,3)) = 64 * 3 * 3 * 3 + 64 = 1,792 parameters ``` """ @@ -779,12 +779,12 @@ class Conv2D(Module): LEARNING CONNECTIONS: - **Production CNNs**: This matches PyTorch's nn.Conv2D parameter structure - - **Memory Scaling**: Parameters = out_channels ร— in_channels ร— kernel_height ร— kernel_width + - **Memory Scaling**: Parameters = out_channels * in_channels * kernel_height * kernel_width - **He Initialization**: Maintains activation variance through deep networks - **Feature Learning**: Each filter learns different patterns across all input channels EXAMPLE: - # For CIFAR-10 RGB images (3 channels) โ†’ 32 feature maps + # For CIFAR-10 RGB images (3 channels) -> 32 feature maps conv = Conv2D(in_channels=3, out_channels=32, kernel_size=(3, 3)) # Creates weight: shape (32, 3, 3, 3) = 864 parameters @@ -901,7 +901,7 @@ class Conv2D(Module): # %% [markdown] """ -### ๐Ÿงช Unit Test: Multi-Channel Conv2D Layer +### TEST Unit Test: Multi-Channel Conv2D Layer Let us test your multi-channel Conv2D implementation! This handles RGB images and multiple filters like production CNNs. @@ -914,7 +914,7 @@ print("๐Ÿ”ฌ Unit Test: Multi-Channel Conv2D Layer...") # Test 1: RGB to feature maps (CIFAR-10 scenario) try: - # Create layer: 3 RGB channels โ†’ 8 feature maps + # Create layer: 3 RGB channels -> 8 feature maps conv_rgb = Conv2D(in_channels=3, out_channels=8, kernel_size=(3, 3)) print(f"Multi-channel Conv2D created:") @@ -927,7 +927,7 @@ try: assert conv_rgb.weight.shape == (8, 3, 3, 3), f"Weight shape should be (8, 3, 3, 3), got {conv_rgb.weight.shape}" assert not np.allclose(conv_rgb.weight.data, 0), "Weights should not be all zeros" assert conv_rgb.bias.shape == (8,), f"Bias shape should be (8,), got {conv_rgb.bias.shape}" - print("โœ… Multi-channel layer initialization successful") + print("PASS Multi-channel layer initialization successful") # Test with RGB image (simulated CIFAR-10 patch) rgb_image = Tensor(np.random.randn(3, 8, 8)) # 3 channels, 8x8 image @@ -941,10 +941,10 @@ try: assert feature_maps.shape == expected_shape, f"Output shape should be {expected_shape}, got {feature_maps.shape}" # Output should be a Tensor (autograd integration added later) assert isinstance(feature_maps, Tensor), "Output should be a Tensor" - print("โœ… RGB convolution test passed") + print("PASS RGB convolution test passed") except Exception as e: - print(f"โŒ RGB convolution test failed: {e}") + print(f"FAIL RGB convolution test failed: {e}") raise # Test 2: Batch processing @@ -955,34 +955,34 @@ try: expected_batch_shape = (4, 8, 8, 8) # 4 images, 8 channels, 10-3+1=8 spatial assert batch_output.shape == expected_batch_shape, f"Batch output shape should be {expected_batch_shape}, got {batch_output.shape}" - print("โœ… Batch processing test passed") + print("PASS Batch processing test passed") except Exception as e: - print(f"โŒ Batch processing test failed: {e}") + print(f"FAIL Batch processing test failed: {e}") raise # Test 3: Different channel configurations try: - # Test 1โ†’16 channels (grayscale to features) + # Test 1->16 channels (grayscale to features) conv_grayscale = Conv2D(in_channels=1, out_channels=16, kernel_size=(5, 5)) gray_image = Tensor(np.random.randn(1, 12, 12)) # 1 channel, 12x12 gray_features = conv_grayscale(gray_image) expected_gray_shape = (16, 8, 8) # 16 channels, 12-5+1=8 spatial assert gray_features.shape == expected_gray_shape, f"Grayscale output should be {expected_gray_shape}, got {gray_features.shape}" - print("โœ… Grayscale convolution test passed") + print("PASS Grayscale convolution test passed") - # Test 32โ†’64 channels (feature maps to more feature maps) + # Test 32->64 channels (feature maps to more feature maps) conv_deep = Conv2D(in_channels=32, out_channels=64, kernel_size=(3, 3)) deep_features = Tensor(np.random.randn(32, 6, 6)) # 32 channels, 6x6 deeper_features = conv_deep(deep_features) expected_deep_shape = (64, 4, 4) # 64 channels, 6-3+1=4 spatial assert deeper_features.shape == expected_deep_shape, f"Deep features should be {expected_deep_shape}, got {deeper_features.shape}" - print("โœ… Deep feature convolution test passed") + print("PASS Deep feature convolution test passed") except Exception as e: - print(f"โŒ Different channel configurations test failed: {e}") + print(f"FAIL Different channel configurations test failed: {e}") raise # Test 4: Parameter counting @@ -993,42 +993,42 @@ try: assert params_3_to_8 == expected_params, f"Parameter count should be {expected_params}, got {params_3_to_8}" print(f"Parameter scaling verification:") - print(f" 3โ†’8 channels, 3x3 kernel: {params_3_to_8} parameters") + print(f" 3->8 channels, 3x3 kernel: {params_3_to_8} parameters") print(f" Breakdown: {8*3*3*3} weights + {8} bias = {expected_params}") - print("โœ… Parameter counting test passed") + print("PASS Parameter counting test passed") except Exception as e: - print(f"โŒ Parameter counting test failed: {e}") + print(f"FAIL Parameter counting test failed: {e}") raise # Show multi-channel behavior -print("๐ŸŽฏ Multi-channel Conv2D behavior:") +print("TARGET Multi-channel Conv2D behavior:") print(" Processes multiple input channels (RGB, feature maps)") print(" Produces multiple output feature maps") print(" Each filter mixes information across ALL input channels") -print(" Parameter count = out_channels ร— in_channels ร— kernel_h ร— kernel_w") -print("๐Ÿ“ˆ Progress: Single-channel โœ“, Multi-channel โœ“") +print(" Parameter count = out_channels * in_channels * kernel_h * kernel_w") +print("PROGRESS Progress: Single-channel OK, Multi-channel OK") -# โœ… IMPLEMENTATION CHECKPOINT: Multi-channel convolution complete +# PASS IMPLEMENTATION CHECKPOINT: Multi-channel convolution complete -# ๐Ÿค” PREDICTION: How much memory does a Conv2D(3, 64, (3,3)) layer use for parameters? -# Your calculation: _____ parameters ร— 4 bytes = _____ MB +# THINK PREDICTION: How much memory does a Conv2D(3, 64, (3,3)) layer use for parameters? +# Your calculation: _____ parameters * 4 bytes = _____ MB -# ๐Ÿ” SYSTEMS INSIGHT #2: CNN Memory Scaling Analysis +# MAGNIFY SYSTEMS INSIGHT #2: CNN Memory Scaling Analysis def analyze_cnn_memory_scaling(): """Analyze memory usage patterns in CNN architectures.""" try: # Common CNN configurations configs = [ - ("Inputโ†’First", 3, 32, (3, 3)), - ("Conv1โ†’Conv2", 32, 64, (3, 3)), - ("Conv2โ†’Conv3", 64, 128, (3, 3)), - ("Conv3โ†’Conv4", 128, 256, (3, 3)), + ("Input->First", 3, 32, (3, 3)), + ("Conv1->Conv2", 32, 64, (3, 3)), + ("Conv2->Conv3", 64, 128, (3, 3)), + ("Conv3->Conv4", 128, 256, (3, 3)), ("Deep Layer", 256, 512, (3, 3)) ] print("CNN Memory Scaling Analysis:") - print("Layer\t\tParams\t\tMemory (MB)\tActivations (32ร—32)") + print("Layer\t\tParams\t\tMemory (MB)\tActivations (32*32)") print("-" * 65) total_params = 0 @@ -1040,8 +1040,8 @@ def analyze_cnn_memory_scaling(): # Memory for parameters (float32 = 4 bytes) param_memory_mb = params * 4 / (1024 * 1024) - # Activation memory (assuming 32ร—32 input, float32) - # Output size โ‰ˆ 30ร—30 for 3ร—3 conv on 32ร—32 input + # Activation memory (assuming 32*32 input, float32) + # Output size ~= 30*30 for 3*3 conv on 32*32 input act_size = out_ch * 30 * 30 * 4 / (1024 * 1024) total_params += params @@ -1052,21 +1052,21 @@ def analyze_cnn_memory_scaling(): print(f"Total Memory: {total_params * 4 / (1024*1024):.2f} MB") # Real-world context - print("\n๐Ÿ’ก Production Comparison:") + print("\nTIP Production Comparison:") print("โ€ข Your CNN: ~1M parameters") print("โ€ข ResNet-50: 25M parameters (100 MB)") print("โ€ข GPT-3: 175B parameters (700 GB!)") print("โ€ข Modern GPUs: 24-80 GB memory") # Memory bottleneck analysis - print("\nโš ๏ธ Memory Bottlenecks:") - print("โ€ข Parameters grow as in_channels ร— out_channels") + print("\nWARNING๏ธ Memory Bottlenecks:") + print("โ€ข Parameters grow as in_channels * out_channels") print("โ€ข Activations often use more memory than parameters") print("โ€ข Batch size multiplies activation memory") print("โ€ข Gradients double memory usage during training") except Exception as e: - print(f"โš ๏ธ Error in memory analysis: {e}") + print(f"WARNING๏ธ Error in memory analysis: {e}") print("Make sure Conv2D class is implemented correctly") # Run the analysis @@ -1082,13 +1082,13 @@ Let us analyze how memory requirements scale with channels and understand the tr # %% nbgrader={"grade": false, "grade_id": "multi-channel-memory-analysis", "locked": false, "schema_version": 3, "solution": false, "task": false} def analyze_conv_memory_scaling(): """Analyze memory requirements for different channel configurations.""" - print("๐Ÿ” MULTI-CHANNEL MEMORY SCALING ANALYSIS") + print("MAGNIFY MULTI-CHANNEL MEMORY SCALING ANALYSIS") print("=" * 50) configurations = [ - (1, 16, (3, 3)), # Grayscale โ†’ features - (3, 32, (3, 3)), # RGB โ†’ features - (32, 64, (3, 3)), # Features โ†’ more features + (1, 16, (3, 3)), # Grayscale -> features + (3, 32, (3, 3)), # RGB -> features + (32, 64, (3, 3)), # Features -> more features (64, 128, (3, 3)), # Deep features (3, 32, (5, 5)), # RGB with larger kernel (3, 32, (7, 7)), # RGB with very large kernel @@ -1107,13 +1107,13 @@ def analyze_conv_memory_scaling(): input_mb = (in_c * 32 * 32 * 4) / (1024 * 1024) output_mb = (out_c * (32-kernel_height+1) * (32-kernel_width+1) * 4) / (1024 * 1024) - print(f" {in_c:3d}โ†’{out_c:3d} channels, {kernel_height}x{kernel_width} kernel:") + print(f" {in_c:3d}->{out_c:3d} channels, {kernel_height}x{kernel_width} kernel:") print(f" Parameters: {total_params:,} ({memory_mb:.3f} MB)") print(f" Activations: {input_mb:.3f} MB input + {output_mb:.3f} MB output") print(f" Total memory: {memory_mb + input_mb + output_mb:.3f} MB") - print("\n๐Ÿ’ก Key Memory Insights:") - print(" โ€ข Parameters scale as: out_channels ร— in_channels ร— kernel_sizeยฒ") + print("\nTIP Key Memory Insights:") + print(" โ€ข Parameters scale as: out_channels * in_channels * kernel_sizeยฒ") print(" โ€ข Larger kernels dramatically increase memory (5x5 = 2.8x vs 3x3)") print(" โ€ข Channel depth matters more than spatial size for parameters") print(" โ€ข Activation memory depends on spatial dimensions") @@ -1123,9 +1123,9 @@ def analyze_conv_memory_scaling(): # Run memory analysis try: analyze_conv_memory_scaling() - print("โœ… Memory scaling analysis completed") + print("PASS Memory scaling analysis completed") except Exception as e: - print(f"โš ๏ธ Memory analysis had issues: {e}") + print(f"WARNING๏ธ Memory analysis had issues: {e}") # %% [markdown] """ @@ -1141,7 +1141,7 @@ except Exception as e: - **Overfitting reduction**: Acts as a form of regularization ### Real-World Usage -- **After convolution**: Conv2D โ†’ ReLU โ†’ MaxPool2D is a common pattern +- **After convolution**: Conv2D -> ReLU -> MaxPool2D is a common pattern - **Progressive downsampling**: Each pool layer reduces spatial dimensions - **Feature concentration**: Keeps most important activations """ @@ -1157,31 +1157,31 @@ class MaxPool2D: VISUAL POOLING OPERATION: ``` - Input (4ร—4): 2ร—2 MaxPool: Output (2ร—2): - โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”โ”€โ”€โ”€โ”€โ”€โ” - โ”‚ 1 2 3 4โ”‚ โ”‚ 1 2 โ”‚ 3 4 โ”‚ โ”‚ 6 โ”‚ 8 โ”‚ - โ”‚ 5 6 7 8โ”‚ โ†’ โ”‚ 5 6 โ”‚ 7 8 โ”‚ โ†’ โ”‚ โ”‚ โ”‚ - โ”‚ 9 10 11 12โ”‚ โ”œโ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”ค โ”œโ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”ค - โ”‚ 13 14 15 16โ”‚ โ”‚ 9 10โ”‚11 12โ”‚ โ”‚ 14 โ”‚ 16 โ”‚ - โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚13 14โ”‚15 16โ”‚ โ”‚ โ”‚ โ”‚ - โ””โ”€โ”€โ”€โ”€โ”€โ”˜โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”˜โ”€โ”€โ”€โ”€โ”€โ”˜ + Input (4*4): 2*2 MaxPool: Output (2*2): + +-------------+ +-----+-----+ +-----+-----+ + | 1 2 3 4| | 1 2 | 3 4 | | 6 | 8 | + | 5 6 7 8| -> | 5 6 | 7 8 | -> | | | + | 9 10 11 12| +-----+-----โ”ค +-----+-----โ”ค + | 13 14 15 16| | 9 10|11 12| | 14 | 16 | + +-------------+ |13 14|15 16| | | | + +-----+-----+ +-----+-----+ max([1,2,5,6])=6 max([3,4,7,8])=8 ``` MEMORY REDUCTION: ``` - Before MaxPool: 32 ร— 32 ร— 64 = 65,536 values - After MaxPool: 16 ร— 16 ร— 64 = 16,384 values (4ร— reduction) + Before MaxPool: 32 * 32 * 64 = 65,536 values + After MaxPool: 16 * 16 * 64 = 16,384 values (4* reduction) Typical CNN Pattern: - Conv2D โ†’ ReLU โ†’ MaxPool2D โ†’ Conv2D โ†’ ReLU โ†’ MaxPool2D ... - (32,32,3) โ†’ (32,32,64) โ†’ (16,16,64) โ†’ (16,16,128) โ†’ (8,8,128) + Conv2D -> ReLU -> MaxPool2D -> Conv2D -> ReLU -> MaxPool2D ... + (32,32,3) -> (32,32,64) -> (16,16,64) -> (16,16,128) -> (8,8,128) ``` WHY MAX POOLING WORKS: โ€ข Translation Invariance: Small shifts don't change max value โ€ข Feature Robustness: Preserves strongest activations - โ€ข Computational Efficiency: Reduces data by 4ร— (2ร—2 pooling) + โ€ข Computational Efficiency: Reduces data by 4* (2*2 pooling) โ€ข Memory Efficiency: Less data to process in deeper layers """ @@ -1291,7 +1291,7 @@ class MaxPool2D: # %% [markdown] """ -### ๐Ÿงช Unit Test: MaxPool2D Layer +### TEST Unit Test: MaxPool2D Layer Let us test your MaxPool2D implementation! This provides spatial downsampling for efficient computation. @@ -1320,17 +1320,17 @@ try: print(f"Pooled:\n{pooled.data}") # Verify shape - expected_shape = (2, 2) # 4x4 โ†’ 2x2 with 2x2 pooling + expected_shape = (2, 2) # 4x4 -> 2x2 with 2x2 pooling assert pooled.shape == expected_shape, f"Pooled shape should be {expected_shape}, got {pooled.shape}" # Verify values (each 2x2 window's maximum) expected_values = np.array([[6, 8], [14, 16]]) # Max of each 2x2 window assert np.array_equal(pooled.data, expected_values), f"Expected {expected_values}, got {pooled.data}" - print("โœ… Basic 2x2 pooling test passed") + print("PASS Basic 2x2 pooling test passed") except Exception as e: - print(f"โŒ Basic pooling test failed: {e}") + print(f"FAIL Basic pooling test failed: {e}") raise # Test 2: Multi-channel pooling @@ -1352,10 +1352,10 @@ try: expected_multi_shape = (2, 2, 2) # 2 channels, 2x2 spatial assert pooled_multi.shape == expected_multi_shape, f"Multi-channel shape should be {expected_multi_shape}, got {pooled_multi.shape}" - print("โœ… Multi-channel pooling test passed") + print("PASS Multi-channel pooling test passed") except Exception as e: - print(f"โŒ Multi-channel pooling test failed: {e}") + print(f"FAIL Multi-channel pooling test failed: {e}") raise # Test 3: Different pool sizes @@ -1365,53 +1365,53 @@ try: input_6x6 = Tensor(np.arange(36).reshape(6, 6)) # 6x6 input pooled_3x3 = pool_3x3(input_6x6) - expected_3x3_shape = (2, 2) # 6x6 โ†’ 2x2 with 3x3 pooling, stride 3 + expected_3x3_shape = (2, 2) # 6x6 -> 2x2 with 3x3 pooling, stride 3 assert pooled_3x3.shape == expected_3x3_shape, f"3x3 pooling shape should be {expected_3x3_shape}, got {pooled_3x3.shape}" - print("โœ… Different pool sizes test passed") + print("PASS Different pool sizes test passed") except Exception as e: - print(f"โŒ Different pool sizes test failed: {e}") + print(f"FAIL Different pool sizes test failed: {e}") raise # Test 4: Integration with convolution try: - # Test Conv2D โ†’ MaxPool2D pipeline + # Test Conv2D -> MaxPool2D pipeline conv = Conv2D(in_channels=1, out_channels=4, kernel_size=(3, 3)) pool_after_conv = MaxPool2D(pool_size=(2, 2)) # Input image input_image = Tensor(np.random.randn(1, 8, 8)) # 1 channel, 8x8 - # Forward pass: Conv โ†’ Pool - conv_output = conv(input_image) # (1,8,8) โ†’ (4,6,6) - pool_output = pool_after_conv(conv_output) # (4,6,6) โ†’ (4,3,3) + # Forward pass: Conv -> Pool + conv_output = conv(input_image) # (1,8,8) -> (4,6,6) + pool_output = pool_after_conv(conv_output) # (4,6,6) -> (4,3,3) assert conv_output.shape == (4, 6, 6), f"Conv output should be (4,6,6), got {conv_output.shape}" assert pool_output.shape == (4, 3, 3), f"Pool output should be (4,3,3), got {pool_output.shape}" - print("โœ… Conv โ†’ Pool integration test passed") + print("PASS Conv -> Pool integration test passed") except Exception as e: - print(f"โŒ Conv โ†’ Pool integration test failed: {e}") + print(f"FAIL Conv -> Pool integration test failed: {e}") raise # Show pooling behavior -print("๐ŸŽฏ MaxPool2D behavior:") +print("TARGET MaxPool2D behavior:") print(" Reduces spatial dimensions by taking maximum in each window") print(" Provides translation invariance") print(" No learnable parameters") -print(" Common pattern: Conv2D โ†’ ReLU โ†’ MaxPool2D") -print("๐Ÿ“ˆ Progress: Single-channel โœ“, Multi-channel โœ“, Pooling โœ“") +print(" Common pattern: Conv2D -> ReLU -> MaxPool2D") +print("PROGRESS Progress: Single-channel OK, Multi-channel OK, Pooling OK") -# โœ… IMPLEMENTATION CHECKPOINT: MaxPool2D layer complete +# PASS IMPLEMENTATION CHECKPOINT: MaxPool2D layer complete -# ๐Ÿค” PREDICTION: If a 32ร—32 image goes through three 2ร—2 MaxPool layers, what's the final size? -# Size after pool 1: ___ร—___ -# Size after pool 2: ___ร—___ -# Size after pool 3: ___ร—___ +# THINK PREDICTION: If a 32*32 image goes through three 2*2 MaxPool layers, what's the final size? +# Size after pool 1: ___*___ +# Size after pool 2: ___*___ +# Size after pool 3: ___*___ -# ๐Ÿ” SYSTEMS INSIGHT #3: Spatial Dimension Reduction Analysis +# MAGNIFY SYSTEMS INSIGHT #3: Spatial Dimension Reduction Analysis def analyze_spatial_reduction(): """Analyze how pooling affects spatial dimensions and memory.""" try: @@ -1431,19 +1431,19 @@ def analyze_spatial_reduction(): memory_mb = ch * current_size * current_size * 4 / (1024 * 1024) layer_name = f"Layer {i+1}" if i > 0 else "Input" - print(f"{layer_name:12s}\t{current_size}ร—{current_size}\t\t{ch}\t\t{memory_mb:.1f}\t\t{total_reduction:.1f}ร—") + print(f"{layer_name:12s}\t{current_size}*{current_size}\t\t{ch}\t\t{memory_mb:.1f}\t\t{total_reduction:.1f}*") - # Apply pooling (2ร—2) after each layer except last + # Apply pooling (2*2) after each layer except last if i < len(channels) - 1: - current_size = current_size // 2 # MaxPool2D reduces by 2ร— - total_reduction *= 4 # 2ร—2 = 4ร— reduction in total pixels + current_size = current_size // 2 # MaxPool2D reduces by 2* + total_reduction *= 4 # 2*2 = 4* reduction in total pixels - print(f"\n๐Ÿ“Š Final Reduction: {total_reduction:.0f}ร— fewer pixels") - print(f" Original: {initial_size}ร—{initial_size} = {initial_size**2:,} pixels") - print(f" Final: {current_size}ร—{current_size} = {current_size**2:,} pixels") + print(f"\n๐Ÿ“Š Final Reduction: {total_reduction:.0f}* fewer pixels") + print(f" Original: {initial_size}*{initial_size} = {initial_size**2:,} pixels") + print(f" Final: {current_size}*{current_size} = {current_size**2:,} pixels") # Real-world implications - print("\n๐Ÿ’ก Why This Matters:") + print("\nTIP Why This Matters:") print("โ€ข Pooling reduces overfitting (less spatial detail)") print("โ€ข Enables larger receptive fields in deeper layers") print("โ€ข Dramatically reduces memory and computation") @@ -1456,7 +1456,7 @@ def analyze_spatial_reduction(): print("โ€ข Modern alternatives: strided convolutions, attention") except Exception as e: - print(f"โš ๏ธ Error in spatial analysis: {e}") + print(f"WARNING๏ธ Error in spatial analysis: {e}") print("Make sure MaxPool2D class is implemented correctly") # Run the analysis @@ -1477,7 +1477,7 @@ analyze_spatial_reduction() ### The Pattern ``` -Conv2D โ†’ ReLU โ†’ MaxPool2D โ†’ Flatten โ†’ Linear โ†’ Output +Conv2D -> ReLU -> MaxPool2D -> Flatten -> Linear -> Output ``` ### Real-World Usage @@ -1491,12 +1491,12 @@ Conv2D โ†’ ReLU โ†’ MaxPool2D โ†’ Flatten โ†’ Linear โ†’ Output # Note: The flatten function is already implemented in the Spatial Helper Functions section above. # We use that single implementation throughout this module for consistency and clarity. -print("โœ… Flatten function is available from the Spatial Helper Functions section") -print("๐Ÿ” The flatten() function handles tensor flattening for CNN-to-Linear transitions") +print("PASS Flatten function is available from the Spatial Helper Functions section") +print("MAGNIFY The flatten() function handles tensor flattening for CNN-to-Linear transitions") # %% [markdown] """ -### ๐Ÿงช Unit Test: Flatten Function +### TEST Unit Test: Flatten Function Let us test your flatten function! This connects convolutional layers to dense layers. @@ -1520,10 +1520,10 @@ try: assert flattened.shape == (1, 4), f"Flattened shape should be (1, 4), got {flattened.shape}" expected_data = np.array([[1, 2, 3, 4]]) assert np.array_equal(flattened.data, expected_data), f"Flattened data should be {expected_data}, got {flattened.data}" - print("โœ… 2x2 flatten test passed") + print("PASS 2x2 flatten test passed") except Exception as e: - print(f"โŒ 2x2 flatten test failed: {e}") + print(f"FAIL 2x2 flatten test failed: {e}") raise # Test case 2: 3x3 tensor @@ -1534,10 +1534,10 @@ try: assert flattened2.shape == (1, 9), f"Flattened shape should be (1, 9), got {flattened2.shape}" expected_data2 = np.array([[1, 2, 3, 4, 5, 6, 7, 8, 9]]) assert np.array_equal(flattened2.data, expected_data2), f"Flattened data should be {expected_data2}, got {flattened2.data}" - print("โœ… 3x3 flatten test passed") + print("PASS 3x3 flatten test passed") except Exception as e: - print(f"โŒ 3x3 flatten test failed: {e}") + print(f"FAIL 3x3 flatten test failed: {e}") raise # Test case 3: Different shapes @@ -1548,18 +1548,18 @@ try: assert flattened3.shape == (1, 8), f"Flattened shape should be (1, 8), got {flattened3.shape}" expected_data3 = np.array([[1, 2, 3, 4, 5, 6, 7, 8]]) assert np.array_equal(flattened3.data, expected_data3), f"Flattened data should be {expected_data3}, got {flattened3.data}" - print("โœ… Different shapes flatten test passed") + print("PASS Different shapes flatten test passed") except Exception as e: - print(f"โŒ Different shapes flatten test failed: {e}") + print(f"FAIL Different shapes flatten test failed: {e}") raise # Show the flattening behavior -print("๐ŸŽฏ Flatten behavior:") +print("TARGET Flatten behavior:") print(" Converts 2D tensor to 1D") print(" Preserves batch dimension") print(" Enables connection to Linear layers") -print("๐Ÿ“ˆ Progress: Convolution operation โœ“, Conv2D layer โœ“, Flatten โœ“") +print("PROGRESS Progress: Convolution operation OK, Conv2D layer OK, Flatten OK") # %% [markdown] """ @@ -1571,19 +1571,19 @@ Let us test our complete CNN system with realistic multi-channel scenarios: #### **CIFAR-10 Style CNN** ```python # RGB images to classification -RGB Input โ†’ Multi-Channel Conv2D โ†’ ReLU โ†’ MaxPool2D โ†’ Flatten โ†’ Linear โ†’ Output +RGB Input -> Multi-Channel Conv2D -> ReLU -> MaxPool2D -> Flatten -> Linear -> Output ``` #### **Deep Multi-Channel CNN** ```python # Progressive feature extraction -RGB โ†’ Conv2D(3โ†’32) โ†’ ReLU โ†’ Pool โ†’ Conv2D(32โ†’64) โ†’ ReLU โ†’ Pool โ†’ Flatten โ†’ Linear +RGB -> Conv2D(3->32) -> ReLU -> Pool -> Conv2D(32->64) -> ReLU -> Pool -> Flatten -> Linear ``` #### **Production CNN Pattern** ```python # Full computer vision pipeline -RGB images โ†’ Feature extraction layers โ†’ Spatial downsampling โ†’ Classification head +RGB images -> Feature extraction layers -> Spatial downsampling -> Classification head ``` This comprehensive test ensures our multi-channel CNN components work together for real computer vision applications like CIFAR-10! @@ -1597,7 +1597,7 @@ try: # Test 1: CIFAR-10 Style RGB CNN Pipeline print("\n1. CIFAR-10 Style RGB CNN Pipeline:") - # Create pipeline: RGB โ†’ Conv2D(3โ†’16) โ†’ ReLU โ†’ MaxPool2D โ†’ Flatten โ†’ Linear + # Create pipeline: RGB -> Conv2D(3->16) -> ReLU -> MaxPool2D -> Flatten -> Linear rgb_conv = Conv2D(in_channels=3, out_channels=16, kernel_size=(3, 3)) relu = ReLU() pool = MaxPool2D(pool_size=(2, 2)) @@ -1608,11 +1608,11 @@ try: print(f"RGB input shape: {rgb_image.shape}") # Forward pass through complete pipeline - conv_features = rgb_conv(rgb_image) # (3,8,8) โ†’ (16,6,6) - activated = relu(conv_features) # (16,6,6) โ†’ (16,6,6) - pooled = pool(activated) # (16,6,6) โ†’ (16,3,3) - flattened = flatten(pooled, start_dim=0) # (16,3,3) โ†’ (1,144) - predictions = dense(flattened) # (1,144) โ†’ (1,10) + conv_features = rgb_conv(rgb_image) # (3,8,8) -> (16,6,6) + activated = relu(conv_features) # (16,6,6) -> (16,6,6) + pooled = pool(activated) # (16,6,6) -> (16,3,3) + flattened = flatten(pooled, start_dim=0) # (16,3,3) -> (1,144) + predictions = dense(flattened) # (1,144) -> (1,10) assert conv_features.shape == (16, 6, 6), f"Conv features wrong: {conv_features.shape}" assert activated.shape == (16, 6, 6), f"Activated features wrong: {activated.shape}" @@ -1620,12 +1620,12 @@ try: assert flattened.shape == (1, 144), f"Flattened features wrong: {flattened.shape}" assert predictions.shape == (1, 10), f"Predictions wrong: {predictions.shape}" - print("โœ… CIFAR-10 style RGB pipeline works correctly") + print("PASS CIFAR-10 style RGB pipeline works correctly") # Test 2: Deep Multi-Channel CNN print("\n2. Deep Multi-Channel CNN:") - # Create deeper pipeline: RGB โ†’ Conv1(3โ†’32) โ†’ ReLU โ†’ Pool โ†’ Conv2(32โ†’64) โ†’ ReLU โ†’ Pool โ†’ Linear + # Create deeper pipeline: RGB -> Conv1(3->32) -> ReLU -> Pool -> Conv2(32->64) -> ReLU -> Pool -> Linear conv1_deep = Conv2D(in_channels=3, out_channels=32, kernel_size=(3, 3)) relu1 = ReLU() pool1 = MaxPool2D(pool_size=(2, 2)) @@ -1639,14 +1639,14 @@ try: print(f"Large RGB input shape: {large_rgb.shape}") # Forward pass through deep network - h1 = conv1_deep(large_rgb) # (3,12,12) โ†’ (32,10,10) - h2 = relu1(h1) # (32,10,10) โ†’ (32,10,10) - h3 = pool1(h2) # (32,10,10) โ†’ (32,5,5) - h4 = conv2_deep(h3) # (32,5,5) โ†’ (64,3,3) - h5 = relu2(h4) # (64,3,3) โ†’ (64,3,3) - h6 = pool2(h5) # (64,3,3) โ†’ (64,1,1) - h7 = flatten(h6, start_dim=0) # (64,1,1) โ†’ (1,64) - output_deep = classifier_deep(h7) # (1,64) โ†’ (1,5) + h1 = conv1_deep(large_rgb) # (3,12,12) -> (32,10,10) + h2 = relu1(h1) # (32,10,10) -> (32,10,10) + h3 = pool1(h2) # (32,10,10) -> (32,5,5) + h4 = conv2_deep(h3) # (32,5,5) -> (64,3,3) + h5 = relu2(h4) # (64,3,3) -> (64,3,3) + h6 = pool2(h5) # (64,3,3) -> (64,1,1) + h7 = flatten(h6, start_dim=0) # (64,1,1) -> (1,64) + output_deep = classifier_deep(h7) # (1,64) -> (1,5) assert h1.shape == (32, 10, 10), f"Conv1 output wrong: {h1.shape}" assert h3.shape == (32, 5, 5), f"Pool1 output wrong: {h3.shape}" @@ -1655,7 +1655,7 @@ try: assert h7.shape == (1, 64), f"Final flatten wrong: {h7.shape}" assert output_deep.shape == (1, 5), f"Final prediction wrong: {output_deep.shape}" - print("โœ… Deep multi-channel CNN works correctly") + print("PASS Deep multi-channel CNN works correctly") # Test 3: Batch Processing with Multi-Channel print("\n3. Batch Processing Test:") @@ -1669,21 +1669,21 @@ try: print(f"Batch RGB input shape: {rgb_batch.shape}") # Forward pass to determine correct feature size - batch_conv_out = batch_conv(rgb_batch) # (4,3,6,6) โ†’ (4,8,4,4) - batch_pool_out = batch_pool(batch_conv_out) # (4,8,4,4) โ†’ (4,8,2,2) - batch_flat = flatten(batch_pool_out) # (4,8,2,2) โ†’ (4,32) + batch_conv_out = batch_conv(rgb_batch) # (4,3,6,6) -> (4,8,4,4) + batch_pool_out = batch_pool(batch_conv_out) # (4,8,4,4) -> (4,8,2,2) + batch_flat = flatten(batch_pool_out) # (4,8,2,2) -> (4,32) # Create classifier with correct input size feature_size = batch_flat.shape[1] # 32 features batch_classifier = Linear(input_size=feature_size, output_size=3) - batch_pred = batch_classifier(batch_flat) # (4,32) โ†’ (4,3) + batch_pred = batch_classifier(batch_flat) # (4,32) -> (4,3) assert batch_conv_out.shape == (4, 8, 4, 4), f"Batch conv wrong: {batch_conv_out.shape}" assert batch_pool_out.shape == (4, 8, 2, 2), f"Batch pool wrong: {batch_pool_out.shape}" assert batch_flat.shape == (4, 32), f"Batch flatten wrong: {batch_flat.shape}" assert batch_pred.shape == (4, 3), f"Batch prediction wrong: {batch_pred.shape}" - print("โœ… Batch processing with multi-channel works correctly") + print("PASS Batch processing with multi-channel works correctly") # Test 4: Backward Compatibility with Single Channel print("\n4. Backward Compatibility Test:") @@ -1694,17 +1694,17 @@ try: gray_features = gray_conv(gray_image) assert gray_features.shape == (8, 4, 4), f"Grayscale features wrong: {gray_features.shape}" - print("โœ… Single-channel compatibility works correctly") + print("PASS Single-channel compatibility works correctly") # Test 5: Memory and Parameter Analysis print("\n5. Memory and Parameter Analysis:") # Analyze different configurations configs = [ - (Conv2D(in_channels=1, out_channels=8, kernel_size=(3, 3)), "1โ†’8 channels"), - (Conv2D(in_channels=3, out_channels=16, kernel_size=(3, 3)), "3โ†’16 channels (RGB)"), - (Conv2D(in_channels=16, out_channels=32, kernel_size=(3, 3)), "16โ†’32 channels"), - (Conv2D(in_channels=32, out_channels=64, kernel_size=(3, 3)), "32โ†’64 channels"), + (Conv2D(in_channels=1, out_channels=8, kernel_size=(3, 3)), "1->8 channels"), + (Conv2D(in_channels=3, out_channels=16, kernel_size=(3, 3)), "3->16 channels (RGB)"), + (Conv2D(in_channels=16, out_channels=32, kernel_size=(3, 3)), "16->32 channels"), + (Conv2D(in_channels=32, out_channels=64, kernel_size=(3, 3)), "32->64 channels"), ] for conv_layer, desc in configs: @@ -1712,26 +1712,26 @@ try: memory_mb = params * 4 / (1024 * 1024) # float32 = 4 bytes print(f" {desc}: {params:,} parameters ({memory_mb:.3f} MB)") - print("โœ… Memory analysis completed") + print("PASS Memory analysis completed") - print("\n๐ŸŽ‰ Comprehensive multi-channel test passed! Your CNN system supports:") + print("\nCELEBRATE Comprehensive multi-channel test passed! Your CNN system supports:") print(" โ€ข RGB image processing (CIFAR-10 ready)") print(" โ€ข Deep multi-channel architectures") print(" โ€ข Batch processing with multiple channels") print(" โ€ข Backward compatibility with single-channel") print(" โ€ข Production-ready parameter scaling") - print(" โ€ข Complete Conv โ†’ Pool โ†’ Linear pipelines") - print("๐Ÿ“ˆ Progress: Production-ready multi-channel CNN system!") + print(" โ€ข Complete Conv -> Pool -> Linear pipelines") + print("PROGRESS Progress: Production-ready multi-channel CNN system!") except Exception as e: - print(f"โŒ Comprehensive multi-channel test failed: {e}") + print(f"FAIL Comprehensive multi-channel test failed: {e}") raise -print("๐Ÿ“ˆ Final Progress: Production-ready multi-channel CNN system for real computer vision!") +print("PROGRESS Final Progress: Production-ready multi-channel CNN system for real computer vision!") # %% [markdown] """ -### ๐Ÿงช Unit Test: Convolution Operation Implementation +### TEST Unit Test: Convolution Operation Implementation This test validates the `conv2d_naive` function, ensuring it correctly performs 2D convolution operations with proper kernel sliding, dot product computation, and output shape calculation for spatial feature detection. """ @@ -1750,13 +1750,13 @@ def test_unit_convolution_operation(): expected = np.array([[6, 8], [12, 14]]) assert np.array_equal(result, expected), "Convolution should produce correct values" - print("โœ… Convolution operation works correctly") + print("PASS Convolution operation works correctly") # Test function defined (called in main block) # %% [markdown] """ -### ๐Ÿงช Unit Test: Conv2D Layer Implementation +### TEST Unit Test: Conv2D Layer Implementation This test validates the Conv2D layer class, ensuring proper kernel initialization, forward pass functionality, and integration with the tensor framework for convolutional neural network construction. """ @@ -1775,13 +1775,13 @@ def test_unit_simple_conv2d_performance(): assert hasattr(conv, 'kernel'), "Conv2D should have kernel attribute" assert conv.kernel.shape == (3, 3), "Kernel should have correct shape" - print("โœ… Conv2D layer works correctly") + print("PASS Conv2D layer works correctly") # Test function defined (called in main block) # %% [markdown] """ -### ๐Ÿงช Unit Test: Flatten Function Implementation +### TEST Unit Test: Flatten Function Implementation This test validates the flatten function, ensuring it correctly converts 2D spatial tensors to 1D vectors for connecting convolutional layers to dense layers in CNN architectures. """ @@ -1799,7 +1799,7 @@ def test_unit_flatten_function(): expected = np.array([[1, 2, 3, 4]]) assert np.array_equal(flattened.data, expected), "Flatten should preserve values" - print("โœ… Flatten function works correctly") + print("PASS Flatten function works correctly") # Test function defined (called in main block) @@ -1807,7 +1807,7 @@ def test_unit_flatten_function(): # %% [markdown] """ -## ๐Ÿงช Module Testing +## TEST Module Testing Time to test your implementation! This section uses TinyTorch's standardized testing framework to ensure your implementation works correctly. @@ -1853,7 +1853,7 @@ def test_module_conv2d_tensor_compatibility(): expected_shape = (5, 1, 8, 8) assert isinstance(output_tensor, Tensor), "Conv2D output must be a Tensor" assert output_tensor.shape == expected_shape, f"Expected output shape {expected_shape}, but got {output_tensor.shape}" - print("โœ… Integration Test Passed: Conv2D layer correctly transformed image tensor.") + print("PASS Integration Test Passed: Conv2D layer correctly transformed image tensor.") # %% [markdown] @@ -1978,7 +1978,7 @@ class ConvolutionProfiler: # Find fastest fastest = min(self.timing_results.items(), key=lambda x: x[1]['time_ms']) - print(f"\n๐Ÿš€ Fastest: {fastest[0]} ({fastest[1]['time_ms']:.3f}ms)") + print(f"\nROCKET Fastest: {fastest[0]} ({fastest[1]['time_ms']:.3f}ms)") def analyze_memory_patterns(self, input_sizes=[(64, 64), (128, 128), (256, 256)]): """ @@ -1987,7 +1987,7 @@ class ConvolutionProfiler: This function is PROVIDED to demonstrate memory scaling analysis. Students use it to understand spatial computation memory requirements. """ - print("๐Ÿ” MEMORY PATTERN ANALYSIS") + print("MAGNIFY MEMORY PATTERN ANALYSIS") print("=" * 40) conv_3x3 = SimpleConv2D(kernel_size=(3, 3)) @@ -2031,7 +2031,7 @@ class ConvolutionProfiler: size_ratio = (large['input_size'][0] / small['input_size'][0]) ** 2 memory_ratio = large['total_memory_mb'] / small['total_memory_mb'] - print(f"\n๐Ÿ“ˆ Memory Scaling Analysis:") + print(f"\nPROGRESS Memory Scaling Analysis:") print(f" Input size increased {size_ratio:.1f}x") print(f" Memory usage increased {memory_ratio:.1f}x") print(f" Scaling efficiency: {(memory_ratio/size_ratio)*100:.1f}% (lower is better)") @@ -2040,7 +2040,7 @@ class ConvolutionProfiler: # %% [markdown] """ -### ๐Ÿงช Test: Convolution Performance Profiling +### TEST Test: Convolution Performance Profiling Let us test our convolution profiler with realistic computer vision scenarios. """ @@ -2076,7 +2076,7 @@ def test_convolution_profiler(): assert 'total_memory_mb' in result, f"Should analyze memory for {kernel_name}" assert result['time_ms'] > 0, f"Time should be positive for {kernel_name}" - print("โœ… Convolution profiling test passed") + print("PASS Convolution profiling test passed") # Test memory pattern analysis memory_analysis = profiler.analyze_memory_patterns(input_sizes=[(32, 32), (64, 64)]) @@ -2089,13 +2089,13 @@ def test_convolution_profiler(): assert 'total_memory_mb' in result, "Should calculate total memory" assert result['total_memory_mb'] > 0, "Memory usage should be positive" - print("โœ… Memory pattern analysis test passed") + print("PASS Memory pattern analysis test passed") except Exception as e: - print(f"โš ๏ธ Convolution profiling test had issues: {e}") - print("โœ… Basic structure test passed (graceful degradation)") + print(f"WARNING๏ธ Convolution profiling test had issues: {e}") + print("PASS Basic structure test passed (graceful degradation)") - print("๐ŸŽฏ Convolution Profiler: All tests passed!") + print("TARGET Convolution Profiler: All tests passed!") # Test function defined (called in main block) @@ -2112,7 +2112,7 @@ def test_unit_multichannel_conv2d(): assert hasattr(conv, 'weight'), "Multi-channel Conv2D should have weights attribute" assert conv.weight.shape == (8, 3, 3, 3), "Weights should have correct multi-channel shape" - print("โœ… Multi-channel Conv2D works correctly") + print("PASS Multi-channel Conv2D works correctly") def test_unit_maxpool2d(): """Unit test for the MaxPool2D implementation.""" @@ -2127,12 +2127,12 @@ def test_unit_maxpool2d(): expected = np.array([[5, 7], [13, 15]]) # Max of each 2x2 window assert np.array_equal(pooled.data, expected), "MaxPool2D should compute correct max values" - print("โœ… MaxPool2D works correctly") + print("PASS MaxPool2D works correctly") # Create test_unit_all function for consistent pattern def test_unit_all(): """Run complete module validation.""" - print("๐Ÿงช Running all Spatial module tests...") + print("TEST Running all Spatial module tests...") # Run all individual test functions test_unit_convolution_operation() @@ -2143,14 +2143,14 @@ def test_unit_all(): test_module_conv2d_tensor_compatibility() test_convolution_profiler() - print("โœ… All tests passed! Spatial module ready for integration.") + print("PASS All tests passed! Spatial module ready for integration.") if __name__ == "__main__": test_unit_all() # %% [markdown] """ -## ๐Ÿค” ML Systems Thinking: Interactive Questions +## THINK ML Systems Thinking: Interactive Questions Now that you've built convolution operations and spatial processing capabilities, let's connect this foundational work to broader ML systems challenges. These questions help you think critically about how spatial computation patterns scale to production computer vision environments. @@ -2251,9 +2251,9 @@ GRADING RUBRIC (Instructor Use): """ ### Question 3: CNN Architecture Memory Management -**Context**: You built a complete CNN pipeline using `Conv2D`, `MaxPool2D`, and `flatten` operations. When you analyzed spatial reduction, you observed how pooling reduces memory by 4ร— but channels typically increase (3โ†’32โ†’64โ†’128). Your memory scaling analysis showed that deeper layers can have millions of parameters. +**Context**: You built a complete CNN pipeline using `Conv2D`, `MaxPool2D`, and `flatten` operations. When you analyzed spatial reduction, you observed how pooling reduces memory by 4* but channels typically increase (3->32->64->128). Your memory scaling analysis showed that deeper layers can have millions of parameters. -**Reflection Question**: Design memory management strategies for training deep CNN architectures using your implemented components. How would you handle the memory explosion when processing large batches through your Conv2Dโ†’ReLUโ†’MaxPool2D sequences? Consider gradient storage requirements (doubled memory), activation checkpointing strategies, and memory optimization techniques that work with your specific implementations. +**Reflection Question**: Design memory management strategies for training deep CNN architectures using your implemented components. How would you handle the memory explosion when processing large batches through your Conv2D->ReLU->MaxPool2D sequences? Consider gradient storage requirements (doubled memory), activation checkpointing strategies, and memory optimization techniques that work with your specific implementations. Reference your implementation: Consider how your `Conv2D` parameter layout and `MaxPool2D` reduction patterns affect total memory usage in deep networks. @@ -2269,7 +2269,7 @@ YOUR REFLECTION ON CNN ARCHITECTURE MEMORY MANAGEMENT: TODO: Replace this text with your thoughtful response about memory management strategies for your CNN implementations. Consider addressing: -- How would you handle memory explosion in deep Conv2Dโ†’ReLUโ†’MaxPool2D sequences? +- How would you handle memory explosion in deep Conv2D->ReLU->MaxPool2D sequences? - What impact do gradient storage requirements have on your CNN memory usage? - How would you implement activation checkpointing with your specific Conv2D and MaxPool2D components? - What batch size optimization strategies would work with your parameter layout? @@ -2294,34 +2294,34 @@ GRADING RUBRIC (Instructor Use): # %% [markdown] """ -## ๐ŸŽฏ MODULE SUMMARY: Multi-Channel Convolutional Networks +## TARGET MODULE SUMMARY: Multi-Channel Convolutional Networks Congratulations! You have successfully implemented a complete multi-channel CNN system ready for real computer vision applications: ### What You have Accomplished -โœ… **Convolution Operation**: Implemented the sliding window mechanism from scratch -โœ… **Single-Channel Conv2D**: Built learnable convolutional layers with random initialization -โœ… **Multi-Channel Conv2D**: Added support for RGB images and multiple output feature maps -โœ… **MaxPool2D**: Implemented spatial downsampling for computational efficiency -โœ… **Flatten Function**: Created the bridge between convolutional and dense layers -โœ… **Complete CNN Pipelines**: Built CIFAR-10 ready architectures with proper parameter scaling -โœ… **Memory Analysis**: Profiled parameter scaling and computational complexity -โœ… **Production Patterns**: Tested batch processing and deep multi-channel architectures +PASS **Convolution Operation**: Implemented the sliding window mechanism from scratch +PASS **Single-Channel Conv2D**: Built learnable convolutional layers with random initialization +PASS **Multi-Channel Conv2D**: Added support for RGB images and multiple output feature maps +PASS **MaxPool2D**: Implemented spatial downsampling for computational efficiency +PASS **Flatten Function**: Created the bridge between convolutional and dense layers +PASS **Complete CNN Pipelines**: Built CIFAR-10 ready architectures with proper parameter scaling +PASS **Memory Analysis**: Profiled parameter scaling and computational complexity +PASS **Production Patterns**: Tested batch processing and deep multi-channel architectures ### Key Concepts You have Learned - **Multi-channel convolution**: How RGB images are processed through multiple filters - **Parameter scaling**: How memory requirements grow with channels and kernel sizes - **Spatial downsampling**: MaxPooling for translation invariance and efficiency -- **Feature hierarchy**: Progressive extraction from RGB โ†’ edges โ†’ objects โ†’ concepts -- **Production architectures**: Conv โ†’ ReLU โ†’ Pool โ†’ Conv โ†’ ReLU โ†’ Pool โ†’ Linear patterns +- **Feature hierarchy**: Progressive extraction from RGB -> edges -> objects -> concepts +- **Production architectures**: Conv -> ReLU -> Pool -> Conv -> ReLU -> Pool -> Linear patterns - **He initialization**: Proper weight initialization for stable multi-layer training ### Mathematical Foundations - **Multi-channel convolution**: Each filter processes ALL input channels, summing results -- **Parameter calculation**: out_channels ร— in_channels ร— kernel_h ร— kernel_w + bias_terms +- **Parameter calculation**: out_channels * in_channels * kernel_h * kernel_w + bias_terms - **Spatial size reduction**: Convolution and pooling progressively reduce spatial dimensions - **Channel expansion**: Typical pattern increases channels while reducing spatial size -- **Memory complexity**: O(batch ร— channels ร— height ร— width) for activations +- **Memory complexity**: O(batch * channels * height * width) for activations ### Systems Engineering Insights - **Memory scaling**: Parameters grow quadratically with channels, linearly with filters @@ -2331,20 +2331,20 @@ Congratulations! You have successfully implemented a complete multi-channel CNN - **Production trade-offs**: More channels = better accuracy but higher memory/compute cost ### Real-World Applications -- **CIFAR-10 classification**: Your CNN can handle 32ร—32 RGB images โ†’ 10 classes +- **CIFAR-10 classification**: Your CNN can handle 32*32 RGB images -> 10 classes - **Image recognition**: Object detection, medical imaging, autonomous driving - **Transfer learning**: Pre-trained features for downstream tasks - **Computer vision**: Face recognition, document analysis, quality inspection ### CNN Architecture Patterns -- **Basic CNN**: RGB โ†’ Conv(3โ†’32) โ†’ ReLU โ†’ Pool โ†’ Conv(32โ†’64) โ†’ ReLU โ†’ Pool โ†’ Linear -- **Parameter efficiency**: 32ร—3ร—3ร—3 = 864 parameters vs 32ร—32ร—32 = 32,768 for dense layer +- **Basic CNN**: RGB -> Conv(3->32) -> ReLU -> Pool -> Conv(32->64) -> ReLU -> Pool -> Linear +- **Parameter efficiency**: 32*3*3*3 = 864 parameters vs 32*32*32 = 32,768 for dense layer - **Spatial hierarchy**: Early layers detect edges, later layers detect objects - **Translation invariance**: Same features detected regardless of position in image ### Performance Characteristics - **Memory efficiency**: Shared parameters across spatial locations -- **Computational complexity**: O(batch ร— out_channels ร— in_channels ร— kernel_sizeยฒ ร— output_spatial) +- **Computational complexity**: O(batch * out_channels * in_channels * kernel_sizeยฒ * output_spatial) - **Hardware acceleration**: Highly parallelizable operations ideal for GPUs - **Scaling behavior**: Memory grows with channels, computation grows with spatial size @@ -2363,9 +2363,9 @@ classifier = Linear(input_size=64*6*6, output_size=10) # Process RGB image rgb_image = Tensor(np.random.randn(3, 32, 32)) # CIFAR-10 format -features1 = pool1(ReLU()(conv1(rgb_image))) # (3,32,32) โ†’ (32,15,15) -features2 = pool2(ReLU()(conv2(features1))) # (32,15,15) โ†’ (64,6,6) -predictions = classifier(flatten(features2, start_dim=0)) # (64,6,6) โ†’ (1,10) +features1 = pool1(ReLU()(conv1(rgb_image))) # (3,32,32) -> (32,15,15) +features2 = pool2(ReLU()(conv2(features1))) # (32,15,15) -> (64,6,6) +predictions = classifier(flatten(features2, start_dim=0)) # (64,6,6) -> (1,10) ``` ### Next Steps diff --git a/modules/09_dataloader/dataloader_dev.py b/modules/09_dataloader/dataloader_dev.py index a243bd06..6bf296ab 100644 --- a/modules/09_dataloader/dataloader_dev.py +++ b/modules/09_dataloader/dataloader_dev.py @@ -21,7 +21,7 @@ Welcome to the DataLoader module! You'll build the data infrastructure that feed - Framework connection: See how your implementation mirrors PyTorch's data loading infrastructure and optimization strategies - Performance insight: Learn why data loading parallelization and prefetching are essential for GPU utilization in production systems -## Build โ†’ Use โ†’ Reflect +## Build -> Use -> Reflect 1. **Build**: Complete Dataset and DataLoader classes with efficient batching, shuffling, and real dataset support (CIFAR-10) 2. **Use**: Load large-scale image datasets and feed them to neural networks with proper batch processing 3. **Reflect**: Why does data loading speed often determine training speed more than model computation? @@ -35,8 +35,8 @@ By the end of this module, you'll understand: - Connection to production ML systems and how frameworks optimize data loading for different storage systems ## Systems Reality Check -๐Ÿ’ก **Production Context**: PyTorch's DataLoader uses multiprocessing and memory pinning to overlap data loading with GPU computation, achieving near-zero data loading overhead -โšก **Performance Note**: Modern GPUs can process data faster than storage systems can provide it - data loading optimization is critical for hardware utilization in production training +TIP **Production Context**: PyTorch's DataLoader uses multiprocessing and memory pinning to overlap data loading with GPU computation, achieving near-zero data loading overhead +SPEED **Performance Note**: Modern GPUs can process data faster than storage systems can provide it - data loading optimization is critical for hardware utilization in production training """ # %% nbgrader={"grade": false, "grade_id": "dataloader-imports", "locked": false, "schema_version": 3, "solution": false, "task": false} @@ -61,14 +61,14 @@ except ImportError: from tensor_dev import Tensor # %% nbgrader={"grade": false, "grade_id": "dataloader-welcome", "locked": false, "schema_version": 3, "solution": false, "task": false} -print("๐Ÿ”ฅ TinyTorch DataLoader Module") +print("FIRE TinyTorch DataLoader Module") print(f"NumPy version: {np.__version__}") print(f"Python version: {sys.version_info.major}.{sys.version_info.minor}") print("Ready to build data pipelines!") # %% [markdown] """ -## ๐Ÿ“ฆ Where This Code Lives in the Final Package +## PACKAGE Where This Code Lives in the Final Package **Learning Side:** You work in `modules/source/06_dataloader/dataloader_dev.py` **Building Side:** Code exports to `tinytorch.core.dataloader` @@ -96,7 +96,7 @@ from tinytorch.core.networks import Sequential # Models to train """ ## Step 1: Understanding Data Pipelines - The Foundation of ML Systems -### ๐Ÿ”— Building on Previous Learning +### LINK Building on Previous Learning **What You Built Before**: - Module 02 (Tensor): Data structures that hold and manipulate arrays efficiently - Module 04 (Layers): Neural network components that need batched inputs @@ -109,7 +109,7 @@ from tinytorch.core.networks import Sequential # Models to train **Connection Map**: ``` -Tensor Operations โ†’ Data Loading โ†’ Training Loop +Tensor Operations -> Data Loading -> Training Loop (Module 02) (Module 10) (Next: Module 11) ``` @@ -118,39 +118,39 @@ Tensor Operations โ†’ Data Loading โ†’ Training Loop ### ๐Ÿ“Š The Complete Data Pipeline Flow ``` -โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” -โ”‚ Raw Storage โ”‚โ”€โ”€โ”€โ–ถโ”‚ Dataset โ”‚โ”€โ”€โ”€โ–ถโ”‚ Shuffle โ”‚โ”€โ”€โ”€โ–ถโ”‚ Batch โ”‚โ”€โ”€โ”€โ–ถโ”‚ Neural Net โ”‚ -โ”‚ (Files/DB) โ”‚ โ”‚ Loading โ”‚ โ”‚ + Index โ”‚ โ”‚ + Stack โ”‚ โ”‚ Training โ”‚ -โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ - โ†“ โ†“ โ†“ โ†“ โ†“ ++-------------+ +----------+ +---------+ +---------+ +--------------+ +| Raw Storage |---โ–ถ| Dataset |---โ–ถ| Shuffle |---โ–ถ| Batch |---โ–ถ| Neural Net | +| (Files/DB) | | Loading | | + Index | | + Stack | | Training | ++-------------+ +----------+ +---------+ +---------+ +--------------+ + v v v v v Gigabytes On-Demand Random Order GPU-Friendly Learning! of Data Loading (No Overfit) Format ``` -### ๐Ÿ” Why Data Pipelines Are Critical for ML Systems +### MAGNIFY Why Data Pipelines Are Critical for ML Systems - **Performance**: Efficient loading prevents GPU starvation (GPUs idle waiting for data) - **Scalability**: Handle datasets larger than memory (ImageNet = 150GB) - **Consistency**: Reproducible data processing across experiments - **Flexibility**: Easy to switch between datasets and configurations -### โšก Real-World Performance Challenges +### SPEED Real-World Performance Challenges ``` ๐ŸŽ๏ธ GPU Processing Speed: ~1000 images/second ๐ŸŒ Disk Read Speed: ~100 images/second -โš ๏ธ Result: GPU waits 90% of time for data! +WARNING๏ธ Result: GPU waits 90% of time for data! ``` ### ๐Ÿ’พ Memory vs Storage Trade-offs ``` Dataset Size Analysis: -โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” -โ”‚ Dataset โ”‚ Size โ”‚ Fits in RAM โ”‚ Strategy โ”‚ -โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค -โ”‚ MNIST โ”‚ ~60 MB โ”‚ โœ… Yes โ”‚ Load All โ”‚ -โ”‚ CIFAR-10 โ”‚ ~170 MB โ”‚ โœ… Yes โ”‚ Load All โ”‚ -โ”‚ ImageNet โ”‚ ~150 GB โ”‚ โŒ No โ”‚ Stream โ”‚ -โ”‚ Custom โ”‚ ~1 TB โ”‚ โŒ No โ”‚ Stream โ”‚ -โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ ++-------------+-------------+-------------+-------------+ +| Dataset | Size | Fits in RAM | Strategy | ++-------------+-------------+-------------+-------------โ”ค +| MNIST | ~60 MB | PASS Yes | Load All | +| CIFAR-10 | ~170 MB | PASS Yes | Load All | +| ImageNet | ~150 GB | FAIL No | Stream | +| Custom | ~1 TB | FAIL No | Stream | ++-------------+-------------+-------------+-------------+ ``` ### ๐Ÿง  Systems Engineering Principles @@ -159,15 +159,15 @@ Dataset Size Analysis: - **Batching strategies**: Trade-offs between memory usage and training speed - **Caching**: When to cache frequently used data vs recompute on-demand -### ๐Ÿ“ˆ Batch Processing Impact +### PROGRESS Batch Processing Impact ``` Batch Size Performance Analysis: - Batch Size โ”‚ GPU Utilization โ”‚ Memory Usage โ”‚ Training Speed - โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ - 1 โ”‚ ~10% โ”‚ Low โ”‚ Very Slow - 32 โ”‚ ~80% โ”‚ Medium โ”‚ Good - 128 โ”‚ ~95% โ”‚ High โ”‚ Very Fast - 512 โ”‚ ~98% โ”‚ Very High โ”‚ Fastest* + Batch Size | GPU Utilization | Memory Usage | Training Speed + -----------+-----------------+--------------+--------------- + 1 | ~10% | Low | Very Slow + 32 | ~80% | Medium | Good + 128 | ~95% | High | Very Fast + 512 | ~98% | Very High | Fastest* * Until you run out of GPU memory! ``` @@ -182,24 +182,24 @@ Let's start by building the most fundamental component: **Dataset**. ### What is a Dataset? A **Dataset** is an abstract interface that provides consistent access to data. It's the foundation of all data loading systems and the key abstraction that makes ML frameworks flexible. -### ๐ŸŽฏ The Universal Dataset Pattern +### TARGET The Universal Dataset Pattern ``` Dataset Interface - โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” - โ”‚ def __getitem__(index): โ”‚โ—„โ”€โ”€โ”€ Get single sample by index - โ”‚ return data, label โ”‚ (like a list or dictionary) - โ”‚ โ”‚ - โ”‚ def __len__(): โ”‚โ—„โ”€โ”€โ”€ Total number of samples - โ”‚ return total_samples โ”‚ (enables progress tracking) - โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ - โ–ฒ - โ”‚ Implements - โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” - โ”‚ โ”‚ โ”‚ -โ”Œโ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ” -โ”‚ MNIST โ”‚ โ”‚ CIFAR-10 โ”‚ โ”‚ Custom Data โ”‚ -โ”‚Dataset โ”‚ โ”‚ Dataset โ”‚ โ”‚ Dataset โ”‚ -โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ + +-----------------------------+ + | def __getitem__(index): |<--- Get single sample by index + | return data, label | (like a list or dictionary) + | | + | def __len__(): |<--- Total number of samples + | return total_samples | (enables progress tracking) + +-----------------------------+ + ^ + | Implements + +---------------+---------------+ + | | | ++---v----+ +------v-----+ +------v------+ +| MNIST | | CIFAR-10 | | Custom Data | +|Dataset | | Dataset | | Dataset | ++--------+ +------------+ +-------------+ ``` ### ๐Ÿ”ง Why Abstract Interfaces Are Systems Engineering Gold @@ -229,12 +229,12 @@ Real-World Dataset Implementations: - AudioSet: 2M YouTube clips with audio events - Custom: Your company's audio data -๐Ÿ“ˆ Time Series: +PROGRESS Time Series: - Stock prices, sensor data, user behavior logs - Custom: Your company's time series data ``` -### ๐Ÿš€ Framework Integration Power +### ROCKET Framework Integration Power ``` # PyTorch Compatibility: torch_dataset = torch.utils.data.Dataset # Same interface! @@ -394,7 +394,7 @@ class Dataset: # %% [markdown] """ -### ๐Ÿงช Unit Test: Dataset Interface +### TEST Unit Test: Dataset Interface Let's understand the Dataset interface! While we can't test the abstract class directly, we'll create a simple test dataset. @@ -434,28 +434,28 @@ A **DataLoader** efficiently batches and iterates through datasets. It's the bri ### ๐Ÿ”„ The DataLoader Processing Pipeline ``` Dataset Samples DataLoader Magic Neural Network - โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” - โ”‚ [sample_1] โ”‚ โ”‚ 1. Shuffle indices โ”‚ โ”‚ Efficient GPU โ”‚ - โ”‚ [sample_2] โ”‚โ”€โ”€โ”€โ”€โ”€โ”€โ–ถโ”‚ 2. Group into โ”‚โ”€โ”€โ”€โ”€โ”€โ”€โ–ถโ”‚ Batch โ”‚ - โ”‚ [sample_3] โ”‚ โ”‚ batches โ”‚ โ”‚ Processing โ”‚ - โ”‚ [sample_4] โ”‚ โ”‚ 3. Stack tensors โ”‚ โ”‚ โ”‚ - โ”‚ ... โ”‚ โ”‚ 4. Yield batches โ”‚ โ”‚ batch_size=32 โ”‚ - โ”‚ [sample_n] โ”‚ โ”‚ โ”‚ โ”‚ shape=(32,...) โ”‚ - โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ + +---------------------+ +---------------------+ +-----------------+ + | [sample_1] | | 1. Shuffle indices | | Efficient GPU | + | [sample_2] |------โ–ถ| 2. Group into |------โ–ถ| Batch | + | [sample_3] | | batches | | Processing | + | [sample_4] | | 3. Stack tensors | | | + | ... | | 4. Yield batches | | batch_size=32 | + | [sample_n] | | | | shape=(32,...) | + +---------------------+ +---------------------+ +-----------------+ ``` -### โšก Why DataLoaders Are Critical for Performance +### SPEED Why DataLoaders Are Critical for Performance ``` GPU Utilization Without Batching: -โ”Œโ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ” -โ”‚ ๐Ÿ”„ โ”‚ ... โ”‚ ... โ”‚ ... โ”‚ ... โ”‚ ... โ”‚ ... โ”‚ ... โ”‚ Time -โ””โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”˜ ++-----+-----+-----+-----+-----+-----+-----+-----+ +| ๐Ÿ”„ | ... | ... | ... | ... | ... | ... | ... | Time ++-----+-----+-----+-----+-----+-----+-----+-----+ ~5% GPU mostly idle (underutilized) GPU Utilization With Proper Batching: -โ”Œโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ” -โ”‚ โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ โ”‚ Time -โ””โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ”˜ ++โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ+ +| โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ | Time ++โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ+ ~95% GPU fully utilized (efficient!) ``` @@ -463,33 +463,33 @@ GPU Utilization With Proper Batching: ``` Batch Size Impact Analysis: - Batch Size โ”‚ Memory Usage โ”‚ GPU Utilization โ”‚ Gradient Quality - โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ - 1 โ”‚ Low โ”‚ ~10% โ”‚ Noisy (bad) - 16 โ”‚ Medium โ”‚ ~60% โ”‚ Better - 64 โ”‚ High โ”‚ ~90% โ”‚ Good - 256 โ”‚ Very High โ”‚ ~95% โ”‚ Very Good - 512 โ”‚ TOO HIGH! ๐Ÿ’ฅ โ”‚ N/A โ”‚ OOM Error + Batch Size | Memory Usage | GPU Utilization | Gradient Quality + -----------+--------------+-----------------+----------------- + 1 | Low | ~10% | Noisy (bad) + 16 | Medium | ~60% | Better + 64 | High | ~90% | Good + 256 | Very High | ~95% | Very Good + 512 | TOO HIGH! CRASH | N/A | OOM Error ``` ### ๐Ÿ”€ Shuffling: Preventing Overfitting to Data Order ``` Without Shuffling (Bad!): Epoch 1: [cat, cat, dog, dog, bird, bird] - Epoch 2: [cat, cat, dog, dog, bird, bird] โ† Same order! + Epoch 2: [cat, cat, dog, dog, bird, bird] <- Same order! Model learns data order, not features ๐Ÿ˜ž With Shuffling (Good!): Epoch 1: [dog, cat, bird, cat, dog, bird] - Epoch 2: [bird, dog, cat, bird, cat, dog] โ† Random order! + Epoch 2: [bird, dog, cat, bird, cat, dog] <- Random order! Model learns features, generalizes well ๐Ÿ˜Š ``` -### ๐ŸŽฏ Production Training Pattern +### TARGET Production Training Pattern ```python # The universal ML training pattern: for epoch in range(num_epochs): - for batch_data, batch_labels in dataloader: # โ† This line! + for batch_data, batch_labels in dataloader: # <- This line! predictions = model(batch_data) loss = criterion(predictions, batch_labels) loss.backward() @@ -639,7 +639,7 @@ class DataLoader: 3. Use ceiling division: (n + batch_size - 1) // batch_size EXAMPLE: - Dataset size 100, batch size 32 โ†’ 4 batches + Dataset size 100, batch size 32 -> 4 batches HINTS: - Use len(self.dataset) for dataset size @@ -655,7 +655,7 @@ class DataLoader: # %% [markdown] """ -### ๐Ÿงช Unit Test: DataLoader +### TEST Unit Test: DataLoader Let's test your DataLoader implementation! This is the heart of efficient data loading for neural networks. @@ -696,7 +696,7 @@ try: # Test __len__ expected_batches = (10 + 3 - 1) // 3 # Ceiling division: 4 batches assert len(dataloader) == expected_batches, f"Should have {expected_batches} batches, got {len(dataloader)}" - print("โœ… DataLoader __len__ works correctly") + print("PASS DataLoader __len__ works correctly") # Test iteration batch_count = 0 @@ -717,10 +717,10 @@ try: assert batch_count == expected_batches, f"Should iterate {expected_batches} times, got {batch_count}" assert total_samples == 10, f"Should process 10 total samples, got {total_samples}" - print("โœ… DataLoader iteration works correctly") + print("PASS DataLoader iteration works correctly") except Exception as e: - print(f"โŒ DataLoader test failed: {e}") + print(f"FAIL DataLoader test failed: {e}") raise # Test shuffling @@ -732,10 +732,10 @@ try: batch1_shuffle = next(iter(dataloader_shuffle)) batch1_no_shuffle = next(iter(dataloader_no_shuffle)) - print("โœ… DataLoader shuffling parameter works") + print("PASS DataLoader shuffling parameter works") except Exception as e: - print(f"โŒ DataLoader shuffling test failed: {e}") + print(f"FAIL DataLoader shuffling test failed: {e}") raise # Test different batch sizes @@ -745,18 +745,18 @@ try: assert len(small_loader) == 5, f"Small loader should have 5 batches, got {len(small_loader)}" assert len(large_loader) == 2, f"Large loader should have 2 batches, got {len(large_loader)}" - print("โœ… DataLoader handles different batch sizes correctly") + print("PASS DataLoader handles different batch sizes correctly") except Exception as e: - print(f"โŒ DataLoader batch size test failed: {e}") + print(f"FAIL DataLoader batch size test failed: {e}") raise # Show the DataLoader behavior -print("๐ŸŽฏ DataLoader behavior:") +print("TARGET DataLoader behavior:") print(" Batches data for efficient processing") print(" Handles shuffling and iteration") print(" Provides clean interface for training loops") -print("๐Ÿ“ˆ Progress: Dataset interface โœ“, DataLoader โœ“") +print("PROGRESS Progress: Dataset interface OK, DataLoader OK") # %% [markdown] """ @@ -910,13 +910,13 @@ Let's implement loading CIFAR-10, the dataset we'll use to achieve our ambitious ### ๐Ÿ‡บ๐Ÿ‡ธ CIFAR-10 Dataset Specifications ``` ๐Ÿ–ผ๏ธ CIFAR-10 Dataset Overview: - โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” - โ”‚ ๐ŸŽจ Classes: 10 (airplane, car, bird, etc.) โ”‚ - โ”‚ ๐Ÿ–ผ๏ธ Images: 60,000 total (50k train + 10k test) โ”‚ - โ”‚ ๐Ÿ“Œ Size: 32x32 pixels, RGB color โ”‚ - โ”‚ ๐Ÿ’พ Storage: ~170MB compressed โ”‚ - โ”‚ ๐ŸŽฏ Goal: 75% classification accuracy โ”‚ - โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ + +----------------------------------------+ + | ๐ŸŽจ Classes: 10 (airplane, car, bird, etc.) | + | ๐Ÿ–ผ๏ธ Images: 60,000 total (50k train + 10k test) | + | ๐Ÿ“Œ Size: 32x32 pixels, RGB color | + | ๐Ÿ’พ Storage: ~170MB compressed | + | TARGET Goal: 75% classification accuracy | + +----------------------------------------+ Classes: airplane, automobile, bird, cat, deer, dog, frog, horse, ship, truck @@ -927,16 +927,16 @@ Let's implement loading CIFAR-10, the dataset we'll use to achieve our ambitious CIFAR-10 Loading Pipeline: Raw Files Dataset Class DataLoader CNN Model -โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” -โ”‚ data_batch_1 โ”‚ โ”‚ CIFAR10Dataset โ”‚ โ”‚ Batch: (32,3, โ”‚ โ”‚ Convolutional โ”‚ -โ”‚ data_batch_2 โ”‚โ–ถโ”‚ __getitem__() โ”‚โ–ถโ”‚ 32,32) images โ”‚โ–ถโ”‚ Neural โ”‚ -โ”‚ data_batch_3 โ”‚ โ”‚ Loads on-demand โ”‚ โ”‚ Labels: (32,) โ”‚ โ”‚ Network โ”‚ -โ”‚ data_batch_4 โ”‚ โ”‚ Normalizes [0,1]โ”‚ โ”‚ Shuffled order โ”‚ โ”‚ Training โ”‚ -โ”‚ data_batch_5 โ”‚ โ”‚ Shape: (3,32,32)โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ -โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ ++-----------------+ +-----------------+ +-----------------+ +-----------------+ +| data_batch_1 | | CIFAR10Dataset | | Batch: (32,3, | | Convolutional | +| data_batch_2 |โ–ถ| __getitem__() |โ–ถ| 32,32) images |โ–ถ| Neural | +| data_batch_3 | | Loads on-demand | | Labels: (32,) | | Network | +| data_batch_4 | | Normalizes [0,1]| | Shuffled order | | Training | +| data_batch_5 | | Shape: (3,32,32)| | | | | ++-----------------+ +-----------------+ +-----------------+ +-----------------+ ``` -### ๐Ÿ“ˆ Why CIFAR-10 is Perfect for Learning +### PROGRESS Why CIFAR-10 is Perfect for Learning - **Manageable size**: Fits in memory, fast iteration - **Real complexity**: Natural images, not toy data - **Standard benchmark**: Compare with published results @@ -961,7 +961,7 @@ def download_cifar10(root: str = "./data") -> str: dataset_dir = os.path.join(root, "cifar-10-batches-py") if os.path.exists(dataset_dir): - print(f"โœ… CIFAR-10 found at {dataset_dir}") + print(f"PASS CIFAR-10 found at {dataset_dir}") return dataset_dir url = "https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz" @@ -969,12 +969,12 @@ def download_cifar10(root: str = "./data") -> str: print(f"๐Ÿ“ฅ Downloading CIFAR-10 (~170MB)...") urllib.request.urlretrieve(url, tar_path) - print("โœ… Downloaded!") + print("PASS Downloaded!") - print("๐Ÿ“ฆ Extracting...") + print("PACKAGE Extracting...") with tarfile.open(tar_path, 'r:gz') as tar: tar.extractall(root) - print("โœ… Ready!") + print("PASS Ready!") return dataset_dir ### END SOLUTION @@ -1010,7 +1010,7 @@ class CIFAR10Dataset(Dataset): # Normalize pixel values from [0, 255] to [0, 1] for neural network training # This is critical: neural networks expect inputs in [0,1] range! self.data = self.data.reshape(-1, 3, 32, 32).astype(np.float32) / 255.0 - print(f"โœ… Loaded {len(self.data):,} images") + print(f"PASS Loaded {len(self.data):,} images") print(f" Data shape: {self.data.shape}") print(f" Value range: [{self.data.min():.2f}, {self.data.max():.2f}]") ### END SOLUTION @@ -1037,7 +1037,7 @@ class CIFAR10Dataset(Dataset): # %% [markdown] """ -### ๐Ÿงช Unit Test: SimpleDataset +### TEST Unit Test: SimpleDataset Let's test your SimpleDataset implementation! This concrete example shows how the Dataset pattern works. @@ -1057,7 +1057,7 @@ try: # Test basic properties assert len(dataset) == 20, f"Dataset length should be 20, got {len(dataset)}" assert dataset.get_num_classes() == 4, f"Should have 4 classes, got {dataset.get_num_classes()}" - print("โœ… SimpleDataset basic properties work correctly") + print("PASS SimpleDataset basic properties work correctly") # Test sample access data, label = dataset[0] @@ -1065,19 +1065,19 @@ try: assert isinstance(label, Tensor), "Label should be a Tensor" assert data.shape == (5,), f"Data shape should be (5,), got {data.shape}" assert label.shape == (), f"Label shape should be (), got {label.shape}" - print("โœ… SimpleDataset sample access works correctly") + print("PASS SimpleDataset sample access works correctly") # Test sample shape sample_shape = dataset.get_sample_shape() assert sample_shape == (5,), f"Sample shape should be (5,), got {sample_shape}" - print("โœ… SimpleDataset get_sample_shape works correctly") + print("PASS SimpleDataset get_sample_shape works correctly") # Test multiple samples for i in range(5): data, label = dataset[i] assert data.shape == (5,), f"Data shape should be (5,) for sample {i}, got {data.shape}" assert 0 <= label.data < 4, f"Label should be in [0, 3] for sample {i}, got {label.data}" - print("โœ… SimpleDataset multiple samples work correctly") + print("PASS SimpleDataset multiple samples work correctly") # Test deterministic data (same seed should give same data) dataset2 = SimpleDataset(size=20, num_features=5, num_classes=4) @@ -1085,17 +1085,17 @@ try: data2, label2 = dataset2[0] assert np.array_equal(data1.data, data2.data), "Data should be deterministic" assert np.array_equal(label1.data, label2.data), "Labels should be deterministic" - print("โœ… SimpleDataset data is deterministic") + print("PASS SimpleDataset data is deterministic") except Exception as e: - print(f"โŒ SimpleDataset test failed: {e}") + print(f"FAIL SimpleDataset test failed: {e}") # Show the SimpleDataset behavior -print("๐ŸŽฏ SimpleDataset behavior:") +print("TARGET SimpleDataset behavior:") print(" Generates synthetic data for testing") print(" Implements complete Dataset interface") print(" Provides deterministic data for reproducibility") -print("๐Ÿ“ˆ Progress: Dataset interface โœ“, DataLoader โœ“, SimpleDataset โœ“") +print("PROGRESS Progress: Dataset interface OK, DataLoader OK, SimpleDataset OK") # %% [markdown] """ @@ -1166,7 +1166,7 @@ try: assert epoch_samples == 100, f"Should process 100 samples, got {epoch_samples}" expected_batches = (100 + 16 - 1) // 16 assert epoch_batches == expected_batches, f"Should have {expected_batches} batches, got {epoch_batches}" - print("โœ… Training pipeline works correctly") + print("PASS Training pipeline works correctly") # Test 2: Validation Data Pipeline print("\n2. Validation Data Pipeline Test:") @@ -1189,7 +1189,7 @@ try: assert val_samples == 50, f"Should process 50 validation samples, got {val_samples}" assert val_batches == 5, f"Should have 5 validation batches, got {val_batches}" - print("โœ… Validation pipeline works correctly") + print("PASS Validation pipeline works correctly") # Test 3: Different Dataset Configurations print("\n3. Dataset Configuration Test:") @@ -1212,7 +1212,7 @@ try: assert len(dataset) == size, f"Size mismatch for config {configs}" assert dataset.get_num_classes() == classes, f"Classes mismatch for config {configs}" - print("โœ… Different dataset configurations work correctly") + print("PASS Different dataset configurations work correctly") # Test 4: Memory Efficiency Simulation print("\n4. Memory Efficiency Test:") @@ -1233,7 +1233,7 @@ try: assert batch_data.shape[0] <= 50, f"Batch size should not exceed 50, got {batch_data.shape[0]}" assert processed_samples == 500, f"Should process all 500 samples, got {processed_samples}" - print("โœ… Memory efficiency works correctly") + print("PASS Memory efficiency works correctly") # Test 5: Multi-Epoch Training Simulation print("\n5. Multi-Epoch Training Test:") @@ -1253,25 +1253,25 @@ try: assert epoch_samples == 60, f"Should process 60 samples in epoch {epoch}, got {epoch_samples}" - print("โœ… Multi-epoch training works correctly") + print("PASS Multi-epoch training works correctly") - print("\n๐ŸŽ‰ Comprehensive test passed! Your data pipeline works correctly for:") + print("\nCELEBRATE Comprehensive test passed! Your data pipeline works correctly for:") print(" โ€ข Large-scale dataset handling") print(" โ€ข Batch processing with multiple workers") print(" โ€ข Shuffling and sampling strategies") print(" โ€ข Memory-efficient data loading") print(" โ€ข Complete training pipeline integration") - print("๐Ÿ“ˆ Progress: Production-ready data pipeline โœ“") + print("PROGRESS Progress: Production-ready data pipeline OK") except Exception as e: - print(f"โŒ Comprehensive test failed: {e}") + print(f"FAIL Comprehensive test failed: {e}") raise -print("๐Ÿ“ˆ Final Progress: Complete data pipeline ready for production ML!") +print("PROGRESS Final Progress: Complete data pipeline ready for production ML!") # %% [markdown] """ -### ๐Ÿงช Unit Test: Dataset Interface Implementation +### TEST Unit Test: Dataset Interface Implementation This test validates the abstract Dataset interface, ensuring proper inheritance, method implementation, and interface compliance for creating custom datasets in the TinyTorch data loading pipeline. """ @@ -1292,11 +1292,11 @@ def test_unit_dataset_interface(): assert isinstance(sample, Tensor), "Sample should be Tensor" assert isinstance(label, Tensor), "Label should be Tensor" - print("โœ… Dataset interface works correctly") + print("PASS Dataset interface works correctly") # %% [markdown] """ -### ๐Ÿงช Unit Test: DataLoader Implementation +### TEST Unit Test: DataLoader Implementation This test validates the DataLoader class functionality, ensuring proper batch creation, iteration capability, and integration with datasets for efficient data loading in machine learning training pipelines. """ @@ -1319,11 +1319,11 @@ def test_unit_dataloader(): assert batch_data.shape[0] <= 3, "Batch size should be <= 3" assert batch_labels.shape[0] <= 3, "Batch labels should match data" - print("โœ… DataLoader works correctly") + print("PASS DataLoader works correctly") # %% [markdown] """ -### ๐Ÿงช Unit Test: Simple Dataset Implementation +### TEST Unit Test: Simple Dataset Implementation This test validates the SimpleDataset class, ensuring it can handle real-world data scenarios including proper data storage, indexing, and compatibility with the DataLoader for practical machine learning workflows. """ @@ -1345,11 +1345,11 @@ def test_unit_simple_dataset(): assert sample.shape == (4,), "Sample should have correct features" assert 0 <= label.data < 3, "Label should be valid class" - print("โœ… SimpleDataset works correctly") + print("PASS SimpleDataset works correctly") # %% [markdown] """ -### ๐Ÿงช Unit Test: Complete Data Pipeline Integration +### TEST Unit Test: Complete Data Pipeline Integration This comprehensive test validates the entire data pipeline from dataset creation through DataLoader batching, ensuring all components work together seamlessly for end-to-end machine learning data processing workflows. """ @@ -1372,12 +1372,12 @@ def test_unit_dataloader_pipeline(): assert total_samples == 50, "Should process all samples" - print("โœ… Data pipeline integration works correctly") + print("PASS Data pipeline integration works correctly") # %% [markdown] # %% [markdown] """ -## ๐Ÿงช Module Testing +## TEST Module Testing Time to test your implementation! This section uses TinyTorch's standardized testing framework to ensure your implementation works correctly. @@ -1420,7 +1420,7 @@ def test_module_dataloader_tensor_yield(): assert isinstance(labels_batch, Tensor), "Labels batch should be a Tensor" assert labels_batch.shape == (10,), f"Expected labels shape (10,), but got {labels_batch.shape}" - print("โœ… Integration Test Passed: DataLoader correctly yields batches of Tensors.") + print("PASS Integration Test Passed: DataLoader correctly yields batches of Tensors.") # Test function defined (called in main block) @@ -1601,7 +1601,7 @@ class DataPipelineProfiler: batch_size = 32 # Standard batch size for comparison for strategy in strategies: - print(f"\n๐Ÿ” Testing {strategy.upper()} strategy...") + print(f"\nMAGNIFY Testing {strategy.upper()} strategy...") if strategy == 'sequential': dataloader = DataLoader(dataset, batch_size=batch_size, shuffle=False) @@ -1627,7 +1627,7 @@ class DataPipelineProfiler: speedup = slowest[1]['avg_batch_time'] / fastest[1]['avg_batch_time'] - print(f"\n๐ŸŽฏ STRATEGY ANALYSIS:") + print(f"\nTARGET STRATEGY ANALYSIS:") print(f" Fastest: {fastest[0]} ({fastest[1]['avg_batch_time']:.3f}s)") print(f" Slowest: {slowest[0]} ({slowest[1]['avg_batch_time']:.3f}s)") print(f" Speedup: {speedup:.1f}x") @@ -1670,7 +1670,7 @@ class DataPipelineProfiler: else: bottleneck = "Compute" utilization = avg_io_time / simulated_compute_time * 100 - print(f"\nโœ… BOTTLENECK: {bottleneck}") + print(f"\nPASS BOTTLENECK: {bottleneck}") print(f" I/O utilization: {utilization:.1f}%") print(f" I/O waiting for GPU: {simulated_compute_time - avg_io_time:.3f}s per batch") @@ -1678,17 +1678,17 @@ class DataPipelineProfiler: total_cycle_time = max(avg_io_time, simulated_compute_time) efficiency = min(avg_io_time, simulated_compute_time) / total_cycle_time * 100 - print(f"\n๐ŸŽฏ TRAINING IMPACT:") + print(f"\nTARGET TRAINING IMPACT:") print(f" Pipeline efficiency: {efficiency:.1f}%") print(f" Total cycle time: {total_cycle_time:.3f}s") if bottleneck == "I/O": - print(f" ๐Ÿ’ก Recommendation: Optimize data loading") + print(f" TIP Recommendation: Optimize data loading") print(f" - Increase batch size") print(f" - Use data prefetching") print(f" - Faster storage (SSD vs HDD)") else: - print(f" ๐Ÿ’ก Recommendation: I/O is well optimized") + print(f" TIP Recommendation: I/O is well optimized") print(f" - Consider larger models or batch sizes") print(f" - Focus on compute optimization") @@ -1702,7 +1702,7 @@ class DataPipelineProfiler: # %% [markdown] """ -### ๐ŸŽฏ Learning Activity 1: DataLoader Performance Profiling (Medium Guided Implementation) +### TARGET Learning Activity 1: DataLoader Performance Profiling (Medium Guided Implementation) **Goal**: Learn to measure data loading performance and identify I/O bottlenecks that can slow down training. @@ -1713,9 +1713,9 @@ Complete the missing implementations in the `DataPipelineProfiler` class above, # Initialize the data pipeline profiler profiler = DataPipelineProfiler() -# โœ… IMPLEMENTATION CHECKPOINT: Ensure your profiler methods are complete before running +# PASS IMPLEMENTATION CHECKPOINT: Ensure your profiler methods are complete before running -# ๐Ÿค” PREDICTION: Which will be faster - sequential or shuffled data loading? +# THINK PREDICTION: Which will be faster - sequential or shuffled data loading? # Your answer: _______ # Guard to prevent execution when imported @@ -1746,7 +1746,7 @@ if __name__ == '__main__': print(f" Error: {timing_result['error']}") # Test 2: Batch size scaling analysis - print(f"\n๐Ÿ“ˆ Batch Size Scaling Analysis:") + print(f"\nPROGRESS Batch Size Scaling Analysis:") # Students use their implemented scaling analysis scaling_analysis = profiler.analyze_batch_size_scaling(test_dataset, [16, 32, 64, 128]) @@ -1761,7 +1761,7 @@ if __name__ == '__main__': else: print(f" Error: {scaling_analysis['error']}") - print(f"\n๐Ÿ’ก I/O PERFORMANCE INSIGHTS:") + print(f"\nTIP I/O PERFORMANCE INSIGHTS:") print(f" - Larger batches often improve throughput (better amortization)") print(f" - But memory constraints limit maximum batch size") print(f" - Sweet spot balances throughput vs memory usage") @@ -1769,15 +1769,15 @@ if __name__ == '__main__': # %% [markdown] """ -### ๐ŸŽฏ Learning Activity 2: Production I/O Optimization Analysis (Review & Understand) +### TARGET Learning Activity 2: Production I/O Optimization Analysis (Review & Understand) **Goal**: Understand how I/O performance affects real training systems and learn optimization strategies used in production. """ # %% -# โœ… IMPLEMENTATION CHECKPOINT: Ensure profiler comparison methods work before running +# PASS IMPLEMENTATION CHECKPOINT: Ensure profiler comparison methods work before running -# ๐Ÿ” SYSTEMS INSIGHT: I/O Strategy Performance Comparison +# MAGNIFY SYSTEMS INSIGHT: I/O Strategy Performance Comparison def analyze_io_strategy_impact(): """Analyze the performance difference between I/O strategies.""" print("๐Ÿ”„ I/O STRATEGY IMPACT ANALYSIS") @@ -1787,7 +1787,7 @@ def analyze_io_strategy_impact(): # Create test scenarios dataset = TestDataset(size=500) - print("๐Ÿงช Testing Sequential vs Random Access:") + print("TEST Testing Sequential vs Random Access:") # Sequential access simulation import time @@ -1810,7 +1810,7 @@ def analyze_io_strategy_impact(): print(f" Random access: {random_time:.3f}s") print(f" Speed difference: {random_time/sequential_time:.1f}x") - print("\n๐Ÿ’ก WHY PERFORMANCE DIFFERS:") + print("\nTIP WHY PERFORMANCE DIFFERS:") print(" 1. ๐Ÿ’พ Cache locality: Sequential = better CPU cache usage") print(" 2. ๐Ÿ’ฟ Storage patterns: HDDs hate random access") print(" 3. ๐Ÿง  Memory prefetching: CPUs predict sequential patterns") @@ -1818,22 +1818,22 @@ def analyze_io_strategy_impact(): print("\nโš–๏ธ TRAINING TRADE-OFFS:") print(" Sequential Loading:") - print(" โœ… Faster I/O performance") - print(" โœ… Better cache utilization") - print(" โŒ Model learns data order (overfitting!)") + print(" PASS Faster I/O performance") + print(" PASS Better cache utilization") + print(" FAIL Model learns data order (overfitting!)") print(" Random/Shuffled Loading:") - print(" โœ… Better model generalization") - print(" โœ… Prevents order memorization") - print(" โŒ Slightly slower I/O") - print(" โŒ Cache misses more frequent") + print(" PASS Better model generalization") + print(" PASS Prevents order memorization") + print(" FAIL Slightly slower I/O") + print(" FAIL Cache misses more frequent") - print("\n๐ŸŽฏ PRODUCTION RECOMMENDATION:") + print("\nTARGET PRODUCTION RECOMMENDATION:") print(" Always use shuffling for training (generalization > speed)") print(" Use sequential for inference (speed matters, no learning)") except Exception as e: - print(f"โš ๏ธ Error in I/O strategy analysis: {e}") + print(f"WARNING๏ธ Error in I/O strategy analysis: {e}") # Run the analysis analyze_io_strategy_impact() @@ -1860,7 +1860,7 @@ if __name__ == '__main__': print(f"\n๐Ÿ–ฅ๏ธ {scenario_name}:") balance_analysis = profiler.simulate_compute_vs_io_balance(sample_dataloader, compute_time) - print(f"\n๐ŸŽฏ PRODUCTION I/O OPTIMIZATION LESSONS:") + print(f"\nTARGET PRODUCTION I/O OPTIMIZATION LESSONS:") print(f"=" * 50) print(f"\n1. ๐Ÿ“Š I/O BOTTLENECK IDENTIFICATION:") @@ -1868,7 +1868,7 @@ if __name__ == '__main__': print(f" - CPU training rarely I/O bottlenecked") print(f" - Modern GPUs process data faster than storage provides it") - print(f"\n2. ๐Ÿš€ OPTIMIZATION STRATEGIES:") + print(f"\n2. ROCKET OPTIMIZATION STRATEGIES:") print(f" - Data prefetching: Load next batch while GPU computes") print(f" - Parallel workers: Multiple threads/processes for loading") print(f" - Faster storage: NVMe SSD vs SATA vs network storage") @@ -1884,9 +1884,9 @@ if __name__ == '__main__': print(f" - GPU utilization directly affects training costs") print(f" - Faster storage investment pays off in GPU efficiency") - print(f"\n๐Ÿ’ก SYSTEMS ENGINEERING INSIGHT:") + print(f"\nTIP SYSTEMS ENGINEERING INSIGHT:") print(f"I/O optimization is often the highest-impact performance improvement:") - print(f"- GPUs are expensive โ†’ maximize their utilization") + print(f"- GPUs are expensive -> maximize their utilization") print(f"- Data loading is often the limiting factor") print(f"- 10% I/O improvement = 10% faster training = 10% cost reduction") print(f"- Modern ML systems spend significant effort on data pipeline optimization") @@ -1902,31 +1902,31 @@ if __name__ == "__main__": print(f"Sample 0: data={data}, label={label}") assert isinstance(data, Tensor), "Data should be a Tensor" assert isinstance(label, Tensor), "Label should be a Tensor" - print("โœ… Dataset __getitem__ works correctly") + print("PASS Dataset __getitem__ works correctly") # Test __len__ assert len(test_dataset) == 5, f"Dataset length should be 5, got {len(test_dataset)}" - print("โœ… Dataset __len__ works correctly") + print("PASS Dataset __len__ works correctly") # Test get_num_classes num_classes = test_dataset.get_num_classes() assert num_classes == 3, f"Number of classes should be 3, got {num_classes}" - print("โœ… Dataset get_num_classes works correctly") + print("PASS Dataset get_num_classes works correctly") # Test get_sample_shape sample_shape = test_dataset.get_sample_shape() assert sample_shape == (2,), f"Sample shape should be (2,), got {sample_shape}" - print("โœ… Dataset get_sample_shape works correctly") + print("PASS Dataset get_sample_shape works correctly") - print("๐ŸŽฏ Dataset interface pattern:") + print("TARGET Dataset interface pattern:") print(" __getitem__: Returns (data, label) tuple") print(" __len__: Returns dataset size") print(" get_num_classes: Returns number of classes") print(" get_sample_shape: Returns shape of data samples") - print("๐Ÿ“ˆ Progress: Dataset interface โœ“") + print("PROGRESS Progress: Dataset interface OK") except Exception as e: - print(f"โŒ Dataset interface test failed: {e}") + print(f"FAIL Dataset interface test failed: {e}") raise # Run all tests @@ -1941,7 +1941,7 @@ if __name__ == "__main__": # %% [markdown] """ -## ๐Ÿค” ML Systems Thinking Questions +## THINK ML Systems Thinking Questions ### System Design 1. How does TinyTorch's DataLoader design compare to PyTorch's DataLoader and TensorFlow's tf.data API in terms of flexibility and performance? @@ -1966,16 +1966,16 @@ if __name__ == "__main__": # %% [markdown] """ -## ๐ŸŽฏ MODULE SUMMARY: Data Loading and Processing +## TARGET MODULE SUMMARY: Data Loading and Processing Congratulations! You've successfully implemented professional data loading systems: ### What You've Accomplished -โœ… **DataLoader Class**: Efficient batch processing with memory management -โœ… **Dataset Integration**: Seamless compatibility with Tensor operations -โœ… **Batch Processing**: Optimized data loading for training -โœ… **Memory Management**: Efficient handling of large datasets -โœ… **Real Applications**: Image classification, regression, and more +PASS **DataLoader Class**: Efficient batch processing with memory management +PASS **Dataset Integration**: Seamless compatibility with Tensor operations +PASS **Batch Processing**: Optimized data loading for training +PASS **Memory Management**: Efficient handling of large datasets +PASS **Real Applications**: Image classification, regression, and more ### Key Concepts You've Learned - **Batch processing**: How to efficiently process data in chunks diff --git a/modules/10_tokenization/tokenization_dev.py b/modules/10_tokenization/tokenization_dev.py index ec3b1f3b..114b134c 100644 --- a/modules/10_tokenization/tokenization_dev.py +++ b/modules/10_tokenization/tokenization_dev.py @@ -21,7 +21,7 @@ Welcome to the Tokenization module! You'll implement the fundamental text proces - Framework connection: See how your implementations match production tokenization systems - Performance insight: Learn how tokenization throughput affects training pipeline efficiency -## Build โ†’ Use โ†’ Reflect +## Build -> Use -> Reflect 1. **Build**: Character tokenizer and basic BPE (Byte Pair Encoding) implementation 2. **Use**: Process real text and observe how different tokenization strategies affect sequence length 3. **Reflect**: How does tokenization choice determine model efficiency and language understanding? @@ -35,8 +35,8 @@ By the end of this module, you'll understand: - Connection to production systems like GPT's tokenizers and their design trade-offs ## Systems Reality Check -๐Ÿ’ก **Production Context**: Modern language models use sophisticated tokenizers (GPT's tiktoken, SentencePiece) - your implementation reveals the algorithmic foundations -โšก **Performance Note**: Tokenization can become a bottleneck in training pipelines - efficient string processing is critical for high-throughput training +TIP **Production Context**: Modern language models use sophisticated tokenizers (GPT's tiktoken, SentencePiece) - your implementation reveals the algorithmic foundations +SPEED **Performance Note**: Tokenization can become a bottleneck in training pipelines - efficient string processing is critical for high-throughput training """ # %% nbgrader={"grade": false, "grade_id": "tokenization-imports", "locked": false, "schema_version": 3, "solution": false, "task": false} @@ -64,7 +64,7 @@ print("Ready to build text processing systems!") # %% [markdown] """ -## ๐Ÿ“ฆ Where This Code Lives in the Final Package +## PACKAGE Where This Code Lives in the Final Package **Learning Side:** You work in `modules/source/11_tokenization/tokenization_dev.py` **Building Side:** Code exports to `tinytorch.core.tokenization` @@ -91,77 +91,77 @@ from tinytorch.core.embeddings import Embedding # Next module Neural networks work with numbers, but we want to process text: ``` -"Hello world!" โ†’ [15496, 995, 0] # Numbers the model can understand +"Hello world!" -> [15496, 995, 0] # Numbers the model can understand ``` ### ๐Ÿ”ค Visual Tokenization Flow ``` -Raw Text โ†’ Tokenization Strategy โ†’ Token IDs โ†’ Neural Network Input +Raw Text -> Tokenization Strategy -> Token IDs -> Neural Network Input "Hello world!" - โ†“ -โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” -โ”‚ Tokenization Process โ”‚ -โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”โ”‚ -โ”‚ โ”‚ Split into tokens โ”‚โ”‚ -โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜โ”‚ -โ”‚ โ†“ โ”‚ -โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”โ”‚ -โ”‚ โ”‚ Map to vocabulary โ”‚โ”‚ -โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜โ”‚ -โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ - โ†“ + v ++-------------------------+ +| Tokenization Process | +| +---------------------+| +| | Split into tokens || +| +---------------------+| +| v | +| +---------------------+| +| | Map to vocabulary || +| +---------------------+| ++-------------------------+ + v [15496, 995, 0] - โ†“ + v Neural Network ``` ### ๐Ÿ“Š Tokenization Strategy Comparison ``` -Strategy โ”‚ Vocab Size โ”‚ Sequence Length โ”‚ Use Case -โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ -Character โ”‚ ~256 โ”‚ Long โ”‚ Simple/Debug -Subword (BPE) โ”‚ ~50,000 โ”‚ Medium โ”‚ Production -Word-level โ”‚ ~100,000+ โ”‚ Short โ”‚ Specialized +Strategy | Vocab Size | Sequence Length | Use Case +--------------+------------+-----------------+----------------- +Character | ~256 | Long | Simple/Debug +Subword (BPE) | ~50,000 | Medium | Production +Word-level | ~100,000+ | Short | Specialized ``` -### ๐ŸŽฏ Systems Trade-offs Visualization +### TARGET Systems Trade-offs Visualization ``` Memory Usage Impact - โ†“ - โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” - โ”‚ Vocabulary Size โ”‚โ”€โ”€โ”€โ–บ Embedding Table Memory - โ”‚ โ”‚ vocab_size ร— embed_dim ร— 4 bytes - โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ - โ†“ - โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” - โ”‚ Sequence Length โ”‚โ”€โ”€โ”€โ–บ Attention Memory - โ”‚ โ”‚ O(sequence_lengthยฒ) - โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ - โ†“ - โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” - โ”‚ Tokenization Speed โ”‚โ”€โ”€โ”€โ–บ Training Throughput - โ”‚ โ”‚ tokens/second pipeline - โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ + v + +-------------------------+ + | Vocabulary Size |---> Embedding Table Memory + | | vocab_size * embed_dim * 4 bytes + +-------------------------+ + v + +-------------------------+ + | Sequence Length |---> Attention Memory + | | O(sequence_lengthยฒ) + +-------------------------+ + v + +-------------------------+ + | Tokenization Speed |---> Training Throughput + | | tokens/second pipeline + +-------------------------+ Key Insight: Tokenization choices create cascading effects throughout ML systems! ``` -### ๐Ÿ” Character vs Subword vs Word Example +### MAGNIFY Character vs Subword vs Word Example ``` Input: "The tokenization process" Character-level: ['T','h','e',' ','t','o','k','e','n','i','z','a','t','i','o','n',' ','p','r','o','c','e','s','s'] -โ†“ (24 tokens, vocab ~256) +v (24 tokens, vocab ~256) Subword (BPE): ['The', 'token', 'ization', 'process'] -โ†“ (4 tokens, vocab ~50k) +v (4 tokens, vocab ~50k) Word-level: ['The', 'tokenization', 'process'] -โ†“ (3 tokens, vocab ~100k+) +v (3 tokens, vocab ~100k+) Trade-off: Smaller vocab = Longer sequences = More computation Larger vocab = More parameters = More memory @@ -360,7 +360,7 @@ class CharTokenizer: # %% [markdown] """ -### ๐Ÿงช Test Your Character Tokenizer Implementation +### TEST Test Your Character Tokenizer Implementation Once you implement the CharTokenizer encode and decode methods above, run this cell to test it: """ @@ -394,7 +394,7 @@ def test_unit_char_tokenizer(): assert tokenizer.vocab_size >= 99, "Should have at least 99 tokens (4 special + 95 ASCII)" # Test unknown character handling - unknown_tokens = tokenizer.encode("๐Ÿš€", add_special_tokens=False) # Emoji not in ASCII + unknown_tokens = tokenizer.encode("ROCKET", add_special_tokens=False) # Emoji not in ASCII assert unknown_tokens[0] == tokenizer.char_to_idx[''], "Should use UNK token for unknown chars" # Test padding @@ -404,11 +404,11 @@ def test_unit_char_tokenizer(): assert len(padded[1]) == 4, "Second sequence should be padded to length 4" assert padded[1][-1] == tokenizer.char_to_idx[''], "Should use PAD token for padding" - print("โœ… Character tokenizer tests passed!") - print(f"โœ… Vocabulary size: {tokenizer.vocab_size}") - print(f"โœ… Encode/decode cycle works correctly") - print(f"โœ… Special tokens handled properly") - print(f"โœ… Padding functionality works") + print("PASS Character tokenizer tests passed!") + print(f"PASS Vocabulary size: {tokenizer.vocab_size}") + print(f"PASS Encode/decode cycle works correctly") + print(f"PASS Special tokens handled properly") + print(f"PASS Padding functionality works") # Test function defined (called in main block) @@ -421,35 +421,35 @@ Now let's implement a simplified version of BPE, the subword tokenization algori ### ๐Ÿงฉ BPE Algorithm Visualization ``` Step 1: Start with characters -"hello" โ†’ ['h', 'e', 'l', 'l', 'o', ''] +"hello" -> ['h', 'e', 'l', 'l', 'o', ''] Step 2: Count adjacent pairs -('l', 'l'): 1 occurrence โ† Most frequent pair +('l', 'l'): 1 occurrence <- Most frequent pair Step 3: Merge most frequent pair -['h', 'e', 'l', 'l', 'o', ''] โ†’ ['h', 'e', 'll', 'o', ''] +['h', 'e', 'l', 'l', 'o', ''] -> ['h', 'e', 'll', 'o', ''] Step 4: Repeat until vocabulary target reached -Next iteration might merge ('e', 'll') โ†’ 'ell' if frequent enough +Next iteration might merge ('e', 'll') -> 'ell' if frequent enough BPE Training Process: -โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” -โ”‚ Character Vocab โ”‚ โ”€โ”€โ”€โ–บ โ”‚ Count Pairs โ”‚ โ”€โ”€โ”€โ–บ โ”‚ Merge Most โ”‚ -โ”‚ a, b, c, d... โ”‚ โ”‚ (a,b): 5 โ”‚ โ”‚ Frequent Pair โ”‚ -โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ (c,d): 3 โ”‚ โ”‚ (a,b) โ†’ ab โ”‚ - โ†‘ โ”‚ (e,f): 1 โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ - โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ - โ”‚ โ”‚ - โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ Repeat Until Target โ†โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ ++-----------------+ +-----------------+ +-----------------+ +| Character Vocab | ---> | Count Pairs | ---> | Merge Most | +| a, b, c, d... | | (a,b): 5 | | Frequent Pair | ++-----------------+ | (c,d): 3 | | (a,b) -> ab | + ^ | (e,f): 1 | +-----------------+ + | +-----------------+ | + | | + +------------------- Repeat Until Target <---------+ ``` -### ๐Ÿ“ˆ BPE Learning Process Example +### PROGRESS BPE Learning Process Example ``` Initial: "hello" = ['h', 'e', 'l', 'l', 'o', ''] Iteration 1: Pairs: (h,e):1, (e,l):1, (l,l):1, (l,o):1, (o,):1 - Merge: (l,l) โ†’ 'll' + Merge: (l,l) -> 'll' Result: ['h', 'e', 'll', 'o', ''] Iteration 2: @@ -460,19 +460,19 @@ Iteration 2: Key Insight: BPE learns common subword patterns from data! ``` -### ๐ŸŽฏ BPE Benefits +### TARGET BPE Benefits ``` Traditional Tokenization Problems: -โŒ "unhappiness" โ†’ UNK (unknown word) -โŒ "supercalifragilisticexpialidocious" โ†’ UNK +FAIL "unhappiness" -> UNK (unknown word) +FAIL "supercalifragilisticexpialidocious" -> UNK BPE Solution: -โœ… "unhappiness" โ†’ ['un', 'happy', 'ness'] (recognizable parts) -โœ… "supercali..." โ†’ ['super', 'cal', 'i', 'frag', ...] (graceful degradation) +PASS "unhappiness" -> ['un', 'happy', 'ness'] (recognizable parts) +PASS "supercali..." -> ['super', 'cal', 'i', 'frag', ...] (graceful degradation) Memory Efficiency: -Character: 26 vocab ร— 512 embed_dim = 13,312 parameters -BPE-50k: 50,000 vocab ร— 512 embed_dim = 25,600,000 parameters +Character: 26 vocab * 512 embed_dim = 13,312 parameters +BPE-50k: 50,000 vocab * 512 embed_dim = 25,600,000 parameters Trade-off: More parameters, shorter sequences (faster attention) ``` """ @@ -754,7 +754,7 @@ class BPETokenizer: # %% [markdown] """ -### ๐Ÿงช Test Your BPE Implementation +### TEST Test Your BPE Implementation Once you implement the BPE helper methods above, run this cell to test it: """ @@ -814,36 +814,36 @@ def test_unit_bpe_tokenizer(): individual_l_count = sum(1 for token in merged[0] if token == 'l') assert individual_l_count == 0, f"Should have no individual 'l' tokens after merge, got {individual_l_count}" - print("โœ… BPE tokenizer tests passed!") - print(f"โœ… Trained vocabulary size: {len(bpe.char_to_idx)}") - print(f"โœ… Learned {len(bpe.merges)} merges") - print(f"โœ… Encode/decode cycle works") + print("PASS BPE tokenizer tests passed!") + print(f"PASS Trained vocabulary size: {len(bpe.char_to_idx)}") + print(f"PASS Learned {len(bpe.merges)} merges") + print(f"PASS Encode/decode cycle works") # Test function defined (called in main block) # %% [markdown] """ -## ๐ŸŽฏ ML Systems: Performance Analysis & Tokenization Efficiency +## TARGET ML Systems: Performance Analysis & Tokenization Efficiency Now let's develop systems engineering skills by analyzing tokenization performance and understanding how tokenization choices affect downstream ML system efficiency. ### **Learning Outcome**: *"I understand how tokenization affects model memory, training speed, and language understanding"* -### ๐Ÿ” Systems Insights Functions +### MAGNIFY Systems Insights Functions The next few implementations include **executable analysis functions** that help you discover key insights about tokenization performance and memory scaling. These aren't just code - they're interactive learning tools that reveal how tokenization choices affect real ML systems. ### ๐Ÿ“Š What We'll Measure ``` Performance Metrics: -โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” -โ”‚ Tokenization โ”‚ โ”‚ Memory Usage โ”‚ โ”‚ Scaling โ”‚ -โ”‚ Speed โ”‚ โ”‚ Analysis โ”‚ โ”‚ Behavior โ”‚ -โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ -โ”‚ โ€ข tokens/sec โ”‚ โ”‚ โ€ข vocab memory โ”‚ โ”‚ โ€ข time complexityโ”‚ -โ”‚ โ€ข chars/sec โ”‚ โ”‚ โ€ข sequence mem โ”‚ โ”‚ โ€ข space complexityโ”‚ -โ”‚ โ€ข compression โ”‚ โ”‚ โ€ข total footprintโ”‚ โ”‚ โ€ข bottleneck ID โ”‚ -โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ ++-----------------+ +-----------------+ +-----------------+ +| Tokenization | | Memory Usage | | Scaling | +| Speed | | Analysis | | Behavior | +| | | | | | +| โ€ข tokens/sec | | โ€ข vocab memory | | โ€ข time complexity| +| โ€ข chars/sec | | โ€ข sequence mem | | โ€ข space complexity| +| โ€ข compression | | โ€ข total footprint| | โ€ข bottleneck ID | ++-----------------+ +-----------------+ +-----------------+ ``` """ @@ -932,7 +932,7 @@ class TokenizationProfiler: This function is PROVIDED to show comprehensive comparison. """ - print("๐Ÿ” TOKENIZER COMPARISON") + print("MAGNIFY TOKENIZER COMPARISON") print("=" * 50) # Create tokenizers @@ -969,7 +969,7 @@ class TokenizationProfiler: This function is PROVIDED to demonstrate scaling analysis. """ - print(f"\n๐Ÿ” MEMORY SCALING ANALYSIS") + print(f"\nMAGNIFY MEMORY SCALING ANALYSIS") print("=" * 40) scaling_results = [] @@ -999,7 +999,7 @@ class TokenizationProfiler: } scaling_results.append(result) - print(f" {length:>6} chars โ†’ {len(tokens):>4} tokens ({time_taken*1000:.2f}ms)") + print(f" {length:>6} chars -> {len(tokens):>4} tokens ({time_taken*1000:.2f}ms)") # Analyze scaling pattern if len(scaling_results) >= 2: @@ -1010,7 +1010,7 @@ class TokenizationProfiler: time_ratio = large['time_ms'] / small['time_ms'] memory_ratio = large['total_memory_bytes'] / small['total_memory_bytes'] - print(f"\n๐Ÿ“ˆ Scaling Analysis:") + print(f"\nPROGRESS Scaling Analysis:") print(f" Text length increased {length_ratio:.1f}x") print(f" Time increased {time_ratio:.1f}x") print(f" Memory increased {memory_ratio:.1f}x") @@ -1024,7 +1024,7 @@ def analyze_tokenization_impact(): This function is PROVIDED to show systems-level thinking. """ - print("๐ŸŽฏ TOKENIZATION IMPACT ON ML SYSTEMS") + print("TARGET TOKENIZATION IMPACT ON ML SYSTEMS") print("=" * 60) # Sample texts for analysis @@ -1068,16 +1068,16 @@ def analyze_tokenization_impact(): print(f"{name:<12} {tokenizer.vocab_size:<10} {avg_tokens:<10.1f} {total_memory:<15.1f}MB") - print(f"\n๐Ÿ’ก KEY INSIGHTS:") + print(f"\nTIP KEY INSIGHTS:") print(f" ๐Ÿ”ค Character tokenizer: Small vocabulary, long sequences") print(f" ๐Ÿงฉ BPE tokenizer: Medium vocabulary, shorter sequences") - print(f" ๐Ÿ“ˆ Memory scaling: O(vocab_size * embed_dim + seq_len * batch_size)") - print(f" โšก Attention complexity: O(seq_lenยฒ) - shorter sequences = faster attention") + print(f" PROGRESS Memory scaling: O(vocab_size * embed_dim + seq_len * batch_size)") + print(f" SPEED Attention complexity: O(seq_lenยฒ) - shorter sequences = faster attention") print(f" ๐Ÿญ Production trade-off: Vocabulary size vs sequence length vs compute") # %% [markdown] """ -### ๐Ÿงช Test: Tokenization Performance Analysis +### TEST Test: Tokenization Performance Analysis Let's test our tokenization profiler with realistic performance scenarios. """ @@ -1115,14 +1115,14 @@ def test_tokenization_profiler(): assert metrics['total_tokens'] > 0, "Should count tokens" assert metrics['texts_per_second'] > 0, "Should measure throughput" - print("โœ… Basic profiling functionality test passed") + print("PASS Basic profiling functionality test passed") # Test comparison comparison_results = profiler.compare_tokenizers(test_texts) assert isinstance(comparison_results, dict), "Should return comparison results" assert len(comparison_results) >= 1, "Should test at least one tokenizer" - print("โœ… Tokenizer comparison test passed") + print("PASS Tokenizer comparison test passed") # Test scaling analysis scaling_results = profiler.analyze_memory_scaling(char_tokenizer, [50, 100]) @@ -1134,8 +1134,8 @@ def test_tokenization_profiler(): assert 'num_tokens' in result, "Should include token count" assert result['num_tokens'] > 0, "Should produce tokens" - print("โœ… Scaling analysis test passed") - print("๐ŸŽฏ Tokenization Profiler: All tests passed!") + print("PASS Scaling analysis test passed") + print("TARGET Tokenization Profiler: All tests passed!") # Test function defined (called in main block) @@ -1204,7 +1204,7 @@ def analyze_tokenization_systems_impact(): print(f" {model_name}: {total_memory:.1f}MB total") print(f" Embedding: {embed_memory:.1f}MB, Sequence: {seq_memory:.1f}MB, Attention: {attention_memory:.1f}MB") - print(f"\n๐ŸŽฏ KEY SYSTEM DESIGN INSIGHTS:") + print(f"\nTARGET KEY SYSTEM DESIGN INSIGHTS:") print(f" 1. Vocabulary Size Trade-offs:") print(f" - Larger vocab = more parameters = more memory") print(f" - Smaller vocab = longer sequences = more compute") @@ -1223,21 +1223,21 @@ def analyze_tokenization_systems_impact(): # %% [markdown] """ -## ๐Ÿ” Interactive Systems Insights +## MAGNIFY Interactive Systems Insights Let's build intuition about tokenization through hands-on analysis. These functions reveal how tokenization choices cascade through ML systems. """ -# โœ… IMPLEMENTATION CHECKPOINT: Ensure your tokenizers are complete before running +# PASS IMPLEMENTATION CHECKPOINT: Ensure your tokenizers are complete before running -# ๐Ÿค” PREDICTION: Which tokenizer will use more memory - character or BPE? Why? +# THINK PREDICTION: Which tokenizer will use more memory - character or BPE? Why? # Your guess: _______ -# ๐Ÿ” SYSTEMS INSIGHT #1: Vocabulary Size vs Memory Trade-offs +# MAGNIFY SYSTEMS INSIGHT #1: Vocabulary Size vs Memory Trade-offs def analyze_tokenization_memory_impact(): """Analyze how vocabulary size affects model memory usage.""" try: - print("๐Ÿ” TOKENIZATION MEMORY IMPACT ANALYSIS") + print("MAGNIFY TOKENIZATION MEMORY IMPACT ANALYSIS") print("=" * 50) # Create tokenizers with different vocabulary sizes @@ -1295,7 +1295,7 @@ def analyze_tokenization_memory_impact(): total_per_sample = sequence_memory_kb + attention_memory_kb print(f" Total per sample: {total_per_sample:.1f} KB") - print(f"\n๐Ÿ’ก KEY INSIGHTS:") + print(f"\nTIP KEY INSIGHTS:") print(f" โ€ข Vocabulary size directly affects model parameters") print(f" โ€ข Sequence length affects computation (attention is O(Nยฒ))") print(f" โ€ข Character tokenization: Small vocab, long sequences") @@ -1303,22 +1303,22 @@ def analyze_tokenization_memory_impact(): print(f" โ€ข Production trade-off: Parameters vs computation") except Exception as e: - print(f"โš ๏ธ Error in memory analysis: {e}") + print(f"WARNING๏ธ Error in memory analysis: {e}") print("Make sure both tokenizers are implemented correctly") # Run the analysis analyze_tokenization_memory_impact() -# โœ… IMPLEMENTATION CHECKPOINT: Ensure BPE merge functions are working +# PASS IMPLEMENTATION CHECKPOINT: Ensure BPE merge functions are working -# ๐Ÿค” PREDICTION: How does tokenization speed scale with text length? +# THINK PREDICTION: How does tokenization speed scale with text length? # Linear? Quadratic? Your guess: _______ -# ๐Ÿ” SYSTEMS INSIGHT #2: Tokenization Speed Scaling Analysis +# MAGNIFY SYSTEMS INSIGHT #2: Tokenization Speed Scaling Analysis def analyze_tokenization_speed_scaling(): """Measure how tokenization performance scales with input size.""" try: - print("\n๐Ÿ” TOKENIZATION SPEED SCALING ANALYSIS") + print("\nMAGNIFY TOKENIZATION SPEED SCALING ANALYSIS") print("=" * 50) char_tokenizer = CharTokenizer() @@ -1340,22 +1340,22 @@ def analyze_tokenization_speed_scaling(): char_times.append(char_time) - print(f" {length:>5} chars โ†’ {len(char_tokens):>5} tokens in {char_time*1000:.2f}ms") + print(f" {length:>5} chars -> {len(char_tokens):>5} tokens in {char_time*1000:.2f}ms") # Analyze scaling pattern if len(char_times) >= 2: - print(f"\n๐Ÿ“ˆ Scaling Analysis:") + print(f"\nPROGRESS Scaling Analysis:") for i in range(1, len(text_lengths)): length_ratio = text_lengths[i] / text_lengths[0] time_ratio = char_times[i] / char_times[0] if char_times[0] > 0 else 0 - print(f" {text_lengths[i]:>5} chars: {length_ratio:.1f}x length โ†’ {time_ratio:.1f}x time") + print(f" {text_lengths[i]:>5} chars: {length_ratio:.1f}x length -> {time_ratio:.1f}x time") # Calculate approximate complexity avg_scaling = sum(char_times[i]/char_times[0] / (text_lengths[i]/text_lengths[0]) for i in range(1, len(text_lengths)) if char_times[0] > 0) / (len(text_lengths) - 1) - print(f"\n๐ŸŽฏ SCALING INSIGHTS:") + print(f"\nTARGET SCALING INSIGHTS:") print(f" โ€ข Character tokenization: ~O(N) time complexity") print(f" โ€ข Average scaling factor: {avg_scaling:.2f} (1.0 = perfect linear)") if avg_scaling < 1.2: @@ -1369,22 +1369,22 @@ def analyze_tokenization_speed_scaling(): print(f" โ€ข Production implication: Tokenization speed rarely bottlenecks training") except Exception as e: - print(f"โš ๏ธ Error in scaling analysis: {e}") + print(f"WARNING๏ธ Error in scaling analysis: {e}") print("Make sure character tokenizer is implemented correctly") # Run the scaling analysis analyze_tokenization_speed_scaling() -# โœ… IMPLEMENTATION CHECKPOINT: All tokenization systems working +# PASS IMPLEMENTATION CHECKPOINT: All tokenization systems working -# ๐Ÿค” PREDICTION: For a 7B parameter model, what percentage of memory is vocabulary? +# THINK PREDICTION: For a 7B parameter model, what percentage of memory is vocabulary? # Your estimate: _______% -# ๐Ÿ” SYSTEMS INSIGHT #3: Production Model Memory Breakdown +# MAGNIFY SYSTEMS INSIGHT #3: Production Model Memory Breakdown def analyze_production_memory_breakdown(): """Analyze vocabulary memory in production-scale language models.""" try: - print("\n๐Ÿ” PRODUCTION MODEL MEMORY BREAKDOWN") + print("\nMAGNIFY PRODUCTION MODEL MEMORY BREAKDOWN") print("=" * 50) # Model configurations based on real systems @@ -1412,10 +1412,10 @@ def analyze_production_memory_breakdown(): print(f"{model_name:<12} {total_params/1e6:>8.0f}M {vocab_params/1e6:>8.1f}M {vocab_percentage:>6.1f}% {vocab_memory_mb:>8.0f}MB") - print(f"\n๐ŸŽฏ PRODUCTION INSIGHTS:") + print(f"\nTARGET PRODUCTION INSIGHTS:") print(f" โ€ข Small models (100M): Vocabulary is ~20-30% of parameters") print(f" โ€ข Large models (7B+): Vocabulary is ~1-2% of parameters") - print(f" โ€ข Vocabulary memory scales with vocab_size ร— embed_dim") + print(f" โ€ข Vocabulary memory scales with vocab_size * embed_dim") print(f" โ€ข GPT uses 50k vocabulary, LLaMA uses 32k (efficiency optimization)") # Calculate tokenization efficiency comparison @@ -1437,7 +1437,7 @@ def analyze_production_memory_breakdown(): char_tokens = len(sample_text) # Approximate character count gpt_tokens = char_tokens // 4 # Approximate GPT tokenization (4 chars/token) - print(f"\nโšก COMPUTE EFFICIENCY:") + print(f"\nSPEED COMPUTE EFFICIENCY:") print(f" Sample text: '{sample_text}'") print(f" Character tokens: ~{char_tokens}") print(f" GPT tokens: ~{gpt_tokens}") @@ -1446,13 +1446,13 @@ def analyze_production_memory_breakdown(): print(f" GPT attention: O({gpt_tokens}ยฒ) = {gpt_tokens**2:,} operations") print(f" Compute reduction: {(char_tokens**2)/(gpt_tokens**2):.1f}x faster attention") - print(f"\n๐Ÿ’ก TRADE-OFF SUMMARY:") + print(f"\nTIP TRADE-OFF SUMMARY:") print(f" โ€ข BPE uses {gpt_memory/char_memory:.0f}x more vocabulary memory") print(f" โ€ข BPE provides {(char_tokens**2)/(gpt_tokens**2):.1f}x faster attention computation") print(f" โ€ข Production systems choose BPE for compute efficiency") except Exception as e: - print(f"โš ๏ธ Error in production analysis: {e}") + print(f"WARNING๏ธ Error in production analysis: {e}") print("Error in memory calculation - check model configurations") # Run the production analysis @@ -1460,7 +1460,7 @@ analyze_production_memory_breakdown() # %% [markdown] """ -## ๐Ÿš€ Advanced: Tokenization Efficiency Techniques +## ROCKET Advanced: Tokenization Efficiency Techniques Production tokenization systems use several optimization techniques. Let's implement a few key ones: """ @@ -1543,7 +1543,7 @@ def demonstrate_production_optimizations(): This function is PROVIDED to show real-world optimization techniques. """ - print("๐Ÿš€ PRODUCTION TOKENIZATION OPTIMIZATIONS") + print("ROCKET PRODUCTION TOKENIZATION OPTIMIZATIONS") print("=" * 60) # Create optimized tokenizer @@ -1586,18 +1586,18 @@ def demonstrate_production_optimizations(): # Report results cache_stats = optimized_tokenizer.get_cache_stats() - print(f"\nโšก PERFORMANCE COMPARISON:") + print(f"\nSPEED PERFORMANCE COMPARISON:") print(f" No caching: {no_cache_time*1000:.2f}ms") print(f" With caching: {cache_time*1000:.2f}ms ({(no_cache_time/cache_time):.1f}x speedup)") print(f" Batch processing: {batch_time*1000:.2f}ms") - print(f"\n๐Ÿ“ˆ CACHE PERFORMANCE:") + print(f"\nPROGRESS CACHE PERFORMANCE:") print(f" Hit rate: {cache_stats['hit_rate']*100:.1f}%") print(f" Cache hits: {cache_stats['cache_hits']}") print(f" Cache misses: {cache_stats['cache_misses']}") print(f" Cache size: {cache_stats['cache_size']} entries") - print(f"\n๐ŸŽฏ PRODUCTION INSIGHTS:") + print(f"\nTARGET PRODUCTION INSIGHTS:") print(f" - Caching provides significant speedup for repeated texts") print(f" - Batch processing enables vectorized operations") print(f" - Memory-efficient encoding reduces allocation overhead") @@ -1615,7 +1615,7 @@ Let's run comprehensive tests to ensure all tokenization functionality works cor # %% nbgrader={"grade": false, "grade_id": "test-tokenization-comprehensive", "locked": false, "schema_version": 3, "solution": false, "task": false} def test_tokenization_comprehensive(): """Comprehensive test suite for all tokenization functionality.""" - print("๐Ÿงช Comprehensive Tokenization Tests...") + print("TEST Comprehensive Tokenization Tests...") # Test 1: Character tokenizer edge cases print(" Testing character tokenizer edge cases...") @@ -1640,7 +1640,7 @@ def test_tokenization_comprehensive(): decoded = char_tokenizer.decode(tokens, skip_special_tokens=True) assert decoded == original, "Round-trip should preserve text" - print(" โœ… Character tokenizer edge cases passed") + print(" PASS Character tokenizer edge cases passed") # Test 2: BPE tokenizer robustness print(" Testing BPE tokenizer robustness...") @@ -1673,7 +1673,7 @@ def test_tokenization_comprehensive(): # BPE decoding might have slightly different spacing due to word boundaries assert test_text.replace(" ", "") in decoded.replace(" ", ""), f"BPE round-trip failed for '{test_text}'" - print(" โœ… BPE tokenizer robustness passed") + print(" PASS BPE tokenizer robustness passed") # Test 3: Memory efficiency with large texts print(" Testing memory efficiency...") @@ -1686,7 +1686,7 @@ def test_tokenization_comprehensive(): assert len(char_tokens) > 20000, "Should handle large texts" assert char_time < 1.0, "Should tokenize large text quickly" - print(" โœ… Memory efficiency tests passed") + print(" PASS Memory efficiency tests passed") # Test 4: Integration with optimization features print(" Testing optimization features...") @@ -1710,9 +1710,9 @@ def test_tokenization_comprehensive(): assert len(batch_results) == len(batch_texts), "Batch size should match input" assert all(len(seq) == len(batch_results[0]) for seq in batch_results), "All sequences should be padded to same length" - print(" โœ… Optimization features tests passed") + print(" PASS Optimization features tests passed") - print("โœ… All comprehensive tokenization tests passed!") + print("PASS All comprehensive tokenization tests passed!") # Test function defined (called in main block) @@ -1729,7 +1729,7 @@ if __name__ == "__main__": print("="*60) # Run all unit tests - print("\n๐Ÿงช UNIT TESTS") + print("\nTEST UNIT TESTS") print("-" * 30) test_unit_char_tokenizer() test_unit_bpe_tokenizer() @@ -1742,7 +1742,7 @@ if __name__ == "__main__": # Performance analysis print("\n" + "="*60) - print("๐Ÿ” TOKENIZATION PERFORMANCE ANALYSIS") + print("MAGNIFY TOKENIZATION PERFORMANCE ANALYSIS") print("="*60) # Create test data @@ -1767,16 +1767,16 @@ if __name__ == "__main__": demonstrate_production_optimizations() print("\n" + "="*60) - print("๐ŸŽฏ TOKENIZATION MODULE COMPLETE!") + print("TARGET TOKENIZATION MODULE COMPLETE!") print("="*60) - print("โœ… All tokenization tests passed!") - print("โœ… Systems insights analysis complete!") - print("โœ… Performance profiling successful!") - print("๐Ÿš€ Ready for embedding layer integration!") + print("PASS All tokenization tests passed!") + print("PASS Systems insights analysis complete!") + print("PASS Performance profiling successful!") + print("ROCKET Ready for embedding layer integration!") # %% [markdown] """ -## ๐Ÿค” ML Systems Thinking: Interactive Questions +## THINK ML Systems Thinking: Interactive Questions Now that you've built the text processing foundation for language models, let's connect this work to broader ML systems challenges. These questions help you think critically about how tokenization scales to production language processing systems. @@ -1953,11 +1953,11 @@ GRADING RUBRIC (Instructor Use): # %% [markdown] """ -## ๐ŸŽฏ MODULE SUMMARY: Tokenization +## TARGET MODULE SUMMARY: Tokenization Congratulations! You have successfully implemented comprehensive tokenization systems for language processing: -### โœ… What You Have Built +### PASS What You Have Built - **Character Tokenizer**: Simple character-level tokenization with special token handling - **BPE Tokenizer**: Subword tokenization using Byte Pair Encoding algorithm - **Vocabulary Management**: Efficient mapping between text and numerical representations @@ -1966,41 +1966,41 @@ Congratulations! You have successfully implemented comprehensive tokenization sy - **๐Ÿ†• Memory Efficiency**: Optimized string processing and token caching systems - **๐Ÿ†• Systems Analysis**: Comprehensive performance profiling and scaling analysis -### โœ… Key Learning Outcomes +### PASS Key Learning Outcomes - **Understanding**: How text becomes numbers that neural networks can process - **Implementation**: Built character and subword tokenizers from scratch - **Systems Insight**: How tokenization affects model memory, performance, and capabilities - **Performance Engineering**: Measured and optimized tokenization throughput - **Production Context**: Understanding real-world tokenization challenges and solutions -### โœ… Technical Mastery +### PASS Technical Mastery - **Character Tokenization**: Simple but interpretable text processing - **BPE Algorithm**: Iterative pair merging for subword discovery - **Vocabulary Trade-offs**: Balancing vocabulary size vs sequence length - **Memory Optimization**: Efficient caching and batch processing techniques - **๐Ÿ†• Performance Analysis**: Measuring tokenization impact on downstream systems -### โœ… Professional Skills Developed +### PASS Professional Skills Developed - **Algorithm Implementation**: Building complex text processing systems - **Performance Engineering**: Optimizing for speed and memory efficiency - **Systems Thinking**: Understanding tokenization's role in ML pipelines - **Production Optimization**: Caching, batching, and scalability techniques -### โœ… Ready for Next Steps +### PASS Ready for Next Steps Your tokenization systems are now ready to power: - **Embedding Layers**: Converting tokens to dense vector representations - **Language Models**: Processing text for transformer architectures - **Production Systems**: Efficient text processing pipelines - **๐Ÿง  Text Understanding**: Foundation for natural language processing -### ๐Ÿ”— Connection to Real ML Systems +### LINK Connection to Real ML Systems Your implementations mirror production systems: - **GPT Tokenizers**: Modern language models use sophisticated BPE variants - **SentencePiece**: Unigram language model tokenization used in many systems - **Hugging Face Tokenizers**: Production-optimized tokenization libraries - **Industry Applications**: Every language model relies on efficient tokenization -### ๐ŸŽฏ The Power of Text Processing +### TARGET The Power of Text Processing You have unlocked the bridge between human language and machine understanding: - **Before**: Text was just strings of characters - **After**: Text becomes structured numerical sequences for neural networks diff --git a/modules/11_embeddings/embeddings_dev.py b/modules/11_embeddings/embeddings_dev.py index 82ad6f5f..0d3e818c 100644 --- a/modules/11_embeddings/embeddings_dev.py +++ b/modules/11_embeddings/embeddings_dev.py @@ -21,7 +21,7 @@ Welcome to the Embeddings module! You'll implement the systems that convert disc - Framework connection: See how your implementations match PyTorch's embedding systems - Performance insight: Learn how embedding lookup patterns affect cache efficiency and memory bandwidth -## Build โ†’ Use โ†’ Reflect +## Build -> Use -> Reflect 1. **Build**: Embedding layer with lookup table and positional encoding systems 2. **Use**: Transform token sequences into rich vector representations for language processing 3. **Reflect**: How do embedding choices determine model capacity and computational efficiency? @@ -35,8 +35,8 @@ By the end of this module, you'll understand: - Connection to production systems like transformer embedding layers and their optimization techniques ## Systems Reality Check -๐Ÿ’ก **Production Context**: Modern language models have embedding tables with billions of parameters (GPT-3: 50k vocab ร— 12k dim = 600M embedding params) -โšก **Performance Note**: Embedding lookups are memory-bandwidth bound - efficient access patterns are critical for high-throughput training +TIP **Production Context**: Modern language models have embedding tables with billions of parameters (GPT-3: 50k vocab * 12k dim = 600M embedding params) +SPEED **Performance Note**: Embedding lookups are memory-bandwidth bound - efficient access patterns are critical for high-throughput training """ # %% nbgrader={"grade": false, "grade_id": "embeddings-imports", "locked": false, "schema_version": 3, "solution": false, "task": false} @@ -75,13 +75,13 @@ except ImportError: self.vocab_size = vocab_size # %% nbgrader={"grade": false, "grade_id": "embeddings-welcome", "locked": false, "schema_version": 3, "solution": false, "task": false} -print("๐ŸŽฏ TinyTorch Embeddings Module") +print("TARGET TinyTorch Embeddings Module") print(f"NumPy version: {np.__version__}") print("Ready to build embedding systems!") # %% [markdown] """ -## ๐Ÿ“ฆ Where This Code Lives in the Final Package +## PACKAGE Where This Code Lives in the Final Package **Learning Side:** You work in `modules/source/12_embeddings/embeddings_dev.py` **Building Side:** Code exports to `tinytorch.core.embeddings` @@ -109,12 +109,12 @@ Tokens are discrete symbols, but neural networks work best with continuous vecto ``` Discrete Token Transformation: - Token ID โ†’ Dense Vector Representation - 42 โ†’ [0.1, -0.3, 0.8, 0.2, ...] + Token ID -> Dense Vector Representation + 42 -> [0.1, -0.3, 0.8, 0.2, ...] Visualization: Sparse One-Hot Dense Embedding - [0,0,0,1,0,...] โ†’ [0.1,-0.3,0.8,0.2] + [0,0,0,1,0,...] -> [0.1,-0.3,0.8,0.2] 100,000 dims 512 dims ``` @@ -123,36 +123,36 @@ An embedding layer is essentially a learnable lookup table: ``` Embedding Table Memory Layout: -โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” -โ”‚ Embedding Weight Matrix โ”‚ -โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค -โ”‚ Token 0: [0.1, -0.2, 0.3, ...] โ”‚ โ† "" token -โ”‚ Token 1: [0.4, 0.1, -0.5, ...] โ”‚ โ† "" token -โ”‚ Token 2: [-0.1, 0.8, 0.2, ...] โ”‚ โ† "the" token -โ”‚ Token 3: [0.7, -0.3, 0.1, ...] โ”‚ โ† "and" token -โ”‚ ... โ”‚ -โ”‚ Token N: [0.2, 0.5, -0.7, ...] โ”‚ โ† Final token -โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ - โ†‘ โ†‘ ++-------------------------------------+ +| Embedding Weight Matrix | ++-------------------------------------โ”ค +| Token 0: [0.1, -0.2, 0.3, ...] | <- "" token +| Token 1: [0.4, 0.1, -0.5, ...] | <- "" token +| Token 2: [-0.1, 0.8, 0.2, ...] | <- "the" token +| Token 3: [0.7, -0.3, 0.1, ...] | <- "and" token +| ... | +| Token N: [0.2, 0.5, -0.7, ...] | <- Final token ++-------------------------------------+ + ^ ^ vocab_size embedding_dim -Example: 50,000 ร— 512 = 25.6M parameters = 102.4MB (float32) +Example: 50,000 * 512 = 25.6M parameters = 102.4MB (float32) ``` ### Embedding Lookup Process ``` Lookup Operation Flow: Token IDs: [42, 17, 8] (Input sequence) - โ†“ Advanced Indexing - Embedding Table[42] โ†’ [0.1, -0.3, 0.8, ...] - Embedding Table[17] โ†’ [0.4, 0.1, -0.5, ...] - Embedding Table[8] โ†’ [-0.1, 0.8, 0.2, ...] - โ†“ Stack Results - Output: [[0.1, -0.3, 0.8, ...], โ† Token 42 embedding - [0.4, 0.1, -0.5, ...], โ† Token 17 embedding - [-0.1, 0.8, 0.2, ...]] โ† Token 8 embedding + v Advanced Indexing + Embedding Table[42] -> [0.1, -0.3, 0.8, ...] + Embedding Table[17] -> [0.4, 0.1, -0.5, ...] + Embedding Table[8] -> [-0.1, 0.8, 0.2, ...] + v Stack Results + Output: [[0.1, -0.3, 0.8, ...], <- Token 42 embedding + [0.4, 0.1, -0.5, ...], <- Token 17 embedding + [-0.1, 0.8, 0.2, ...]] <- Token 8 embedding -Complexity: O(seq_length) lookups, O(seq_length ร— embed_dim) memory +Complexity: O(seq_length) lookups, O(seq_length * embed_dim) memory ``` ### Why Embeddings Work @@ -167,12 +167,12 @@ Since transformers lack inherent position awareness, we add positional informati ``` Position-Aware Embedding Creation: Token Embedding + Positional Encoding = Final Representation - โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” - โ”‚[0.1,-0.3,0.8]โ”‚ + โ”‚[0.0, 1.0,0.0]โ”‚ = โ”‚[0.1, 0.7,0.8]โ”‚ โ† Pos 0 - โ”‚[0.4, 0.1,-0.5]โ”‚ + โ”‚[0.1, 0.9,0.1]โ”‚ = โ”‚[0.5, 1.0,-0.4]โ”‚ โ† Pos 1 - โ”‚[-0.1,0.8, 0.2]โ”‚ + โ”‚[0.2, 0.8,0.2]โ”‚ = โ”‚[0.1, 1.6, 0.4]โ”‚ โ† Pos 2 - โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ - โ†‘ โ†‘ โ†‘ + +-------------+ +-------------+ +-------------+ + |[0.1,-0.3,0.8]| + |[0.0, 1.0,0.0]| = |[0.1, 0.7,0.8]| <- Pos 0 + |[0.4, 0.1,-0.5]| + |[0.1, 0.9,0.1]| = |[0.5, 1.0,-0.4]| <- Pos 1 + |[-0.1,0.8, 0.2]| + |[0.2, 0.8,0.2]| = |[0.1, 1.6, 0.4]| <- Pos 2 + +-------------+ +-------------+ +-------------+ + ^ ^ ^ Content Info Position Info Complete Context ``` @@ -193,19 +193,19 @@ Let's start with the core embedding layer - a learnable lookup table that conver ``` Embedding Layer Architecture: Input: Token IDs [batch_size, seq_length] - โ†“ Index into weight matrix + v Index into weight matrix Weight Matrix: [vocab_size, embedding_dim] - โ†“ Advanced indexing: weight[input_ids] + v Advanced indexing: weight[input_ids] Output: Embeddings [batch_size, seq_length, embedding_dim] Memory Layout: -โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” -โ”‚ Embedding Weight Matrix โ”‚ โ† Main parameter storage -โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค -โ”‚ Input Token IDs (integers) โ”‚ โ† Temporary during forward -โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค -โ”‚ Output Embeddings (float32) โ”‚ โ† Result tensor -โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ ++--------------------------------------+ +| Embedding Weight Matrix | <- Main parameter storage ++--------------------------------------โ”ค +| Input Token IDs (integers) | <- Temporary during forward ++--------------------------------------โ”ค +| Output Embeddings (float32) | <- Result tensor ++--------------------------------------+ Operation: O(1) lookup per token, O(seq_length) total ``` @@ -355,7 +355,7 @@ class Embedding: # %% [markdown] """ -### ๐Ÿงช Test Your Embedding Layer Implementation +### TEST Test Your Embedding Layer Implementation Once you implement the Embedding forward method above, run this cell to test it: """ @@ -417,10 +417,10 @@ def test_unit_embedding_layer(): assert 'total_memory_mb' in memory_stats, "Should provide memory statistics" assert memory_stats['total_parameters'] == vocab_size * embedding_dim, "Should calculate parameters correctly" - print("โœ… Embedding layer tests passed!") - print(f"โœ… Handles various input shapes correctly") - print(f"โœ… Consistent lookup and parameter tracking") - print(f"โœ… Memory usage: {memory_stats['total_memory_mb']:.2f}MB") + print("PASS Embedding layer tests passed!") + print(f"PASS Handles various input shapes correctly") + print(f"PASS Consistent lookup and parameter tracking") + print(f"PASS Memory usage: {memory_stats['total_memory_mb']:.2f}MB") # Test function defined (called in main block) @@ -433,18 +433,18 @@ Transformers need explicit position information since attention is position-agno ### Sinusoidal Positional Encoding Visualization ``` Mathematical Foundation: - PE(pos, 2i) = sin(pos / 10000^(2i/d_model)) โ† Even dimensions - PE(pos, 2i+1) = cos(pos / 10000^(2i/d_model)) โ† Odd dimensions + PE(pos, 2i) = sin(pos / 10000^(2i/d_model)) <- Even dimensions + PE(pos, 2i+1) = cos(pos / 10000^(2i/d_model)) <- Odd dimensions Frequency Pattern: - Position โ†’ 0 1 2 3 4 ... - Dim 0: [sin] [sin] [sin] [sin] [sin] ... โ† High frequency - Dim 1: [cos] [cos] [cos] [cos] [cos] ... โ† High frequency - Dim 2: [sin] [sin] [sin] [sin] [sin] ... โ† Med frequency - Dim 3: [cos] [cos] [cos] [cos] [cos] ... โ† Med frequency + Position -> 0 1 2 3 4 ... + Dim 0: [sin] [sin] [sin] [sin] [sin] ... <- High frequency + Dim 1: [cos] [cos] [cos] [cos] [cos] ... <- High frequency + Dim 2: [sin] [sin] [sin] [sin] [sin] ... <- Med frequency + Dim 3: [cos] [cos] [cos] [cos] [cos] ... <- Med frequency ... ... ... ... ... ... - Dim n-2: [sin] [sin] [sin] [sin] [sin] ... โ† Low frequency - Dim n-1: [cos] [cos] [cos] [cos] [cos] ... โ† Low frequency + Dim n-2: [sin] [sin] [sin] [sin] [sin] ... <- Low frequency + Dim n-1: [cos] [cos] [cos] [cos] [cos] ... <- Low frequency Why This Works: - Each position gets unique encoding across all dimensions @@ -456,19 +456,19 @@ Why This Works: ### Position Encoding Memory Layout ``` Precomputed Position Matrix: -โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” -โ”‚ Position Encoding Matrix โ”‚ -โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค -โ”‚ Pos 0: [0.00, 1.00, 0.00, 1.00...]โ”‚ โ† sin(0), cos(0), sin(0), cos(0) -โ”‚ Pos 1: [0.84, 0.54, 0.10, 0.99...]โ”‚ โ† sin(1), cos(1), sin(f1), cos(f1) -โ”‚ Pos 2: [0.91,-0.42, 0.20, 0.98...]โ”‚ โ† sin(2), cos(2), sin(f2), cos(f2) -โ”‚ Pos 3: [0.14,-0.99, 0.30, 0.95...]โ”‚ โ† sin(3), cos(3), sin(f3), cos(f3) -โ”‚ ... โ”‚ -โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ - โ†‘ โ†‘ ++-------------------------------------+ +| Position Encoding Matrix | ++-------------------------------------โ”ค +| Pos 0: [0.00, 1.00, 0.00, 1.00...]| <- sin(0), cos(0), sin(0), cos(0) +| Pos 1: [0.84, 0.54, 0.10, 0.99...]| <- sin(1), cos(1), sin(f1), cos(f1) +| Pos 2: [0.91,-0.42, 0.20, 0.98...]| <- sin(2), cos(2), sin(f2), cos(f2) +| Pos 3: [0.14,-0.99, 0.30, 0.95...]| <- sin(3), cos(3), sin(f3), cos(f3) +| ... | ++-------------------------------------+ + ^ ^ max_seq_length embedding_dim -Memory: max_seq_length ร— embedding_dim ร— 4 bytes (precomputed) +Memory: max_seq_length * embedding_dim * 4 bytes (precomputed) ``` """ @@ -630,7 +630,7 @@ class PositionalEncoding: print() # Show frequency analysis - print(f"\n๐Ÿ“ˆ FREQUENCY ANALYSIS:") + print(f"\nPROGRESS FREQUENCY ANALYSIS:") print("Even dimensions (sine): Lower frequencies for early dimensions") print("Odd dimensions (cosine): Same frequencies, phase-shifted") @@ -641,7 +641,7 @@ class PositionalEncoding: # %% [markdown] """ -### ๐Ÿงช Test Your Positional Encoding Implementation +### TEST Test Your Positional Encoding Implementation Once you implement the PositionalEncoding methods above, run this cell to test it: """ @@ -713,10 +713,10 @@ def test_unit_positional_encoding(): pos_embeddings_callable = pos_enc(embeddings) assert np.allclose(pos_embeddings_callable.data, pos_embeddings.data), "Callable interface should work" - print("โœ… Positional encoding tests passed!") - print(f"โœ… Handles 2D and 3D inputs correctly") - print(f"โœ… Proper validation and deterministic behavior") - print(f"โœ… Encoding dimension: {embedding_dim}, Max length: {max_seq_length}") + print("PASS Positional encoding tests passed!") + print(f"PASS Handles 2D and 3D inputs correctly") + print(f"PASS Proper validation and deterministic behavior") + print(f"PASS Encoding dimension: {embedding_dim}, Max length: {max_seq_length}") # Test function defined (called in main block) @@ -729,15 +729,15 @@ Some models use learned positional embeddings instead of fixed sinusoidal ones. ### Learned vs Sinusoidal Comparison ``` Sinusoidal Positional Encoding: - โœ“ Zero parameters (deterministic computation) - โœ“ Can extrapolate to longer sequences - โœ“ Mathematical guarantees about relative positions + OK Zero parameters (deterministic computation) + OK Can extrapolate to longer sequences + OK Mathematical guarantees about relative positions โœ— Fixed pattern - cannot adapt to task Learned Positional Embeddings: - โœ“ Learnable parameters (adapts to task/data) - โœ“ Can capture task-specific positional patterns - โœ— Requires additional parameters (max_seq_len ร— embed_dim) + OK Learnable parameters (adapts to task/data) + OK Can capture task-specific positional patterns + โœ— Requires additional parameters (max_seq_len * embed_dim) โœ— Cannot extrapolate beyond training sequence length โœ— Needs sufficient training data to learn good positions ``` @@ -746,16 +746,16 @@ Learned Positional Embeddings: ``` Learned Position System: Position IDs: [0, 1, 2, 3, ...] - โ†“ Embedding lookup (just like token embeddings) + v Embedding lookup (just like token embeddings) Position Table: [max_seq_length, embedding_dim] - โ†“ Standard embedding lookup + v Standard embedding lookup Position Embeddings: [seq_length, embedding_dim] - โ†“ Add to token embeddings + v Add to token embeddings Final Representation: Token + Position information This is essentially two embedding tables: - - Token Embedding: token_id โ†’ content vector - - Position Embedding: position_id โ†’ position vector + - Token Embedding: token_id -> content vector + - Position Embedding: position_id -> position vector ``` """ @@ -867,7 +867,7 @@ class LearnedPositionalEmbedding: # %% [markdown] """ -### ๐Ÿงช Test Your Learned Positional Embedding Implementation +### TEST Test Your Learned Positional Embedding Implementation Once you implement the LearnedPositionalEmbedding methods above, run this cell to test it: """ @@ -944,20 +944,20 @@ def test_unit_learned_positional_embedding(): pos_embeddings_callable = learned_pos(embeddings) assert np.allclose(pos_embeddings_callable.data, pos_embeddings.data), "Callable interface should work" - print("โœ… Learned positional embedding tests passed!") - print(f"โœ… Parameter tracking and optimization ready") - print(f"โœ… Handles various input shapes correctly") - print(f"โœ… Max sequence length: {max_seq_length}, Embedding dim: {embedding_dim}") + print("PASS Learned positional embedding tests passed!") + print(f"PASS Parameter tracking and optimization ready") + print(f"PASS Handles various input shapes correctly") + print(f"PASS Max sequence length: {max_seq_length}, Embedding dim: {embedding_dim}") # Test function defined (called in main block) -# โœ… IMPLEMENTATION CHECKPOINT: Ensure all embedding components are complete before analysis +# PASS IMPLEMENTATION CHECKPOINT: Ensure all embedding components are complete before analysis -# ๐Ÿค” PREDICTION: How does embedding table memory scale with vocabulary size and dimension? +# THINK PREDICTION: How does embedding table memory scale with vocabulary size and dimension? # Linear with vocab_size? Linear with embedding_dim? Quadratic with both? # Your prediction: _______ -# ๐Ÿ” SYSTEMS INSIGHT #1: Embedding Memory Scaling Analysis +# MAGNIFY SYSTEMS INSIGHT #1: Embedding Memory Scaling Analysis def analyze_embedding_memory_scaling(): """Analyze how embedding memory scales with vocabulary and dimension parameters.""" try: @@ -994,31 +994,31 @@ def analyze_embedding_memory_scaling(): print(f"{vocab_size:<12,} {embed_dim:<10} {params:<12,} {memory_mb:<12.1f} {lookup_time:<12.2f}") - # ๐Ÿ’ก WHY THIS MATTERS: GPT-3 has 50k vocab ร— 12k dim = 600M embedding parameters! + # TIP WHY THIS MATTERS: GPT-3 has 50k vocab * 12k dim = 600M embedding parameters! # That's 2.4GB just for the embedding table (before any other model weights) - print("\n๐Ÿ’ก SCALING INSIGHTS:") + print("\nTIP SCALING INSIGHTS:") print(" - Memory scales linearly with both vocab_size AND embedding_dim") print(" - Lookup time is dominated by memory bandwidth, not computation") print(" - Large models spend significant memory on embeddings alone") except Exception as e: - print(f"โš ๏ธ Error in memory scaling analysis: {e}") + print(f"WARNING๏ธ Error in memory scaling analysis: {e}") print("Make sure your Embedding class is implemented correctly") analyze_embedding_memory_scaling() -# โœ… IMPLEMENTATION CHECKPOINT: Ensure positional encoding works before analysis +# PASS IMPLEMENTATION CHECKPOINT: Ensure positional encoding works before analysis -# ๐Ÿค” PREDICTION: Which positional encoding uses more memory - sinusoidal or learned? +# THINK PREDICTION: Which positional encoding uses more memory - sinusoidal or learned? # Which can handle longer sequences? Your answer: _______ -# ๐Ÿ” SYSTEMS INSIGHT #2: Positional Encoding Trade-offs +# MAGNIFY SYSTEMS INSIGHT #2: Positional Encoding Trade-offs def analyze_positional_encoding_tradeoffs(): """Compare memory and performance characteristics of different positional encodings.""" try: import time - print("\n๐Ÿ” POSITIONAL ENCODING COMPARISON") + print("\nMAGNIFY POSITIONAL ENCODING COMPARISON") print("=" * 50) embedding_dim = 512 @@ -1056,30 +1056,30 @@ def analyze_positional_encoding_tradeoffs(): print(f"{seq_len:<8} {'Learned':<12} {learned_time:<10.2f} {learned_memory:<12.1f} {learned_params:<12,}") print() - # ๐Ÿ’ก WHY THIS MATTERS: Choice affects model size and sequence length flexibility - print("๐Ÿ’ก TRADE-OFF INSIGHTS:") + # TIP WHY THIS MATTERS: Choice affects model size and sequence length flexibility + print("TIP TRADE-OFF INSIGHTS:") print(" - Sinusoidal: 0 parameters, can extrapolate to any length") print(" - Learned: Many parameters, limited to training sequence length") print(" - Modern models often use learned for better task adaptation") except Exception as e: - print(f"โš ๏ธ Error in positional encoding analysis: {e}") + print(f"WARNING๏ธ Error in positional encoding analysis: {e}") print("Make sure both positional encoding classes are implemented") analyze_positional_encoding_tradeoffs() -# โœ… IMPLEMENTATION CHECKPOINT: Ensure full embedding pipeline works +# PASS IMPLEMENTATION CHECKPOINT: Ensure full embedding pipeline works -# ๐Ÿค” PREDICTION: What's the bottleneck in embedding pipelines - computation or memory? +# THINK PREDICTION: What's the bottleneck in embedding pipelines - computation or memory? # How does batch size affect throughput? Your prediction: _______ -# ๐Ÿ” SYSTEMS INSIGHT #3: Embedding Pipeline Performance +# MAGNIFY SYSTEMS INSIGHT #3: Embedding Pipeline Performance def analyze_embedding_pipeline_performance(): """Analyze performance characteristics of the complete embedding pipeline.""" try: import time - print("\nโšก EMBEDDING PIPELINE PERFORMANCE") + print("\nSPEED EMBEDDING PIPELINE PERFORMANCE") print("=" * 50) # Create pipeline components @@ -1124,22 +1124,22 @@ def analyze_embedding_pipeline_performance(): print(f"{batch_size:<6} {seq_length:<8} {total_tokens:<12,} {pipeline_time:<10.2f} {tokens_per_sec:<12,.0f} {memory_mb:<12.1f}") - # ๐Ÿ’ก WHY THIS MATTERS: Understanding pipeline bottlenecks for production deployment - print("\n๐Ÿ’ก PIPELINE INSIGHTS:") + # TIP WHY THIS MATTERS: Understanding pipeline bottlenecks for production deployment + print("\nTIP PIPELINE INSIGHTS:") print(" - Embedding lookup is memory-bandwidth bound (not compute bound)") print(" - Larger batches improve throughput due to better memory utilization") print(" - Sequence length affects memory linearly, performance sublinearly") print(" - Production systems optimize with: embedding caching, mixed precision, etc.") except Exception as e: - print(f"โš ๏ธ Error in pipeline analysis: {e}") + print(f"WARNING๏ธ Error in pipeline analysis: {e}") print("Make sure your full embedding pipeline is working") analyze_embedding_pipeline_performance() # %% [markdown] """ -## ๐ŸŽฏ ML Systems: Performance Analysis & Embedding Scaling +## TARGET ML Systems: Performance Analysis & Embedding Scaling Now let's develop systems engineering skills by analyzing embedding performance and understanding how embedding choices affect downstream ML system efficiency. @@ -1271,7 +1271,7 @@ class EmbeddingProfiler: print(f"{vocab_size:<12,} {embed_dim:<10} {total_params:<12,} {total_memory_mb:<12.2f} {lookup_time_ms:<12.2f}") # Analyze scaling patterns - print(f"\n๐Ÿ“ˆ SCALING INSIGHTS:") + print(f"\nPROGRESS SCALING INSIGHTS:") if len(vocab_sizes) > 1 and len(embedding_dims) > 1: # Compare scaling with vocab size (fixed embedding dim) fixed_dim = embedding_dims[0] @@ -1284,7 +1284,7 @@ class EmbeddingProfiler: if small_key in scaling_results and large_key in scaling_results: vocab_ratio = large_vocab / small_vocab memory_ratio = scaling_results[large_key]['memory_mb'] / scaling_results[small_key]['memory_mb'] - print(f" Vocabulary scaling: {vocab_ratio:.1f}x vocab โ†’ {memory_ratio:.1f}x memory (Linear)") + print(f" Vocabulary scaling: {vocab_ratio:.1f}x vocab -> {memory_ratio:.1f}x memory (Linear)") # Compare scaling with embedding dim (fixed vocab) fixed_vocab = vocab_sizes[0] @@ -1297,7 +1297,7 @@ class EmbeddingProfiler: if small_key in scaling_results and large_key in scaling_results: dim_ratio = large_dim / small_dim memory_ratio = scaling_results[large_key]['memory_mb'] / scaling_results[small_key]['memory_mb'] - print(f" Dimension scaling: {dim_ratio:.1f}x dim โ†’ {memory_ratio:.1f}x memory (Linear)") + print(f" Dimension scaling: {dim_ratio:.1f}x dim -> {memory_ratio:.1f}x memory (Linear)") return scaling_results @@ -1307,7 +1307,7 @@ class EmbeddingProfiler: This function is PROVIDED to show positional encoding comparison. """ - print(f"\n๐Ÿ” POSITIONAL ENCODING COMPARISON") + print(f"\nMAGNIFY POSITIONAL ENCODING COMPARISON") print("=" * 50) # Create test embeddings @@ -1353,7 +1353,7 @@ class EmbeddingProfiler: print(f"{'Sinusoidal':<12} {sin_time:<10.2f} {sin_memory:<12.2f} {0:<12,} {'Good'}") print(f"{'Learned':<12} {learned_time:<10.2f} {learned_memory:<12.2f} {results['learned']['parameters']:<12,} {'Limited'}") - print(f"\n๐Ÿ’ก INSIGHTS:") + print(f"\nTIP INSIGHTS:") print(f" - Sinusoidal: Zero parameters, deterministic, good extrapolation") print(f" - Learned: Requires parameters, model-specific, limited extrapolation") print(f" - Choice depends on: model capacity, sequence length requirements, extrapolation needs") @@ -1390,7 +1390,7 @@ def analyze_embedding_system_design(): print(f"{config['name']:<12} {config['vocab_size']:<10,} {config['embed_dim']:<10} " f"{config['seq_length']:<8} {embed_params:<12,} {embed_memory_mb:<10.1f}") - print(f"\n๐ŸŽฏ DESIGN TRADE-OFFS:") + print(f"\nTARGET DESIGN TRADE-OFFS:") print(f" 1. Vocabulary Size:") print(f" - Larger vocab: Better text coverage, more parameters") print(f" - Smaller vocab: Longer sequences, more compute") @@ -1401,8 +1401,8 @@ def analyze_embedding_system_design(): print(f" - Sinusoidal: No parameters, good extrapolation") print(f" - Learned: Model-specific, limited to training length") print(f" 4. Memory Scaling:") - print(f" - Embedding table: O(vocab_size ร— embed_dim)") - print(f" - Sequence processing: O(batch_size ร— seq_length ร— embed_dim)") + print(f" - Embedding table: O(vocab_size * embed_dim)") + print(f" - Sequence processing: O(batch_size * seq_length * embed_dim)") print(f" - Total memory dominated by model size, not embedding table") print(f"\n๐Ÿญ PRODUCTION CONSIDERATIONS:") @@ -1413,7 +1413,7 @@ def analyze_embedding_system_design(): # %% [markdown] """ -### ๐Ÿงช Test: Embedding Performance Analysis +### TEST Test: Embedding Performance Analysis Let's test our embedding profiler with realistic performance scenarios. """ @@ -1453,7 +1453,7 @@ def test_embedding_profiler(): assert metrics['lookup_time_ms'] >= 0, "Time should be non-negative" assert metrics['tokens_per_second'] >= 0, "Throughput should be non-negative" - print("โœ… Lookup performance measurement test passed") + print("PASS Lookup performance measurement test passed") # Test memory scaling analysis vocab_sizes = [500, 1000] @@ -1471,7 +1471,7 @@ def test_embedding_profiler(): assert metrics['total_parameters'] > 0, "Should have parameters" assert metrics['memory_mb'] > 0, "Should use memory" - print("โœ… Memory scaling analysis test passed") + print("PASS Memory scaling analysis test passed") # Test positional encoding comparison comparison_results = profiler.compare_positional_encodings(seq_length=50, embedding_dim=64) @@ -1485,8 +1485,8 @@ def test_embedding_profiler(): assert 'memory_usage_mb' in metrics, "Should measure memory usage" assert 'parameters' in metrics, "Should count parameters" - print("โœ… Positional encoding comparison test passed") - print("๐ŸŽฏ Embedding Profiler: All tests passed!") + print("PASS Positional encoding comparison test passed") + print("TARGET Embedding Profiler: All tests passed!") # Test function defined (called in main block) @@ -1500,7 +1500,7 @@ Let's test how all our embedding components work together in a realistic languag # %% nbgrader={"grade": false, "grade_id": "test-embedding-integration", "locked": false, "schema_version": 3, "solution": false, "task": false} def test_embedding_integration(): """Test complete embedding pipeline with tokenization integration.""" - print("๐Ÿงช Integration Test: Complete Embedding Pipeline...") + print("TEST Integration Test: Complete Embedding Pipeline...") # Create tokenizer tokenizer = CharTokenizer() @@ -1589,10 +1589,10 @@ def test_embedding_integration(): print(f" Embedding table memory: {total_memory_mb:.2f}MB") print(f" Sequence memory: {large_pos_embeddings.data.nbytes / (1024*1024):.2f}MB") - print("โœ… Complete embedding pipeline integration test passed!") - print(f"โœ… Tokenization โ†’ Embedding โ†’ Positional Encoding pipeline works") - print(f"โœ… Handles various batch sizes and sequence lengths") - print(f"โœ… Memory usage is reasonable for production systems") + print("PASS Complete embedding pipeline integration test passed!") + print(f"PASS Tokenization -> Embedding -> Positional Encoding pipeline works") + print(f"PASS Handles various batch sizes and sequence lengths") + print(f"PASS Memory usage is reasonable for production systems") # Test function defined (called in main block) @@ -1613,7 +1613,7 @@ if __name__ == "__main__": test_embedding_integration() print("\n" + "="*60) - print("๐Ÿ” EMBEDDING SYSTEMS ANALYSIS") + print("MAGNIFY EMBEDDING SYSTEMS ANALYSIS") print("="*60) # Performance analysis @@ -1681,14 +1681,14 @@ if __name__ == "__main__": print(f" Throughput: {(batch_size * seq_length) / total_time:.0f} tokens/second") print("\n" + "="*60) - print("๐ŸŽฏ EMBEDDINGS MODULE COMPLETE!") + print("TARGET EMBEDDINGS MODULE COMPLETE!") print("="*60) print("All embedding tests passed!") print("Ready for attention mechanism integration!") # %% [markdown] """ -## ๐Ÿค” ML Systems Thinking: Interactive Questions +## THINK ML Systems Thinking: Interactive Questions Now that you've built the embedding systems that convert tokens to rich vector representations, let's connect this work to broader ML systems challenges. These questions help you think critically about how embedding design scales to production language processing systems. @@ -1715,7 +1715,7 @@ YOUR REFLECTION ON EMBEDDING MEMORY OPTIMIZATION: TODO: Replace this text with your thoughtful response about memory-optimized embedding system design. Consider addressing: -- How would you implement embedding compression for a 100k ร— 1024 vocabulary under GPU constraints? +- How would you implement embedding compression for a 100k * 1024 vocabulary under GPU constraints? - What techniques would you use to optimize lookup patterns for high-throughput training? - How would you design dynamic vocabulary expansion while maintaining memory efficiency? - What trade-offs would you make between embedding quality and memory footprint? @@ -1741,7 +1741,7 @@ GRADING RUBRIC (Instructor Use): """ ### Question 2: Positional Encoding and Sequence Length Scalability -**Context**: Your positional encoding implementations show the trade-offs between fixed sinusoidal patterns and learned position embeddings. In your analysis, you saw that `PositionalEncoding` requires 0 parameters but `LearnedPositionalEmbedding` needs max_seq_length ร— embedding_dim parameters. Production language models increasingly need to handle variable sequence lengths efficiently while maintaining consistent position representations across different tasks and deployment scenarios. +**Context**: Your positional encoding implementations show the trade-offs between fixed sinusoidal patterns and learned position embeddings. In your analysis, you saw that `PositionalEncoding` requires 0 parameters but `LearnedPositionalEmbedding` needs max_seq_length * embedding_dim parameters. Production language models increasingly need to handle variable sequence lengths efficiently while maintaining consistent position representations across different tasks and deployment scenarios. **Reflection Question**: Based on your `PositionalEncoding` and `LearnedPositionalEmbedding` implementations, architect a hybrid positional encoding system for a production transformer that efficiently handles sequences from 512 tokens to 32k tokens. How would you modify your current `forward()` methods to create a hybrid approach that combines the benefits of both systems? What changes would you make to your position computation to optimize for variable-length sequences, and how would you extend your positional encoding comparison analysis to measure performance across different sequence length distributions? @@ -1785,7 +1785,7 @@ GRADING RUBRIC (Instructor Use): **Context**: Your embedding pipeline integration demonstrates how tokenization, embedding lookup, and positional encoding work together in language model preprocessing. In your `test_embedding_integration()` function, you measured pipeline performance and saw how batch size affects throughput. In production training systems, the embedding pipeline often becomes a bottleneck due to memory bandwidth limitations and the need to process billions of tokens efficiently during training. -**Reflection Question**: Based on your complete embedding pipeline implementation (tokenization โ†’ `Embedding.forward()` โ†’ `PositionalEncoding.forward()`), design an optimization strategy for large-scale language model training that processes 1 trillion tokens efficiently. How would you modify your current pipeline functions to implement batch processing optimizations for mixed sequence lengths, design efficient gradient updates for your massive `Embedding.weight` parameters, and coordinate embedding updates across distributed training nodes? Consider how your current memory analysis and performance measurement techniques could be extended to monitor pipeline bottlenecks in distributed settings. +**Reflection Question**: Based on your complete embedding pipeline implementation (tokenization -> `Embedding.forward()` -> `PositionalEncoding.forward()`), design an optimization strategy for large-scale language model training that processes 1 trillion tokens efficiently. How would you modify your current pipeline functions to implement batch processing optimizations for mixed sequence lengths, design efficient gradient updates for your massive `Embedding.weight` parameters, and coordinate embedding updates across distributed training nodes? Consider how your current memory analysis and performance measurement techniques could be extended to monitor pipeline bottlenecks in distributed settings. Think about: optimizing your current pipeline implementation, extending your performance analysis to distributed settings, modifying your batch processing patterns, and scaling your embedding weight update mechanisms. @@ -1823,54 +1823,54 @@ GRADING RUBRIC (Instructor Use): # %% [markdown] """ -## ๐ŸŽฏ MODULE SUMMARY: Embeddings +## TARGET MODULE SUMMARY: Embeddings Congratulations! You have successfully implemented comprehensive embedding systems for language processing: -### โœ… What You Have Built +### PASS What You Have Built - **Embedding Layer**: Learnable lookup table converting tokens to dense vector representations - **Positional Encoding**: Sinusoidal position information for sequence understanding - **Learned Positional Embeddings**: Trainable position representations for model-specific optimization - **Memory-Efficient Lookups**: Optimized embedding access patterns for production systems - **Performance Analysis**: Comprehensive profiling and scaling analysis tools -- **๐Ÿ†• Integration Pipeline**: Complete tokenization โ†’ embedding โ†’ positional encoding workflow +- **๐Ÿ†• Integration Pipeline**: Complete tokenization -> embedding -> positional encoding workflow - **๐Ÿ†• Systems Optimization**: Memory usage analysis and performance optimization techniques -### โœ… Key Learning Outcomes +### PASS Key Learning Outcomes - **Understanding**: How discrete tokens become continuous vector representations - **Implementation**: Built embedding systems from scratch with efficient lookup operations - **Systems Insight**: How embedding table size affects model memory and training efficiency - **Performance Engineering**: Measured and optimized embedding lookup patterns and memory usage - **Production Context**: Understanding real-world embedding challenges and optimization techniques -### โœ… Technical Mastery +### PASS Technical Mastery - **Embedding Lookup**: Efficient table lookup with various initialization strategies - **Positional Encoding**: Mathematical sine/cosine patterns for position representation -- **Memory Scaling**: Understanding O(vocab_size ร— embedding_dim) parameter scaling +- **Memory Scaling**: Understanding O(vocab_size * embedding_dim) parameter scaling - **Performance Optimization**: Cache-friendly access patterns and memory bandwidth optimization - **๐Ÿ†• Integration Design**: Seamless pipeline from text processing to vector representations -### โœ… Professional Skills Developed +### PASS Professional Skills Developed - **Systems Architecture**: Designing embedding systems for production scale - **Memory Engineering**: Optimizing large parameter tables for efficient access - **Performance Analysis**: Measuring and improving embedding pipeline throughput - **Integration Thinking**: Connecting embedding systems with tokenization and attention -### โœ… Ready for Next Steps +### PASS Ready for Next Steps Your embedding systems are now ready to power: - **Attention Mechanisms**: Processing sequence representations with attention - **Transformer Models**: Complete language model architectures - **Language Understanding**: Rich semantic representations for NLP tasks - **๐Ÿง  Sequence Processing**: Foundation for advanced sequence modeling -### ๐Ÿ”— Connection to Real ML Systems +### LINK Connection to Real ML Systems Your implementations mirror production systems: - **PyTorch Embeddings**: `torch.nn.Embedding` and `torch.nn.functional.embedding` - **Transformer Models**: All modern language models use similar embedding approaches - **Production Optimizations**: Memory mapping, gradient checkpointing, and distributed embeddings - **Industry Applications**: GPT, BERT, and other transformer models rely on these foundations -### ๐ŸŽฏ The Power of Dense Representations +### TARGET The Power of Dense Representations You have unlocked the bridge between discrete tokens and continuous understanding: - **Before**: Tokens were sparse, discrete symbols - **After**: Tokens become rich, continuous vectors that capture semantic relationships diff --git a/modules/12_attention/attention_dev.py b/modules/12_attention/attention_dev.py index f79a4d83..593a996e 100644 --- a/modules/12_attention/attention_dev.py +++ b/modules/12_attention/attention_dev.py @@ -21,7 +21,7 @@ Welcome to the Attention module! You'll implement the scaled dot-product attenti - Framework connection: See how your implementations match PyTorch's attention systems - Performance insight: Learn how attention patterns affect training efficiency and model capabilities -## Build โ†’ Use โ†’ Reflect +## Build -> Use -> Reflect 1. **Build**: Scaled dot-product attention and multi-head attention with masking and KV-cache 2. **Use**: Process sequences to capture dependencies between distant tokens 3. **Reflect**: How does attention's quadratic scaling determine practical limits of sequence length? @@ -35,8 +35,8 @@ By the end of this module, you'll understand: - Connection to production systems like GPT's attention layers and their optimization techniques ## Systems Reality Check -๐Ÿ’ก **Production Context**: Attention is the memory bottleneck in transformers - GPT-3 uses 96 attention heads across 96 layers -โšก **Performance Note**: O(Nยฒ) memory scaling means 2x sequence length = 4x attention memory - this fundamentally limits transformer sequence length +TIP **Production Context**: Attention is the memory bottleneck in transformers - GPT-3 uses 96 attention heads across 96 layers +SPEED **Performance Note**: O(Nยฒ) memory scaling means 2x sequence length = 4x attention memory - this fundamentally limits transformer sequence length """ # %% nbgrader={"grade": false, "grade_id": "attention-imports", "locked": false, "schema_version": 3, "solution": false, "task": false} @@ -82,13 +82,13 @@ except ImportError: self.embedding_dim = embedding_dim # %% nbgrader={"grade": false, "grade_id": "attention-welcome", "locked": false, "schema_version": 3, "solution": false, "task": false} -print("๐ŸŽฏ TinyTorch Attention Module") +print("TARGET TinyTorch Attention Module") print(f"NumPy version: {np.__version__}") print("Ready to build attention mechanisms!") # %% [markdown] """ -## ๐Ÿ“ฆ Where This Code Lives in the Final Package +## PACKAGE Where This Code Lives in the Final Package **Learning Side:** You work in `modules/source/13_attention/attention_dev.py` **Building Side:** Code exports to `tinytorch.core.attention` @@ -125,25 +125,25 @@ Traditional RNNs process sequences step-by-step, making it hard to capture long- Query-Key-Value Attention Visualization: Query (Q) Key (K) Value (V) - โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” - โ”‚ "What am I โ”‚ โ”‚ "What can โ”‚ โ”‚ "What info โ”‚ - โ”‚ looking โ”‚ โ”‚ I attend โ”‚ โ”‚ do I get โ”‚ - โ”‚ for?" โ”‚ โ”‚ to?" โ”‚ โ”‚ from it?" โ”‚ - โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ - โ”‚ โ”‚ โ”‚ - โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ - โ–ผ โ”‚ - Attention โ”‚ - Scores โ”‚ - QK^T / โˆšd_k โ”‚ - โ”‚ โ”‚ - โ–ผ โ”‚ - Softmax โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ - Weights โ”‚ - โ”‚ โ”‚ - โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ - โ”‚ - โ–ผ + +-------------+ +-----------+ +-------------+ + | "What am I | | "What can | | "What info | + | looking | | I attend | | do I get | + | for?" | | to?" | | from it?" | + +-------------+ +-----------+ +-------------+ + | | | + +------+-------+ | + v | + Attention | + Scores | + QK^T / sqrtd_k | + | | + v | + Softmax ------------------+ + Weights | + | | + +----------------------+ + | + v Weighted Sum (Attended Output) ``` @@ -153,11 +153,11 @@ Query-Key-Value Attention Visualization: ``` Step 1: Compute Attention Scores Q: [seq_len, d_model] @ K^T: [d_model, seq_len] - โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ + ------------------------------------------------ Scores: [seq_len, seq_len] ("How much to attend?") Step 2: Scale for Numerical Stability - Scores = Scores / โˆšd_k + Scores = Scores / sqrtd_k (Prevents saturation in softmax) Step 3: Apply Softmax @@ -173,24 +173,24 @@ Step 4: Weighted Combination ``` Input Embeddings [batch, seq_len, d_model] - โ”‚ - โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” - โ”‚ โ”‚ โ”‚ + | + +-------+-------+ + | | | W_Q W_K W_V (Linear projections) - โ”‚ โ”‚ โ”‚ - โ”‚ Reshape to Multiple Heads - โ”‚ [batch, heads, seq_len, d_k] - โ”‚ โ”‚ โ”‚ - โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ - โ”‚ + | | | + | Reshape to Multiple Heads + | [batch, heads, seq_len, d_k] + | | | + +-------+-------+ + | Scaled Dot-Product Attention (Applied to each head) - โ”‚ + | Concatenate Heads [batch, seq_len, d_model] - โ”‚ + | Linear Output Projection (W_O) - โ”‚ + | Multi-Head Output ``` @@ -216,19 +216,19 @@ Where: ``` Without Masking (Bi-directional): t1 t2 t3 t4 - t1 [A] [A] [A] [A] โ† Can see all positions + t1 [A] [A] [A] [A] <- Can see all positions t2 [A] [A] [A] [A] t3 [A] [A] [A] [A] t4 [A] [A] [A] [A] With Causal Masking (Auto-regressive): t1 t2 t3 t4 - t1 [A] [-] [-] [-] โ† Can only see current/past + t1 [A] [-] [-] [-] <- Can only see current/past t2 [A] [A] [-] [-] t3 [A] [A] [A] [-] t4 [A] [A] [A] [A] - [A] = Attend [-] = Masked (set to -โˆž) + [A] = Attend [-] = Masked (set to -inf) ``` ### Systems Trade-offs @@ -373,12 +373,12 @@ class ScaledDotProductAttention: """Make the class callable.""" return self.forward(query, key, value, mask, return_attention_weights) -# โœ… IMPLEMENTATION CHECKPOINT: Ensure your ScaledDotProductAttention is complete before running +# PASS IMPLEMENTATION CHECKPOINT: Ensure your ScaledDotProductAttention is complete before running -# ๐Ÿค” PREDICTION: How do you think attention weights will distribute? +# THINK PREDICTION: How do you think attention weights will distribute? # With random inputs: Uniform? Concentrated? Your guess: _______ -# ๐Ÿ” SYSTEMS INSIGHT #1: Attention Weight Distribution Analysis +# MAGNIFY SYSTEMS INSIGHT #1: Attention Weight Distribution Analysis def analyze_attention_distribution(): """Analyze how attention weights distribute across different scenarios.""" try: @@ -418,14 +418,14 @@ def analyze_attention_distribution(): row_sums = np.sum(weights.data, axis=-1) assert np.allclose(row_sums, 1.0), f"Attention weights should sum to 1 in {scenario_name}" - print(f"\n๐Ÿ’ก WHY THIS MATTERS:") - print(f" - Random inputs โ†’ relatively uniform attention (high entropy)") - print(f" - Similar inputs โ†’ more concentrated attention (lower entropy)") + print(f"\nTIP WHY THIS MATTERS:") + print(f" - Random inputs -> relatively uniform attention (high entropy)") + print(f" - Similar inputs -> more concentrated attention (lower entropy)") print(f" - Extreme values can lead to attention collapse (very low entropy)") print(f" - Real language models learn meaningful attention patterns!") except Exception as e: - print(f"โš ๏ธ Make sure ScaledDotProductAttention is implemented correctly") + print(f"WARNING๏ธ Make sure ScaledDotProductAttention is implemented correctly") print(f"Error: {e}") # Run the analysis @@ -433,7 +433,7 @@ analyze_attention_distribution() # %% [markdown] """ -### ๐Ÿงช Test Your Scaled Dot-Product Attention Implementation +### TEST Test Your Scaled Dot-Product Attention Implementation Once you implement the ScaledDotProductAttention forward method above, run this cell to test it: """ @@ -507,11 +507,11 @@ def test_unit_scaled_attention(): assert not np.any(np.isnan(extreme_output.data)), "Should handle extreme values without NaN" assert not np.any(np.isinf(extreme_output.data)), "Should handle extreme values without inf" - print("โœ… Scaled dot-product attention tests passed!") - print(f"โœ… Handles various input shapes and sequence lengths") - print(f"โœ… Attention weights sum to 1 (softmax property)") - print(f"โœ… Causal masking works correctly") - print(f"โœ… Numerical stability with extreme values") + print("PASS Scaled dot-product attention tests passed!") + print(f"PASS Handles various input shapes and sequence lengths") + print(f"PASS Attention weights sum to 1 (softmax property)") + print(f"PASS Causal masking works correctly") + print(f"PASS Numerical stability with extreme values") # Test function defined (called in main block) @@ -712,16 +712,16 @@ class MultiHeadAttention: 'total_parameters': sum(param.data.size for param in self.parameters) } -# โœ… IMPLEMENTATION CHECKPOINT: Ensure your MultiHeadAttention is complete before running +# PASS IMPLEMENTATION CHECKPOINT: Ensure your MultiHeadAttention is complete before running -# ๐Ÿค” PREDICTION: Multi-head vs single-head - which uses more memory and why? +# THINK PREDICTION: Multi-head vs single-head - which uses more memory and why? # Your answer: _______ -# ๐Ÿ” SYSTEMS INSIGHT #2: Multi-Head vs Single-Head Comparison +# MAGNIFY SYSTEMS INSIGHT #2: Multi-Head vs Single-Head Comparison def compare_attention_architectures(): """Compare single-head vs multi-head attention characteristics.""" try: - print("๐Ÿ” MULTI-HEAD vs SINGLE-HEAD ATTENTION COMPARISON") + print("MAGNIFY MULTI-HEAD vs SINGLE-HEAD ATTENTION COMPARISON") print("=" * 60) embed_dim = 256 @@ -760,19 +760,19 @@ def compare_attention_architectures(): f"{head_dim:<10} {attention_flops/1e6:.1f}M FLOPs") print(f"\n๐Ÿ“Š ANALYSIS:") - print(f" Parameter Count: Constant across heads (embed_dimยฒ ร— 4 matrices)") + print(f" Parameter Count: Constant across heads (embed_dimยฒ * 4 matrices)") print(f" Head Dimension: Decreases as num_heads increases (embed_dim/num_heads)") print(f" Representation: More heads = richer, diverse attention patterns") print(f" Computation: Linear scaling with number of heads") - print(f"\n๐Ÿ’ก WHY MULTI-HEAD WORKS:") + print(f"\nTIP WHY MULTI-HEAD WORKS:") print(f" - Different heads learn different types of relationships") print(f" - Some heads focus on syntax, others on semantics") print(f" - Parallel computation across heads") print(f" - Better representation learning without parameter increase") except Exception as e: - print(f"โš ๏ธ Make sure MultiHeadAttention is implemented correctly") + print(f"WARNING๏ธ Make sure MultiHeadAttention is implemented correctly") print(f"Error: {e}") # Run the comparison @@ -780,7 +780,7 @@ compare_attention_architectures() # %% [markdown] """ -### ๐Ÿงช Test Your Multi-Head Attention Implementation +### TEST Test Your Multi-Head Attention Implementation Once you implement the MultiHeadAttention methods above, run this cell to test it: """ @@ -865,11 +865,11 @@ def test_unit_multi_head_attention(): self_attn_output = mha.forward(query, query, query) assert self_attn_output.shape == expected_shape, "Self-attention should work" - print("โœ… Multi-head attention tests passed!") - print(f"โœ… Handles {num_heads} heads with {mha.head_dim} dimensions each") - print(f"โœ… Parameter memory: {memory_stats['total_parameter_memory_mb']:.2f}MB") - print(f"โœ… Causal masking works across all heads") - print(f"โœ… Self-attention capability verified") + print("PASS Multi-head attention tests passed!") + print(f"PASS Handles {num_heads} heads with {mha.head_dim} dimensions each") + print(f"PASS Parameter memory: {memory_stats['total_parameter_memory_mb']:.2f}MB") + print(f"PASS Causal masking works across all heads") + print(f"PASS Self-attention capability verified") # Test function defined (called in main block) @@ -1027,12 +1027,12 @@ class KVCache: 'cache_utilization': np.mean(self.cache_lengths / self.max_seq_length) if self.is_active else 0.0 } -# โœ… IMPLEMENTATION CHECKPOINT: Ensure your KVCache is complete before running +# PASS IMPLEMENTATION CHECKPOINT: Ensure your KVCache is complete before running -# ๐Ÿค” PREDICTION: How much memory could KV-cache save during generation? +# THINK PREDICTION: How much memory could KV-cache save during generation? # For 1000 tokens: 10%? 50%? 90%? Your guess: _______ -# ๐Ÿ” SYSTEMS INSIGHT #3: KV-Cache Generation Efficiency Analysis +# MAGNIFY SYSTEMS INSIGHT #3: KV-Cache Generation Efficiency Analysis def analyze_kv_cache_efficiency(): """Analyze KV-cache memory and computation savings during generation.""" try: @@ -1098,7 +1098,7 @@ def analyze_kv_cache_efficiency(): print(f" Cache now contains: {total_cached} tokens") print(f" Memory used: {total_cached * embed_dim * 2 * 4 / 1024:.1f} KB") - print(f"\n๐Ÿ’ก WHY KV-CACHE IS ESSENTIAL:") + print(f"\nTIP WHY KV-CACHE IS ESSENTIAL:") print(f" - Without cache: O(Nยฒ) computation growth per token") print(f" - With cache: O(N) computation per token") print(f" - Memory trade-off: Store K,V to avoid recomputation") @@ -1106,7 +1106,7 @@ def analyze_kv_cache_efficiency(): print(f" - Production impact: 10-100x speedup for long sequences") except Exception as e: - print(f"โš ๏ธ Make sure KVCache is implemented correctly") + print(f"WARNING๏ธ Make sure KVCache is implemented correctly") print(f"Error: {e}") # Run the efficiency analysis @@ -1114,7 +1114,7 @@ analyze_kv_cache_efficiency() # %% [markdown] """ -### ๐Ÿงช Test Your KV-Cache Implementation +### TEST Test Your KV-Cache Implementation Once you implement the KVCache methods above, run this cell to test it: """ @@ -1212,17 +1212,17 @@ def test_unit_kv_cache(): assert memory_stats['max_batch_size'] == max_batch_size, "Should report correct batch size" assert memory_stats['max_seq_length'] == max_seq_length, "Should report correct sequence length" - print("โœ… KV-Cache tests passed!") - print(f"โœ… Handles {max_batch_size} sequences of up to {max_seq_length} tokens") - print(f"โœ… Memory usage: {memory_stats['total_cache_memory_mb']:.2f}MB total") - print(f"โœ… Cache overflow protection works") - print(f"โœ… Independent batch sequence management") + print("PASS KV-Cache tests passed!") + print(f"PASS Handles {max_batch_size} sequences of up to {max_seq_length} tokens") + print(f"PASS Memory usage: {memory_stats['total_cache_memory_mb']:.2f}MB total") + print(f"PASS Cache overflow protection works") + print(f"PASS Independent batch sequence management") # Test function defined (called in main block) # %% [markdown] """ -## ๐ŸŽฏ ML Systems: Performance Analysis & Attention Scaling +## TARGET ML Systems: Performance Analysis & Attention Scaling Now let's develop systems engineering skills by analyzing attention performance and understanding how attention's quadratic scaling affects practical transformer deployment. @@ -1323,7 +1323,7 @@ class AttentionProfiler: This function is PROVIDED to show scaling pattern analysis. """ - print("๐Ÿ“ˆ ATTENTION QUADRATIC SCALING ANALYSIS") + print("PROGRESS ATTENTION QUADRATIC SCALING ANALYSIS") print("=" * 60) seq_lengths = sorted(scaling_results.keys()) @@ -1370,7 +1370,7 @@ class AttentionProfiler: print(f"{length_ratio:<12.1f} {time_ratio:<12.1f} {memory_ratio:<12.1f} {theoretical_ratio:<12.1f}") # Analysis insights - print(f"\n๐Ÿ’ก SCALING INSIGHTS:") + print(f"\nTIP SCALING INSIGHTS:") avg_memory_efficiency = np.mean([scaling_analysis[seq]['memory_ratio'] / scaling_analysis[seq]['theoretical_ratio'] for seq in seq_lengths[1:] if seq in scaling_analysis]) @@ -1387,7 +1387,7 @@ class AttentionProfiler: This function is PROVIDED to show attention type comparison. """ - print(f"\n๐Ÿ” ATTENTION TYPE COMPARISON") + print(f"\nMAGNIFY ATTENTION TYPE COMPARISON") print("=" * 50) batch_size = 8 @@ -1429,7 +1429,7 @@ class AttentionProfiler: } # Display comparison - print(f"Test configuration: {batch_size} batch ร— {seq_length} seq ร— {embed_dim} dim") + print(f"Test configuration: {batch_size} batch * {seq_length} seq * {embed_dim} dim") print(f"{'Type':<15} {'Time (ms)':<10} {'Parameters':<12} {'Memory (MB)':<12} {'Description'}") print("-" * 70) @@ -1492,7 +1492,7 @@ class AttentionProfiler: print(f"{seq_len:<10} {no_cache_total:<14.2f} {cache_total:<16.2f} " f"{memory_savings:<10.1f}% {speedup_estimate:<10.1f}x") - print(f"\n๐Ÿ’ก KV-CACHE INSIGHTS:") + print(f"\nTIP KV-CACHE INSIGHTS:") print(f" - Memory: Significant savings for long sequences") print(f" - Speed: Avoid recomputing K,V for all previous tokens") print(f" - Trade-off: Cache storage vs recomputation") @@ -1550,7 +1550,7 @@ def analyze_attention_system_design(): print(f"{config['name']:<12} {seq_len:<8} {config['num_heads']:<6} " f"{config['num_layers']:<7} {attention_matrix_memory_mb:<12.1f} {total_attention_memory_mb:<12.1f}") - print(f"\n๐ŸŽฏ KEY DESIGN IMPLICATIONS:") + print(f"\nTARGET KEY DESIGN IMPLICATIONS:") print(f" 1. Sequence Length Scaling:") print(f" - Memory scales O(Nยฒ) with sequence length") print(f" - 2x sequence length = 4x attention memory") @@ -1573,14 +1573,14 @@ def analyze_attention_system_design(): print(f"\n๐Ÿญ OPTIMIZATION STRATEGIES:") print(f" - Flash Attention: Memory-efficient attention computation") - print(f" - Sparse Attention: Reduce O(Nยฒ) to O(NโˆšN) or O(N log N)") + print(f" - Sparse Attention: Reduce O(Nยฒ) to O(NsqrtN) or O(N log N)") print(f" - Linear Attention: Approximate attention with linear complexity") print(f" - Sliding Window: Local attention with fixed window size") print(f" - KV-Cache: Essential for autoregressive generation") # %% [markdown] """ -### ๐Ÿงช Test: Attention Performance Analysis +### TEST Test: Attention Performance Analysis Let's test our attention profiler with realistic performance scenarios. """ @@ -1618,7 +1618,7 @@ def test_attention_profiler(): assert result['computation_time_ms'] >= 0, "Time should be non-negative" assert result['total_memory_mb'] > 0, "Memory usage should be positive" - print("โœ… Scaling measurement test passed") + print("PASS Scaling measurement test passed") # Test quadratic scaling analysis scaling_analysis = profiler.analyze_quadratic_scaling(scaling_results) @@ -1633,7 +1633,7 @@ def test_attention_profiler(): assert analysis['length_ratio'] > 1, f"Length ratio should be > 1 for {seq_len}" assert analysis['theoretical_ratio'] > 1, f"Theoretical ratio should be > 1 for {seq_len}" - print("โœ… Quadratic scaling analysis test passed") + print("PASS Quadratic scaling analysis test passed") # Test attention type comparison comparison_results = profiler.compare_attention_types(seq_length=64, embed_dim=128) @@ -1648,7 +1648,7 @@ def test_attention_profiler(): assert 'memory_mb' in metrics, "Should measure memory usage" assert metrics['computation_time_ms'] > 0, "Should have positive computation time" - print("โœ… Attention type comparison test passed") + print("PASS Attention type comparison test passed") # Test KV-cache benefits simulation cache_results = profiler.simulate_kv_cache_benefits([64, 128], embed_dim=128) @@ -1660,8 +1660,8 @@ def test_attention_profiler(): assert 'memory_savings_percent' in result, "Should calculate savings" assert result['memory_savings_percent'] > 0, "Should show memory savings" - print("โœ… KV-cache benefits simulation test passed") - print("๐ŸŽฏ Attention Profiler: All tests passed!") + print("PASS KV-cache benefits simulation test passed") + print("TARGET Attention Profiler: All tests passed!") # Test function defined (called in main block) @@ -1675,7 +1675,7 @@ Let's test how all our attention components work together in a realistic transfo # %% nbgrader={"grade": false, "grade_id": "test-attention-integration", "locked": false, "schema_version": 3, "solution": false, "task": false} def test_attention_integration(): """Test complete attention pipeline with embeddings integration.""" - print("๐Ÿงช Integration Test: Complete Attention Pipeline...") + print("TEST Integration Test: Complete Attention Pipeline...") # Configuration vocab_size = 1000 @@ -1830,10 +1830,10 @@ def test_attention_integration(): print(f" Attention throughput: {throughput:.0f} tokens/second") - print("โœ… Complete attention pipeline integration test passed!") - print(f"โœ… Self-attention, cross-attention, and causal masking work correctly") - print(f"โœ… KV-cache integration ready for autoregressive generation") - print(f"โœ… Memory usage and performance characteristics measured") + print("PASS Complete attention pipeline integration test passed!") + print(f"PASS Self-attention, cross-attention, and causal masking work correctly") + print(f"PASS KV-cache integration ready for autoregressive generation") + print(f"PASS Memory usage and performance characteristics measured") # Test function defined (called in main block) @@ -1854,14 +1854,14 @@ if __name__ == "__main__": test_attention_integration() print("\n" + "="*60) - print("๐Ÿ” ATTENTION SYSTEMS ANALYSIS") + print("MAGNIFY ATTENTION SYSTEMS ANALYSIS") print("="*60) # Performance analysis profiler = AttentionProfiler() # Test attention scaling with different sequence lengths - print("๐Ÿ“ˆ ATTENTION SCALING ANALYSIS:") + print("PROGRESS ATTENTION SCALING ANALYSIS:") scaled_attention = ScaledDotProductAttention() seq_lengths = [64, 128, 256, 512] embed_dim = 256 @@ -1944,14 +1944,14 @@ if __name__ == "__main__": print(f" Memory efficiency critical for longer sequences") print("\n" + "="*60) - print("๐ŸŽฏ ATTENTION MODULE COMPLETE!") + print("TARGET ATTENTION MODULE COMPLETE!") print("="*60) print("All attention tests passed!") print("Ready for transformer architecture integration!") # %% [markdown] """ -## ๐Ÿค” ML Systems Thinking: Interactive Questions +## THINK ML Systems Thinking: Interactive Questions Now that you've built the attention mechanisms that revolutionized language understanding, let's connect this work to broader ML systems challenges. These questions help you think critically about how attention's quadratic scaling affects production transformer deployment. @@ -1960,7 +1960,7 @@ Take time to reflect thoughtfully on each question - your insights will help you # %% [markdown] """ -### ๐ŸŽฏ Computational Assessment: Attention Complexity Analysis +### TARGET Computational Assessment: Attention Complexity Analysis **Learning Objective**: Analyze the computational and memory complexity of attention mechanisms to understand their practical limitations and optimization opportunities. @@ -2037,13 +2037,13 @@ if 'ScaledDotProductAttention' in globals(): f"{metrics['total_flops']/1e6:<10.1f} " f"{metrics['memory_scaling_factor']:<10.1f}x") - print(f"\n๐Ÿ’ก COMPLEXITY INSIGHTS:") + print(f"\nTIP COMPLEXITY INSIGHTS:") print(f" - Memory scales O(Nยฒ) with sequence length") print(f" - Computation scales O(Nยฒ) with sequence length") print(f" - Multi-head attention multiplies memory by number of heads") print(f" - 2x sequence length = 4x memory and computation") else: - print("โš ๏ธ Complete attention implementations first") + print("WARNING๏ธ Complete attention implementations first") # %% [markdown] """ @@ -2089,7 +2089,7 @@ GRADING RUBRIC (Instructor Use): # %% [markdown] """ -### ๐ŸŽฏ Computational Assessment: Causal Masking and Generation Patterns +### TARGET Computational Assessment: Causal Masking and Generation Patterns **Learning Objective**: Understand how causal masking enables autoregressive generation and analyze different attention masking strategies. @@ -2217,7 +2217,7 @@ if 'ScaledDotProductAttention' in globals(): f"{metrics['max_attention']:<10.4f} " f"{metrics['computation_ratio']*100:<10.1f}%") - print(f"\n๐Ÿ’ก MASKING INSIGHTS:") + print(f"\nTIP MASKING INSIGHTS:") print(f" - Causal masking: Essential for autoregressive generation") print(f" - Local attention: Good for capturing local dependencies") print(f" - Strided attention: Balances long-range and local connections") @@ -2225,7 +2225,7 @@ if 'ScaledDotProductAttention' in globals(): else: print(masking_results['error']) else: - print("โš ๏ธ Complete attention implementations first") + print("WARNING๏ธ Complete attention implementations first") # %% [markdown] """ @@ -2271,7 +2271,7 @@ GRADING RUBRIC (Instructor Use): # %% [markdown] """ -### ๐ŸŽฏ Computational Assessment: Attention Scaling and Production Optimization +### TARGET Computational Assessment: Attention Scaling and Production Optimization **Learning Objective**: Analyze how attention scaling affects production deployment and design optimization strategies for different use cases. @@ -2359,7 +2359,7 @@ def design_production_attention_system(): 'trade_off': 'Slight computation increase for massive memory savings' }, 'sparse_attention': { - 'memory_reduction': 'O(NโˆšN) or O(N log N) instead of O(Nยฒ)', + 'memory_reduction': 'O(NsqrtN) or O(N log N) instead of O(Nยฒ)', 'technique': 'Local + strided + global attention patterns', 'trade_off': 'Potential quality loss vs memory/compute savings' }, @@ -2415,7 +2415,7 @@ if 'KVCache' in globals(): print(f" Attention FLOPs: {analysis['attention_flops']/1e12:.1f} TFLOPs") print(f" Memory bandwidth: {analysis['memory_bandwidth_gb_s']:.1f} GB/s") - print("\n๐Ÿš€ OPTIMIZATION STRATEGIES:") + print("\nROCKET OPTIMIZATION STRATEGIES:") for strategy, details in production_design['memory_optimization'].items(): print(f"\n{strategy.replace('_', ' ').title()}:") print(f" Reduction: {details['memory_reduction']}") @@ -2430,7 +2430,7 @@ if 'KVCache' in globals(): else: print(f" {strategies}") - print("\n๐Ÿ“ˆ PERFORMANCE IMPACT:") + print("\nPROGRESS PERFORMANCE IMPACT:") perf = production_design['performance_estimates'] baseline = perf['baseline_gpt_3_scale'] optimized = perf['optimized_system'] @@ -2442,7 +2442,7 @@ if 'KVCache' in globals(): print(f" Sequence length: {seq_improvement:.0f}x with sparse attention") print(f" Generation speedup: {optimized['kv_cache_speedup']}") else: - print("โš ๏ธ Complete all attention implementations first") + print("WARNING๏ธ Complete all attention implementations first") # %% [markdown] """ @@ -2488,11 +2488,11 @@ GRADING RUBRIC (Instructor Use): # %% [markdown] """ -## ๐ŸŽฏ MODULE SUMMARY: Attention +## TARGET MODULE SUMMARY: Attention Congratulations! You have successfully implemented the attention mechanisms that revolutionized language understanding: -### โœ… What You Have Built +### PASS What You Have Built - **Scaled Dot-Product Attention**: The fundamental attention mechanism with proper masking support - **Multi-Head Attention**: Parallel attention heads for richer representation learning - **KV-Cache System**: Efficient caching for autoregressive generation workloads @@ -2501,41 +2501,41 @@ Congratulations! You have successfully implemented the attention mechanisms that - **๐Ÿ†• Memory Optimization**: Understanding and measuring attention's O(Nยฒ) scaling characteristics - **๐Ÿ†• Systems Integration**: Complete attention pipeline with embeddings and generation support -### โœ… Key Learning Outcomes +### PASS Key Learning Outcomes - **Understanding**: How attention enables transformers to model sequence relationships - **Implementation**: Built attention mechanisms with memory-efficient patterns and causal masking - **Systems Insight**: How attention's quadratic scaling affects model architecture and deployment - **Performance Engineering**: Measured and analyzed attention bottlenecks and optimization techniques - **Production Context**: Understanding real-world attention challenges and optimization strategies -### โœ… Technical Mastery -- **Attention Mathematics**: Attention(Q,K,V) = softmax(QK^T/โˆšd_k)V with proper scaling +### PASS Technical Mastery +- **Attention Mathematics**: Attention(Q,K,V) = softmax(QK^T/sqrtd_k)V with proper scaling - **Multi-Head Architecture**: Parallel attention computation with head dimension management - **Causal Masking**: Autoregressive attention patterns for language generation - **Memory Scaling**: Understanding O(Nยฒ) complexity and its implications for sequence length - **๐Ÿ†• KV-Cache Efficiency**: Optimizing attention computation for generation workloads -### โœ… Professional Skills Developed +### PASS Professional Skills Developed - **Systems Architecture**: Designing attention systems for production scale and efficiency - **Memory Engineering**: Understanding and optimizing attention's memory bottlenecks - **Performance Analysis**: Measuring and improving attention computation throughput - **Integration Design**: Building attention systems that work with embeddings and transformers -### โœ… Ready for Next Steps +### PASS Ready for Next Steps Your attention systems are now ready to power: - **Transformer Blocks**: Complete transformer architectures with attention and feedforward layers - **Language Generation**: Autoregressive text generation with efficient attention patterns - **Sequence Modeling**: Advanced sequence processing for various NLP tasks - **๐Ÿง  Modern AI Systems**: Foundation for GPT, BERT, and other transformer-based models -### ๐Ÿ”— Connection to Real ML Systems +### LINK Connection to Real ML Systems Your implementations mirror production systems: - **PyTorch Attention**: `torch.nn.MultiheadAttention` and `torch.nn.functional.scaled_dot_product_attention` - **Flash Attention**: Memory-efficient attention computation used in production systems - **KV-Cache Optimization**: Essential for efficient language model serving and generation - **Industry Applications**: Every modern language model relies on optimized attention mechanisms -### ๐ŸŽฏ The Revolution of Attention +### TARGET The Revolution of Attention You have built the mechanism that transformed AI: - **Before**: RNNs struggled with long-range dependencies and sequential computation - **After**: Attention enables parallel processing and direct long-range connections diff --git a/modules/13_transformers/transformers_dev.py b/modules/13_transformers/transformers_dev.py index 94997f91..0e59bced 100644 --- a/modules/13_transformers/transformers_dev.py +++ b/modules/13_transformers/transformers_dev.py @@ -21,7 +21,7 @@ Welcome to the Transformers module! You'll implement complete transformer blocks - Framework connection: See how your implementations match production transformer systems - Performance insight: Learn how transformer layer memory accumulation affects model deployment -## Build โ†’ Use โ†’ Reflect +## Build -> Use -> Reflect 1. **Build**: LayerNorm, transformer blocks, and complete transformer models 2. **Use**: Process sequences through multi-layer transformer architectures 3. **Reflect**: How do transformer design choices affect scalability and training dynamics? @@ -35,8 +35,8 @@ By the end of this module, you'll understand: - Connection to production systems like GPT's transformer blocks and their optimization techniques ## Systems Reality Check -๐Ÿ’ก **Production Context**: GPT-3 has 96 transformer layers, each with 12k-dimensional representations and complex memory management -โšก **Performance Note**: Transformer layer memory accumulates linearly with depth - deep models require careful activation checkpointing +TIP **Production Context**: GPT-3 has 96 transformer layers, each with 12k-dimensional representations and complex memory management +SPEED **Performance Note**: Transformer layer memory accumulates linearly with depth - deep models require careful activation checkpointing """ # %% nbgrader={"grade": false, "grade_id": "transformers-imports", "locked": false, "schema_version": 3, "solution": false, "task": false} @@ -101,7 +101,7 @@ print("Ready to build complete transformer architectures!") # %% [markdown] """ -## ๐Ÿ“ฆ Where This Code Lives in the Final Package +## PACKAGE Where This Code Lives in the Final Package **Learning Side:** You work in `modules/source/14_transformers/transformers_dev.py` **Building Side:** Code exports to `tinytorch.core.transformers` @@ -129,7 +129,7 @@ Transformers revolutionized AI by replacing recurrent connections with attention **Traditional RNN/LSTM:** ``` -hโ‚ โ†’ hโ‚‚ โ†’ hโ‚ƒ โ†’ hโ‚„ (Sequential processing) +hโ‚ -> hโ‚‚ -> hโ‚ƒ -> hโ‚„ (Sequential processing) ``` **Transformer:** @@ -148,9 +148,9 @@ Each transformer block contains: ### The Complete Architecture ``` Input Embeddings + Positional Encoding - โ†“ -[Transformer Block] ร— N layers - โ†“ + v +[Transformer Block] * N layers + v Output Layer (Language Modeling Head) ``` @@ -170,7 +170,7 @@ Layer normalization is crucial for training stable transformers. Unlike batch no # %% [markdown] """ -## ๐ŸŽฏ Building Transformer Components +## TARGET Building Transformer Components ### Transformer Architecture Overview @@ -179,51 +179,51 @@ Before implementing individual components, let's visualize how they fit together ``` Transformer Architecture: -โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” -โ”‚ Input Tokens โ”‚ -โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ - โ”‚ -โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” -โ”‚ Token Embeddings โ”‚ -โ”‚ + Positional Encoding โ”‚ -โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ - โ”‚ -โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” -โ”‚ Layer 1 โ”‚ -โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ -โ”‚ โ”‚ Multi-Head Attention โ”‚ โ”‚ -โ”‚ โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ โ”‚ -โ”‚ โ”‚ โ”‚Head 1 โ”‚ โ”‚Head 2 โ”‚ โ”‚Head n โ”‚ โ†’ Concatโ”‚ โ”‚ -โ”‚ โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ”‚ -โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ -โ”‚ โ”‚ โ”‚ -โ”‚ โ–ผ โ”‚ -โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ -โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”‚ Add & Norm โ”‚โ—„โ”€โ”€โ”€โ”€โ” โ”‚ -โ”‚ โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ Residual โ”‚ -โ”‚ โ”‚ โ”‚ Connection โ”‚ -โ”‚ โ–ผ โ”‚ โ”‚ -โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ โ”‚ -โ”‚ โ”‚ Position-wise FFN โ”‚ โ”‚ โ”‚ -โ”‚ โ”‚ Linear โ†’ ReLU โ†’ Linear โ”‚ โ”‚ โ”‚ -โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ”‚ -โ”‚ โ”‚ โ”‚ โ”‚ -โ”‚ โ–ผ โ”‚ โ”‚ -โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ โ”‚ -โ”‚ โ”‚ Add & Norm โ”‚โ—„โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ -โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ -โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ - โ”‚ - โ–ผ - โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” - โ”‚ Layer 2, 3, ..., N โ”‚ (Same structure) - โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ - โ”‚ - โ–ผ - โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” - โ”‚ Output Projection โ”‚ - โ”‚ Linear(embed_dim, vocab_size) โ”‚ - โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ ++-----------------------------------------------------+ +| Input Tokens | ++-----------------+-----------------------------------+ + | ++-----------------v-----------------------------------+ +| Token Embeddings | +| + Positional Encoding | ++-----------------+-----------------------------------+ + | ++-----------------v-----------------------------------+ +| Layer 1 | +| +---------------------------------------------+ | +| | Multi-Head Attention | | +| | +-------+ +-------+ +-------+ | | +| | |Head 1 | |Head 2 | |Head n | -> Concat| | +| | +-------+ +-------+ +-------+ | | +| +---------------------------------------------+ | +| | | +| v | +| +-------------+ | +| +----| Add & Norm |<----+ | +| | +-------------+ | Residual | +| | | Connection | +| v | | +| +---------------------------------+ | | +| | Position-wise FFN | | | +| | Linear -> ReLU -> Linear | | | +| +---------------------------------+ | | +| | | | +| v | | +| +-------------+ | | +| | Add & Norm |<------+ | +| +-------------+ | ++-----------------+-----------------------------------+ + | + v + +-------------------------------------+ + | Layer 2, 3, ..., N | (Same structure) + +-------------------------------------+ + | + v + +-------------------------------------+ + | Output Projection | + | Linear(embed_dim, vocab_size) | + +-------------------------------------+ ``` ### Memory Layout Visualization @@ -231,25 +231,25 @@ Transformer Architecture: ``` Transformer Memory Organization: -โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” -โ”‚ Model Parameters โ”‚ -โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค -โ”‚ Token Embeddings โ”‚ vocab ร— embed_dim โ”‚ โ† 70% of parameters -โ”‚ Position Encodings โ”‚ max_seq ร— embed_dim โ”‚ (for large vocab) -โ”‚ N ร— Transformer Layers: โ”‚ -โ”‚ โ”œ Multi-Head Attn โ”‚ 4 ร— embed_dimยฒ โ”‚ โ† 25% of parameters -โ”‚ โ”œ Feed-Forward โ”‚ 2 ร— embed_dim ร— ffn_dim โ”‚ (per layer) -โ”‚ โ”” Layer Norms โ”‚ 2 ร— embed_dim โ”‚ -โ”‚ Output Projection โ”‚ embed_dim ร— vocab_size โ”‚ โ† Same as embeddings -โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ ++-------------------------------------------------+ +| Model Parameters | ++-------------------------------------------------โ”ค +| Token Embeddings | vocab * embed_dim | <- 70% of parameters +| Position Encodings | max_seq * embed_dim | (for large vocab) +| N * Transformer Layers: | +| + Multi-Head Attn | 4 * embed_dimยฒ | <- 25% of parameters +| + Feed-Forward | 2 * embed_dim * ffn_dim | (per layer) +| + Layer Norms | 2 * embed_dim | +| Output Projection | embed_dim * vocab_size | <- Same as embeddings ++-------------------------------------------------+ Activation Memory (Forward Pass): -โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” -โ”‚ Input: batch ร— seq_len ร— embed_dim โ”‚ โ† Base memory unit -โ”‚ Attention Scores: batch ร— heads ร— seq ร— seq โ”‚ โ† O(seqยฒ) scaling! -โ”‚ Layer Outputs: N ร— batch ร— seq ร— embed_dim โ”‚ โ† Linear with depth -โ”‚ Gradients: 2ร— parameter memory โ”‚ โ† Training overhead -โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ ++-------------------------------------------------+ +| Input: batch * seq_len * embed_dim | <- Base memory unit +| Attention Scores: batch * heads * seq * seq | <- O(seqยฒ) scaling! +| Layer Outputs: N * batch * seq * embed_dim | <- Linear with depth +| Gradients: 2* parameter memory | <- Training overhead ++-------------------------------------------------+ For GPT-3 scale (175B parameters): - Parameters: 700GB (fp32) / 350GB (fp16) @@ -293,7 +293,7 @@ class LayerNorm: EXAMPLE (LayerNorm Operation): >>> ln = LayerNorm(512) # For 512-dim embeddings - >>> x = Tensor(np.random.randn(32, 100, 512)) # batch ร— seq ร— embed + >>> x = Tensor(np.random.randn(32, 100, 512)) # batch * seq * embed >>> normalized = ln(x) >>> print(f"Mean: {normalized.data.mean(axis=-1)[0,0]:.6f}") # ~0 >>> print(f"Std: {normalized.data.std(axis=-1)[0,0]:.6f}") # ~1 @@ -462,7 +462,7 @@ class LayerNorm: # %% [markdown] """ -### ๐Ÿงช Test Your Layer Normalization Implementation +### TEST Test Your Layer Normalization Implementation Once you implement the LayerNorm methods above, run this cell to test it: """ @@ -536,11 +536,11 @@ def test_unit_layer_norm(): assert 'parameter_memory_mb' in memory_stats, "Should provide memory statistics" assert memory_stats['total_parameters'] == 2 * embed_dim, "Should count gamma and beta parameters" - print("โœ… Layer normalization tests passed!") - print(f"โœ… Properly normalizes across feature dimensions") - print(f"โœ… Handles 2D and 3D inputs correctly") - print(f"โœ… Maintains ~0 mean and ~1 variance after normalization") - print(f"โœ… Parameter memory: {memory_stats['parameter_memory_mb']:.4f}MB") + print("PASS Layer normalization tests passed!") + print(f"PASS Properly normalizes across feature dimensions") + print(f"PASS Handles 2D and 3D inputs correctly") + print(f"PASS Maintains ~0 mean and ~1 variance after normalization") + print(f"PASS Parameter memory: {memory_stats['parameter_memory_mb']:.4f}MB") # Test function defined (called in main block) @@ -556,30 +556,30 @@ Each transformer block contains a position-wise feed-forward network that applie Position-wise FFN Structure: Input: (batch, seq_len, embed_dim) - โ”‚ - โ–ผ -โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” -โ”‚ Linear Layer 1 โ”‚ -โ”‚ embed_dim โ†’ hidden_dim โ”‚ โ† Expansion -โ”‚ W1: (embed_dim, hidden_dim) โ”‚ (usually 4x) -โ”‚ b1: (hidden_dim,) โ”‚ -โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ - โ”‚ - โ–ผ -โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” -โ”‚ ReLU โ”‚ โ† Nonlinearity -โ”‚ max(0, x) โ”‚ (makes it powerful) -โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ - โ”‚ - โ–ผ -โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” -โ”‚ Linear Layer 2 โ”‚ -โ”‚ hidden_dim โ†’ embed_dim โ”‚ โ† Compression -โ”‚ W2: (hidden_dim, embed_dim) โ”‚ (back to original) -โ”‚ b2: (embed_dim,) โ”‚ -โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ - โ”‚ - โ–ผ + | + v ++-------------------------------------------+ +| Linear Layer 1 | +| embed_dim -> hidden_dim | <- Expansion +| W1: (embed_dim, hidden_dim) | (usually 4x) +| b1: (hidden_dim,) | ++-------------------------------------------+ + | + v ++-------------------------------------------+ +| ReLU | <- Nonlinearity +| max(0, x) | (makes it powerful) ++-------------------------------------------+ + | + v ++-------------------------------------------+ +| Linear Layer 2 | +| hidden_dim -> embed_dim | <- Compression +| W2: (hidden_dim, embed_dim) | (back to original) +| b2: (embed_dim,) | ++-------------------------------------------+ + | + v Output: (batch, seq_len, embed_dim) ``` @@ -590,19 +590,19 @@ FFN Parameter Breakdown: For embed_dim=512, hidden_dim=2048: -โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” -โ”‚ W1: 512 ร— 2048 = 1,048,576 parameters โ”‚ โ† 67% of FFN -โ”‚ b1: 2048 parameters โ”‚ -โ”‚ W2: 2048 ร— 512 = 1,048,576 parameters โ”‚ โ† 67% of FFN -โ”‚ b2: 512 parameters โ”‚ -โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค -โ”‚ Total: 2,099,712 parameters โ”‚ -โ”‚ Memory (fp32): 8.4 MB โ”‚ -โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ ++----------------------------------------------+ +| W1: 512 * 2048 = 1,048,576 parameters | <- 67% of FFN +| b1: 2048 parameters | +| W2: 2048 * 512 = 1,048,576 parameters | <- 67% of FFN +| b2: 512 parameters | ++----------------------------------------------โ”ค +| Total: 2,099,712 parameters | +| Memory (fp32): 8.4 MB | ++----------------------------------------------+ -Scaling: Parameters โˆ embed_dim ร— hidden_dim -Typical ratio: hidden_dim = 4 ร— embed_dim -โ†’ FFN params โˆ 8 ร— embed_dimยฒ +Scaling: Parameters โˆ embed_dim * hidden_dim +Typical ratio: hidden_dim = 4 * embed_dim +-> FFN params โˆ 8 * embed_dimยฒ ``` ### Computational Pattern @@ -610,10 +610,10 @@ Typical ratio: hidden_dim = 4 ร— embed_dim ``` FFN applies the same transformation to EVERY position independently: -Position 0: [e0_0, e0_1, ..., e0_d] โ†’ FFN โ†’ [o0_0, o0_1, ..., o0_d] -Position 1: [e1_0, e1_1, ..., e1_d] โ†’ FFN โ†’ [o1_0, o1_1, ..., o1_d] +Position 0: [e0_0, e0_1, ..., e0_d] -> FFN -> [o0_0, o0_1, ..., o0_d] +Position 1: [e1_0, e1_1, ..., e1_d] -> FFN -> [o1_0, o1_1, ..., o1_d] ... ... ... ... -Position N: [eN_0, eN_1, ..., eN_d] โ†’ FFN โ†’ [oN_0, oN_1, ..., oN_d] +Position N: [eN_0, eN_1, ..., eN_d] -> FFN -> [oN_0, oN_1, ..., oN_d] This is why it's called "position-wise" - each position gets the same treatment! ``` @@ -643,9 +643,9 @@ class PositionwiseFeedForward: ARCHITECTURE: - Input: (batch, seq_len, embed_dim) - - Linear 1: embed_dim โ†’ hidden_dim + - Linear 1: embed_dim -> hidden_dim - ReLU activation - - Linear 2: hidden_dim โ†’ embed_dim + - Linear 2: hidden_dim -> embed_dim - Output: (batch, seq_len, embed_dim) PARAMETER INITIALIZATION: @@ -662,12 +662,12 @@ class PositionwiseFeedForward: self.dropout = dropout # Initialize weights using Xavier initialization - # W1: embed_dim โ†’ hidden_dim + # W1: embed_dim -> hidden_dim xavier_bound_1 = math.sqrt(6.0 / (embed_dim + hidden_dim)) self.w1 = Tensor(np.random.uniform(-xavier_bound_1, xavier_bound_1, (embed_dim, hidden_dim))) self.b1 = Tensor(np.zeros(hidden_dim)) - # W2: hidden_dim โ†’ embed_dim + # W2: hidden_dim -> embed_dim xavier_bound_2 = math.sqrt(6.0 / (hidden_dim + embed_dim)) self.w2 = Tensor(np.random.uniform(-xavier_bound_2, xavier_bound_2, (hidden_dim, embed_dim))) self.b2 = Tensor(np.zeros(embed_dim)) @@ -755,7 +755,7 @@ class PositionwiseFeedForward: # %% [markdown] """ -### ๐Ÿงช Test Your Feed-Forward Network Implementation +### TEST Test Your Feed-Forward Network Implementation Once you implement the PositionwiseFeedForward methods above, run this cell to test it: """ @@ -840,12 +840,12 @@ def test_unit_feed_forward(): assert memory_stats['w2_parameters'] == expected_w2_params, "Should count W2 parameters correctly" assert memory_stats['total_parameters'] == expected_total, "Should count total parameters correctly" - print("โœ… Position-wise feed-forward tests passed!") - print(f"โœ… Handles 2D and 3D inputs correctly") - print(f"โœ… Position-wise processing verified") - print(f"โœ… ReLU activation working properly") - print(f"โœ… Total parameters: {memory_stats['total_parameters']:,}") - print(f"โœ… Parameter memory: {memory_stats['parameter_memory_mb']:.2f}MB") + print("PASS Position-wise feed-forward tests passed!") + print(f"PASS Handles 2D and 3D inputs correctly") + print(f"PASS Position-wise processing verified") + print(f"PASS ReLU activation working properly") + print(f"PASS Total parameters: {memory_stats['total_parameters']:,}") + print(f"PASS Parameter memory: {memory_stats['parameter_memory_mb']:.2f}MB") # Test function defined (called in main block) @@ -886,8 +886,8 @@ class TransformerBlock: 5. Set up parameter tracking from all sub-components ARCHITECTURE CHOICE: Pre-norm vs Post-norm - - Pre-norm: LayerNorm โ†’ Attention โ†’ Residual (more stable) - - Post-norm: Attention โ†’ LayerNorm โ†’ Residual (original paper) + - Pre-norm: LayerNorm -> Attention -> Residual (more stable) + - Post-norm: Attention -> LayerNorm -> Residual (original paper) Args: embed_dim: Embedding dimension @@ -1119,7 +1119,7 @@ class TransformerBlock: # %% [markdown] """ -### ๐Ÿงช Test Your Transformer Block Implementation +### TEST Test Your Transformer Block Implementation Once you implement the TransformerBlock methods above, run this cell to test it: """ @@ -1228,12 +1228,12 @@ def test_unit_transformer_block(): assert memory_stats['total_memory_mb'] > 0, "Should have positive memory usage" assert memory_stats['total_parameters'] > 0, "Should count parameters" - print("โœ… Transformer block tests passed!") - print(f"โœ… Pre-norm and post-norm architectures work correctly") - print(f"โœ… Residual connections preserve information flow") - print(f"โœ… Causal masking works across all attention heads") - print(f"โœ… Total parameters: {memory_stats['total_parameters']:,}") - print(f"โœ… Total memory: {memory_stats['total_memory_mb']:.2f}MB") + print("PASS Transformer block tests passed!") + print(f"PASS Pre-norm and post-norm architectures work correctly") + print(f"PASS Residual connections preserve information flow") + print(f"PASS Causal masking works across all attention heads") + print(f"PASS Total parameters: {memory_stats['total_parameters']:,}") + print(f"PASS Total memory: {memory_stats['total_memory_mb']:.2f}MB") # Test function defined (called in main block) @@ -1570,7 +1570,7 @@ class Transformer: # %% [markdown] """ -### ๐Ÿงช Test Your Complete Transformer Implementation +### TEST Test Your Complete Transformer Implementation Once you implement the Transformer methods above, run this cell to test it: """ @@ -1699,19 +1699,19 @@ def test_unit_transformer_model(): assert memory_stats['transformer_blocks_memory_mb'] > 0, "Should have transformer block memory" assert memory_stats['lm_head_memory_mb'] > 0, "Should have language modeling head memory" - print("โœ… Complete transformer model tests passed!") - print(f"โœ… Forward pass produces correct logit shapes") - print(f"โœ… Causal masking works across all {num_layers} layers") - print(f"โœ… Text generation capability verified") - print(f"โœ… Total parameters: {memory_stats['total_parameters']:,}") - print(f"โœ… Total memory: {memory_stats['total_memory_mb']:.2f}MB") - print(f"โœ… Pre-norm and post-norm architectures work correctly") + print("PASS Complete transformer model tests passed!") + print(f"PASS Forward pass produces correct logit shapes") + print(f"PASS Causal masking works across all {num_layers} layers") + print(f"PASS Text generation capability verified") + print(f"PASS Total parameters: {memory_stats['total_parameters']:,}") + print(f"PASS Total memory: {memory_stats['total_memory_mb']:.2f}MB") + print(f"PASS Pre-norm and post-norm architectures work correctly") # Test function defined (called in main block) # %% [markdown] """ -## ๐ŸŽฏ ML Systems: Performance Analysis & Transformer Scaling +## TARGET ML Systems: Performance Analysis & Transformer Scaling Now let's develop systems engineering skills by analyzing transformer performance and understanding how model depth and width affect memory usage and computational requirements. @@ -1865,7 +1865,7 @@ class TransformerProfiler: print(f"{config_name:<15} ERROR: {str(e)[:50]}") # Analysis - print(f"\n๐Ÿ’ก TRADE-OFF INSIGHTS:") + print(f"\nTIP TRADE-OFF INSIGHTS:") print(f" - Deeper models: Better at learning complex patterns, more sequential") print(f" - Wider models: More parallelizable, can capture diverse features") print(f" - More heads: Richer attention patterns, more computation") @@ -1951,7 +1951,7 @@ class TransformerProfiler: print(f"{size:<12} {total_params/1e6:.1f}M {training_memory_gb:.1f} {training_gpu:<12} {inference_req}") - print(f"\n๐Ÿ“ˆ SCALING OBSERVATIONS:") + print(f"\nPROGRESS SCALING OBSERVATIONS:") print(f" - Model size grows super-linearly with dimension increases") print(f" - Memory requirements dominate deployment decisions") print(f" - Training requires 3-4x more memory than inference") @@ -1976,7 +1976,7 @@ def analyze_transformer_system_design(): }, 'Attention Patterns': { 'Full attention': {'complexity': 'O(Nยฒ)', 'quality': 'Best', 'scalability': 'Limited'}, - 'Sparse attention': {'complexity': 'O(NโˆšN)', 'quality': 'Good', 'scalability': 'Better'}, + 'Sparse attention': {'complexity': 'O(NsqrtN)', 'quality': 'Good', 'scalability': 'Better'}, 'Linear attention': {'complexity': 'O(N)', 'quality': 'Reduced', 'scalability': 'Excellent'} }, 'Feed-Forward Size': { @@ -1986,7 +1986,7 @@ def analyze_transformer_system_design(): } } - print("๐ŸŽฏ ARCHITECTURAL DESIGN CHOICES:") + print("TARGET ARCHITECTURAL DESIGN CHOICES:") for category, choices in design_choices.items(): print(f"\n{category}:") for choice, properties in choices.items(): @@ -1996,12 +1996,12 @@ def analyze_transformer_system_design(): # Memory scaling analysis print(f"\n๐Ÿ“Š MEMORY SCALING PATTERNS:") print(f"Component breakdown for typical transformer:") - print(f" - Token embeddings: vocab_size ร— embed_dim parameters") - print(f" - Position encodings: 0 parameters (sinusoidal) or seq_len ร— embed_dim (learned)") - print(f" - Attention layers: 4 ร— embed_dimยฒ parameters per layer") - print(f" - Feed-forward: 2 ร— embed_dim ร— hidden_dim parameters per layer") - print(f" - Layer normalization: 2 ร— embed_dim parameters per layer") - print(f" - Output projection: embed_dim ร— vocab_size parameters") + print(f" - Token embeddings: vocab_size * embed_dim parameters") + print(f" - Position encodings: 0 parameters (sinusoidal) or seq_len * embed_dim (learned)") + print(f" - Attention layers: 4 * embed_dimยฒ parameters per layer") + print(f" - Feed-forward: 2 * embed_dim * hidden_dim parameters per layer") + print(f" - Layer normalization: 2 * embed_dim parameters per layer") + print(f" - Output projection: embed_dim * vocab_size parameters") print(f"\n๐Ÿ”ง OPTIMIZATION STRATEGIES:") optimization_techniques = [ @@ -2017,7 +2017,7 @@ def analyze_transformer_system_design(): for technique in optimization_techniques: print(f" - {technique}") - print(f"\n๐ŸŽฏ PRODUCTION DEPLOYMENT CONSIDERATIONS:") + print(f"\nTARGET PRODUCTION DEPLOYMENT CONSIDERATIONS:") deployment_factors = [ "Batch size: Larger batches improve GPU utilization but increase memory", "Sequence length: Quadratic impact on attention memory", @@ -2034,7 +2034,7 @@ def analyze_transformer_system_design(): # %% [markdown] """ -### ๐Ÿงช Test: Transformer Performance Analysis +### TEST Test: Transformer Performance Analysis Let's test our transformer profiler with realistic scaling scenarios. """ @@ -2089,7 +2089,7 @@ def test_transformer_profiler(): assert 1.0 < param_ratio < layer_ratio * 2, f"Parameters should scale sub-linearly, got {param_ratio:.2f}" assert 1.0 < memory_ratio < layer_ratio * 2, f"Memory should scale sub-linearly, got {memory_ratio:.2f}" - print("โœ… Depth scaling measurement test passed") + print("PASS Depth scaling measurement test passed") # Test width vs depth analysis configurations = [ @@ -2108,7 +2108,7 @@ def test_transformer_profiler(): assert 'computation_time_ms' in result, "Should measure computation time" assert result['actual_parameters'] > 0, "Should have positive parameter count" - print("โœ… Width vs depth analysis test passed") + print("PASS Width vs depth analysis test passed") # Test production scaling simulation production_results = profiler.simulate_production_scaling(['Small', 'Medium']) @@ -2124,8 +2124,8 @@ def test_transformer_profiler(): assert result['total_parameters'] > 1e6, "Should have millions of parameters" assert result['training_memory_gb'] > result['inference_memory_gb'], "Training should require more memory" - print("โœ… Production scaling simulation test passed") - print("๐ŸŽฏ Transformer Profiler: All tests passed!") + print("PASS Production scaling simulation test passed") + print("TARGET Transformer Profiler: All tests passed!") # Test function defined (called in main block) @@ -2139,7 +2139,7 @@ Let's test the complete pipeline from tokenization through transformer processin # %% nbgrader={"grade": false, "grade_id": "test-transformer-integration", "locked": false, "schema_version": 3, "solution": false, "task": false} def test_complete_language_model_pipeline(): """Test complete language model pipeline integration.""" - print("๐Ÿงช Integration Test: Complete Language Model Pipeline...") + print("TEST Integration Test: Complete Language Model Pipeline...") # Create a small but complete language model vocab_size = 1000 @@ -2295,10 +2295,10 @@ def test_complete_language_model_pipeline(): print(f" Model shows appropriate sensitivity to input changes") - print("โœ… Complete language model pipeline integration test passed!") - print(f"โœ… Forward pass, masking, generation, and performance verified") - print(f"โœ… Model processes {tokens_per_second:.0f} tokens/second") - print(f"โœ… Memory footprint: {memory_stats['total_memory_mb']:.1f}MB") + print("PASS Complete language model pipeline integration test passed!") + print(f"PASS Forward pass, masking, generation, and performance verified") + print(f"PASS Model processes {tokens_per_second:.0f} tokens/second") + print(f"PASS Memory footprint: {memory_stats['total_memory_mb']:.1f}MB") # Test function defined (called in main block) @@ -2320,14 +2320,14 @@ if __name__ == "__main__": test_complete_language_model_pipeline() print("\n" + "="*60) - print("๐Ÿ” TRANSFORMER SYSTEMS ANALYSIS") + print("MAGNIFY TRANSFORMER SYSTEMS ANALYSIS") print("="*60) # Performance analysis profiler = TransformerProfiler() # Test transformer scaling with different depths - print("๐Ÿ“ˆ TRANSFORMER DEPTH SCALING ANALYSIS:") + print("PROGRESS TRANSFORMER DEPTH SCALING ANALYSIS:") base_config = { 'vocab_size': 1000, 'embed_dim': 256, @@ -2446,7 +2446,7 @@ if __name__ == "__main__": print(f" {factor}x scale: {scaled_params/1e6:.0f}M params, ~{scaled_memory_gb:.1f}GB memory") print("\n" + "="*60) - print("๐ŸŽฏ TRANSFORMERS MODULE COMPLETE!") + print("TARGET TRANSFORMERS MODULE COMPLETE!") print("="*60) print("All transformer tests passed!") print("Complete language model architecture implemented!") @@ -2455,16 +2455,16 @@ if __name__ == "__main__": # Final systems analysis analyze_transformer_memory_scaling_final() -# ๐Ÿ” SYSTEMS INSIGHT: Final Transformer Memory Scaling Analysis +# MAGNIFY SYSTEMS INSIGHT: Final Transformer Memory Scaling Analysis def analyze_transformer_memory_scaling_final(): """Comprehensive analysis of transformer memory scaling patterns.""" try: print("\n" + "="*70) - print("๐Ÿ“ˆ TRANSFORMER MEMORY SCALING ANALYSIS") + print("PROGRESS TRANSFORMER MEMORY SCALING ANALYSIS") print("="*70) # Test sequence length scaling (the quadratic bottleneck) - print("๐Ÿ” SEQUENCE LENGTH SCALING (Quadratic Alert!)") + print("MAGNIFY SEQUENCE LENGTH SCALING (Quadratic Alert!)") embed_dim = 512 num_heads = 8 batch_size = 16 @@ -2476,10 +2476,10 @@ def analyze_transformer_memory_scaling_final(): base_memory = None for seq_len in seq_lengths: - # Input activation memory: batch ร— seq ร— embed + # Input activation memory: batch * seq * embed input_memory = batch_size * seq_len * embed_dim * 4 / (1024**2) - # Attention matrix memory: batch ร— heads ร— seq ร— seq (the killer!) + # Attention matrix memory: batch * heads * seq * seq (the killer!) attention_memory = batch_size * num_heads * seq_len * seq_len * 4 / (1024**2) total_memory = input_memory + attention_memory @@ -2492,10 +2492,10 @@ def analyze_transformer_memory_scaling_final(): print(f"{seq_len:<8} {input_memory:<11.2f} {attention_memory:<14.2f} {total_memory:<11.2f} {scale_factor:<12.2f}") - print(f"\nโš ๏ธ QUADRATIC SCALING ALERT: 2ร— sequence = 4ร— attention memory!") + print(f"\nWARNING๏ธ QUADRATIC SCALING ALERT: 2* sequence = 4* attention memory!") # Model size comparison - print(f"\n๐Ÿ” MODEL SIZE COMPARISON (Parameter Count)") + print(f"\nMAGNIFY MODEL SIZE COMPARISON (Parameter Count)") configs = [ ("GPT-2 Small", 50257, 768, 12, 12, 3072), ("GPT-2 Medium", 50257, 1024, 24, 16, 4096), @@ -2522,9 +2522,9 @@ def analyze_transformer_memory_scaling_final(): print(f" - Sequence length: O(Nยฒ) scaling due to attention matrices") print(f" - Model parameters: O(embed_dimยฒ) dominates for transformer blocks") print(f" - Vocabulary size: O(vocab_size) can dominate total parameters") - print(f" - Training memory: 4-16ร— parameter memory (gradients + optimizer)") + print(f" - Training memory: 4-16* parameter memory (gradients + optimizer)") - print(f"\n๐Ÿ’ก PRODUCTION IMPLICATIONS:") + print(f"\nTIP PRODUCTION IMPLICATIONS:") print(f" - Attention memory limits sequence length in practice") print(f" - Large vocabularies dominate parameter count") print(f" - Deep models need careful memory management") @@ -2535,11 +2535,11 @@ def analyze_transformer_memory_scaling_final(): print(f" โ€ข Mixed precision for memory efficiency") except Exception as e: - print(f"โš ๏ธ Error in scaling analysis: {e}") + print(f"WARNING๏ธ Error in scaling analysis: {e}") # %% [markdown] """ -## ๐Ÿค” ML Systems Thinking: Interactive Questions +## THINK ML Systems Thinking: Interactive Questions Now that you've built complete transformer architectures, let's connect this work to broader ML systems challenges. These questions help you think critically about how transformer design choices affect production deployment and system performance. @@ -2552,7 +2552,7 @@ Take time to reflect thoughtfully on each question - your insights will help you **Context**: Your transformer implementations reveal how architectural choices affect memory usage and computational complexity. In your TransformerBlock implementation, you saw how FFN parameters dominate (67% of block parameters), while attention creates O(Nยฒ) memory scaling with sequence length. Your memory scaling analysis showed quadratic growth with sequence length. -**Reflection Question**: Analyze the memory and performance trade-offs in your transformer architecture. Based on your parameter counting and memory analysis, how would you modify your TransformerBlock implementation to handle sequences 4ร— longer while staying within the same memory budget? Consider the attention matrix scaling you observed (quadratic with sequence length) and the FFN parameter dominance you measured. What specific changes to your MultiHeadAttention and PositionwiseFeedForward classes would enable more efficient long-sequence processing, and how would these modifications affect the residual connections and layer normalization in your transformer blocks? +**Reflection Question**: Analyze the memory and performance trade-offs in your transformer architecture. Based on your parameter counting and memory analysis, how would you modify your TransformerBlock implementation to handle sequences 4* longer while staying within the same memory budget? Consider the attention matrix scaling you observed (quadratic with sequence length) and the FFN parameter dominance you measured. What specific changes to your MultiHeadAttention and PositionwiseFeedForward classes would enable more efficient long-sequence processing, and how would these modifications affect the residual connections and layer normalization in your transformer blocks? Think about: attention matrix memory scaling, FFN parameter reduction strategies, efficient residual connection patterns, and layer normalization placement optimization. @@ -2674,11 +2674,11 @@ GRADING RUBRIC (Instructor Use): # %% [markdown] """ -## ๐ŸŽฏ MODULE SUMMARY: Transformers +## TARGET MODULE SUMMARY: Transformers Congratulations! You have successfully implemented complete transformer architectures that power modern language models: -### โœ… What You Have Built +### PASS What You Have Built - **Layer Normalization**: Stable normalization for deep transformer training - **Position-wise Feed-Forward**: Non-linear transformations applied to each sequence position - **Transformer Blocks**: Complete transformer layers with attention, normalization, and residual connections @@ -2687,28 +2687,28 @@ Congratulations! You have successfully implemented complete transformer architec - **๐Ÿ†• Performance Analysis**: Comprehensive scaling analysis and architectural optimization tools - **๐Ÿ†• Production Insights**: Understanding of real-world transformer deployment challenges -### โœ… Key Learning Outcomes +### PASS Key Learning Outcomes - **Understanding**: How transformer blocks enable powerful sequence modeling through attention and feed-forward layers - **Implementation**: Built complete transformer architectures with proper layer organization and residual connections - **Systems Insight**: How transformer depth affects memory usage, training efficiency, and model capacity - **Performance Engineering**: Measured and analyzed transformer scaling characteristics and optimization opportunities - **Production Context**: Understanding transformer deployment challenges and architectural trade-offs -### โœ… Technical Mastery +### PASS Technical Mastery - **Layer Normalization**: Stabilizing deep network training with proper feature normalization - **Residual Connections**: Enabling gradient flow through deep transformer architectures - **Pre-norm vs Post-norm**: Understanding normalization placement effects on training stability - **Parameter Scaling**: Understanding how transformer parameters scale with architectural choices - **๐Ÿ†• Generation Systems**: Autoregressive text generation with causal attention patterns -### โœ… Professional Skills Developed +### PASS Professional Skills Developed - **Systems Architecture**: Designing complete transformer systems for production scale - **Memory Engineering**: Understanding transformer memory scaling (O(Nยฒ) attention, parameter distribution) - **Computational Assessment**: Parameter counting, memory analysis, and production-scale calculations - **Performance Analysis**: Measuring and improving transformer computation and memory efficiency - **Integration Design**: Building complete language processing pipelines from tokenization to generation -### โœ… Ready for Next Steps +### PASS Ready for Next Steps Your transformer implementations and analysis provide the foundation for: - **Advanced Language Models**: GPT, BERT, and other transformer-based architectures - **Multi-modal Models**: Extending transformers to vision, audio, and other modalities @@ -2716,14 +2716,14 @@ Your transformer implementations and analysis provide the foundation for: - **Scale Analysis**: Understanding memory bottlenecks from small models to GPT-3 scale (175B parameters) - **๐Ÿง  AI Applications**: Real-world language processing applications and services -### ๐Ÿ”— Connection to Real ML Systems +### LINK Connection to Real ML Systems Your implementations mirror production systems: - **GPT Architecture**: Your transformer matches GPT's decoder-only architecture - **BERT Components**: Layer normalization and attention mechanisms used in BERT - **Production Optimization**: Understanding of memory scaling, batching, and generation optimization - **Industry Applications**: Foundation for all modern language model deployments -### ๐ŸŽฏ The Complete Language Model +### TARGET The Complete Language Model You have built the architecture that transformed AI: - **Before**: RNNs and CNNs limited by sequential processing and local dependencies - **After**: Transformers enable parallel processing and global attention across entire sequences diff --git a/modules/14_profiling/profiling_dev.py b/modules/14_profiling/profiling_dev.py index 34310dd9..f7bbda2f 100644 --- a/modules/14_profiling/profiling_dev.py +++ b/modules/14_profiling/profiling_dev.py @@ -7,7 +7,7 @@ But here's the million-dollar question: **Why is your transformer 100x slower th Time to become a performance detective and find out what's really happening under the hood. -## ๐Ÿ” What You'll Discover +## MAGNIFY What You'll Discover Ever wonder why your models feel sluggish? We're about to reveal the culprits: - Which operations are eating your CPU cycles @@ -18,7 +18,7 @@ Ever wonder why your models feel sluggish? We're about to reveal the culprits: **Spoiler Alert**: The results might surprise you. That "simple" attention mechanism? It's probably consuming 73% of your compute time! -## ๐ŸŽฏ Learning Objectives +## TARGET Learning Objectives By the end of this module, you'll be able to: 1. **Build Professional Profilers**: Create timing, memory, and FLOP counters @@ -44,7 +44,7 @@ import time start = time.time() result = my_function() end = time.time() -print(f"Took {end - start:.2f}s") # โŒ Unreliable! +print(f"Took {end - start:.2f}s") # FAIL Unreliable! ``` **Problems:** @@ -70,7 +70,7 @@ try: from tinytorch.core.spatial import Conv2d, MaxPool2d from tinytorch.core.transformers import Transformer except ImportError: - print("โš ๏ธ TinyTorch modules not available - using mocks for development") + print("WARNING๏ธ TinyTorch modules not available - using mocks for development") class Tensor: def __init__(self, data): @@ -149,7 +149,7 @@ class Timer: self.measurements = [] # Warmup runs to get code in CPU cache - print(f"๐Ÿ”ฅ Running {warmup} warmup iterations...") + print(f"FIRE Running {warmup} warmup iterations...") for _ in range(warmup): _ = func(*args, **kwargs) @@ -219,7 +219,7 @@ class Timer: def print_report(self, name: str = "Function"): """Print a formatted timing report.""" if not self.measurements: - print(f"โŒ No measurements available for {name}") + print(f"FAIL No measurements available for {name}") return stats = self._compute_stats() @@ -228,20 +228,20 @@ class Timer: print("=" * 50) print(f"Runs: {stats['runs']}") print(f"Mean: {stats['mean_ms']:.3f} ms ยฑ {stats['std_ms']:.3f} ms") - print(f"Range: {stats['min_ms']:.3f} ms โ†’ {stats['max_ms']:.3f} ms") + print(f"Range: {stats['min_ms']:.3f} ms -> {stats['max_ms']:.3f} ms") print(f"P50: {stats['p50_ms']:.3f} ms") print(f"P95: {stats['p95_ms']:.3f} ms") print(f"P99: {stats['p99_ms']:.3f} ms") # Helpful interpretation if stats['std_ms'] / stats['mean_ms'] > 0.1: - print("โš ๏ธ High variability - consider more warmup runs") + print("WARNING๏ธ High variability - consider more warmup runs") else: - print("โœ… Stable timing measurements") + print("PASS Stable timing measurements") # %% [markdown] """ -### ๐Ÿงช Test the Timer +### TEST Test the Timer Let's test our timer on different types of operations to see the statistical rigor in action. """ @@ -282,7 +282,7 @@ def test_timer(): stats = timer.measure(ml_operation, warmup=2, runs=50) timer.print_report("Linear Layer Forward") - print("\n๐ŸŽฏ KEY INSIGHT: Notice the different scales!") + print("\nTARGET KEY INSIGHT: Notice the different scales!") print(" - CPU operations: microseconds (< 1ms)") print(" - Memory operations: low milliseconds") print(" - ML operations: higher milliseconds") @@ -390,20 +390,20 @@ class MemoryProfiler: # Memory efficiency insights if stats['allocated_mb'] > stats['peak_mb'] * 0.5: - print("โš ๏ธ High memory allocation - check for copies") + print("WARNING๏ธ High memory allocation - check for copies") elif stats['allocated_mb'] < 0: - print("โœ… Memory efficient - some cleanup occurred") + print("PASS Memory efficient - some cleanup occurred") else: - print("โœ… Reasonable memory usage") + print("PASS Reasonable memory usage") # Peak vs final analysis peak_vs_final_ratio = stats['peak_mb'] / max(stats['final_mb'], 0.001) if peak_vs_final_ratio > 2.0: - print(f"๐Ÿ’ก Peak was {peak_vs_final_ratio:.1f}x final - temporary allocations detected") + print(f"TIP Peak was {peak_vs_final_ratio:.1f}x final - temporary allocations detected") # %% [markdown] """ -### ๐Ÿงช Test Memory Profiler +### TEST Test Memory Profiler Let's test the memory profiler on operations that have different memory patterns. """ @@ -445,7 +445,7 @@ def test_memory_profiler(): stats = profiler.profile(copying_operation) profiler.print_report(stats, "Copying Operation") - print("\n๐ŸŽฏ KEY INSIGHT: Memory patterns reveal optimization opportunities!") + print("\nTARGET KEY INSIGHT: Memory patterns reveal optimization opportunities!") print(" - Small allocations: Usually efficient") print(" - Large allocations: Watch for memory bandwidth limits") print(" - Copying operations: Major performance killers") @@ -512,7 +512,7 @@ class FLOPCounter: Returns: Total FLOPs for this operation """ - # Matrix multiplication: (batch, in) ร— (in, out) = batch * in * out multiplications + # Matrix multiplication: (batch, in) * (in, out) = batch * in * out multiplications multiply_ops = batch_size * input_features * output_features # Addition for bias: batch * out additions @@ -548,7 +548,7 @@ class FLOPCounter: output_height = input_height - kernel_size + 1 output_width = input_width - kernel_size + 1 - # Each output pixel requires kernel_sizeยฒ ร— input_channels multiplications + # Each output pixel requires kernel_sizeยฒ * input_channels multiplications multiply_ops = (batch_size * output_height * output_width * output_channels * kernel_size * kernel_size * input_channels) @@ -643,7 +643,7 @@ class FLOPCounter: total_flops = self.operation_counts['total_flops'] if total_flops == 0: - print("โŒ No FLOPs counted") + print("FAIL No FLOPs counted") return print(f"Total FLOPs: {total_flops:,}") @@ -667,7 +667,7 @@ class FLOPCounter: # %% [markdown] """ -### ๐Ÿงช Test FLOP Counter +### TEST Test FLOP Counter Let's count operations for different architectures and see the scaling differences. """ @@ -681,13 +681,13 @@ def test_flop_counter(): print("=" * 65) # Test 1: Simple Linear Layer (MLP building block) - print("\n1๏ธโƒฃ Linear Layer (64 โ†’ 32, batch=10)") + print("\n1๏ธโƒฃ Linear Layer (64 -> 32, batch=10)") flops = counter.count_linear(input_features=64, output_features=32, batch_size=10) counter.print_report("Linear Layer") # Test 2: Convolutional Layer counter.reset() - print("\n2๏ธโƒฃ Conv2D Layer (32ร—32ร—3 โ†’ 16 channels, 3ร—3 kernel)") + print("\n2๏ธโƒฃ Conv2D Layer (32*32*3 -> 16 channels, 3*3 kernel)") flops = counter.count_conv2d(input_height=32, input_width=32, input_channels=3, output_channels=16, kernel_size=3, batch_size=1) counter.print_report("Conv2D Layer") @@ -785,7 +785,7 @@ class ProfilerContext: def __enter__(self): """Start profiling context.""" - print(f"๐Ÿ” PROFILING: {self.name}") + print(f"MAGNIFY PROFILING: {self.name}") print("=" * (len(self.name) + 12)) if self.enable_memory: @@ -798,7 +798,7 @@ class ProfilerContext: def __exit__(self, exc_type, exc_val, exc_tb): """End profiling and generate report.""" if exc_type is not None: - print(f"โŒ Error during profiling: {exc_val}") + print(f"FAIL Error during profiling: {exc_val}") return False self.generate_report() @@ -883,7 +883,7 @@ class ProfilerContext: def _print_insights(self): """Print performance insights and recommendations.""" - print(f"\n๐Ÿ’ก PERFORMANCE INSIGHTS:") + print(f"\nTIP PERFORMANCE INSIGHTS:") insights = [] @@ -893,11 +893,11 @@ class ProfilerContext: std_ms = self.timing_stats.get('std_ms', 0) if mean_ms < 0.1: - insights.append("โšก Very fast operation (< 0.1ms)") + insights.append("SPEED Very fast operation (< 0.1ms)") elif mean_ms < 1: - insights.append("โœ… Fast operation (< 1ms)") + insights.append("PASS Fast operation (< 1ms)") elif mean_ms < 10: - insights.append("โš ๏ธ Moderate speed (1-10ms)") + insights.append("WARNING๏ธ Moderate speed (1-10ms)") else: insights.append("๐ŸŒ Slow operation (> 10ms) - optimization target") @@ -922,18 +922,18 @@ class ProfilerContext: gflops_per_sec = (self.flop_counter.operation_counts['total_flops'] / 1e9) / mean_seconds if gflops_per_sec > 10: - insights.append("๐Ÿš€ Excellent computational efficiency") + insights.append("ROCKET Excellent computational efficiency") elif gflops_per_sec > 1: - insights.append("โœ… Good computational efficiency") + insights.append("PASS Good computational efficiency") else: - insights.append("โš ๏ธ Low efficiency - check for bottlenecks") + insights.append("WARNING๏ธ Low efficiency - check for bottlenecks") # Print insights for insight in insights: print(f" {insight}") if not insights: - print(" ๐Ÿ“ˆ Run with more profiling options for insights") + print(" PROGRESS Run with more profiling options for insights") # %% #| export @@ -984,7 +984,7 @@ def profile_function(func, *args, **kwargs): # %% [markdown] """ -### ๐Ÿงช Test Comprehensive Profiling +### TEST Test Comprehensive Profiling Now let's use the complete profiler to analyze different model architectures. This is where the detective work pays off - you'll see exactly why some models are fast and others are slow! @@ -994,7 +994,7 @@ This is where the detective work pays off - you'll see exactly why some models a def test_comprehensive_profiling(): """Test comprehensive profiling on different model types.""" - print("๐Ÿ” COMPREHENSIVE PROFILING - Architecture Detective Work") + print("MAGNIFY COMPREHENSIVE PROFILING - Architecture Detective Work") print("=" * 80) # Test 1: Simple Linear Model (MLP) @@ -1052,16 +1052,16 @@ def test_comprehensive_profiling(): print("COMPARATIVE ANALYSIS - The Big Reveal!") print("๐Ÿ"*25) print(""" -๐ŸŽฏ KEY DISCOVERIES: +TARGET KEY DISCOVERIES: 1๏ธโƒฃ MLP (Linear): - Fastest for small inputs - - Linear scaling: O(input_size ร— output_size) + - Linear scaling: O(input_size * output_size) - Excellent for final classification layers 2๏ธโƒฃ CNN (Convolutional): - Moderate speed, excellent for spatial data - - Scaling: O(input_pixels ร— kernel_size) + - Scaling: O(input_pixels * kernel_size) - Hardware-friendly (vectorizable) 3๏ธโƒฃ Transformer (Attention): @@ -1075,7 +1075,7 @@ def test_comprehensive_profiling(): - 10x longer sequence = 100x computation - This is why GPT models are expensive to run! -๐Ÿ’ก OPTIMIZATION STRATEGIES: +TIP OPTIMIZATION STRATEGIES: - MLPs: Focus on batch processing - CNNs: Use optimized convolution libraries - Transformers: Implement attention optimizations (next module!) @@ -1112,7 +1112,7 @@ def simulate_complete_model_profiling(): print("๐Ÿ•ต๏ธ PERFORMANCE DETECTIVE: Complete Model Analysis") print("=" * 80) print(""" -๐ŸŽฏ MISSION: Find the bottleneck in our neural network +TARGET MISSION: Find the bottleneck in our neural network We have a model with: - Input processing (Linear layer) @@ -1152,7 +1152,7 @@ Which component is slowing us down? print(f"{'TOTAL':<20s}: {total_time:6.2f} ms") # Bottleneck analysis - print(f"\n๐Ÿ” BOTTLENECK ANALYSIS:") + print(f"\nMAGNIFY BOTTLENECK ANALYSIS:") print("-" * 40) # Find the slowest component @@ -1163,7 +1163,7 @@ Which component is slowing us down? print(f" Time: {slowest_time:.2f} ms ({bottleneck_percentage:.1f}% of total)") # Calculate optimization impact - print(f"\n๐Ÿ’ก OPTIMIZATION IMPACT ANALYSIS:") + print(f"\nTIP OPTIMIZATION IMPACT ANALYSIS:") print("-" * 40) # If we optimize the bottleneck by different amounts @@ -1202,7 +1202,7 @@ Which component is slowing us down? print(f"{'TOTAL':<20s}: {total_memory:5.1f} MB") # Key insights - print(f"\n๐ŸŽฏ KEY PERFORMANCE INSIGHTS:") + print(f"\nTARGET KEY PERFORMANCE INSIGHTS:") print("=" * 50) print(f""" 1๏ธโƒฃ BOTTLENECK IDENTIFIED: {slowest_name} @@ -1260,14 +1260,14 @@ def analyze_systems_implications(): print("=" * 80) print(""" -๐ŸŽฏ PROFILING INSIGHTS โ†’ SYSTEMS DECISIONS +TARGET PROFILING INSIGHTS -> SYSTEMS DECISIONS Our performance detective work revealed several critical patterns. Let's trace how these insights drive production ML systems: """) # Memory scaling analysis - print("\n๐Ÿ“ˆ MEMORY SCALING ANALYSIS:") + print("\nPROGRESS MEMORY SCALING ANALYSIS:") print("-" * 50) sequence_lengths = [128, 512, 1024, 2048, 4096] @@ -1285,18 +1285,18 @@ Let's trace how these insights drive production ML systems: total_memory_gb = (qkv_memory + attention_scores) * 2 # Forward + backward if seq_len <= 512: - note = "โœ… Practical" + note = "PASS Practical" elif seq_len <= 1024: - note = "โš ๏ธ Expensive" + note = "WARNING๏ธ Expensive" else: note = "๐Ÿšจ Prohibitive" print(f"{seq_len:8d} | {total_memory_gb:8.2f} | {note}") - print("\n๐Ÿ’ก KEY INSIGHT: Memory grows O(nยฒ) - this is why context length is limited!") + print("\nTIP KEY INSIGHT: Memory grows O(nยฒ) - this is why context length is limited!") # Compute scaling analysis - print("\nโšก COMPUTE SCALING ANALYSIS:") + print("\nSPEED COMPUTE SCALING ANALYSIS:") print("-" * 50) print("FLOPs Required by Architecture (1M input features):") @@ -1313,14 +1313,14 @@ Let's trace how these insights drive production ML systems: for arch, flops, scaling, use_case in architectures: print(f"{arch:12s} | {flops:8s} | {scaling:8s} | {use_case}") - print("\n๐Ÿ’ก INSIGHT: Attention is 1000x more expensive than linear layers!") + print("\nTIP INSIGHT: Attention is 1000x more expensive than linear layers!") # Hardware implications print("\n๐Ÿ”ง HARDWARE IMPLICATIONS:") print("-" * 40) print(""" -From Profiling Data โ†’ Hardware Decisions: +From Profiling Data -> Hardware Decisions: 1๏ธโƒฃ CPU vs GPU Choice: - Linear layers: CPU fine (low parallelism) @@ -1360,12 +1360,12 @@ How Our Profiling Insights Play Out in Production: - Decision: Use tensor parallelism across GPUs - Result: Split attention computation, linear speedup -โšก EDGE DEVICES: +SPEED EDGE DEVICES: - Profiling shows: Memory bandwidth limited - Decision: Quantize to INT8, cache frequent patterns - Result: 4x memory reduction, 2x speedup -๐ŸŽฏ KEY TAKEAWAY: +TARGET KEY TAKEAWAY: Profiling isn't academic - it drives billion-dollar infrastructure decisions! Every major ML system (GPT, BERT, ResNet) was optimized using these techniques. """) @@ -1389,7 +1389,7 @@ def integration_test_profiling_suite(): Tests all components working together on a realistic model. """ - print("๐Ÿงช INTEGRATION TEST: Complete Profiling Suite") + print("TEST INTEGRATION TEST: Complete Profiling Suite") print("=" * 70) # Test all profilers working together @@ -1405,7 +1405,7 @@ def integration_test_profiling_suite(): timing_stats = timer.measure(sample_computation, warmup=2, runs=50) assert timing_stats['runs'] == 50 assert timing_stats['mean_ms'] > 0 - print("โœ… Timer: Working correctly") + print("PASS Timer: Working correctly") # Memory profiler test memory_profiler = MemoryProfiler() @@ -1415,13 +1415,13 @@ def integration_test_profiling_suite(): memory_stats = memory_profiler.profile(memory_intensive_task) assert memory_stats['peak_mb'] > 0 - print("โœ… Memory Profiler: Working correctly") + print("PASS Memory Profiler: Working correctly") # FLOP counter test flop_counter = FLOPCounter() flops = flop_counter.count_linear(100, 50, batch_size=32) assert flops == 32 * 100 * 50 + 32 * 50 # multiply + add operations - print("โœ… FLOP Counter: Working correctly") + print("PASS FLOP Counter: Working correctly") # Context manager test print("\n2๏ธโƒฃ Testing Profiler Context Integration:") @@ -1446,7 +1446,7 @@ def integration_test_profiling_suite(): ) profiler.add_flop_count(estimated_flops) - print("โœ… Profiler Context: Integration successful") + print("PASS Profiler Context: Integration successful") # Test performance comparison print("\n3๏ธโƒฃ Performance Comparison Test:") @@ -1466,7 +1466,7 @@ def integration_test_profiling_suite(): results.append(name) - print("โœ… Performance Comparison: All operations profiled successfully") + print("PASS Performance Comparison: All operations profiled successfully") # Validate profiling accuracy print("\n4๏ธโƒฃ Profiling Accuracy Validation:") @@ -1483,10 +1483,10 @@ def integration_test_profiling_suite(): tolerance = 0.3 relative_error = abs(mean_ms - expected_ms) / expected_ms if relative_error > tolerance: - print(f"โš ๏ธ Timing variance higher than expected: {mean_ms:.2f}ms vs expected {expected_ms:.2f}ms (tolerance: {tolerance*100}%)") + print(f"WARNING๏ธ Timing variance higher than expected: {mean_ms:.2f}ms vs expected {expected_ms:.2f}ms (tolerance: {tolerance*100}%)") print(" This is normal for mock operations and system-dependent timing") else: - print("โœ… Timing Accuracy: Within acceptable tolerance") + print("PASS Timing Accuracy: Within acceptable tolerance") # Test memory tracking accuracy def known_memory_allocation(): @@ -1499,7 +1499,7 @@ def integration_test_profiling_suite(): # Memory allocation should be positive and reasonable assert allocated_mb > 0.5, f"Memory tracking issue: {allocated_mb:.2f}MB seems too low" assert allocated_mb < 10, f"Memory tracking issue: {allocated_mb:.2f}MB seems too high" - print("โœ… Memory Tracking: Reasonable accuracy") + print("PASS Memory Tracking: Reasonable accuracy") # Final integration validation print("\n5๏ธโƒฃ End-to-End Integration Test:") @@ -1536,7 +1536,7 @@ def integration_test_profiling_suite(): total_flops = sum(model_flops.values()) profiler.add_flop_count(total_flops, model_flops) - print("โœ… End-to-End: Complete workflow successful") + print("PASS End-to-End: Complete workflow successful") # Test SimpleProfiler interface (for Module 20 compatibility) print("\n6๏ธโƒฃ SimpleProfiler Interface Test:") @@ -1556,7 +1556,7 @@ def integration_test_profiling_suite(): assert 'wall_time' in result assert 'cpu_time' in result assert 'name' in result - print("โœ… SimpleProfiler: Full functionality working") + print("PASS SimpleProfiler: Full functionality working") except ImportError: # Fall back to simple computation if numpy not available def simple_computation(): @@ -1567,33 +1567,33 @@ def integration_test_profiling_suite(): assert 'wall_time' in result assert 'cpu_time' in result assert 'name' in result - print("โœ… SimpleProfiler: Basic functionality working") + print("PASS SimpleProfiler: Basic functionality working") # Test profile_function utility try: func_result = profile_function(sample_computation) assert 'wall_time' in func_result - print("โœ… profile_function utility: Working correctly") + print("PASS profile_function utility: Working correctly") except ImportError: def simple_computation(): return sum(i*i for i in range(1000)) func_result = profile_function(simple_computation) assert 'wall_time' in func_result - print("โœ… profile_function utility: Working correctly (fallback)") + print("PASS profile_function utility: Working correctly (fallback)") # Success summary - print(f"\n๐ŸŽ‰ INTEGRATION TEST RESULTS:") + print(f"\nCELEBRATE INTEGRATION TEST RESULTS:") print("=" * 50) print(""" -โœ… All profiling components working correctly -โœ… Context manager integration successful -โœ… Timing accuracy within acceptable range -โœ… Memory tracking functioning properly -โœ… FLOP counting calculations correct -โœ… End-to-end workflow validated -โœ… SimpleProfiler interface ready for Module 20 +PASS All profiling components working correctly +PASS Context manager integration successful +PASS Timing accuracy within acceptable range +PASS Memory tracking functioning properly +PASS FLOP counting calculations correct +PASS End-to-end workflow validated +PASS SimpleProfiler interface ready for Module 20 -๐Ÿš€ PROFILING SUITE READY FOR PRODUCTION USE! +ROCKET PROFILING SUITE READY FOR PRODUCTION USE! Your profiling tools are now ready to: - Identify bottlenecks in real models @@ -1611,7 +1611,7 @@ if __name__ == "__main__": # %% [markdown] """ -## ๐Ÿค” ML Systems Thinking: Interactive Questions +## THINK ML Systems Thinking: Interactive Questions Now that you've built a complete profiling suite, let's think about how this applies to real ML systems engineering. """ @@ -1678,7 +1678,7 @@ How would you modify your profiling approach for production? What are the key de # %% if __name__ == "__main__": - print("๐Ÿค” ML Systems Thinking Questions") + print("THINK ML Systems Thinking Questions") print("=" * 50) print(""" Complete the interactive questions above to deepen your understanding of: @@ -1705,7 +1705,7 @@ Answer them to master performance analysis thinking! # %% if __name__ == "__main__": - print("๐Ÿ” PROFILING MODULE: Performance Detective Suite") + print("MAGNIFY PROFILING MODULE: Performance Detective Suite") print("=" * 60) # Run all profiling tests in sequence @@ -1730,8 +1730,8 @@ if __name__ == "__main__": print("\n7๏ธโƒฃ Running Integration Tests...") integration_test_profiling_suite() - print("\n๐ŸŽ‰ ALL PROFILING TESTS COMPLETED SUCCESSFULLY!") - print("\n๐Ÿš€ Your profiling suite is ready to:") + print("\nCELEBRATE ALL PROFILING TESTS COMPLETED SUCCESSFULLY!") + print("\nROCKET Your profiling suite is ready to:") print(" - Identify bottlenecks in neural networks") print(" - Guide optimization decisions with data") print(" - Predict performance at scale") @@ -1740,7 +1740,7 @@ if __name__ == "__main__": # %% [markdown] """ -## ๐ŸŽฏ MODULE SUMMARY: Profiling - Performance Detective Work +## TARGET MODULE SUMMARY: Profiling - Performance Detective Work Congratulations! You've built a comprehensive profiling suite that reveals the performance secrets of neural networks. @@ -1770,7 +1770,7 @@ Congratulations! You've built a comprehensive profiling suite that reveals the p - Implemented automatic insight generation - Developed production-ready profiling workflow -### ๐Ÿ” Key Discoveries Made +### MAGNIFY Key Discoveries Made **Architecture Performance Profiles:** - **MLPs**: Fast, linear scaling, memory efficient @@ -1787,7 +1787,7 @@ Congratulations! You've built a comprehensive profiling suite that reveals the p - Memory constraints limit batch sizes in attention models - Optimization ROI follows Amdahl's Law patterns -### ๐Ÿš€ Real-World Applications +### ROCKET Real-World Applications Your profiling tools enable: - **Bottleneck identification** in production models @@ -1796,7 +1796,7 @@ Your profiling tools enable: - **Cost prediction** for scaling ML systems - **Performance regression** detection in CI/CD -### ๐ŸŽฏ What's Next +### TARGET What's Next Module 16 (Acceleration) will use these profiling insights to: - Implement attention optimizations (Flash Attention patterns) @@ -1817,5 +1817,5 @@ Module 16 (Acceleration) will use these profiling insights to: You now have the tools to analyze any neural network and understand exactly why it's fast or slow. These are the same techniques used to optimize GPT, BERT, and every other production ML system. -**Welcome to the ranks of ML systems performance engineers!** ๐ŸŽ‰ +**Welcome to the ranks of ML systems performance engineers!** CELEBRATE """ \ No newline at end of file diff --git a/modules/15_acceleration/acceleration_dev.py b/modules/15_acceleration/acceleration_dev.py index 2bf814cb..51346675 100644 --- a/modules/15_acceleration/acceleration_dev.py +++ b/modules/15_acceleration/acceleration_dev.py @@ -4,7 +4,7 @@ Welcome to Hardware Acceleration! You'll discover the easiest optimization in ML systems - getting 100x speedups with zero code changes! -## ๐Ÿ”— Building on Previous Learning +## LINK Building on Previous Learning **What You Built Before**: - Module 02 (Tensor): Triple-nested loops for matrix operations - Module 04 (Layers): Forward pass implementations @@ -18,7 +18,7 @@ Welcome to Hardware Acceleration! You'll discover the easiest optimization in ML **Connection Map**: ``` -Profiling โ†’ Acceleration โ†’ Production ML +Profiling -> Acceleration -> Production ML (identify) (optimize) (deploy at scale) ``` @@ -29,21 +29,21 @@ Profiling โ†’ Acceleration โ†’ Production ML - **Framework connections**: How PyTorch/TensorFlow achieve performance - **Optimization trade-offs**: Educational clarity vs production speed -## Build โ†’ Use โ†’ Reflect +## Build -> Use -> Reflect 1. **Build**: Cache-friendly blocked matrix multiplication from scratch 2. **Use**: Apply acceleration to real ML model operations (MLP, CNN, Attention) 3. **Reflect**: Analyze the educational-to-production optimization spectrum ## Systems Reality Check -๐Ÿ’ก **Production Context**: ML frameworks use these exact principles for 100x speedups -โšก **Performance Insight**: Memory access patterns matter more than raw computation speed +TIP **Production Context**: ML frameworks use these exact principles for 100x speedups +SPEED **Performance Insight**: Memory access patterns matter more than raw computation speed ## The Free Speedup Journey **Key Message**: This is the EASIEST optimization - just use better backends! No accuracy trade-offs, no complex math - just 10-100x faster code. ``` -Educational Loops โ†’ Cache Blocking โ†’ NumPy/BLAS โ†’ Smart Backends +Educational Loops -> Cache Blocking -> NumPy/BLAS -> Smart Backends (learning) (understanding) (production) (automation) 1000x slower 100x slower optimal speed transparent ``` @@ -68,19 +68,19 @@ Let's start with the educational triple-nested loops you implemented earlier. Th ``` CPU Architecture (Optimized for Sequential): GPU Architecture (Optimized for Parallel): -โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” -โ”‚ Complex Control Unit โ”‚ โ”‚ Simple Control Units โ”‚ -โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” Large Caches โ”‚ โ”‚ โ”Œโ”€โ” โ”Œโ”€โ” โ”Œโ”€โ” โ”Œโ”€โ” Small Caches โ”‚ -โ”‚ โ”‚ Core 1 โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ โ”‚ โ”‚Cโ”‚ โ”‚Cโ”‚ โ”‚Cโ”‚ โ”‚Cโ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ -โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ L3 Cache (8MB) โ”‚ โ”‚ โ”‚ โ””โ”€โ”˜ โ””โ”€โ”˜ โ””โ”€โ”˜ โ””โ”€โ”˜ โ”‚ Shared Memory (48KB) โ”‚ โ”‚ -โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ โ”‚ โ”‚ โ”‚ โ”Œโ”€โ” โ”Œโ”€โ” โ”Œโ”€โ” โ”Œโ”€โ” โ”‚ โ”‚ โ”‚ -โ”‚ โ”‚ Core 2 โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ”‚ โ”‚Cโ”‚ โ”‚Cโ”‚ โ”‚Cโ”‚ โ”‚Cโ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ -โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ”‚ โ””โ”€โ”˜ โ””โ”€โ”˜ โ””โ”€โ”˜ โ””โ”€โ”˜ ... (thousands of cores) โ”‚ -โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” Main Memory (16GB) โ”‚ โ”‚ โ”‚ -โ”‚ โ”‚ Core 4 โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ โ”‚ High Bandwidth Memory (HBM) โ”‚ -โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ 200+ cycle latency โ”‚ โ”‚ โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ -โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ”‚ โ”‚ 1000+ GB/s bandwidth โ”‚ โ”‚ -โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ ++---------------------------------------------+ +---------------------------------------------+ +| Complex Control Unit | | Simple Control Units | +| +---------+ Large Caches | | +-+ +-+ +-+ +-+ Small Caches | +| | Core 1 | +--------------------------+ | | |C| |C| |C| |C| +----------------------+ | +| +---------+ | L3 Cache (8MB) | | | +-+ +-+ +-+ +-+ | Shared Memory (48KB) | | +| +---------+ | | | | +-+ +-+ +-+ +-+ | | | +| | Core 2 | +--------------------------+ | | |C| |C| |C| |C| +----------------------+ | +| +---------+ | | +-+ +-+ +-+ +-+ ... (thousands of cores) | +| +---------+ Main Memory (16GB) | | | +| | Core 4 | +--------------------------+ | | High Bandwidth Memory (HBM) | +| +---------+ | 200+ cycle latency | | | +--------------------------------------+ | +| +--------------------------+ | | | 1000+ GB/s bandwidth | | ++---------------------------------------------+ +---------------------------------------------+ CPU: Few cores, complex, optimized for latency GPU: Many cores, simple, optimized for throughput Best for: Sequential algorithms, complex logic Best for: Parallel algorithms, simple operations @@ -91,19 +91,19 @@ Best for: Sequential algorithms, complex logic Best for: Parallel algorithms, ``` Memory Hierarchy (Latency and Size Trade-offs): -Registers: 4 bytes โ”‚ 1 cycle โ”‚ โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ Speed -L1 Cache: 32KB โ”‚ 3-4 cycles โ”‚ โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–’โ–’ -L2 Cache: 256KB โ”‚ 10-20 cycles โ”‚ โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–’โ–’โ–’โ–’ -L3 Cache: 8MB โ”‚ 50-100 cyclesโ”‚ โ–ˆโ–ˆโ–ˆโ–ˆโ–’โ–’โ–’โ–’โ–’โ–’ -Main RAM: 16GB โ”‚ 200+ cycles โ”‚ โ–ˆโ–ˆโ–’โ–’โ–’โ–’โ–’โ–’โ–’โ–’ -SSD Storage: 1TB โ”‚ 100,000+ cyc โ”‚ โ–’โ–’โ–’โ–’โ–’โ–’โ–’โ–’โ–’โ–’ - โ†‘ โ†‘ +Registers: 4 bytes | 1 cycle | โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ Speed +L1 Cache: 32KB | 3-4 cycles | โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–’โ–’ +L2 Cache: 256KB | 10-20 cycles | โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–’โ–’โ–’โ–’ +L3 Cache: 8MB | 50-100 cycles| โ–ˆโ–ˆโ–ˆโ–ˆโ–’โ–’โ–’โ–’โ–’โ–’ +Main RAM: 16GB | 200+ cycles | โ–ˆโ–ˆโ–’โ–’โ–’โ–’โ–’โ–’โ–’โ–’ +SSD Storage: 1TB | 100,000+ cyc | โ–’โ–’โ–’โ–’โ–’โ–’โ–’โ–’โ–’โ–’ + ^ ^ Size Speed ``` **The Cache Miss Problem**: -- Cache hit: Data found in L1 โ†’ 1 cycle -- Cache miss: Must fetch from RAM โ†’ 200+ cycles +- Cache hit: Data found in L1 -> 1 cycle +- Cache miss: Must fetch from RAM -> 200+ cycles - 200x slowdown for every cache miss! """ @@ -125,10 +125,10 @@ def matmul_naive(a: np.ndarray, b: np.ndarray) -> np.ndarray: Memory Access Pattern Analysis: ``` Inner loop accesses: - a[i, k] โ†’ Sequential access (cache-friendly) - b[k, j] โ†’ Strided access (cache-hostile!) + a[i, k] -> Sequential access (cache-friendly) + b[k, j] -> Strided access (cache-hostile!) - For 1000ร—1000 matrices: + For 1000*1000 matrices: - a[i,k]: 1000 sequential reads per row (good) - b[k,j]: 1000 random column reads (terrible!) - Total cache misses: ~1 billion! @@ -154,7 +154,7 @@ def matmul_naive(a: np.ndarray, b: np.ndarray) -> np.ndarray: return c -# ๐Ÿ” SYSTEMS INSIGHT: Memory Access Pattern Analysis +# MAGNIFY SYSTEMS INSIGHT: Memory Access Pattern Analysis def analyze_memory_access_patterns(): """ Visualize why naive loops create terrible cache performance. @@ -175,8 +175,8 @@ def analyze_memory_access_patterns(): print("Memory: [b00 b01 b02 b03 | b10 b11 b12 b13 | b20 b21 b22 b23 | b30 b31 b32 b33]") print("\n๐Ÿ”ด PROBLEM: Computing C[0,0] = sum(A[0,k] * B[k,0])") - print("A[0,k] accesses: a00, a01, a02, a03 (sequential โœ“)") - print("B[k,0] accesses: b00, b10, b20, b30 (every 4th element โŒ)") + print("A[0,k] accesses: a00, a01, a02, a03 (sequential OK)") + print("B[k,0] accesses: b00, b10, b20, b30 (every 4th element FAIL)") print("\n๐Ÿ“Š Cache Miss Analysis:") cache_line_size = 64 # bytes @@ -204,7 +204,7 @@ def analyze_memory_access_patterns(): print(f"Performance penalty: 200x slower!") except Exception as e: - print(f"โš ๏ธ Error in memory analysis: {e}") + print(f"WARNING๏ธ Error in memory analysis: {e}") print("Make sure numpy is available") # Run the analysis @@ -212,17 +212,17 @@ analyze_memory_access_patterns() # %% [markdown] """ -### ๐Ÿงช Unit Test: Educational Implementation +### TEST Unit Test: Educational Implementation Let's test our educational loops and measure their performance characteristics. """ -# โœ… IMPLEMENTATION CHECKPOINT: Naive matrix multiplication complete +# PASS IMPLEMENTATION CHECKPOINT: Naive matrix multiplication complete -# ๐Ÿค” PREDICTION: How much slower are educational loops vs NumPy? +# THINK PREDICTION: How much slower are educational loops vs NumPy? # Your guess: ___x slower for 100x100 matrices -# ๐Ÿ” SYSTEMS INSIGHT #1: Why Educational Loops Are Slow +# MAGNIFY SYSTEMS INSIGHT #1: Why Educational Loops Are Slow def analyze_educational_loop_performance(): """ Measure and understand why educational loops create performance problems. @@ -278,14 +278,14 @@ def analyze_educational_loop_performance(): print(f"โ€ข Cache misses make large matrices exponentially slower") print(f"โ€ข NumPy: Professional optimizations give 100-1000x speedup") - print(f"\n๐Ÿ’ก Why This Matters for ML Systems:") - print(f"โ€ข Understanding algorithms โ‰  performance optimization") + print(f"\nTIP Why This Matters for ML Systems:") + print(f"โ€ข Understanding algorithms != performance optimization") print(f"โ€ข Educational clarity vs production speed trade-off") print(f"โ€ข Memory access patterns dominate performance") print(f"โ€ข Library choice impacts application feasibility") except Exception as e: - print(f"โš ๏ธ Error in performance analysis: {e}") + print(f"WARNING๏ธ Error in performance analysis: {e}") print("Make sure matrices are small enough for educational timing") # Run the educational performance analysis @@ -299,7 +299,7 @@ def test_naive_baseline(): This test validates correctness and demonstrates the performance gap between educational loops and optimized implementations. """ - print("๐Ÿงช Testing Naive Implementation...") + print("TEST Testing Naive Implementation...") # Test correctness with small matrices first a = np.array([[1, 2], [3, 4]], dtype=np.float32) @@ -311,7 +311,7 @@ def test_naive_baseline(): assert np.allclose(result_naive, result_numpy), "Naive matmul incorrect vs NumPy" assert np.allclose(result_naive, expected), "Naive matmul incorrect vs expected" - print("โœ… Naive implementation produces correct results") + print("PASS Naive implementation produces correct results") # Performance comparison (small sizes only - educational is VERY slow) print("\n๐Ÿ“Š Performance comparison:") @@ -338,9 +338,9 @@ def test_naive_baseline(): print(f"\n๐Ÿ“Š Scaling Analysis (100x100 baseline):") print(f"For 500x500 matrix: ~{speedup * 125:.0f}x slower than NumPy") # (500/100)^3 = 125 print(f"For 1000x1000 matrix: ~{speedup * 1000:.0f}x slower than NumPy") # (1000/100)^3 = 1000 - print(f"\n๐Ÿ’ก Why: O(Nยณ) complexity + cache misses = exponential slowdown") + print(f"\nTIP Why: O(Nยณ) complexity + cache misses = exponential slowdown") - print("โœ… Naive baseline established") + print("PASS Naive baseline established") return naive_time, numpy_time, speedup # Execute the test @@ -356,14 +356,14 @@ test_naive_baseline() ``` CPU Cache Hierarchy (Latency vs Capacity Trade-off): -โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” -Register: 4 bytes โ”‚ โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ 1 cycle (instant access) โ”‚ -L1 Cache: 32KB โ”‚ โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–’โ–’ 3-4 cycles (lightning fast) โ”‚ -L2 Cache: 256KB โ”‚ โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–’โ–’โ–’โ–’ 10-20 cycles (fast) โ”‚ -L3 Cache: 8MB โ”‚ โ–ˆโ–ˆโ–ˆโ–ˆโ–’โ–’โ–’โ–’โ–’โ–’ 50-100 cycles(slow) โ”‚ -Main RAM: 16GB โ”‚ โ–ˆโ–ˆโ–’โ–’โ–’โ–’โ–’โ–’โ–’โ–’ 200+ cycles (VERY slow) โ”‚ -SSD: 1TB โ”‚ โ–’โ–’โ–’โ–’โ–’โ–’โ–’โ–’โ–’โ–’ 100,000+ cyc (glacial) โ”‚ -โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ ++----------------------------------------------------------------------------------+ +Register: 4 bytes | โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ 1 cycle (instant access) | +L1 Cache: 32KB | โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–’โ–’ 3-4 cycles (lightning fast) | +L2 Cache: 256KB | โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–’โ–’โ–’โ–’ 10-20 cycles (fast) | +L3 Cache: 8MB | โ–ˆโ–ˆโ–ˆโ–ˆโ–’โ–’โ–’โ–’โ–’โ–’ 50-100 cycles(slow) | +Main RAM: 16GB | โ–ˆโ–ˆโ–’โ–’โ–’โ–’โ–’โ–’โ–’โ–’ 200+ cycles (VERY slow) | +SSD: 1TB | โ–’โ–’โ–’โ–’โ–’โ–’โ–’โ–’โ–’โ–’ 100,000+ cyc (glacial) | ++----------------------------------------------------------------------------------+ Size Speed Characteristics ``` @@ -373,27 +373,27 @@ SSD: 1TB โ”‚ โ–’โ–’โ–’โ–’โ–’โ–’โ–’โ–’โ–’โ–’ 100,000+ cyc (glacial) ``` Vectorization (SIMD - Single Instruction, Multiple Data): -โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” -โ”‚ Scalar: for i in range(4): c[i] = a[i] + b[i] โ”‚ -โ”‚ ADD a[0], b[0] โ†’ c[0] (4 operations) โ”‚ -โ”‚ ADD a[1], b[1] โ†’ c[1] โ”‚ -โ”‚ ADD a[2], b[2] โ†’ c[2] โ”‚ -โ”‚ ADD a[3], b[3] โ†’ c[3] โ”‚ -โ”‚ โ”‚ -โ”‚ Vector: c = a + b (NumPy/BLAS) โ”‚ -โ”‚ VADD [a0,a1,a2,a3], [b0,b1,b2,b3] โ”‚ -โ”‚ โ†’ [c0,c1,c2,c3] (1 operation!) โ”‚ -โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ ++--------------------------------------------------+ +| Scalar: for i in range(4): c[i] = a[i] + b[i] | +| ADD a[0], b[0] -> c[0] (4 operations) | +| ADD a[1], b[1] -> c[1] | +| ADD a[2], b[2] -> c[2] | +| ADD a[3], b[3] -> c[3] | +| | +| Vector: c = a + b (NumPy/BLAS) | +| VADD [a0,a1,a2,a3], [b0,b1,b2,b3] | +| -> [c0,c1,c2,c3] (1 operation!) | ++--------------------------------------------------+ Parallelization (Multiple cores working simultaneously): -โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” -โ”‚ Core 1: Computes rows 0-249 of result matrix โ”‚ -โ”‚ Core 2: Computes rows 250-499 of result matrix โ”‚ -โ”‚ Core 3: Computes rows 500-749 of result matrix โ”‚ -โ”‚ Core 4: Computes rows 750-999 of result matrix โ”‚ -โ”‚ โ”‚ -โ”‚ 4x speedup (ideal) if no synchronization costs โ”‚ -โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ ++--------------------------------------------------+ +| Core 1: Computes rows 0-249 of result matrix | +| Core 2: Computes rows 250-499 of result matrix | +| Core 3: Computes rows 500-749 of result matrix | +| Core 4: Computes rows 750-999 of result matrix | +| | +| 4x speedup (ideal) if no synchronization costs | ++--------------------------------------------------+ ``` ### Memory Access Pattern Analysis @@ -409,7 +409,7 @@ for i in range(m): # Loop over output rows **The Problem**: `b[k,j]` creates terrible access patterns: - Each `j` increment jumps to a new column (cache miss) - Each `k` increment jumps to a new row (another cache miss) -- For 1000ร—1000 matrix: 1 billion cache misses! +- For 1000*1000 matrix: 1 billion cache misses! **Visualization of Memory Access**: ``` @@ -417,7 +417,7 @@ Matrix B in memory (row-major): [b00 b01 b02 b03 | b10 b11 b12 b13 | b20 b21 b22 b23 | ...] Accessing column 0: b00, b10, b20, b30, ... - โ”‚ โ”‚ โ”‚ โ”‚ + | | | | 4 4 4 4 elements apart = strided access ๐Ÿ”ด ๐Ÿ”ด ๐Ÿ”ด ๐Ÿ”ด cache misses! ``` @@ -434,7 +434,7 @@ def matmul_blocked(a: np.ndarray, b: np.ndarray, block_size: int = 64) -> np.nda dramatically reducing cache misses and improving performance. **Memory Analysis (Quantitative)**: - - 64ร—64 float32 block = 4096 * 4 bytes = 16KB per block + - 64*64 float32 block = 4096 * 4 bytes = 16KB per block - 3 blocks (A_block, B_block, C_block) = 48KB total - Fits comfortably in 256KB L2 cache with room for other data - Reuses each data element 64 times before evicting from cache @@ -446,21 +446,21 @@ def matmul_blocked(a: np.ndarray, b: np.ndarray, block_size: int = 64) -> np.nda **Blocking Visualization**: ``` Large Matrix Multiplication: - A (1000x1000) ร— B (1000x1000) = C (1000x1000) + A (1000x1000) * B (1000x1000) = C (1000x1000) Blocked Approach: - โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” - โ”‚ 64x64โ”‚ โ”‚ โ”‚ 64x64โ”‚ โ”‚ โ”‚ 64x64โ”‚ โ”‚ - โ”‚ block โ”‚ A โ”‚ ร— โ”‚ block โ”‚ B โ”‚ = โ”‚ block โ”‚ C โ”‚ - โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ - โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ + +----------------+ +----------------+ +----------------+ + | 64x64| | | 64x64| | | 64x64| | + | block | A | * | block | B | = | block | C | + | | | | | | | | | + +----------------+ +----------------+ +----------------+ Each 64x64 block fits in L1/L2 cache! ``` Args: - a: Left matrix (m ร— k) - b: Right matrix (k ร— n) + a: Left matrix (m * k) + b: Right matrix (k * n) block_size: Cache-friendly block size (64 = 16KB fits in L2 cache) """ m, k = a.shape @@ -483,8 +483,8 @@ def matmul_blocked(a: np.ndarray, b: np.ndarray, block_size: int = 64) -> np.nda # Extract blocks that fit in cache # These slices create views, not copies (memory efficient) - a_block = a[i:i_end, k_idx:k_end] # Shape: (โ‰ค64, โ‰ค64) - b_block = b[k_idx:k_end, j:j_end] # Shape: (โ‰ค64, โ‰ค64) + a_block = a[i:i_end, k_idx:k_end] # Shape: (<=64, <=64) + b_block = b[k_idx:k_end, j:j_end] # Shape: (<=64, <=64) # Multiply blocks using optimized NumPy BLAS # This operates on cache-resident data @@ -527,7 +527,7 @@ def calculate_cache_footprint(block_size: int) -> dict: ) } -# ๐Ÿ” SYSTEMS INSIGHT: Cache Optimization Analysis +# MAGNIFY SYSTEMS INSIGHT: Cache Optimization Analysis def analyze_cache_optimization(): """ Analyze how different block sizes affect cache performance. @@ -569,31 +569,31 @@ def analyze_cache_optimization(): print(f"Reuse factor: Each element used 64 times") print(f"Cache efficiency: 64x better than naive") - print("\n๐Ÿ’ก Key Insights:") + print("\nTIP Key Insights:") print("โ€ข Blocks too small: High loop overhead") print("โ€ข Blocks too large: Cache misses") print("โ€ข Sweet spot: 64x64 fits in L2 cache") print("โ€ข Modern CPUs: Designed for this pattern!") except Exception as e: - print(f"โš ๏ธ Error in cache analysis: {e}") + print(f"WARNING๏ธ Error in cache analysis: {e}") # Run the cache analysis analyze_cache_optimization() # %% [markdown] """ -### ๐Ÿงช Unit Test: Blocked Implementation +### TEST Unit Test: Blocked Implementation Let's see how much faster cache-friendly blocking is compared to educational loops. """ -# โœ… IMPLEMENTATION CHECKPOINT: Cache-friendly blocking complete +# PASS IMPLEMENTATION CHECKPOINT: Cache-friendly blocking complete -# ๐Ÿค” PREDICTION: How much speedup does cache blocking provide? +# THINK PREDICTION: How much speedup does cache blocking provide? # Your guess: ___x faster than educational loops -# ๐Ÿ” SYSTEMS INSIGHT #2: Cache Blocking Effectiveness +# MAGNIFY SYSTEMS INSIGHT #2: Cache Blocking Effectiveness def analyze_cache_blocking_effectiveness(): """ Measure how cache-friendly blocking improves performance. @@ -662,19 +662,19 @@ def analyze_cache_blocking_effectiveness(): speedup_blocked = naive_scaled / best_time speedup_numpy = naive_scaled / numpy_time - print(f"\n๐Ÿš€ Speedup Results:") + print(f"\nROCKET Speedup Results:") print(f"Blocking: {speedup_blocked:.0f}x faster than naive") print(f"NumPy: {speedup_numpy:.0f}x faster than naive") print(f"Block size {best_block}: Optimal for this matrix size") - print(f"\n๐Ÿ’ก Key Cache Insights:") + print(f"\nTIP Key Cache Insights:") print(f"โ€ข 64x64 blocks typically optimal (fits L2 cache)") print(f"โ€ข Too small: High loop overhead") print(f"โ€ข Too large: Cache misses return") print(f"โ€ข Cache hierarchy shapes algorithm design") except Exception as e: - print(f"โš ๏ธ Error in blocking analysis: {e}") + print(f"WARNING๏ธ Error in blocking analysis: {e}") print("Make sure all blocking functions are implemented correctly") # Run the cache blocking analysis @@ -692,7 +692,7 @@ def test_blocked_optimization(): result_numpy = a @ b assert np.allclose(result_blocked, result_numpy, atol=1e-3), "Blocked matmul incorrect" - print("โœ… Blocked implementation produces correct results") + print("PASS Blocked implementation produces correct results") # Performance comparison print("\nPerformance comparison:") @@ -727,15 +727,15 @@ def test_blocked_optimization(): speedup_blocked = naive_time_scaled / blocked_time speedup_numpy = naive_time_scaled / numpy_time - print(f"\n๐Ÿš€ SPEEDUP RESULTS:") + print(f"\nROCKET SPEEDUP RESULTS:") print(f"Blocked is {speedup_blocked:.1f}x faster than naive loops!") print(f"NumPy is {speedup_numpy:.1f}x faster than naive loops!") - print(f"\n๐Ÿ’ก Why blocking works: Better cache utilization!") + print(f"\nTIP Why blocking works: Better cache utilization!") print(f" โ€ข Naive: 1 cache miss per operation") print(f" โ€ข Blocked: 1 cache miss per 64 operations") print(f" โ€ข NumPy: Professional optimizations + vectorization") - print("โœ… Blocked optimization tested successfully") + print("PASS Blocked optimization tested successfully") return blocked_time, numpy_time # Execute the blocked test @@ -760,17 +760,17 @@ def matmul_numpy(a: np.ndarray, b: np.ndarray) -> np.ndarray: # %% [markdown] """ -### ๐Ÿงช Unit Test: Production Implementation +### TEST Unit Test: Production Implementation Let's verify that NumPy is indeed the best choice for production. """ -# โœ… IMPLEMENTATION CHECKPOINT: Production backend system complete +# PASS IMPLEMENTATION CHECKPOINT: Production backend system complete -# ๐Ÿค” PREDICTION: What makes NumPy faster than our blocking algorithm? +# THINK PREDICTION: What makes NumPy faster than our blocking algorithm? # Your answer: ___ (vectorization, BLAS, assembly, etc.) -# ๐Ÿ” SYSTEMS INSIGHT #3: Production Optimization Analysis +# MAGNIFY SYSTEMS INSIGHT #3: Production Optimization Analysis def analyze_production_optimization_stack(): """ Analyze the complete optimization stack that makes NumPy so fast. @@ -786,7 +786,7 @@ def analyze_production_optimization_stack(): sizes = [100, 300, 500, 1000] print("\nOptimization Stack Performance:") - print("Size | Naive Est | Blocked | NumPy | Blockโ†’NumPy | Total Speedup") + print("Size | Naive Est | Blocked | NumPy | Block->NumPy | Total Speedup") print("-" * 70) for size in sizes: @@ -833,19 +833,19 @@ def analyze_production_optimization_stack(): print(f"๐Ÿ”ง 6. Threading: Automatic parallelization for large matrices") print(f"\n๐Ÿ“Š Development Cost vs Performance Benefit:") - print(f"โ€ข Custom blocking: 1 week implementation โ†’ 10-50x speedup") - print(f"โ€ข BLAS integration: 1 month implementation โ†’ additional 5-10x") - print(f"โ€ข Assembly optimization: 6+ months โ†’ additional 2-5x") - print(f"โ€ข NumPy: 0 development time โ†’ all optimizations included") + print(f"โ€ข Custom blocking: 1 week implementation -> 10-50x speedup") + print(f"โ€ข BLAS integration: 1 month implementation -> additional 5-10x") + print(f"โ€ข Assembly optimization: 6+ months -> additional 2-5x") + print(f"โ€ข NumPy: 0 development time -> all optimizations included") - print(f"\n๐Ÿ’ก ML Systems Engineering Insight:") + print(f"\nTIP ML Systems Engineering Insight:") print(f"โ€ข Focus on system architecture, not micro-optimizations") print(f"โ€ข Leverage existing optimized libraries (NumPy, PyTorch, TensorFlow)") print(f"โ€ข Understanding principles enables better system design") print(f"โ€ข Build on foundations, don't reinvent optimized wheels") except Exception as e: - print(f"โš ๏ธ Error in production analysis: {e}") + print(f"WARNING๏ธ Error in production analysis: {e}") print("Make sure all performance functions are implemented correctly") # Run the production optimization analysis @@ -881,13 +881,13 @@ def test_production_performance(): print(f"NumPy: {numpy_time*1000:6.1f} ms") print(f"NumPy is {speedup:.1f}x faster than blocked") - print("\n๐Ÿ’ก Key Insight: NumPy already has these optimizations built-in!") + print("\nTIP Key Insight: NumPy already has these optimizations built-in!") print(" โ€ข Blocking algorithms") print(" โ€ข Vectorization") print(" โ€ข Hardware-specific BLAS libraries") print(" โ€ข Assembly-level optimizations") - print("\nโœ… Production performance verified") + print("\nPASS Production performance verified") return True # Execute the production test @@ -947,7 +947,7 @@ def matmul(a: np.ndarray, b: np.ndarray) -> np.ndarray: # %% [markdown] """ -### ๐Ÿงช Unit Test: Backend System +### TEST Unit Test: Backend System Let's verify our backend system works correctly and uses optimal implementations. """ @@ -966,7 +966,7 @@ def test_backend_system(): expected = a @ b assert np.allclose(result, expected), "Backend matmul incorrect" - print("โœ… Backend produces correct results") + print("PASS Backend produces correct results") # Compare performance start = time.perf_counter() @@ -982,7 +982,7 @@ def test_backend_system(): print(f"NumPy: {numpy_time*1000:.1f} ms") print(f"Backend uses optimal NumPy implementation") - print("\nโœ… Backend system works correctly") + print("\nPASS Backend system works correctly") return True # Execute the backend test @@ -990,7 +990,7 @@ test_backend_system() # %% [markdown] """ -## ๐ŸŽฏ Computational Assessment Questions +## TARGET Computational Assessment Questions Practice your understanding of hardware acceleration concepts with these NBGrader-compatible questions. @@ -1002,7 +1002,7 @@ def calculate_cache_efficiency(matrix_size: int, block_size: int) -> Tuple[int, """ Calculate the cache efficiency improvement of blocked vs naive matrix multiplication. - For a matrix_size ร— matrix_size multiplication using block_size ร— block_size blocks: + For a matrix_size * matrix_size multiplication using block_size * block_size blocks: 1. Calculate total number of cache misses for naive implementation 2. Calculate total number of cache misses for blocked implementation 3. Return (total_operations, efficiency_ratio) @@ -1013,7 +1013,7 @@ def calculate_cache_efficiency(matrix_size: int, block_size: int) -> Tuple[int, - Blocked: 1 cache miss per block load, then block stays in cache Args: - matrix_size: Size of square matrices (Nร—N) + matrix_size: Size of square matrices (N*N) block_size: Size of blocks for blocked algorithm Returns: @@ -1023,8 +1023,8 @@ def calculate_cache_efficiency(matrix_size: int, block_size: int) -> Tuple[int, HINTS: - Total operations = matrix_sizeยณ - - Naive cache misses โ‰ˆ matrix_sizeยณ (every B access misses) - - Blocked cache misses = (matrix_size/block_size)ยณ ร— block_sizeยฒ + - Naive cache misses ~= matrix_sizeยณ (every B access misses) + - Blocked cache misses = (matrix_size/block_size)ยณ * block_sizeยฒ - Efficiency ratio = naive_misses / blocked_misses """ ### BEGIN SOLUTION @@ -1109,7 +1109,7 @@ def optimize_block_size(matrix_size: int, cache_sizes: Dict[str, int]) -> Tuple[ APPROACH: 1. For each candidate block size, calculate memory footprint - 2. Check which cache level it fits in (3 blocks ร— block_sizeยฒ ร— 4 bytes) + 2. Check which cache level it fits in (3 blocks * block_sizeยฒ * 4 bytes) 3. Select largest block size that fits in L2 cache 4. Calculate memory utilization = footprint / cache_size @@ -1219,7 +1219,7 @@ def test_ml_model_acceleration(): print("Testing Acceleration on Real ML Models...") # Test 1: MLP Forward Pass (common in Module 4) - print("\n1. MLP Forward Pass (256 โ†’ 128 โ†’ 64):") + print("\n1. MLP Forward Pass (256 -> 128 -> 64):") batch_size, input_dim, hidden_dim, output_dim = 32, 256, 128, 64 # Simulated MLP layers @@ -1239,7 +1239,7 @@ def test_ml_model_acceleration(): h2_opt = matmul(h1_opt, W2) opt_time = time.perf_counter() - start - # Scale for: batch_size (32/8) ร— input_dim (256/64) ร— hidden_dim (128/32) + # Scale for: batch_size (32/8) * input_dim (256/64) * hidden_dim (128/32) batch_scale = 32/8 # 4x more samples input_scale = 256/64 # 4x larger input hidden_scale = 128/32 # 4x larger hidden layer @@ -1260,7 +1260,7 @@ def test_ml_model_acceleration(): conv_output = matmul(img_patches, conv_filters) conv_time = time.perf_counter() - start print(f" Convolution output: {conv_time*1000:.1f} ms") - print(f" Shape: {conv_output.shape} (1024 locations ร— 64 filters)") + print(f" Shape: {conv_output.shape} (1024 locations * 64 filters)") # Test 3: Transformer-like Attention (scaled down) print("\n3. Transformer Attention (QยทK^T):") @@ -1272,16 +1272,16 @@ def test_ml_model_acceleration(): attention_scores = matmul(Q, K.T) # Shape: (seq_len, seq_len) attn_time = time.perf_counter() - start print(f" Attention computation: {attn_time*1000:.1f} ms") - print(f" Shape: {attention_scores.shape} (128ร—128 attention matrix)") + print(f" Shape: {attention_scores.shape} (128*128 attention matrix)") - print(f"\nโœ… All ML model operations accelerated successfully!") - print(f"๐Ÿ’ก Key insight: Matrix multiplication is EVERYWHERE in ML!") + print(f"\nPASS All ML model operations accelerated successfully!") + print(f"TIP Key insight: Matrix multiplication is EVERYWHERE in ML!") return True # Execute the ML model test test_ml_model_acceleration() -# ๐Ÿ” SYSTEMS INSIGHT: Acceleration Scaling Analysis +# MAGNIFY SYSTEMS INSIGHT: Acceleration Scaling Analysis def analyze_acceleration_scaling(): """ Analyze how different acceleration techniques scale with problem size. @@ -1312,10 +1312,10 @@ def analyze_acceleration_scaling(): small_speedup = compare_acceleration_techniques(100)['cache_blocking'] large_speedup = compare_acceleration_techniques(2000)['cache_blocking'] - print(f"โ€ข Cache blocking: {small_speedup:.1f}x โ†’ {large_speedup:.1f}x (scales with cache misses)") + print(f"โ€ข Cache blocking: {small_speedup:.1f}x -> {large_speedup:.1f}x (scales with cache misses)") print(f"โ€ข Vectorization: 8.0x constant (independent of matrix size)") print(f"โ€ข Parallelization: 4.0x constant (perfect scaling assumed)") - print(f"โ€ข Combined: Multiplicative effect = cache ร— vector ร— parallel") + print(f"โ€ข Combined: Multiplicative effect = cache * vector * parallel") print(f"\n๐Ÿ“Š Real-World Performance Expectations:") realistic_combined = large_speedup * 4.0 * 4.0 # Conservative vectorization @@ -1323,14 +1323,14 @@ def analyze_acceleration_scaling(): print(f"โ€ข Why not perfect: Memory bandwidth limits, overhead, synchronization") print(f"โ€ข Production systems: Focus on cache + vectorization first") - print(f"\n๐Ÿ’ก ML Systems Implications:") - print(f"โ€ข Small models (โ‰ค500): Vectorization dominates") - print(f"โ€ข Large models (โ‰ฅ1000): Cache optimization critical") + print(f"\nTIP ML Systems Implications:") + print(f"โ€ข Small models (<=500): Vectorization dominates") + print(f"โ€ข Large models (>=1000): Cache optimization critical") print(f"โ€ข Production: Memory bandwidth becomes bottleneck") print(f"โ€ข GPU: Different scaling - thousands of cores, different cache hierarchy") except Exception as e: - print(f"โš ๏ธ Error in scaling analysis: {e}") + print(f"WARNING๏ธ Error in scaling analysis: {e}") print("Make sure all analysis functions are implemented correctly") # Run the scaling analysis @@ -1338,7 +1338,7 @@ analyze_acceleration_scaling() def run_complete_acceleration_demo(): """Run the complete acceleration demonstration""" - print("๐Ÿš€ Complete Hardware Acceleration Demo") + print("ROCKET Complete Hardware Acceleration Demo") print("=" * 55) print("THE FREE SPEEDUP: From Naive Loops to Optimized Backends") @@ -1363,23 +1363,23 @@ def run_complete_acceleration_demo(): test_backend_system() print("\n" + "=" * 55) - print("๐ŸŽฏ HARDWARE ACCELERATION MASTERED") + print("TARGET HARDWARE ACCELERATION MASTERED") print("=" * 55) print("\n๐Ÿ“š What You Mastered:") - print("โœ… Why your Module 2/4 loops were slow (cache hierarchy matters!)") - print("โœ… How cache-friendly blocking works (process data in chunks)") - print("โœ… Why NumPy dominates (professional optimizations built-in)") - print("โœ… How to build smart backend systems (automatic optimization)") - print("โœ… Real ML applications (MLPs, CNNs, Transformers all use matmul!)") + print("PASS Why your Module 2/4 loops were slow (cache hierarchy matters!)") + print("PASS How cache-friendly blocking works (process data in chunks)") + print("PASS Why NumPy dominates (professional optimizations built-in)") + print("PASS How to build smart backend systems (automatic optimization)") + print("PASS Real ML applications (MLPs, CNNs, Transformers all use matmul!)") - print("\n๐ŸŽฏ The Free Speedup Philosophy:") - print("โ€ข ๐Ÿš€ Same math, better implementation = 100x speedup") + print("\nTARGET The Free Speedup Philosophy:") + print("โ€ข ROCKET Same math, better implementation = 100x speedup") print("โ€ข ๐Ÿง  Educational loops teach algorithms") - print("โ€ข โšก Blocked algorithms teach cache optimization") + print("โ€ข SPEED Blocked algorithms teach cache optimization") print("โ€ข ๐Ÿญ NumPy provides production performance") - print("โ€ข ๐ŸŽฏ Smart backends make optimization transparent") - print("โ€ข ๐Ÿ’ก Understanding the spectrum makes you a better engineer!") + print("โ€ข TARGET Smart backends make optimization transparent") + print("โ€ข TIP Understanding the spectrum makes you a better engineer!") return naive_results @@ -1395,7 +1395,7 @@ This module demonstrates the fundamental principles of hardware acceleration in - **Memory Layout**: Contiguous access patterns for optimal performance - **Backend Abstraction**: Transparent dispatch between naive and optimized implementations -### โšก **Optimization Techniques** +### SPEED **Optimization Techniques** - **Blocked Algorithms**: Process data in cache-friendly blocks - **Vectorized Operations**: Avoid Python loops, use NumPy's optimized routines - **In-place Operations**: Minimize memory allocation overhead @@ -1403,16 +1403,16 @@ This module demonstrates the fundamental principles of hardware acceleration in ### ๐Ÿ“Š **Performance Understanding** - **Measurement First**: Profile real bottlenecks before optimizing -- **Algorithmic Impact**: O(Nยณ) โ†’ O(Nยฒ) matters more than 2x constant factors +- **Algorithmic Impact**: O(Nยณ) -> O(Nยฒ) matters more than 2x constant factors - **Hardware Awareness**: CPU cache misses cost 100x more than cache hits - **Library Utilization**: Optimized BLAS libraries beat custom implementations -### ๐ŸŽฏ **Real-World Applications** +### TARGET **Real-World Applications** - **ML Frameworks**: How PyTorch/TensorFlow apply these same principles - **Production Systems**: Where optimization efforts provide real value - **Development Practice**: When to optimize vs when to use existing solutions -### ๐Ÿ’ก **Key Insights** +### TIP **Key Insights** - Cache-friendly algorithms provide 2-5x speedups from memory access patterns alone - Vectorization eliminates Python overhead for 10-100x improvements - Most NumPy operations are already optimized - focus on system-level improvements @@ -1424,7 +1424,7 @@ This approach teaches students to think like systems engineers: understand the h def test_unit_all(): """Run all unit tests for the acceleration module.""" - print("๐Ÿงช Running all Hardware Acceleration tests...") + print("TEST Running all Hardware Acceleration tests...") print("=" * 55) try: @@ -1449,34 +1449,34 @@ def test_unit_all(): test_ml_model_acceleration() print("\n" + "=" * 55) - print("โœ… All Hardware Acceleration tests passed!") - print("๐Ÿš€ Module ready for production ML systems.") + print("PASS All Hardware Acceleration tests passed!") + print("ROCKET Module ready for production ML systems.") except Exception as e: - print(f"โŒ Test failed: {e}") + print(f"FAIL Test failed: {e}") raise if __name__ == "__main__": print("Module 16: Hardware Acceleration - The Free Speedup!") print("=" * 60) - print("๐Ÿš€ THE EASIEST OPTIMIZATION: Better Backends, Zero Trade-offs") + print("ROCKET THE EASIEST OPTIMIZATION: Better Backends, Zero Trade-offs") # Run complete testing suite test_unit_all() - print(f"\n๐ŸŽ‰ Module 16: Hardware Acceleration COMPLETE!") - print(f"โšก Mastered: 10-100x speedups with no accuracy loss") + print(f"\nCELEBRATE Module 16: Hardware Acceleration COMPLETE!") + print(f"SPEED Mastered: 10-100x speedups with no accuracy loss") print(f"๐Ÿง  Learned: Cache hierarchy, blocking, vectorization") print(f"๐Ÿญ Applied: MLPs, CNNs, Transformers all benefit") - print(f"๐ŸŽฏ Ready: To build high-performance ML systems!") + print(f"TARGET Ready: To build high-performance ML systems!") # %% [markdown] """ -## ๐Ÿค” ML Systems Thinking: Interactive Questions +## THINK ML Systems Thinking: Interactive Questions -1. **Memory Access Pattern Analysis**: In your `matmul_naive()` implementation, the innermost loop accesses `a[i, k]` sequentially but `b[k, j]` with large strides. When you tested 200ร—200 matrices, you saw dramatic slowdowns. Analyze why: (a) Calculate cache misses for both access patterns, (b) Explain why `b[k, j]` creates O(Nยฒ) cache misses, (c) Show how this scales to 1000ร—1000 matrices, and (d) Design a memory layout that would eliminate strided access. +1. **Memory Access Pattern Analysis**: In your `matmul_naive()` implementation, the innermost loop accesses `a[i, k]` sequentially but `b[k, j]` with large strides. When you tested 200*200 matrices, you saw dramatic slowdowns. Analyze why: (a) Calculate cache misses for both access patterns, (b) Explain why `b[k, j]` creates O(Nยฒ) cache misses, (c) Show how this scales to 1000*1000 matrices, and (d) Design a memory layout that would eliminate strided access. -2. **Cache Blocking Optimization**: Your `matmul_blocked()` function uses 64ร—64 blocks and showed significant speedups over naive loops. Analyze the cache efficiency: (a) Calculate total memory footprint (3 blocks ร— 64ยฒ ร— 4 bytes), (b) Verify it fits in L2 cache (256KB), (c) Compute cache reuse factor (64 operations per cache line), (d) Predict performance change with 128ร—128 blocks, and (e) Explain why your cache analysis function showed 64ร—64 as optimal. +2. **Cache Blocking Optimization**: Your `matmul_blocked()` function uses 64*64 blocks and showed significant speedups over naive loops. Analyze the cache efficiency: (a) Calculate total memory footprint (3 blocks * 64ยฒ * 4 bytes), (b) Verify it fits in L2 cache (256KB), (c) Compute cache reuse factor (64 operations per cache line), (d) Predict performance change with 128*128 blocks, and (e) Explain why your cache analysis function showed 64*64 as optimal. 3. **Production Stack Engineering**: You measured that NumPy beats your blocked implementation by 5-10x. Analyze the engineering trade-offs: (a) List three specific optimizations NumPy includes (BLAS, vectorization, threading), (b) Calculate development time vs. performance gain for each, (c) Estimate why custom optimization rarely beats production libraries, and (d) Determine when custom optimization is justified in ML systems. @@ -1485,7 +1485,7 @@ if __name__ == "__main__": # %% [markdown] """ -## ๐ŸŽฏ MODULE SUMMARY: Hardware Acceleration - The Free Speedup +## TARGET MODULE SUMMARY: Hardware Acceleration - The Free Speedup This module demonstrates the easiest optimization in ML systems: using better backends for free speedups with zero accuracy trade-offs. You learned why understanding the optimization spectrum makes you a better engineer. @@ -1497,7 +1497,7 @@ This module demonstrates the easiest optimization in ML systems: using better ba ### ๐Ÿ› ๏ธ **What We Built and Tested** - **Educational Baseline**: Your triple-nested loops from Module 2/4 (algorithm understanding) -- **Cache-Friendly Blocking**: 64ร—64 blocks fitting in L1/L2 cache (10x+ speedup) +- **Cache-Friendly Blocking**: 64*64 blocks fitting in L1/L2 cache (10x+ speedup) - **NumPy Production**: Leveraging professional BLAS optimizations (another 10x speedup) - **Smart Backend System**: Automatic dispatch to optimal implementations - **Real ML Applications**: MLP, CNN, Transformer operations using matrix multiplication @@ -1508,7 +1508,7 @@ This module demonstrates the easiest optimization in ML systems: using better ba - **When to use NumPy**: It already has these optimizations (and more) built-in - **Systems thinking**: Understanding enables better decisions about when to optimize -### โšก **Performance Spectrum Mastered** +### SPEED **Performance Spectrum Mastered** - **Educational loops**: Algorithm understanding (1000x slower, perfect for learning) - **Cache-friendly blocking**: Systems understanding (100x slower, teaches optimization) - **NumPy production**: Professional performance (optimal speed, built-in optimizations) @@ -1526,7 +1526,7 @@ This module demonstrates the easiest optimization in ML systems: using better ba - **Libraries beat custom optimization**: NumPy already has expert-level optimizations - **Understanding enables better tools**: You can build smarter systems when you know the principles -### ๐Ÿ’ก **The Free Speedup Philosophy** +### TIP **The Free Speedup Philosophy** This is the EASIEST optimization in ML systems: same math, better implementation, massive speedups, zero downsides. You implemented loops to understand algorithms. You implemented blocking to understand cache optimization. Now you use NumPy because it has all optimizations built-in. Understanding this spectrum - from educational to production - makes you a superior ML systems engineer who can make informed optimization decisions. """ diff --git a/modules/16_quantization/quantization_dev.py b/modules/16_quantization/quantization_dev.py index 7f5ede84..31f11f82 100644 --- a/modules/16_quantization/quantization_dev.py +++ b/modules/16_quantization/quantization_dev.py @@ -12,9 +12,9 @@ """ # Module 17: Quantization - Trading Precision for Speed -Welcome to the Quantization module! After Module 16 showed you how to get free speedups through better algorithms, now we make our **first trade-off**: reduce precision for speed. You'll implement INT8 quantization to achieve 4ร— speedup with <1% accuracy loss. +Welcome to the Quantization module! After Module 16 showed you how to get free speedups through better algorithms, now we make our **first trade-off**: reduce precision for speed. You'll implement INT8 quantization to achieve 4* speedup with <1% accuracy loss. -## Connection from Module 16: Acceleration โ†’ Quantization +## Connection from Module 16: Acceleration -> Quantization Module 16 taught you to accelerate computations through better algorithms and hardware utilization - these were "free" optimizations. Now we enter the world of **trade-offs**: sacrificing precision to gain speed. This is especially powerful for CNN inference where INT8 operations are much faster than FP32. @@ -24,13 +24,13 @@ Module 16 taught you to accelerate computations through better algorithms and ha - **Core implementation skill**: Build INT8 quantization systems for CNN weights and activations - **Pattern recognition**: Understand calibration-based quantization for post-training optimization - **Framework connection**: See how production systems use quantization for edge deployment and mobile inference -- **Performance insight**: Achieve 4ร— speedup with <1% accuracy loss through precision optimization +- **Performance insight**: Achieve 4* speedup with <1% accuracy loss through precision optimization -## Build โ†’ Profile โ†’ Optimize +## Build -> Profile -> Optimize 1. **Build**: Start with FP32 CNN inference (baseline) 2. **Profile**: Measure memory usage and computational cost of FP32 operations -3. **Optimize**: Implement INT8 quantization to achieve 4ร— speedup with minimal accuracy loss +3. **Optimize**: Implement INT8 quantization to achieve 4* speedup with minimal accuracy loss ## What You'll Achieve @@ -38,14 +38,14 @@ By the end of this module, you'll understand: - **Deep technical understanding**: How INT8 quantization reduces precision while maintaining model quality - **Practical capability**: Implement production-grade quantization for CNN inference acceleration - **Systems insight**: Memory vs precision tradeoffs in ML systems optimization -- **Performance mastery**: Achieve 4ร— speedup (50ms โ†’ 12ms inference) with <1% accuracy loss +- **Performance mastery**: Achieve 4* speedup (50ms -> 12ms inference) with <1% accuracy loss - **Connection to edge deployment**: How mobile and edge devices use quantization for efficient AI ## Systems Reality Check -๐Ÿ’ก **Production Context**: TensorFlow Lite and PyTorch Mobile use INT8 quantization for mobile deployment -โšก **Performance Note**: CNN inference: FP32 = 50ms, INT8 = 12ms (4ร— faster) with 98% โ†’ 97.5% accuracy -๐Ÿง  **Memory Tradeoff**: INT8 uses 4ร— less memory and enables much faster integer arithmetic +TIP **Production Context**: TensorFlow Lite and PyTorch Mobile use INT8 quantization for mobile deployment +SPEED **Performance Note**: CNN inference: FP32 = 50ms, INT8 = 12ms (4* faster) with 98% -> 97.5% accuracy +๐Ÿง  **Memory Tradeoff**: INT8 uses 4* less memory and enables much faster integer arithmetic """ # %% nbgrader={"grade": false, "grade_id": "quantization-imports", "locked": false, "schema_version": 3, "solution": false, "task": false} @@ -92,7 +92,7 @@ Let's start by understanding what quantization means and why it provides such dr ### The Quantization Concept Quantization converts high-precision floating-point numbers (FP32: 32 bits) to low-precision integers (INT8: 8 bits): -- **Memory**: 4ร— reduction (32 bits โ†’ 8 bits) +- **Memory**: 4* reduction (32 bits -> 8 bits) - **Compute**: Integer arithmetic is much faster than floating-point - **Hardware**: Specialized INT8 units on modern CPUs and mobile processors - **Trade-off**: Small precision loss for large speed gain @@ -144,7 +144,7 @@ class BaselineCNN: self.fc_input_size = 64 * 6 * 6 # 64 channels, 6x6 spatial self.fc = np.random.randn(self.fc_input_size, num_classes) * 0.02 - print(f"โœ… BaselineCNN initialized: {self._count_parameters()} parameters") + print(f"PASS BaselineCNN initialized: {self._count_parameters()} parameters") ### END SOLUTION def _count_parameters(self) -> int: @@ -253,7 +253,7 @@ Let's test our baseline CNN to establish performance and accuracy baselines: # %% nbgrader={"grade": true, "grade_id": "test-baseline-cnn", "locked": false, "points": 2, "schema_version": 3, "solution": false, "task": false} def test_baseline_cnn(): """Test baseline CNN implementation and measure performance.""" - print("๐Ÿ” Testing Baseline FP32 CNN...") + print("MAGNIFY Testing Baseline FP32 CNN...") print("=" * 60) # Create baseline model @@ -272,13 +272,13 @@ def test_baseline_cnn(): # Validate output assert logits.shape == (batch_size, 10), f"Expected (4, 10), got {logits.shape}" - print(f"โœ… Forward pass works: {logits.shape}") + print(f"PASS Forward pass works: {logits.shape}") # Test predictions predictions = model.predict(input_data) assert predictions.shape == (batch_size,), f"Expected (4,), got {predictions.shape}" assert all(0 <= p < 10 for p in predictions), "All predictions should be valid class indices" - print(f"โœ… Predictions work: {predictions}") + print(f"PASS Predictions work: {predictions}") # Performance baseline print(f"\n๐Ÿ“Š Performance Baseline:") @@ -287,8 +287,8 @@ def test_baseline_cnn(): print(f" Parameters: {model._count_parameters()} (all FP32)") print(f" Memory usage: ~{model._count_parameters() * 4 / 1024:.1f}KB for weights") - print("โœ… Baseline CNN tests passed!") - print("๐Ÿ’ก Ready to implement INT8 quantization for 4ร— speedup...") + print("PASS Baseline CNN tests passed!") + print("TIP Ready to implement INT8 quantization for 4* speedup...") # Test function defined (called in main block) @@ -478,7 +478,7 @@ class INT8Quantizer: print(f" Scale: {scale:.6f}, Zero point: {zero_point}") print(f" Quantization error: {quantization_error:.6f} (max: {max_error:.6f})") - print(f" Compression: {compression_ratio:.1f}ร— ({original_size//1024}KB โ†’ {quantized_size//1024}KB)") + print(f" Compression: {compression_ratio:.1f}* ({original_size//1024}KB -> {quantized_size//1024}KB)") return { 'quantized_weights': quantized_weights, @@ -500,7 +500,7 @@ Let's test our quantizer to verify it works correctly: # %% nbgrader={"grade": true, "grade_id": "test-quantizer", "locked": false, "points": 3, "schema_version": 3, "solution": false, "task": false} def test_int8_quantizer(): """Test INT8 quantizer implementation.""" - print("๐Ÿ” Testing INT8 Quantizer...") + print("MAGNIFY Testing INT8 Quantizer...") print("=" * 60) quantizer = INT8Quantizer() @@ -519,14 +519,14 @@ def test_int8_quantizer(): # Verify quantized tensor is INT8 assert quantized.dtype == np.int8, f"Expected int8, got {quantized.dtype}" assert np.all(quantized >= -128) and np.all(quantized <= 127), "Quantized values outside INT8 range" - print("โœ… Quantization produces valid INT8 values") + print("PASS Quantization produces valid INT8 values") # Verify round-trip error is reasonable quantization_error = np.mean(np.abs(test_tensor - dequantized)) max_error = np.max(np.abs(test_tensor - dequantized)) assert quantization_error < 0.1, f"Quantization error too high: {quantization_error}" - print(f"โœ… Round-trip error acceptable: {quantization_error:.6f} (max: {max_error:.6f})") + print(f"PASS Round-trip error acceptable: {quantization_error:.6f} (max: {max_error:.6f})") # Test weight quantization weight_tensor = np.random.randn(64, 32, 3, 3) * 0.1 # Typical conv weight range @@ -538,20 +538,20 @@ def test_int8_quantizer(): assert 'quantization_error' in weight_result, "Should return error metrics" assert weight_result['compression_ratio'] > 3.5, "Should achieve good compression" - print(f"โœ… Weight quantization: {weight_result['compression_ratio']:.1f}ร— compression") - print(f"โœ… Weight quantization error: {weight_result['quantization_error']:.6f}") + print(f"PASS Weight quantization: {weight_result['compression_ratio']:.1f}* compression") + print(f"PASS Weight quantization error: {weight_result['quantization_error']:.6f}") - print("โœ… INT8 quantizer tests passed!") - print("๐Ÿ’ก Ready to build quantized CNN...") + print("PASS INT8 quantizer tests passed!") + print("TIP Ready to build quantized CNN...") # Test function defined (called in main block) -# โœ… IMPLEMENTATION CHECKPOINT: Ensure quantized CNN is fully built before running +# PASS IMPLEMENTATION CHECKPOINT: Ensure quantized CNN is fully built before running -# ๐Ÿค” PREDICTION: How much memory will quantization save for convolutional layers? -# Write your guess here: _______ร— reduction +# THINK PREDICTION: How much memory will quantization save for convolutional layers? +# Write your guess here: _______* reduction -# ๐Ÿ” SYSTEMS INSIGHT #1: Quantization Memory Analysis +# MAGNIFY SYSTEMS INSIGHT #1: Quantization Memory Analysis def analyze_quantization_memory(): """Analyze memory savings from quantization.""" try: @@ -579,15 +579,15 @@ def analyze_quantization_memory(): print(f"๐Ÿ“Š Quantization Memory Analysis:") print(f" Baseline conv weights: {baseline_conv_memory/1024:.1f}KB") print(f" Quantized conv weights: {quantized_conv_memory/1024:.1f}KB") - print(f" Compression ratio: {compression_ratio:.1f}ร—") + print(f" Compression ratio: {compression_ratio:.1f}*") print(f" Memory saved: {(baseline_conv_memory - quantized_conv_memory)/1024:.1f}KB") # Explain the scaling - print(f"\n๐Ÿ’ก WHY THIS MATTERS:") + print(f"\nTIP WHY THIS MATTERS:") print(f" โ€ข FP32 uses 4 bytes per parameter") print(f" โ€ข INT8 uses 1 byte per parameter") - print(f" โ€ข Theoretical maximum: 4ร— compression") - print(f" โ€ข Actual compression: {compression_ratio:.1f}ร— (close to theoretical!)") + print(f" โ€ข Theoretical maximum: 4* compression") + print(f" โ€ข Actual compression: {compression_ratio:.1f}* (close to theoretical!)") print(f" โ€ข For large models: This enables mobile deployment") # Scale to production size @@ -601,7 +601,7 @@ def analyze_quantization_memory(): print(f" Mobile app size reduction: {fp32_size_mb - int8_size_mb:.1f}MB") except Exception as e: - print(f"โš ๏ธ Error in memory analysis: {e}") + print(f"WARNING๏ธ Error in memory analysis: {e}") print("Make sure quantized CNN is implemented correctly") # Analyze quantization memory impact @@ -616,7 +616,7 @@ Now let's create a quantized version of our CNN that uses INT8 weights while mai ### Quantized Operations Strategy For maximum performance, we need to: -1. **Store weights in INT8** format (4ร— memory savings) +1. **Store weights in INT8** format (4* memory savings) 2. **Compute convolutions with INT8** arithmetic (faster) 3. **Dequantize only when necessary** for activation functions 4. **Calibrate quantization** using representative data @@ -683,7 +683,7 @@ class QuantizedConv2d: self.weight_zero_point = result['zero_point'] self.is_quantized = True - print(f" Quantized: {result['compression_ratio']:.1f}ร— compression, " + print(f" Quantized: {result['compression_ratio']:.1f}* compression, " f"{result['quantization_error']:.6f} error") ### END SOLUTION @@ -742,7 +742,7 @@ class QuantizedCNN: """ CNN with INT8 quantized weights for fast inference. - This model demonstrates how quantization can achieve 4ร— speedup + This model demonstrates how quantization can achieve 4* speedup with minimal accuracy loss through precision optimization. """ @@ -781,7 +781,7 @@ class QuantizedCNN: self.quantizer = INT8Quantizer() self.is_quantized = False - print(f"โœ… QuantizedCNN initialized: {self._count_parameters()} parameters") + print(f"PASS QuantizedCNN initialized: {self._count_parameters()} parameters") ### END SOLUTION def _count_parameters(self) -> int: @@ -829,9 +829,9 @@ class QuantizedCNN: compression_ratio = original_conv_memory / quantized_conv_memory - print(f"โœ… Quantization complete:") - print(f" Conv layers: {original_conv_memory//1024}KB โ†’ {quantized_conv_memory//1024}KB") - print(f" Compression: {compression_ratio:.1f}ร— memory savings") + print(f"PASS Quantization complete:") + print(f" Conv layers: {original_conv_memory//1024}KB -> {quantized_conv_memory//1024}KB") + print(f" Compression: {compression_ratio:.1f}* memory savings") print(f" Model ready for fast inference!") ### END SOLUTION @@ -899,7 +899,7 @@ Let's test our quantized CNN and verify it maintains accuracy: # %% nbgrader={"grade": true, "grade_id": "test-quantized-cnn", "locked": false, "points": 4, "schema_version": 3, "solution": false, "task": false} def test_quantized_cnn(): """Test quantized CNN implementation.""" - print("๐Ÿ” Testing Quantized CNN...") + print("MAGNIFY Testing Quantized CNN...") print("=" * 60) # Create quantized model @@ -911,45 +911,45 @@ def test_quantized_cnn(): # Test before quantization test_input = np.random.randn(2, 3, 32, 32) logits_before = model.forward(test_input) - print(f"โœ… Forward pass before quantization: {logits_before.shape}") + print(f"PASS Forward pass before quantization: {logits_before.shape}") # Calibrate and quantize model.calibrate_and_quantize(calibration_data) assert model.is_quantized, "Model should be marked as quantized" assert model.conv1.is_quantized, "Conv1 should be quantized" assert model.conv2.is_quantized, "Conv2 should be quantized" - print("โœ… Model quantization successful") + print("PASS Model quantization successful") # Test after quantization logits_after = model.forward(test_input) assert logits_after.shape == logits_before.shape, "Output shape should be unchanged" - print(f"โœ… Forward pass after quantization: {logits_after.shape}") + print(f"PASS Forward pass after quantization: {logits_after.shape}") # Check predictions still work predictions = model.predict(test_input) assert predictions.shape == (2,), f"Expected (2,), got {predictions.shape}" assert all(0 <= p < 10 for p in predictions), "All predictions should be valid" - print(f"โœ… Predictions work: {predictions}") + print(f"PASS Predictions work: {predictions}") # Verify quantization maintains reasonable accuracy output_diff = np.mean(np.abs(logits_before - logits_after)) max_diff = np.max(np.abs(logits_before - logits_after)) - print(f"โœ… Quantization impact: {output_diff:.4f} mean diff, {max_diff:.4f} max diff") + print(f"PASS Quantization impact: {output_diff:.4f} mean diff, {max_diff:.4f} max diff") # Should have reasonable impact but not destroy the model assert output_diff < 2.0, f"Quantization impact too large: {output_diff:.4f}" - print("โœ… Quantized CNN tests passed!") - print("๐Ÿ’ก Ready for performance comparison...") + print("PASS Quantized CNN tests passed!") + print("TIP Ready for performance comparison...") # Test function defined (called in main block) -# โœ… IMPLEMENTATION CHECKPOINT: Quantized CNN complete +# PASS IMPLEMENTATION CHECKPOINT: Quantized CNN complete -# ๐Ÿค” PREDICTION: What will be the biggest source of speedup from quantization? +# THINK PREDICTION: What will be the biggest source of speedup from quantization? # Your answer: Memory bandwidth / Computation / Cache efficiency / _______ -# ๐Ÿ” SYSTEMS INSIGHT #2: Quantization Speed Analysis +# MAGNIFY SYSTEMS INSIGHT #2: Quantization Speed Analysis def analyze_quantization_speed(): """Analyze speed improvements from quantization.""" try: @@ -984,42 +984,42 @@ def analyze_quantization_speed(): speedup = baseline_avg / quantized_avg if quantized_avg > 0 else 1.0 - print(f"โšก Quantization Speed Analysis:") + print(f"SPEED Quantization Speed Analysis:") print(f" Baseline FP32: {baseline_avg:.2f}ms") print(f" Quantized INT8: {quantized_avg:.2f}ms") - print(f" Speedup: {speedup:.1f}ร—") + print(f" Speedup: {speedup:.1f}*") # Analyze speedup sources - print(f"\n๐Ÿ” Speedup Sources:") - print(f" 1. Memory bandwidth: 4ร— less data to load (32โ†’8 bits)") + print(f"\nMAGNIFY Speedup Sources:") + print(f" 1. Memory bandwidth: 4* less data to load (32->8 bits)") print(f" 2. Cache efficiency: More weights fit in CPU cache") print(f" 3. SIMD operations: More INT8 ops per instruction") print(f" 4. Hardware acceleration: Dedicated INT8 units") # Note about production vs educational implementation print(f"\n๐Ÿ“š Educational vs Production:") - print(f" โ€ข This implementation: {speedup:.1f}ร— (educational focus)") - print(f" โ€ข Production systems: 3-5ร— typical speedup") - print(f" โ€ข Hardware optimized: Up to 10ร— on specialized chips") + print(f" โ€ข This implementation: {speedup:.1f}* (educational focus)") + print(f" โ€ข Production systems: 3-5* typical speedup") + print(f" โ€ข Hardware optimized: Up to 10* on specialized chips") print(f" โ€ข Why difference: We dequantize for computation (educational clarity)") print(f" โ€ข Production: Native INT8 kernels throughout pipeline") except Exception as e: - print(f"โš ๏ธ Error in speed analysis: {e}") + print(f"WARNING๏ธ Error in speed analysis: {e}") # Analyze quantization speed benefits analyze_quantization_speed() # %% [markdown] """ -## Part 4: Performance Analysis - 4ร— Speedup Demonstration +## Part 4: Performance Analysis - 4* Speedup Demonstration Now let's demonstrate the dramatic performance improvement achieved by INT8 quantization. We'll compare FP32 vs INT8 inference speed and memory usage. ### Expected Results -- **Memory usage**: 4ร— reduction for quantized weights -- **Inference speed**: 4ร— improvement through INT8 arithmetic -- **Accuracy**: <1% degradation (98% โ†’ 97.5% typical) +- **Memory usage**: 4* reduction for quantized weights +- **Inference speed**: 4* improvement through INT8 arithmetic +- **Accuracy**: <1% degradation (98% -> 97.5% typical) """ # %% nbgrader={"grade": false, "grade_id": "performance-analyzer", "locked": false, "schema_version": 3, "solution": true, "task": false} @@ -1073,7 +1073,7 @@ class QuantizationPerformanceAnalyzer: print(f"๐Ÿ“Š Memory Analysis:") print(f" Baseline: {baseline_memory:.1f}KB") print(f" Quantized: {quantized_memory:.1f}KB") - print(f" Reduction: {memory_reduction:.1f}ร—") + print(f" Reduction: {memory_reduction:.1f}*") # Inference Speed Benchmark print(f"\nโฑ๏ธ Speed Benchmark ({num_runs} runs):") @@ -1105,7 +1105,7 @@ class QuantizationPerformanceAnalyzer: print(f" Baseline: {baseline_avg_time*1000:.2f}ms ยฑ {baseline_std_time*1000:.2f}ms") print(f" Quantized: {quantized_avg_time*1000:.2f}ms ยฑ {quantized_std_time*1000:.2f}ms") - print(f" Speedup: {speedup:.1f}ร—") + print(f" Speedup: {speedup:.1f}*") # Accuracy Analysis output_diff = np.mean(np.abs(baseline_output - quantized_output)) @@ -1116,7 +1116,7 @@ class QuantizationPerformanceAnalyzer: quantized_preds = np.argmax(quantized_output, axis=1) agreement = np.mean(baseline_preds == quantized_preds) - print(f"\n๐ŸŽฏ Accuracy Analysis:") + print(f"\nTARGET Accuracy Analysis:") print(f" Output difference: {output_diff:.4f} (max: {max_diff:.4f})") print(f" Prediction agreement: {agreement:.1%}") @@ -1176,29 +1176,29 @@ class QuantizationPerformanceAnalyzer: This function is PROVIDED to display results clearly. """ - print("\n๐Ÿš€ QUANTIZATION PERFORMANCE SUMMARY") + print("\nROCKET QUANTIZATION PERFORMANCE SUMMARY") print("=" * 60) print(f"๐Ÿ“Š Memory Optimization:") print(f" โ€ข FP32 Model: {results['memory_baseline_kb']:.1f}KB") print(f" โ€ข INT8 Model: {results['memory_quantized_kb']:.1f}KB") - print(f" โ€ข Memory savings: {results['memory_reduction']:.1f}ร— reduction") + print(f" โ€ข Memory savings: {results['memory_reduction']:.1f}* reduction") print(f" โ€ข Storage efficiency: {(1 - 1/results['memory_reduction'])*100:.1f}% less memory") - print(f"\nโšก Speed Optimization:") + print(f"\nSPEED Speed Optimization:") print(f" โ€ข FP32 Inference: {results['speed_baseline_ms']:.1f}ms") print(f" โ€ข INT8 Inference: {results['speed_quantized_ms']:.1f}ms") - print(f" โ€ข Speed improvement: {results['speedup']:.1f}ร— faster") + print(f" โ€ข Speed improvement: {results['speedup']:.1f}* faster") print(f" โ€ข Latency reduction: {(1 - 1/results['speedup'])*100:.1f}% faster") - print(f"\n๐ŸŽฏ Accuracy Trade-off:") + print(f"\nTARGET Accuracy Trade-off:") print(f" โ€ข Output preservation: {(1-results['output_difference'])*100:.1f}% similarity") print(f" โ€ข Prediction agreement: {results['prediction_agreement']:.1%}") - print(f" โ€ข Quality maintained with {results['speedup']:.1f}ร— speedup!") + print(f" โ€ข Quality maintained with {results['speedup']:.1f}* speedup!") # Overall assessment efficiency_score = results['speedup'] * results['memory_reduction'] print(f"\n๐Ÿ† Overall Efficiency:") - print(f" โ€ข Combined benefit: {efficiency_score:.1f}ร— (speed ร— memory)") + print(f" โ€ข Combined benefit: {efficiency_score:.1f}* (speed * memory)") print(f" โ€ข Trade-off assessment: {'๐ŸŸข Excellent' if results['prediction_agreement'] > 0.95 else '๐ŸŸก Good'}") # %% [markdown] @@ -1211,7 +1211,7 @@ Let's run comprehensive benchmarks to see the quantization benefits: # %% nbgrader={"grade": true, "grade_id": "test-performance-analysis", "locked": false, "points": 4, "schema_version": 3, "solution": false, "task": false} def test_performance_analysis(): """Test performance analysis of quantization benefits.""" - print("๐Ÿ” Testing Performance Analysis...") + print("MAGNIFY Testing Performance Analysis...") print("=" * 60) # Create models @@ -1235,28 +1235,28 @@ def test_performance_analysis(): assert 'prediction_agreement' in results, "Should report accuracy preservation" # Verify quantization benefits (realistic expectation: conv layers quantized, FC kept FP32) - assert results['memory_reduction'] > 1.2, f"Should show memory reduction, got {results['memory_reduction']:.1f}ร—" - assert results['speedup'] > 0.5, f"Educational implementation without actual INT8 kernels, got {results['speedup']:.1f}ร—" + assert results['memory_reduction'] > 1.2, f"Should show memory reduction, got {results['memory_reduction']:.1f}*" + assert results['speedup'] > 0.5, f"Educational implementation without actual INT8 kernels, got {results['speedup']:.1f}*" assert results['prediction_agreement'] >= 0.0, f"Prediction agreement measurement, got {results['prediction_agreement']:.1%}" - print(f"โœ… Memory reduction: {results['memory_reduction']:.1f}ร—") - print(f"โœ… Speed improvement: {results['speedup']:.1f}ร—") - print(f"โœ… Prediction agreement: {results['prediction_agreement']:.1%}") + print(f"PASS Memory reduction: {results['memory_reduction']:.1f}*") + print(f"PASS Speed improvement: {results['speedup']:.1f}*") + print(f"PASS Prediction agreement: {results['prediction_agreement']:.1%}") # Print comprehensive summary analyzer.print_performance_summary(results) - print("โœ… Performance analysis tests passed!") - print("๐ŸŽ‰ Quantization delivers significant benefits!") + print("PASS Performance analysis tests passed!") + print("CELEBRATE Quantization delivers significant benefits!") # Test function defined (called in main block) -# โœ… IMPLEMENTATION CHECKPOINT: Performance analysis complete +# PASS IMPLEMENTATION CHECKPOINT: Performance analysis complete -# ๐Ÿค” PREDICTION: Which quantization bit-width provides the best trade-off? +# THINK PREDICTION: Which quantization bit-width provides the best trade-off? # Your answer: 4-bit / 8-bit / 16-bit / 32-bit -# ๐Ÿ” SYSTEMS INSIGHT #3: Quantization Bit-Width Analysis +# MAGNIFY SYSTEMS INSIGHT #3: Quantization Bit-Width Analysis def analyze_quantization_bitwidths(): """Compare different quantization bit-widths.""" try: @@ -1298,11 +1298,11 @@ def analyze_quantization_bitwidths(): hardware = "Research" use_case = "Experimental" - print(f"{bits:<6} {memory:<8.1f} {speed:<8.1f}ร— {accuracy:<10.1f}% {hardware:<15} {use_case:<20}") + print(f"{bits:<6} {memory:<8.1f} {speed:<8.1f}* {accuracy:<10.1f}% {hardware:<15} {use_case:<20}") - print(f"\n๐ŸŽฏ Key Insights:") + print(f"\nTARGET Key Insights:") print(f" โ€ข INT8 Sweet Spot: Best balance of speed, accuracy, and hardware support") - print(f" โ€ข Memory scales linearly: Each bit halving saves 2ร— memory") + print(f" โ€ข Memory scales linearly: Each bit halving saves 2* memory") print(f" โ€ข Speed scaling non-linear: Hardware specialization matters") print(f" โ€ข Accuracy degrades exponentially: Below 8-bit becomes problematic") @@ -1310,7 +1310,7 @@ def analyze_quantization_bitwidths(): print(f" โ€ข TensorFlow Lite: Standardized on INT8") print(f" โ€ข PyTorch Mobile: INT8 with FP16 fallback") print(f" โ€ข Apple Neural Engine: Optimized for INT8") - print(f" โ€ข Google TPU: INT8 operations 10ร— faster than FP32") + print(f" โ€ข Google TPU: INT8 operations 10* faster than FP32") # Calculate efficiency score (speed / accuracy_loss) print(f"\n๐Ÿ“Š Efficiency Score (Speed / Accuracy Loss):") @@ -1330,10 +1330,10 @@ def analyze_quantization_bitwidths(): print(f" {bits}-bit: {score:.1f} (higher is better)") - print(f"\n๐Ÿ’ก WHY INT8 WINS: Highest efficiency score + universal hardware support!") + print(f"\nTIP WHY INT8 WINS: Highest efficiency score + universal hardware support!") except Exception as e: - print(f"โš ๏ธ Error in bit-width analysis: {e}") + print(f"WARNING๏ธ Error in bit-width analysis: {e}") # Analyze different quantization bit-widths analyze_quantization_bitwidths() @@ -1373,7 +1373,7 @@ class ProductionQuantizationInsights: { 'system': 'PyTorch Mobile (Meta)', 'technique': 'Dynamic quantization with runtime calibration', - 'benefit': 'Reduces model size by 4ร— for mobile deployment', + 'benefit': 'Reduces model size by 4* for mobile deployment', 'challenge': 'Balancing quantization overhead vs inference speedup' }, { @@ -1400,16 +1400,16 @@ class ProductionQuantizationInsights: @staticmethod def explain_advanced_techniques(): """Explain advanced quantization techniques.""" - print("โšก ADVANCED QUANTIZATION TECHNIQUES") + print("SPEED ADVANCED QUANTIZATION TECHNIQUES") print("=" * 45) print() techniques = [ "๐Ÿง  **Mixed Precision**: Quantize some layers to INT8, keep critical layers in FP32", "๐Ÿ”„ **Dynamic Quantization**: Quantize weights statically, activations dynamically", - "๐Ÿ“ฆ **Block-wise Quantization**: Different quantization parameters for weight blocks", + "PACKAGE **Block-wise Quantization**: Different quantization parameters for weight blocks", "โฐ **Quantization-Aware Training**: Train model to be robust to quantization", - "๐ŸŽฏ **Channel-wise Quantization**: Separate scales for each output channel", + "TARGET **Channel-wise Quantization**: Separate scales for each output channel", "๐Ÿ”€ **Adaptive Quantization**: Adjust precision based on layer importance", "โš–๏ธ **Hardware-Aware Quantization**: Optimize for specific hardware capabilities", "๐Ÿ›ก๏ธ **Calibration-Free Quantization**: Use statistical methods without data" @@ -1419,7 +1419,7 @@ class ProductionQuantizationInsights: print(f" {technique}") print() - print("๐Ÿ’ก **Your Implementation Foundation**: The INT8 quantization you built") + print("TIP **Your Implementation Foundation**: The INT8 quantization you built") print(" demonstrates the core principles behind all these optimizations!") @staticmethod @@ -1429,20 +1429,20 @@ class ProductionQuantizationInsights: print("=" * 40) print() - print("๐Ÿš€ **Speed Improvements**:") - print(" โ€ข Mobile CNNs: 2-4ร— faster inference with INT8") - print(" โ€ข BERT models: 3-5ร— speedup with mixed precision") - print(" โ€ข Edge deployment: 10ร— improvement with dedicated INT8 hardware") + print("ROCKET **Speed Improvements**:") + print(" โ€ข Mobile CNNs: 2-4* faster inference with INT8") + print(" โ€ข BERT models: 3-5* speedup with mixed precision") + print(" โ€ข Edge deployment: 10* improvement with dedicated INT8 hardware") print(" โ€ข Real-time vision: Enables 30fps on mobile devices") print() print("๐Ÿ’พ **Memory Reduction**:") - print(" โ€ข Model size: 4ร— smaller (critical for mobile apps)") - print(" โ€ข Runtime memory: 2-3ร— less activation memory") + print(" โ€ข Model size: 4* smaller (critical for mobile apps)") + print(" โ€ข Runtime memory: 2-3* less activation memory") print(" โ€ข Cache efficiency: Better fit in processor caches") print() - print("๐ŸŽฏ **Accuracy Preservation**:") + print("TARGET **Accuracy Preservation**:") print(" โ€ข Computer vision: <1% accuracy loss typical") print(" โ€ข Language models: 2-5% accuracy loss acceptable") print(" โ€ข Recommendation systems: Minimal impact on ranking quality") @@ -1529,7 +1529,7 @@ class QuantizationSystemsAnalyzer: efficiency = 32.0 / bits # Rough approximation results['compute_efficiency'].append(efficiency) - print(f" Compute efficiency: {efficiency:.1f}ร— faster than FP32") + print(f" Compute efficiency: {efficiency:.1f}* faster than FP32") # Typical accuracy loss (percentage points) if bits == 32: @@ -1585,7 +1585,7 @@ class QuantizationSystemsAnalyzer: This function is PROVIDED to show the analysis clearly. """ - print("\n๐ŸŽฏ PRECISION VS PERFORMANCE TRADE-OFF SUMMARY") + print("\nTARGET PRECISION VS PERFORMANCE TRADE-OFF SUMMARY") print("=" * 60) print(f"{'Bits':<6} {'Memory':<8} {'Speed':<8} {'Acc Loss':<10} {'Hardware':<20}") print("-" * 60) @@ -1597,10 +1597,10 @@ class QuantizationSystemsAnalyzer: hardware = analysis['hardware_support'] for i, bits in enumerate(bit_widths): - print(f"{bits:<6} {memory[i]:<8.1f} {speed[i]:<8.1f}ร— {acc_loss[i]:<10.1f}% {hardware[i]:<20}") + print(f"{bits:<6} {memory[i]:<8.1f} {speed[i]:<8.1f}* {acc_loss[i]:<10.1f}% {hardware[i]:<20}") print() - print("๐Ÿ” **Key Insights**:") + print("MAGNIFY **Key Insights**:") # Find sweet spot (best speed/accuracy trade-off) efficiency_ratios = [s / (1 + a) for s, a in zip(speed, acc_loss)] @@ -1608,14 +1608,14 @@ class QuantizationSystemsAnalyzer: best_bits = bit_widths[best_idx] print(f" โ€ข Sweet spot: {best_bits}-bit provides best efficiency/accuracy trade-off") - print(f" โ€ข Memory scaling: Linear with bit width (4ร— reduction FP32โ†’INT8)") + print(f" โ€ข Memory scaling: Linear with bit width (4* reduction FP32->INT8)") print(f" โ€ข Speed scaling: Non-linear due to hardware specialization") print(f" โ€ข Accuracy: Manageable loss up to 8-bit, significant below") - print(f"\n๐Ÿ’ก **Why INT8 Dominates Production**:") + print(f"\nTIP **Why INT8 Dominates Production**:") print(f" โ€ข Hardware support: Excellent across all platforms") - print(f" โ€ข Speed improvement: {speed[bit_widths.index(8)]:.1f}ร— faster than FP32") - print(f" โ€ข Memory reduction: {32/8:.1f}ร— smaller models") + print(f" โ€ข Speed improvement: {speed[bit_widths.index(8)]:.1f}* faster than FP32") + print(f" โ€ข Memory reduction: {32/8:.1f}* smaller models") print(f" โ€ข Accuracy preservation: <{acc_loss[bit_widths.index(8)]:.1f}% typical loss") print(f" โ€ข Deployment friendly: Fits mobile and edge constraints") @@ -1629,7 +1629,7 @@ Let's analyze the fundamental precision vs performance trade-offs: # %% nbgrader={"grade": true, "grade_id": "test-systems-analysis", "locked": false, "points": 3, "schema_version": 3, "solution": false, "task": false} def test_systems_analysis(): """Test systems analysis of precision vs performance trade-offs.""" - print("๐Ÿ” Testing Systems Analysis...") + print("MAGNIFY Testing Systems Analysis...") print("=" * 60) analyzer = QuantizationSystemsAnalyzer() @@ -1653,8 +1653,8 @@ def test_systems_analysis(): assert efficiency[int8_idx] > efficiency[fp32_idx], "INT8 should be more efficient than FP32" assert memory[int8_idx] < memory[fp32_idx], "INT8 should use less memory than FP32" - print(f"โœ… INT8 efficiency: {efficiency[int8_idx]:.1f}ร— vs FP32") - print(f"โœ… INT8 memory: {memory[int8_idx]:.1f} vs {memory[fp32_idx]:.1f} bytes/param") + print(f"PASS INT8 efficiency: {efficiency[int8_idx]:.1f}* vs FP32") + print(f"PASS INT8 memory: {memory[int8_idx]:.1f} vs {memory[fp32_idx]:.1f} bytes/param") # Show comprehensive analysis analyzer.print_tradeoff_summary(analysis) @@ -1664,10 +1664,10 @@ def test_systems_analysis(): best_bits = analysis['bit_widths'][np.argmax(efficiency_ratios)] assert best_bits == 8, f"INT8 should be identified as optimal, got {best_bits}-bit" - print(f"โœ… Systems analysis correctly identifies {best_bits}-bit as optimal") + print(f"PASS Systems analysis correctly identifies {best_bits}-bit as optimal") - print("โœ… Systems analysis tests passed!") - print("๐Ÿ’ก INT8 quantization is the proven sweet spot for production!") + print("PASS Systems analysis tests passed!") + print("TIP INT8 quantization is the proven sweet spot for production!") # Test function defined (called in main block) @@ -1681,7 +1681,7 @@ Let's run comprehensive tests to validate our complete quantization implementati # %% nbgrader={"grade": true, "grade_id": "comprehensive-tests", "locked": false, "points": 5, "schema_version": 3, "solution": false, "task": false} def run_comprehensive_tests(): """Run comprehensive tests of the entire quantization system.""" - print("๐Ÿงช COMPREHENSIVE QUANTIZATION SYSTEM TESTS") + print("TEST COMPREHENSIVE QUANTIZATION SYSTEM TESTS") print("=" * 60) # Test 1: Baseline CNN @@ -1727,16 +1727,16 @@ def run_comprehensive_tests(): # Verify pipeline works assert len(baseline_pred) == len(quantized_pred), "Predictions should have same length" - print(f" โœ… End-to-end pipeline works") - print(f" โœ… Baseline predictions: {baseline_pred}") - print(f" โœ… Quantized predictions: {quantized_pred}") + print(f" PASS End-to-end pipeline works") + print(f" PASS Baseline predictions: {baseline_pred}") + print(f" PASS Quantized predictions: {quantized_pred}") except Exception as e: - print(f" โš ๏ธ End-to-end test issue: {e}") + print(f" WARNING๏ธ End-to-end test issue: {e}") - print("๐ŸŽ‰ ALL COMPREHENSIVE TESTS PASSED!") - print("โœ… Quantization system is working correctly!") - print("๐Ÿš€ Ready for production deployment with 4ร— speedup!") + print("CELEBRATE ALL COMPREHENSIVE TESTS PASSED!") + print("PASS Quantization system is working correctly!") + print("ROCKET Ready for production deployment with 4* speedup!") # Test function defined (called in main block) @@ -1781,9 +1781,9 @@ class QuantizationMemoryProfiler: baseline_fc_mem = baseline_model.fc.nbytes baseline_total = baseline_conv1_mem + baseline_conv2_mem + baseline_fc_mem - print(f" Conv1 weights: {baseline_conv1_mem // 1024:.1f}KB (32ร—3ร—3ร—3 + 32 bias)") - print(f" Conv2 weights: {baseline_conv2_mem // 1024:.1f}KB (64ร—32ร—3ร—3 + 64 bias)") - print(f" FC weights: {baseline_fc_mem // 1024:.1f}KB (2304ร—10)") + print(f" Conv1 weights: {baseline_conv1_mem // 1024:.1f}KB (32*3*3*3 + 32 bias)") + print(f" Conv2 weights: {baseline_conv2_mem // 1024:.1f}KB (64*32*3*3 + 64 bias)") + print(f" FC weights: {baseline_fc_mem // 1024:.1f}KB (2304*10)") print(f" Total: {baseline_total // 1024:.1f}KB") # Quantized model memory breakdown @@ -1803,8 +1803,8 @@ class QuantizationMemoryProfiler: total_savings = baseline_total / quant_total print(f"\n๐Ÿ’พ Memory Savings Analysis:") - print(f" Conv layers: {conv_savings:.1f}ร— reduction") - print(f" Overall model: {total_savings:.1f}ร— reduction") + print(f" Conv layers: {conv_savings:.1f}* reduction") + print(f" Overall model: {total_savings:.1f}* reduction") print(f" Memory saved: {(baseline_total - quant_total) // 1024:.1f}KB") return { @@ -1831,9 +1831,9 @@ class QuantizationMemoryProfiler: kernel_size = 3 print(f"๐Ÿ“ Model Configuration:") - print(f" Input: {batch_size} ร— 3 ร— {input_h} ร— {input_w}") - print(f" Conv1: 3 โ†’ {conv1_out_ch}, {kernel_size}ร—{kernel_size} kernel") - print(f" Conv2: {conv1_out_ch} โ†’ {conv2_out_ch}, {kernel_size}ร—{kernel_size} kernel") + print(f" Input: {batch_size} * 3 * {input_h} * {input_w}") + print(f" Conv1: 3 -> {conv1_out_ch}, {kernel_size}*{kernel_size} kernel") + print(f" Conv2: {conv1_out_ch} -> {conv2_out_ch}, {kernel_size}*{kernel_size} kernel") # FP32 operations conv1_h_out = input_h - kernel_size + 1 # 30 @@ -1867,15 +1867,15 @@ class QuantizationMemoryProfiler: print(f" Conv2 weight access: {conv2_weight_access:,} parameters") print(f" FP32 memory bandwidth: {(conv1_weight_access + conv2_weight_access) * 4:,} bytes") print(f" INT8 memory bandwidth: {(conv1_weight_access + conv2_weight_access) * 1:,} bytes") - print(f" Bandwidth reduction: 4ร— (FP32 โ†’ INT8)") + print(f" Bandwidth reduction: 4* (FP32 -> INT8)") # Theoretical speedup analysis - print(f"\nโšก Theoretical Speedup Sources:") - print(f" Memory bandwidth: 4ร— improvement (32-bit โ†’ 8-bit)") + print(f"\nSPEED Theoretical Speedup Sources:") + print(f" Memory bandwidth: 4* improvement (32-bit -> 8-bit)") print(f" Cache efficiency: Better fit in L1/L2 cache") print(f" SIMD vectorization: More operations per instruction") print(f" Hardware acceleration: Dedicated INT8 units on modern CPUs") - print(f" Expected speedup: 2-4ร— in production systems") + print(f" Expected speedup: 2-4* in production systems") return { 'total_flops': total_flops, @@ -1889,7 +1889,7 @@ class QuantizationMemoryProfiler: This function is PROVIDED to demonstrate scaling analysis. """ - print("\n๐Ÿ“ˆ SCALING BEHAVIOR ANALYSIS") + print("\nPROGRESS SCALING BEHAVIOR ANALYSIS") print("=" * 35) model_sizes = [ @@ -1916,10 +1916,10 @@ class QuantizationMemoryProfiler: else: speedup = 4.0 # Large models: memory bound, maximum benefit - print(f"{name:<15} {fp32_size_mb:<11.1f}MB {int8_size_mb:<11.1f}MB {savings:<9.1f}ร— {speedup:<7.1f}ร—") + print(f"{name:<15} {fp32_size_mb:<11.1f}MB {int8_size_mb:<11.1f}MB {savings:<9.1f}* {speedup:<7.1f}*") - print(f"\n๐Ÿ’ก Key Scaling Insights:") - print(f" โ€ข Memory savings: Linear 4ร— reduction for all model sizes") + print(f"\nTIP Key Scaling Insights:") + print(f" โ€ข Memory savings: Linear 4* reduction for all model sizes") print(f" โ€ข Speed benefits: Increase with model size (memory bottleneck)") print(f" โ€ข Large models: Maximum benefit from reduced memory pressure") print(f" โ€ข Mobile deployment: Enables models that wouldn't fit in RAM") @@ -1940,7 +1940,7 @@ Let's run comprehensive systems analysis to understand quantization behavior: # %% nbgrader={"grade": true, "grade_id": "test-memory-profiling", "locked": false, "points": 3, "schema_version": 3, "solution": false, "task": false} def test_memory_profiling(): """Test memory profiling and systems analysis.""" - print("๐Ÿ” Testing Memory Profiling and Systems Analysis...") + print("MAGNIFY Testing Memory Profiling and Systems Analysis...") print("=" * 60) # Create models for profiling @@ -1957,21 +1957,21 @@ def test_memory_profiling(): # Test memory usage analysis memory_results = profiler.profile_memory_usage(baseline, quantized) assert memory_results['conv_compression'] > 3.0, "Should show significant conv layer compression" - print(f"โœ… Conv layer compression: {memory_results['conv_compression']:.1f}ร—") + print(f"PASS Conv layer compression: {memory_results['conv_compression']:.1f}*") # Test computational complexity analysis complexity_results = profiler.analyze_computational_complexity() assert complexity_results['total_flops'] > 0, "Should calculate FLOPs" - assert complexity_results['memory_bandwidth_reduction'] == 4.0, "Should show 4ร— bandwidth reduction" - print(f"โœ… Memory bandwidth reduction: {complexity_results['memory_bandwidth_reduction']:.1f}ร—") + assert complexity_results['memory_bandwidth_reduction'] == 4.0, "Should show 4* bandwidth reduction" + print(f"PASS Memory bandwidth reduction: {complexity_results['memory_bandwidth_reduction']:.1f}*") # Test scaling behavior analysis scaling_results = profiler.analyze_scaling_behavior() - assert scaling_results['memory_savings'] == 4.0, "Should show consistent 4ร— memory savings" - print(f"โœ… Memory savings scaling: {scaling_results['memory_savings']:.1f}ร— across all model sizes") + assert scaling_results['memory_savings'] == 4.0, "Should show consistent 4* memory savings" + print(f"PASS Memory savings scaling: {scaling_results['memory_savings']:.1f}* across all model sizes") - print("โœ… Memory profiling and systems analysis tests passed!") - print("๐ŸŽฏ Quantization systems engineering principles validated!") + print("PASS Memory profiling and systems analysis tests passed!") + print("TARGET Quantization systems engineering principles validated!") # Test function defined (called in main block) @@ -1983,9 +1983,9 @@ Let's run all our tests to validate the complete implementation: """ if __name__ == "__main__": - print("๐Ÿš€ MODULE 17: QUANTIZATION - TRADING PRECISION FOR SPEED") + print("ROCKET MODULE 17: QUANTIZATION - TRADING PRECISION FOR SPEED") print("=" * 70) - print("Testing complete INT8 quantization implementation for 4ร— speedup...") + print("Testing complete INT8 quantization implementation for 4* speedup...") print() try: @@ -2019,26 +2019,26 @@ if __name__ == "__main__": ProductionQuantizationInsights.show_performance_numbers() print() - print("๐ŸŽ‰ SUCCESS: All quantization tests passed!") - print("๐Ÿ† ACHIEVEMENT: 4ร— speedup through precision optimization!") + print("CELEBRATE SUCCESS: All quantization tests passed!") + print("๐Ÿ† ACHIEVEMENT: 4* speedup through precision optimization!") except Exception as e: - print(f"โŒ Error in testing: {e}") + print(f"FAIL Error in testing: {e}") import traceback traceback.print_exc() # %% [markdown] """ -## ๐Ÿค” ML Systems Thinking: Interactive Questions +## THINK ML Systems Thinking: Interactive Questions -Now that you've implemented INT8 quantization and achieved 4ร— speedup, let's reflect on the systems engineering principles and precision trade-offs you've learned. +Now that you've implemented INT8 quantization and achieved 4* speedup, let's reflect on the systems engineering principles and precision trade-offs you've learned. """ # %% [markdown] nbgrader={"grade": true, "grade_id": "systems-thinking-1", "locked": false, "points": 3, "schema_version": 3, "solution": true, "task": false} """ **Question 1: Precision vs Performance Trade-offs** -You implemented INT8 quantization that uses 4ร— less memory but provides 4ร— speedup with <1% accuracy loss. +You implemented INT8 quantization that uses 4* less memory but provides 4* speedup with <1% accuracy loss. a) Why is INT8 the "sweet spot" for production quantization rather than INT4 or INT16? b) In what scenarios would you choose NOT to use quantization despite the performance benefits? @@ -2053,8 +2053,8 @@ c) How do hardware capabilities (mobile vs server) influence quantization decisi a) Why INT8 is the sweet spot: - Hardware support: Excellent native INT8 support in CPUs, GPUs, and mobile processors - Accuracy preservation: Can represent 256 different values, sufficient for most weight distributions -- Speed gains: Specialized INT8 arithmetic units provide real 4ร— speedup (not just theoretical) -- Memory sweet spot: 4ร— reduction is significant but not so extreme as to destroy model quality +- Speed gains: Specialized INT8 arithmetic units provide real 4* speedup (not just theoretical) +- Memory sweet spot: 4* reduction is significant but not so extreme as to destroy model quality - Production proven: Extensive validation across many model types shows <1% accuracy loss - Tool ecosystem: TensorFlow Lite, PyTorch Mobile, ONNX Runtime all optimize for INT8 @@ -2072,7 +2072,7 @@ c) Hardware influence on quantization decisions: - Server GPUs: Mixed precision (FP16) might be better than INT8 for throughput - CPUs: INT8 vectorization provides significant benefits over FP32 - Memory-constrained systems: Quantization may be required just to fit the model -- Bandwidth-limited: 4ร— smaller models transfer faster over network +- Bandwidth-limited: 4* smaller models transfer faster over network """ ### END SOLUTION @@ -2188,7 +2188,7 @@ a) Quantization interactions with other optimizations: - Model pruning synergy: Pruned models often quantize better (remaining weights more important) - Knowledge distillation compatibility: Student models designed for quantization from start - Neural architecture search: NAS can search for quantization-friendly architectures -- Combined benefits: Pruning + quantization can achieve 16ร— compression (4ร— each) +- Combined benefits: Pruning + quantization can achieve 16* compression (4* each) - Order matters: Generally prune first, then quantize (quantizing first can interfere with pruning) - Optimization conflicts: Some optimizations may work against each other - Unified approaches: Modern techniques like differentiable quantization during NAS @@ -2228,26 +2228,26 @@ Monitoring phase: # %% [markdown] """ -## ๐ŸŽฏ MODULE SUMMARY: Quantization - Trading Precision for Speed +## TARGET MODULE SUMMARY: Quantization - Trading Precision for Speed Congratulations! You've completed Module 17 and mastered quantization techniques that achieve dramatic performance improvements while maintaining model accuracy. ### What You Built - **Baseline FP32 CNN**: Reference implementation showing computational and memory costs - **INT8 Quantizer**: Complete quantization system with scale/zero-point parameter computation -- **Quantized CNN**: Production-ready CNN using INT8 weights for 4ร— speedup +- **Quantized CNN**: Production-ready CNN using INT8 weights for 4* speedup - **Performance Analyzer**: Comprehensive benchmarking system measuring speed, memory, and accuracy trade-offs - **Systems Analyzer**: Deep analysis of precision vs performance trade-offs across different bit widths ### Key Systems Insights Mastered -1. **Precision vs Performance Trade-offs**: Understanding when to sacrifice precision for speed (4ร— memory/speed improvement for <1% accuracy loss) +1. **Precision vs Performance Trade-offs**: Understanding when to sacrifice precision for speed (4* memory/speed improvement for <1% accuracy loss) 2. **Quantization Mathematics**: Implementing scale/zero-point based affine quantization for optimal precision 3. **Hardware-Aware Optimization**: Leveraging INT8 specialized hardware for maximum performance benefits 4. **Production Deployment Strategies**: Calibration-based quantization for mobile and edge deployment ### Performance Achievements -- ๐Ÿš€ **4ร— Speed Improvement**: Reduced inference time from 50ms to 12ms through INT8 arithmetic -- ๐Ÿง  **4ร— Memory Reduction**: Quantized weights use 25% of original FP32 memory +- ROCKET **4* Speed Improvement**: Reduced inference time from 50ms to 12ms through INT8 arithmetic +- ๐Ÿง  **4* Memory Reduction**: Quantized weights use 25% of original FP32 memory - ๐Ÿ“Š **<1% Accuracy Loss**: Maintained model quality while achieving dramatic speedups - ๐Ÿญ **Production Ready**: Implemented patterns used by TensorFlow Lite, PyTorch Mobile, and Core ML diff --git a/modules/17_compression/compression_dev.py b/modules/17_compression/compression_dev.py index 886eb807..6cc67443 100644 --- a/modules/17_compression/compression_dev.py +++ b/modules/17_compression/compression_dev.py @@ -24,7 +24,7 @@ In Module 17, you learned quantization - reducing precision from FP32 to INT8. B - Framework connection: See how your implementation mirrors production sparse inference systems - Performance insight: Learn why 70% sparsity often provides optimal accuracy vs size tradeoffs -## Build โ†’ Profile โ†’ Optimize +## Build -> Profile -> Optimize 1. **Build**: Magnitude-based pruners that remove small weights, discover massive redundancy in neural networks 2. **Profile**: Measure model size reduction, accuracy impact, and sparse computation efficiency 3. **Optimize**: Implement structured pruning for hardware-friendly sparsity patterns @@ -38,8 +38,8 @@ By the end of this module, you'll understand: - Connection to production systems where pruning enables edge AI applications ## Systems Reality Check -๐Ÿ’ก **Production Context**: Apple's Neural Engine, Google's Edge TPU, and mobile inference frameworks heavily rely on sparsity for efficient computation -โšก **Performance Note**: 70% sparsity provides 3-5x model compression with <2% accuracy loss, but speedup depends on hardware sparse computation support +TIP **Production Context**: Apple's Neural Engine, Google's Edge TPU, and mobile inference frameworks heavily rely on sparsity for efficient computation +SPEED **Performance Note**: 70% sparsity provides 3-5x model compression with <2% accuracy loss, but speedup depends on hardware sparse computation support """ # %% nbgrader={"grade": false, "grade_id": "compression-imports", "locked": false, "schema_version": 3, "solution": false, "task": false} @@ -114,42 +114,42 @@ Before implementing pruning, let's understand the fundamental insight: **neural Weight Magnitude Distribution in Typical Neural Network: Count - โ†‘ - 5000โ”‚ โ–ˆโ–ˆโ–ˆโ–ˆ โ† Many small weights - 4000โ”‚ โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ - 3000โ”‚ โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ - 2000โ”‚ โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ - 1000โ”‚ โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ โ–ˆโ–ˆ โ† Few large weights - 0โ”‚ โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ - โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ†’ Weight Magnitude + ^ + 5000| โ–ˆโ–ˆโ–ˆโ–ˆ <- Many small weights + 4000| โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ + 3000| โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ + 2000| โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ + 1000| โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ โ–ˆโ–ˆ <- Few large weights + 0| โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ + +-------------------------> Weight Magnitude 0.0 0.1 0.2 0.3 0.4 0.5 The Natural Sparsity Pattern: - โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” - โ”‚ 80% of weights have magnitude < 0.1 โ”‚ โ† Can be pruned - โ”‚ 15% of weights have magnitude 0.1-0.3 โ”‚ โ† Moderately important - โ”‚ 5% of weights have magnitude > 0.3 โ”‚ โ† Critical weights - โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ + +-----------------------------------------+ + | 80% of weights have magnitude < 0.1 | <- Can be pruned + | 15% of weights have magnitude 0.1-0.3 | <- Moderately important + | 5% of weights have magnitude > 0.3 | <- Critical weights + +-----------------------------------------+ ``` ### Pruning Strategy Visualization ``` Original Dense Network: - โ”Œโ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ” - โ”‚ 0.8 โ”‚ 0.1 โ”‚ 0.05โ”‚ 0.3 โ”‚ โ† All weights present - โ”‚ 0.02โ”‚ 0.7 โ”‚ 0.4 โ”‚ 0.09โ”‚ - โ”‚ 0.6 โ”‚ 0.03โ”‚ 0.5 โ”‚ 0.2 โ”‚ - โ”‚ 0.04โ”‚ 0.9 โ”‚ 0.06โ”‚ 0.1 โ”‚ - โ””โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”˜ + +-----+-----+-----+-----+ + | 0.8 | 0.1 | 0.05| 0.3 | <- All weights present + | 0.02| 0.7 | 0.4 | 0.09| + | 0.6 | 0.03| 0.5 | 0.2 | + | 0.04| 0.9 | 0.06| 0.1 | + +-----+-----+-----+-----+ After 70% Magnitude-Based Pruning: - โ”Œโ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ” - โ”‚ 0.8 โ”‚ 0 โ”‚ 0 โ”‚ 0.3 โ”‚ โ† Small weights โ†’ 0 - โ”‚ 0 โ”‚ 0.7 โ”‚ 0.4 โ”‚ 0 โ”‚ - โ”‚ 0.6 โ”‚ 0 โ”‚ 0.5 โ”‚ 0.2 โ”‚ - โ”‚ 0 โ”‚ 0.9 โ”‚ 0 โ”‚ 0 โ”‚ - โ””โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”˜ + +-----+-----+-----+-----+ + | 0.8 | 0 | 0 | 0.3 | <- Small weights -> 0 + | 0 | 0.7 | 0.4 | 0 | + | 0.6 | 0 | 0.5 | 0.2 | + | 0 | 0.9 | 0 | 0 | + +-----+-----+-----+-----+ Result: 70% sparsity, 95%+ accuracy preserved! ``` @@ -186,7 +186,7 @@ def analyze_weight_redundancy(weights: np.ndarray, title: str = "Weight Analysis for p in percentiles: val = np.percentile(w_abs, p) smaller_count = np.sum(w_abs <= val) - print(f" {p:2d}%: {val:.6f} ({smaller_count:,} weights โ‰ค this value)") + print(f" {p:2d}%: {val:.6f} ({smaller_count:,} weights <= this value)") # Show natural sparsity (near-zero weights) zero_threshold = w_abs.mean() * NEAR_ZERO_THRESHOLD_RATIO # Threshold for "near-zero" weights @@ -233,7 +233,7 @@ def test_redundancy_analysis(): assert conv_stats['natural_sparsity'] > 0, "Should detect some natural sparsity" assert linear_stats['natural_sparsity'] > 0, "Should detect some natural sparsity" - print("โœ… Weight redundancy analysis test passed!") + print("PASS Weight redundancy analysis test passed!") test_redundancy_analysis() @@ -248,34 +248,34 @@ The simplest and most effective pruning technique: **remove the smallest weights ``` Step 1: Calculate Weight Magnitudes Original Weights: Absolute Values: - โ”Œโ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ” - โ”‚-0.8 โ”‚ 0.1 โ”‚-0.05โ”‚ โ†’ โ”‚ 0.8 โ”‚ 0.1 โ”‚ 0.05โ”‚ - โ”‚ 0.02โ”‚-0.7 โ”‚ 0.4 โ”‚ โ”‚ 0.02โ”‚ 0.7 โ”‚ 0.4 โ”‚ - โ”‚-0.6 โ”‚ 0.03โ”‚ 0.5 โ”‚ โ”‚ 0.6 โ”‚ 0.03โ”‚ 0.5 โ”‚ - โ””โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”˜ + +-----+-----+-----+ +-----+-----+-----+ + |-0.8 | 0.1 |-0.05| -> | 0.8 | 0.1 | 0.05| + | 0.02|-0.7 | 0.4 | | 0.02| 0.7 | 0.4 | + |-0.6 | 0.03| 0.5 | | 0.6 | 0.03| 0.5 | + +-----+-----+-----+ +-----+-----+-----+ Step 2: Sort and Find Threshold (70% sparsity) Sorted magnitudes: [0.02, 0.03, 0.05, 0.1, 0.4, 0.5, 0.6, 0.7, 0.8] 70th percentile threshold: 0.4 - โ†‘ - Keep weights โ‰ฅ 0.4 + ^ + Keep weights >= 0.4 Step 3: Create Binary Mask - Magnitude โ‰ฅ threshold: Binary Mask: - โ”Œโ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ” - โ”‚ โœ“ โ”‚ โœ— โ”‚ โœ— โ”‚ โ†’ โ”‚ 1 โ”‚ 0 โ”‚ 0 โ”‚ - โ”‚ โœ— โ”‚ โœ“ โ”‚ โœ“ โ”‚ โ”‚ 0 โ”‚ 1 โ”‚ 1 โ”‚ - โ”‚ โœ“ โ”‚ โœ— โ”‚ โœ“ โ”‚ โ”‚ 1 โ”‚ 0 โ”‚ 1 โ”‚ - โ””โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”˜ + Magnitude >= threshold: Binary Mask: + +-----+-----+-----+ +-----+-----+-----+ + | OK | โœ— | โœ— | -> | 1 | 0 | 0 | + | โœ— | OK | OK | | 0 | 1 | 1 | + | OK | โœ— | OK | | 1 | 0 | 1 | + +-----+-----+-----+ +-----+-----+-----+ Step 4: Apply Mask (Element-wise Multiplication) - Original ร— Mask = Pruned: - โ”Œโ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ” - โ”‚-0.8 โ”‚ 0.1 โ”‚-0.05โ”‚ ร— โ”‚ 1 โ”‚ 0 โ”‚ 0 โ”‚ = โ”Œโ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ” - โ”‚ 0.02โ”‚-0.7 โ”‚ 0.4 โ”‚ โ”‚ 0 โ”‚ 1 โ”‚ 1 โ”‚ โ”‚-0.8 โ”‚ 0 โ”‚ 0 โ”‚ - โ”‚-0.6 โ”‚ 0.03โ”‚ 0.5 โ”‚ โ”‚ 1 โ”‚ 0 โ”‚ 1 โ”‚ โ”‚ 0 โ”‚-0.7 โ”‚ 0.4 โ”‚ - โ””โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚-0.6 โ”‚ 0 โ”‚ 0.5 โ”‚ - โ””โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”˜ + Original * Mask = Pruned: + +-----+-----+-----+ +-----+-----+-----+ + |-0.8 | 0.1 |-0.05| * | 1 | 0 | 0 | = +-----+-----+-----+ + | 0.02|-0.7 | 0.4 | | 0 | 1 | 1 | |-0.8 | 0 | 0 | + |-0.6 | 0.03| 0.5 | | 1 | 0 | 1 | | 0 |-0.7 | 0.4 | + +-----+-----+-----+ +-----+-----+-----+ |-0.6 | 0 | 0.5 | + +-----+-----+-----+ ``` ### Magnitude Pruning Algorithm @@ -448,7 +448,7 @@ def test_magnitude_pruning(): [0.3, 0.02, 0.7] ]) - # Test 50% sparsity (should keep 4.5 โ‰ˆ 4-5 weights) + # Test 50% sparsity (should keep 4.5 ~= 4-5 weights) pruned, mask, stats = pruner.prune(weights, sparsity=0.5) print(f"Original weights:") @@ -482,7 +482,7 @@ def test_magnitude_pruning(): assert 'mean_relative_error' in accuracy_impact, "Should measure relative error" assert accuracy_impact['weight_norm_preservation'] > 0, "Should preserve some weight norm" - print("โœ… Magnitude-based pruning test passed!") + print("PASS Magnitude-based pruning test passed!") test_magnitude_pruning() @@ -497,43 +497,43 @@ So far we've implemented **unstructured pruning** - removing individual weights ``` UNSTRUCTURED PRUNING (Individual Weight Removal): - Original 4ร—4 Weight Matrix: - โ”Œโ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ” - โ”‚ 0.8 โ”‚ 0.1 โ”‚ 0.05โ”‚ 0.3 โ”‚ - โ”‚ 0.02โ”‚ 0.7 โ”‚ 0.4 โ”‚ 0.09โ”‚ - โ”‚ 0.6 โ”‚ 0.03โ”‚ 0.5 โ”‚ 0.2 โ”‚ - โ”‚ 0.04โ”‚ 0.9 โ”‚ 0.06โ”‚ 0.1 โ”‚ - โ””โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”˜ + Original 4*4 Weight Matrix: + +-----+-----+-----+-----+ + | 0.8 | 0.1 | 0.05| 0.3 | + | 0.02| 0.7 | 0.4 | 0.09| + | 0.6 | 0.03| 0.5 | 0.2 | + | 0.04| 0.9 | 0.06| 0.1 | + +-----+-----+-----+-----+ After 50% Unstructured Pruning (irregular pattern): - โ”Œโ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ” - โ”‚ 0.8 โ”‚ 0 โ”‚ 0 โ”‚ 0.3 โ”‚ โ† Scattered zeros - โ”‚ 0 โ”‚ 0.7 โ”‚ 0.4 โ”‚ 0 โ”‚ โ† Hard for hardware to optimize - โ”‚ 0.6 โ”‚ 0 โ”‚ 0.5 โ”‚ 0.2 โ”‚ โ† Requires sparse kernels - โ”‚ 0 โ”‚ 0.9 โ”‚ 0 โ”‚ 0 โ”‚ โ† Irregular memory access - โ””โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”˜ + +-----+-----+-----+-----+ + | 0.8 | 0 | 0 | 0.3 | <- Scattered zeros + | 0 | 0.7 | 0.4 | 0 | <- Hard for hardware to optimize + | 0.6 | 0 | 0.5 | 0.2 | <- Requires sparse kernels + | 0 | 0.9 | 0 | 0 | <- Irregular memory access + +-----+-----+-----+-----+ STRUCTURED PRUNING (Channel/Filter Removal): - Conv Layer: 4 filters ร— 3 input channels: + Conv Layer: 4 filters * 3 input channels: Filter 0: Filter 1: Filter 2: Filter 3: - โ”Œโ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ” - โ”‚ 0.8 โ”‚ โ”‚ 0.1 โ”‚ โ”‚ 0.05โ”‚ โ”‚ 0.3 โ”‚ - โ”‚ 0.2 โ”‚ โ”‚ 0.7 โ”‚ โ”‚ 0.4 โ”‚ โ”‚ 0.9 โ”‚ โ† L2 norms: [1.2, 0.9, 0.6, 1.1] - โ”‚ 0.6 โ”‚ โ”‚ 0.3 โ”‚ โ”‚ 0.5 โ”‚ โ”‚ 0.7 โ”‚ - โ””โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”˜ - โ†“ โ†“ + +-----+ +-----+ +-----+ +-----+ + | 0.8 | | 0.1 | | 0.05| | 0.3 | + | 0.2 | | 0.7 | | 0.4 | | 0.9 | <- L2 norms: [1.2, 0.9, 0.6, 1.1] + | 0.6 | | 0.3 | | 0.5 | | 0.7 | + +-----+ +-----+ +-----+ +-----+ + v v Remove Remove (weak) (weak) After 50% Structured Pruning (remove 2 weakest filters): Filter 0: Filter 3: - โ”Œโ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ” - โ”‚ 0.8 โ”‚ โ”‚ 0.3 โ”‚ โ† Clean matrix reduction - โ”‚ 0.2 โ”‚ โ”‚ 0.9 โ”‚ โ† Dense computation friendly - โ”‚ 0.6 โ”‚ โ”‚ 0.7 โ”‚ โ† No sparse kernels needed - โ””โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”˜ โ† Regular memory access + +-----+ +-----+ + | 0.8 | | 0.3 | <- Clean matrix reduction + | 0.2 | | 0.9 | <- Dense computation friendly + | 0.6 | | 0.7 | <- No sparse kernels needed + +-----+ +-----+ <- Regular memory access ``` ### Hardware Efficiency Comparison @@ -542,21 +542,21 @@ So far we've implemented **unstructured pruning** - removing individual weights COMPUTATION PATTERNS: Unstructured (50% sparse): Structured (50% fewer filters): - โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” - โ”‚ for i in range(rows): โ”‚ โ”‚ for i in range(rows/2): โ”‚ - โ”‚ for j in range(cols): โ”‚ โ”‚ for j in range(cols): โ”‚ - โ”‚ if mask[i,j]: โ”‚ โ†โ”€โ” โ”‚ result += data[i,j] โ”‚ - โ”‚ result += data[i,j]โ”‚ โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ - โ”‚ # else: skip โ”‚ โ”‚ โ†‘ - โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ Dense, vectorized - โ†‘ โ”‚ - Sparse, branching โ”‚ - Bad for SIMD โ”‚ - โ”‚ - Memory Access Pattern: โ”‚ Memory Access Pattern: - [โœ“][โœ—][โœ“][โœ—][โœ“][โœ—][โœ—][โœ“] โ”‚ [โœ“โœ“โœ“โœ“][โœ“โœ“โœ“โœ“] โ† Contiguous - โ†‘ Irregular โ”‚ โ†‘ Cache-friendly - Bad for cache โ”‚ + +-------------------------+ +-------------------------+ + | for i in range(rows): | | for i in range(rows/2): | + | for j in range(cols): | | for j in range(cols): | + | if mask[i,j]: | <--+ | result += data[i,j] | + | result += data[i,j]| | +-------------------------+ + | # else: skip | | ^ + +-------------------------+ | Dense, vectorized + ^ | + Sparse, branching | + Bad for SIMD | + | + Memory Access Pattern: | Memory Access Pattern: + [OK][โœ—][OK][โœ—][OK][โœ—][โœ—][OK] | [OKOKOKOK][OKOKOKOK] <- Contiguous + ^ Irregular | ^ Cache-friendly + Bad for cache | ``` ### Structured Pruning Benefits: @@ -657,7 +657,7 @@ def compare_structured_vs_unstructured(conv_weights: np.ndarray, sparsity: float print(f" Compression: {structured_stats['compression_ratio']:.1f}x") print(f" Filters removed: {structured_stats['pruned_filters']}") - print(f"\n๐Ÿ’ก Key Differences:") + print(f"\nTIP Key Differences:") print(f" โ€ข Unstructured: Irregular sparsity, requires sparse kernels") print(f" โ€ข Structured: Regular reduction, standard dense computation") print(f" โ€ข Hardware: Structured pruning provides actual speedup") @@ -720,7 +720,7 @@ def test_structured_pruning(): assert unstructured_result.shape == conv_weights.shape, "Unstructured keeps same shape" assert structured_result.shape[0] < conv_weights.shape[0], "Structured reduces filters" - print("โœ… Structured pruning test passed!") + print("PASS Structured pruning test passed!") test_structured_pruning() @@ -736,17 +736,17 @@ Pruning creates sparse networks, but how do we compute with them efficiently? We DENSE COMPUTATION (Standard): Input Vector: Weight Matrix: Output: - โ”Œโ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ” - โ”‚ 2 โ”‚ โ”‚ 0.8 โ”‚ 0 โ”‚ 0.3 โ”‚ โ”‚ โ”‚ - โ”‚ 3 โ”‚ ร— โ”‚ 0 โ”‚ 0.7 โ”‚ 0.4 โ”‚ = โ”‚ ? โ”‚ - โ”‚ 1 โ”‚ โ”‚ 0.6 โ”‚ 0 โ”‚ 0.5 โ”‚ โ”‚ โ”‚ - โ””โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”˜ + +-----+ +-----+-----+-----+ +-----+ + | 2 | | 0.8 | 0 | 0.3 | | | + | 3 | * | 0 | 0.7 | 0.4 | = | ? | + | 1 | | 0.6 | 0 | 0.5 | | | + +-----+ +-----+-----+-----+ +-----+ Standard Matrix Multiply (wastes work on zeros): - output[0] = 2ร—0.8 + 3ร—0 + 1ร—0.3 = 1.6 + 0 + 0.3 = 1.9 - output[1] = 2ร—0 + 3ร—0.7 + 1ร—0.4 = 0 + 2.1 + 0.4 = 2.5 - output[2] = 2ร—0.6 + 3ร—0 + 1ร—0.5 = 1.2 + 0 + 0.5 = 1.7 - โ†‘ โ†‘ โ†‘ + output[0] = 2*0.8 + 3*0 + 1*0.3 = 1.6 + 0 + 0.3 = 1.9 + output[1] = 2*0 + 3*0.7 + 1*0.4 = 0 + 2.1 + 0.4 = 2.5 + output[2] = 2*0.6 + 3*0 + 1*0.5 = 1.2 + 0 + 0.5 = 1.7 + ^ ^ ^ Wasted Useful Useful @@ -756,7 +756,7 @@ Pruning creates sparse networks, but how do we compute with them efficiently? We values: [0.8, 0.3, 0.7, 0.4, 0.6, 0.5] cols: [ 0, 2, 1, 2, 0, 2 ] row_ptr: [ 0, 2, 4, 6 ] - โ†‘ โ†‘ โ†‘ โ†‘ + ^ ^ ^ ^ row0 row1 row2 end Optimized Sparse Multiply (skip zeros): @@ -772,18 +772,18 @@ Pruning creates sparse networks, but how do we compute with them efficiently? We ### Memory Layout Comparison ``` - DENSE STORAGE (4ร—4 matrix, 50% sparse): - โ”Œโ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ” - โ”‚ 0.8 โ”‚ 0.0 โ”‚ 0.0 โ”‚ 0.3 โ”‚ Memory: 16 floats ร— 4 bytes = 64 bytes - โ”‚ 0.0 โ”‚ 0.7 โ”‚ 0.4 โ”‚ 0.0 โ”‚ Wasted: 8 zeros ร— 4 bytes = 32 bytes (50%) - โ”‚ 0.6 โ”‚ 0.0 โ”‚ 0.5 โ”‚ 0.2 โ”‚ Operations: 16 multiply-adds - โ”‚ 0.0 โ”‚ 0.9 โ”‚ 0.0 โ”‚ 0.0 โ”‚ - โ””โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”˜ + DENSE STORAGE (4*4 matrix, 50% sparse): + +-----+-----+-----+-----+ + | 0.8 | 0.0 | 0.0 | 0.3 | Memory: 16 floats * 4 bytes = 64 bytes + | 0.0 | 0.7 | 0.4 | 0.0 | Wasted: 8 zeros * 4 bytes = 32 bytes (50%) + | 0.6 | 0.0 | 0.5 | 0.2 | Operations: 16 multiply-adds + | 0.0 | 0.9 | 0.0 | 0.0 | + +-----+-----+-----+-----+ SPARSE STORAGE (CSR format): - values: [0.8, 0.3, 0.7, 0.4, 0.6, 0.5, 0.2, 0.9] = 8 ร— 4 = 32 bytes - columns: [ 0, 3, 1, 2, 0, 2, 3, 1 ] = 8 ร— 4 = 32 bytes - row_ptr: [ 0, 2, 4, 7, 8 ] = 5 ร— 4 = 20 bytes + values: [0.8, 0.3, 0.7, 0.4, 0.6, 0.5, 0.2, 0.9] = 8 * 4 = 32 bytes + columns: [ 0, 3, 1, 2, 0, 2, 3, 1 ] = 8 * 4 = 32 bytes + row_ptr: [ 0, 2, 4, 7, 8 ] = 5 * 4 = 20 bytes Total: 84 bytes Overhead: 84 vs 64 bytes (+31%) BUT only 8 operations vs 16 (-50%) @@ -1045,7 +1045,7 @@ def test_sparse_neural_network(): assert benchmark['dense_ops'] == expected_dense_ops, "Linear op count incorrect" assert benchmark['sparse_ops'] < benchmark['dense_ops'], "Sparse should use fewer ops" - print("โœ… Sparse neural network test passed!") + print("PASS Sparse neural network test passed!") test_sparse_neural_network() @@ -1059,27 +1059,27 @@ Now let's build a complete model compression pipeline that can prune entire neur ``` PHASE 1: MODEL ANALYSIS - โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” - โ”‚ Original Dense Model โ”‚ - โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค - โ”‚ Conv1 โ”‚ Conv2 โ”‚ Dense1 โ”‚ - โ”‚ 32ร—3ร—3ร—3 โ”‚ 64ร—32ร—3ร—3 โ”‚ 512ร—1024 โ”‚ - โ”‚ 864 paramsโ”‚ 18,432 p โ”‚ 524,288 params โ”‚ - โ”‚ Type: Convโ”‚ Type: Conv โ”‚ Type: Dense โ”‚ - โ”‚ Sens: Low โ”‚ Sens: Med โ”‚ Sens: High โ”‚ - โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ - โ†“ โ†“ โ†“ + +-------------------------------------------------+ + | Original Dense Model | + +-------------+-------------+---------------------โ”ค + | Conv1 | Conv2 | Dense1 | + | 32*3*3*3 | 64*32*3*3 | 512*1024 | + | 864 params| 18,432 p | 524,288 params | + | Type: Conv| Type: Conv | Type: Dense | + | Sens: Low | Sens: Med | Sens: High | + +-------------+-------------+---------------------+ + v v v Recommend: 50% Recommend: 60% Recommend: 80% PHASE 2: LAYER-WISE PRUNING - โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” - โ”‚ Compressed Sparse Model โ”‚ - โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค - โ”‚ Conv1 โ”‚ Conv2 โ”‚ Dense1 โ”‚ - โ”‚ 432 params โ”‚ 7,373 p โ”‚ 104,858 params โ”‚ - โ”‚ 50% sparse โ”‚ 60% sparse โ”‚ 80% sparse โ”‚ - โ”‚ โœ“ 2x less โ”‚ โœ“ 2.5x lessโ”‚ โœ“ 5x less โ”‚ - โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ + +-------------------------------------------------+ + | Compressed Sparse Model | + +-------------+-------------+---------------------โ”ค + | Conv1 | Conv2 | Dense1 | + | 432 params | 7,373 p | 104,858 params | + | 50% sparse | 60% sparse | 80% sparse | + | OK 2x less | OK 2.5x less| OK 5x less | + +-------------+-------------+---------------------+ COMPRESSION SUMMARY: Original: 864 + 18,432 + 524,288 = 543,584 total params @@ -1092,20 +1092,20 @@ Now let's build a complete model compression pipeline that can prune entire neur ``` COMPRESSION QUALITY SCORECARD: - Layer โ”‚ Weight Error โ”‚ Norm Ratio โ”‚ Quality Score โ”‚ Status - โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ - Conv1 โ”‚ 0.000234 โ”‚ 0.876 โ”‚ 0.845 โ”‚ โœ… Good - Conv2 โ”‚ 0.000567 โ”‚ 0.823 โ”‚ 0.789 โ”‚ โœ… Good - Dense1 โ”‚ 0.001234 โ”‚ 0.734 โ”‚ 0.692 โ”‚ โš ๏ธ OK - โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ - Overall โ”‚ - โ”‚ - โ”‚ 0.775 โ”‚ โœ… Good + Layer | Weight Error | Norm Ratio | Quality Score | Status + ---------+--------------+------------+---------------+-------- + Conv1 | 0.000234 | 0.876 | 0.845 | PASS Good + Conv2 | 0.000567 | 0.823 | 0.789 | PASS Good + Dense1 | 0.001234 | 0.734 | 0.692 | WARNING๏ธ OK + ---------+--------------+------------+---------------+-------- + Overall | - | - | 0.775 | PASS Good Quality Score Calculation: - score = norm_preservation ร— (1 - relative_error) + score = norm_preservation * (1 - relative_error) - โœ… Excellent: > 0.8 (minimal degradation) - โš ๏ธ Acceptable: 0.6-0.8 (moderate degradation) - โŒ Poor: < 0.6 (significant degradation) + PASS Excellent: > 0.8 (minimal degradation) + WARNING๏ธ Acceptable: 0.6-0.8 (moderate degradation) + FAIL Poor: < 0.6 (significant degradation) ``` ### Production Compression Pipeline: @@ -1200,11 +1200,11 @@ def _calculate_quality_score(norm_preservation: float, mean_error: float, origin def _get_quality_assessment(quality_score: float) -> str: """Get quality assessment string based on score.""" if quality_score > EXCELLENT_QUALITY_THRESHOLD: - return "โœ… Excellent compression quality!" + return "PASS Excellent compression quality!" elif quality_score > ACCEPTABLE_QUALITY_THRESHOLD: - return "โš ๏ธ Acceptable compression quality" + return "WARNING๏ธ Acceptable compression quality" else: - return "โŒ Poor compression quality - consider lower sparsity" + return "FAIL Poor compression quality - consider lower sparsity" class ModelCompressor: """ @@ -1249,7 +1249,7 @@ class ModelCompressor: 'recommendations': {} } - print("๐Ÿ” Model Compression Analysis") + print("MAGNIFY Model Compression Analysis") print("=" * 50) print("Layer | Type | Parameters | Natural Sparsity | Recommendation") print("-" * 70) @@ -1348,7 +1348,7 @@ class ModelCompressor: 'layer_sparsities': layer_sparsities } - print(f"\nโœ… Model Compression Complete!") + print(f"\nPASS Model Compression Complete!") print(f" Original parameters: {total_original_params:,}") print(f" Remaining parameters: {total_remaining_params:,}") print(f" Overall sparsity: {overall_sparsity:.1%}") @@ -1371,7 +1371,7 @@ class ModelCompressor: 'quality_score': 0.0 } - print(f"\nโœ… Validating Compression Quality") + print(f"\nPASS Validating Compression Quality") print("=" * 50) print("Layer | Weight Error | Norm Preservation | Quality") print("-" * 55) @@ -1418,7 +1418,7 @@ class ModelCompressor: } validation_results['quality_score'] = overall_quality_score - print(f"\n๐ŸŽฏ Overall Quality Score: {overall_quality_score:.3f}") + print(f"\nTARGET Overall Quality Score: {overall_quality_score:.3f}") print(f" {_get_quality_assessment(overall_quality_score)}") return validation_results @@ -1441,8 +1441,8 @@ def test_compression_pipeline(): model_weights = { 'conv1': np.random.normal(0, 0.02, (32, 3, 3, 3)), # Conv: 32 filters, 3 input channels 'conv2': np.random.normal(0, 0.02, (64, 32, 3, 3)), # Conv: 64 filters, 32 input channels - 'linear1': np.random.normal(0, 0.01, (512, 1024)), # Linear: 512 โ†’ 1024 - 'linear2': np.random.normal(0, 0.01, (10, 512)), # Linear: 10 โ†’ 512 (output layer) + 'linear1': np.random.normal(0, 0.01, (512, 1024)), # Linear: 512 -> 1024 + 'linear2': np.random.normal(0, 0.01, (10, 512)), # Linear: 10 -> 512 (output layer) } # Create compressor @@ -1507,7 +1507,7 @@ def test_compression_pipeline(): # Allow some tolerance in sparsity assert abs(sparsity - expected_sparsity) < 0.1, f"{layer_name} sparsity mismatch" - print("โœ… Model compression pipeline test passed!") + print("PASS Model compression pipeline test passed!") test_compression_pipeline() @@ -1522,57 +1522,57 @@ Let's analyze compression from a systems engineering perspective, measuring the ``` COMPRESSION BENEFITS VISUALIZATION: - โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” - โ”‚ MODEL SIZE IMPACT โ”‚ - โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค - โ”‚ Dense Model: [โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ] 200MB โ”‚ - โ”‚ 50% Sparse: [โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘] 100MB โ”‚ - โ”‚ 70% Sparse: [โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘] 60MB โ”‚ - โ”‚ 90% Sparse: [โ–ˆโ–ˆโ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘] 20MB โ”‚ - โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ + +--------------------------------------------------------------+ + | MODEL SIZE IMPACT | + +--------------------------------------------------------------โ”ค + | Dense Model: [โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ] 200MB | + | 50% Sparse: [โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘] 100MB | + | 70% Sparse: [โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘] 60MB | + | 90% Sparse: [โ–ˆโ–ˆโ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘] 20MB | + +--------------------------------------------------------------+ - โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” - โ”‚ INFERENCE SPEED IMPACT โ”‚ - โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค - โ”‚ Dense (50ms): [โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ] 50ms โ”‚ - โ”‚ Sparse (20ms): [โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘] 20ms โ”‚ - โ”‚ 2.5x faster inference! โ”‚ - โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ + +--------------------------------------------------------------+ + | INFERENCE SPEED IMPACT | + +--------------------------------------------------------------โ”ค + | Dense (50ms): [โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ] 50ms | + | Sparse (20ms): [โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘] 20ms | + | 2.5x faster inference! | + +--------------------------------------------------------------+ - โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” - โ”‚ DEPLOYMENT ENABLEMENT โ”‚ - โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค - โ”‚ Cloud Server: โœ“ Can run any model size โ”‚ - โ”‚ Mobile Phone: โœ— Dense, โœ“ 70% sparse โ”‚ - โ”‚ IoT Device: โœ— Dense, โœ— 50% sparse, โœ“ 90% sparse โ”‚ - โ”‚ Smartwatch: โœ— All except extreme compression โ”‚ - โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ + +--------------------------------------------------------------+ + | DEPLOYMENT ENABLEMENT | + +--------------------------------------------------------------โ”ค + | Cloud Server: OK Can run any model size | + | Mobile Phone: โœ— Dense, OK 70% sparse | + | IoT Device: โœ— Dense, โœ— 50% sparse, OK 90% sparse | + | Smartwatch: โœ— All except extreme compression | + +--------------------------------------------------------------+ ``` ### Edge AI Deployment Pipeline ``` - COMPRESSION โ†’ DEPLOYMENT PIPELINE: + COMPRESSION -> DEPLOYMENT PIPELINE: - โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” - โ”‚ PHASE 1: COMPRESSION PHASE 2: OPTIMIZATION PHASE 3: DEPLOYMENT โ”‚ - โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค - โ”‚ Dense Model (200MB) โ”‚ Pruned Model (60MB) โ”‚ Mobile App โ”‚ - โ”‚ โ†“ โ”‚ โ†“ โ”‚ โ”‚ - โ”‚ [โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ] โ”‚ [โ–ˆโ–ˆโ–ˆโ–ˆโ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘] โ”‚ ๐Ÿ“ฑ Real-time AI โ”‚ - โ”‚ โ†“ โ”‚ โ†“ โ”‚ ๐Ÿ”‹ Privacy-first โ”‚ - โ”‚ 70% Magnitude Pruning โ”‚ Hardware Optimization โ”‚ โšก Low latency โ”‚ - โ”‚ + Structured Removal โ”‚ + Quantization (8-bit) โ”‚ ๐Ÿ”‹ Always available โ”‚ - โ”‚ + Quality Validation โ”‚ + Sparse Kernels โ”‚ โ”‚ - โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ + +----------------------------------------------------------------------+ + | PHASE 1: COMPRESSION PHASE 2: OPTIMIZATION PHASE 3: DEPLOYMENT | + +------------------------+-----------------------+-----------------------โ”ค + | Dense Model (200MB) | Pruned Model (60MB) | Mobile App | + | v | v | | + | [โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ] | [โ–ˆโ–ˆโ–ˆโ–ˆโ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘] | ๐Ÿ“ฑ Real-time AI | + | v | v | ๐Ÿ”‹ Privacy-first | + | 70% Magnitude Pruning | Hardware Optimization | SPEED Low latency | + | + Structured Removal | + Quantization (8-bit) | ๐Ÿ”‹ Always available | + | + Quality Validation | + Sparse Kernels | | + +------------------------+-----------------------+-----------------------+ OUTCOME: AI that was impossible becomes possible! ``` ### ML Systems Analysis: Why Pruning Enables Edge AI -**Memory Complexity**: O(N ร— sparsity) storage reduction where N = original parameters -**Computational Complexity**: Theoretical O(N ร— sparsity) speedup, actual depends on hardware +**Memory Complexity**: O(N * sparsity) storage reduction where N = original parameters +**Computational Complexity**: Theoretical O(N * sparsity) speedup, actual depends on hardware **Cache Efficiency**: Smaller models fit in cache, reducing memory bandwidth bottlenecks **Energy Efficiency**: Fewer operations = lower power consumption for mobile devices **Deployment Enablement**: Makes models fit where they couldn't before @@ -1580,12 +1580,12 @@ Let's analyze compression from a systems engineering perspective, measuring the # %% nbgrader={"grade": false, "grade_id": "compression-systems-analysis", "locked": false, "schema_version": 3, "solution": true, "task": false} #| export -# โœ… IMPLEMENTATION CHECKPOINT: Ensure compression pipeline is complete +# PASS IMPLEMENTATION CHECKPOINT: Ensure compression pipeline is complete -# ๐Ÿค” PREDICTION: How much memory do you think a 5M parameter model uses? +# THINK PREDICTION: How much memory do you think a 5M parameter model uses? # Dense model: _______ MB, 80% sparse model: _______ MB -# ๐Ÿ” SYSTEMS INSIGHT #1: Memory Profiling Analysis +# MAGNIFY SYSTEMS INSIGHT #1: Memory Profiling Analysis def profile_compression_memory(): """ Profile memory usage patterns during model compression. @@ -1647,7 +1647,7 @@ def profile_compression_memory(): tracemalloc.stop() - # ๐Ÿ’ก WHY THIS MATTERS: Memory is often the limiting factor for edge deployment + # TIP WHY THIS MATTERS: Memory is often the limiting factor for edge deployment # A 200MB model won't fit on a 128MB mobile device, but a 40MB compressed model will! return { @@ -1659,15 +1659,15 @@ def profile_compression_memory(): 'size_reduction': original_size_mb / compressed_size_mb } -# โœ… IMPLEMENTATION CHECKPOINT: Memory profiling analysis complete +# PASS IMPLEMENTATION CHECKPOINT: Memory profiling analysis complete -# ๐Ÿค” PREDICTION: Which device constraint is more limiting - memory or compute? +# THINK PREDICTION: Which device constraint is more limiting - memory or compute? # Your guess: Memory / Compute (circle one) -# ๐Ÿ” SYSTEMS INSIGHT #2: Deployment Constraint Analysis +# MAGNIFY SYSTEMS INSIGHT #2: Deployment Constraint Analysis def analyze_deployment_scenarios(): """Analyze how compression enables different deployment scenarios.""" - print("\n๐Ÿš€ Compression Deployment Impact Analysis") + print("\nROCKET Compression Deployment Impact Analysis") print("=" * 60) # Define deployment constraints @@ -1742,37 +1742,37 @@ def analyze_deployment_scenarios(): fits_mem = config['size_mb'] <= mem_limit fits_comp = config['gflops'] <= compute_limit if fits_mem and fits_comp: - status = "โœ…" + status = "PASS" elif fits_mem: - status = "โšก" # Memory OK, compute too high + status = "SPEED" # Memory OK, compute too high elif fits_comp: status = "๐Ÿ’พ" # Compute OK, memory too high else: - status = "โŒ" + status = "FAIL" fit_status.append(status) print(f"{name:14} | {mem_limit:4d}MB | {compute_limit:5.1f}G | " f"{fit_status[0]:3} | {fit_status[1]:3} | {fit_status[2]:3} | {fit_status[3]:3} | {best_option}") - print(f"\n๐Ÿ’ก Key Insights:") + print(f"\nTIP Key Insights:") print(f" โ€ข Compression often determines deployment feasibility") print(f" โ€ข Edge devices require 70-90% sparsity for deployment") print(f" โ€ข Mobile devices can use moderate compression (50-70%)") print(f" โ€ข Power constraints favor sparse models (fewer operations)") print(f" โ€ข Memory limits are often more restrictive than compute limits") - # ๐Ÿ’ก WHY THIS MATTERS: Compression is often about enabling deployment, not optimizing it + # TIP WHY THIS MATTERS: Compression is often about enabling deployment, not optimizing it # Without compression, many edge AI applications simply wouldn't be possible! -# โœ… IMPLEMENTATION CHECKPOINT: Deployment analysis complete +# PASS IMPLEMENTATION CHECKPOINT: Deployment analysis complete -# ๐Ÿค” PREDICTION: Will 90% sparsity give 10x speedup in practice? +# THINK PREDICTION: Will 90% sparsity give 10x speedup in practice? # Your prediction: ___x actual speedup (vs 10x theoretical) -# ๐Ÿ” SYSTEMS INSIGHT #3: Sparse Computation Reality Check +# MAGNIFY SYSTEMS INSIGHT #3: Sparse Computation Reality Check def benchmark_sparse_inference_speedup(): """Benchmark actual vs theoretical speedup from sparsity.""" - print("\nโšก Sparse Inference Speedup Analysis") + print("\nSPEED Sparse Inference Speedup Analysis") print("=" * 50) import time @@ -1818,13 +1818,13 @@ def benchmark_sparse_inference_speedup(): print(f"{size[0]}x{size[1]:4} | {sparsity:6.0%} | {theoretical:9.1f}x | " f"{actual:5.1f}x | {efficiency:8.1%} | {notes}") - print(f"\n๐ŸŽฏ Speedup Reality Check:") + print(f"\nTARGET Speedup Reality Check:") print(f" โ€ข Theoretical speedup assumes perfect sparse hardware") print(f" โ€ข Actual speedup limited by memory bandwidth and overhead") print(f" โ€ข High sparsity (>80%) shows diminishing returns") print(f" โ€ข Production sparse hardware (GPUs, TPUs) achieve better efficiency") - # ๐Ÿ’ก WHY THIS MATTERS: The gap between theoretical and actual speedup reveals + # TIP WHY THIS MATTERS: The gap between theoretical and actual speedup reveals # why structured pruning and specialized hardware are essential for production deployment! # %% [markdown] @@ -1851,7 +1851,7 @@ def test_systems_analysis(): benchmark_sparse_inference_speedup() # All functions should run without errors - print("โœ… Systems analysis test passed!") + print("PASS Systems analysis test passed!") test_systems_analysis() @@ -1866,23 +1866,23 @@ Let's explore how pruning is used in production ML systems and connect our imple ``` PRODUCTION PRUNING LANDSCAPE: - โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” - โ”‚ FRAMEWORKS & HARDWARE โ”‚ - โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค - โ”‚ RESEARCH โ”‚ PRODUCTION โ”‚ DEPLOYMENT โ”‚ - โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค - โ”‚ ๐Ÿ” PyTorch โ”‚ โš™๏ธ TensorRT โ”‚ ๐Ÿ“ฑ Mobile Apps โ”‚ - โ”‚ torch.nn.utils โ”‚ Structured pruning โ”‚ Apple Neural Eng โ”‚ - โ”‚ .prune โ”‚ 2:4 sparsity โ”‚ Google Edge TPU โ”‚ - โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค - โ”‚ ๐Ÿง  TensorFlow โ”‚ ๐Ÿš€ OpenVINO โ”‚ ๐Ÿ  Smart Home โ”‚ - โ”‚ Model Optimization โ”‚ Intel optimization โ”‚ Always-on AI โ”‚ - โ”‚ Gradual pruning โ”‚ CPU/GPU sparse โ”‚ Voice assistants โ”‚ - โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค - โ”‚ ๐Ÿ”ฌ Our TinyTorch โ”‚ ๐ŸŽฏ Production-Ready โ”‚ ๐Ÿ† Success Stories โ”‚ - โ”‚ Educational impl. โ”‚ Magnitude + struct โ”‚ Tesla Autopilot โ”‚ - โ”‚ Magnitude pruning โ”‚ Quality validation โ”‚ Google Pixel โ”‚ - โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ + +----------------------------------------------------------------------+ + | FRAMEWORKS & HARDWARE | + +---------------------+---------------------+---------------------โ”ค + | RESEARCH | PRODUCTION | DEPLOYMENT | + +---------------------+---------------------+---------------------โ”ค + | MAGNIFY PyTorch | โš™๏ธ TensorRT | ๐Ÿ“ฑ Mobile Apps | + | torch.nn.utils | Structured pruning | Apple Neural Eng | + | .prune | 2:4 sparsity | Google Edge TPU | + +---------------------+---------------------+---------------------โ”ค + | ๐Ÿง  TensorFlow | ROCKET OpenVINO | ๐Ÿ  Smart Home | + | Model Optimization | Intel optimization | Always-on AI | + | Gradual pruning | CPU/GPU sparse | Voice assistants | + +---------------------+---------------------+---------------------โ”ค + | ๐Ÿ”ฌ Our TinyTorch | TARGET Production-Ready | ๐Ÿ† Success Stories | + | Educational impl. | Magnitude + struct | Tesla Autopilot | + | Magnitude pruning | Quality validation | Google Pixel | + +---------------------+---------------------+---------------------+ ``` ### Real-World Application Examples @@ -1892,22 +1892,22 @@ Let's explore how pruning is used in production ML systems and connect our imple ๐Ÿ“ฑ MOBILE PHOTOGRAPHY (Google Pixel) Original: Portrait CNN, 45MB, 120ms - Compressed: 70% pruning + quantization โ†’ 12MB, 35ms + Compressed: 70% pruning + quantization -> 12MB, 35ms Result: Real-time portrait mode on phone ๐Ÿš— AUTONOMOUS VEHICLES (Tesla FSD) Original: Object detection, 2GB, 80ms - Compressed: 50% structured pruning โ†’ 1GB, 35ms + Compressed: 50% structured pruning -> 1GB, 35ms Result: Real-time object detection for safety ๐Ÿ  SMART HOME (Alexa) Original: Wake word detection, 15MB - Compressed: 95% pruning + 8-bit quantization โ†’ 0.5MB + Compressed: 95% pruning + 8-bit quantization -> 0.5MB Result: Always-on listening with <1mW power ๐ŸŽฅ AUGMENTED REALITY (Apple ARKit) Original: Hand tracking, 80MB, 16ms - Compressed: Channel pruning + mobile optimization โ†’ 25MB, 8ms + Compressed: Channel pruning + mobile optimization -> 25MB, 8ms Result: 60fps hand tracking on mobile GPU ``` @@ -1980,7 +1980,7 @@ def compare_with_production_pruning(): print(f"{name:9} | {methods_str:12} | {hw_str:16} | {deploy_str:12} | {sim_str}") - print(f"\n๐ŸŽฏ Key Production Insights:") + print(f"\nTARGET Key Production Insights:") print(f" โ€ข Our magnitude approach is industry standard") print(f" โ€ข Production systems emphasize structured pruning for hardware") print(f" โ€ข Real frameworks integrate pruning with quantization") @@ -2048,7 +2048,7 @@ def demonstrate_pruning_applications(): print(f" Example: {app['example']}") print() - print("๐Ÿ’ก Common Patterns in Production Pruning:") + print("TIP Common Patterns in Production Pruning:") print(" โ€ข Latency-critical apps use structured pruning (regular sparsity)") print(" โ€ข Memory-constrained devices use aggressive unstructured pruning") print(" โ€ข Safety-critical systems use conservative pruning with validation") @@ -2057,14 +2057,14 @@ def demonstrate_pruning_applications(): # Visual success metrics print(f"๐Ÿ† Production Success Metrics:") - print(f" โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”") - print(f" โ”‚ Application โ”‚ Size Reduction โ”‚ Latency Gain โ”‚") - print(f" โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค") - print(f" โ”‚ Mobile Camera โ”‚ 4x โ”‚ 3.5x โ”‚") - print(f" โ”‚ Voice Assistant โ”‚ 30x โ”‚ 10x โ”‚") - print(f" โ”‚ Autonomous Car โ”‚ 2x โ”‚ 2.3x โ”‚") - print(f" โ”‚ AR Hand Tracking โ”‚ 3x โ”‚ 2x โ”‚") - print(f" โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜") + print(f" +----------------------------------------------------+") + print(f" | Application | Size Reduction | Latency Gain |") + print(f" +-------------------+---------------+--------------โ”ค") + print(f" | Mobile Camera | 4x | 3.5x |") + print(f" | Voice Assistant | 30x | 10x |") + print(f" | Autonomous Car | 2x | 2.3x |") + print(f" | AR Hand Tracking | 3x | 2x |") + print(f" +-------------------+---------------+--------------+") # %% [markdown] """ @@ -2085,7 +2085,7 @@ def test_production_context(): demonstrate_pruning_applications() # Both functions should run without errors and provide insights - print("โœ… Production context analysis test passed!") + print("PASS Production context analysis test passed!") test_production_context() @@ -2099,7 +2099,7 @@ Let's run a comprehensive test of all compression functionality to ensure everyt # %% nbgrader={"grade": false, "grade_id": "comprehensive-testing", "locked": false, "schema_version": 3, "solution": false, "task": false} def run_all_tests(): """Run comprehensive test suite for compression module.""" - print("๐Ÿงช Running Comprehensive Compression Test Suite") + print("TEST Running Comprehensive Compression Test Suite") print("=" * 60) test_functions = [ @@ -2119,18 +2119,18 @@ def run_all_tests(): print(f"\n{'='*20} {test_name} {'='*20}") try: test_func() - print(f"โœ… {test_name}: PASSED") + print(f"PASS {test_name}: PASSED") passed += 1 except Exception as e: - print(f"โŒ {test_name}: FAILED - {e}") + print(f"FAIL {test_name}: FAILED - {e}") - print(f"\n๐ŸŽฏ Test Results: {passed}/{total} tests passed") + print(f"\nTARGET Test Results: {passed}/{total} tests passed") if passed == total: - print("๐ŸŽ‰ All compression tests passed! Module implementation complete.") + print("CELEBRATE All compression tests passed! Module implementation complete.") # Show final demo - print(f"\n๐Ÿš€ Final Compression Demo:") + print(f"\nROCKET Final Compression Demo:") print("=" * 50) # Create a realistic model and compress it @@ -2146,15 +2146,15 @@ def run_all_tests(): original_params = sum(w.size for w in demo_model.values()) compressed_params = sum(np.sum(info['weights'] != 0) for info in compressed.values()) - print(f"๐ŸŽฏ FINAL RESULT:") + print(f"TARGET FINAL RESULT:") print(f" Original model: {original_params:,} parameters") print(f" Compressed model: {compressed_params:,} parameters") print(f" Compression achieved: {original_params/compressed_params:.1f}x smaller") print(f" Size reduction: {(1-compressed_params/original_params)*100:.1f}% of parameters removed") - print(f" โœ… Ready for edge deployment!") + print(f" PASS Ready for edge deployment!") else: - print(f"โš ๏ธ {total - passed} tests failed. Review implementation.") + print(f"WARNING๏ธ {total - passed} tests failed. Review implementation.") # Run all systems insights profile_compression_memory() @@ -2166,7 +2166,7 @@ if __name__ == "__main__": # %% [markdown] """ -## ๐Ÿค” ML Systems Thinking: Interactive Questions +## THINK ML Systems Thinking: Interactive Questions Now that you've implemented neural network pruning, let's reflect on the systems engineering principles and production deployment considerations. @@ -2189,7 +2189,7 @@ d) Explain why the threshold approach guarantees the target sparsity level ### BEGIN SOLUTION a) Sorted weights by magnitude: [0.02, 0.03, 0.05, 0.09, 0.1, 0.3, 0.4, 0.6, 0.7, 0.8] 70th percentile (keep top 30%) = weights[7] = 0.6 - Threshold = 0.6 (keep weights โ‰ฅ 0.6) + Threshold = 0.6 (keep weights >= 0.6) b) Binary mask for original array [0.8, 0.1, 0.05, 0.3, 0.02, 0.7, 0.4, 0.09, 0.6, 0.03]: Mask: [1, 0, 0, 0, 0, 1, 0, 0, 1, 0] @@ -2197,7 +2197,7 @@ b) Binary mask for original array [0.8, 0.1, 0.05, 0.3, 0.02, 0.7, 0.4, 0.09, 0. c) Compression ratio calculation: - Original parameters: 10 - - Surviving parameters: 3 (values โ‰ฅ 0.6) + - Surviving parameters: 3 (values >= 0.6) - Actual sparsity: 7/10 = 70% exactly - Compression ratio: 10/3 = 3.33x @@ -2224,28 +2224,28 @@ d) For mobile deployment requiring <50 parameters, which pruning strategy works? ### BEGIN SOLUTION a) Unstructured pruning (75% sparsity): - - Original parameters: 8 ร— 4 ร— 3 ร— 3 = 288 + - Original parameters: 8 * 4 * 3 * 3 = 288 - Sparsity = 75% means keep 25% of weights - - Remaining parameters: 288 ร— 0.25 = 72 parameters + - Remaining parameters: 288 * 0.25 = 72 parameters - Compression ratio: 288/72 = 4x - BUT: Still need to store 288 values (with zeros), irregular sparsity pattern b) Structured pruning (remove 6 filters, keep 2): - Filters removed: 6/8 = 75% of filters - - Remaining parameters: 2 ร— 4 ร— 3 ร— 3 = 72 parameters + - Remaining parameters: 2 * 4 * 3 * 3 = 72 parameters - Compression ratio: 288/72 = 4x (same as unstructured) - - BUT: Dense 2ร—4ร—3ร—3 tensor, no zeros to store + - BUT: Dense 2*4*3*3 tensor, no zeros to store c) Structured provides better actual speedup because: - - Dense computation on smaller tensor (2ร—4ร—3ร—3) vs sparse on large (8ร—4ร—3ร—3) + - Dense computation on smaller tensor (2*4*3*3) vs sparse on large (8*4*3*3) - No conditional branching (if weight != 0) in inner loops - Better cache locality with contiguous memory access - Can use optimized BLAS/convolution libraries - Unstructured requires specialized sparse kernels (often unavailable) d) For <50 parameters mobile constraint: - - Unstructured: 72 remaining parameters > 50 โ†’ doesn't fit - - Structured: Need 50/36 โ‰ˆ 1.4 filters โ†’ keep 1 filter = 36 parameters โœ“ + - Unstructured: 72 remaining parameters > 50 -> doesn't fit + - Structured: Need 50/36 ~= 1.4 filters -> keep 1 filter = 36 parameters OK - Structured pruning better for extreme resource constraints ### END SOLUTION """ @@ -2272,9 +2272,9 @@ c) Recommend optimal compression strategy for each deployment target ### BEGIN SOLUTION a) Device compatibility analysis: - - Mobile (50MB, 10 GFLOPS): โœ— 50% (100MB > 50MB), โœ“ 70% (60MB, 12 GFLOPS), โœ“ 90% + - Mobile (50MB, 10 GFLOPS): โœ— 50% (100MB > 50MB), OK 70% (60MB, 12 GFLOPS), OK 90% - IoT (10MB, 1 GFLOPS): โœ— 50%, โœ— 70%, โœ— 90% (4 GFLOPS > 1 GFLOPS) - - Edge Server (500MB, 100 GFLOPS): โœ“ All options work + - Edge Server (500MB, 100 GFLOPS): OK All options work b) Accuracy-efficiency tradeoff (accuracy/memory ratio): - 50% sparse: 94%/100MB = 0.94%/MB @@ -2287,8 +2287,8 @@ c) Optimal recommendations: - Edge Server: 50% sparse (maximum accuracy 94% with abundant resources) IoT solution: Combine 90% pruning + 8-bit quantization + structured pruning: - - Memory: 20MB โ†’ 5MB (quantization) โ†’ 2MB (structured) โœ“ - - Compute: 4 GFLOPS โ†’ 1 GFLOPS (structured optimization) โœ“ + - Memory: 20MB -> 5MB (quantization) -> 2MB (structured) OK + - Compute: 4 GFLOPS -> 1 GFLOPS (structured optimization) OK ### END SOLUTION """ @@ -2311,32 +2311,32 @@ d) Analyze the business case for different deployment scales ### BEGIN SOLUTION a) Daily compute cost calculation: Dense model: - - 1M requests ร— 50ms = 50,000 seconds = 13.9 hours - - Daily cost: 13.9 hours ร— $0.10 = $1.39/day + - 1M requests * 50ms = 50,000 seconds = 13.9 hours + - Daily cost: 13.9 hours * $0.10 = $1.39/day Compressed model: - - 1M requests ร— 20ms = 20,000 seconds = 5.6 hours - - Daily cost: 5.6 hours ร— $0.04 = $0.22/day + - 1M requests * 20ms = 20,000 seconds = 5.6 hours + - Daily cost: 5.6 hours * $0.04 = $0.22/day Daily savings: $1.39 - $0.22 = $1.17/day b) Infrastructure analysis: - - Memory savings: 500MB โ†’ 100MB = 5x reduction + - Memory savings: 500MB -> 100MB = 5x reduction - Server capacity: 5x more models per server (memory bound) - - Latency improvement: 50ms โ†’ 20ms = 2.5x faster response + - Latency improvement: 50ms -> 20ms = 2.5x faster response - Throughput: 2.5x more requests per server c) Break-even timeline: - Development cost: $50,000 - Daily savings: $1.17 - - Break-even: $50,000 รท $1.17 = 42,735 days โ‰ˆ 117 years! + - Break-even: $50,000 / $1.17 = 42,735 days ~= 117 years! This seems wrong - let me recalculate for realistic scale: At 100M requests/day (large scale): - - Dense: 1,389 hours ร— $0.10 = $138.90/day - - Compressed: 556 hours ร— $0.04 = $22.24/day + - Dense: 1,389 hours * $0.10 = $138.90/day + - Compressed: 556 hours * $0.04 = $22.24/day - Daily savings: $116.66 - - Break-even: $50,000 รท $116.66 = 428 days โ‰ˆ 14 months โœ“ + - Break-even: $50,000 / $116.66 = 428 days ~= 14 months OK d) Business case by scale: - Small scale (<1M/day): ROI unclear, focus on accuracy @@ -2377,11 +2377,11 @@ b) The structured vs unstructured tradeoff: - Inference speed: structured pruning provides actual speedup, unstructured often theoretical only c) Layer-specific sparsity tolerance: -- Linear layers: High redundancy, many parameters, more overparametrized โ†’ tolerate 80% sparsity -- Conv layers: Fewer parameters, each filter captures important spatial features โ†’ more sensitive -- First layers: Extract low-level features (edges, textures) โ†’ very sensitive to pruning -- Later layers: More abstract features with redundancy โ†’ can handle moderate pruning -- Output layers: Critical for final predictions โ†’ require conservative pruning +- Linear layers: High redundancy, many parameters, more overparametrized -> tolerate 80% sparsity +- Conv layers: Fewer parameters, each filter captures important spatial features -> more sensitive +- First layers: Extract low-level features (edges, textures) -> very sensitive to pruning +- Later layers: More abstract features with redundancy -> can handle moderate pruning +- Output layers: Critical for final predictions -> require conservative pruning """ @@ -2409,10 +2409,10 @@ a) Lower actual speedup due to multiple bottlenecks: - Hardware mismatch: Most CPUs/GPUs optimized for dense linear algebra, not sparse b) Hardware-driven pruning requirements: -- Mobile: Strict memory (4GB total), battery, thermal constraints โ†’ need aggressive 70-90% sparsity -- Edge servers: More memory (16GB+), power, cooling โ†’ moderate 50% sparsity sufficient -- Cloud: Abundant resources โ†’ pruning for cost optimization, not necessity -- Embedded/IoT: Extreme constraints (MB not GB) โ†’ need structured pruning + quantization +- Mobile: Strict memory (4GB total), battery, thermal constraints -> need aggressive 70-90% sparsity +- Edge servers: More memory (16GB+), power, cooling -> moderate 50% sparsity sufficient +- Cloud: Abundant resources -> pruning for cost optimization, not necessity +- Embedded/IoT: Extreme constraints (MB not GB) -> need structured pruning + quantization - Different hardware accelerators: Edge TPU loves sparsity, standard GPUs don't benefit much c) Pruning-friendly architecture design: @@ -2515,7 +2515,7 @@ c) Future evolution predictions: # %% [markdown] """ -## ๐ŸŽฏ MODULE SUMMARY: Compression - Neural Network Pruning for Edge Deployment +## TARGET MODULE SUMMARY: Compression - Neural Network Pruning for Edge Deployment ### What You Accomplished diff --git a/modules/18_caching/caching_dev.py b/modules/18_caching/caching_dev.py index 28fe94a4..2dea1b0b 100644 --- a/modules/18_caching/caching_dev.py +++ b/modules/18_caching/caching_dev.py @@ -21,7 +21,7 @@ Welcome to the KV Caching module! You'll implement the key-value cache optimizat - Systems insight: How memory management enables dramatic speedups - Incremental computation: Build systems that efficiently reuse previous work -## Build โ†’ Profile โ†’ Optimize +## Build -> Profile -> Optimize 1. **Build**: Implement KV caching for multi-head attention with incremental generation 2. **Profile**: Compare O(Nยฒ) vs O(N) performance and memory usage patterns 3. **Optimize**: Apply caching to complete transformer inference pipeline @@ -35,9 +35,9 @@ By the end of this module, you'll understand: - Connection to how ChatGPT, GPT-4, and other LLMs achieve fast response times ## Systems Reality Check -๐Ÿ’ก **Production Context**: GPT-4 uses KV caching for all inference - without it, generating 100 tokens would take minutes instead of seconds -โšก **Performance Note**: KV caching is the difference between research models and production LLMs -๐Ÿ”ฅ **Memory Trade-off**: Cache grows with sequence length but saves quadratic recomputation +TIP **Production Context**: GPT-4 uses KV caching for all inference - without it, generating 100 tokens would take minutes instead of seconds +SPEED **Performance Note**: KV caching is the difference between research models and production LLMs +FIRE **Memory Trade-off**: Cache grows with sequence length but saves quadratic recomputation """ # %% nbgrader={"grade": false, "grade_id": "caching-imports", "locked": false, "schema_version": 3, "solution": false, "task": false} @@ -82,13 +82,13 @@ except ImportError: self.dropout = dropout # %% nbgrader={"grade": false, "grade_id": "caching-welcome", "locked": false, "schema_version": 3, "solution": false, "task": false} -print("๐Ÿš€ TinyTorch KV Caching Module") +print("ROCKET TinyTorch KV Caching Module") print(f"NumPy version: {np.__version__}") print("Ready to implement the most sophisticated optimization!") # %% [markdown] """ -## ๐Ÿ“ฆ Where This Code Lives in the Final Package +## PACKAGE Where This Code Lives in the Final Package **Learning Side:** You work in `modules/source/19_caching/caching_dev.py` **Building Side:** Code exports to `tinytorch.core.caching` @@ -128,7 +128,7 @@ Generate token N: Attend to [token_1, ..., token_{N-1}] ### Memory and Compute Analysis For each new token, traditional attention: 1. **Recomputes K,V** for all previous tokens (wasted computation) -2. **Attention matrix** grows: 1ร—1, 2ร—2, 3ร—3, ..., Nร—N (quadratic memory) +2. **Attention matrix** grows: 1*1, 2*2, 3*3, ..., N*N (quadratic memory) 3. **Total operations**: 1ยฒ + 2ยฒ + 3ยฒ + ... + Nยฒ = O(Nยณ) for full sequence! **This is why naive transformer generation is impossibly slow for long sequences.** @@ -190,7 +190,7 @@ class KVCache: MEMORY LAYOUT: - Cache per layer: keys[seq_len, n_heads, head_dim] - Cache per layer: values[seq_len, n_heads, head_dim] - - Total memory: 2 ร— n_layers ร— max_seq_len ร— n_heads ร— head_dim + - Total memory: 2 * n_layers * max_seq_len * n_heads * head_dim Args: max_seq_len: Maximum sequence length to cache @@ -387,20 +387,20 @@ def test_kv_cache(): cache.reset() assert cache.current_position == 0, "Reset should return to position 0" - print("โœ… KV Cache tests passed!") + print("PASS KV Cache tests passed!") print(f" Cache capacity: {memory_info['total_cache_size_mb']:.2f} MB") - print(f" Memory efficiency: O(L ร— N ร— H ร— D) scaling") + print(f" Memory efficiency: O(L * N * H * D) scaling") # Run the test test_kv_cache() -# โœ… IMPLEMENTATION CHECKPOINT: Basic KV Cache complete +# PASS IMPLEMENTATION CHECKPOINT: Basic KV Cache complete -# ๐Ÿค” PREDICTION: How much memory would a KV cache use for GPT-3? +# THINK PREDICTION: How much memory would a KV cache use for GPT-3? # GPT-3: 96 layers, 96 heads, 128 head_dim, 2048 max tokens # Your guess: _____ GB -# ๐Ÿ” SYSTEMS INSIGHT #1: Cache Memory Scaling Analysis +# MAGNIFY SYSTEMS INSIGHT #1: Cache Memory Scaling Analysis def analyze_cache_memory_scaling(): """Analyze how KV cache memory scales with model and sequence parameters.""" try: @@ -434,18 +434,18 @@ def analyze_cache_memory_scaling(): print(f"{config['name']:<15} {config['layers']:<8} {total_mb:<12.1f}MB {per_token_kb:<12.1f}KB") - print(f"\n๐Ÿ” Key Insights:") - print(f" โ€ข Memory scales as: O(Layers ร— Heads ร— HeadDim ร— SeqLen)") - print(f" โ€ข Each token adds: 2 ร— Layers ร— Heads ร— HeadDim ร— 4 bytes") + print(f"\nMAGNIFY Key Insights:") + print(f" โ€ข Memory scales as: O(Layers * Heads * HeadDim * SeqLen)") + print(f" โ€ข Each token adds: 2 * Layers * Heads * HeadDim * 4 bytes") print(f" โ€ข GPT-3 cache: ~2.4GB for full 2048-token context!") print(f" โ€ข Trade-off: Large memory cost but eliminates O(Nยฒ) recomputation") - # ๐Ÿ’ก WHY THIS MATTERS: Understanding memory scaling helps design + # TIP WHY THIS MATTERS: Understanding memory scaling helps design # systems that can handle large models and long sequences efficiently. # Real inference servers must budget memory for multiple concurrent caches! except Exception as e: - print(f"โš ๏ธ Error in memory analysis: {e}") + print(f"WARNING๏ธ Error in memory analysis: {e}") print("Make sure KVCache class is implemented correctly") # Analyze cache memory scaling @@ -745,23 +745,23 @@ def test_cached_attention(): # The outputs should be similar (not exactly equal due to different computation paths) assert full_output.shape == (batch_size, 2, embed_dim), "Full sequence output should have correct shape" - print("โœ… Cached Attention tests passed!") + print("PASS Cached Attention tests passed!") print(f" Memory saved: {cache.get_memory_usage()['used_cache_size_mb']:.2f} MB cache vs full recomputation") print(f" Cache position: {cache.current_position}") # Run the test test_cached_attention() -# โœ… IMPLEMENTATION CHECKPOINT: Cached Attention complete +# PASS IMPLEMENTATION CHECKPOINT: Cached Attention complete -# ๐Ÿค” PREDICTION: How much faster is cached vs non-cached attention for 100 tokens? +# THINK PREDICTION: How much faster is cached vs non-cached attention for 100 tokens? # Your guess: ___x faster -# ๐Ÿ” SYSTEMS INSIGHT #2: Attention Performance Comparison +# MAGNIFY SYSTEMS INSIGHT #2: Attention Performance Comparison def analyze_attention_performance_scaling(): """Compare cached vs non-cached attention across different sequence lengths.""" try: - print("\nโšก Attention Performance Scaling Analysis") + print("\nSPEED Attention Performance Scaling Analysis") print("=" * 45) embed_dim = 64 @@ -801,19 +801,19 @@ def analyze_attention_performance_scaling(): print(f"{seq_len:<10} {cached_time:<12.2f} {non_cached_time:<15.2f} {speedup:<10.2f}x") - print(f"\n๐Ÿ” Key Insights:") + print(f"\nMAGNIFY Key Insights:") print(f" โ€ข Speedup increases with sequence length (more reuse!)") print(f" โ€ข Cached: O(N) complexity per token") print(f" โ€ข Non-cached: O(Nยฒ) complexity per token") print(f" โ€ข Break-even typically around 20-50 tokens") print(f" โ€ข Memory cost: Linear cache vs quadratic recomputation") - # ๐Ÿ’ก WHY THIS MATTERS: This analysis shows why KV caching is essential + # TIP WHY THIS MATTERS: This analysis shows why KV caching is essential # for any practical transformer deployment. The speedup becomes dramatic # for longer sequences that are common in real applications! except Exception as e: - print(f"โš ๏ธ Error in performance analysis: {e}") + print(f"WARNING๏ธ Error in performance analysis: {e}") print("Make sure cached attention is implemented correctly") # Analyze attention performance scaling @@ -1048,7 +1048,7 @@ def test_cached_generation(): "Initial tokens should be preserved in output" ) - print("โœ… Cached Generation tests passed!") + print("PASS Cached Generation tests passed!") print(f" Generated sequence length: {generated_sequence.shape[1]}") print(f" Processing time: {cached_time:.3f}s") print(f" Memory efficiency: O(N) per step instead of O(Nยฒ)") @@ -1056,17 +1056,17 @@ def test_cached_generation(): # Run the test test_cached_generation() -# โœ… IMPLEMENTATION CHECKPOINT: Cached Generation complete +# PASS IMPLEMENTATION CHECKPOINT: Cached Generation complete -# ๐Ÿค” PREDICTION: For a 1000-token story, how many fewer operations does caching save? +# THINK PREDICTION: For a 1000-token story, how many fewer operations does caching save? # Without cache: ~333 million operations, With cache: ~1 million operations # Your calculation: _____ million operations saved -# ๐Ÿ” SYSTEMS INSIGHT #3: Generation Efficiency Analysis +# MAGNIFY SYSTEMS INSIGHT #3: Generation Efficiency Analysis def analyze_generation_efficiency(): """Analyze the computational savings from KV caching in text generation.""" try: - print("\n๐Ÿš€ Text Generation Efficiency Analysis") + print("\nROCKET Text Generation Efficiency Analysis") print("=" * 45) # Analyze different generation scenarios @@ -1098,10 +1098,10 @@ def analyze_generation_efficiency(): print(f"{scenario['name']:<15} {n:<8} {ops_without_str:<15} {ops_with_str:<12} {reduction:<12.0f}x") - print(f"\n๐Ÿ” Computational Complexity:") + print(f"\nMAGNIFY Computational Complexity:") print(f" โ€ข Without Cache: O(Nยณ) total operations for N-token generation") print(f" โ€ข With Cache: O(Nยฒ) total operations for N-token generation") - print(f" โ€ข Memory Trade-off: O(Lร—Hร—Dร—N) cache vs O(Nยณ) recomputation") + print(f" โ€ข Memory Trade-off: O(L*H*D*N) cache vs O(Nยณ) recomputation") print(f" โ€ข Real Impact: Makes GPT-style models practical for generation") # Test actual generation performance @@ -1124,12 +1124,12 @@ def analyze_generation_efficiency(): print(f" Rate: {result.shape[1]/generation_time:.1f} tokens/second") print(f" This enables real-time conversational AI!") - # ๐Ÿ’ก WHY THIS MATTERS: This dramatic computational savings is what + # TIP WHY THIS MATTERS: This dramatic computational savings is what # makes conversational AI possible. Without KV caching, chatbots would # take minutes to generate simple responses! except Exception as e: - print(f"โš ๏ธ Error in efficiency analysis: {e}") + print(f"WARNING๏ธ Error in efficiency analysis: {e}") print("Make sure generation functions are implemented correctly") # Analyze generation efficiency @@ -1182,7 +1182,7 @@ def calculate_theoretical_speedup(seq_len: int) -> Dict[str, int]: def format_performance_results(results: List[Dict[str, Any]]) -> None: """Format and display performance analysis results in a readable table.""" - print(f"\n๐Ÿ“ˆ Performance Summary:") + print(f"\nPROGRESS Performance Summary:") print(f"{'Seq Len':<8} {'Memory(MB)':<12} {'Speedup':<10} {'Memory/Speedup':<15}") print("-" * 50) @@ -1197,7 +1197,7 @@ def analyze_kv_cache_performance(): This function has been refactored into smaller, focused helper functions for better readability and maintainability. """ - print("๐Ÿ” Analyzing KV Cache Performance Characteristics...") + print("MAGNIFY Analyzing KV Cache Performance Characteristics...") # Define test configuration (reduced for faster testing) test_config = { @@ -1277,8 +1277,8 @@ def _display_analysis_summary(results: List[Dict[str, Any]], sequence_lengths: L """Display formatted summary and key insights.""" format_performance_results(results) - print(f"\n๐ŸŽฏ Key Insights:") - print(f" โ€ข Memory scales as O(L ร— N ร— H ร— D) where L=layers, N=seq_len, H=heads, D=head_dim") + print(f"\nTARGET Key Insights:") + print(f" โ€ข Memory scales as O(L * N * H * D) where L=layers, N=seq_len, H=heads, D=head_dim") print(f" โ€ข Computation scales as O(Nยฒ) with cache vs O(Nยณ) without") print(f" โ€ข Break-even point: ~{sequence_lengths[1]} tokens for this configuration") print(f" โ€ข Memory-efficiency trade-off: more cache memory for better performance") @@ -1345,14 +1345,14 @@ def explore_production_kv_caching(): for system in systems: # Calculate cache memory requirements - # 2 (K + V) ร— layers ร— max_context ร— heads ร— head_dim ร— 4 bytes (float32) + # 2 (K + V) * layers * max_context * heads * head_dim * 4 bytes (float32) cache_size_bytes = (2 * system['layers'] * system['max_context'] * system['heads'] * system['head_dim'] * 4) cache_size_gb = cache_size_bytes / (1024**3) print(f"{system['name']:<15} {cache_size_gb:<12.2f}GB {system['max_context']:<12} {system['use_case']:<15}") - print(f"\n๐Ÿ’ก Production Optimizations:") + print(f"\nTIP Production Optimizations:") print(f" โ€ข Memory pooling: Reuse cache memory across requests") print(f" โ€ข Batch processing: Share cache computation across multiple queries") print(f" โ€ข Attention masks: Skip computation for padded tokens") @@ -1360,13 +1360,13 @@ def explore_production_kv_caching(): print(f" โ€ข Mixed precision: Use FP16/INT8 to reduce cache memory") print(f" โ€ข Flash Attention: Optimize memory access patterns") - print(f"\nโšก Real-World Performance Impact:") + print(f"\nSPEED Real-World Performance Impact:") print(f" โ€ข Without KV cache: GPT would take minutes to generate short responses") print(f" โ€ข With KV cache: Real-time conversation becomes possible") print(f" โ€ข Memory cost: 1-10GB RAM per conversation depending on model size") print(f" โ€ข Speedup: 10-100x faster generation for typical use cases") - print(f"\n๐ŸŽฏ Why This Matters for ML Engineers:") + print(f"\nTARGET Why This Matters for ML Engineers:") print(f" โ€ข KV caching is THE optimization that makes LLMs practical") print(f" โ€ข Memory management becomes critical at scale") print(f" โ€ข Understanding trade-offs helps design better systems") @@ -1385,7 +1385,7 @@ Complete validation of our KV caching implementation. # %% nbgrader={"grade": true, "grade_id": "comprehensive-tests", "locked": false, "points": 20, "schema_version": 3, "solution": false, "task": false} def run_comprehensive_tests(): """Run all tests to validate KV caching implementation.""" - print("๐Ÿงช Running Comprehensive KV Caching Tests") + print("TEST Running Comprehensive KV Caching Tests") print("=" * 50) # Test 1: Cache capacity and bounds checking @@ -1409,7 +1409,7 @@ def run_comprehensive_tests(): except ValueError: pass # Expected - print(" โœ… Capacity management works") + print(" PASS Capacity management works") # Test 2: Multi-layer cache consistency print("Test 2: Multi-layer Consistency...") @@ -1432,7 +1432,7 @@ def run_comprehensive_tests(): np.testing.assert_array_equal(cached_k.data, expected_k, f"Layer {layer} keys incorrect") np.testing.assert_array_equal(cached_v.data, expected_v, f"Layer {layer} values incorrect") - print(" โœ… Multi-layer consistency works") + print(" PASS Multi-layer consistency works") # Test 3: Attention output consistency print("Test 3: Attention Consistency...") @@ -1467,7 +1467,7 @@ def run_comprehensive_tests(): diff = np.abs(cached_outputs[-1] - full_output.data[:, -1:, :]).mean() assert diff < 1.0, f"Cached and non-cached outputs too different: {diff}" - print(" โœ… Attention consistency acceptable") + print(" PASS Attention consistency acceptable") # Test 4: Memory profiling print("Test 4: Memory Profiling...") @@ -1486,9 +1486,9 @@ def run_comprehensive_tests(): print(f" Actual memory usage: {memory_mb:.2f} MB") print(f" Theoretical cache size: {theoretical_mb:.2f} MB") - print(" โœ… Memory usage within expected range") + print(" PASS Memory usage within expected range") - print("\n๐ŸŽ‰ All Comprehensive Tests Passed!") + print("\nCELEBRATE All Comprehensive Tests Passed!") print("KV caching implementation is working correctly!") # Run comprehensive tests @@ -1503,7 +1503,7 @@ Consolidate all test execution for when the module is run directly. # %% if __name__ == "__main__": - print("๐Ÿš€ TinyTorch KV Caching Module - Complete Test Suite") + print("ROCKET TinyTorch KV Caching Module - Complete Test Suite") print("=" * 60) # Run all tests in sequence @@ -1525,16 +1525,16 @@ if __name__ == "__main__": run_comprehensive_tests() print("\n" + "=" * 60) - print("๐ŸŽฏ MODULE COMPLETE: KV Caching Implementation") + print("TARGET MODULE COMPLETE: KV Caching Implementation") print("=" * 60) - print("โœ… All tests passed!") - print("โœ… Performance analysis complete!") - print("โœ… Production context understood!") + print("PASS All tests passed!") + print("PASS Performance analysis complete!") + print("PASS Production context understood!") print("\nYou now understand the most sophisticated transformer optimization!") # %% [markdown] """ -## ๐Ÿค” ML Systems Thinking: Interactive Questions +## THINK ML Systems Thinking: Interactive Questions Reflect on how KV caching transforms transformer systems and enables production deployments. """ @@ -1618,14 +1618,14 @@ Reflect on how KV caching transforms transformer systems and enables production # %% [markdown] """ -## ๐ŸŽฏ MODULE SUMMARY: KV Caching - The Most Sophisticated Optimization +## TARGET MODULE SUMMARY: KV Caching - The Most Sophisticated Optimization ### What You've Accomplished -โœ… **KVCache Implementation**: 200+ lines of sophisticated cache management with memory-efficient storage and retrieval -โœ… **CachedMultiHeadAttention**: Complete attention mechanism with O(N) complexity instead of O(Nยฒ) -โœ… **Autoregressive Generation**: Full text generation pipeline with dramatic performance improvements -โœ… **Systems Analysis**: Comprehensive memory profiling and performance benchmarking across model scales -โœ… **Production Context**: Understanding of real-world deployment challenges and optimization strategies +PASS **KVCache Implementation**: 200+ lines of sophisticated cache management with memory-efficient storage and retrieval +PASS **CachedMultiHeadAttention**: Complete attention mechanism with O(N) complexity instead of O(Nยฒ) +PASS **Autoregressive Generation**: Full text generation pipeline with dramatic performance improvements +PASS **Systems Analysis**: Comprehensive memory profiling and performance benchmarking across model scales +PASS **Production Context**: Understanding of real-world deployment challenges and optimization strategies ### Key Learning Outcomes - **Algorithmic Transformation**: Mastered how changing the algorithm (not just implementation) achieves orders-of-magnitude speedups @@ -1634,8 +1634,8 @@ Reflect on how KV caching transforms transformer systems and enables production - **Systems Engineering**: Gained insight into memory management, cache eviction, and resource optimization at scale ### Mathematical Foundations Mastered -- **Complexity Analysis**: O(Nยณ) โ†’ O(Nยฒ) total operations transformation for sequence generation -- **Memory Scaling**: O(L ร— N ร— H ร— D) cache memory requirements across layers, sequence length, heads, and dimensions +- **Complexity Analysis**: O(Nยณ) -> O(Nยฒ) total operations transformation for sequence generation +- **Memory Scaling**: O(L * N * H * D) cache memory requirements across layers, sequence length, heads, and dimensions - **Performance Metrics**: Break-even analysis between cache memory cost and computational savings ### Professional Skills Developed @@ -1648,32 +1648,32 @@ Reflect on how KV caching transforms transformer systems and enables production Complexity Transformation Achieved: Without KV Cache (O(Nยณ) total): -Token 1: [โ– ] โ† 0 ops -Token 2: [โ– ]โ”€โ”€โ”€[โ– ] โ† 1 op -Token 3: [โ– ]โ”€โ”€โ”€[โ– ]โ”€โ”€โ”€[โ– ] โ† 4 ops (recompute all) -Token 4: [โ– ]โ”€โ”€โ”€[โ– ]โ”€โ”€โ”€[โ– ]โ”€โ”€โ”€[โ– ] โ† 9 ops (recompute all) +Token 1: [โ– ] <- 0 ops +Token 2: [โ– ]---[โ– ] <- 1 op +Token 3: [โ– ]---[โ– ]---[โ– ] <- 4 ops (recompute all) +Token 4: [โ– ]---[โ– ]---[โ– ]---[โ– ] <- 9 ops (recompute all) ... Total: 0 + 1 + 4 + 9 + 16 + ... = O(Nยณ) scaling With KV Cache (O(Nยฒ) total): -Token 1: [โ– ] โ†’ Cache โ† 1 op + store -Token 2: [C]โ”€โ”€โ”€[โ– ] โ†’ Cache โ† 1 op + reuse -Token 3: [C]โ”€โ”€โ”€[C]โ”€โ”€โ”€[โ– ] โ† 1 op + reuse -Token 4: [C]โ”€โ”€โ”€[C]โ”€โ”€โ”€[C]โ”€โ”€โ”€[โ– ] โ† 1 op + reuse +Token 1: [โ– ] -> Cache <- 1 op + store +Token 2: [C]---[โ– ] -> Cache <- 1 op + reuse +Token 3: [C]---[C]---[โ– ] <- 1 op + reuse +Token 4: [C]---[C]---[C]---[โ– ] <- 1 op + reuse ... Total: 1 + 1 + 1 + 1 + ... = O(N) per token, O(Nยฒ) total Memory Layout You Implemented: -โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” -โ”‚ KVCache: Multi-Layer Storage System โ”‚ -โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค -โ”‚ Layer 0: K[seq_len, heads, head_dim] โ”‚ -โ”‚ V[seq_len, heads, head_dim] โ”‚ -โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค -โ”‚ Layer 1: K[seq_len, heads, head_dim] โ”‚ -โ”‚ V[seq_len, heads, head_dim] โ”‚ -โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ -Position Tracking: current_position โ†’ shared across layers ++--------------------------------------------------+ +| KVCache: Multi-Layer Storage System | ++--------------------------------------------------โ”ค +| Layer 0: K[seq_len, heads, head_dim] | +| V[seq_len, heads, head_dim] | ++--------------------------------------------------โ”ค +| Layer 1: K[seq_len, heads, head_dim] | +| V[seq_len, heads, head_dim] | ++--------------------------------------------------+ +Position Tracking: current_position -> shared across layers ``` ### Ready for Advanced Applications @@ -1700,7 +1700,7 @@ Your implementation mirrors production systems: 3. **Explore advanced features**: Multi-precision caching, Flash Attention integration 4. **Ready for Production**: Apply these techniques to real transformer deployments -**Congratulations!** Your KV caching implementation represents the pinnacle of transformer optimization - the algorithmic innovation that makes conversational AI possible. You've mastered the most sophisticated optimization in modern ML systems! ๐Ÿš€ +**Congratulations!** Your KV caching implementation represents the pinnacle of transformer optimization - the algorithmic innovation that makes conversational AI possible. You've mastered the most sophisticated optimization in modern ML systems! ROCKET This completes your journey through transformer optimization techniques - from basic implementations to the algorithmic innovations that power production AI systems. """ \ No newline at end of file diff --git a/modules/19_benchmarking/benchmarking_dev.py b/modules/19_benchmarking/benchmarking_dev.py index 9ab33c6d..fabeac77 100644 --- a/modules/19_benchmarking/benchmarking_dev.py +++ b/modules/19_benchmarking/benchmarking_dev.py @@ -85,10 +85,10 @@ def _check_profiler_availability(): """Check if TinyTorch profiler is available and explain implications.""" try: from tinytorch.utils.profiler import SimpleProfiler, profile_function - print("โœ… TinyTorch profiler loaded - using advanced timing") + print("PASS TinyTorch profiler loaded - using advanced timing") return True, SimpleProfiler, profile_function except ImportError: - print("โš ๏ธ TinyTorch profiler not available") + print("WARNING๏ธ TinyTorch profiler not available") print(" Make sure Module 15 (Profiling) is completed first") print(" Using basic timing as fallback") return False, None, None @@ -104,11 +104,11 @@ Before diving into the full competition, let's understand the core concepts step # %% def simple_timing_demo(): - """๐ŸŽฏ Learning Checkpoint 1: Basic Performance Measurement + """TARGET Learning Checkpoint 1: Basic Performance Measurement Understand why we need systematic timing for fair comparison. """ - print("๐Ÿ” Learning Checkpoint 1: Basic Performance Measurement") + print("MAGNIFY Learning Checkpoint 1: Basic Performance Measurement") print("=" * 60) # Simple function to time @@ -147,19 +147,19 @@ def simple_timing_demo(): print(f" Slow version: {slow_time*1000:.2f} ms") print(f" Fast version: {fast_time*1000:.2f} ms") - print(f" ๐Ÿš€ Speedup: {speedup:.2f}x faster") + print(f" ROCKET Speedup: {speedup:.2f}x faster") - print(f"\n๐Ÿ’ก Key Insight: Optimization can provide dramatic speedups!") + print(f"\nTIP Key Insight: Optimization can provide dramatic speedups!") print(f" This is why we need systematic benchmarking to measure improvements.") return {'slow_time': slow_time, 'fast_time': fast_time, 'speedup': speedup} def statistical_timing_demo(): - """๐ŸŽฏ Learning Checkpoint 2: Why We Need Multiple Runs + """TARGET Learning Checkpoint 2: Why We Need Multiple Runs Understand timing variability and the need for statistical reliability. """ - print("\n๐Ÿ” Learning Checkpoint 2: Statistical Timing Reliability") + print("\nMAGNIFY Learning Checkpoint 2: Statistical Timing Reliability") print("=" * 60) # Simple operation to time @@ -193,19 +193,19 @@ def statistical_timing_demo(): print(f" Range: {min_time*1000:.2f} - {max_time*1000:.2f} ms") variability = (std_time / mean_time) * 100 - print(f" ๐Ÿ“ˆ Variability: {variability:.1f}% coefficient of variation") + print(f" PROGRESS Variability: {variability:.1f}% coefficient of variation") - print(f"\n๐Ÿ’ก Key Insight: Single measurements are unreliable!") + print(f"\nTIP Key Insight: Single measurements are unreliable!") print(f" We need {DEFAULT_TIMING_RUNS}+ runs with warmup for statistical reliability.") return {'times': times, 'mean': mean_time, 'std': std_time} def benchmark_model_demo(): - """๐ŸŽฏ Learning Checkpoint 3: Model Benchmarking Basics + """TARGET Learning Checkpoint 3: Model Benchmarking Basics Understand how to benchmark ML models specifically. """ - print("\n๐Ÿ” Learning Checkpoint 3: ML Model Benchmarking") + print("\nMAGNIFY Learning Checkpoint 3: ML Model Benchmarking") print("=" * 60) # Simple model for demonstration @@ -248,7 +248,7 @@ def benchmark_model_demo(): print(f" ๐Ÿ”ข Size ratio: {256/64:.0f}x parameters") print(f" โฑ๏ธ Time ratio: {large_time/small_time:.1f}x slower") - print(f"\n๐Ÿ’ก Key Insight: Model complexity directly affects inference time!") + print(f"\nTIP Key Insight: Model complexity directly affects inference time!") print(f" This is why standardized models are crucial for fair competition.") return {'small_time': small_time, 'large_time': large_time} @@ -270,7 +270,7 @@ def run_learning_checkpoints(): model_results = benchmark_model_demo() print("\n" + "=" * 80) - print("๐ŸŽ‰ Learning checkpoints complete! Ready for TinyMLPerf competition.") + print("CELEBRATE Learning checkpoints complete! Ready for TinyMLPerf competition.") print("=" * 80) return { @@ -291,7 +291,7 @@ def test_learning_checkpoints(): """Test the learning checkpoint system""" print("Testing learning checkpoints...") results = run_learning_checkpoints() - print("\nโœ… Learning checkpoints test complete!") + print("\nPASS Learning checkpoints test complete!") return results # %% [markdown] @@ -484,7 +484,7 @@ class TinyMLPerf: self.benchmark_datasets = {} print("๐Ÿ† TinyMLPerf Competition Suite Initialized!") - print("๐ŸŽฏ Three Events: MLP Sprint, CNN Marathon, Transformer Decathlon") + print("TARGET Three Events: MLP Sprint, CNN Marathon, Transformer Decathlon") # Load standard benchmark models self._load_benchmark_models() @@ -501,7 +501,7 @@ class TinyMLPerf: 'transformer_decathlon': TransformerBenchmark() } - print("โœ… Benchmark models loaded successfully!") + print("PASS Benchmark models loaded successfully!") for event, model in self.benchmark_models.items(): print(f" ๐Ÿ“‹ {event.replace('_', ' ').title()}: {type(model).__name__}") @@ -543,9 +543,9 @@ class TinyMLPerf: 'transformer_decathlon': transformer_data } - print("โœ… Benchmark datasets loaded successfully!") + print("PASS Benchmark datasets loaded successfully!") for event, data in self.benchmark_datasets.items(): - print(f" ๐ŸŽฏ {data['event']}: {data['inputs'].shape} -> {data['targets'].shape}") + print(f" TARGET {data['event']}: {data['inputs'].shape} -> {data['targets'].shape}") def load_benchmark(self, event_name: str) -> Tuple[Any, Dict[str, Any]]: """ @@ -605,14 +605,14 @@ def test_tinymlperf_benchmark_suite(): inputs = dataset['inputs'] outputs = model.predict(inputs) - print(f" โœ… Inference successful: {inputs.shape} -> {outputs.shape}") + print(f" PASS Inference successful: {inputs.shape} -> {outputs.shape}") # Verify output shape makes sense batch_size = inputs.shape[0] assert outputs.shape[0] == batch_size, f"Batch size mismatch: {outputs.shape[0]} != {batch_size}" - print(f" โœ… Output shape verified") + print(f" PASS Output shape verified") - print(f"\nโœ… TinyMLPerf benchmark suite test complete!") + print(f"\nPASS TinyMLPerf benchmark suite test complete!") return benchmark_suite # %% [markdown] @@ -648,9 +648,9 @@ class CompetitionProfiler: self.has_profiler = HAS_PROFILER if not self.has_profiler: - print("โš ๏ธ Warning: Advanced profiling unavailable, using basic timing") + print("WARNING๏ธ Warning: Advanced profiling unavailable, using basic timing") else: - print("โœ… Using TinyTorch Module 15 profiler for advanced metrics") + print("PASS Using TinyTorch Module 15 profiler for advanced metrics") def benchmark_model(self, model, dataset: Dict[str, Any]) -> Dict[str, Any]: """ @@ -731,7 +731,7 @@ class CompetitionProfiler: print(f"๐Ÿ“Š Baseline: {comparison['baseline_time']*1000:.2f} ms") print(f"๐Ÿ“Š Optimized: {comparison['optimized_time']*1000:.2f} ms") - print(f"๐Ÿš€ Speedup: {speedup:.2f}x {'faster' if speedup > 1.0 else 'slower'}") + print(f"ROCKET Speedup: {speedup:.2f}x {'faster' if speedup > 1.0 else 'slower'}") return comparison @@ -751,7 +751,7 @@ class CompetitionProfiler: speedup = baseline_time / results['mean_inference_time'] results['speedup_vs_baseline'] = speedup - print(f"๐Ÿš€ Speedup vs baseline: {speedup:.2f}x {'faster' if speedup > 1.0 else 'slower'}") + print(f"ROCKET Speedup vs baseline: {speedup:.2f}x {'faster' if speedup > 1.0 else 'slower'}") return results def _run_basic_profiling(self, model, inputs: np.ndarray) -> Dict[str, Any]: @@ -863,7 +863,7 @@ class CompetitionProfiler: print(f" P95 Time: {results['p95_inference_time']*1000:.2f} ms") if 'speedup_vs_baseline' in results: - print(f" ๐Ÿš€ Speedup: {results['speedup_vs_baseline']:.2f}x faster") + print(f" ROCKET Speedup: {results['speedup_vs_baseline']:.2f}x faster") if 'memory_delta_mb' in results: print(f" ๐Ÿ’พ Memory: {results['memory_delta_mb']:.2f} MB delta, {results['peak_memory_mb']:.2f} MB peak") @@ -901,7 +901,7 @@ def test_competition_profiler(): baseline_time=mlp_results['mean_inference_time'] # Use MLP as baseline ) - print(f"\nโœ… Competition profiler test complete!") + print(f"\nPASS Competition profiler test complete!") return competition_profiler, mlp_results, cnn_results # %% [markdown] @@ -1167,7 +1167,7 @@ class TinyMLPerfCompetition: self.baselines = self._establish_baselines() print("๐Ÿ† TinyMLPerf Competition Initialized!") - print("๐ŸŽฏ Three Events Ready for Competition!") + print("TARGET Three Events Ready for Competition!") def _establish_baselines(self) -> Dict[str, float]: """Establish baseline performance for relative scoring.""" @@ -1201,13 +1201,13 @@ class TinyMLPerfCompetition: # Validate event if event_name not in self.baselines: available = list(self.baselines.keys()) - print(f"โŒ Event '{event_name}' not recognized!") - print("๐ŸŽฏ Available competitions:") + print(f"FAIL Event '{event_name}' not recognized!") + print("TARGET Available competitions:") for event in available: print(f" โ€ข {event.replace('_', ' ').title()}") return None - print(f"๐Ÿš€ TINYMLPERF SUBMISSION") + print(f"ROCKET TINYMLPERF SUBMISSION") print(f"๐Ÿ† Event: {event_name.replace('_', ' ').title()}") print(f"๐Ÿ‘ฅ Team: {team_name}") print("-" * 60) @@ -1294,25 +1294,25 @@ class TinyMLPerfCompetition: print(f"\nโฑ๏ธ Performance:") print(f" Your Time: {submission['submission_time_ms']:.2f} ms") print(f" Baseline: {submission['baseline_time_ms']:.2f} ms") - print(f" ๐Ÿš€ Speedup: {speedup:.2f}x {'FASTER' if speedup > 1.0 else 'slower'}") + print(f" ROCKET Speedup: {speedup:.2f}x {'FASTER' if speedup > 1.0 else 'slower'}") if 'memory_delta_mb' in metrics: print(f" ๐Ÿ’พ Memory: {metrics['memory_delta_mb']:.2f} MB") # Award celebration for good performance if speedup >= 3.0: - print(f"\n๐ŸŽ‰ AMAZING! 3x+ speedup achieved!") + print(f"\nCELEBRATE AMAZING! 3x+ speedup achieved!") elif speedup >= 2.0: print(f"\n๐Ÿ† EXCELLENT! 2x+ speedup!") elif speedup >= 1.5: print(f"\nโญ GREAT! 50%+ speedup!") elif speedup >= 1.1: - print(f"\nโœ… Good optimization!") + print(f"\nPASS Good optimization!") else: - print(f"\n๐Ÿค” Keep optimizing - you can do better!") + print(f"\nTHINK Keep optimizing - you can do better!") if submission['optimization_description']: - print(f"\n๐Ÿ’ก Techniques Used:") + print(f"\nTIP Techniques Used:") print(f" {submission['optimization_description']}") def display_leaderboard(self, event_name: str, sort_by: str = 'speed', top_n: int = 10) -> List[Dict[str, Any]]: @@ -1395,7 +1395,7 @@ def test_tinymlperf_competition(): return x_flat @ self.fc_weights + self.fc_bias # Submit optimized models to competition - print("\n๐Ÿš€ Submitting Competition Entries...") + print("\nROCKET Submitting Competition Entries...") # MLP Sprint submissions mlp_submission1 = competition.submit_entry( @@ -1427,7 +1427,7 @@ def test_tinymlperf_competition(): print("\n๐Ÿ“Š Competition Leaderboards:") competition.display_all_leaderboards() - print("\nโœ… TinyMLPerf competition framework test complete!") + print("\nPASS TinyMLPerf competition framework test complete!") return competition # %% [markdown] @@ -1474,7 +1474,7 @@ def test_simplified_competition_features(): return x_flat @ self.fc_weights + self.fc_bias # Submit entries with different optimization descriptions - print("\n๐Ÿš€ Submitting Competition Entries...") + print("\nROCKET Submitting Competition Entries...") # MLP submissions with different techniques submission1 = competition.submit_entry( @@ -1513,7 +1513,7 @@ def test_simplified_competition_features(): print("\n3. Composite Leaderboard:") competition.display_leaderboard("mlp_sprint", sort_by="composite", top_n=5) - print("\nโœ… Simplified competition features test complete!") + print("\nPASS Simplified competition features test complete!") return competition # %% [markdown] @@ -1532,11 +1532,11 @@ def run_complete_tinymlperf_demo(): # Test benchmark suite benchmark_suite = test_tinymlperf_benchmark_suite() - print("\n2. โšก Testing Competition Profiling...") + print("\n2. SPEED Testing Competition Profiling...") # Test profiling infrastructure competition_profiler, mlp_results, cnn_results = test_competition_profiler() - print("\n3. ๐Ÿš€ Running Basic Competition...") + print("\n3. ROCKET Running Basic Competition...") # Test basic competition basic_competition = test_tinymlperf_competition() @@ -1545,25 +1545,25 @@ def run_complete_tinymlperf_demo(): simplified_competition = test_simplified_competition_features() print("\n" + "=" * 80) - print("๐ŸŽ‰ TINYMLPERF DEMO COMPLETE!") + print("CELEBRATE TINYMLPERF DEMO COMPLETE!") print("=" * 80) print("\n๐Ÿ† TinyMLPerf Competition Ready:") - print("โœ… Three exciting events: MLP Sprint, CNN Marathon, Transformer Decathlon") - print("โœ… TinyTorch Module 15 profiler integration for rigorous benchmarking") - print("โœ… Hardware-independent relative scoring (speedup ratios)") - print("โœ… Transparent leaderboards with evidence requirements") - print("โœ… Simplified innovation detection and creativity rewards") - print("โœ… Three leaderboard types: speed, innovation, and composite scoring") + print("PASS Three exciting events: MLP Sprint, CNN Marathon, Transformer Decathlon") + print("PASS TinyTorch Module 15 profiler integration for rigorous benchmarking") + print("PASS Hardware-independent relative scoring (speedup ratios)") + print("PASS Transparent leaderboards with evidence requirements") + print("PASS Simplified innovation detection and creativity rewards") + print("PASS Three leaderboard types: speed, innovation, and composite scoring") - print("\n๐Ÿš€ Competition Features:") + print("\nROCKET Competition Features:") print("โ€ข Standardized benchmark models and datasets") print("โ€ข Statistical reliability with multiple timing runs") print("โ€ข Multiple leaderboard categories with simple keyword detection") print("โ€ข GitHub integration for transparency and reproducibility") print("โ€ข Focused classes with single responsibilities") - print("\n๐ŸŽฏ Ready to Compete:") + print("\nTARGET Ready to Compete:") print("1. Optimize your models using techniques from Modules 16-19") print("2. Submit to TinyMLPerf events using competition.submit_entry()") print("3. See your results on speed, innovation, or composite leaderboards") @@ -1589,7 +1589,7 @@ This simplified TinyMLPerf competition module demonstrates advanced ML systems e - **Consistent API**: Single parameterized leaderboard method replaces three separate implementations - **Student-Friendly**: Reduced cognitive load while maintaining all essential functionality -### โšก **Streamlined Performance Optimization** +### SPEED **Streamlined Performance Optimization** - **Single Leaderboard Interface**: One method with sort_by parameter ('speed', 'innovation', 'composite') replaces complex multiple methods - **Simple Innovation Detection**: Basic keyword matching replaces complex pattern analysis and model introspection - **Consistent Formatting**: Centralized header templates ensure visual consistency across all leaderboard types @@ -1601,13 +1601,13 @@ This simplified TinyMLPerf competition module demonstrates advanced ML systems e - **Visual Clarity**: Clear section headers and spacing prevent information overload - **Focused Testing**: Each test function validates one specific capability -### ๐Ÿ’ก **Educational Improvements** +### TIP **Educational Improvements** - **Reduced Complexity**: Eliminated 100+ line classes in favor of focused 20-30 line classes - **Better Mental Models**: Students understand leaderboard concepts instead of getting lost in implementation details - **Maintainable Code**: Consistent patterns and centralized formatting make code easier to debug and extend - **KISS Principle**: Keep It Simple, Stupid - core pedagogical value preserved with implementation complexity reduced -### ๐ŸŽฏ **Key Learning Objectives Maintained** +### TARGET **Key Learning Objectives Maintained** - Competition still accelerates optimization learning through concrete performance measurements - Hardware-independent scoring ensures fair comparison across different development environments - Multiple leaderboard types prevent single-metric tunnel vision @@ -1638,13 +1638,13 @@ if __name__ == "__main__": # Run complete TinyMLPerf demonstration results = run_complete_tinymlperf_demo() - print(f"\n๐ŸŽ‰ Module 20 complete!") + print(f"\nCELEBRATE Module 20 complete!") print(f"๐Ÿ† TinyMLPerf competition infrastructure ready!") - print(f"๐Ÿš€ Time to optimize your models and climb the leaderboards!") + print(f"ROCKET Time to optimize your models and climb the leaderboards!") # %% [markdown] """ -## ๐Ÿค” ML Systems Thinking: Interactive Questions +## THINK ML Systems Thinking: Interactive Questions 1. **Why is separation of concerns crucial in competition software architecture?** Your refactored TinyMLPerf breaks large classes into focused components: CompetitionSubmission, CompetitionStorage, CompetitionLeaderboard, and SimpleInnovationDetector. Explain why this modular design is essential for educational software and how it teaches students professional software development practices beyond just ML systems concepts. @@ -1657,7 +1657,7 @@ if __name__ == "__main__": # %% [markdown] """ -## ๐ŸŽฏ MODULE SUMMARY: TinyMLPerf - Simplified Competition Framework +## TARGET MODULE SUMMARY: TinyMLPerf - Simplified Competition Framework This refactored module demonstrates the power of the KISS principle in educational software design, proving that complex systems can be both pedagogically effective and professionally engineered. @@ -1673,7 +1673,7 @@ This refactored module demonstrates the power of the KISS principle in education - **SimpleInnovationDetector**: Basic keyword matching replacing complex pattern analysis - **TinyMLPerfCompetition**: Orchestrates components with clean delegation patterns -### ๐ŸŽฏ **Educational Excellence** +### TARGET **Educational Excellence** Students learn both ML systems concepts AND professional software engineering: - **Modular Design**: How to break complex problems into manageable components - **API Consistency**: Why parameterized methods beat specialized implementations @@ -1688,7 +1688,7 @@ All essential functionality preserved with improved usability: - Evidence requirements ensuring reproducible, honest performance claims - Simple but effective innovation detection rewarding creative optimization -### ๐Ÿ’ก **Professional Development** +### TIP **Professional Development** This refactor teaches students that excellent engineering means: - Choosing clarity over clever complexity - Building maintainable systems that others can understand and extend diff --git a/modules/20_capstone/capstone_dev.py b/modules/20_capstone/capstone_dev.py index e4f10f97..82910b38 100644 --- a/modules/20_capstone/capstone_dev.py +++ b/modules/20_capstone/capstone_dev.py @@ -4,7 +4,7 @@ Welcome to the TinyGPT Capstone! You'll integrate everything from modules 02-19 to build a complete language model from first principles. -## ๐Ÿ”— Building on Previous Learning +## LINK Building on Previous Learning **What You Built Before**: - Modules 02-11: Core ML infrastructure (tensors, layers, training, optimization) - Modules 12-15: Advanced systems (attention, profiling, benchmarking) @@ -18,25 +18,25 @@ Welcome to the TinyGPT Capstone! You'll integrate everything from modules 02-19 **Connection Map**: ``` -All Previous Modules โ†’ TinyGPT Integration โ†’ Complete ML System +All Previous Modules -> TinyGPT Integration -> Complete ML System (components) (assembly) (text generation) ``` ## Learning Goals 1. **Systems Integration**: Combine all TinyTorch components into working language model -2. **End-to-End Pipeline**: Build complete tokenization โ†’ inference โ†’ generation workflow +2. **End-to-End Pipeline**: Build complete tokenization -> inference -> generation workflow 3. **Performance Analysis**: Profile and optimize complete system bottlenecks 4. **Production Readiness**: Deploy working model with monitoring and optimization 5. **Mastery Demonstration**: Prove comprehensive ML systems engineering capability -## Build โ†’ Use โ†’ Reflect +## Build -> Use -> Reflect 1. **Build**: Complete TinyGPT integration from all previous modules 2. **Use**: Generate text and analyze end-to-end performance characteristics 3. **Reflect**: Evaluate system design decisions and optimization opportunities ## Systems Reality Check -๐Ÿ’ก **Production Context**: Real language models require careful component integration and system optimization -โšก **Performance Insight**: End-to-end systems reveal bottlenecks invisible in isolated components +TIP **Production Context**: Real language models require careful component integration and system optimization +SPEED **Performance Insight**: End-to-end systems reveal bottlenecks invisible in isolated components """ # %% @@ -63,9 +63,9 @@ try: from tinytorch.core.attention import MultiHeadAttention from tinytorch.utils.profiler import SimpleProfiler TINYTORCH_AVAILABLE = True - print("โœ… TinyTorch components loaded successfully") + print("PASS TinyTorch components loaded successfully") except ImportError as e: - print(f"โš ๏ธ TinyTorch components not available: {e}") + print(f"WARNING๏ธ TinyTorch components not available: {e}") print(" Some functionality will use NumPy fallbacks") TINYTORCH_AVAILABLE = False @@ -137,11 +137,11 @@ def _check_component_availability(): except (ImportError, AttributeError): COMPONENT_STATUS[component_name] = False - print(f"๐Ÿ” Component Integration Status: {available_count}/{len(components_to_check)} available") + print(f"MAGNIFY Component Integration Status: {available_count}/{len(components_to_check)} available") # Display detailed status for component, available in COMPONENT_STATUS.items(): - status = "โœ…" if available else "โŒ" + status = "PASS" if available else "FAIL" print(f" {status} {component.capitalize()}") return available_count, len(components_to_check) @@ -161,46 +161,46 @@ Before building the complete system, let's understand how all TinyTorch componen TinyGPT Language Model Pipeline: Input Text - โ”‚ - โ†“ (Tokenization) + | + v (Tokenization) Token IDs [7, 23, 145, ...] - โ”‚ - โ†“ (Token Embedding) - โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” - โ”‚ Token + Position Embeddings โ”‚ - โ”‚ Shape: (batch, seq_len, d_model) โ”‚ - โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ - โ”‚ - โ†“ (Transformer Layers x6) - โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” - โ”‚ Layer 1: MultiHeadAttention โ”‚ - โ”‚ โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ - โ”‚ โ”‚ โ”‚ Q, K, V โ†’ Attention โ”‚ โ”‚ - โ”‚ โ”‚ โ”‚ O(nยฒ) complexity โ”‚ โ”‚ - โ”‚ โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ - โ”‚ โ†“ โ”‚ - โ”‚ LayerNorm + Residual โ”‚ - โ”‚ โ†“ โ”‚ - โ”‚ Feed Forward (Linear โ†’ GELU โ†’ Linear) โ”‚ - โ”‚ โ†“ โ”‚ - โ”‚ LayerNorm + Residual โ”‚ - โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ - โ”‚ (Repeat for layers 2-6) - โ†“ - โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” - โ”‚ Final Layer Norm โ”‚ - โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ - โ”‚ - โ†“ (Language Modeling Head) - โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” - โ”‚ Linear: d_model โ†’ vocab_size โ”‚ - โ”‚ Output: (batch, seq_len, vocab) โ”‚ - โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ - โ”‚ - โ†“ (Softmax + Sampling) + | + v (Token Embedding) + +-----------------------------------+ + | Token + Position Embeddings | + | Shape: (batch, seq_len, d_model) | + +-----------------------------------+ + | + v (Transformer Layers x6) + +-----------------------------------+ + | Layer 1: MultiHeadAttention | + | | +--------------------------+ | + | | | Q, K, V -> Attention | | + | | | O(nยฒ) complexity | | + | | +--------------------------+ | + | v | + | LayerNorm + Residual | + | v | + | Feed Forward (Linear -> GELU -> Linear) | + | v | + | LayerNorm + Residual | + +-----------------------------------+ + | (Repeat for layers 2-6) + v + +-----------------------------------+ + | Final Layer Norm | + +-----------------------------------+ + | + v (Language Modeling Head) + +-----------------------------------+ + | Linear: d_model -> vocab_size | + | Output: (batch, seq_len, vocab) | + +-----------------------------------+ + | + v (Softmax + Sampling) Next Token Probabilities - โ”‚ - โ†“ (Generation Loop) + | + v (Generation Loop) Generated Text Output ``` @@ -209,17 +209,17 @@ TinyGPT Language Model Pipeline: ``` TinyGPT Memory Footprint (Educational Scale): -โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” -โ”‚ Component โ”‚ Parameters โ”‚ Memory (MB) โ”‚ -โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค -โ”‚ Token Embedding โ”‚ 128,000 โ”‚ 0.5 โ”‚ vocab ร— d_model -โ”‚ Position Embedding โ”‚ 8,192 โ”‚ 0.03 โ”‚ seq_len ร— d_model -โ”‚ 6x Attention Layers โ”‚ 294,912 โ”‚ 1.1 โ”‚ 4 ร— d_modelยฒ ร— layers -โ”‚ 6x Feed Forward โ”‚ 393,216 โ”‚ 1.5 โ”‚ 8 ร— d_modelยฒ ร— layers -โ”‚ Output Head โ”‚ 128,000 โ”‚ 0.5 โ”‚ d_model ร— vocab -โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค -โ”‚ TOTAL MODEL โ”‚ 952,320 โ”‚ 3.6 โ”‚ โ†’ 1M parameters! -โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ ++------------------------------------------+ +| Component | Parameters | Memory (MB) | ++------------------------------------------โ”ค +| Token Embedding | 128,000 | 0.5 | vocab * d_model +| Position Embedding | 8,192 | 0.03 | seq_len * d_model +| 6x Attention Layers | 294,912 | 1.1 | 4 * d_modelยฒ * layers +| 6x Feed Forward | 393,216 | 1.5 | 8 * d_modelยฒ * layers +| Output Head | 128,000 | 0.5 | d_model * vocab ++------------------------------------------โ”ค +| TOTAL MODEL | 952,320 | 3.6 | -> 1M parameters! ++------------------------------------------+ Runtime Memory (per batch): - Forward pass activations: ~2-4 MB @@ -228,7 +228,7 @@ Runtime Memory (per batch): - Total training memory: ~15-20 MB ``` -### โšก Performance Characteristics +### SPEED Performance Characteristics ``` Inference Performance Analysis: @@ -240,20 +240,20 @@ Sequence Length Scaling (O(nยฒ) attention bottleneck): 128 tokens: ~128ms (64x slower) Bottleneck Analysis: -1. ๐Ÿ” Attention: 60-70% of computation time -2. ๐Ÿ” Feed Forward: 20-25% of computation time -3. ๐Ÿ” Embedding Lookup: 5-10% of computation time -4. ๐Ÿ” Other Operations: 5-10% of computation time +1. MAGNIFY Attention: 60-70% of computation time +2. MAGNIFY Feed Forward: 20-25% of computation time +3. MAGNIFY Embedding Lookup: 5-10% of computation time +4. MAGNIFY Other Operations: 5-10% of computation time ``` """ # %% def simple_tokenizer_demo(): - """๐ŸŽฏ Learning Checkpoint 1: Basic Text Tokenization + """TARGET Learning Checkpoint 1: Basic Text Tokenization Understand how text becomes numerical tokens for language modeling. """ - print("๐Ÿ” Learning Checkpoint 1: Text Tokenization for Language Models") + print("MAGNIFY Learning Checkpoint 1: Text Tokenization for Language Models") print("=" * 60) # Simple vocabulary for demonstration (real tokenizers are much more sophisticated) @@ -305,17 +305,17 @@ def simple_tokenizer_demo(): 'length': len(token_ids) }) - print(f"๐Ÿ’ก Key Insight: Language models work with token IDs, not raw text!") + print(f"TIP Key Insight: Language models work with token IDs, not raw text!") print(f" Tokenization quality directly affects model performance.") return {'vocab': vocab, 'results': tokenization_results} def attention_scaling_demo(): - """๐ŸŽฏ Learning Checkpoint 2: Understanding Attention Complexity + """TARGET Learning Checkpoint 2: Understanding Attention Complexity Understand why attention is O(nยฒ) and becomes the bottleneck in large models. """ - print("\n๐Ÿ” Learning Checkpoint 2: Attention Scaling Analysis") + print("\nMAGNIFY Learning Checkpoint 2: Attention Scaling Analysis") print("=" * 60) def simple_attention(query, key, value): @@ -380,7 +380,7 @@ def attention_scaling_demo(): base_time = scaling_results[0]['time_ms'] base_length = scaling_results[0]['seq_len'] - print(f"\n๐Ÿ“ˆ Scaling Analysis:") + print(f"\nPROGRESS Scaling Analysis:") for result in scaling_results[1:]: length_ratio = result['seq_len'] / base_length time_ratio = result['time_ms'] / base_time @@ -388,17 +388,17 @@ def attention_scaling_demo(): print(f" {result['seq_len']}vs{base_length}: {time_ratio:.1f}x time (expected O(nยฒ): {expected_quadratic:.1f}x)") - print(f"\n๐Ÿ’ก Key Insight: Attention scales quadratically with sequence length!") + print(f"\nTIP Key Insight: Attention scales quadratically with sequence length!") print(f" This is why long sequences are expensive in transformers.") return {'results': scaling_results} def transformer_component_demo(): - """๐ŸŽฏ Learning Checkpoint 3: Transformer Component Integration + """TARGET Learning Checkpoint 3: Transformer Component Integration Understand how transformer components work together in language models. """ - print("\n๐Ÿ” Learning Checkpoint 3: Transformer Component Integration") + print("\nMAGNIFY Learning Checkpoint 3: Transformer Component Integration") print("=" * 60) # Simple transformer components for demonstration @@ -526,7 +526,7 @@ def transformer_component_demo(): print(f" Feed Forward: {ff_params:,} parameters ({ff_params/total_params*100:.1f}%)") print(f" Total Layer: {total_params:,} parameters") - print(f"\n๐Ÿ’ก Key Insight: Attention dominates compute, FF dominates parameters!") + print(f"\nTIP Key Insight: Attention dominates compute, FF dominates parameters!") print(f" Understanding component characteristics guides optimization.") return {'timing': components_timing, 'params': {'attention': attn_params, 'ff': ff_params}} @@ -548,7 +548,7 @@ def run_learning_checkpoints(): component_results = transformer_component_demo() print("\n" + "=" * 80) - print("๐ŸŽ‰ Learning checkpoints complete! Ready for TinyGPT integration.") + print("CELEBRATE Learning checkpoints complete! Ready for TinyGPT integration.") print("=" * 80) return { @@ -569,7 +569,7 @@ def test_learning_checkpoints(): """Test the TinyGPT learning checkpoint system""" print("Testing TinyGPT learning checkpoints...") results = run_learning_checkpoints() - print("\nโœ… TinyGPT learning checkpoints test complete!") + print("\nPASS TinyGPT learning checkpoints test complete!") return results # %% [markdown] @@ -830,7 +830,7 @@ class TinyGPTModel: # Calculate total parameters self.total_parameters = self._count_parameters() - print(f"๐Ÿš€ TinyGPT Model Initialized:") + print(f"ROCKET TinyGPT Model Initialized:") print(f" ๐Ÿ“Š Parameters: {self.total_parameters:,}") print(f" ๐Ÿ—๏ธ Architecture: {n_layers} layers, {n_heads} heads, {d_model} dim") print(f" ๐Ÿ“š Vocabulary: {vocab_size} tokens") @@ -979,8 +979,8 @@ class TinyGPTSystem: self.warmup_runs = warmup_runs self.timing_runs = timing_runs - print("๐Ÿš€ TinyGPT Complete System Initializing...") - print("๐ŸŽฏ Integrating All TinyTorch Components (Modules 02-19)") + print("ROCKET TinyGPT Complete System Initializing...") + print("TARGET Integrating All TinyTorch Components (Modules 02-19)") # Initialize tokenizer (text processing foundation) self.tokenizer = TinyGPTTokenizer(vocab_size) @@ -997,9 +997,9 @@ class TinyGPTSystem: # Initialize profiler for performance analysis self.profiler_available = TINYTORCH_AVAILABLE and available_components >= 6 if self.profiler_available: - print("โœ… Advanced profiling available (Module 15 integrated)") + print("PASS Advanced profiling available (Module 15 integrated)") else: - print("โš ๏ธ Using basic timing (complete TinyTorch integration recommended)") + print("WARNING๏ธ Using basic timing (complete TinyTorch integration recommended)") # System status and integration validation self._validate_system_integration() @@ -1007,7 +1007,7 @@ class TinyGPTSystem: def _validate_system_integration(self): """Validate that all TinyTorch components are properly integrated.""" - print("๐Ÿ” Validating TinyGPT System Integration...") + print("MAGNIFY Validating TinyGPT System Integration...") integration_checks = { 'tokenizer': self.tokenizer is not None, @@ -1019,15 +1019,15 @@ class TinyGPTSystem: all_passed = True for check_name, passed in integration_checks.items(): - status = "โœ…" if passed else "โŒ" + status = "PASS" if passed else "FAIL" print(f" {status} {check_name.replace('_', ' ').title()}") if not passed: all_passed = False if all_passed: - print("โœ… All integration checks passed!") + print("PASS All integration checks passed!") else: - print("โš ๏ธ Some integration issues detected - functionality may be limited") + print("WARNING๏ธ Some integration issues detected - functionality may be limited") return all_passed @@ -1056,13 +1056,13 @@ class TinyGPTSystem: print(f" โ€ข Integration: {available_components}/{total_components} components") # System capabilities - print(f"\n๐Ÿš€ Capabilities:") - print(f" โ€ข Text Generation: โœ… Autoregressive generation with sampling") - print(f" โ€ข Performance Analysis: {'โœ…' if self.profiler_available else 'โš ๏ธ '} {'Advanced' if self.profiler_available else 'Basic'} profiling") - print(f" โ€ข Scaling Analysis: โœ… Memory and compute profiling") - print(f" โ€ข Production Ready: โœ… Complete end-to-end pipeline") + print(f"\nROCKET Capabilities:") + print(f" โ€ข Text Generation: PASS Autoregressive generation with sampling") + print(f" โ€ข Performance Analysis: {'PASS' if self.profiler_available else 'WARNING๏ธ '} {'Advanced' if self.profiler_available else 'Basic'} profiling") + print(f" โ€ข Scaling Analysis: PASS Memory and compute profiling") + print(f" โ€ข Production Ready: PASS Complete end-to-end pipeline") - print("\n๐ŸŽฏ Ready for text generation and performance analysis!") + print("\nTARGET Ready for text generation and performance analysis!") def encode_text(self, text: str) -> np.ndarray: """ @@ -1078,7 +1078,7 @@ class TinyGPTSystem: # Ensure sequence doesn't exceed max length if len(token_ids) > self.model.max_seq_len: - print(f"โš ๏ธ Text truncated: {len(token_ids)} -> {self.model.max_seq_len} tokens") + print(f"WARNING๏ธ Text truncated: {len(token_ids)} -> {self.model.max_seq_len} tokens") token_ids = token_ids[:self.model.max_seq_len] return token_ids @@ -1114,9 +1114,9 @@ class TinyGPTSystem: Complete generated text (prompt + new tokens) """ if verbose: - print(f"๐Ÿš€ TinyGPT Text Generation Starting...") + print(f"ROCKET TinyGPT Text Generation Starting...") print(f" ๐Ÿ“ Prompt: '{prompt}'") - print(f" ๐ŸŽฏ Generating {max_new_tokens} tokens with temp={temperature}, top_k={top_k}") + print(f" TARGET Generating {max_new_tokens} tokens with temp={temperature}, top_k={top_k}") # Encode prompt to token IDs initial_tokens = self.encode_text(prompt) @@ -1131,7 +1131,7 @@ class TinyGPTSystem: # Check if we've reached max sequence length if current_tokens.shape[1] >= self.model.max_seq_len: if verbose: - print(f" โš ๏ธ Reached max sequence length ({self.model.max_seq_len}), stopping generation") + print(f" WARNING๏ธ Reached max sequence length ({self.model.max_seq_len}), stopping generation") break # Generate next token using the model @@ -1144,7 +1144,7 @@ class TinyGPTSystem: # Check for end-of-sequence token if next_token[0] == self.tokenizer.vocab['']: if verbose: - print(f" โœ… Generated token, stopping generation") + print(f" PASS Generated token, stopping generation") break # Add new token to sequence @@ -1161,7 +1161,7 @@ class TinyGPTSystem: final_text = self.decode_tokens(current_tokens[0]) if verbose: - print(f" โœ… Generation complete: {len(generated_tokens)} new tokens") + print(f" PASS Generation complete: {len(generated_tokens)} new tokens") print(f" ๐Ÿ“š Final text: '{final_text}'") return final_text @@ -1215,7 +1215,7 @@ class TinyGPTSystem: Returns: Performance profiling results """ - print(f"โšก Profiling TinyGPT Inference Performance...") + print(f"SPEED Profiling TinyGPT Inference Performance...") # Encode text once token_ids = self.encode_text(text) @@ -1266,10 +1266,10 @@ class TinyGPTSystem: return performance_results -# ๐Ÿ” SYSTEMS INSIGHT: Complete System Performance Analysis +# MAGNIFY SYSTEMS INSIGHT: Complete System Performance Analysis def analyze_complete_system_performance(): """Comprehensive performance analysis of the complete TinyGPT system.""" - print("๐Ÿ” SYSTEMS INSIGHT: Complete TinyGPT Performance Analysis") + print("MAGNIFY SYSTEMS INSIGHT: Complete TinyGPT Performance Analysis") print("=" * 70) # Initialize system @@ -1283,8 +1283,8 @@ def analyze_complete_system_performance(): # 1. Tokenization analysis complexity = system.analyze_text_complexity(test_text) print(f" ๐Ÿ“ Text: '{test_text}'") - print(f" ๐Ÿ”ค Tokenization: {complexity['word_count']} words โ†’ {complexity['token_count']} tokens") - print(f" ๐Ÿ“ˆ Compression: {complexity['compression_ratio']:.2f} chars/token") + print(f" ๐Ÿ”ค Tokenization: {complexity['word_count']} words -> {complexity['token_count']} tokens") + print(f" PROGRESS Compression: {complexity['compression_ratio']:.2f} chars/token") print(f" ๐Ÿ“š Coverage: {complexity['vocabulary_coverage']*100:.1f}% known tokens") # 2. Model size analysis @@ -1300,10 +1300,10 @@ def analyze_complete_system_performance(): attention_memory = seq_len * seq_len * 4 / 1024 / 1024 # Attention matrix in MB attention_flops = seq_len * seq_len * system.model.d_model # Approximate FLOPs - print(f"\n โšก Attention Analysis (seq_len={seq_len}):") + print(f"\n SPEED Attention Analysis (seq_len={seq_len}):") print(f" ๐Ÿ’พ Attention Memory: {attention_memory:.3f} MB per head") print(f" ๐Ÿงฎ Total Attention Memory: {attention_memory * system.model.n_heads:.2f} MB") - print(f" โšก Attention FLOPs: {attention_flops:,}") + print(f" SPEED Attention FLOPs: {attention_flops:,}") # 4. Performance profiling print(f"\n โฑ๏ธ Performance Profiling:") @@ -1316,8 +1316,8 @@ def analyze_complete_system_performance(): actual_scaling = batch_results[1]['mean_time_ms'] / batch_results[0]['mean_time_ms'] efficiency = linear_scaling / actual_scaling - print(f" ๐Ÿ“ˆ Batch Scaling Efficiency: {efficiency:.2f} (1.0 = perfect)") - print(f" ๐ŸŽฏ Best Throughput: {max(r['tokens_per_second'] for r in batch_results):.1f} tokens/sec") + print(f" PROGRESS Batch Scaling Efficiency: {efficiency:.2f} (1.0 = perfect)") + print(f" TARGET Best Throughput: {max(r['tokens_per_second'] for r in batch_results):.1f} tokens/sec") # 5. Memory scaling with sequence length print(f"\n ๐Ÿ“Š Memory Scaling Analysis:") @@ -1328,11 +1328,11 @@ def analyze_complete_system_performance(): print(f" ๐Ÿ“ Seq {seq_len:2d}: {total_attn_mem:.2f} MB attention ({seq_len*seq_len:,} elements)") - print(f"\n๐Ÿ’ก KEY INSIGHTS:") - print(f" ๐Ÿ” Attention dominates memory: O(nยฒ) scaling with sequence length") - print(f" ๐Ÿš€ Batch processing improves throughput via parallelization") + print(f"\nTIP KEY INSIGHTS:") + print(f" MAGNIFY Attention dominates memory: O(nยฒ) scaling with sequence length") + print(f" ROCKET Batch processing improves throughput via parallelization") print(f" ๐Ÿ’พ Model parameters: {memory_mb:.1f} MB, Attention: varies with sequence") - print(f" โšก Total system uses all TinyTorch components from modules 02-19") + print(f" SPEED Total system uses all TinyTorch components from modules 02-19") return { 'complexity': complexity, @@ -1345,10 +1345,10 @@ def analyze_complete_system_performance(): } } -# ๐Ÿ” SYSTEMS INSIGHT: Scaling Behavior Analysis +# MAGNIFY SYSTEMS INSIGHT: Scaling Behavior Analysis def analyze_scaling_bottlenecks(): """Analyze how TinyGPT performance scales with different dimensions.""" - print("\n๐Ÿ” SYSTEMS INSIGHT: TinyGPT Scaling Bottleneck Analysis") + print("\nMAGNIFY SYSTEMS INSIGHT: TinyGPT Scaling Bottleneck Analysis") print("=" * 70) test_text = "the quick brown fox jumps over the lazy dog" @@ -1404,7 +1404,7 @@ def analyze_scaling_bottlenecks(): # Analyze scaling relationships if len(scaling_results) >= 2: - print(f"\n๐Ÿ“ˆ Scaling Analysis:") + print(f"\nPROGRESS Scaling Analysis:") base = scaling_results[0] for result in scaling_results[1:]: @@ -1417,18 +1417,18 @@ def analyze_scaling_bottlenecks(): print(f" โฑ๏ธ Time: {time_ratio:.1f}x") print(f" ๐Ÿ’พ Memory: {memory_ratio:.1f}x") - print(f"\n๐Ÿ’ก SCALING INSIGHTS:") - print(f" ๐Ÿ” Parameter count grows roughly O(d_modelยฒ) due to attention") + print(f"\nTIP SCALING INSIGHTS:") + print(f" MAGNIFY Parameter count grows roughly O(d_modelยฒ) due to attention") print(f" โฑ๏ธ Inference time scales with both parameters and sequence length") print(f" ๐Ÿ’พ Memory usage is dominated by model parameters (not activations)") - print(f" ๐ŸŽฏ Sweet spot: Balance model size with inference speed requirements") + print(f" TARGET Sweet spot: Balance model size with inference speed requirements") return scaling_results -# ๐Ÿ” SYSTEMS INSIGHT: End-to-End Pipeline Analysis +# MAGNIFY SYSTEMS INSIGHT: End-to-End Pipeline Analysis def analyze_end_to_end_pipeline(): """Analyze the complete text generation pipeline from input to output.""" - print("\n๐Ÿ” SYSTEMS INSIGHT: End-to-End Pipeline Analysis") + print("\nMAGNIFY SYSTEMS INSIGHT: End-to-End Pipeline Analysis") print("=" * 70) system = TinyGPTSystem() @@ -1442,7 +1442,7 @@ def analyze_end_to_end_pipeline(): tokenization_time = (time.perf_counter() - start_time) * 1000 print(f" 1๏ธโƒฃ Tokenization: {tokenization_time:.3f} ms") - print(f" '{test_prompt}' โ†’ {token_ids.tolist()}") + print(f" '{test_prompt}' -> {token_ids.tolist()}") # Stage 2: Model Forward Pass batch_tokens = token_ids.reshape(1, -1) @@ -1451,7 +1451,7 @@ def analyze_end_to_end_pipeline(): forward_time = (time.perf_counter() - start_time) * 1000 print(f" 2๏ธโƒฃ Model Forward: {forward_time:.3f} ms") - print(f" {batch_tokens.shape} โ†’ {logits.shape}") + print(f" {batch_tokens.shape} -> {logits.shape}") # Stage 3: Next Token Generation start_time = time.perf_counter() @@ -1468,7 +1468,7 @@ def analyze_end_to_end_pipeline(): detokenization_time = (time.perf_counter() - start_time) * 1000 print(f" 4๏ธโƒฃ Detokenization: {detokenization_time:.3f} ms") - print(f" {complete_tokens.tolist()} โ†’ '{output_text}'") + print(f" {complete_tokens.tolist()} -> '{output_text}'") # Total pipeline time total_time = tokenization_time + forward_time + generation_time + detokenization_time @@ -1478,27 +1478,27 @@ def analyze_end_to_end_pipeline(): print(f" ๐Ÿง  Model Forward: {forward_time:6.3f} ms ({forward_time/total_time*100:4.1f}%)") print(f" ๐ŸŽฒ Token Generation: {generation_time:6.3f} ms ({generation_time/total_time*100:4.1f}%)") print(f" ๐Ÿ”ค Detokenization: {detokenization_time:6.3f} ms ({detokenization_time/total_time*100:4.1f}%)") - print(f" โšก TOTAL: {total_time:6.3f} ms (100.0%)") + print(f" SPEED TOTAL: {total_time:6.3f} ms (100.0%)") # Calculate tokens per second for generation tokens_per_second = 1000 / total_time # 1 token generated per total_time ms print(f"\n๐Ÿ“Š Generation Performance:") - print(f" ๐Ÿš€ Speed: {tokens_per_second:.1f} tokens/second") + print(f" ROCKET Speed: {tokens_per_second:.1f} tokens/second") print(f" ๐Ÿ“ Latency: {total_time:.1f} ms per token") # Estimate full text generation time target_tokens = 50 estimated_time = target_tokens * total_time / 1000 # Convert to seconds - print(f"\n๐ŸŽฏ Scaling Projection:") + print(f"\nTARGET Scaling Projection:") print(f" ๐Ÿ“ Generate {target_tokens} tokens: ~{estimated_time:.1f} seconds") print(f" ๐Ÿ“Š Rate: {target_tokens/estimated_time:.1f} tokens/sec sustained") - print(f"\n๐Ÿ’ก PIPELINE INSIGHTS:") - print(f" ๐Ÿ” Model forward pass dominates computation time") - print(f" โšก Tokenization/detokenization are negligible overhead") - print(f" ๐Ÿš€ Autoregressive generation requires N forward passes for N tokens") + print(f"\nTIP PIPELINE INSIGHTS:") + print(f" MAGNIFY Model forward pass dominates computation time") + print(f" SPEED Tokenization/detokenization are negligible overhead") + print(f" ROCKET Autoregressive generation requires N forward passes for N tokens") print(f" ๐Ÿ’พ Memory usage stays constant (no KV caching implemented)") return { @@ -1526,14 +1526,14 @@ def test_tinygpt_complete_system(): # Initialize complete system system = TinyGPTSystem() - print(f"\n๐Ÿงช Component Integration Tests:") + print(f"\nTEST Component Integration Tests:") # Test 1: Tokenization test_text = "hello world how are you" token_ids = system.encode_text(test_text) decoded_text = system.decode_tokens(token_ids) - print(f" โœ… Tokenization: '{test_text}' โ†’ {len(token_ids)} tokens โ†’ '{decoded_text}'") + print(f" PASS Tokenization: '{test_text}' -> {len(token_ids)} tokens -> '{decoded_text}'") # Test 2: Model forward pass batch_tokens = token_ids.reshape(1, -1) @@ -1541,24 +1541,24 @@ def test_tinygpt_complete_system(): expected_shape = (1, len(token_ids), system.model.vocab_size) assert logits.shape == expected_shape, f"Shape mismatch: {logits.shape} != {expected_shape}" - print(f" โœ… Model Forward: {batch_tokens.shape} โ†’ {logits.shape}") + print(f" PASS Model Forward: {batch_tokens.shape} -> {logits.shape}") # Test 3: Text generation generated_text = system.generate_text("the cat", max_new_tokens=5, verbose=False) - print(f" โœ… Text Generation: 'the cat' โ†’ '{generated_text}'") + print(f" PASS Text Generation: 'the cat' -> '{generated_text}'") # Test 4: Performance analysis complexity = system.analyze_text_complexity(test_text) - print(f" โœ… Text Analysis: {complexity['word_count']} words, {complexity['token_count']} tokens") + print(f" PASS Text Analysis: {complexity['word_count']} words, {complexity['token_count']} tokens") # Test 5: Performance profiling perf_results = system.profile_inference_performance(test_text, batch_sizes=[1, 2]) - print(f" โœ… Performance Profiling: {len(perf_results['batch_results'])} batch sizes tested") + print(f" PASS Performance Profiling: {len(perf_results['batch_results'])} batch sizes tested") - print(f"\n๐ŸŽฏ Integration Validation:") + print(f"\nTARGET Integration Validation:") # Validate component integration validation_results = { @@ -1570,22 +1570,22 @@ def test_tinygpt_complete_system(): } for test_name, passed in validation_results.items(): - status = "โœ…" if passed else "โŒ" + status = "PASS" if passed else "FAIL" print(f" {status} {test_name.replace('_', ' ').title()}") all_tests_passed = all(validation_results.values()) if all_tests_passed: - print(f"\n๐ŸŽ‰ ALL TESTS PASSED! TinyGPT system fully operational.") - print(f" ๐Ÿš€ Ready for comprehensive text generation and analysis") + print(f"\nCELEBRATE ALL TESTS PASSED! TinyGPT system fully operational.") + print(f" ROCKET Ready for comprehensive text generation and analysis") else: - print(f"\nโš ๏ธ Some tests failed - check TinyTorch component integration") + print(f"\nWARNING๏ธ Some tests failed - check TinyTorch component integration") return system, validation_results except Exception as e: - print(f"\nโŒ System test failed: {e}") - print(f" ๐Ÿ’ก Ensure all TinyTorch modules (02-19) are properly integrated") + print(f"\nFAIL System test failed: {e}") + print(f" TIP Ensure all TinyTorch modules (02-19) are properly integrated") return None, {} # %% [markdown] @@ -2082,7 +2082,7 @@ Let's test the complete TinyGPT system with all systems insights and demonstrate # %% def run_complete_tinygpt_demonstration(): """Comprehensive demonstration of the complete TinyGPT system capabilities.""" - print("๐Ÿš€ TINYGPT CAPSTONE DEMONSTRATION") + print("ROCKET TINYGPT CAPSTONE DEMONSTRATION") print("=" * 80) print("Complete ML Systems Integration - Modules 02-19 Working Together!") print("=" * 80) @@ -2101,22 +2101,22 @@ def run_complete_tinygpt_demonstration(): print("๐Ÿ† TINYGPT CAPSTONE COMPLETION SUMMARY") print("=" * 80) - print(f"\n๐ŸŽฏ Complete Integration Achieved:") - print(f" โœ… Tokenizer: {system.tokenizer.get_vocab_size():,} token vocabulary") - print(f" โœ… Model: {system.model.total_parameters:,} parameters across {system.model.n_layers} layers") - print(f" โœ… Generation: Working autoregressive text generation") - print(f" โœ… Systems Analysis: Memory, compute, and scaling characteristics") + print(f"\nTARGET Complete Integration Achieved:") + print(f" PASS Tokenizer: {system.tokenizer.get_vocab_size():,} token vocabulary") + print(f" PASS Model: {system.model.total_parameters:,} parameters across {system.model.n_layers} layers") + print(f" PASS Generation: Working autoregressive text generation") + print(f" PASS Systems Analysis: Memory, compute, and scaling characteristics") print(f"\n๐Ÿ”ง TinyTorch Component Integration:") integrated_components = [name for name, status in COMPONENT_STATUS.items() if status] - print(f" โœ… Integrated: {', '.join(integrated_components)}") + print(f" PASS Integrated: {', '.join(integrated_components)}") print(f" ๐Ÿ“Š Coverage: {len(integrated_components)}/{len(COMPONENT_STATUS)} components") print(f"\n๐ŸŽ“ Educational Achievement:") - print(f" โœ… End-to-end language model built from scratch") - print(f" โœ… All TinyTorch modules integrated into working system") - print(f" โœ… Production-ready systems understanding demonstrated") - print(f" โœ… Complete ML systems engineering pipeline mastered") + print(f" PASS End-to-end language model built from scratch") + print(f" PASS All TinyTorch modules integrated into working system") + print(f" PASS Production-ready systems understanding demonstrated") + print(f" PASS Complete ML systems engineering pipeline mastered") return {'system': system} @@ -2129,8 +2129,8 @@ Test the complete TinyGPT system functionality. # %% def test_unit_tinygpt_system(): - """๐Ÿงช Unit Test: Complete TinyGPT System Integration""" - print("๐Ÿงช Unit Test: TinyGPT Complete System") + """TEST Unit Test: Complete TinyGPT System Integration""" + print("TEST Unit Test: TinyGPT Complete System") print("-" * 50) try: @@ -2138,70 +2138,70 @@ def test_unit_tinygpt_system(): system = TinyGPTSystem() assert system.model is not None, "Model should be initialized" assert system.tokenizer is not None, "Tokenizer should be initialized" - print(" โœ… System initialization successful") + print(" PASS System initialization successful") # Test tokenization test_text = "hello world" token_ids = system.encode_text(test_text) decoded_text = system.decode_tokens(token_ids) assert len(token_ids) > 0, "Tokenization should produce tokens" - print(f" โœ… Tokenization works: '{test_text}' โ†’ {len(token_ids)} tokens โ†’ '{decoded_text}'") + print(f" PASS Tokenization works: '{test_text}' -> {len(token_ids)} tokens -> '{decoded_text}'") # Test model forward pass batch_tokens = token_ids.reshape(1, -1) logits = system.model.forward(batch_tokens) expected_shape = (1, len(token_ids), system.model.vocab_size) assert logits.shape == expected_shape, f"Shape mismatch: {logits.shape} != {expected_shape}" - print(f" โœ… Model forward pass: {batch_tokens.shape} โ†’ {logits.shape}") + print(f" PASS Model forward pass: {batch_tokens.shape} -> {logits.shape}") # Test text generation generated = system.generate_text("the", max_new_tokens=3, verbose=False) assert len(generated) > len("the"), "Generation should add tokens" - print(f" โœ… Text generation: 'the' โ†’ '{generated}'") + print(f" PASS Text generation: 'the' -> '{generated}'") # Test performance profiling performance = system.profile_inference_performance(test_text, batch_sizes=[1]) assert len(performance['batch_results']) > 0, "Performance profiling should work" - print(f" โœ… Performance profiling: {performance['batch_results'][0]['tokens_per_second']:.1f} tokens/sec") + print(f" PASS Performance profiling: {performance['batch_results'][0]['tokens_per_second']:.1f} tokens/sec") - print("โœ… TinyGPT system integration test passed!") + print("PASS TinyGPT system integration test passed!") return True except Exception as e: - print(f"โŒ TinyGPT system test failed: {e}") + print(f"FAIL TinyGPT system test failed: {e}") return False def test_unit_systems_insights(): - """๐Ÿงช Unit Test: Systems Insights Functions""" - print("๐Ÿงช Unit Test: Systems Insights Analysis") + """TEST Unit Test: Systems Insights Functions""" + print("TEST Unit Test: Systems Insights Analysis") print("-" * 50) try: # Test complete system analysis analysis = analyze_complete_system_performance() assert 'complexity' in analysis, "Should include complexity analysis" - print(" โœ… Complete system performance analysis works") + print(" PASS Complete system performance analysis works") # Test scaling analysis scaling = analyze_scaling_bottlenecks() assert len(scaling) > 0, "Should return scaling results" - print(" โœ… Scaling bottleneck analysis works") + print(" PASS Scaling bottleneck analysis works") # Test pipeline analysis pipeline = analyze_end_to_end_pipeline() assert 'tokenization_ms' in pipeline, "Should include pipeline timing" - print(" โœ… End-to-end pipeline analysis works") + print(" PASS End-to-end pipeline analysis works") - print("โœ… Systems insights test passed!") + print("PASS Systems insights test passed!") return True except Exception as e: - print(f"โŒ Systems insights test failed: {e}") + print(f"FAIL Systems insights test failed: {e}") return False def test_unit_computational_assessments(): - """๐Ÿงช Unit Test: Computational Assessment Questions""" - print("๐Ÿงช Unit Test: Computational Assessment Questions") + """TEST Unit Test: Computational Assessment Questions""" + print("TEST Unit Test: Computational Assessment Questions") print("-" * 50) try: @@ -2210,33 +2210,33 @@ def test_unit_computational_assessments(): # Test integration analysis integration = analyze_system_integration_bottlenecks(system) assert 'pipeline_breakdown' in integration, "Should analyze pipeline" - print(" โœ… System integration analysis assessment works") + print(" PASS System integration analysis assessment works") # Test scaling analysis scaling = analyze_scaling_characteristics(system) assert 'sequence_scaling' in scaling, "Should analyze sequence scaling" - print(" โœ… Scaling characteristics assessment works") + print(" PASS Scaling characteristics assessment works") # Test optimization strategy optimization = design_optimization_strategy(system) assert 'current_performance' in optimization, "Should analyze current performance" - print(" โœ… Optimization strategy assessment works") + print(" PASS Optimization strategy assessment works") # Test deployment strategy deployment = design_production_deployment_strategy(system) assert 'system_analysis' in deployment, "Should analyze system" - print(" โœ… Production deployment assessment works") + print(" PASS Production deployment assessment works") - print("โœ… Computational assessments test passed!") + print("PASS Computational assessments test passed!") return True except Exception as e: - print(f"โŒ Computational assessments test failed: {e}") + print(f"FAIL Computational assessments test failed: {e}") return False def test_unit_all(): """Run all TinyGPT capstone unit tests.""" - print("๐Ÿงช Running All TinyGPT Capstone Unit Tests...") + print("TEST Running All TinyGPT Capstone Unit Tests...") print("=" * 60) tests = [ @@ -2253,11 +2253,11 @@ def test_unit_all(): print("=" * 60) if passed == len(tests): - print(f"๐ŸŽ‰ ALL TESTS PASSED! ({passed}/{len(tests)})") - print("โœ… TinyGPT Capstone module is fully operational!") + print(f"CELEBRATE ALL TESTS PASSED! ({passed}/{len(tests)})") + print("PASS TinyGPT Capstone module is fully operational!") else: - print(f"โš ๏ธ {len(tests) - passed}/{len(tests)} tests failed") - print("๐Ÿ’ก Check TinyTorch component integration") + print(f"WARNING๏ธ {len(tests) - passed}/{len(tests)} tests failed") + print("TIP Check TinyTorch component integration") return passed == len(tests) @@ -2283,20 +2283,20 @@ if __name__ == "__main__": checkpoint_results = run_learning_checkpoints() # Test complete system - print("\n๐Ÿงช Testing Complete TinyGPT System...") + print("\nTEST Testing Complete TinyGPT System...") system_tests_passed = test_unit_all() # Run comprehensive demonstration - print("\n๐Ÿš€ Running Complete TinyGPT Demonstration...") + print("\nROCKET Running Complete TinyGPT Demonstration...") demo_results = run_complete_tinygpt_demonstration() - print(f"\n๐ŸŽ‰ Module 20 Capstone Complete!") + print(f"\nCELEBRATE Module 20 Capstone Complete!") print(f"๐Ÿ† TinyGPT system fully integrated and operational!") - print(f"๐Ÿš€ Ready for real-world ML systems engineering!") + print(f"ROCKET Ready for real-world ML systems engineering!") # %% [markdown] """ -## ๐Ÿค” ML Systems Thinking: Interactive Questions +## THINK ML Systems Thinking: Interactive Questions 1. **How does end-to-end system integration reveal bottlenecks invisible in isolated components?** Your TinyGPT system integrates tokenization, transformer layers, attention mechanisms, and generation into a complete pipeline. Analyze how profiling the complete system revealed different performance characteristics than testing individual components in isolation, and explain why production ML systems require end-to-end optimization rather than component-wise optimization. @@ -2309,7 +2309,7 @@ if __name__ == "__main__": # %% [markdown] """ -## ๐ŸŽฏ MODULE SUMMARY: TinyGPT Capstone - Complete ML Systems Mastery +## TARGET MODULE SUMMARY: TinyGPT Capstone - Complete ML Systems Mastery Congratulations! You have successfully completed the ultimate ML systems engineering challenge by building a complete language model from first principles. @@ -2325,11 +2325,11 @@ Congratulations! You have successfully completed the ultimate ML systems enginee - **TinyGPTSystem**: End-to-end pipeline with profiling, analysis, and optimization capabilities ### ๐Ÿ”ง **Technical Integration Achieved** -โœ… **Component Integration**: All TinyTorch modules (02-19) working together seamlessly -โœ… **Text Generation**: Working autoregressive language model with sampling and temperature control -โœ… **Performance Analysis**: Complete system profiling with bottleneck identification and scaling analysis -โœ… **Production Strategy**: Comprehensive deployment planning with monitoring and reliability considerations -โœ… **Optimization Roadmap**: Phased optimization strategy based on actual performance profiling results +PASS **Component Integration**: All TinyTorch modules (02-19) working together seamlessly +PASS **Text Generation**: Working autoregressive language model with sampling and temperature control +PASS **Performance Analysis**: Complete system profiling with bottleneck identification and scaling analysis +PASS **Production Strategy**: Comprehensive deployment planning with monitoring and reliability considerations +PASS **Optimization Roadmap**: Phased optimization strategy based on actual performance profiling results ### ๐Ÿ“Š **Systems Engineering Mastery** Your implementation demonstrates mastery of: @@ -2339,7 +2339,7 @@ Your implementation demonstrates mastery of: - **Production Deployment**: Real-world architecture design, monitoring strategies, and reliability planning - **End-to-End Thinking**: Integration challenges that only emerge when components work together -### ๐ŸŽฏ **Real-World Capability Achieved** +### TARGET **Real-World Capability Achieved** You can now: - **Build**: Complete language models from individual components - **Analyze**: System performance characteristics and scaling bottlenecks @@ -2357,7 +2357,7 @@ This capstone proves you understand: **You are now equipped to tackle real-world ML systems engineering challenges with confidence and expertise!** -### ๐Ÿš€ **Next Steps** +### ROCKET **Next Steps** 1. **Apply Knowledge**: Use your TinyGPT system as foundation for more advanced projects 2. **Optimize Further**: Implement advanced optimizations from your roadmap 3. **Scale Up**: Deploy your system and measure real-world performance diff --git a/modules/source/08_normalization/normalization_dev.py b/modules/source/08_normalization/normalization_dev.py index 69f2cb49..1a9b7bd9 100644 --- a/modules/source/08_normalization/normalization_dev.py +++ b/modules/source/08_normalization/normalization_dev.py @@ -4,7 +4,7 @@ Welcome to Normalization! You'll implement the normalization techniques that make deep neural networks trainable and stable. -## ๐Ÿ”— Building on Previous Learning +## LINK Building on Previous Learning **What You Built Before**: - Module 02 (Tensor): Data structures with gradient tracking - Module 04 (Layers): Neural network layer primitives @@ -19,7 +19,7 @@ Welcome to Normalization! You'll implement the normalization techniques that mak **Connection Map**: ``` -Layers โ†’ Normalization โ†’ Stable Training +Layers -> Normalization -> Stable Training (unstable) (stabilized) (convergence) ``` @@ -30,14 +30,14 @@ Layers โ†’ Normalization โ†’ Stable Training - **Framework connections**: Connect to PyTorch's nn.BatchNorm2d, nn.LayerNorm, nn.GroupNorm - **Optimization trade-offs**: Analyze memory vs stability vs computation trade-offs -## Build โ†’ Use โ†’ Reflect +## Build -> Use -> Reflect 1. **Build**: Implementation of BatchNorm, LayerNorm, and GroupNorm with running statistics 2. **Use**: Apply normalization to stabilize training of deep networks 3. **Reflect**: How do different normalization schemes affect memory, computation, and training dynamics? ## Systems Reality Check -๐Ÿ’ก **Production Context**: Normalization is critical in all modern deep learning - ResNet uses BatchNorm, Transformers use LayerNorm, modern ConvNets use GroupNorm -โšก **Performance Insight**: BatchNorm adds 2ร— parameters per layer but often enables 10ร— larger learning rates, dramatically accelerating training +TIP **Production Context**: Normalization is critical in all modern deep learning - ResNet uses BatchNorm, Transformers use LayerNorm, modern ConvNets use GroupNorm +SPEED **Performance Insight**: BatchNorm adds 2* parameters per layer but often enables 10* larger learning rates, dramatically accelerating training ## What You'll Achieve By the end of this module, you'll have implemented the normalization arsenal that makes modern deep learning possible, with complete understanding of their memory characteristics and performance trade-offs. @@ -51,9 +51,9 @@ Internal covariate shift occurs when the distribution of inputs to each layer ch ### The Core Problem: ``` -Layer 1: xโ‚ โ†’ fโ‚(xโ‚) โ†’ yโ‚ (distribution Dโ‚) -Layer 2: yโ‚ โ†’ fโ‚‚(yโ‚) โ†’ yโ‚‚ (distribution changes as fโ‚ changes!) -Layer 3: yโ‚‚ โ†’ fโ‚ƒ(yโ‚‚) โ†’ yโ‚ƒ (distribution keeps shifting!) +Layer 1: xโ‚ -> fโ‚(xโ‚) -> yโ‚ (distribution Dโ‚) +Layer 2: yโ‚ -> fโ‚‚(yโ‚) -> yโ‚‚ (distribution changes as fโ‚ changes!) +Layer 3: yโ‚‚ -> fโ‚ƒ(yโ‚‚) -> yโ‚ƒ (distribution keeps shifting!) ``` ### The Normalization Solution: @@ -65,7 +65,7 @@ Normalize activations to have stable statistics (mean=0, variance=1): Where: - ฮผ = E[x] (mean) -- ฯƒ = โˆš(Var[x] + ฮต) (standard deviation) +- ฯƒ = sqrt(Var[x] + ฮต) (standard deviation) - ฮณ = learnable scale parameter - ฮฒ = learnable shift parameter - ฮต = numerical stability constant (usually 1e-5) @@ -89,7 +89,7 @@ Where: - **Object Detection**: GroupNorm enables small-batch training with stable results ### Memory vs Performance Trade-offs -- **BatchNorm**: 2ร— parameters, but enables 5-10ร— larger learning rates +- **BatchNorm**: 2* parameters, but enables 5-10* larger learning rates - **LayerNorm**: No batch dimension dependence, consistent across batch sizes - **GroupNorm**: Balance between batch and layer normalization benefits """ @@ -141,9 +141,9 @@ Building normalization layers teaches: 1. **Normalization Axis Selection**: ``` - BatchNorm: Normalize across batch dimension (N, C, H, W) โ†’ across N - LayerNorm: Normalize across feature dimensions โ†’ across C, H, W - GroupNorm: Normalize across channel groups โ†’ within groups of C + BatchNorm: Normalize across batch dimension (N, C, H, W) -> across N + LayerNorm: Normalize across feature dimensions -> across C, H, W + GroupNorm: Normalize across channel groups -> within groups of C ``` 2. **Parameter Organization**: @@ -205,13 +205,13 @@ Batch Normalization normalizes activations across the batch dimension, making tr #| export class BatchNorm2d(Module): """ - Batch Normalization for 2D convolutions (4D tensors: Nร—Cร—Hร—W). + Batch Normalization for 2D convolutions (4D tensors: N*C*H*W). Normalizes across the batch dimension, computing ฮผ and ฯƒยฒ across N, H, W for each channel C independently. MATHEMATICAL FOUNDATION: - BN(x) = ฮณ * (x - ฮผ_batch) / โˆš(ฯƒยฒ_batch + ฮต) + ฮฒ + BN(x) = ฮณ * (x - ฮผ_batch) / sqrt(ฯƒยฒ_batch + ฮต) + ฮฒ Where ฮผ_batch and ฯƒยฒ_batch are computed across (N, H, W) dimensions. """ @@ -229,13 +229,13 @@ class BatchNorm2d(Module): 4. Set training mode flag for different train/eval behavior MEMORY ANALYSIS: - - Learnable parameters: 2 ร— num_features (ฮณ and ฮฒ) - - Running statistics: 2 ร— num_features (running_mean and running_var) - - Total memory: 4 ร— num_features parameters + - Learnable parameters: 2 * num_features (ฮณ and ฮฒ) + - Running statistics: 2 * num_features (running_mean and running_var) + - Total memory: 4 * num_features parameters EXAMPLE (BatchNorm Usage): >>> bn = BatchNorm2d(64) # For 64 channels - >>> x = Tensor(np.random.randn(32, 64, 28, 28)) # batch ร— channels ร— height ร— width + >>> x = Tensor(np.random.randn(32, 64, 28, 28)) # batch * channels * height * width >>> normalized = bn(x) >>> print(f"Normalized shape: {normalized.shape}") # (32, 64, 28, 28) @@ -283,7 +283,7 @@ class BatchNorm2d(Module): 5. Update running statistics during training DIMENSION ANALYSIS for 4D input (N, C, H, W): - - Batch statistics computed across dims (0, 2, 3) โ†’ shape (C,) + - Batch statistics computed across dims (0, 2, 3) -> shape (C,) - ฮณ and ฮฒ broadcasted to match input: (1, C, 1, 1) - Output has same shape as input @@ -347,11 +347,11 @@ class BatchNorm2d(Module): self.training = False return self -# ๐Ÿ” SYSTEMS INSIGHT: Batch Normalization Memory Analysis +# MAGNIFY SYSTEMS INSIGHT: Batch Normalization Memory Analysis def analyze_batchnorm_memory(): """Let's analyze BatchNorm memory usage and batch dependency!""" try: - print("๐Ÿ” SYSTEMS INSIGHT: Batch Normalization Analysis") + print("MAGNIFY SYSTEMS INSIGHT: Batch Normalization Analysis") print("=" * 50) # Different channel sizes to show scaling @@ -361,38 +361,38 @@ def analyze_batchnorm_memory(): bn = BatchNorm2d(channels) # Parameter memory calculation - param_memory = 4 * channels * 4 # 4 params per channel ร— 4 bytes (float32) + param_memory = 4 * channels * 4 # 4 params per channel * 4 bytes (float32) print(f"Channels: {channels:4d} | Parameters: {4 * channels:4d} | Memory: {param_memory / 1024:.2f} KB") - print("\n๐Ÿ’ก KEY INSIGHTS:") + print("\nTIP KEY INSIGHTS:") print("โ€ข BatchNorm memory scales linearly with channel count") print("โ€ข Only 4 parameters per channel (ฮณ, ฮฒ, running_mean, running_var)") print("โ€ข Memory overhead is typically < 1% of layer weights") # Batch size dependency demonstration - print("\n๐ŸŽฏ BATCH SIZE DEPENDENCY:") + print("\nTARGET BATCH SIZE DEPENDENCY:") bn = BatchNorm2d(64) for batch_size in [1, 8, 32, 128]: x = Tensor(np.random.randn(batch_size, 64, 32, 32)) if batch_size == 1: - print(f"Batch size {batch_size:3d}: โš ๏ธ May be unstable (poor statistics)") + print(f"Batch size {batch_size:3d}: WARNING๏ธ May be unstable (poor statistics)") else: - print(f"Batch size {batch_size:3d}: โœ… Good statistics") + print(f"Batch size {batch_size:3d}: PASS Good statistics") print("\n๐Ÿšจ CRITICAL: BatchNorm needs batch_size > 1 for stable training!") print(" Single-sample batches have undefined variance") except Exception as e: - print(f"โš ๏ธ Error in BatchNorm analysis: {e}") + print(f"WARNING๏ธ Error in BatchNorm analysis: {e}") # Run the analysis analyze_batchnorm_memory() # %% [markdown] """ -### ๐Ÿงช Unit Test: Batch Normalization +### TEST Unit Test: Batch Normalization This test validates BatchNorm2d implementation, ensuring proper normalization across batch dimension and correct running statistics updates. """ @@ -472,11 +472,11 @@ def test_unit_batch_norm(): assert hasattr(bn, 'beta'), "Should have beta parameter" assert len(bn.parameters) == 2, "Should have 2 learnable parameters" - print("โœ… Batch normalization tests passed!") - print(f"โœ… Properly normalizes across batch dimension") - print(f"โœ… Updates running statistics during training") - print(f"โœ… Uses running statistics during evaluation") - print(f"โœ… Maintains gradient flow through learnable parameters") + print("PASS Batch normalization tests passed!") + print(f"PASS Properly normalizes across batch dimension") + print(f"PASS Updates running statistics during training") + print(f"PASS Uses running statistics during evaluation") + print(f"PASS Maintains gradient flow through learnable parameters") # Test function defined (called in main block) @@ -499,7 +499,7 @@ class LayerNorm(Module): Unlike BatchNorm, LayerNorm doesn't depend on batch statistics. MATHEMATICAL FOUNDATION: - LN(x) = ฮณ * (x - ฮผ) / โˆš(ฯƒยฒ + ฮต) + ฮฒ + LN(x) = ฮณ * (x - ฮผ) / sqrt(ฯƒยฒ + ฮต) + ฮฒ Where ฮผ and ฯƒยฒ are computed across feature dimensions for each sample. """ @@ -603,16 +603,16 @@ class LayerNorm(Module): """Allow LayerNorm to be called directly.""" return self.forward(x) -# โœ… IMPLEMENTATION CHECKPOINT: Basic LayerNorm complete +# PASS IMPLEMENTATION CHECKPOINT: Basic LayerNorm complete -# ๐Ÿค” PREDICTION: How does LayerNorm memory scale compared to BatchNorm? +# THINK PREDICTION: How does LayerNorm memory scale compared to BatchNorm? # Your guess: LayerNorm uses _____ memory than BatchNorm for the same feature size -# ๐Ÿ” SYSTEMS INSIGHT: LayerNorm vs BatchNorm Memory Comparison +# MAGNIFY SYSTEMS INSIGHT: LayerNorm vs BatchNorm Memory Comparison def compare_normalization_memory(): """Compare memory usage between different normalization techniques.""" try: - print("๐Ÿ” SYSTEMS INSIGHT: Normalization Memory Comparison") + print("MAGNIFY SYSTEMS INSIGHT: Normalization Memory Comparison") print("=" * 60) # Test different feature configurations @@ -637,13 +637,13 @@ def compare_normalization_memory(): print(f"{features:<8} {bn_memory/1024:.2f} KB {ln_memory/1024:.2f} KB {ratio:.1f}x {context}") - print(f"\n๐Ÿ’ก KEY INSIGHTS:") - print("โ€ข BatchNorm uses 2ร— more memory than LayerNorm") + print(f"\nTIP KEY INSIGHTS:") + print("โ€ข BatchNorm uses 2* more memory than LayerNorm") print("โ€ข BatchNorm stores running statistics (inference requirements)") print("โ€ข LayerNorm has no running state (batch-independent)") # Batch size independence demonstration - print(f"\n๐ŸŽฏ BATCH SIZE INDEPENDENCE:") + print(f"\nTARGET BATCH SIZE INDEPENDENCE:") ln = LayerNorm(256) for batch_size in [1, 8, 32, 128]: @@ -654,19 +654,19 @@ def compare_normalization_memory(): sample_mean = np.mean(output.data[0, :, :]) # First sample mean sample_var = np.var(output.data[0, :, :]) # First sample variance - print(f"Batch size {batch_size:3d}: Mean={sample_mean:.6f}, Var={sample_var:.6f} โœ…") + print(f"Batch size {batch_size:3d}: Mean={sample_mean:.6f}, Var={sample_var:.6f} PASS") print(f"\nโœจ LayerNorm gives consistent results regardless of batch size!") except Exception as e: - print(f"โš ๏ธ Error in normalization comparison: {e}") + print(f"WARNING๏ธ Error in normalization comparison: {e}") # Run the comparison compare_normalization_memory() # %% [markdown] """ -### ๐Ÿงช Unit Test: Layer Normalization +### TEST Unit Test: Layer Normalization This test validates LayerNorm implementation, ensuring proper normalization across feature dimensions and batch-size independence. """ @@ -751,11 +751,11 @@ def test_unit_layer_norm(): assert ln.gamma in ln.parameters, "Gamma should be tracked" assert ln.beta in ln.parameters, "Beta should be tracked" - print("โœ… Layer normalization tests passed!") - print(f"โœ… Properly normalizes across feature dimensions") - print(f"โœ… Works with any input shape") - print(f"โœ… Batch-size independent behavior") - print(f"โœ… Supports multi-dimensional normalization") + print("PASS Layer normalization tests passed!") + print(f"PASS Properly normalizes across feature dimensions") + print(f"PASS Works with any input shape") + print(f"PASS Batch-size independent behavior") + print(f"PASS Supports multi-dimensional normalization") # Test function defined (called in main block) @@ -780,7 +780,7 @@ class GroupNorm(Module): MATHEMATICAL FOUNDATION: For input (N, C, H, W) with G groups: 1. Reshape to (N, G, C//G, H, W) - 2. Normalize within each group: GN(x) = ฮณ * (x - ฮผ_group) / โˆš(ฯƒยฒ_group + ฮต) + ฮฒ + 2. Normalize within each group: GN(x) = ฮณ * (x - ฮผ_group) / sqrt(ฯƒยฒ_group + ฮต) + ฮฒ 3. Reshape back to (N, C, H, W) """ @@ -802,14 +802,14 @@ class GroupNorm(Module): - Parameters ฮณ and ฮฒ have shape (num_channels,) for per-channel scaling EXAMPLE (GroupNorm Configurations): - >>> gn1 = GroupNorm(32, 64) # 32 groups, 64 channels โ†’ 2 channels per group - >>> gn2 = GroupNorm(8, 256) # 8 groups, 256 channels โ†’ 32 channels per group - >>> gn3 = GroupNorm(1, 128) # 1 group, 128 channels โ†’ LayerNorm equivalent + >>> gn1 = GroupNorm(32, 64) # 32 groups, 64 channels -> 2 channels per group + >>> gn2 = GroupNorm(8, 256) # 8 groups, 256 channels -> 32 channels per group + >>> gn3 = GroupNorm(1, 128) # 1 group, 128 channels -> LayerNorm equivalent HINTS: - Use assert to validate num_channels % num_groups == 0 - - Special case: num_groups = num_channels โ†’ InstanceNorm (each channel is a group) - - Special case: num_groups = 1 โ†’ LayerNorm for spatial data + - Special case: num_groups = num_channels -> InstanceNorm (each channel is a group) + - Special case: num_groups = 1 -> LayerNorm for spatial data Args: num_groups: Number of groups to divide channels into @@ -846,7 +846,7 @@ class GroupNorm(Module): TODO: Implement group normalization forward pass. STEP-BY-STEP IMPLEMENTATION: - 1. Reshape input to separate groups: (N, C, H, W) โ†’ (N, G, C//G, H, W) + 1. Reshape input to separate groups: (N, C, H, W) -> (N, G, C//G, H, W) 2. Compute mean and variance within each group 3. Normalize within groups 4. Reshape back to original shape @@ -873,7 +873,7 @@ class GroupNorm(Module): N, C, H, W = x.shape assert C == self.num_channels, f"Expected {self.num_channels} channels, got {C}" - # Reshape to separate groups: (N, C, H, W) โ†’ (N, G, C//G, H, W) + # Reshape to separate groups: (N, C, H, W) -> (N, G, C//G, H, W) x_grouped = x.data.reshape(N, self.num_groups, self.channels_per_group, H, W) # Compute mean and variance within each group @@ -884,7 +884,7 @@ class GroupNorm(Module): # Normalize within groups normalized = (x_grouped - mean) / np.sqrt(var + self.eps) - # Reshape back to original shape: (N, G, C//G, H, W) โ†’ (N, C, H, W) + # Reshape back to original shape: (N, G, C//G, H, W) -> (N, C, H, W) normalized = normalized.reshape(N, C, H, W) # Apply per-channel learnable parameters @@ -896,16 +896,16 @@ class GroupNorm(Module): return Tensor(output) ### END SOLUTION -# โœ… IMPLEMENTATION CHECKPOINT: All normalization techniques complete +# PASS IMPLEMENTATION CHECKPOINT: All normalization techniques complete -# ๐Ÿค” PREDICTION: Which normalization uses the most memory - Batch, Layer, or Group? +# THINK PREDICTION: Which normalization uses the most memory - Batch, Layer, or Group? # Your answer: _______ because _______ -# ๐Ÿ” SYSTEMS INSIGHT: Complete Normalization Scaling Analysis +# MAGNIFY SYSTEMS INSIGHT: Complete Normalization Scaling Analysis def analyze_normalization_scaling(): """Analyze how different normalization techniques scale with architecture size.""" try: - print("๐Ÿ” SYSTEMS INSIGHT: Normalization Scaling Analysis") + print("MAGNIFY SYSTEMS INSIGHT: Normalization Scaling Analysis") print("=" * 70) # Different model scales to analyze @@ -927,13 +927,13 @@ def analyze_normalization_scaling(): print(f"{channels:<8} {bn_memory/1024:.2f} KB {ln_memory/1024:.2f} KB {gn_memory/1024:.2f} KB {context}") - print(f"\n๐Ÿ’ก MEMORY INSIGHTS:") + print(f"\nTIP MEMORY INSIGHTS:") print("โ€ข BatchNorm: Highest memory (stores running statistics)") print("โ€ข LayerNorm: 50% less memory than BatchNorm") print("โ€ข GroupNorm: Same memory as LayerNorm") # Computational complexity analysis - print(f"\nโšก COMPUTATIONAL COMPLEXITY:") + print(f"\nSPEED COMPUTATIONAL COMPLEXITY:") batch_size, height, width = 32, 64, 64 channels = 256 @@ -949,7 +949,7 @@ def analyze_normalization_scaling(): print(f"LayerNorm FLOPs: ~{base_flops/1e6:.1f}M (per-sample statistics)") print(f"GroupNorm FLOPs: ~{base_flops/1e6:.1f}M (group statistics)") - print(f"\n๐ŸŽฏ WHEN TO USE EACH:") + print(f"\nTARGET WHEN TO USE EACH:") print("โ€ข BatchNorm: Large batches, CNNs, stable batch sizes") print("โ€ข LayerNorm: Transformers, variable batch sizes, RNNs") print("โ€ข GroupNorm: Small batches, object detection, fine-tuning") @@ -981,14 +981,14 @@ def analyze_normalization_scaling(): f"LN={ln_mean:.6f} GN={gn_mean:.6f}") except Exception as e: - print(f"โš ๏ธ Error in scaling analysis: {e}") + print(f"WARNING๏ธ Error in scaling analysis: {e}") # Run the scaling analysis analyze_normalization_scaling() # %% [markdown] """ -### ๐Ÿงช Unit Test: Group Normalization +### TEST Unit Test: Group Normalization This test validates GroupNorm implementation, ensuring proper grouping and normalization within channel groups. """ @@ -1080,11 +1080,11 @@ def test_unit_group_norm(): assert gn.gamma in gn.parameters, "Gamma should be tracked" assert gn.beta in gn.parameters, "Beta should be tracked" - print("โœ… Group normalization tests passed!") - print(f"โœ… Properly groups channels and normalizes within groups") - print(f"โœ… Validates configuration constraints") - print(f"โœ… Supports special cases (Instance/Layer norm variants)") - print(f"โœ… Maintains gradient flow through learnable parameters") + print("PASS Group normalization tests passed!") + print(f"PASS Properly groups channels and normalizes within groups") + print(f"PASS Validates configuration constraints") + print(f"PASS Supports special cases (Instance/Layer norm variants)") + print(f"PASS Maintains gradient flow through learnable parameters") # Test function defined (called in main block) @@ -1103,17 +1103,17 @@ Here's how normalization layers are typically used in different architectures: **ConvNet with BatchNorm:** ``` -Conv2d โ†’ BatchNorm2d โ†’ ReLU โ†’ Conv2d โ†’ BatchNorm2d โ†’ ReLU โ†’ ... +Conv2d -> BatchNorm2d -> ReLU -> Conv2d -> BatchNorm2d -> ReLU -> ... ``` **Transformer with LayerNorm:** ``` -Embedding โ†’ LayerNorm โ†’ Attention โ†’ Add & Norm โ†’ FFN โ†’ Add & Norm โ†’ ... +Embedding -> LayerNorm -> Attention -> Add & Norm -> FFN -> Add & Norm -> ... ``` **ResNet Block with GroupNorm:** ``` -Conv2d โ†’ GroupNorm โ†’ ReLU โ†’ Conv2d โ†’ GroupNorm โ†’ Add โ†’ ReLU +Conv2d -> GroupNorm -> ReLU -> Conv2d -> GroupNorm -> Add -> ReLU ``` """ @@ -1168,8 +1168,8 @@ def demonstrate_normalization_usage(): print(f" Mean: {np.mean(gn_output.data):.6f}") print(f" Std: {np.std(gn_output.data):.3f}") - print(f"\nโœ… All normalization techniques stabilize activations!") - print(f"โœ… Mean โ‰ˆ 0, Std โ‰ˆ 1 for all methods") + print(f"\nPASS All normalization techniques stabilize activations!") + print(f"PASS Mean ~= 0, Std ~= 1 for all methods") ### END SOLUTION # Run the demonstration @@ -1182,16 +1182,16 @@ demonstrate_normalization_usage() Let's compare how different normalization techniques affect training stability by simulating gradient updates. """ -# โœ… IMPLEMENTATION CHECKPOINT: All normalization implementations complete +# PASS IMPLEMENTATION CHECKPOINT: All normalization implementations complete -# ๐Ÿค” PREDICTION: Which normalization technique will be most stable for very small batch sizes? +# THINK PREDICTION: Which normalization technique will be most stable for very small batch sizes? # Your answer: _______ because _______ -# ๐Ÿ” SYSTEMS INSIGHT: Training Stability Analysis +# MAGNIFY SYSTEMS INSIGHT: Training Stability Analysis def analyze_training_stability(): """Analyze how normalization affects training stability across different scenarios.""" try: - print("๐Ÿ” SYSTEMS INSIGHT: Training Stability Analysis") + print("MAGNIFY SYSTEMS INSIGHT: Training Stability Analysis") print("=" * 60) # Test stability across different batch sizes @@ -1235,7 +1235,7 @@ def analyze_training_stability(): print(f"{batch_size:<12} {bn_stability:<12} {ln_stability:<12} {gn_stability:<12} {scenario}") - print(f"\n๐Ÿ’ก STABILITY INSIGHTS:") + print(f"\nTIP STABILITY INSIGHTS:") print("โ€ข BatchNorm: Unstable with batch_size=1, best with large batches") print("โ€ข LayerNorm: Consistent across all batch sizes") print("โ€ข GroupNorm: Consistent across all batch sizes") @@ -1258,20 +1258,20 @@ def analyze_training_stability(): print(f"After LayerNorm: ~{np.linalg.norm(ln_out.data):.3f} (normalized)") print(f"After GroupNorm: ~{np.linalg.norm(gn_out.data):.3f} (normalized)") - print(f"\n๐ŸŽฏ PRACTICAL RECOMMENDATIONS:") - print("โ€ข Use BatchNorm for: CNNs with batch_size โ‰ฅ 8, stable training") + print(f"\nTARGET PRACTICAL RECOMMENDATIONS:") + print("โ€ข Use BatchNorm for: CNNs with batch_size >= 8, stable training") print("โ€ข Use LayerNorm for: Transformers, RNNs, variable batch sizes") print("โ€ข Use GroupNorm for: Object detection, fine-tuning, small batches") except Exception as e: - print(f"โš ๏ธ Error in stability analysis: {e}") + print(f"WARNING๏ธ Error in stability analysis: {e}") # Run the stability analysis analyze_training_stability() # %% [markdown] """ -### ๐Ÿงช Integration Test: Complete Normalization Suite +### TEST Integration Test: Complete Normalization Suite This test validates that all normalization techniques work together and can be used interchangeably in neural network architectures. """ @@ -1352,11 +1352,11 @@ def test_unit_normalization_integration(): assert bn_total_memory > ln_total_memory, "BatchNorm should use more memory (running stats)" assert ln_total_memory == gn_total_memory, "LayerNorm and GroupNorm should use same memory" - print("โœ… Normalization integration tests passed!") - print(f"โœ… All techniques work with same input format") - print(f"โœ… All produce appropriately normalized outputs") - print(f"โœ… Memory usage patterns are as expected") - print(f"โœ… Batch size independence works correctly") + print("PASS Normalization integration tests passed!") + print(f"PASS All techniques work with same input format") + print(f"PASS All produce appropriately normalized outputs") + print(f"PASS Memory usage patterns are as expected") + print(f"PASS Batch size independence works correctly") # Test function defined (called in main block) @@ -1380,7 +1380,7 @@ def benchmark_normalization_performance(): This function is PROVIDED for educational analysis. """ - print("โšก Performance Benchmark: Normalization Techniques") + print("SPEED Performance Benchmark: Normalization Techniques") print("=" * 55) import time @@ -1432,7 +1432,7 @@ def benchmark_normalization_performance(): speedup = baseline / time_ms print(f" {name}: {speedup:.2f}x relative to BatchNorm") - print(f"\n๐Ÿ’ก Performance Insights:") + print(f"\nTIP Performance Insights:") print(f" โ€ข All normalizations have similar computational complexity") print(f" โ€ข Differences mainly due to memory access patterns") print(f" โ€ข BatchNorm may be slightly faster due to batch parallelization") @@ -1449,7 +1449,7 @@ Run all tests to validate our normalization implementations. if __name__ == "__main__": """Main execution block - runs all normalization tests.""" - print("๐Ÿงช Running Complete Normalization Test Suite") + print("TEST Running Complete Normalization Test Suite") print("=" * 50) # Run all unit tests @@ -1465,13 +1465,13 @@ if __name__ == "__main__": test_unit_normalization_integration() print() - print("โœ… All normalization tests passed!") - print("\n๐ŸŽฏ NORMALIZATION SUITE COMPLETE") + print("PASS All normalization tests passed!") + print("\nTARGET NORMALIZATION SUITE COMPLETE") print("Your normalization implementations are ready for use in neural networks!") # %% [markdown] """ -## ๐Ÿค” ML Systems Thinking: Interactive Questions +## THINK ML Systems Thinking: Interactive Questions Now that you've implemented all three major normalization techniques, let's reflect on their systems implications and design trade-offs. """ @@ -1480,7 +1480,7 @@ Now that you've implemented all three major normalization techniques, let's refl """ ### Question 1: Memory and Batch Size Trade-offs -**Context**: In your BatchNorm2d implementation, you saw that running statistics require additional memory (4ร— parameters vs 2ร— for LayerNorm/GroupNorm), but BatchNorm fails completely with batch_size=1. Your memory analysis showed that BatchNorm needs 2ร— the memory of other techniques, while your stability analysis revealed batch size dependencies. +**Context**: In your BatchNorm2d implementation, you saw that running statistics require additional memory (4* parameters vs 2* for LayerNorm/GroupNorm), but BatchNorm fails completely with batch_size=1. Your memory analysis showed that BatchNorm needs 2* the memory of other techniques, while your stability analysis revealed batch size dependencies. **Reflection Question**: Analyze the memory vs batch size trade-offs in your normalization implementations. When you tested different batch sizes, you discovered BatchNorm becomes unstable with small batches while LayerNorm/GroupNorm remain consistent. For a production system that needs to handle both training (large batches) and inference (single samples), how would you modify your current normalization implementations to optimize memory usage while maintaining stability? Consider the running statistics storage in your BatchNorm class and the per-sample computation in your LayerNorm class. @@ -1517,39 +1517,39 @@ Think about: automatic technique selection, runtime adaptation, memory budget co # %% [markdown] """ -## ๐ŸŽฏ MODULE SUMMARY: Normalization +## TARGET MODULE SUMMARY: Normalization Congratulations! You have successfully implemented the complete normalization toolkit that makes modern deep learning possible: -### โœ… What You Have Built +### PASS What You Have Built - **BatchNorm2d**: Complete batch normalization with running statistics and train/eval modes - **LayerNorm**: Batch-independent normalization for any tensor dimensions - **GroupNorm**: Channel group normalization balancing batch and layer norm benefits - **๐Ÿ†• Comprehensive Analysis**: Memory scaling, training stability, and performance benchmarking - **๐Ÿ†• Integration Examples**: How normalization fits into different network architectures -### โœ… Technical Mastery +### PASS Technical Mastery - **Statistical Computing**: Efficient mean/variance computation across different tensor dimensions - **Memory Management**: Understanding parameter storage vs running statistics trade-offs - **Training Dynamics**: How normalization affects gradient flow and training stability - **Batch Dependencies**: When and why batch size affects normalization behavior - **๐Ÿ†• Production Patterns**: Architecture-specific normalization choices and deployment considerations -### โœ… Systems Understanding -- **Memory Scaling**: BatchNorm uses 2ร— memory of LayerNorm/GroupNorm due to running statistics +### PASS Systems Understanding +- **Memory Scaling**: BatchNorm uses 2* memory of LayerNorm/GroupNorm due to running statistics - **Computational Complexity**: All techniques have similar O(N) complexity but different access patterns - **Batch Size Effects**: BatchNorm requires batch_size > 1, others work with any batch size - **Cache Efficiency**: How normalization axes affect memory access patterns and vectorization - **๐Ÿ†• Training Stability**: Why normalization enables higher learning rates and deeper networks -### ๐Ÿ”— Connection to Real ML Systems +### LINK Connection to Real ML Systems Your implementations mirror production systems: - **PyTorch nn.BatchNorm2d**: Your BatchNorm2d matches PyTorch's interface and behavior - **BERT LayerNorm**: Your LayerNorm enables transformer training stability - **Object Detection GroupNorm**: Your GroupNorm provides batch-independent normalization - **Production Deployment**: Understanding of when to use each technique in real systems -### ๐Ÿš€ What You Can Build Now +### ROCKET What You Can Build Now - **Stable CNNs**: Use BatchNorm for ResNet-style architectures with large batches - **Transformer Models**: Use LayerNorm for attention-based architectures - **Detection Systems**: Use GroupNorm for models with variable batch sizes @@ -1560,5 +1560,5 @@ Your implementations mirror production systems: 2. **Integration ready**: Your normalization layers integrate with any neural network architecture 3. **Ready for Module 09**: Spatial operations will use your normalization for CNN stability -**๐ŸŽ‰ Achievement Unlocked**: You've mastered the normalization techniques that enable modern deep learning, with complete understanding of their memory characteristics and performance trade-offs! +**CELEBRATE Achievement Unlocked**: You've mastered the normalization techniques that enable modern deep learning, with complete understanding of their memory characteristics and performance trade-offs! """ \ No newline at end of file diff --git a/modules/source/13_kernels/kernels_dev.py b/modules/source/13_kernels/kernels_dev.py index 5ed12a11..f26aa3e5 100644 --- a/modules/source/13_kernels/kernels_dev.py +++ b/modules/source/13_kernels/kernels_dev.py @@ -4,7 +4,7 @@ Welcome to Kernels! You'll implement high-performance computational kernels that power modern ML systems! -## ๐Ÿ”— Building on Previous Learning +## LINK Building on Previous Learning **What You Built Before**: - Module 11 (Training): Complete training loops with gradient computation - Module 12 (Regularization): Advanced training techniques for robust models @@ -17,7 +17,7 @@ Welcome to Kernels! You'll implement high-performance computational kernels that **Connection Map**: ``` -Training โ†’ Kernels โ†’ Benchmarking +Training -> Kernels -> Benchmarking (correct) (fast) (measured) ``` @@ -28,14 +28,14 @@ Training โ†’ Kernels โ†’ Benchmarking - **Framework connections**: Understanding how PyTorch and TensorFlow achieve high performance - **Optimization trade-offs**: Balancing memory usage, computational complexity, and parallelism -## Build โ†’ Use โ†’ Reflect +## Build -> Use -> Reflect 1. **Build**: Implement optimized kernels for matrix operations, activations, and memory management 2. **Use**: Apply kernels to real ML workloads and measure performance improvements 3. **Reflect**: Analyze optimization patterns and design production-grade kernel architectures ## Systems Reality Check -๐Ÿ’ก **Production Context**: PyTorch uses custom CUDA kernels and CPU vectorization for 10-100x speedups -โšก **Performance Insight**: Memory bandwidth is often the limiting factor, not compute - optimize data movement first +TIP **Production Context**: PyTorch uses custom CUDA kernels and CPU vectorization for 10-100x speedups +SPEED **Performance Insight**: Memory bandwidth is often the limiting factor, not compute - optimize data movement first """ # %% [markdown] @@ -46,18 +46,18 @@ High-performance kernels are optimized computational functions that leverage har ``` CPU Kernels: -โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” -โ”‚ SIMD Instructions (AVX, SSE) โ”‚ โ† Process 4-16 floats simultaneously -โ”‚ Cache-Friendly Memory Patterns โ”‚ โ† Minimize cache misses -โ”‚ Loop Unrolling & Vectorization โ”‚ โ† Eliminate loop overhead -โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ ++-------------------------------------+ +| SIMD Instructions (AVX, SSE) | <- Process 4-16 floats simultaneously +| Cache-Friendly Memory Patterns | <- Minimize cache misses +| Loop Unrolling & Vectorization | <- Eliminate loop overhead ++-------------------------------------+ GPU Kernels: -โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” -โ”‚ Thread Blocks & Shared Memory โ”‚ โ† Parallel processing with fast memory -โ”‚ Memory Coalescing โ”‚ โ† Efficient global memory access -โ”‚ Warp-Level Operations โ”‚ โ† 32 threads execute together -โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ ++-------------------------------------+ +| Thread Blocks & Shared Memory | <- Parallel processing with fast memory +| Memory Coalescing | <- Efficient global memory access +| Warp-Level Operations | <- 32 threads execute together ++-------------------------------------+ ``` **Why This Matters for ML Systems:** @@ -112,15 +112,15 @@ y = np.maximum(0, x) # 8 operations per cycle ``` Row-Major Access (Fast): -A[0,0] A[0,1] A[0,2] A[0,3] ... โ† Sequential memory access +A[0,0] A[0,1] A[0,2] A[0,3] ... <- Sequential memory access Column-Major Access (Slow): -A[0,0] A[1,0] A[2,0] A[3,0] ... โ† Strided memory access +A[0,0] A[1,0] A[2,0] A[3,0] ... <- Strided memory access Cache Line Impact: -โ”Œโ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ” -โ”‚ A[0,0:4] loaded together โ”‚ โ† 64-byte cache line -โ””โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”˜ ++-----+-----+-----+-----+ +| A[0,0:4] loaded together | <- 64-byte cache line ++-----+-----+-----+-----+ ``` """ @@ -212,21 +212,21 @@ except ImportError: Our kernel optimization strategy follows a systematic hierarchy: ``` -๐ŸŽฏ Optimization Strategy: -โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” -โ”‚ 1. Correctness: Get the right answer โ”‚ -โ”‚ 2. Cache Optimization: Memory patterns โ”‚ -โ”‚ 3. Vectorization: SIMD instructions โ”‚ -โ”‚ 4. Parallelization: Multi-core โ”‚ -โ”‚ 5. Quantization: Reduced precision โ”‚ -โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ +TARGET Optimization Strategy: ++-------------------------------------+ +| 1. Correctness: Get the right answer | +| 2. Cache Optimization: Memory patterns | +| 3. Vectorization: SIMD instructions | +| 4. Parallelization: Multi-core | +| 5. Quantization: Reduced precision | ++-------------------------------------+ ๐Ÿ”ง Implementation Layers: -โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” -โ”‚ Higher Level: Kernel Composition โ”‚ โ† Combine optimizations -โ”‚ Mid Level: Algorithm Optimization โ”‚ โ† Cache blocking, tiling -โ”‚ Lower Level: Hardware Primitives โ”‚ โ† SIMD, memory layout -โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ ++-------------------------------------+ +| Higher Level: Kernel Composition | <- Combine optimizations +| Mid Level: Algorithm Optimization | <- Cache blocking, tiling +| Lower Level: Hardware Primitives | <- SIMD, memory layout ++-------------------------------------+ ``` **Design Principles:** @@ -292,12 +292,12 @@ def time_kernel(func: Callable, *args, **kwargs) -> Tuple[Any, float]: return result, execution_time_us ### END SOLUTION -# โœ… IMPLEMENTATION CHECKPOINT: Timing infrastructure complete +# PASS IMPLEMENTATION CHECKPOINT: Timing infrastructure complete -# ๐Ÿค” PREDICTION: How much timing overhead does our measurement add? +# THINK PREDICTION: How much timing overhead does our measurement add? # Your guess: _____ microseconds -# ๐Ÿ” SYSTEMS INSIGHT: Timing Overhead Analysis +# MAGNIFY SYSTEMS INSIGHT: Timing Overhead Analysis def analyze_timing_overhead(): """Measure the overhead of our timing infrastructure.""" try: @@ -321,7 +321,7 @@ def analyze_timing_overhead(): print(f" Minimum: {min_overhead:.3f} ฮผs") print(f" Relative precision: ยฑ{std_overhead/avg_overhead*100:.1f}%") - # ๐Ÿ’ก WHY THIS MATTERS: Timing overhead must be much smaller than + # TIP WHY THIS MATTERS: Timing overhead must be much smaller than # the operations we're measuring, or results will be meaningless. # Modern CPUs: ~1-10 ฮผs overhead, so measure operations >100 ฮผs @@ -331,7 +331,7 @@ def analyze_timing_overhead(): 'reliable_for_operations_above_us': avg_overhead * 10 } except Exception as e: - print(f"โš ๏ธ Timing analysis error: {e}") + print(f"WARNING๏ธ Timing analysis error: {e}") return None # Run the analysis @@ -339,14 +339,14 @@ timing_analysis = analyze_timing_overhead() # %% [markdown] """ -### ๐Ÿงช Unit Test: Timing Infrastructure +### TEST Unit Test: Timing Infrastructure This test validates `time_kernel`, ensuring accurate performance measurement """ # %% def test_unit_timing_infrastructure(): """Test timing infrastructure with known operations.""" - print("๐Ÿงช Unit Test: Timing Infrastructure") + print("TEST Unit Test: Timing Infrastructure") # Test 1: Basic timing functionality def test_operation(): @@ -357,7 +357,7 @@ def test_unit_timing_infrastructure(): assert result == "done", "Function result should be preserved" assert 800 <= elapsed_us <= 2000, f"1ms sleep should take ~1000ฮผs, got {elapsed_us:.1f}ฮผs" - print(f"โœ… Basic timing: {elapsed_us:.1f}ฮผs for 1ms operation") + print(f"PASS Basic timing: {elapsed_us:.1f}ฮผs for 1ms operation") # Test 2: Timing precision def fast_operation(): @@ -370,7 +370,7 @@ def test_unit_timing_infrastructure(): cv = np.std(measurements) / np.mean(measurements) assert cv < 0.5, f"Timing precision should be reasonable, CV={cv:.3f}" - print(f"โœ… Timing precision: CV={cv:.3f} across 10 measurements") + print(f"PASS Timing precision: CV={cv:.3f} across 10 measurements") # Test 3: Argument passing def add_operation(a, b, c=0): @@ -378,7 +378,7 @@ def test_unit_timing_infrastructure(): result, _ = time_kernel(add_operation, 5, 10, c=2) assert result == 17, f"Arguments should pass correctly, got {result}" - print("โœ… Argument passing works correctly") + print("PASS Argument passing works correctly") # Run the test test_unit_timing_infrastructure() @@ -432,9 +432,9 @@ def matmul_baseline(A: np.ndarray, B: np.ndarray) -> np.ndarray: return result ### END SOLUTION -# โœ… IMPLEMENTATION CHECKPOINT: Baseline matrix multiplication complete +# PASS IMPLEMENTATION CHECKPOINT: Baseline matrix multiplication complete -# ๐Ÿ” SYSTEMS INSIGHT: Matrix Multiplication Performance Scaling +# MAGNIFY SYSTEMS INSIGHT: Matrix Multiplication Performance Scaling def analyze_matmul_scaling(): """Analyze how matrix multiplication performance scales with size.""" try: @@ -471,14 +471,14 @@ def analyze_matmul_scaling(): print(f" Theoretical (O(nยณ)): {size_scaling:.1f}x") print(f" Efficiency: {efficiency:.3f} (1.0 = perfect scaling)") - # ๐Ÿ’ก WHY THIS MATTERS: Matrix multiplication is O(nยณ), but cache effects + # TIP WHY THIS MATTERS: Matrix multiplication is O(nยณ), but cache effects # and memory bandwidth limits mean real performance doesn't scale perfectly. # Understanding these limits helps size operations for optimal performance. return results except Exception as e: - print(f"โš ๏ธ Scaling analysis error: {e}") + print(f"WARNING๏ธ Scaling analysis error: {e}") return None # Run the analysis @@ -557,14 +557,14 @@ def cache_friendly_matmul(A: np.ndarray, B: np.ndarray, block_size: int = 64) -> # %% [markdown] """ -### ๐Ÿงช Unit Test: Cache-Friendly Matrix Multiplication +### TEST Unit Test: Cache-Friendly Matrix Multiplication This test validates `cache_friendly_matmul`, ensuring correctness and performance improvement """ # %% def test_unit_cache_friendly_matmul(): """Test cache-friendly matrix multiplication.""" - print("๐Ÿงช Unit Test: Cache-Friendly Matrix Multiplication") + print("TEST Unit Test: Cache-Friendly Matrix Multiplication") # Test 1: Correctness A = np.array([[1, 2], [3, 4]], dtype=np.float32) @@ -574,7 +574,7 @@ def test_unit_cache_friendly_matmul(): result_baseline = matmul_baseline(A, B) assert np.allclose(result_cache, result_baseline), "Cache-friendly result should match baseline" - print("โœ… Correctness: Matches baseline implementation") + print("PASS Correctness: Matches baseline implementation") # Test 2: Performance comparison size = 256 @@ -584,7 +584,7 @@ def test_unit_cache_friendly_matmul(): _, baseline_time = time_kernel(matmul_baseline, A_large, B_large) _, cache_time = time_kernel(cache_friendly_matmul, A_large, B_large, 64) - print(f"โœ… Performance: Baseline={baseline_time:.1f}ฮผs, Cache-friendly={cache_time:.1f}ฮผs") + print(f"PASS Performance: Baseline={baseline_time:.1f}ฮผs, Cache-friendly={cache_time:.1f}ฮผs") # Test 3: Different block sizes block_sizes = [32, 64, 128] @@ -592,7 +592,7 @@ def test_unit_cache_friendly_matmul(): result = cache_friendly_matmul(A, B, block_size=bs) assert np.allclose(result, result_baseline), f"Block size {bs} should be correct" - print(f"โœ… Block sizes: Tested {block_sizes}") + print(f"PASS Block sizes: Tested {block_sizes}") # Run the test test_unit_cache_friendly_matmul() @@ -714,9 +714,9 @@ def vectorized_operations(x: np.ndarray, y: np.ndarray) -> Dict[str, np.ndarray] return results ### END SOLUTION -# โœ… IMPLEMENTATION CHECKPOINT: Vectorized operations complete +# PASS IMPLEMENTATION CHECKPOINT: Vectorized operations complete -# ๐Ÿ” SYSTEMS INSIGHT: Vectorization Performance Analysis +# MAGNIFY SYSTEMS INSIGHT: Vectorization Performance Analysis def analyze_vectorization_performance(): """Compare vectorized vs scalar performance.""" try: @@ -755,7 +755,7 @@ def analyze_vectorization_performance(): print(f" Vectorized operations: {ops_time:.1f}ฮผs") print(f" Throughput: {operations_per_second/1e6:.1f}M ops/sec") - # ๐Ÿ’ก WHY THIS MATTERS: Vectorization provides 4-16x speedups on modern CPUs. + # TIP WHY THIS MATTERS: Vectorization provides 4-16x speedups on modern CPUs. # This is essential for real-time inference and efficient training. # ML frameworks like PyTorch rely heavily on vectorized operations. @@ -765,7 +765,7 @@ def analyze_vectorization_performance(): } except Exception as e: - print(f"โš ๏ธ Vectorization analysis error: {e}") + print(f"WARNING๏ธ Vectorization analysis error: {e}") return None # Run the analysis @@ -773,14 +773,14 @@ vectorization_analysis = analyze_vectorization_performance() # %% [markdown] """ -### ๐Ÿงช Unit Test: Vectorized Operations +### TEST Unit Test: Vectorized Operations This test validates vectorized implementations for correctness and performance """ # %% def test_unit_vectorized_operations(): """Test vectorized operations.""" - print("๐Ÿงช Unit Test: Vectorized Operations") + print("TEST Unit Test: Vectorized Operations") # Test 1: Vectorized ReLU correctness x = np.array([-2, -1, 0, 1, 2], dtype=np.float32) @@ -788,7 +788,7 @@ def test_unit_vectorized_operations(): expected = np.array([0, 0, 0, 1, 2], dtype=np.float32) assert np.allclose(result, expected), "Vectorized ReLU should be correct" - print("โœ… ReLU correctness: Produces expected outputs") + print("PASS ReLU correctness: Produces expected outputs") # Test 2: Vectorized operations correctness x = np.array([1, 2, 3, 4], dtype=np.float32) @@ -800,7 +800,7 @@ def test_unit_vectorized_operations(): assert np.allclose(results['element_wise_multiply'], [2, 6, 12, 20]), "Multiplication should be correct" assert np.allclose(results['dot_product'], 40), "Dot product should be correct" - print("โœ… Operations correctness: All operations produce expected results") + print("PASS Operations correctness: All operations produce expected results") # Test 3: Performance with larger arrays large_x = np.random.randn(10000).astype(np.float32) @@ -812,7 +812,7 @@ def test_unit_vectorized_operations(): assert relu_time < 1000, f"ReLU should be fast, took {relu_time:.1f}ฮผs" assert ops_time < 5000, f"Operations should be fast, took {ops_time:.1f}ฮผs" - print(f"โœ… Performance: ReLU={relu_time:.1f}ฮผs, Operations={ops_time:.1f}ฮผs") + print(f"PASS Performance: ReLU={relu_time:.1f}ฮผs, Operations={ops_time:.1f}ฮผs") # Run the test test_unit_vectorized_operations() @@ -958,9 +958,9 @@ def parallel_batch_processing(batch_data: np.ndarray, operation: Callable = None return np.concatenate(results, axis=0) ### END SOLUTION -# โœ… IMPLEMENTATION CHECKPOINT: Parallel processing complete +# PASS IMPLEMENTATION CHECKPOINT: Parallel processing complete -# ๐Ÿ” SYSTEMS INSIGHT: Parallel Processing Scaling Analysis +# MAGNIFY SYSTEMS INSIGHT: Parallel Processing Scaling Analysis def analyze_parallel_scaling(): """Analyze how parallel processing scales with worker count.""" try: @@ -1017,7 +1017,7 @@ def analyze_parallel_scaling(): print(f" ReLU efficiency: {max_speedup_relu/8:.2f} (theoretical max: 1.0)") print(f" Batch efficiency: {max_speedup_batch/8:.2f} (theoretical max: 1.0)") - # ๐Ÿ’ก WHY THIS MATTERS: Parallel processing has diminishing returns due to: + # TIP WHY THIS MATTERS: Parallel processing has diminishing returns due to: # 1. Thread overhead and synchronization costs # 2. Memory bandwidth limitations # 3. Amdahl's law - sequential portions limit speedup @@ -1026,7 +1026,7 @@ def analyze_parallel_scaling(): return results except Exception as e: - print(f"โš ๏ธ Parallel scaling analysis error: {e}") + print(f"WARNING๏ธ Parallel scaling analysis error: {e}") return None # Run the analysis @@ -1034,14 +1034,14 @@ parallel_scaling = analyze_parallel_scaling() # %% [markdown] """ -### ๐Ÿงช Unit Test: Parallel Processing +### TEST Unit Test: Parallel Processing This test validates parallel implementations for correctness and performance scaling """ # %% def test_unit_parallel_processing(): """Test parallel processing implementations.""" - print("๐Ÿงช Unit Test: Parallel Processing") + print("TEST Unit Test: Parallel Processing") # Test 1: Parallel ReLU correctness x = np.array([-2, -1, 0, 1, 2], dtype=np.float32) @@ -1050,7 +1050,7 @@ def test_unit_parallel_processing(): result_sequential = vectorized_relu(x) assert np.allclose(result_parallel, result_sequential), "Parallel ReLU should match sequential" - print("โœ… ReLU correctness: Parallel matches sequential result") + print("PASS ReLU correctness: Parallel matches sequential result") # Test 2: Parallel batch processing correctness batch = np.random.randn(16, 10).astype(np.float32) @@ -1060,7 +1060,7 @@ def test_unit_parallel_processing(): assert np.allclose(result_parallel, result_sequential), "Parallel batch should match sequential" assert result_parallel.shape == batch.shape, "Output shape should match input" - print("โœ… Batch correctness: Parallel matches sequential result") + print("PASS Batch correctness: Parallel matches sequential result") # Test 3: Performance with larger data large_x = np.random.randn(20000).astype(np.float32) @@ -1069,7 +1069,7 @@ def test_unit_parallel_processing(): _, sequential_time = time_kernel(vectorized_relu, large_x) _, parallel_time = time_kernel(parallel_relu, large_x, 4) - print(f"โœ… Performance: Sequential={sequential_time:.1f}ฮผs, Parallel={parallel_time:.1f}ฮผs") + print(f"PASS Performance: Sequential={sequential_time:.1f}ฮผs, Parallel={parallel_time:.1f}ฮผs") # Test 4: Edge cases small_x = np.array([1, 2, 3]) @@ -1077,7 +1077,7 @@ def test_unit_parallel_processing(): expected_small = vectorized_relu(small_x) assert np.allclose(result_small, expected_small), "Small arrays should work correctly" - print("โœ… Edge cases: Small arrays handled correctly") + print("PASS Edge cases: Small arrays handled correctly") # Run the test test_unit_parallel_processing() @@ -1125,7 +1125,7 @@ def quantized_matmul(A: np.ndarray, B: np.ndarray, bits: int = 8) -> np.ndarray: >>> C = quantized_matmul(A, B, bits=8) PERFORMANCE BENEFITS: - - 4x memory reduction (float32 โ†’ int8) + - 4x memory reduction (float32 -> int8) - Faster integer arithmetic on some hardware - Enables deployment on memory-constrained devices """ @@ -1226,9 +1226,9 @@ def quantized_relu(x: np.ndarray, bits: int = 8) -> np.ndarray: return result ### END SOLUTION -# โœ… IMPLEMENTATION CHECKPOINT: Quantization kernels complete +# PASS IMPLEMENTATION CHECKPOINT: Quantization kernels complete -# ๐Ÿ” SYSTEMS INSIGHT: Quantization Analysis +# MAGNIFY SYSTEMS INSIGHT: Quantization Analysis def analyze_quantization_impact(): """Analyze the impact of quantization on accuracy and performance.""" try: @@ -1281,7 +1281,7 @@ def analyze_quantization_impact(): print(f" MatMul: {baseline_time:.1f}ฮผs, {baseline_size/1024:.1f}KB") print(f" ReLU: {baseline_relu_time:.1f}ฮผs, {x.nbytes/1024:.1f}KB") - # ๐Ÿ’ก WHY THIS MATTERS: Quantization trades accuracy for memory and speed. + # TIP WHY THIS MATTERS: Quantization trades accuracy for memory and speed. # 8-bit quantization: 4x memory reduction, variable performance impact # Critical for edge deployment where memory is constrained # Modern ML accelerators (TPUs, mobile chips) heavily use quantization @@ -1293,7 +1293,7 @@ def analyze_quantization_impact(): } except Exception as e: - print(f"โš ๏ธ Quantization analysis error: {e}") + print(f"WARNING๏ธ Quantization analysis error: {e}") return None # Run the analysis @@ -1301,14 +1301,14 @@ quantization_analysis = analyze_quantization_impact() # %% [markdown] """ -### ๐Ÿงช Unit Test: Quantization Kernels +### TEST Unit Test: Quantization Kernels This test validates quantization implementations for correctness and efficiency trade-offs """ # %% def test_unit_quantization_kernels(): """Test quantization kernel implementations.""" - print("๐Ÿงช Unit Test: Quantization Kernels") + print("TEST Unit Test: Quantization Kernels") # Test 1: Quantized matrix multiplication correctness A = np.array([[1.0, 2.0], [3.0, 4.0]], dtype=np.float32) @@ -1320,7 +1320,7 @@ def test_unit_quantization_kernels(): # Should be approximately correct (quantization introduces error) relative_error = np.mean(np.abs(result_quant - result_baseline) / np.abs(result_baseline + 1e-8)) assert relative_error < 0.1, f"Quantization error too high: {relative_error:.3f}" - print(f"โœ… MatMul quantization: relative error {relative_error:.3f}") + print(f"PASS MatMul quantization: relative error {relative_error:.3f}") # Test 2: Quantized ReLU correctness x = np.array([-2.0, -1.0, 0.0, 1.0, 2.0], dtype=np.float32) @@ -1331,7 +1331,7 @@ def test_unit_quantization_kernels(): # Check that negative values become zero and positive values remain positive assert np.all(result_quant_relu >= 0), "Quantized ReLU should be non-negative" assert np.allclose(result_quant_relu[x <= 0], 0, atol=0.1), "Negative inputs should become zero" - print("โœ… ReLU quantization: maintains ReLU properties") + print("PASS ReLU quantization: maintains ReLU properties") # Test 3: Different bit depths for bits in [8, 16]: @@ -1341,7 +1341,7 @@ def test_unit_quantization_kernels(): result_relu_bits = quantized_relu(x, bits=bits) assert result_relu_bits.shape == x.shape, f"{bits}-bit ReLU shape should match" - print("โœ… Bit depths: 8-bit and 16-bit quantization work correctly") + print("PASS Bit depths: 8-bit and 16-bit quantization work correctly") # Test 4: Performance characteristics large_A = np.random.randn(64, 64).astype(np.float32) @@ -1350,7 +1350,7 @@ def test_unit_quantization_kernels(): _, baseline_time = time_kernel(matmul_baseline, large_A, large_B) _, quant_time = time_kernel(quantized_matmul, large_A, large_B, 8) - print(f"โœ… Performance: Baseline={baseline_time:.1f}ฮผs, Quantized={quant_time:.1f}ฮผs") + print(f"PASS Performance: Baseline={baseline_time:.1f}ฮผs, Quantized={quant_time:.1f}ฮผs") # Run the test test_unit_quantization_kernels() @@ -1366,7 +1366,7 @@ At this level, you design comprehensive analyses from scratch - no scaffolding p # %% [markdown] """ -### ๐ŸŽฏ ADVANCED ANALYSIS CHALLENGE: Comprehensive Kernel Optimization Analysis +### TARGET ADVANCED ANALYSIS CHALLENGE: Comprehensive Kernel Optimization Analysis **CHALLENGE**: Design and implement a complete kernel optimization analysis system that: @@ -1558,7 +1558,7 @@ class KernelOptimizationAnalyzer: cache_analysis['recommendations'].append("Memory bandwidth limited - consider cache blocking") if max(data_sizes)**2 * 4 > l3_size: - cache_analysis['recommendations'].append(f"Large matrices exceed L3 cache - use block size โ‰ค {cache_analysis['optimal_block_sizes']['L2']}") + cache_analysis['recommendations'].append(f"Large matrices exceed L3 cache - use block size <= {cache_analysis['optimal_block_sizes']['L2']}") self.analysis_results['cache_efficiency'] = cache_analysis return cache_analysis @@ -2177,16 +2177,16 @@ class KernelOptimizationAnalyzer: return roadmap ### END SOLUTION -# โœ… IMPLEMENTATION CHECKPOINT: Advanced optimization analyzer complete +# PASS IMPLEMENTATION CHECKPOINT: Advanced optimization analyzer complete -# ๐Ÿค” PREDICTION: What will be the most impactful optimization for matrix operations? +# THINK PREDICTION: What will be the most impactful optimization for matrix operations? # Your guess: _______ -# ๐Ÿ” SYSTEMS INSIGHT: Comprehensive Kernel Optimization Analysis +# MAGNIFY SYSTEMS INSIGHT: Comprehensive Kernel Optimization Analysis def comprehensive_kernel_analysis(): """Run complete kernel optimization analysis using the advanced analyzer.""" try: - print("๐Ÿš€ Comprehensive Kernel Optimization Analysis") + print("ROCKET Comprehensive Kernel Optimization Analysis") print("=" * 60) # Initialize analyzer @@ -2206,7 +2206,7 @@ def comprehensive_kernel_analysis(): print(f" Recommendations: {'; '.join(cache_results['recommendations'])}") # 2. Vectorization potential analysis - print("\n๐Ÿš€ Vectorization Potential Analysis:") + print("\nROCKET Vectorization Potential Analysis:") vec_results = analyzer.analyze_vectorization_potential( ['matmul', 'relu', 'add', 'multiply'], [(1000,), (1000, 1000)] @@ -2261,7 +2261,7 @@ def comprehensive_kernel_analysis(): for rec in roadmap['recommendations'][:3]: print(f" โ€ข {rec}") - # ๐Ÿ’ก WHY THIS MATTERS: Comprehensive analysis guides optimization decisions: + # TIP WHY THIS MATTERS: Comprehensive analysis guides optimization decisions: # 1. Cache analysis reveals memory bottlenecks and optimal algorithms # 2. Vectorization analysis shows where SIMD can provide biggest gains # 3. Parallel analysis identifies when threading helps vs hurts @@ -2277,7 +2277,7 @@ def comprehensive_kernel_analysis(): } except Exception as e: - print(f"โš ๏ธ Comprehensive analysis error: {e}") + print(f"WARNING๏ธ Comprehensive analysis error: {e}") return None # Run the comprehensive analysis @@ -2285,21 +2285,21 @@ comprehensive_analysis = comprehensive_kernel_analysis() # %% [markdown] """ -### ๐Ÿงช Unit Test: Advanced Optimization Analyzer +### TEST Unit Test: Advanced Optimization Analyzer This test validates the comprehensive kernel optimization analyzer """ # %% def test_unit_advanced_optimization_analyzer(): """Test the advanced kernel optimization analyzer.""" - print("๐Ÿงช Unit Test: Advanced Optimization Analyzer") + print("TEST Unit Test: Advanced Optimization Analyzer") # Test 1: Analyzer initialization analyzer = KernelOptimizationAnalyzer() assert hasattr(analyzer, 'hardware_config'), "Analyzer should have hardware config" assert analyzer.hardware_config['cpu_cores'] > 0, "Should detect CPU cores" - print("โœ… Initialization: Hardware configuration detected") + print("PASS Initialization: Hardware configuration detected") # Test 2: Cache efficiency analysis cache_results = analyzer.analyze_cache_efficiency(matmul_baseline, [64, 128]) @@ -2307,28 +2307,28 @@ def test_unit_advanced_optimization_analyzer(): assert 'cache_efficiency' in cache_results, "Should return cache efficiency results" assert 'bandwidth_utilization' in cache_results, "Should analyze bandwidth utilization" assert 'recommendations' in cache_results, "Should provide recommendations" - print("โœ… Cache analysis: Complete analysis with recommendations") + print("PASS Cache analysis: Complete analysis with recommendations") # Test 3: Vectorization potential analysis vec_results = analyzer.analyze_vectorization_potential(['relu', 'add']) assert 'simd_opportunities' in vec_results, "Should identify SIMD opportunities" assert 'speedup_estimates' in vec_results, "Should estimate speedup potential" - print("โœ… Vectorization analysis: SIMD opportunities identified") + print("PASS Vectorization analysis: SIMD opportunities identified") # Test 4: Parallel scaling analysis parallel_results = analyzer.analyze_parallel_scaling(parallel_relu, [1, 2, 4]) assert 'scaling_results' in parallel_results, "Should provide scaling results" assert 'efficiency_analysis' in parallel_results, "Should analyze efficiency" - print("โœ… Parallel analysis: Scaling efficiency measured") + print("PASS Parallel analysis: Scaling efficiency measured") # Test 5: Quantization analysis quant_results = analyzer.analyze_quantization_trade_offs([vectorized_relu]) assert 'deployment_recommendations' in quant_results, "Should provide deployment recommendations" assert 'accuracy_analysis' in quant_results, "Should analyze accuracy impact" - print("โœ… Quantization analysis: Trade-offs evaluated") + print("PASS Quantization analysis: Trade-offs evaluated") # Test 6: Optimization roadmap roadmap = analyzer.generate_optimization_roadmap('cloud') @@ -2337,11 +2337,11 @@ def test_unit_advanced_optimization_analyzer(): assert 'implementation_plan' in roadmap, "Should provide implementation plan" assert 'expected_outcomes' in roadmap, "Should estimate outcomes" assert 'recommendations' in roadmap, "Should give actionable recommendations" - print("โœ… Roadmap generation: Comprehensive optimization plan created") + print("PASS Roadmap generation: Comprehensive optimization plan created") # Test 7: Integration across analyses assert len(analyzer.analysis_results) >= 4, "Should store all analysis results" - print("โœ… Integration: All analyses stored and accessible") + print("PASS Integration: All analyses stored and accessible") # Run the test test_unit_advanced_optimization_analyzer() @@ -2356,7 +2356,7 @@ test_unit_advanced_optimization_analyzer() # %% def test_unit_all(): """Run comprehensive kernel module validation.""" - print("๐Ÿงช Running all kernel unit tests...") + print("TEST Running all kernel unit tests...") # Core infrastructure tests test_unit_timing_infrastructure() @@ -2382,7 +2382,7 @@ def test_unit_all(): test_unit_advanced_optimization_analyzer() print() - print("โœ… All kernel unit tests passed! High-performance kernels ready for deployment.") + print("PASS All kernel unit tests passed! High-performance kernels ready for deployment.") # %% [markdown] """ @@ -2449,7 +2449,7 @@ if __name__ == "__main__": # %% [markdown] """ -## ๐Ÿค” ML Systems Thinking: Interactive Questions +## THINK ML Systems Thinking: Interactive Questions Now that you've implemented high-performance computational kernels, let's explore the systems implications through hands-on analysis. """ @@ -2502,16 +2502,16 @@ Now that you've implemented high-performance computational kernels, let's explor # %% [markdown] """ -## ๐ŸŽฏ MODULE SUMMARY: Kernels +## TARGET MODULE SUMMARY: Kernels Congratulations! You've successfully implemented high-performance computational kernels that power modern ML systems! ### What You've Accomplished -โœ… **High-Performance Implementation**: 200+ lines of optimized kernel code with cache blocking, vectorization, and parallelization -โœ… **Advanced Optimization Analysis**: Comprehensive `KernelOptimizationAnalyzer` with multi-dimensional performance evaluation -โœ… **Production-Ready Kernels**: Matrix multiplication, activation functions, and quantization kernels optimized for real-world deployment -โœ… **Systems Integration**: Complete optimization pipeline from profiling through deployment recommendations -โœ… **Performance Engineering**: Deep understanding of cache hierarchy, SIMD vectorization, and parallel processing trade-offs +PASS **High-Performance Implementation**: 200+ lines of optimized kernel code with cache blocking, vectorization, and parallelization +PASS **Advanced Optimization Analysis**: Comprehensive `KernelOptimizationAnalyzer` with multi-dimensional performance evaluation +PASS **Production-Ready Kernels**: Matrix multiplication, activation functions, and quantization kernels optimized for real-world deployment +PASS **Systems Integration**: Complete optimization pipeline from profiling through deployment recommendations +PASS **Performance Engineering**: Deep understanding of cache hierarchy, SIMD vectorization, and parallel processing trade-offs ### Key Learning Outcomes - **Cache Optimization**: Implementing cache-friendly algorithms that minimize memory access latency diff --git a/progress.json b/progress.json index c55c4699..699b749d 100644 --- a/progress.json +++ b/progress.json @@ -6,8 +6,8 @@ "06", "08" ], - "last_completed": "08", - "last_updated": "2025-09-28T08:07:12.088651", + "last_completed": "01", + "last_updated": "2025-09-28T08:14:37.700346", "started_modules": [ "01", "04"