mirror of https://github.com/MLSysBook/TinyTorch.git synced 2026-04-29 01:29:21 -05:00

Files

Vijay Janapa Reddi 93f5bcba72 Add comprehensive TinyTorch Enhanced Capability Unlock System documentation

This commit adds complete documentation for the 5-milestone system that transforms
TinyTorch from module-based to capability-driven learning:

📚 Documentation Suite:
- milestone-system.md: Student-facing guide with milestone descriptions
- instructor-milestone-guide.md: Complete assessment framework for instructors
- milestone-troubleshooting.md: Comprehensive debugging guide for common issues
- milestone-implementation-guide.md: Technical implementation specifications
- milestone-system-overview.md: Executive summary tying everything together

🎯 The Five Milestones:
1. Basic Inference (Module 04) - Neural networks work (85%+ MNIST)
2. Computer Vision (Module 06) - MNIST recognition (95%+ CNN accuracy)
3. Full Training (Module 11) - Complete training loops (CIFAR-10 training)
4. Advanced Vision (Module 13) - CIFAR-10 classification (75%+ accuracy)
5. Language Generation (Module 16) - GPT text generation (coherent output)

🚀 Key Features:
- Capability-based achievement system replacing traditional module completion
- Visual progress tracking with Rich CLI visualizations
- Victory conditions aligned with industry-relevant skills
- Comprehensive troubleshooting for each milestone challenge
- Instructor assessment framework with automated testing
- Technical implementation roadmap for CLI integration

💡 Educational Impact:
- Students develop portfolio-worthy capabilities rather than just completing assignments
- Clear progression from basic neural networks to production AI systems
- Motivation through achievement and concrete skill development
- Industry alignment with real ML engineering competencies

Ready for implementation phase with complete technical specifications.

2025-09-20 20:07:19 -04:00

19 KiB

Raw Blame History

🔧 TinyTorch Milestone Troubleshooting Guide

Common Issues and Solutions

This guide helps you overcome the most frequent challenges students encounter while pursuing TinyTorch milestones. Each section provides symptoms, diagnoses, and concrete solutions.

🎯 Milestone 1: Basic Inference

Issue: "My neural network outputs don't make sense"

Symptoms:

Network outputs NaN or inf values
All predictions are the same number
Accuracy stuck at random chance (10% for MNIST)
Gradients exploding or vanishing

Diagnosis & Solutions:

Weight Initialization Problems

# ❌ WRONG: Weights too large
self.weight = Tensor(np.random.randn(input_size, output_size))

# ✅ CORRECT: Xavier initialization
scale = np.sqrt(2.0 / (input_size + output_size))
self.weight = Tensor(np.random.randn(input_size, output_size) * scale)

Shape Mismatch Issues

# Debug shapes at each step
print(f"Input shape: {x.shape}")
output = self.dense1(x)
print(f"After dense1: {output.shape}")
output = self.activation(output)
print(f"After activation: {output.shape}")

Learning Rate Problems

# ❌ TOO HIGH: Learning rate 1.0 causes instability
optimizer = SGD(model.parameters(), learning_rate=1.0)

# ✅ GOOD: Start with smaller learning rate
optimizer = SGD(model.parameters(), learning_rate=0.01)

Issue: "MNIST accuracy stuck below 85%"

Symptoms:

Network trains but plateaus at 60-70% accuracy
Loss decreases but accuracy doesn't improve
Similar performance on training and test sets

Diagnosis & Solutions:

Insufficient Network Capacity

# ❌ TOO SIMPLE: Not enough parameters
model = Sequential([
    Dense(784, 10),  # Only 7,850 parameters
    Softmax()
])

# ✅ BETTER: More capacity for complex patterns
model = Sequential([
    Dense(784, 128), ReLU(),  # Hidden layer for feature learning
    Dense(128, 64), ReLU(),   # Additional feature refinement
    Dense(64, 10), Softmax()  # Final classification
])

Activation Function Issues

# ❌ WRONG: No activation between layers
model = Sequential([
    Dense(784, 128),
    Dense(128, 10),  # Linear combinations of linear functions = linear
    Softmax()
])

# ✅ CORRECT: Nonlinearity enables complex patterns
model = Sequential([
    Dense(784, 128), ReLU(),  # Nonlinearity crucial!
    Dense(128, 10), Softmax()
])

👁️ Milestone 2: Computer Vision

Issue: "Convolution implementation is too slow"

Symptoms:

Conv2D forward pass takes >10 seconds for small images
Memory usage explodes during convolution
System becomes unresponsive during training

Diagnosis & Solutions:

Inefficient Convolution Loops

# ❌ SLOW: Nested Python loops
for batch in range(batch_size):
    for out_ch in range(out_channels):
        for in_ch in range(in_channels):
            for h in range(output_height):
                for w in range(output_width):
                    # Convolution computation
                    result[batch, out_ch, h, w] += ...

# ✅ FASTER: Vectorized operations using im2col
def im2col_convolution(input_tensor, weight, bias=None):
    # Convert convolution to matrix multiplication
    input_cols = im2col(input_tensor, weight.shape[2:])
    output = input_cols @ weight.reshape(weight.shape[0], -1).T
    return output.reshape(batch_size, out_channels, output_height, output_width)

Memory Inefficiency

# ❌ MEMORY HOG: Creating intermediate tensors in loops
for i in range(kernel_height):
    for j in range(kernel_width):
        temp_tensor = input[:, :, i:i+output_height, j:j+output_width]
        result += temp_tensor * kernel[:, :, i, j]

# ✅ MEMORY EFFICIENT: In-place operations
output = Tensor(np.zeros((batch_size, out_channels, output_height, output_width)))
for i in range(kernel_height):
    for j in range(kernel_width):
        # Use views instead of copies
        input_slice = input[:, :, i:i+output_height, j:j+output_width]
        output += input_slice * kernel[:, :, i, j]

Issue: "CNN accuracy worse than dense network"

Symptoms:

Dense network achieves 90%+ on MNIST
CNN with same parameters gets 70-80%
CNN training loss decreases slower than dense

Diagnosis & Solutions:

Poor CNN Architecture

# ❌ BAD: CNN worse than dense
model = Sequential([
    Conv2D(1, 32, kernel_size=7),  # Too large kernel
    ReLU(),
    Flatten(),
    Dense(32 * 22 * 22, 10)  # Huge dense layer
])

# ✅ GOOD: Proper CNN design
model = Sequential([
    Conv2D(1, 16, kernel_size=3), ReLU(),  # Small kernels
    MaxPool2D(kernel_size=2),               # Reduce spatial size
    Conv2D(16, 32, kernel_size=3), ReLU(),
    MaxPool2D(kernel_size=2),
    Flatten(),
    Dense(32 * 5 * 5, 128), ReLU(),        # Reasonable dense size
    Dense(128, 10)
])

Padding and Stride Issues

# ❌ WRONG: Losing too much spatial information
conv = Conv2D(1, 16, kernel_size=5, stride=2, padding=0)  # Aggressive downsampling

# ✅ CORRECT: Preserve spatial information
conv = Conv2D(1, 16, kernel_size=3, stride=1, padding=1)  # Same size output
pool = MaxPool2D(kernel_size=2)  # Controlled downsampling

⚙️ Milestone 3: Full Training

Issue: "Training loss not decreasing"

Symptoms:

Loss remains constant across epochs
Gradients are all zeros or very small
Model predictions don't change during training

Diagnosis & Solutions:

Learning Rate Too Small

# ❌ TOO SMALL: No visible progress
optimizer = Adam(model.parameters(), learning_rate=1e-6)

# ✅ GOOD RANGE: Start here and adjust
optimizer = Adam(model.parameters(), learning_rate=1e-3)

# Monitor gradient norms to debug
def check_gradients(model):
    total_norm = 0.0
    for param in model.parameters():
        if param.grad is not None:
            total_norm += param.grad.data.norm()**2
    return total_norm**0.5

print(f"Gradient norm: {check_gradients(model)}")

Incorrect Loss Function Implementation

# ❌ WRONG: CrossEntropy without log-softmax
def cross_entropy_loss(predictions, targets):
    return -np.mean(predictions[range(len(targets)), targets])

# ✅ CORRECT: Proper log-softmax + NLL
def cross_entropy_loss(logits, targets):
    log_probs = log_softmax(logits)
    return -np.mean(log_probs[range(len(targets)), targets])

def log_softmax(x):
    exp_x = np.exp(x - np.max(x, axis=1, keepdims=True))
    return np.log(exp_x / np.sum(exp_x, axis=1, keepdims=True))

Issue: "CIFAR-10 training diverges or gets stuck"

Symptoms:

Loss starts decreasing then shoots up to infinity
Accuracy drops during training
NaN values appear in loss or gradients

Diagnosis & Solutions:

Data Preprocessing Issues

# ❌ WRONG: Using raw pixel values 0-255
train_data = cifar10_data  # Values in [0, 255]

# ✅ CORRECT: Normalize to reasonable range
train_data = cifar10_data.astype(np.float32) / 255.0  # Values in [0, 1]

# Even better: Zero-center and normalize
mean = np.array([0.485, 0.456, 0.406])
std = np.array([0.229, 0.224, 0.225])
train_data = (train_data - mean) / std

Batch Size Too Large

# ❌ PROBLEMATIC: Batch size too large for dataset
train_loader = DataLoader(train_dataset, batch_size=1024, shuffle=True)

# ✅ BETTER: Moderate batch size for stability
train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)

Learning Rate Scheduling

# ❌ BASIC: Fixed learning rate throughout training
optimizer = Adam(model.parameters(), learning_rate=0.001)

# ✅ ADVANCED: Learning rate decay for convergence
def adjust_learning_rate(optimizer, epoch, initial_lr=0.001):
    lr = initial_lr * (0.9 ** (epoch // 10))
    for param_group in optimizer.param_groups:
        param_group['lr'] = lr
    return lr

🚀 Milestone 4: Advanced Vision

Issue: "Can't reach 75% CIFAR-10 accuracy"

Symptoms:

Model plateaus at 65-70% accuracy
Training and validation accuracy gap is large
Loss continues decreasing but accuracy doesn't improve

Diagnosis & Solutions:

Insufficient Model Complexity

# ❌ TOO SIMPLE: Not enough capacity for CIFAR-10
model = Sequential([
    Conv2D(3, 16, 3), ReLU(),
    MaxPool2D(2),
    Flatten(),
    Dense(16 * 16 * 16, 10)
])

# ✅ BETTER: Deeper architecture with more features
model = Sequential([
    Conv2D(3, 32, 3), ReLU(),
    Conv2D(32, 32, 3), ReLU(),
    MaxPool2D(2),
    Conv2D(32, 64, 3), ReLU(),
    Conv2D(64, 64, 3), ReLU(),
    MaxPool2D(2),
    Flatten(),
    Dense(64 * 6 * 6, 256), ReLU(),
    Dropout(0.5),
    Dense(256, 10)
])

Overfitting Problems

# Add regularization techniques
model = Sequential([
    Conv2D(3, 32, 3), BatchNorm2D(32), ReLU(),
    Conv2D(32, 32, 3), BatchNorm2D(32), ReLU(),
    MaxPool2D(2), Dropout(0.2),
    
    Conv2D(32, 64, 3), BatchNorm2D(64), ReLU(),
    Conv2D(64, 64, 3), BatchNorm2D(64), ReLU(),
    MaxPool2D(2), Dropout(0.3),
    
    Flatten(),
    Dense(64 * 6 * 6, 256), BatchNorm1D(256), ReLU(),
    Dropout(0.5),
    Dense(256, 10)
])

Data Augmentation Missing

# ✅ ADD: Data augmentation for better generalization
def augment_cifar10(image):
    # Random horizontal flip
    if np.random.random() > 0.5:
        image = np.fliplr(image)
    
    # Random crop and pad
    pad_width = 4
    padded = np.pad(image, ((pad_width, pad_width), (pad_width, pad_width), (0, 0)), mode='constant')
    crop_x = np.random.randint(0, 2 * pad_width + 1)
    crop_y = np.random.randint(0, 2 * pad_width + 1)
    image = padded[crop_y:crop_y+32, crop_x:crop_x+32]
    
    return image

class AugmentedCIFAR10Dataset(CIFAR10Dataset):
    def __getitem__(self, idx):
        image, label = super().__getitem__(idx)
        if self.train:
            image = augment_cifar10(image)
        return image, label

Issue: "Model training takes too long"

Symptoms:

Single epoch takes >10 minutes
GPU utilization low or no GPU being used
Memory usage constantly growing

Diagnosis & Solutions:

Inefficient Convolution Implementation

# Profile your convolution
import time

def time_convolution():
    input_tensor = Tensor(np.random.randn(32, 3, 32, 32))
    conv = Conv2D(3, 64, kernel_size=3)
    
    start_time = time.time()
    for _ in range(100):
        output = conv(input_tensor)
    end_time = time.time()
    
    print(f"100 convolutions took {end_time - start_time:.2f} seconds")
    print(f"Average time per convolution: {(end_time - start_time)/100:.4f} seconds")

time_convolution()

Memory Leaks in Training Loop

# ❌ MEMORY LEAK: Accumulating computation graphs
for epoch in range(epochs):
    for batch_idx, (data, target) in enumerate(train_loader):
        output = model(data)
        loss = loss_fn(output, target)
        loss.backward()
        optimizer.step()
        # Missing: optimizer.zero_grad()

# ✅ CORRECT: Clear gradients each iteration
for epoch in range(epochs):
    for batch_idx, (data, target) in enumerate(train_loader):
        optimizer.zero_grad()  # Clear previous gradients
        output = model(data)
        loss = loss_fn(output, target)
        loss.backward()
        optimizer.step()

🔥 Milestone 5: Language Generation

Issue: "GPT generates nonsense text"

Symptoms:

Generated text is random characters
Model outputs same character repeatedly
Text has no recognizable patterns or structure

Diagnosis & Solutions:

Tokenization Problems

# ❌ WRONG: Inconsistent character mapping
def tokenize(text):
    chars = list(set(text))  # Order changes each run!
    char_to_idx = {ch: i for i, ch in enumerate(chars)}
    return [char_to_idx[ch] for ch in text]

# ✅ CORRECT: Consistent character vocabulary
class CharTokenizer:
    def __init__(self, text):
        self.chars = sorted(list(set(text)))  # Consistent ordering
        self.char_to_idx = {ch: i for i, ch in enumerate(self.chars)}
        self.idx_to_char = {i: ch for i, ch in enumerate(self.chars)}
        
    def encode(self, text):
        return [self.char_to_idx[ch] for ch in text]
        
    def decode(self, indices):
        return ''.join([self.idx_to_char[i] for i in indices])

Sequence Length Issues

# ❌ TOO LONG: Sequence length too large for available data
sequence_length = 1000  # Only have 10,000 chars total

# ✅ REASONABLE: Sequence length appropriate for dataset
sequence_length = min(100, len(text) // 100)  # At least 100 sequences

Position Encoding Missing

# ❌ MISSING: No positional information
class GPTBlock(nn.Module):
    def __init__(self, embed_dim, num_heads):
        self.attention = MultiHeadAttention(embed_dim, num_heads)
        self.mlp = MLP(embed_dim)
        
    def forward(self, x):
        x = x + self.attention(x)  # No position info!
        x = x + self.mlp(x)
        return x

# ✅ CORRECT: Add positional encoding
class GPTBlock(nn.Module):
    def __init__(self, embed_dim, num_heads, max_seq_len):
        self.attention = MultiHeadAttention(embed_dim, num_heads)
        self.mlp = MLP(embed_dim)
        self.pos_encoding = PositionalEncoding(embed_dim, max_seq_len)
        
    def forward(self, x):
        x = x + self.pos_encoding(x)  # Add position information
        x = x + self.attention(x)
        x = x + self.mlp(x)
        return x

Issue: "Can't reuse components from vision modules"

Symptoms:

Having to reimplement Dense layers, ReLU, etc.
Components don't work with sequence data
Different interfaces for vision vs. language components

Diagnosis & Solutions:

Shape Incompatibility

# ❌ PROBLEM: Dense layer expects 2D input, sequences are 3D
# Sequence shape: (batch_size, sequence_length, embed_dim)
# Dense expects: (batch_size, features)

# ✅ SOLUTION: Reshape for compatibility
class SequenceDense(nn.Module):
    def __init__(self, input_dim, output_dim):
        self.dense = Dense(input_dim, output_dim)  # Reuse vision component!
        
    def forward(self, x):
        # x shape: (batch, seq_len, input_dim)
        batch_size, seq_len, input_dim = x.shape
        
        # Reshape to 2D for dense layer
        x_flat = x.reshape(batch_size * seq_len, input_dim)
        
        # Apply dense transformation
        output_flat = self.dense(x_flat)
        
        # Reshape back to sequence format
        output_dim = output_flat.shape[-1]
        return output_flat.reshape(batch_size, seq_len, output_dim)

Different Data Types

# ❌ ISSUE: Vision uses float32, language uses int64 indices
# Vision: image_tensor = Tensor(np.float32([...]))
# Language: token_indices = [1, 5, 12, ...]

# ✅ SOLUTION: Embedding layer converts indices to vectors
class TokenEmbedding(nn.Module):
    def __init__(self, vocab_size, embed_dim):
        self.embedding = Tensor(np.random.randn(vocab_size, embed_dim) * 0.1)
        
    def forward(self, token_indices):
        # Convert integer indices to float embeddings
        return self.embedding[token_indices]  # Now compatible with Dense layers!

🛠️ General Debugging Strategies

Debugging Checklist

Before Every Milestone Attempt:

Environment activated: source .venv/bin/activate
Dependencies updated: pip install -r requirements.txt
Previous modules working: tito test --all-previous
Clean workspace: git status shows clean state

During Implementation:

Print shapes at every step
Test with small data first (batch_size=1, small input)
Use debugger breakpoints at critical functions
Save intermediate results for inspection

Before Milestone Submission:

Code runs without errors
Performance benchmarks met
All tests pass: tito milestone test X
Code exported successfully: tito export --module X

Performance Debugging

Memory Usage:

import tracemalloc

def debug_memory_usage():
    tracemalloc.start()
    
    # Your code here
    model = build_model()
    train_one_epoch(model)
    
    current, peak = tracemalloc.get_traced_memory()
    print(f"Current memory usage: {current / 1024 / 1024:.1f} MB")
    print(f"Peak memory usage: {peak / 1024 / 1024:.1f} MB")
    tracemalloc.stop()

Training Speed:

import time

def benchmark_training_speed():
    model = build_model()
    dummy_data = create_dummy_batch()
    
    # Warm up
    for _ in range(5):
        _ = model(dummy_data)
    
    # Benchmark
    start_time = time.time()
    for _ in range(100):
        output = model(dummy_data)
    end_time = time.time()
    
    avg_time = (end_time - start_time) / 100
    print(f"Average forward pass time: {avg_time*1000:.2f} ms")

Getting Help

Documentation Resources:

Module READMEs: modules/source/XX_module/README.md
API Reference: book/appendices/api-reference.md
Troubleshooting: This guide!

Community Support:

Discord/Slack: #tinytorch-help channel
Office Hours: See course calendar
Study Groups: Form with classmates working on same milestone

Instructor Support:

Email for conceptual questions
Office hours for debugging sessions
Milestone review meetings for stuck students

When to Ask for Help

Ask for help if:

Stuck on same issue for >2 hours
Performance far below milestone requirements
Unclear about milestone requirements
Suspecting bug in provided code

Before asking, prepare:

Minimal code example reproducing the issue
Error messages and stack traces
What you've already tried
Specific question, not just "it doesn't work"

🎯 Success Strategies

Milestone Achievement Tips

Start Early:

Begin milestone attempts when you complete prerequisites
Don't wait until the deadline to discover issues
Use intermediate checkpoints to track progress

Incremental Development:

Get basic version working first
Optimize performance second
Add advanced features last

Test-Driven Development:

Write tests for your functions before implementation
Use provided test suites as specification
Add your own tests for edge cases

Systematic Debugging:

Isolate issues to smallest possible code section
Use print statements and debugger strategically
Keep a debugging log of what you've tried

Building Confidence

Celebrate Small Wins:

First successful forward pass
First decreasing loss curve
First accuracy improvement

Learn from Failures:

Every bug teaches you something about the system
Failed milestones often lead to deeper understanding
Debugging skills are as valuable as implementation skills

Connect to Bigger Picture:

Each milestone represents real-world capability
Your implementations mirror industry practices
Skills transfer directly to research and industry roles

Remember the Goal: You're not just completing assignments—you're building genuine ML systems engineering expertise that will serve you throughout your career. Every challenge overcome makes you a stronger engineer.

🚀 Keep going! Every milestone brings you closer to ML systems mastery.

19 KiB Raw Blame History

🔧 TinyTorch Milestone Troubleshooting Guide

Common Issues and Solutions

🎯 Milestone 1: Basic Inference

Issue: "My neural network outputs don't make sense"

Weight Initialization Problems

Shape Mismatch Issues

Learning Rate Problems

Issue: "MNIST accuracy stuck below 85%"

Insufficient Network Capacity

Activation Function Issues

👁️ Milestone 2: Computer Vision

Issue: "Convolution implementation is too slow"

Inefficient Convolution Loops

Memory Inefficiency

Issue: "CNN accuracy worse than dense network"

Poor CNN Architecture

Padding and Stride Issues

⚙️ Milestone 3: Full Training

Issue: "Training loss not decreasing"

Learning Rate Too Small

Incorrect Loss Function Implementation

Issue: "CIFAR-10 training diverges or gets stuck"

Data Preprocessing Issues

Batch Size Too Large

Learning Rate Scheduling

🚀 Milestone 4: Advanced Vision

Issue: "Can't reach 75% CIFAR-10 accuracy"

Insufficient Model Complexity

Overfitting Problems

Data Augmentation Missing

Issue: "Model training takes too long"

Inefficient Convolution Implementation

Memory Leaks in Training Loop

🔥 Milestone 5: Language Generation

Issue: "GPT generates nonsense text"

Tokenization Problems

Sequence Length Issues

Position Encoding Missing

Issue: "Can't reuse components from vision modules"

Shape Incompatibility

Different Data Types

🛠️ General Debugging Strategies

Debugging Checklist

Performance Debugging

Getting Help

When to Ask for Help

🎯 Success Strategies

Milestone Achievement Tips

Building Confidence

19 KiB

Raw Blame History