Files
TinyTorch/docs/milestone-troubleshooting.md
Vijay Janapa Reddi 93f5bcba72 Add comprehensive TinyTorch Enhanced Capability Unlock System documentation
This commit adds complete documentation for the 5-milestone system that transforms
TinyTorch from module-based to capability-driven learning:

📚 Documentation Suite:
- milestone-system.md: Student-facing guide with milestone descriptions
- instructor-milestone-guide.md: Complete assessment framework for instructors
- milestone-troubleshooting.md: Comprehensive debugging guide for common issues
- milestone-implementation-guide.md: Technical implementation specifications
- milestone-system-overview.md: Executive summary tying everything together

🎯 The Five Milestones:
1. Basic Inference (Module 04) - Neural networks work (85%+ MNIST)
2. Computer Vision (Module 06) - MNIST recognition (95%+ CNN accuracy)
3. Full Training (Module 11) - Complete training loops (CIFAR-10 training)
4. Advanced Vision (Module 13) - CIFAR-10 classification (75%+ accuracy)
5. Language Generation (Module 16) - GPT text generation (coherent output)

🚀 Key Features:
- Capability-based achievement system replacing traditional module completion
- Visual progress tracking with Rich CLI visualizations
- Victory conditions aligned with industry-relevant skills
- Comprehensive troubleshooting for each milestone challenge
- Instructor assessment framework with automated testing
- Technical implementation roadmap for CLI integration

💡 Educational Impact:
- Students develop portfolio-worthy capabilities rather than just completing assignments
- Clear progression from basic neural networks to production AI systems
- Motivation through achievement and concrete skill development
- Industry alignment with real ML engineering competencies

Ready for implementation phase with complete technical specifications.
2025-09-20 20:07:19 -04:00

19 KiB

🔧 TinyTorch Milestone Troubleshooting Guide

Common Issues and Solutions

This guide helps you overcome the most frequent challenges students encounter while pursuing TinyTorch milestones. Each section provides symptoms, diagnoses, and concrete solutions.


🎯 Milestone 1: Basic Inference

Issue: "My neural network outputs don't make sense"

Symptoms:

  • Network outputs NaN or inf values
  • All predictions are the same number
  • Accuracy stuck at random chance (10% for MNIST)
  • Gradients exploding or vanishing

Diagnosis & Solutions:

Weight Initialization Problems

# ❌ WRONG: Weights too large
self.weight = Tensor(np.random.randn(input_size, output_size))

# ✅ CORRECT: Xavier initialization
scale = np.sqrt(2.0 / (input_size + output_size))
self.weight = Tensor(np.random.randn(input_size, output_size) * scale)

Shape Mismatch Issues

# Debug shapes at each step
print(f"Input shape: {x.shape}")
output = self.dense1(x)
print(f"After dense1: {output.shape}")
output = self.activation(output)
print(f"After activation: {output.shape}")

Learning Rate Problems

# ❌ TOO HIGH: Learning rate 1.0 causes instability
optimizer = SGD(model.parameters(), learning_rate=1.0)

# ✅ GOOD: Start with smaller learning rate
optimizer = SGD(model.parameters(), learning_rate=0.01)

Issue: "MNIST accuracy stuck below 85%"

Symptoms:

  • Network trains but plateaus at 60-70% accuracy
  • Loss decreases but accuracy doesn't improve
  • Similar performance on training and test sets

Diagnosis & Solutions:

Insufficient Network Capacity

# ❌ TOO SIMPLE: Not enough parameters
model = Sequential([
    Dense(784, 10),  # Only 7,850 parameters
    Softmax()
])

# ✅ BETTER: More capacity for complex patterns
model = Sequential([
    Dense(784, 128), ReLU(),  # Hidden layer for feature learning
    Dense(128, 64), ReLU(),   # Additional feature refinement
    Dense(64, 10), Softmax()  # Final classification
])

Activation Function Issues

# ❌ WRONG: No activation between layers
model = Sequential([
    Dense(784, 128),
    Dense(128, 10),  # Linear combinations of linear functions = linear
    Softmax()
])

# ✅ CORRECT: Nonlinearity enables complex patterns
model = Sequential([
    Dense(784, 128), ReLU(),  # Nonlinearity crucial!
    Dense(128, 10), Softmax()
])

👁️ Milestone 2: Computer Vision

Issue: "Convolution implementation is too slow"

Symptoms:

  • Conv2D forward pass takes >10 seconds for small images
  • Memory usage explodes during convolution
  • System becomes unresponsive during training

Diagnosis & Solutions:

Inefficient Convolution Loops

# ❌ SLOW: Nested Python loops
for batch in range(batch_size):
    for out_ch in range(out_channels):
        for in_ch in range(in_channels):
            for h in range(output_height):
                for w in range(output_width):
                    # Convolution computation
                    result[batch, out_ch, h, w] += ...

# ✅ FASTER: Vectorized operations using im2col
def im2col_convolution(input_tensor, weight, bias=None):
    # Convert convolution to matrix multiplication
    input_cols = im2col(input_tensor, weight.shape[2:])
    output = input_cols @ weight.reshape(weight.shape[0], -1).T
    return output.reshape(batch_size, out_channels, output_height, output_width)

Memory Inefficiency

# ❌ MEMORY HOG: Creating intermediate tensors in loops
for i in range(kernel_height):
    for j in range(kernel_width):
        temp_tensor = input[:, :, i:i+output_height, j:j+output_width]
        result += temp_tensor * kernel[:, :, i, j]

# ✅ MEMORY EFFICIENT: In-place operations
output = Tensor(np.zeros((batch_size, out_channels, output_height, output_width)))
for i in range(kernel_height):
    for j in range(kernel_width):
        # Use views instead of copies
        input_slice = input[:, :, i:i+output_height, j:j+output_width]
        output += input_slice * kernel[:, :, i, j]

Issue: "CNN accuracy worse than dense network"

Symptoms:

  • Dense network achieves 90%+ on MNIST
  • CNN with same parameters gets 70-80%
  • CNN training loss decreases slower than dense

Diagnosis & Solutions:

Poor CNN Architecture

# ❌ BAD: CNN worse than dense
model = Sequential([
    Conv2D(1, 32, kernel_size=7),  # Too large kernel
    ReLU(),
    Flatten(),
    Dense(32 * 22 * 22, 10)  # Huge dense layer
])

# ✅ GOOD: Proper CNN design
model = Sequential([
    Conv2D(1, 16, kernel_size=3), ReLU(),  # Small kernels
    MaxPool2D(kernel_size=2),               # Reduce spatial size
    Conv2D(16, 32, kernel_size=3), ReLU(),
    MaxPool2D(kernel_size=2),
    Flatten(),
    Dense(32 * 5 * 5, 128), ReLU(),        # Reasonable dense size
    Dense(128, 10)
])

Padding and Stride Issues

# ❌ WRONG: Losing too much spatial information
conv = Conv2D(1, 16, kernel_size=5, stride=2, padding=0)  # Aggressive downsampling

# ✅ CORRECT: Preserve spatial information
conv = Conv2D(1, 16, kernel_size=3, stride=1, padding=1)  # Same size output
pool = MaxPool2D(kernel_size=2)  # Controlled downsampling

⚙️ Milestone 3: Full Training

Issue: "Training loss not decreasing"

Symptoms:

  • Loss remains constant across epochs
  • Gradients are all zeros or very small
  • Model predictions don't change during training

Diagnosis & Solutions:

Learning Rate Too Small

# ❌ TOO SMALL: No visible progress
optimizer = Adam(model.parameters(), learning_rate=1e-6)

# ✅ GOOD RANGE: Start here and adjust
optimizer = Adam(model.parameters(), learning_rate=1e-3)

# Monitor gradient norms to debug
def check_gradients(model):
    total_norm = 0.0
    for param in model.parameters():
        if param.grad is not None:
            total_norm += param.grad.data.norm()**2
    return total_norm**0.5

print(f"Gradient norm: {check_gradients(model)}")

Incorrect Loss Function Implementation

# ❌ WRONG: CrossEntropy without log-softmax
def cross_entropy_loss(predictions, targets):
    return -np.mean(predictions[range(len(targets)), targets])

# ✅ CORRECT: Proper log-softmax + NLL
def cross_entropy_loss(logits, targets):
    log_probs = log_softmax(logits)
    return -np.mean(log_probs[range(len(targets)), targets])

def log_softmax(x):
    exp_x = np.exp(x - np.max(x, axis=1, keepdims=True))
    return np.log(exp_x / np.sum(exp_x, axis=1, keepdims=True))

Issue: "CIFAR-10 training diverges or gets stuck"

Symptoms:

  • Loss starts decreasing then shoots up to infinity
  • Accuracy drops during training
  • NaN values appear in loss or gradients

Diagnosis & Solutions:

Data Preprocessing Issues

# ❌ WRONG: Using raw pixel values 0-255
train_data = cifar10_data  # Values in [0, 255]

# ✅ CORRECT: Normalize to reasonable range
train_data = cifar10_data.astype(np.float32) / 255.0  # Values in [0, 1]

# Even better: Zero-center and normalize
mean = np.array([0.485, 0.456, 0.406])
std = np.array([0.229, 0.224, 0.225])
train_data = (train_data - mean) / std

Batch Size Too Large

# ❌ PROBLEMATIC: Batch size too large for dataset
train_loader = DataLoader(train_dataset, batch_size=1024, shuffle=True)

# ✅ BETTER: Moderate batch size for stability
train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)

Learning Rate Scheduling

# ❌ BASIC: Fixed learning rate throughout training
optimizer = Adam(model.parameters(), learning_rate=0.001)

# ✅ ADVANCED: Learning rate decay for convergence
def adjust_learning_rate(optimizer, epoch, initial_lr=0.001):
    lr = initial_lr * (0.9 ** (epoch // 10))
    for param_group in optimizer.param_groups:
        param_group['lr'] = lr
    return lr

🚀 Milestone 4: Advanced Vision

Issue: "Can't reach 75% CIFAR-10 accuracy"

Symptoms:

  • Model plateaus at 65-70% accuracy
  • Training and validation accuracy gap is large
  • Loss continues decreasing but accuracy doesn't improve

Diagnosis & Solutions:

Insufficient Model Complexity

# ❌ TOO SIMPLE: Not enough capacity for CIFAR-10
model = Sequential([
    Conv2D(3, 16, 3), ReLU(),
    MaxPool2D(2),
    Flatten(),
    Dense(16 * 16 * 16, 10)
])

# ✅ BETTER: Deeper architecture with more features
model = Sequential([
    Conv2D(3, 32, 3), ReLU(),
    Conv2D(32, 32, 3), ReLU(),
    MaxPool2D(2),
    Conv2D(32, 64, 3), ReLU(),
    Conv2D(64, 64, 3), ReLU(),
    MaxPool2D(2),
    Flatten(),
    Dense(64 * 6 * 6, 256), ReLU(),
    Dropout(0.5),
    Dense(256, 10)
])

Overfitting Problems

# Add regularization techniques
model = Sequential([
    Conv2D(3, 32, 3), BatchNorm2D(32), ReLU(),
    Conv2D(32, 32, 3), BatchNorm2D(32), ReLU(),
    MaxPool2D(2), Dropout(0.2),
    
    Conv2D(32, 64, 3), BatchNorm2D(64), ReLU(),
    Conv2D(64, 64, 3), BatchNorm2D(64), ReLU(),
    MaxPool2D(2), Dropout(0.3),
    
    Flatten(),
    Dense(64 * 6 * 6, 256), BatchNorm1D(256), ReLU(),
    Dropout(0.5),
    Dense(256, 10)
])

Data Augmentation Missing

# ✅ ADD: Data augmentation for better generalization
def augment_cifar10(image):
    # Random horizontal flip
    if np.random.random() > 0.5:
        image = np.fliplr(image)
    
    # Random crop and pad
    pad_width = 4
    padded = np.pad(image, ((pad_width, pad_width), (pad_width, pad_width), (0, 0)), mode='constant')
    crop_x = np.random.randint(0, 2 * pad_width + 1)
    crop_y = np.random.randint(0, 2 * pad_width + 1)
    image = padded[crop_y:crop_y+32, crop_x:crop_x+32]
    
    return image

class AugmentedCIFAR10Dataset(CIFAR10Dataset):
    def __getitem__(self, idx):
        image, label = super().__getitem__(idx)
        if self.train:
            image = augment_cifar10(image)
        return image, label

Issue: "Model training takes too long"

Symptoms:

  • Single epoch takes >10 minutes
  • GPU utilization low or no GPU being used
  • Memory usage constantly growing

Diagnosis & Solutions:

Inefficient Convolution Implementation

# Profile your convolution
import time

def time_convolution():
    input_tensor = Tensor(np.random.randn(32, 3, 32, 32))
    conv = Conv2D(3, 64, kernel_size=3)
    
    start_time = time.time()
    for _ in range(100):
        output = conv(input_tensor)
    end_time = time.time()
    
    print(f"100 convolutions took {end_time - start_time:.2f} seconds")
    print(f"Average time per convolution: {(end_time - start_time)/100:.4f} seconds")

time_convolution()

Memory Leaks in Training Loop

# ❌ MEMORY LEAK: Accumulating computation graphs
for epoch in range(epochs):
    for batch_idx, (data, target) in enumerate(train_loader):
        output = model(data)
        loss = loss_fn(output, target)
        loss.backward()
        optimizer.step()
        # Missing: optimizer.zero_grad()

# ✅ CORRECT: Clear gradients each iteration
for epoch in range(epochs):
    for batch_idx, (data, target) in enumerate(train_loader):
        optimizer.zero_grad()  # Clear previous gradients
        output = model(data)
        loss = loss_fn(output, target)
        loss.backward()
        optimizer.step()

🔥 Milestone 5: Language Generation

Issue: "GPT generates nonsense text"

Symptoms:

  • Generated text is random characters
  • Model outputs same character repeatedly
  • Text has no recognizable patterns or structure

Diagnosis & Solutions:

Tokenization Problems

# ❌ WRONG: Inconsistent character mapping
def tokenize(text):
    chars = list(set(text))  # Order changes each run!
    char_to_idx = {ch: i for i, ch in enumerate(chars)}
    return [char_to_idx[ch] for ch in text]

# ✅ CORRECT: Consistent character vocabulary
class CharTokenizer:
    def __init__(self, text):
        self.chars = sorted(list(set(text)))  # Consistent ordering
        self.char_to_idx = {ch: i for i, ch in enumerate(self.chars)}
        self.idx_to_char = {i: ch for i, ch in enumerate(self.chars)}
        
    def encode(self, text):
        return [self.char_to_idx[ch] for ch in text]
        
    def decode(self, indices):
        return ''.join([self.idx_to_char[i] for i in indices])

Sequence Length Issues

# ❌ TOO LONG: Sequence length too large for available data
sequence_length = 1000  # Only have 10,000 chars total

# ✅ REASONABLE: Sequence length appropriate for dataset
sequence_length = min(100, len(text) // 100)  # At least 100 sequences

Position Encoding Missing

# ❌ MISSING: No positional information
class GPTBlock(nn.Module):
    def __init__(self, embed_dim, num_heads):
        self.attention = MultiHeadAttention(embed_dim, num_heads)
        self.mlp = MLP(embed_dim)
        
    def forward(self, x):
        x = x + self.attention(x)  # No position info!
        x = x + self.mlp(x)
        return x

# ✅ CORRECT: Add positional encoding
class GPTBlock(nn.Module):
    def __init__(self, embed_dim, num_heads, max_seq_len):
        self.attention = MultiHeadAttention(embed_dim, num_heads)
        self.mlp = MLP(embed_dim)
        self.pos_encoding = PositionalEncoding(embed_dim, max_seq_len)
        
    def forward(self, x):
        x = x + self.pos_encoding(x)  # Add position information
        x = x + self.attention(x)
        x = x + self.mlp(x)
        return x

Issue: "Can't reuse components from vision modules"

Symptoms:

  • Having to reimplement Dense layers, ReLU, etc.
  • Components don't work with sequence data
  • Different interfaces for vision vs. language components

Diagnosis & Solutions:

Shape Incompatibility

# ❌ PROBLEM: Dense layer expects 2D input, sequences are 3D
# Sequence shape: (batch_size, sequence_length, embed_dim)
# Dense expects: (batch_size, features)

# ✅ SOLUTION: Reshape for compatibility
class SequenceDense(nn.Module):
    def __init__(self, input_dim, output_dim):
        self.dense = Dense(input_dim, output_dim)  # Reuse vision component!
        
    def forward(self, x):
        # x shape: (batch, seq_len, input_dim)
        batch_size, seq_len, input_dim = x.shape
        
        # Reshape to 2D for dense layer
        x_flat = x.reshape(batch_size * seq_len, input_dim)
        
        # Apply dense transformation
        output_flat = self.dense(x_flat)
        
        # Reshape back to sequence format
        output_dim = output_flat.shape[-1]
        return output_flat.reshape(batch_size, seq_len, output_dim)

Different Data Types

# ❌ ISSUE: Vision uses float32, language uses int64 indices
# Vision: image_tensor = Tensor(np.float32([...]))
# Language: token_indices = [1, 5, 12, ...]

# ✅ SOLUTION: Embedding layer converts indices to vectors
class TokenEmbedding(nn.Module):
    def __init__(self, vocab_size, embed_dim):
        self.embedding = Tensor(np.random.randn(vocab_size, embed_dim) * 0.1)
        
    def forward(self, token_indices):
        # Convert integer indices to float embeddings
        return self.embedding[token_indices]  # Now compatible with Dense layers!

🛠️ General Debugging Strategies

Debugging Checklist

Before Every Milestone Attempt:

  1. Environment activated: source .venv/bin/activate
  2. Dependencies updated: pip install -r requirements.txt
  3. Previous modules working: tito test --all-previous
  4. Clean workspace: git status shows clean state

During Implementation:

  1. Print shapes at every step
  2. Test with small data first (batch_size=1, small input)
  3. Use debugger breakpoints at critical functions
  4. Save intermediate results for inspection

Before Milestone Submission:

  1. Code runs without errors
  2. Performance benchmarks met
  3. All tests pass: tito milestone test X
  4. Code exported successfully: tito export --module X

Performance Debugging

Memory Usage:

import tracemalloc

def debug_memory_usage():
    tracemalloc.start()
    
    # Your code here
    model = build_model()
    train_one_epoch(model)
    
    current, peak = tracemalloc.get_traced_memory()
    print(f"Current memory usage: {current / 1024 / 1024:.1f} MB")
    print(f"Peak memory usage: {peak / 1024 / 1024:.1f} MB")
    tracemalloc.stop()

Training Speed:

import time

def benchmark_training_speed():
    model = build_model()
    dummy_data = create_dummy_batch()
    
    # Warm up
    for _ in range(5):
        _ = model(dummy_data)
    
    # Benchmark
    start_time = time.time()
    for _ in range(100):
        output = model(dummy_data)
    end_time = time.time()
    
    avg_time = (end_time - start_time) / 100
    print(f"Average forward pass time: {avg_time*1000:.2f} ms")

Getting Help

Documentation Resources:

  • Module READMEs: modules/source/XX_module/README.md
  • API Reference: book/appendices/api-reference.md
  • Troubleshooting: This guide!

Community Support:

  • Discord/Slack: #tinytorch-help channel
  • Office Hours: See course calendar
  • Study Groups: Form with classmates working on same milestone

Instructor Support:

  • Email for conceptual questions
  • Office hours for debugging sessions
  • Milestone review meetings for stuck students

When to Ask for Help

Ask for help if:

  • Stuck on same issue for >2 hours
  • Performance far below milestone requirements
  • Unclear about milestone requirements
  • Suspecting bug in provided code

Before asking, prepare:

  • Minimal code example reproducing the issue
  • Error messages and stack traces
  • What you've already tried
  • Specific question, not just "it doesn't work"

🎯 Success Strategies

Milestone Achievement Tips

Start Early:

  • Begin milestone attempts when you complete prerequisites
  • Don't wait until the deadline to discover issues
  • Use intermediate checkpoints to track progress

Incremental Development:

  • Get basic version working first
  • Optimize performance second
  • Add advanced features last

Test-Driven Development:

  • Write tests for your functions before implementation
  • Use provided test suites as specification
  • Add your own tests for edge cases

Systematic Debugging:

  • Isolate issues to smallest possible code section
  • Use print statements and debugger strategically
  • Keep a debugging log of what you've tried

Building Confidence

Celebrate Small Wins:

  • First successful forward pass
  • First decreasing loss curve
  • First accuracy improvement

Learn from Failures:

  • Every bug teaches you something about the system
  • Failed milestones often lead to deeper understanding
  • Debugging skills are as valuable as implementation skills

Connect to Bigger Picture:

  • Each milestone represents real-world capability
  • Your implementations mirror industry practices
  • Skills transfer directly to research and industry roles

Remember the Goal: You're not just completing assignments—you're building genuine ML systems engineering expertise that will serve you throughout your career. Every challenge overcome makes you a stronger engineer.

🚀 Keep going! Every milestone brings you closer to ML systems mastery.