This commit adds complete documentation for the 5-milestone system that transforms TinyTorch from module-based to capability-driven learning: 📚 Documentation Suite: - milestone-system.md: Student-facing guide with milestone descriptions - instructor-milestone-guide.md: Complete assessment framework for instructors - milestone-troubleshooting.md: Comprehensive debugging guide for common issues - milestone-implementation-guide.md: Technical implementation specifications - milestone-system-overview.md: Executive summary tying everything together 🎯 The Five Milestones: 1. Basic Inference (Module 04) - Neural networks work (85%+ MNIST) 2. Computer Vision (Module 06) - MNIST recognition (95%+ CNN accuracy) 3. Full Training (Module 11) - Complete training loops (CIFAR-10 training) 4. Advanced Vision (Module 13) - CIFAR-10 classification (75%+ accuracy) 5. Language Generation (Module 16) - GPT text generation (coherent output) 🚀 Key Features: - Capability-based achievement system replacing traditional module completion - Visual progress tracking with Rich CLI visualizations - Victory conditions aligned with industry-relevant skills - Comprehensive troubleshooting for each milestone challenge - Instructor assessment framework with automated testing - Technical implementation roadmap for CLI integration 💡 Educational Impact: - Students develop portfolio-worthy capabilities rather than just completing assignments - Clear progression from basic neural networks to production AI systems - Motivation through achievement and concrete skill development - Industry alignment with real ML engineering competencies Ready for implementation phase with complete technical specifications.
19 KiB
🔧 TinyTorch Milestone Troubleshooting Guide
Common Issues and Solutions
This guide helps you overcome the most frequent challenges students encounter while pursuing TinyTorch milestones. Each section provides symptoms, diagnoses, and concrete solutions.
🎯 Milestone 1: Basic Inference
Issue: "My neural network outputs don't make sense"
Symptoms:
- Network outputs NaN or inf values
- All predictions are the same number
- Accuracy stuck at random chance (10% for MNIST)
- Gradients exploding or vanishing
Diagnosis & Solutions:
Weight Initialization Problems
# ❌ WRONG: Weights too large
self.weight = Tensor(np.random.randn(input_size, output_size))
# ✅ CORRECT: Xavier initialization
scale = np.sqrt(2.0 / (input_size + output_size))
self.weight = Tensor(np.random.randn(input_size, output_size) * scale)
Shape Mismatch Issues
# Debug shapes at each step
print(f"Input shape: {x.shape}")
output = self.dense1(x)
print(f"After dense1: {output.shape}")
output = self.activation(output)
print(f"After activation: {output.shape}")
Learning Rate Problems
# ❌ TOO HIGH: Learning rate 1.0 causes instability
optimizer = SGD(model.parameters(), learning_rate=1.0)
# ✅ GOOD: Start with smaller learning rate
optimizer = SGD(model.parameters(), learning_rate=0.01)
Issue: "MNIST accuracy stuck below 85%"
Symptoms:
- Network trains but plateaus at 60-70% accuracy
- Loss decreases but accuracy doesn't improve
- Similar performance on training and test sets
Diagnosis & Solutions:
Insufficient Network Capacity
# ❌ TOO SIMPLE: Not enough parameters
model = Sequential([
Dense(784, 10), # Only 7,850 parameters
Softmax()
])
# ✅ BETTER: More capacity for complex patterns
model = Sequential([
Dense(784, 128), ReLU(), # Hidden layer for feature learning
Dense(128, 64), ReLU(), # Additional feature refinement
Dense(64, 10), Softmax() # Final classification
])
Activation Function Issues
# ❌ WRONG: No activation between layers
model = Sequential([
Dense(784, 128),
Dense(128, 10), # Linear combinations of linear functions = linear
Softmax()
])
# ✅ CORRECT: Nonlinearity enables complex patterns
model = Sequential([
Dense(784, 128), ReLU(), # Nonlinearity crucial!
Dense(128, 10), Softmax()
])
👁️ Milestone 2: Computer Vision
Issue: "Convolution implementation is too slow"
Symptoms:
- Conv2D forward pass takes >10 seconds for small images
- Memory usage explodes during convolution
- System becomes unresponsive during training
Diagnosis & Solutions:
Inefficient Convolution Loops
# ❌ SLOW: Nested Python loops
for batch in range(batch_size):
for out_ch in range(out_channels):
for in_ch in range(in_channels):
for h in range(output_height):
for w in range(output_width):
# Convolution computation
result[batch, out_ch, h, w] += ...
# ✅ FASTER: Vectorized operations using im2col
def im2col_convolution(input_tensor, weight, bias=None):
# Convert convolution to matrix multiplication
input_cols = im2col(input_tensor, weight.shape[2:])
output = input_cols @ weight.reshape(weight.shape[0], -1).T
return output.reshape(batch_size, out_channels, output_height, output_width)
Memory Inefficiency
# ❌ MEMORY HOG: Creating intermediate tensors in loops
for i in range(kernel_height):
for j in range(kernel_width):
temp_tensor = input[:, :, i:i+output_height, j:j+output_width]
result += temp_tensor * kernel[:, :, i, j]
# ✅ MEMORY EFFICIENT: In-place operations
output = Tensor(np.zeros((batch_size, out_channels, output_height, output_width)))
for i in range(kernel_height):
for j in range(kernel_width):
# Use views instead of copies
input_slice = input[:, :, i:i+output_height, j:j+output_width]
output += input_slice * kernel[:, :, i, j]
Issue: "CNN accuracy worse than dense network"
Symptoms:
- Dense network achieves 90%+ on MNIST
- CNN with same parameters gets 70-80%
- CNN training loss decreases slower than dense
Diagnosis & Solutions:
Poor CNN Architecture
# ❌ BAD: CNN worse than dense
model = Sequential([
Conv2D(1, 32, kernel_size=7), # Too large kernel
ReLU(),
Flatten(),
Dense(32 * 22 * 22, 10) # Huge dense layer
])
# ✅ GOOD: Proper CNN design
model = Sequential([
Conv2D(1, 16, kernel_size=3), ReLU(), # Small kernels
MaxPool2D(kernel_size=2), # Reduce spatial size
Conv2D(16, 32, kernel_size=3), ReLU(),
MaxPool2D(kernel_size=2),
Flatten(),
Dense(32 * 5 * 5, 128), ReLU(), # Reasonable dense size
Dense(128, 10)
])
Padding and Stride Issues
# ❌ WRONG: Losing too much spatial information
conv = Conv2D(1, 16, kernel_size=5, stride=2, padding=0) # Aggressive downsampling
# ✅ CORRECT: Preserve spatial information
conv = Conv2D(1, 16, kernel_size=3, stride=1, padding=1) # Same size output
pool = MaxPool2D(kernel_size=2) # Controlled downsampling
⚙️ Milestone 3: Full Training
Issue: "Training loss not decreasing"
Symptoms:
- Loss remains constant across epochs
- Gradients are all zeros or very small
- Model predictions don't change during training
Diagnosis & Solutions:
Learning Rate Too Small
# ❌ TOO SMALL: No visible progress
optimizer = Adam(model.parameters(), learning_rate=1e-6)
# ✅ GOOD RANGE: Start here and adjust
optimizer = Adam(model.parameters(), learning_rate=1e-3)
# Monitor gradient norms to debug
def check_gradients(model):
total_norm = 0.0
for param in model.parameters():
if param.grad is not None:
total_norm += param.grad.data.norm()**2
return total_norm**0.5
print(f"Gradient norm: {check_gradients(model)}")
Incorrect Loss Function Implementation
# ❌ WRONG: CrossEntropy without log-softmax
def cross_entropy_loss(predictions, targets):
return -np.mean(predictions[range(len(targets)), targets])
# ✅ CORRECT: Proper log-softmax + NLL
def cross_entropy_loss(logits, targets):
log_probs = log_softmax(logits)
return -np.mean(log_probs[range(len(targets)), targets])
def log_softmax(x):
exp_x = np.exp(x - np.max(x, axis=1, keepdims=True))
return np.log(exp_x / np.sum(exp_x, axis=1, keepdims=True))
Issue: "CIFAR-10 training diverges or gets stuck"
Symptoms:
- Loss starts decreasing then shoots up to infinity
- Accuracy drops during training
- NaN values appear in loss or gradients
Diagnosis & Solutions:
Data Preprocessing Issues
# ❌ WRONG: Using raw pixel values 0-255
train_data = cifar10_data # Values in [0, 255]
# ✅ CORRECT: Normalize to reasonable range
train_data = cifar10_data.astype(np.float32) / 255.0 # Values in [0, 1]
# Even better: Zero-center and normalize
mean = np.array([0.485, 0.456, 0.406])
std = np.array([0.229, 0.224, 0.225])
train_data = (train_data - mean) / std
Batch Size Too Large
# ❌ PROBLEMATIC: Batch size too large for dataset
train_loader = DataLoader(train_dataset, batch_size=1024, shuffle=True)
# ✅ BETTER: Moderate batch size for stability
train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)
Learning Rate Scheduling
# ❌ BASIC: Fixed learning rate throughout training
optimizer = Adam(model.parameters(), learning_rate=0.001)
# ✅ ADVANCED: Learning rate decay for convergence
def adjust_learning_rate(optimizer, epoch, initial_lr=0.001):
lr = initial_lr * (0.9 ** (epoch // 10))
for param_group in optimizer.param_groups:
param_group['lr'] = lr
return lr
🚀 Milestone 4: Advanced Vision
Issue: "Can't reach 75% CIFAR-10 accuracy"
Symptoms:
- Model plateaus at 65-70% accuracy
- Training and validation accuracy gap is large
- Loss continues decreasing but accuracy doesn't improve
Diagnosis & Solutions:
Insufficient Model Complexity
# ❌ TOO SIMPLE: Not enough capacity for CIFAR-10
model = Sequential([
Conv2D(3, 16, 3), ReLU(),
MaxPool2D(2),
Flatten(),
Dense(16 * 16 * 16, 10)
])
# ✅ BETTER: Deeper architecture with more features
model = Sequential([
Conv2D(3, 32, 3), ReLU(),
Conv2D(32, 32, 3), ReLU(),
MaxPool2D(2),
Conv2D(32, 64, 3), ReLU(),
Conv2D(64, 64, 3), ReLU(),
MaxPool2D(2),
Flatten(),
Dense(64 * 6 * 6, 256), ReLU(),
Dropout(0.5),
Dense(256, 10)
])
Overfitting Problems
# Add regularization techniques
model = Sequential([
Conv2D(3, 32, 3), BatchNorm2D(32), ReLU(),
Conv2D(32, 32, 3), BatchNorm2D(32), ReLU(),
MaxPool2D(2), Dropout(0.2),
Conv2D(32, 64, 3), BatchNorm2D(64), ReLU(),
Conv2D(64, 64, 3), BatchNorm2D(64), ReLU(),
MaxPool2D(2), Dropout(0.3),
Flatten(),
Dense(64 * 6 * 6, 256), BatchNorm1D(256), ReLU(),
Dropout(0.5),
Dense(256, 10)
])
Data Augmentation Missing
# ✅ ADD: Data augmentation for better generalization
def augment_cifar10(image):
# Random horizontal flip
if np.random.random() > 0.5:
image = np.fliplr(image)
# Random crop and pad
pad_width = 4
padded = np.pad(image, ((pad_width, pad_width), (pad_width, pad_width), (0, 0)), mode='constant')
crop_x = np.random.randint(0, 2 * pad_width + 1)
crop_y = np.random.randint(0, 2 * pad_width + 1)
image = padded[crop_y:crop_y+32, crop_x:crop_x+32]
return image
class AugmentedCIFAR10Dataset(CIFAR10Dataset):
def __getitem__(self, idx):
image, label = super().__getitem__(idx)
if self.train:
image = augment_cifar10(image)
return image, label
Issue: "Model training takes too long"
Symptoms:
- Single epoch takes >10 minutes
- GPU utilization low or no GPU being used
- Memory usage constantly growing
Diagnosis & Solutions:
Inefficient Convolution Implementation
# Profile your convolution
import time
def time_convolution():
input_tensor = Tensor(np.random.randn(32, 3, 32, 32))
conv = Conv2D(3, 64, kernel_size=3)
start_time = time.time()
for _ in range(100):
output = conv(input_tensor)
end_time = time.time()
print(f"100 convolutions took {end_time - start_time:.2f} seconds")
print(f"Average time per convolution: {(end_time - start_time)/100:.4f} seconds")
time_convolution()
Memory Leaks in Training Loop
# ❌ MEMORY LEAK: Accumulating computation graphs
for epoch in range(epochs):
for batch_idx, (data, target) in enumerate(train_loader):
output = model(data)
loss = loss_fn(output, target)
loss.backward()
optimizer.step()
# Missing: optimizer.zero_grad()
# ✅ CORRECT: Clear gradients each iteration
for epoch in range(epochs):
for batch_idx, (data, target) in enumerate(train_loader):
optimizer.zero_grad() # Clear previous gradients
output = model(data)
loss = loss_fn(output, target)
loss.backward()
optimizer.step()
🔥 Milestone 5: Language Generation
Issue: "GPT generates nonsense text"
Symptoms:
- Generated text is random characters
- Model outputs same character repeatedly
- Text has no recognizable patterns or structure
Diagnosis & Solutions:
Tokenization Problems
# ❌ WRONG: Inconsistent character mapping
def tokenize(text):
chars = list(set(text)) # Order changes each run!
char_to_idx = {ch: i for i, ch in enumerate(chars)}
return [char_to_idx[ch] for ch in text]
# ✅ CORRECT: Consistent character vocabulary
class CharTokenizer:
def __init__(self, text):
self.chars = sorted(list(set(text))) # Consistent ordering
self.char_to_idx = {ch: i for i, ch in enumerate(self.chars)}
self.idx_to_char = {i: ch for i, ch in enumerate(self.chars)}
def encode(self, text):
return [self.char_to_idx[ch] for ch in text]
def decode(self, indices):
return ''.join([self.idx_to_char[i] for i in indices])
Sequence Length Issues
# ❌ TOO LONG: Sequence length too large for available data
sequence_length = 1000 # Only have 10,000 chars total
# ✅ REASONABLE: Sequence length appropriate for dataset
sequence_length = min(100, len(text) // 100) # At least 100 sequences
Position Encoding Missing
# ❌ MISSING: No positional information
class GPTBlock(nn.Module):
def __init__(self, embed_dim, num_heads):
self.attention = MultiHeadAttention(embed_dim, num_heads)
self.mlp = MLP(embed_dim)
def forward(self, x):
x = x + self.attention(x) # No position info!
x = x + self.mlp(x)
return x
# ✅ CORRECT: Add positional encoding
class GPTBlock(nn.Module):
def __init__(self, embed_dim, num_heads, max_seq_len):
self.attention = MultiHeadAttention(embed_dim, num_heads)
self.mlp = MLP(embed_dim)
self.pos_encoding = PositionalEncoding(embed_dim, max_seq_len)
def forward(self, x):
x = x + self.pos_encoding(x) # Add position information
x = x + self.attention(x)
x = x + self.mlp(x)
return x
Issue: "Can't reuse components from vision modules"
Symptoms:
- Having to reimplement Dense layers, ReLU, etc.
- Components don't work with sequence data
- Different interfaces for vision vs. language components
Diagnosis & Solutions:
Shape Incompatibility
# ❌ PROBLEM: Dense layer expects 2D input, sequences are 3D
# Sequence shape: (batch_size, sequence_length, embed_dim)
# Dense expects: (batch_size, features)
# ✅ SOLUTION: Reshape for compatibility
class SequenceDense(nn.Module):
def __init__(self, input_dim, output_dim):
self.dense = Dense(input_dim, output_dim) # Reuse vision component!
def forward(self, x):
# x shape: (batch, seq_len, input_dim)
batch_size, seq_len, input_dim = x.shape
# Reshape to 2D for dense layer
x_flat = x.reshape(batch_size * seq_len, input_dim)
# Apply dense transformation
output_flat = self.dense(x_flat)
# Reshape back to sequence format
output_dim = output_flat.shape[-1]
return output_flat.reshape(batch_size, seq_len, output_dim)
Different Data Types
# ❌ ISSUE: Vision uses float32, language uses int64 indices
# Vision: image_tensor = Tensor(np.float32([...]))
# Language: token_indices = [1, 5, 12, ...]
# ✅ SOLUTION: Embedding layer converts indices to vectors
class TokenEmbedding(nn.Module):
def __init__(self, vocab_size, embed_dim):
self.embedding = Tensor(np.random.randn(vocab_size, embed_dim) * 0.1)
def forward(self, token_indices):
# Convert integer indices to float embeddings
return self.embedding[token_indices] # Now compatible with Dense layers!
🛠️ General Debugging Strategies
Debugging Checklist
Before Every Milestone Attempt:
- Environment activated:
source .venv/bin/activate - Dependencies updated:
pip install -r requirements.txt - Previous modules working:
tito test --all-previous - Clean workspace:
git statusshows clean state
During Implementation:
- Print shapes at every step
- Test with small data first (batch_size=1, small input)
- Use debugger breakpoints at critical functions
- Save intermediate results for inspection
Before Milestone Submission:
- Code runs without errors
- Performance benchmarks met
- All tests pass:
tito milestone test X - Code exported successfully:
tito export --module X
Performance Debugging
Memory Usage:
import tracemalloc
def debug_memory_usage():
tracemalloc.start()
# Your code here
model = build_model()
train_one_epoch(model)
current, peak = tracemalloc.get_traced_memory()
print(f"Current memory usage: {current / 1024 / 1024:.1f} MB")
print(f"Peak memory usage: {peak / 1024 / 1024:.1f} MB")
tracemalloc.stop()
Training Speed:
import time
def benchmark_training_speed():
model = build_model()
dummy_data = create_dummy_batch()
# Warm up
for _ in range(5):
_ = model(dummy_data)
# Benchmark
start_time = time.time()
for _ in range(100):
output = model(dummy_data)
end_time = time.time()
avg_time = (end_time - start_time) / 100
print(f"Average forward pass time: {avg_time*1000:.2f} ms")
Getting Help
Documentation Resources:
- Module READMEs:
modules/source/XX_module/README.md - API Reference:
book/appendices/api-reference.md - Troubleshooting: This guide!
Community Support:
- Discord/Slack: #tinytorch-help channel
- Office Hours: See course calendar
- Study Groups: Form with classmates working on same milestone
Instructor Support:
- Email for conceptual questions
- Office hours for debugging sessions
- Milestone review meetings for stuck students
When to Ask for Help
Ask for help if:
- Stuck on same issue for >2 hours
- Performance far below milestone requirements
- Unclear about milestone requirements
- Suspecting bug in provided code
Before asking, prepare:
- Minimal code example reproducing the issue
- Error messages and stack traces
- What you've already tried
- Specific question, not just "it doesn't work"
🎯 Success Strategies
Milestone Achievement Tips
Start Early:
- Begin milestone attempts when you complete prerequisites
- Don't wait until the deadline to discover issues
- Use intermediate checkpoints to track progress
Incremental Development:
- Get basic version working first
- Optimize performance second
- Add advanced features last
Test-Driven Development:
- Write tests for your functions before implementation
- Use provided test suites as specification
- Add your own tests for edge cases
Systematic Debugging:
- Isolate issues to smallest possible code section
- Use print statements and debugger strategically
- Keep a debugging log of what you've tried
Building Confidence
Celebrate Small Wins:
- First successful forward pass
- First decreasing loss curve
- First accuracy improvement
Learn from Failures:
- Every bug teaches you something about the system
- Failed milestones often lead to deeper understanding
- Debugging skills are as valuable as implementation skills
Connect to Bigger Picture:
- Each milestone represents real-world capability
- Your implementations mirror industry practices
- Skills transfer directly to research and industry roles
Remember the Goal: You're not just completing assignments—you're building genuine ML systems engineering expertise that will serve you throughout your career. Every challenge overcome makes you a stronger engineer.
🚀 Keep going! Every milestone brings you closer to ML systems mastery.