# πŸ”§ TinyTorch Milestone Troubleshooting Guide ## Common Issues and Solutions This guide helps you overcome the most frequent challenges students encounter while pursuing TinyTorch milestones. Each section provides symptoms, diagnoses, and concrete solutions. --- ## 🎯 Milestone 1: Basic Inference ### Issue: "My neural network outputs don't make sense" **Symptoms:** - Network outputs NaN or inf values - All predictions are the same number - Accuracy stuck at random chance (10% for MNIST) - Gradients exploding or vanishing **Diagnosis & Solutions:** #### Weight Initialization Problems ```python # ❌ WRONG: Weights too large self.weight = Tensor(np.random.randn(input_size, output_size)) # βœ… CORRECT: Xavier initialization scale = np.sqrt(2.0 / (input_size + output_size)) self.weight = Tensor(np.random.randn(input_size, output_size) * scale) ``` #### Shape Mismatch Issues ```python # Debug shapes at each step print(f"Input shape: {x.shape}") output = self.dense1(x) print(f"After dense1: {output.shape}") output = self.activation(output) print(f"After activation: {output.shape}") ``` #### Learning Rate Problems ```python # ❌ TOO HIGH: Learning rate 1.0 causes instability optimizer = SGD(model.parameters(), learning_rate=1.0) # βœ… GOOD: Start with smaller learning rate optimizer = SGD(model.parameters(), learning_rate=0.01) ``` ### Issue: "MNIST accuracy stuck below 85%" **Symptoms:** - Network trains but plateaus at 60-70% accuracy - Loss decreases but accuracy doesn't improve - Similar performance on training and test sets **Diagnosis & Solutions:** #### Insufficient Network Capacity ```python # ❌ TOO SIMPLE: Not enough parameters model = Sequential([ Dense(784, 10), # Only 7,850 parameters Softmax() ]) # βœ… BETTER: More capacity for complex patterns model = Sequential([ Dense(784, 128), ReLU(), # Hidden layer for feature learning Dense(128, 64), ReLU(), # Additional feature refinement Dense(64, 10), Softmax() # Final classification ]) ``` #### Activation Function Issues ```python # ❌ WRONG: No activation between layers model = Sequential([ Dense(784, 128), Dense(128, 10), # Linear combinations of linear functions = linear Softmax() ]) # βœ… CORRECT: Nonlinearity enables complex patterns model = Sequential([ Dense(784, 128), ReLU(), # Nonlinearity crucial! Dense(128, 10), Softmax() ]) ``` --- ## πŸ‘οΈ Milestone 2: Computer Vision ### Issue: "Convolution implementation is too slow" **Symptoms:** - Conv2D forward pass takes >10 seconds for small images - Memory usage explodes during convolution - System becomes unresponsive during training **Diagnosis & Solutions:** #### Inefficient Convolution Loops ```python # ❌ SLOW: Nested Python loops for batch in range(batch_size): for out_ch in range(out_channels): for in_ch in range(in_channels): for h in range(output_height): for w in range(output_width): # Convolution computation result[batch, out_ch, h, w] += ... # βœ… FASTER: Vectorized operations using im2col def im2col_convolution(input_tensor, weight, bias=None): # Convert convolution to matrix multiplication input_cols = im2col(input_tensor, weight.shape[2:]) output = input_cols @ weight.reshape(weight.shape[0], -1).T return output.reshape(batch_size, out_channels, output_height, output_width) ``` #### Memory Inefficiency ```python # ❌ MEMORY HOG: Creating intermediate tensors in loops for i in range(kernel_height): for j in range(kernel_width): temp_tensor = input[:, :, i:i+output_height, j:j+output_width] result += temp_tensor * kernel[:, :, i, j] # βœ… MEMORY EFFICIENT: In-place operations output = Tensor(np.zeros((batch_size, out_channels, output_height, output_width))) for i in range(kernel_height): for j in range(kernel_width): # Use views instead of copies input_slice = input[:, :, i:i+output_height, j:j+output_width] output += input_slice * kernel[:, :, i, j] ``` ### Issue: "CNN accuracy worse than dense network" **Symptoms:** - Dense network achieves 90%+ on MNIST - CNN with same parameters gets 70-80% - CNN training loss decreases slower than dense **Diagnosis & Solutions:** #### Poor CNN Architecture ```python # ❌ BAD: CNN worse than dense model = Sequential([ Conv2D(1, 32, kernel_size=7), # Too large kernel ReLU(), Flatten(), Dense(32 * 22 * 22, 10) # Huge dense layer ]) # βœ… GOOD: Proper CNN design model = Sequential([ Conv2D(1, 16, kernel_size=3), ReLU(), # Small kernels MaxPool2D(kernel_size=2), # Reduce spatial size Conv2D(16, 32, kernel_size=3), ReLU(), MaxPool2D(kernel_size=2), Flatten(), Dense(32 * 5 * 5, 128), ReLU(), # Reasonable dense size Dense(128, 10) ]) ``` #### Padding and Stride Issues ```python # ❌ WRONG: Losing too much spatial information conv = Conv2D(1, 16, kernel_size=5, stride=2, padding=0) # Aggressive downsampling # βœ… CORRECT: Preserve spatial information conv = Conv2D(1, 16, kernel_size=3, stride=1, padding=1) # Same size output pool = MaxPool2D(kernel_size=2) # Controlled downsampling ``` --- ## βš™οΈ Milestone 3: Full Training ### Issue: "Training loss not decreasing" **Symptoms:** - Loss remains constant across epochs - Gradients are all zeros or very small - Model predictions don't change during training **Diagnosis & Solutions:** #### Learning Rate Too Small ```python # ❌ TOO SMALL: No visible progress optimizer = Adam(model.parameters(), learning_rate=1e-6) # βœ… GOOD RANGE: Start here and adjust optimizer = Adam(model.parameters(), learning_rate=1e-3) # Monitor gradient norms to debug def check_gradients(model): total_norm = 0.0 for param in model.parameters(): if param.grad is not None: total_norm += param.grad.data.norm()**2 return total_norm**0.5 print(f"Gradient norm: {check_gradients(model)}") ``` #### Incorrect Loss Function Implementation ```python # ❌ WRONG: CrossEntropy without log-softmax def cross_entropy_loss(predictions, targets): return -np.mean(predictions[range(len(targets)), targets]) # βœ… CORRECT: Proper log-softmax + NLL def cross_entropy_loss(logits, targets): log_probs = log_softmax(logits) return -np.mean(log_probs[range(len(targets)), targets]) def log_softmax(x): exp_x = np.exp(x - np.max(x, axis=1, keepdims=True)) return np.log(exp_x / np.sum(exp_x, axis=1, keepdims=True)) ``` ### Issue: "CIFAR-10 training diverges or gets stuck" **Symptoms:** - Loss starts decreasing then shoots up to infinity - Accuracy drops during training - NaN values appear in loss or gradients **Diagnosis & Solutions:** #### Data Preprocessing Issues ```python # ❌ WRONG: Using raw pixel values 0-255 train_data = cifar10_data # Values in [0, 255] # βœ… CORRECT: Normalize to reasonable range train_data = cifar10_data.astype(np.float32) / 255.0 # Values in [0, 1] # Even better: Zero-center and normalize mean = np.array([0.485, 0.456, 0.406]) std = np.array([0.229, 0.224, 0.225]) train_data = (train_data - mean) / std ``` #### Batch Size Too Large ```python # ❌ PROBLEMATIC: Batch size too large for dataset train_loader = DataLoader(train_dataset, batch_size=1024, shuffle=True) # βœ… BETTER: Moderate batch size for stability train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True) ``` #### Learning Rate Scheduling ```python # ❌ BASIC: Fixed learning rate throughout training optimizer = Adam(model.parameters(), learning_rate=0.001) # βœ… ADVANCED: Learning rate decay for convergence def adjust_learning_rate(optimizer, epoch, initial_lr=0.001): lr = initial_lr * (0.9 ** (epoch // 10)) for param_group in optimizer.param_groups: param_group['lr'] = lr return lr ``` --- ## πŸš€ Milestone 4: Advanced Vision ### Issue: "Can't reach 75% CIFAR-10 accuracy" **Symptoms:** - Model plateaus at 65-70% accuracy - Training and validation accuracy gap is large - Loss continues decreasing but accuracy doesn't improve **Diagnosis & Solutions:** #### Insufficient Model Complexity ```python # ❌ TOO SIMPLE: Not enough capacity for CIFAR-10 model = Sequential([ Conv2D(3, 16, 3), ReLU(), MaxPool2D(2), Flatten(), Dense(16 * 16 * 16, 10) ]) # βœ… BETTER: Deeper architecture with more features model = Sequential([ Conv2D(3, 32, 3), ReLU(), Conv2D(32, 32, 3), ReLU(), MaxPool2D(2), Conv2D(32, 64, 3), ReLU(), Conv2D(64, 64, 3), ReLU(), MaxPool2D(2), Flatten(), Dense(64 * 6 * 6, 256), ReLU(), Dropout(0.5), Dense(256, 10) ]) ``` #### Overfitting Problems ```python # Add regularization techniques model = Sequential([ Conv2D(3, 32, 3), BatchNorm2D(32), ReLU(), Conv2D(32, 32, 3), BatchNorm2D(32), ReLU(), MaxPool2D(2), Dropout(0.2), Conv2D(32, 64, 3), BatchNorm2D(64), ReLU(), Conv2D(64, 64, 3), BatchNorm2D(64), ReLU(), MaxPool2D(2), Dropout(0.3), Flatten(), Dense(64 * 6 * 6, 256), BatchNorm1D(256), ReLU(), Dropout(0.5), Dense(256, 10) ]) ``` #### Data Augmentation Missing ```python # βœ… ADD: Data augmentation for better generalization def augment_cifar10(image): # Random horizontal flip if np.random.random() > 0.5: image = np.fliplr(image) # Random crop and pad pad_width = 4 padded = np.pad(image, ((pad_width, pad_width), (pad_width, pad_width), (0, 0)), mode='constant') crop_x = np.random.randint(0, 2 * pad_width + 1) crop_y = np.random.randint(0, 2 * pad_width + 1) image = padded[crop_y:crop_y+32, crop_x:crop_x+32] return image class AugmentedCIFAR10Dataset(CIFAR10Dataset): def __getitem__(self, idx): image, label = super().__getitem__(idx) if self.train: image = augment_cifar10(image) return image, label ``` ### Issue: "Model training takes too long" **Symptoms:** - Single epoch takes >10 minutes - GPU utilization low or no GPU being used - Memory usage constantly growing **Diagnosis & Solutions:** #### Inefficient Convolution Implementation ```python # Profile your convolution import time def time_convolution(): input_tensor = Tensor(np.random.randn(32, 3, 32, 32)) conv = Conv2D(3, 64, kernel_size=3) start_time = time.time() for _ in range(100): output = conv(input_tensor) end_time = time.time() print(f"100 convolutions took {end_time - start_time:.2f} seconds") print(f"Average time per convolution: {(end_time - start_time)/100:.4f} seconds") time_convolution() ``` #### Memory Leaks in Training Loop ```python # ❌ MEMORY LEAK: Accumulating computation graphs for epoch in range(epochs): for batch_idx, (data, target) in enumerate(train_loader): output = model(data) loss = loss_fn(output, target) loss.backward() optimizer.step() # Missing: optimizer.zero_grad() # βœ… CORRECT: Clear gradients each iteration for epoch in range(epochs): for batch_idx, (data, target) in enumerate(train_loader): optimizer.zero_grad() # Clear previous gradients output = model(data) loss = loss_fn(output, target) loss.backward() optimizer.step() ``` --- ## πŸ”₯ Milestone 5: Language Generation ### Issue: "GPT generates nonsense text" **Symptoms:** - Generated text is random characters - Model outputs same character repeatedly - Text has no recognizable patterns or structure **Diagnosis & Solutions:** #### Tokenization Problems ```python # ❌ WRONG: Inconsistent character mapping def tokenize(text): chars = list(set(text)) # Order changes each run! char_to_idx = {ch: i for i, ch in enumerate(chars)} return [char_to_idx[ch] for ch in text] # βœ… CORRECT: Consistent character vocabulary class CharTokenizer: def __init__(self, text): self.chars = sorted(list(set(text))) # Consistent ordering self.char_to_idx = {ch: i for i, ch in enumerate(self.chars)} self.idx_to_char = {i: ch for i, ch in enumerate(self.chars)} def encode(self, text): return [self.char_to_idx[ch] for ch in text] def decode(self, indices): return ''.join([self.idx_to_char[i] for i in indices]) ``` #### Sequence Length Issues ```python # ❌ TOO LONG: Sequence length too large for available data sequence_length = 1000 # Only have 10,000 chars total # βœ… REASONABLE: Sequence length appropriate for dataset sequence_length = min(100, len(text) // 100) # At least 100 sequences ``` #### Position Encoding Missing ```python # ❌ MISSING: No positional information class GPTBlock(nn.Module): def __init__(self, embed_dim, num_heads): self.attention = MultiHeadAttention(embed_dim, num_heads) self.mlp = MLP(embed_dim) def forward(self, x): x = x + self.attention(x) # No position info! x = x + self.mlp(x) return x # βœ… CORRECT: Add positional encoding class GPTBlock(nn.Module): def __init__(self, embed_dim, num_heads, max_seq_len): self.attention = MultiHeadAttention(embed_dim, num_heads) self.mlp = MLP(embed_dim) self.pos_encoding = PositionalEncoding(embed_dim, max_seq_len) def forward(self, x): x = x + self.pos_encoding(x) # Add position information x = x + self.attention(x) x = x + self.mlp(x) return x ``` ### Issue: "Can't reuse components from vision modules" **Symptoms:** - Having to reimplement Dense layers, ReLU, etc. - Components don't work with sequence data - Different interfaces for vision vs. language components **Diagnosis & Solutions:** #### Shape Incompatibility ```python # ❌ PROBLEM: Dense layer expects 2D input, sequences are 3D # Sequence shape: (batch_size, sequence_length, embed_dim) # Dense expects: (batch_size, features) # βœ… SOLUTION: Reshape for compatibility class SequenceDense(nn.Module): def __init__(self, input_dim, output_dim): self.dense = Dense(input_dim, output_dim) # Reuse vision component! def forward(self, x): # x shape: (batch, seq_len, input_dim) batch_size, seq_len, input_dim = x.shape # Reshape to 2D for dense layer x_flat = x.reshape(batch_size * seq_len, input_dim) # Apply dense transformation output_flat = self.dense(x_flat) # Reshape back to sequence format output_dim = output_flat.shape[-1] return output_flat.reshape(batch_size, seq_len, output_dim) ``` #### Different Data Types ```python # ❌ ISSUE: Vision uses float32, language uses int64 indices # Vision: image_tensor = Tensor(np.float32([...])) # Language: token_indices = [1, 5, 12, ...] # βœ… SOLUTION: Embedding layer converts indices to vectors class TokenEmbedding(nn.Module): def __init__(self, vocab_size, embed_dim): self.embedding = Tensor(np.random.randn(vocab_size, embed_dim) * 0.1) def forward(self, token_indices): # Convert integer indices to float embeddings return self.embedding[token_indices] # Now compatible with Dense layers! ``` --- ## πŸ› οΈ General Debugging Strategies ### Debugging Checklist **Before Every Milestone Attempt:** 1. [ ] Environment activated: `source .venv/bin/activate` 2. [ ] Dependencies updated: `pip install -r requirements.txt` 3. [ ] Previous modules working: `tito test --all-previous` 4. [ ] Clean workspace: `git status` shows clean state **During Implementation:** 1. [ ] Print shapes at every step 2. [ ] Test with small data first (batch_size=1, small input) 3. [ ] Use debugger breakpoints at critical functions 4. [ ] Save intermediate results for inspection **Before Milestone Submission:** 1. [ ] Code runs without errors 2. [ ] Performance benchmarks met 3. [ ] All tests pass: `tito milestone test X` 4. [ ] Code exported successfully: `tito export --module X` ### Performance Debugging **Memory Usage:** ```python import tracemalloc def debug_memory_usage(): tracemalloc.start() # Your code here model = build_model() train_one_epoch(model) current, peak = tracemalloc.get_traced_memory() print(f"Current memory usage: {current / 1024 / 1024:.1f} MB") print(f"Peak memory usage: {peak / 1024 / 1024:.1f} MB") tracemalloc.stop() ``` **Training Speed:** ```python import time def benchmark_training_speed(): model = build_model() dummy_data = create_dummy_batch() # Warm up for _ in range(5): _ = model(dummy_data) # Benchmark start_time = time.time() for _ in range(100): output = model(dummy_data) end_time = time.time() avg_time = (end_time - start_time) / 100 print(f"Average forward pass time: {avg_time*1000:.2f} ms") ``` ### Getting Help **Documentation Resources:** - Module READMEs: `modules/source/XX_module/README.md` - API Reference: `book/appendices/api-reference.md` - Troubleshooting: This guide! **Community Support:** - Discord/Slack: #tinytorch-help channel - Office Hours: See course calendar - Study Groups: Form with classmates working on same milestone **Instructor Support:** - Email for conceptual questions - Office hours for debugging sessions - Milestone review meetings for stuck students ### When to Ask for Help **Ask for help if:** - Stuck on same issue for >2 hours - Performance far below milestone requirements - Unclear about milestone requirements - Suspecting bug in provided code **Before asking, prepare:** - Minimal code example reproducing the issue - Error messages and stack traces - What you've already tried - Specific question, not just "it doesn't work" --- ## 🎯 Success Strategies ### Milestone Achievement Tips **Start Early:** - Begin milestone attempts when you complete prerequisites - Don't wait until the deadline to discover issues - Use intermediate checkpoints to track progress **Incremental Development:** - Get basic version working first - Optimize performance second - Add advanced features last **Test-Driven Development:** - Write tests for your functions before implementation - Use provided test suites as specification - Add your own tests for edge cases **Systematic Debugging:** - Isolate issues to smallest possible code section - Use print statements and debugger strategically - Keep a debugging log of what you've tried ### Building Confidence **Celebrate Small Wins:** - First successful forward pass - First decreasing loss curve - First accuracy improvement **Learn from Failures:** - Every bug teaches you something about the system - Failed milestones often lead to deeper understanding - Debugging skills are as valuable as implementation skills **Connect to Bigger Picture:** - Each milestone represents real-world capability - Your implementations mirror industry practices - Skills transfer directly to research and industry roles **Remember the Goal:** You're not just completing assignmentsβ€”you're building genuine ML systems engineering expertise that will serve you throughout your career. Every challenge overcome makes you a stronger engineer. πŸš€ **Keep going! Every milestone brings you closer to ML systems mastery.**