- Delete outdated site/ directory - Rename docs/ → site/ to match original architecture intent - Update all GitHub workflows to reference site/: - publish-live.yml: Update paths and build directory - publish-dev.yml: Update paths and build directory - build-pdf.yml: Update paths and artifact locations - Update README.md: - Consolidate site/ documentation (website + PDF) - Update all docs/ links to site/ - Test successful: Local build works with all 40 pages The site/ directory now clearly represents the course website and documentation, making the repository structure more intuitive. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
16 KiB
👩🏫 TinyTorch Instructor Guide
Complete guide for teaching ML Systems Engineering with TinyTorch.
🎯 Course Overview
TinyTorch teaches ML systems engineering through building, not just using. Students construct a complete ML framework from tensors to transformers, understanding memory, performance, and scaling at each step.
🛠️ Instructor Setup
1. Initial Setup
# Clone and setup
git clone https://github.com/MLSysBook/TinyTorch.git
cd TinyTorch
# Virtual environment (MANDATORY)
python -m venv .venv
source .venv/bin/activate
# Install with instructor tools
pip install -r requirements.txt
pip install nbgrader
# Setup grading infrastructure
tito grade setup
2. Verify Installation
tito system health
# Should show all green checkmarks
tito grade
# Should show available grade commands
📝 Assignment Workflow
Simplified with Tito CLI
We've wrapped NBGrader behind simple tito grade commands so you don't need to learn NBGrader's complex interface.
1. Prepare Assignments
# Generate instructor version (with solutions)
tito grade generate 01_tensor
# Create student version (solutions removed)
tito grade release 01_tensor
# Student version will be in: release/tinytorch/01_tensor/
2. Distribute to Students
# Option A: GitHub Classroom (recommended)
# 1. Create assignment repository from TinyTorch
# 2. Remove solutions from modules
# 3. Students clone and work
# Option B: Direct distribution
# Share the release/ directory contents
3. Collect Submissions
# Collect all students
tito grade collect 01_tensor
# Or specific student
tito grade collect 01_tensor --student student_id
4. Auto-Grade
# Grade all submissions
tito grade autograde 01_tensor
# Grade specific student
tito grade autograde 01_tensor --student student_id
5. Manual Review
# Open grading interface (browser-based)
tito grade manual 01_tensor
# This launches a web interface for:
# - Reviewing ML Systems question responses
# - Adding feedback comments
# - Adjusting auto-grades
6. Generate Feedback
# Create feedback files for students
tito grade feedback 01_tensor
7. Export Grades
# Export all grades to CSV
tito grade export
# Or specific module
tito grade export --module 01_tensor --output grades_module01.csv
📊 Grading Components
Auto-Graded (70%)
- Code implementation correctness
- Test passing
- Function signatures
- Output validation
Manually Graded (30%)
- ML Systems Thinking questions (3 per module)
- Each question: 10 points
- Focus on understanding, not perfection
Grading Rubric for ML Systems Questions
| Points | Criteria |
|---|---|
| 9-10 | Demonstrates deep understanding, references specific code, discusses systems implications |
| 7-8 | Good understanding, some code references, basic systems thinking |
| 5-6 | Surface understanding, generic response, limited systems perspective |
| 3-4 | Attempted but misses key concepts |
| 0-2 | No attempt or completely off-topic |
What to Look For:
- References to actual implemented code
- Memory/performance analysis
- Scaling considerations
- Production system comparisons
- Understanding of trade-offs
📋 Sample Solutions for Grading Calibration
This section provides sample solutions to help calibrate grading standards. Use these as reference points when evaluating student submissions.
Module 01: Tensor - Memory Footprint
Excellent Solution (9-10 points):
def memory_footprint(self):
"""Calculate tensor memory in bytes."""
return self.data.nbytes
Why Excellent:
- Concise and correct
- Uses NumPy's built-in
nbytesproperty - Clear docstring
- Handles all tensor shapes correctly
Good Solution (7-8 points):
def memory_footprint(self):
"""Calculate memory usage."""
return np.prod(self.data.shape) * self.data.dtype.itemsize
Why Good:
- Correct implementation
- Manually calculates (shows understanding)
- Works but less efficient than using
nbytes - Minor: docstring could be more specific
Acceptable Solution (5-6 points):
def memory_footprint(self):
size = 1
for dim in self.data.shape:
size *= dim
return size * 4 # Assumes float32
Why Acceptable:
- Correct logic but hardcoded dtype size
- Works for float32 but fails for other dtypes
- Shows understanding of memory calculation
- Missing proper dtype handling
Module 05: Autograd - Backward Pass
Excellent Solution (9-10 points):
def backward(self, gradient=None):
"""Backward pass through computational graph."""
if gradient is None:
gradient = np.ones_like(self.data)
self.grad = gradient
if self.grad_fn is not None:
# Compute gradients for inputs
input_grads = self.grad_fn.backward(gradient)
# Propagate to input tensors
if isinstance(input_grads, tuple):
for input_tensor, input_grad in zip(self.grad_fn.inputs, input_grads):
if input_tensor.requires_grad:
input_tensor.backward(input_grad)
else:
if self.grad_fn.inputs[0].requires_grad:
self.grad_fn.inputs[0].backward(input_grads)
Why Excellent:
- Handles both scalar and tensor gradients
- Properly checks
requires_gradbefore propagating - Handles tuple returns from grad_fn
- Clear variable names and structure
Good Solution (7-8 points):
def backward(self, gradient=None):
if gradient is None:
gradient = np.ones_like(self.data)
self.grad = gradient
if self.grad_fn:
grads = self.grad_fn.backward(gradient)
for inp, grad in zip(self.grad_fn.inputs, grads):
inp.backward(grad)
Why Good:
- Correct logic
- Missing
requires_gradcheck (minor issue) - Assumes grads is always iterable (may fail for single input)
- Works for most cases but less robust
Acceptable Solution (5-6 points):
def backward(self, grad):
self.grad = grad
if self.grad_fn:
self.grad_fn.inputs[0].backward(self.grad_fn.backward(grad))
Why Acceptable:
- Basic backward pass works
- Only handles single input (fails for multi-input operations)
- Missing None gradient handling
- Shows understanding but incomplete
Module 09: Spatial - Convolution Implementation
Excellent Solution (9-10 points):
def forward(self, x):
"""Forward pass with explicit loops for clarity."""
batch_size, in_channels, height, width = x.shape
out_height = (height - self.kernel_size + 2 * self.padding) // self.stride + 1
out_width = (width - self.kernel_size + 2 * self.padding) // self.stride + 1
output = np.zeros((batch_size, self.out_channels, out_height, out_width))
# Apply padding
if self.padding > 0:
x = np.pad(x, ((0, 0), (0, 0), (self.padding, self.padding),
(self.padding, self.padding)), mode='constant')
# Explicit convolution loops
for b in range(batch_size):
for oc in range(self.out_channels):
for oh in range(out_height):
for ow in range(out_width):
h_start = oh * self.stride
w_start = ow * self.stride
h_end = h_start + self.kernel_size
w_end = w_start + self.kernel_size
window = x[b, :, h_start:h_end, w_start:w_end]
output[b, oc, oh, ow] = np.sum(
window * self.weight[oc] + self.bias[oc]
)
return Tensor(output, requires_grad=x.requires_grad)
Why Excellent:
- Clear output shape calculation
- Proper padding handling
- Explicit loops make O(kernel_size²) complexity visible
- Correct gradient tracking setup
- Well-structured and readable
Good Solution (7-8 points):
def forward(self, x):
B, C, H, W = x.shape
out_h = (H - self.kernel_size) // self.stride + 1
out_w = (W - self.kernel_size) // self.stride + 1
out = np.zeros((B, self.out_channels, out_h, out_w))
for b in range(B):
for oc in range(self.out_channels):
for i in range(out_h):
for j in range(out_w):
h = i * self.stride
w = j * self.stride
out[b, oc, i, j] = np.sum(
x[b, :, h:h+self.kernel_size, w:w+self.kernel_size]
* self.weight[oc]
) + self.bias[oc]
return Tensor(out)
Why Good:
- Correct implementation
- Missing padding support (works only for padding=0)
- Less clear variable names
- Missing requires_grad propagation
Acceptable Solution (5-6 points):
def forward(self, x):
out = np.zeros((x.shape[0], self.out_channels, x.shape[2]-2, x.shape[3]-2))
for b in range(x.shape[0]):
for c in range(self.out_channels):
for i in range(out.shape[2]):
for j in range(out.shape[3]):
out[b, c, i, j] = np.sum(x[b, :, i:i+3, j:j+3] * self.weight[c])
return Tensor(out)
Why Acceptable:
- Basic convolution works
- Hardcoded kernel_size=3 (not general)
- No stride or padding support
- Shows understanding but incomplete
Module 12: Attention - Scaled Dot-Product Attention
Excellent Solution (9-10 points):
def forward(self, query, key, value, mask=None):
"""Scaled dot-product attention with numerical stability."""
# Compute attention scores
scores = np.dot(query, key.T) / np.sqrt(self.d_k)
# Apply mask if provided
if mask is not None:
scores = np.where(mask, scores, -1e9)
# Softmax with numerical stability
exp_scores = np.exp(scores - np.max(scores, axis=-1, keepdims=True))
attention_weights = exp_scores / np.sum(exp_scores, axis=-1, keepdims=True)
# Apply attention to values
output = np.dot(attention_weights, value)
return output, attention_weights
Why Excellent:
- Proper scaling factor (1/√d_k)
- Numerical stability with max subtraction
- Mask handling
- Returns both output and attention weights
- Clear and well-documented
Good Solution (7-8 points):
def forward(self, q, k, v):
scores = np.dot(q, k.T) / np.sqrt(q.shape[-1])
weights = np.exp(scores) / np.sum(np.exp(scores), axis=-1, keepdims=True)
return np.dot(weights, v)
Why Good:
- Correct implementation
- Missing numerical stability (may overflow)
- Missing mask support
- Works but less robust
Acceptable Solution (5-6 points):
def forward(self, q, k, v):
scores = np.dot(q, k.T)
weights = np.exp(scores) / np.sum(np.exp(scores))
return np.dot(weights, v)
Why Acceptable:
- Basic attention mechanism
- Missing scaling factor
- Missing numerical stability
- Incorrect softmax (should be per-row)
Grading Guidelines Using Sample Solutions
When Evaluating Student Code:
-
Correctness First: Does it pass all tests?
- If no: Maximum 6 points (even if well-written)
- If yes: Proceed to quality evaluation
-
Code Quality:
- Excellent (9-10): Production-ready, handles edge cases, well-documented
- Good (7-8): Correct and functional, minor improvements possible
- Acceptable (5-6): Works but incomplete or has issues
-
Systems Thinking:
- Excellent: Discusses memory, performance, scaling implications
- Good: Some systems awareness
- Acceptable: Focuses only on correctness
-
Common Patterns:
- Look for: Proper error handling, edge case consideration, documentation
- Red flags: Hardcoded values, missing checks, unclear variable names
Remember: These are calibration examples. Adjust based on your course level and learning objectives. The goal is consistent evaluation, not perfection.
📚 Module Teaching Notes
Module 01: Tensor
- Focus: Memory layout, data structures
- Key Concept: Understanding memory is crucial for ML performance
- Demo: Show memory profiling, copying behavior
Module 02: Activations
- Focus: Vectorization, numerical stability
- Key Concept: Small details matter at scale
- Demo: Gradient vanishing/exploding
Module 04-05: Layers & Networks
- Focus: Composition, parameter management
- Key Concept: Building blocks combine into complex systems
- Project: Build a small CNN
Module 06-07: Spatial & Attention
- Focus: Algorithmic complexity, memory patterns
- Key Concept: O(N²) operations become bottlenecks
- Demo: Profile attention memory usage
Module 08-11: Training Pipeline
- Focus: End-to-end system integration
- Key Concept: Many components must work together
- Project: Train a real model
Module 12-15: Production
- Focus: Deployment, optimization, monitoring
- Key Concept: Academic vs production requirements
- Demo: Model compression, deployment
Module 16: TinyGPT
- Focus: Framework generalization
- Key Concept: 70% component reuse from vision to language
- Capstone: Build a working language model
🎯 Learning Objectives
By course end, students should be able to:
- Build complete ML systems from scratch
- Analyze memory usage and computational complexity
- Debug performance bottlenecks
- Optimize for production deployment
- Understand framework design decisions
- Apply systems thinking to ML problems
📈 Tracking Progress
Individual Progress
# Check specific student progress
tito checkpoint status --student student_id
Class Overview
# Export all checkpoint achievements
tito checkpoint export --output class_progress.csv
Identify Struggling Students
Look for:
- Missing checkpoint achievements
- Low scores on ML Systems questions
- Incomplete module submissions
💡 Teaching Tips
1. Emphasize Building Over Theory
- Have students type every line of code
- Run tests immediately after implementation
- Break and fix things intentionally
2. Connect to Production Systems
- Show PyTorch/TensorFlow equivalents
- Discuss real-world bottlenecks
- Share production war stories
3. Make Performance Visible
# Use profilers liberally
with TimeProfiler("operation"):
result = expensive_operation()
# Show memory usage
print(f"Memory: {get_memory_usage():.2f} MB")
4. Encourage Systems Questions
- "What would break at 1B parameters?"
- "How would you distributed this?"
- "What's the bottleneck here?"
🔧 Troubleshooting
Common Student Issues
Environment Problems
# Student fix:
tito system health
tito system reset
Module Import Errors
# Rebuild package
tito export --all
Test Failures
# Detailed test output
tito module test MODULE --verbose
NBGrader Issues
Database Locked
# Clear NBGrader database
rm gradebook.db
tito grade setup
Missing Submissions
# Check submission directory
ls submitted/*/MODULE/
📊 Sample Schedule (16 Weeks)
| Week | Module | Focus |
|---|---|---|
| 1 | 01 Tensor | Data Structures, Memory |
| 2 | 02 Activations | Non-linearity Functions |
| 3 | 03 Layers | Neural Network Components |
| 4 | 04 Losses | Optimization Objectives |
| 5 | 05 Autograd | Automatic Differentiation |
| 6 | 06 Optimizers | Training Algorithms |
| 7 | 07 Training | Complete Training Loop |
| 8 | Midterm Project | Build and Train Network |
| 9 | 08 DataLoader | Data Pipeline |
| 10 | 09 Spatial | Convolutions, CNNs |
| 11 | 10 Tokenization | Text Processing |
| 12 | 11 Embeddings | Word Representations |
| 13 | 12 Attention | Attention Mechanisms |
| 14 | 13 Transformers | Transformer Architecture |
| 15 | 14-19 Optimization | Profiling, Quantization, etc. |
| 16 | 20 Capstone | Torch Olympics Competition |
🎓 Assessment Strategy
Continuous Assessment (70%)
- Module completion: 4% each × 16 = 64%
- Checkpoint achievements: 6%
Projects (30%)
- Midterm: Build and train CNN (15%)
- Final: Extend TinyGPT (15%)
📚 Additional Resources
- MLSys Book - Companion textbook
- Course Discussions
- Issue Tracker
Need help? Open an issue or contact the TinyTorch team!