π©βπ« TinyTorch Instructor Guide#
Complete guide for teaching ML Systems Engineering with TinyTorch.
π― Course Overview#
TinyTorch teaches ML systems engineering through building, not just using. Students construct a complete ML framework from tensors to transformers, understanding memory, performance, and scaling at each step.
π οΈ Instructor Setup#
1. Initial Setup#
# Clone and setup
git clone https://github.com/MLSysBook/TinyTorch.git
cd TinyTorch
# Virtual environment (MANDATORY)
python -m venv .venv
source .venv/bin/activate
# Install with instructor tools
pip install -r requirements.txt
pip install nbgrader
# Setup grading infrastructure
tito grade setup
2. Verify Installation#
tito system doctor
# Should show all green checkmarks
tito grade
# Should show available grade commands
π Assignment Workflow#
Simplified with Tito CLI#
Weβve wrapped NBGrader behind simple tito grade commands so you donβt need to learn NBGraderβs complex interface.
1. Prepare Assignments#
# Generate instructor version (with solutions)
tito grade generate 01_tensor
# Create student version (solutions removed)
tito grade release 01_tensor
# Student version will be in: release/tinytorch/01_tensor/
2. Distribute to Students#
# Option A: GitHub Classroom (recommended)
# 1. Create assignment repository from TinyTorch
# 2. Remove solutions from modules
# 3. Students clone and work
# Option B: Direct distribution
# Share the release/ directory contents
3. Collect Submissions#
# Collect all students
tito grade collect 01_tensor
# Or specific student
tito grade collect 01_tensor --student student_id
4. Auto-Grade#
# Grade all submissions
tito grade autograde 01_tensor
# Grade specific student
tito grade autograde 01_tensor --student student_id
5. Manual Review#
# Open grading interface (browser-based)
tito grade manual 01_tensor
# This launches a web interface for:
# - Reviewing ML Systems question responses
# - Adding feedback comments
# - Adjusting auto-grades
6. Generate Feedback#
# Create feedback files for students
tito grade feedback 01_tensor
7. Export Grades#
# Export all grades to CSV
tito grade export
# Or specific module
tito grade export --module 01_tensor --output grades_module01.csv
π Grading Components#
Auto-Graded (70%)#
Code implementation correctness
Test passing
Function signatures
Output validation
Manually Graded (30%)#
ML Systems Thinking questions (3 per module)
Each question: 10 points
Focus on understanding, not perfection
Grading Rubric for ML Systems Questions#
Points |
Criteria |
|---|---|
9-10 |
Demonstrates deep understanding, references specific code, discusses systems implications |
7-8 |
Good understanding, some code references, basic systems thinking |
5-6 |
Surface understanding, generic response, limited systems perspective |
3-4 |
Attempted but misses key concepts |
0-2 |
No attempt or completely off-topic |
What to Look For:
References to actual implemented code
Memory/performance analysis
Scaling considerations
Production system comparisons
Understanding of trade-offs
π Sample Solutions for Grading Calibration#
This section provides sample solutions to help calibrate grading standards. Use these as reference points when evaluating student submissions.
Module 01: Tensor - Memory Footprint#
Excellent Solution (9-10 points):
def memory_footprint(self):
"""Calculate tensor memory in bytes."""
return self.data.nbytes
Why Excellent:
Concise and correct
Uses NumPyβs built-in
nbytespropertyClear docstring
Handles all tensor shapes correctly
Good Solution (7-8 points):
def memory_footprint(self):
"""Calculate memory usage."""
return np.prod(self.data.shape) * self.data.dtype.itemsize
Why Good:
Correct implementation
Manually calculates (shows understanding)
Works but less efficient than using
nbytesMinor: docstring could be more specific
Acceptable Solution (5-6 points):
def memory_footprint(self):
size = 1
for dim in self.data.shape:
size *= dim
return size * 4 # Assumes float32
Why Acceptable:
Correct logic but hardcoded dtype size
Works for float32 but fails for other dtypes
Shows understanding of memory calculation
Missing proper dtype handling
Module 05: Autograd - Backward Pass#
Excellent Solution (9-10 points):
def backward(self, gradient=None):
"""Backward pass through computational graph."""
if gradient is None:
gradient = np.ones_like(self.data)
self.grad = gradient
if self.grad_fn is not None:
# Compute gradients for inputs
input_grads = self.grad_fn.backward(gradient)
# Propagate to input tensors
if isinstance(input_grads, tuple):
for input_tensor, input_grad in zip(self.grad_fn.inputs, input_grads):
if input_tensor.requires_grad:
input_tensor.backward(input_grad)
else:
if self.grad_fn.inputs[0].requires_grad:
self.grad_fn.inputs[0].backward(input_grads)
Why Excellent:
Handles both scalar and tensor gradients
Properly checks
requires_gradbefore propagatingHandles tuple returns from grad_fn
Clear variable names and structure
Good Solution (7-8 points):
def backward(self, gradient=None):
if gradient is None:
gradient = np.ones_like(self.data)
self.grad = gradient
if self.grad_fn:
grads = self.grad_fn.backward(gradient)
for inp, grad in zip(self.grad_fn.inputs, grads):
inp.backward(grad)
Why Good:
Correct logic
Missing
requires_gradcheck (minor issue)Assumes grads is always iterable (may fail for single input)
Works for most cases but less robust
Acceptable Solution (5-6 points):
def backward(self, grad):
self.grad = grad
if self.grad_fn:
self.grad_fn.inputs[0].backward(self.grad_fn.backward(grad))
Why Acceptable:
Basic backward pass works
Only handles single input (fails for multi-input operations)
Missing None gradient handling
Shows understanding but incomplete
Module 09: Spatial - Convolution Implementation#
Excellent Solution (9-10 points):
def forward(self, x):
"""Forward pass with explicit loops for clarity."""
batch_size, in_channels, height, width = x.shape
out_height = (height - self.kernel_size + 2 * self.padding) // self.stride + 1
out_width = (width - self.kernel_size + 2 * self.padding) // self.stride + 1
output = np.zeros((batch_size, self.out_channels, out_height, out_width))
# Apply padding
if self.padding > 0:
x = np.pad(x, ((0, 0), (0, 0), (self.padding, self.padding),
(self.padding, self.padding)), mode='constant')
# Explicit convolution loops
for b in range(batch_size):
for oc in range(self.out_channels):
for oh in range(out_height):
for ow in range(out_width):
h_start = oh * self.stride
w_start = ow * self.stride
h_end = h_start + self.kernel_size
w_end = w_start + self.kernel_size
window = x[b, :, h_start:h_end, w_start:w_end]
output[b, oc, oh, ow] = np.sum(
window * self.weight[oc] + self.bias[oc]
)
return Tensor(output, requires_grad=x.requires_grad)
Why Excellent:
Clear output shape calculation
Proper padding handling
Explicit loops make O(kernel_sizeΒ²) complexity visible
Correct gradient tracking setup
Well-structured and readable
Good Solution (7-8 points):
def forward(self, x):
B, C, H, W = x.shape
out_h = (H - self.kernel_size) // self.stride + 1
out_w = (W - self.kernel_size) // self.stride + 1
out = np.zeros((B, self.out_channels, out_h, out_w))
for b in range(B):
for oc in range(self.out_channels):
for i in range(out_h):
for j in range(out_w):
h = i * self.stride
w = j * self.stride
out[b, oc, i, j] = np.sum(
x[b, :, h:h+self.kernel_size, w:w+self.kernel_size]
* self.weight[oc]
) + self.bias[oc]
return Tensor(out)
Why Good:
Correct implementation
Missing padding support (works only for padding=0)
Less clear variable names
Missing requires_grad propagation
Acceptable Solution (5-6 points):
def forward(self, x):
out = np.zeros((x.shape[0], self.out_channels, x.shape[2]-2, x.shape[3]-2))
for b in range(x.shape[0]):
for c in range(self.out_channels):
for i in range(out.shape[2]):
for j in range(out.shape[3]):
out[b, c, i, j] = np.sum(x[b, :, i:i+3, j:j+3] * self.weight[c])
return Tensor(out)
Why Acceptable:
Basic convolution works
Hardcoded kernel_size=3 (not general)
No stride or padding support
Shows understanding but incomplete
Module 12: Attention - Scaled Dot-Product Attention#
Excellent Solution (9-10 points):
def forward(self, query, key, value, mask=None):
"""Scaled dot-product attention with numerical stability."""
# Compute attention scores
scores = np.dot(query, key.T) / np.sqrt(self.d_k)
# Apply mask if provided
if mask is not None:
scores = np.where(mask, scores, -1e9)
# Softmax with numerical stability
exp_scores = np.exp(scores - np.max(scores, axis=-1, keepdims=True))
attention_weights = exp_scores / np.sum(exp_scores, axis=-1, keepdims=True)
# Apply attention to values
output = np.dot(attention_weights, value)
return output, attention_weights
Why Excellent:
Proper scaling factor (1/βd_k)
Numerical stability with max subtraction
Mask handling
Returns both output and attention weights
Clear and well-documented
Good Solution (7-8 points):
def forward(self, q, k, v):
scores = np.dot(q, k.T) / np.sqrt(q.shape[-1])
weights = np.exp(scores) / np.sum(np.exp(scores), axis=-1, keepdims=True)
return np.dot(weights, v)
Why Good:
Correct implementation
Missing numerical stability (may overflow)
Missing mask support
Works but less robust
Acceptable Solution (5-6 points):
def forward(self, q, k, v):
scores = np.dot(q, k.T)
weights = np.exp(scores) / np.sum(np.exp(scores))
return np.dot(weights, v)
Why Acceptable:
Basic attention mechanism
Missing scaling factor
Missing numerical stability
Incorrect softmax (should be per-row)
Grading Guidelines Using Sample Solutions#
When Evaluating Student Code:
Correctness First: Does it pass all tests?
If no: Maximum 6 points (even if well-written)
If yes: Proceed to quality evaluation
Code Quality:
Excellent (9-10): Production-ready, handles edge cases, well-documented
Good (7-8): Correct and functional, minor improvements possible
Acceptable (5-6): Works but incomplete or has issues
Systems Thinking:
Excellent: Discusses memory, performance, scaling implications
Good: Some systems awareness
Acceptable: Focuses only on correctness
Common Patterns:
Look for: Proper error handling, edge case consideration, documentation
Red flags: Hardcoded values, missing checks, unclear variable names
Remember: These are calibration examples. Adjust based on your course level and learning objectives. The goal is consistent evaluation, not perfection.
π Module Teaching Notes#
Module 01: Tensor#
Focus: Memory layout, data structures
Key Concept: Understanding memory is crucial for ML performance
Demo: Show memory profiling, copying behavior
Module 02: Activations#
Focus: Vectorization, numerical stability
Key Concept: Small details matter at scale
Demo: Gradient vanishing/exploding
Module 04-05: Layers & Networks#
Focus: Composition, parameter management
Key Concept: Building blocks combine into complex systems
Project: Build a small CNN
Module 06-07: Spatial & Attention#
Focus: Algorithmic complexity, memory patterns
Key Concept: O(NΒ²) operations become bottlenecks
Demo: Profile attention memory usage
Module 08-11: Training Pipeline#
Focus: End-to-end system integration
Key Concept: Many components must work together
Project: Train a real model
Module 12-15: Production#
Focus: Deployment, optimization, monitoring
Key Concept: Academic vs production requirements
Demo: Model compression, deployment
Module 16: TinyGPT#
Focus: Framework generalization
Key Concept: 70% component reuse from vision to language
Capstone: Build a working language model
π― Learning Objectives#
By course end, students should be able to:
Build complete ML systems from scratch
Analyze memory usage and computational complexity
Debug performance bottlenecks
Optimize for production deployment
Understand framework design decisions
Apply systems thinking to ML problems
π Tracking Progress#
Individual Progress#
# Check specific student progress
tito checkpoint status --student student_id
Class Overview#
# Export all checkpoint achievements
tito checkpoint export --output class_progress.csv
Identify Struggling Students#
Look for:
Missing checkpoint achievements
Low scores on ML Systems questions
Incomplete module submissions
π‘ Teaching Tips#
1. Emphasize Building Over Theory#
Have students type every line of code
Run tests immediately after implementation
Break and fix things intentionally
2. Connect to Production Systems#
Show PyTorch/TensorFlow equivalents
Discuss real-world bottlenecks
Share production war stories
3. Make Performance Visible#
# Use profilers liberally
with TimeProfiler("operation"):
result = expensive_operation()
# Show memory usage
print(f"Memory: {get_memory_usage():.2f} MB")
4. Encourage Systems Questions#
βWhat would break at 1B parameters?β
βHow would you distributed this?β
βWhatβs the bottleneck here?β
π§ Troubleshooting#
Common Student Issues#
Environment Problems
# Student fix:
tito system doctor
tito system reset
Module Import Errors
# Rebuild package
tito export --all
Test Failures
# Detailed test output
tito module test MODULE --verbose
NBGrader Issues#
Database Locked
# Clear NBGrader database
rm gradebook.db
tito grade setup
Missing Submissions
# Check submission directory
ls submitted/*/MODULE/
π Sample Schedule (16 Weeks)#
Week |
Module |
Focus |
|---|---|---|
1 |
01 Tensor |
Data Structures, Memory |
2 |
02 Activations |
Non-linearity Functions |
3 |
03 Layers |
Neural Network Components |
4 |
04 Losses |
Optimization Objectives |
5 |
05 Autograd |
Automatic Differentiation |
6 |
06 Optimizers |
Training Algorithms |
7 |
07 Training |
Complete Training Loop |
8 |
Midterm Project |
Build and Train Network |
9 |
08 DataLoader |
Data Pipeline |
10 |
09 Spatial |
Convolutions, CNNs |
11 |
10 Tokenization |
Text Processing |
12 |
11 Embeddings |
Word Representations |
13 |
12 Attention |
Attention Mechanisms |
14 |
13 Transformers |
Transformer Architecture |
15 |
14-19 Optimization |
Profiling, Quantization, etc. |
16 |
20 Capstone |
Torch Olympics Competition |
π Assessment Strategy#
Continuous Assessment (70%)#
Module completion: 4% each Γ 16 = 64%
Checkpoint achievements: 6%
Projects (30%)#
Midterm: Build and train CNN (15%)
Final: Extend TinyGPT (15%)
π Additional Resources#
MLSys Book - Companion textbook
Need help? Open an issue or contact the TinyTorch team!