👩‍🏫 TinyTorch Instructor Guide

👩‍🏫 TinyTorch Instructor Guide#

Complete guide for teaching ML Systems Engineering with TinyTorch.

🎯 Course Overview#

TinyTorch teaches ML systems engineering through building, not just using. Students construct a complete ML framework from tensors to transformers, understanding memory, performance, and scaling at each step.

🛠️ Instructor Setup#

1. Initial Setup#

# Clone and setup
git clone https://github.com/MLSysBook/TinyTorch.git
cd TinyTorch

# Virtual environment (MANDATORY)
python -m venv .venv
source .venv/bin/activate

# Install with instructor tools
pip install -r requirements.txt
pip install nbgrader

# Setup grading infrastructure
tito grade setup

2. Verify Installation#

tito system doctor
# Should show all green checkmarks

tito grade
# Should show available grade commands

📝 Assignment Workflow#

Simplified with Tito CLI#

We’ve wrapped NBGrader behind simple tito grade commands so you don’t need to learn NBGrader’s complex interface.

1. Prepare Assignments#

# Generate instructor version (with solutions)
tito grade generate 01_tensor

# Create student version (solutions removed)
tito grade release 01_tensor

# Student version will be in: release/tinytorch/01_tensor/

2. Distribute to Students#

# Option A: GitHub Classroom (recommended)
# 1. Create assignment repository from TinyTorch
# 2. Remove solutions from modules
# 3. Students clone and work

# Option B: Direct distribution
# Share the release/ directory contents

3. Collect Submissions#

# Collect all students
tito grade collect 01_tensor

# Or specific student
tito grade collect 01_tensor --student student_id

4. Auto-Grade#

# Grade all submissions
tito grade autograde 01_tensor

# Grade specific student
tito grade autograde 01_tensor --student student_id

5. Manual Review#

# Open grading interface (browser-based)
tito grade manual 01_tensor

# This launches a web interface for:
# - Reviewing ML Systems question responses
# - Adding feedback comments
# - Adjusting auto-grades

6. Generate Feedback#

# Create feedback files for students
tito grade feedback 01_tensor

7. Export Grades#

# Export all grades to CSV
tito grade export

# Or specific module
tito grade export --module 01_tensor --output grades_module01.csv

📊 Grading Components#

Auto-Graded (70%)#

Code implementation correctness
Test passing
Function signatures
Output validation

Manually Graded (30%)#

ML Systems Thinking questions (3 per module)
Each question: 10 points
Focus on understanding, not perfection

Grading Rubric for ML Systems Questions#

Points	Criteria
9-10	Demonstrates deep understanding, references specific code, discusses systems implications
7-8	Good understanding, some code references, basic systems thinking
5-6	Surface understanding, generic response, limited systems perspective
3-4	Attempted but misses key concepts
0-2	No attempt or completely off-topic

What to Look For:

References to actual implemented code
Memory/performance analysis
Scaling considerations
Production system comparisons
Understanding of trade-offs

📋 Sample Solutions for Grading Calibration#

This section provides sample solutions to help calibrate grading standards. Use these as reference points when evaluating student submissions.

Module 01: Tensor - Memory Footprint#

Excellent Solution (9-10 points):

def memory_footprint(self):
    """Calculate tensor memory in bytes."""
    return self.data.nbytes

Why Excellent:

Concise and correct
Uses NumPy’s built-in nbytes property
Clear docstring
Handles all tensor shapes correctly

Good Solution (7-8 points):

def memory_footprint(self):
    """Calculate memory usage."""
    return np.prod(self.data.shape) * self.data.dtype.itemsize

Why Good:

Correct implementation
Manually calculates (shows understanding)
Works but less efficient than using nbytes
Minor: docstring could be more specific

Acceptable Solution (5-6 points):

def memory_footprint(self):
    size = 1
    for dim in self.data.shape:
        size *= dim
    return size * 4  # Assumes float32

Why Acceptable:

Correct logic but hardcoded dtype size
Works for float32 but fails for other dtypes
Shows understanding of memory calculation
Missing proper dtype handling

Module 05: Autograd - Backward Pass#

Excellent Solution (9-10 points):

def backward(self, gradient=None):
    """Backward pass through computational graph."""
    if gradient is None:
        gradient = np.ones_like(self.data)
    
    self.grad = gradient
    
    if self.grad_fn is not None:
        # Compute gradients for inputs
        input_grads = self.grad_fn.backward(gradient)
        
        # Propagate to input tensors
        if isinstance(input_grads, tuple):
            for input_tensor, input_grad in zip(self.grad_fn.inputs, input_grads):
                if input_tensor.requires_grad:
                    input_tensor.backward(input_grad)
        else:
            if self.grad_fn.inputs[0].requires_grad:
                self.grad_fn.inputs[0].backward(input_grads)

Why Excellent:

Handles both scalar and tensor gradients
Properly checks requires_grad before propagating
Handles tuple returns from grad_fn
Clear variable names and structure

Good Solution (7-8 points):

def backward(self, gradient=None):
    if gradient is None:
        gradient = np.ones_like(self.data)
    self.grad = gradient
    if self.grad_fn:
        grads = self.grad_fn.backward(gradient)
        for inp, grad in zip(self.grad_fn.inputs, grads):
            inp.backward(grad)

Why Good:

Correct logic
Missing requires_grad check (minor issue)
Assumes grads is always iterable (may fail for single input)
Works for most cases but less robust

Acceptable Solution (5-6 points):

def backward(self, grad):
    self.grad = grad
    if self.grad_fn:
        self.grad_fn.inputs[0].backward(self.grad_fn.backward(grad))

Why Acceptable:

Basic backward pass works
Only handles single input (fails for multi-input operations)
Missing None gradient handling
Shows understanding but incomplete

Module 09: Spatial - Convolution Implementation#

Excellent Solution (9-10 points):

def forward(self, x):
    """Forward pass with explicit loops for clarity."""
    batch_size, in_channels, height, width = x.shape
    out_height = (height - self.kernel_size + 2 * self.padding) // self.stride + 1
    out_width = (width - self.kernel_size + 2 * self.padding) // self.stride + 1
    
    output = np.zeros((batch_size, self.out_channels, out_height, out_width))
    
    # Apply padding
    if self.padding > 0:
        x = np.pad(x, ((0, 0), (0, 0), (self.padding, self.padding), 
                      (self.padding, self.padding)), mode='constant')
    
    # Explicit convolution loops
    for b in range(batch_size):
        for oc in range(self.out_channels):
            for oh in range(out_height):
                for ow in range(out_width):
                    h_start = oh * self.stride
                    w_start = ow * self.stride
                    h_end = h_start + self.kernel_size
                    w_end = w_start + self.kernel_size
                    
                    window = x[b, :, h_start:h_end, w_start:w_end]
                    output[b, oc, oh, ow] = np.sum(
                        window * self.weight[oc] + self.bias[oc]
                    )
    
    return Tensor(output, requires_grad=x.requires_grad)

Why Excellent:

Clear output shape calculation
Proper padding handling
Explicit loops make O(kernel_size²) complexity visible
Correct gradient tracking setup
Well-structured and readable

Good Solution (7-8 points):

def forward(self, x):
    B, C, H, W = x.shape
    out_h = (H - self.kernel_size) // self.stride + 1
    out_w = (W - self.kernel_size) // self.stride + 1
    out = np.zeros((B, self.out_channels, out_h, out_w))
    
    for b in range(B):
        for oc in range(self.out_channels):
            for i in range(out_h):
                for j in range(out_w):
                    h = i * self.stride
                    w = j * self.stride
                    out[b, oc, i, j] = np.sum(
                        x[b, :, h:h+self.kernel_size, w:w+self.kernel_size] 
                        * self.weight[oc]
                    ) + self.bias[oc]
    return Tensor(out)

Why Good:

Correct implementation
Missing padding support (works only for padding=0)
Less clear variable names
Missing requires_grad propagation

Acceptable Solution (5-6 points):

def forward(self, x):
    out = np.zeros((x.shape[0], self.out_channels, x.shape[2]-2, x.shape[3]-2))
    for b in range(x.shape[0]):
        for c in range(self.out_channels):
            for i in range(out.shape[2]):
                for j in range(out.shape[3]):
                    out[b, c, i, j] = np.sum(x[b, :, i:i+3, j:j+3] * self.weight[c])
    return Tensor(out)

Why Acceptable:

Basic convolution works
Hardcoded kernel_size=3 (not general)
No stride or padding support
Shows understanding but incomplete

Module 12: Attention - Scaled Dot-Product Attention#

Excellent Solution (9-10 points):

def forward(self, query, key, value, mask=None):
    """Scaled dot-product attention with numerical stability."""
    # Compute attention scores
    scores = np.dot(query, key.T) / np.sqrt(self.d_k)
    
    # Apply mask if provided
    if mask is not None:
        scores = np.where(mask, scores, -1e9)
    
    # Softmax with numerical stability
    exp_scores = np.exp(scores - np.max(scores, axis=-1, keepdims=True))
    attention_weights = exp_scores / np.sum(exp_scores, axis=-1, keepdims=True)
    
    # Apply attention to values
    output = np.dot(attention_weights, value)
    
    return output, attention_weights

Why Excellent:

Proper scaling factor (1/√d_k)
Numerical stability with max subtraction
Mask handling
Returns both output and attention weights
Clear and well-documented

Good Solution (7-8 points):

def forward(self, q, k, v):
    scores = np.dot(q, k.T) / np.sqrt(q.shape[-1])
    weights = np.exp(scores) / np.sum(np.exp(scores), axis=-1, keepdims=True)
    return np.dot(weights, v)

Why Good:

Correct implementation
Missing numerical stability (may overflow)
Missing mask support
Works but less robust

Acceptable Solution (5-6 points):

def forward(self, q, k, v):
    scores = np.dot(q, k.T)
    weights = np.exp(scores) / np.sum(np.exp(scores))
    return np.dot(weights, v)

Why Acceptable:

Basic attention mechanism
Missing scaling factor
Missing numerical stability
Incorrect softmax (should be per-row)

Grading Guidelines Using Sample Solutions#

When Evaluating Student Code:

Correctness First: Does it pass all tests?
- If no: Maximum 6 points (even if well-written)
- If yes: Proceed to quality evaluation
Code Quality:
- Excellent (9-10): Production-ready, handles edge cases, well-documented
- Good (7-8): Correct and functional, minor improvements possible
- Acceptable (5-6): Works but incomplete or has issues
Systems Thinking:
- Excellent: Discusses memory, performance, scaling implications
- Good: Some systems awareness
- Acceptable: Focuses only on correctness
Common Patterns:
- Look for: Proper error handling, edge case consideration, documentation
- Red flags: Hardcoded values, missing checks, unclear variable names

Remember: These are calibration examples. Adjust based on your course level and learning objectives. The goal is consistent evaluation, not perfection.

📚 Module Teaching Notes#

Module 01: Tensor#

Focus: Memory layout, data structures
Key Concept: Understanding memory is crucial for ML performance
Demo: Show memory profiling, copying behavior

Module 02: Activations#

Focus: Vectorization, numerical stability
Key Concept: Small details matter at scale
Demo: Gradient vanishing/exploding

Module 04-05: Layers & Networks#

Focus: Composition, parameter management
Key Concept: Building blocks combine into complex systems
Project: Build a small CNN

Module 06-07: Spatial & Attention#

Focus: Algorithmic complexity, memory patterns
Key Concept: O(N²) operations become bottlenecks
Demo: Profile attention memory usage

Module 08-11: Training Pipeline#

Focus: End-to-end system integration
Key Concept: Many components must work together
Project: Train a real model

Module 12-15: Production#

Focus: Deployment, optimization, monitoring
Key Concept: Academic vs production requirements
Demo: Model compression, deployment

Module 16: TinyGPT#

Focus: Framework generalization
Key Concept: 70% component reuse from vision to language
Capstone: Build a working language model

🎯 Learning Objectives#

By course end, students should be able to:

Build complete ML systems from scratch
Analyze memory usage and computational complexity
Debug performance bottlenecks
Optimize for production deployment
Understand framework design decisions
Apply systems thinking to ML problems

📈 Tracking Progress#

Individual Progress#

# Check specific student progress
tito checkpoint status --student student_id

Class Overview#

# Export all checkpoint achievements
tito checkpoint export --output class_progress.csv

Identify Struggling Students#

Look for:

Missing checkpoint achievements
Low scores on ML Systems questions
Incomplete module submissions

💡 Teaching Tips#

1. Emphasize Building Over Theory#

Have students type every line of code
Run tests immediately after implementation
Break and fix things intentionally

2. Connect to Production Systems#

Show PyTorch/TensorFlow equivalents
Discuss real-world bottlenecks
Share production war stories

3. Make Performance Visible#

# Use profilers liberally
with TimeProfiler("operation"):
    result = expensive_operation()
    
# Show memory usage
print(f"Memory: {get_memory_usage():.2f} MB")

4. Encourage Systems Questions#

“What would break at 1B parameters?”
“How would you distributed this?”
“What’s the bottleneck here?”

🔧 Troubleshooting#

Common Student Issues#

Environment Problems

# Student fix:
tito system doctor
tito system reset

Module Import Errors

# Rebuild package
tito export --all

Test Failures

# Detailed test output
tito module test MODULE --verbose

NBGrader Issues#

Database Locked

# Clear NBGrader database
rm gradebook.db
tito grade setup

Missing Submissions

# Check submission directory
ls submitted/*/MODULE/

📊 Sample Schedule (16 Weeks)#

Week	Module	Focus
1	01 Tensor	Data Structures, Memory
2	02 Activations	Non-linearity Functions
3	03 Layers	Neural Network Components
4	04 Losses	Optimization Objectives
5	05 Autograd	Automatic Differentiation
6	06 Optimizers	Training Algorithms
7	07 Training	Complete Training Loop
8	Midterm Project	Build and Train Network
9	08 DataLoader	Data Pipeline
10	09 Spatial	Convolutions, CNNs
11	10 Tokenization	Text Processing
12	11 Embeddings	Word Representations
13	12 Attention	Attention Mechanisms
14	13 Transformers	Transformer Architecture
15	14-19 Optimization	Profiling, Quantization, etc.
16	20 Capstone	Torch Olympics Competition

🎓 Assessment Strategy#

Continuous Assessment (70%)#

Module completion: 4% each × 16 = 64%
Checkpoint achievements: 6%

Projects (30%)#

Midterm: Build and train CNN (15%)
Final: Extend TinyGPT (15%)

📚 Additional Resources#

Need help? Open an issue or contact the TinyTorch team!

👩‍🏫 TinyTorch Instructor Guide

Contents

👩‍🏫 TinyTorch Instructor Guide#

🎯 Course Overview#

🛠️ Instructor Setup#

1. Initial Setup#

2. Verify Installation#

📝 Assignment Workflow#

Simplified with Tito CLI#

1. Prepare Assignments#

2. Distribute to Students#

3. Collect Submissions#

4. Auto-Grade#

5. Manual Review#

6. Generate Feedback#

7. Export Grades#

📊 Grading Components#

Auto-Graded (70%)#

Manually Graded (30%)#

Grading Rubric for ML Systems Questions#

📋 Sample Solutions for Grading Calibration#

Module 01: Tensor - Memory Footprint#

Module 05: Autograd - Backward Pass#

Module 09: Spatial - Convolution Implementation#

Module 12: Attention - Scaled Dot-Product Attention#

Grading Guidelines Using Sample Solutions#

📚 Module Teaching Notes#

Module 01: Tensor#

Module 02: Activations#

Module 04-05: Layers & Networks#

Module 06-07: Spatial & Attention#

Module 08-11: Training Pipeline#

Module 12-15: Production#

Module 16: TinyGPT#

🎯 Learning Objectives#

📈 Tracking Progress#

Individual Progress#

Class Overview#

Identify Struggling Students#

💡 Teaching Tips#

1. Emphasize Building Over Theory#

2. Connect to Production Systems#

3. Make Performance Visible#

4. Encourage Systems Questions#

🔧 Troubleshooting#

Common Student Issues#

NBGrader Issues#

📊 Sample Schedule (16 Weeks)#

🎓 Assessment Strategy#

Continuous Assessment (70%)#

Projects (30%)#

📚 Additional Resources#