mirror of https://github.com/MLSysBook/TinyTorch.git synced 2026-04-28 15:42:50 -05:00

Files

Vijay Janapa Reddi 245e27912d Clean up documentation formatting

- Remove bold formatting from all markdown headers
- Remove 'NEW:' tags from README to keep it clean
- Maintain professional academic appearance

2025-09-18 13:36:06 -04:00

8.0 KiB

Raw Blame History

The TinyTorch Vision

Training ML Systems Engineers: From Computer Vision to Language Models

The Problem We're Solving

The ML field has a critical gap: most education teaches you to use frameworks, not build them.

Traditional ML Education:

import torch
import torch.nn as nn
model = nn.Linear(784, 10)
optimizer = torch.optim.Adam(model.parameters())

Questions students can't answer:

Why does Adam use 3× more memory than SGD?
How does loss.backward() actually compute gradients?
When should you use gradient accumulation vs larger batch sizes?
Why do attention mechanisms limit context length?

The TinyTorch Difference:

class Linear:
    def __init__(self, in_features, out_features):
        self.weight = Tensor(np.random.randn(in_features, out_features))
        self.bias = Tensor(np.zeros(out_features))
    
    def forward(self, x):
        return x @ self.weight + self.bias  # YOU implemented @
    
    def backward(self, grad_output):
        # YOU understand exactly how gradients flow
        self.weight.grad = x.T @ grad_output
        return grad_output @ self.weight.T

Questions students CAN answer:

Exactly how automatic differentiation works
Why certain optimizers use more memory
How to debug training instability
When to make performance vs accuracy trade-offs

What We Teach: Systems Thinking

Beyond Algorithms: System-Level Understanding

Memory Management:

Why Adam needs 3× parameter memory (parameters + momentum + variance)
How attention matrices scale O(N²) with sequence length
When gradient accumulation saves memory vs compute trade-offs

Performance Analysis:

Why naive convolution is 100× slower than optimized versions
How cache misses destroy performance in matrix operations
When vectorization provides 10-100× speedups

Production Trade-offs:

SGD vs Adam: convergence speed vs memory constraints
Gradient checkpointing: trading compute for memory
Mixed precision: 2× memory savings with accuracy considerations

Hardware Awareness:

How memory bandwidth limits ML performance
Why GPU utilization matters more than peak FLOPS
When distributed training becomes necessary

Target Audience: Future ML Systems Engineers

Perfect For:

Computer Science Students

Going beyond "use PyTorch" to "understand PyTorch"
Building portfolio projects that demonstrate deep system knowledge
Preparing for ML engineering roles (not just data science)

Software Engineers → ML Engineers

Leveraging existing programming skills for ML systems
Understanding performance, debugging, and optimization
Learning production ML patterns and infrastructure

ML Practitioners

Moving from model users to model builders
Debugging training issues at the systems level
Optimizing models for production deployment

Researchers & Advanced Users

Implementing custom operations and architectures
Understanding framework limitations and workarounds
Building specialized ML systems for unique domains

Career Transformation:

Before TinyTorch: "I can train models with PyTorch" After TinyTorch: "I can build and optimize ML systems"

You become the person your team asks:

"Why is our training bottlenecked?"
"Can we fit this model in memory?"
"How do we implement this research paper?"
"What's the best architecture for our constraints?"

Pedagogical Philosophy: Build → Use → Understand

1. Build First

Every component implemented from scratch:

Tensors with broadcasting and memory management
Automatic differentiation with computational graphs
Optimizers with state management and memory profiling
Complete training loops with checkpointing and monitoring

2. Use Immediately

No toy examples - real applications:

Train CNNs on CIFAR-10 (90%+ accuracy achievable)
Implement transformer attention mechanisms
Deploy production systems with MLOps monitoring
Profile and optimize for performance bottlenecks

3. Understand Systems

Connect implementations to production reality:

How your tensor maps to PyTorch's memory model
Why your optimizer choices affect GPU utilization
How your autograd compares to production frameworks
When your implementations would need modification at scale

4. Reflect on Trade-offs

ML Systems Thinking sections in every module:

Memory vs compute trade-offs in different architectures
Accuracy vs efficiency considerations for deployment
Debugging strategies for common production issues
Framework design principles and their implications

Unique Value Proposition

What Makes TinyTorch Different:

Systems-First Approach

Not just "how does attention work" but "why does attention scale O(N²) and how do production systems handle this?"
Not just "implement SGD" but "when do you choose SGD vs Adam in production?"

Production Relevance

Memory profiling, performance optimization, deployment patterns
Real datasets, realistic scale, professional development workflow
Connection to industry practices and framework design decisions

Framework Generalization

16 modules that build ONE cohesive ML framework supporting vision AND language
95% component reuse from computer vision to language models
Professional package structure with CLI tools and testing

Proven Pedagogy

Build → Use → Understand cycle creates deep intuition
Immediate testing and feedback for every component
Progressive complexity with solid foundations
NBGrader integration for classroom deployment

Learning Outcomes: Becoming an ML Systems Engineer

Technical Mastery

Implement any ML paper from first principles
Debug training issues at the systems level
Optimize models for production deployment
Profile and improve ML system performance
Design custom architectures for specialized domains
Understand framework generalization across vision and language

Systems Understanding

Memory management in ML frameworks
Computational complexity vs real-world performance
Hardware utilization patterns and optimization
Distributed training challenges and solutions
Production deployment considerations and trade-offs

Professional Skills

Test-driven development for ML systems
Performance profiling and optimization techniques
Code organization and package development
Documentation and API design
MLOps and production monitoring

Career Impact

Technical interviews: Demonstrate deep ML systems knowledge
Job opportunities: Qualify for ML engineer (not just data scientist) roles
Team leadership: Become the go-to person for ML systems questions
Research ability: Implement cutting-edge papers independently
Entrepreneurship: Build ML products with full-stack understanding

Success Stories: What Students Say

"Finally understood what happens when I call loss.backward() - now I can debug gradient issues instead of just hoping they go away."

"Built my own attention mechanism from scratch, then extended my vision framework to language models with 95% component reuse. When GPT-4 came out, I actually understood both the technical details AND the framework unification."

"Got hired as an ML engineer specifically because I could explain how optimizers work at the memory level during the technical interview."

"Used TinyTorch concepts to optimize our production training pipeline for both vision and language models - saved 40% on cloud costs by understanding memory bottlenecks across modalities."

"Implemented a custom loss function for our research project in 30 minutes instead of spending days figuring out PyTorch internals."

Ready to Become an ML Systems Engineer?

TinyTorch transforms ML users into ML builders.

Stop wondering how frameworks work. Start building them.

Begin Your Journey →

TinyTorch: Because understanding how to build ML systems makes you a more effective ML engineer.

8.0 KiB Raw Blame History Unescape Escape