- Reorganized chapter structure with new numbering system - Added new chapters: introduction, tokenization, embeddings, profiling, quantization, caching - Removed obsolete chapters (15-mlops) and consolidated content - Updated table of contents and navigation structure - Enhanced visual design with new logos and favicon - Added comprehensive documentation (FAQ, user manual, command reference, competitions) - Improved theme design and custom CSS styling - Added QUICKSTART.md for rapid onboarding - Updated all chapter cross-references and links
7.3 KiB
The TinyTorch Vision
Training ML Systems Engineers: From Computer Vision to Language Models
The Problem We're Solving
The ML field has a critical gap: most education teaches you to use frameworks, not build them.
Traditional ML Education:
import torch
import torch.nn as nn
model = nn.Linear(784, 10)
optimizer = torch.optim.Adam(model.parameters())
Questions students can't answer:
- Why does Adam use 3× more memory than SGD?
- How does
loss.backward()actually compute gradients? - When should you use gradient accumulation vs larger batch sizes?
- Why do attention mechanisms limit context length?
The TinyTorch Difference:
class Linear:
def __init__(self, in_features, out_features):
self.weight = Tensor(np.random.randn(in_features, out_features))
self.bias = Tensor(np.zeros(out_features))
def forward(self, x):
return x @ self.weight + self.bias # YOU implemented @
def backward(self, grad_output):
# YOU understand exactly how gradients flow
self.weight.grad = x.T @ grad_output
return grad_output @ self.weight.T
Questions students CAN answer:
- Exactly how automatic differentiation works
- Why certain optimizers use more memory
- How to debug training instability
- When to make performance vs accuracy trade-offs
What We Teach: Systems Thinking
Beyond Algorithms: System-Level Understanding
Memory Management:
- Why Adam needs 3× parameter memory (parameters + momentum + variance)
- How attention matrices scale O(N²) with sequence length
- When gradient accumulation saves memory vs compute trade-offs
Performance Analysis:
- Why naive convolution is 100× slower than optimized versions
- How cache misses destroy performance in matrix operations
- When vectorization provides 10-100× speedups
Production Trade-offs:
- SGD vs Adam: convergence speed vs memory constraints
- Gradient checkpointing: trading compute for memory
- Mixed precision: 2× memory savings with accuracy considerations
Hardware Awareness:
- How memory bandwidth limits ML performance
- Why GPU utilization matters more than peak FLOPS
- When distributed training becomes necessary
Target Audience: Future ML Systems Engineers
Perfect For:
Computer Science Students
- Going beyond "use PyTorch" to "understand PyTorch"
- Building portfolio projects that demonstrate deep system knowledge
- Preparing for ML engineering roles (not just data science)
Software Engineers → ML Engineers
- Leveraging existing programming skills for ML systems
- Understanding performance, debugging, and optimization
- Learning production ML patterns and infrastructure
ML Practitioners
- Moving from model users to model builders
- Debugging training issues at the systems level
- Optimizing models for production deployment
Researchers & Advanced Users
- Implementing custom operations and architectures
- Understanding framework limitations and workarounds
- Building specialized ML systems for unique domains
Career Transformation:
Before TinyTorch: "I can train models with PyTorch" After TinyTorch: "I can build and optimize ML systems"
You become the person your team asks:
- "Why is our training bottlenecked?"
- "Can we fit this model in memory?"
- "How do we implement this research paper?"
- "What's the best architecture for our constraints?"
Pedagogical Philosophy: Build → Use → Understand
1. Build First
Every component implemented from scratch:
- Tensors with broadcasting and memory management
- Automatic differentiation with computational graphs
- Optimizers with state management and memory profiling
- Complete training loops with checkpointing and monitoring
2. Use Immediately
No toy examples - recreate ML history with real results:
- MLP Era: Train MLPs to 52.7% CIFAR-10 accuracy (the baseline that motivated CNNs)
- CNN Revolution: Build LeNet-1 (39.4%) and LeNet-5 (47.5%) - witness the breakthrough
- Modern CNNs: Push beyond MLPs with optimized architectures (75%+ achievable)
- Transformer Era: Language models using 95% vision framework reuse
3. Understand Systems
Connect implementations to production reality:
- How your tensor maps to PyTorch's memory model
- Why your optimizer choices affect GPU utilization
- How your autograd compares to production frameworks
- When your implementations would need modification at scale
4. Reflect on Trade-offs
ML Systems Thinking sections in every module:
- Memory vs compute trade-offs in different architectures
- Accuracy vs efficiency considerations for deployment
- Debugging strategies for common production issues
- Framework design principles and their implications
Unique Value Proposition
What Makes TinyTorch Different:
Systems-First Approach
- Not just "how does attention work" but "why does attention scale O(N²) and how do production systems handle this?"
- Not just "implement SGD" but "when do you choose SGD vs Adam in production?"
Production Relevance
- Memory profiling, performance optimization, deployment patterns
- Real datasets, realistic scale, professional development workflow
- Connection to industry practices and framework design decisions
Framework Generalization
- 20 modules that build ONE cohesive ML framework supporting vision AND language
- 95% component reuse from computer vision to language models
- Professional package structure with CLI tools and testing
Proven Pedagogy
- Build → Use → Understand cycle creates deep intuition
- Immediate testing and feedback for every component
- Progressive complexity with solid foundations
- NBGrader integration for classroom deployment
Learning Outcomes: Becoming an ML Systems Engineer
Technical Mastery
- Implement any ML paper from first principles
- Debug training issues at the systems level
- Optimize models for production deployment
- Profile and improve ML system performance
- Design custom architectures for specialized domains
- Understand framework generalization across vision and language
Systems Understanding
- Memory management in ML frameworks
- Computational complexity vs real-world performance
- Hardware utilization patterns and optimization
- Distributed training challenges and solutions
- Production deployment considerations and trade-offs
Professional Skills
- Test-driven development for ML systems
- Performance profiling and optimization techniques
- Code organization and package development
- Documentation and API design
- MLOps and production monitoring
Career Impact
- Technical interviews: Demonstrate deep ML systems knowledge
- Job opportunities: Qualify for ML engineer (not just data scientist) roles
- Team leadership: Become the go-to person for ML systems questions
- Research ability: Implement cutting-edge papers independently
- Entrepreneurship: Build ML products with full-stack understanding
Ready to Become an ML Systems Engineer?
TinyTorch transforms ML users into ML builders.
Stop wondering how frameworks work. Start building them.
TinyTorch: Because understanding how to build ML systems makes you a more effective ML engineer.