mirror of
https://github.com/MLSysBook/TinyTorch.git
synced 2026-04-29 06:37:58 -05:00
Remove redundant modules and streamline to 16-module structure
- Remove 00_introduction module (meta-content, not substantive learning) - Remove 16_capstone_backup backup directory - Remove utilities directory from modules/source - Clean up generated book chapters for removed modules Result: Clean 16-module progression (01_setup → 16_tinygpt) focused on hands-on ML systems implementation without administrative overhead.
This commit is contained in:
@@ -1,147 +0,0 @@
|
||||
# TinyTorch System Introduction & Architecture
|
||||
|
||||
Welcome to **TinyTorch** - a complete neural network framework built from scratch for deep learning education and understanding.
|
||||
|
||||
## 🎯 Module Overview
|
||||
|
||||
This introduction module provides a comprehensive visual overview of the entire TinyTorch system, helping you understand how all 16 modules work together to create a complete machine learning framework.
|
||||
|
||||
### What You'll Explore
|
||||
|
||||
- **🏗️ System Architecture** - Complete framework overview with visual diagrams
|
||||
- **📊 Interactive Dependency Graphs** - See how all modules connect and depend on each other
|
||||
- **📚 Learning Roadmap** - Optimal path through the entire TinyTorch curriculum
|
||||
- **🔍 Component Analysis** - Deep dive into what each module implements
|
||||
- **📈 Progress Visualization** - Track your learning journey through the system
|
||||
|
||||
## 🚀 Key Features
|
||||
|
||||
### Automated Analysis System
|
||||
- **Module Metadata Parser** - Automatically loads and analyzes all module.yaml files
|
||||
- **Dependency Graph Builder** - Creates NetworkX graphs of module relationships
|
||||
- **Learning Path Generator** - Uses topological sort to find optimal learning sequence
|
||||
|
||||
### Interactive Visualizations
|
||||
- **Dependency Graph** - Hierarchical and circular layouts showing module connections
|
||||
- **System Architecture** - Layered view of how components work together
|
||||
- **Learning Roadmap** - Timeline view with time estimates and difficulty progression
|
||||
- **Component Analysis** - Statistical analysis of module complexity and relationships
|
||||
|
||||
### Export Functions
|
||||
- **System Overview API** - Programmatic access to TinyTorch metadata
|
||||
- **Module Information** - Detailed data about any specific module
|
||||
- **Learning Recommendations** - Personalized next steps based on progress
|
||||
|
||||
## 📊 What You'll Discover
|
||||
|
||||
### System Statistics
|
||||
- **16 modules** spanning from basic tensors to production MLOps
|
||||
- **60+ components** implementing complete ML framework functionality
|
||||
- **Estimated 80+ hours** of comprehensive learning content
|
||||
- **5 difficulty levels** progressing from foundation to advanced topics
|
||||
|
||||
### Learning Progression
|
||||
1. **Foundation** (3 modules) - Setup, tensors, activations
|
||||
2. **Core Architecture** (4 modules) - Layers, networks, attention, data loading
|
||||
3. **Training System** (3 modules) - Autograd, optimization, training loops
|
||||
4. **Production Ready** (5 modules) - Compression, kernels, benchmarking, MLOps, capstone
|
||||
5. **Integration** (1 module) - Final capstone project
|
||||
|
||||
## 🎨 Visualization Gallery
|
||||
|
||||
### Dependency Graph
|
||||
See how modules build upon each other with interactive dependency visualizations showing:
|
||||
- **Prerequisite relationships** - What you need to learn first
|
||||
- **Module difficulty** - Color-coded complexity levels
|
||||
- **Component count** - Size indicates implementation scope
|
||||
|
||||
### System Architecture
|
||||
Layered architecture diagram showing:
|
||||
- **Foundation Layer** - Core tensors and setup
|
||||
- **Component Layer** - Activations, layers, data loading
|
||||
- **Network Layer** - Dense networks, CNNs, attention
|
||||
- **Training Layer** - Autograd, optimizers, training
|
||||
- **Production Layer** - Compression, kernels, MLOps
|
||||
|
||||
### Learning Roadmap
|
||||
Timeline visualization featuring:
|
||||
- **Optimal sequence** - Dependency-respecting learning order
|
||||
- **Time estimates** - Realistic hour commitments per module
|
||||
- **Difficulty progression** - Smooth learning curve design
|
||||
- **Milestone tracking** - Major learning achievements
|
||||
|
||||
## 🔧 Technical Implementation
|
||||
|
||||
### Module Analysis Engine
|
||||
```python
|
||||
# Automatically analyze all TinyTorch modules
|
||||
analyzer = TinyTorchAnalyzer()
|
||||
overview = analyzer.get_tinytorch_overview()
|
||||
learning_path = analyzer.get_learning_path()
|
||||
```
|
||||
|
||||
### Visualization System
|
||||
```python
|
||||
# Generate comprehensive system visualizations
|
||||
visualizations = visualize_tinytorch_system()
|
||||
dependency_graph = create_dependency_graph_visualization()
|
||||
architecture = create_system_architecture_diagram()
|
||||
roadmap = create_learning_roadmap()
|
||||
```
|
||||
|
||||
### Learning Recommendations
|
||||
```python
|
||||
# Get personalized learning suggestions
|
||||
recommendations = get_learning_recommendations()
|
||||
next_modules = recommendations['next_modules']
|
||||
estimated_time = recommendations['remaining_time']
|
||||
```
|
||||
|
||||
## 🤔 ML Systems Thinking
|
||||
|
||||
This module connects TinyTorch's educational architecture to real-world ML systems:
|
||||
|
||||
### Framework Design Patterns
|
||||
- **Modular Dependencies** - How PyTorch and TensorFlow organize components
|
||||
- **Component Composition** - Building complex operations from simple primitives
|
||||
- **Abstraction Layers** - Balancing usability with performance control
|
||||
|
||||
### Production Considerations
|
||||
- **Deployment Pipelines** - From research code to production systems
|
||||
- **Performance Optimization** - Hardware-aware kernel design
|
||||
- **Monitoring & MLOps** - Continuous learning and model management
|
||||
|
||||
### Educational Philosophy
|
||||
- **Progressive Complexity** - Foundation → Architecture → Training → Production
|
||||
- **Hands-on Learning** - Build before you use, understand before you optimize
|
||||
- **Real-world Relevance** - Educational choices that mirror industry patterns
|
||||
|
||||
## 📈 Learning Outcomes
|
||||
|
||||
After completing this module, you will:
|
||||
|
||||
1. **Understand TinyTorch Architecture** - Complete mental model of the framework
|
||||
2. **Navigate Module Dependencies** - Know what to learn when and why
|
||||
3. **Plan Your Learning Journey** - Realistic timeline and progression tracking
|
||||
4. **Connect to Industry** - See how educational patterns map to production ML
|
||||
|
||||
## 🔗 Integration with TinyTorch
|
||||
|
||||
This introduction module:
|
||||
- **Requires no prerequisites** - Perfect starting point for new learners
|
||||
- **Enables all other modules** - Provides context for the entire journey
|
||||
- **Exports analysis tools** - Used by other modules for self-reflection
|
||||
- **Updates automatically** - Visualization stays current as modules evolve
|
||||
|
||||
## 🎓 Getting Started
|
||||
|
||||
1. **Run the introduction notebook** to see all visualizations
|
||||
2. **Explore the dependency graph** to understand module relationships
|
||||
3. **Review the learning roadmap** to plan your journey
|
||||
4. **Bookmark key functions** for reference during your learning
|
||||
|
||||
**Ready to build a neural network framework from scratch? Let's begin! 🚀**
|
||||
|
||||
---
|
||||
|
||||
*This module serves as your guide through the complete TinyTorch learning experience. Use it to maintain big-picture understanding as you dive deep into implementation details.*
|
||||
File diff suppressed because it is too large
Load Diff
File diff suppressed because it is too large
Load Diff
@@ -1,37 +0,0 @@
|
||||
# TinyTorch Module Metadata
|
||||
# Essential system information for CLI tools and build systems
|
||||
|
||||
name: "introduction"
|
||||
title: "System Introduction & Architecture"
|
||||
description: "Visual overview of TinyTorch framework architecture, module dependencies, and learning roadmap"
|
||||
|
||||
# Dependencies - Used by CLI for module ordering and prerequisites
|
||||
dependencies:
|
||||
prerequisites: []
|
||||
enables: ["setup", "tensor", "activations", "layers", "dense", "spatial", "attention", "dataloader", "autograd", "optimizers", "training", "compression", "kernels", "benchmarking", "mlops", "capstone"]
|
||||
|
||||
# Package Export - What gets built into tinytorch package
|
||||
exports_to: "tinytorch.introduction"
|
||||
|
||||
# File Structure - What files exist in this module
|
||||
files:
|
||||
dev_file: "introduction_dev.py"
|
||||
readme: "README.md"
|
||||
tests: "inline"
|
||||
|
||||
# Educational Metadata
|
||||
difficulty: "⭐"
|
||||
time_estimate: "1-2 hours"
|
||||
|
||||
# Components - What's implemented in this module
|
||||
components:
|
||||
- "TinyTorchAnalyzer"
|
||||
- "ModuleInfo"
|
||||
- "get_tinytorch_overview"
|
||||
- "visualize_tinytorch_system"
|
||||
- "get_module_info"
|
||||
- "get_learning_recommendations"
|
||||
- "create_dependency_graph_visualization"
|
||||
- "create_system_architecture_diagram"
|
||||
- "create_learning_roadmap"
|
||||
- "create_component_analysis"
|
||||
@@ -1,544 +0,0 @@
|
||||
# 🎓 TinyTorch Capstone: Advanced Framework Engineering
|
||||
|
||||
**🎯 Prove your mastery. Optimize your framework. Become the engineer others ask for help.**
|
||||
|
||||
---
|
||||
|
||||
## 📊 Module Overview
|
||||
|
||||
- **Difficulty**: ⭐⭐⭐⭐⭐ Expert Systems Engineering 🥷
|
||||
- **Time Estimate**: 4-8 weeks (flexible scope)
|
||||
- **Prerequisites**: **All 14 TinyTorch modules** - Your complete ML framework
|
||||
- **Outcome**: **Advanced framework engineering portfolio** - Demonstrate deep systems mastery
|
||||
|
||||
After 14 modules, you've built a complete ML framework from scratch. Now it's time to make it **faster**, **smarter**, and **more professional**. This capstone isn't about learning new concepts—it's about proving you can engineer production-quality ML systems.
|
||||
|
||||
---
|
||||
|
||||
## 🔥 **What You've Already Built**
|
||||
|
||||
Before choosing your capstone track, let's celebrate what you've accomplished:
|
||||
|
||||
### 🏗️ **Complete ML Framework** (Modules 1-14)
|
||||
```python
|
||||
# This is YOUR implementation working together:
|
||||
from tinytorch.core.tensor import Tensor
|
||||
from tinytorch.core.layers import Dense
|
||||
from tinytorch.core.dense import Sequential, MLP
|
||||
from tinytorch.core.spatial import Conv2D, flatten
|
||||
from tinytorch.core.attention import SelfAttention, scaled_dot_product_attention
|
||||
from tinytorch.core.activations import ReLU, Softmax
|
||||
from tinytorch.core.optimizers import Adam, SGD
|
||||
from tinytorch.core.training import CrossEntropyLoss, Trainer
|
||||
from tinytorch.core.dataloader import DataLoader, CIFAR10Dataset
|
||||
|
||||
# Build a modern neural network with YOUR components
|
||||
model = Sequential([
|
||||
Conv2D(3, 32, kernel_size=3),
|
||||
ReLU(),
|
||||
flatten,
|
||||
Dense(32*30*30, 256),
|
||||
ReLU(),
|
||||
SelfAttention(d_model=256),
|
||||
Dense(256, 10),
|
||||
Softmax()
|
||||
])
|
||||
|
||||
# Train on real data with YOUR training system
|
||||
trainer = Trainer(model, Adam(lr=0.001), CrossEntropyLoss())
|
||||
dataloader = DataLoader(CIFAR10Dataset(), batch_size=64)
|
||||
trainer.train(dataloader, epochs=10)
|
||||
```
|
||||
|
||||
### 🎯 **Production-Ready Capabilities**
|
||||
- ✅ **Tensor operations** with broadcasting and efficient computation
|
||||
- ✅ **Automatic differentiation** with full backpropagation support
|
||||
- ✅ **Modern architectures** including CNNs and attention mechanisms
|
||||
- ✅ **Advanced optimizers** with momentum and adaptive learning rates
|
||||
- ✅ **Model compression** with pruning and quantization (75% size reduction)
|
||||
- ✅ **High-performance kernels** with vectorization and parallelization
|
||||
- ✅ **Comprehensive benchmarking** with memory profiling and performance analysis
|
||||
|
||||
**You didn't just learn about ML systems. You built one.**
|
||||
|
||||
---
|
||||
|
||||
## 🚀 **The Capstone Challenge: Choose Your Specialization**
|
||||
|
||||
Now that you have a complete framework, choose your path to mastery. Each track focuses on different aspects of production ML engineering:
|
||||
|
||||
### **⚡ Track 1: Performance Ninja**
|
||||
**Mission**: Make TinyTorch competitive with PyTorch in speed and memory efficiency
|
||||
|
||||
**Perfect for**: Students who love optimization, performance engineering, and making things fast
|
||||
|
||||
**Example Project**: *CUDA-Style Matrix Operations*
|
||||
```python
|
||||
# Current: Your CPU implementation (Module 13)
|
||||
def attention_naive(Q, K, V):
|
||||
scores = Q @ K.T # Your matmul from Module 2
|
||||
weights = softmax(scores) # Your softmax from Module 3
|
||||
return weights @ V
|
||||
|
||||
# Your optimization target: 10x faster
|
||||
def attention_optimized(Q, K, V):
|
||||
# Implement using advanced NumPy + memory optimization
|
||||
# Target: Match 90% of PyTorch attention speed
|
||||
pass
|
||||
```
|
||||
|
||||
**Concrete Projects to Choose From:**
|
||||
1. **GPU-Accelerated Tensor Operations**: Use NumPy's advanced features + CuPy for near-GPU performance
|
||||
2. **Memory-Optimized Training**: Implement gradient accumulation and reduce memory usage by 50%
|
||||
3. **Vectorized Convolution**: Replace your naive Conv2D with optimized implementations
|
||||
4. **Parallel Data Loading**: Multi-threaded CIFAR-10 loading with 3x speedup
|
||||
5. **JIT-Style Optimization**: Pre-compile operation graphs for faster execution
|
||||
|
||||
**Success Metrics:**
|
||||
- 5-10x speedup on specific operations
|
||||
- 30%+ reduction in memory usage
|
||||
- Benchmark reports comparing to PyTorch
|
||||
- Performance regression testing suite
|
||||
|
||||
---
|
||||
|
||||
### **🧠 Track 2: Algorithm Architect**
|
||||
**Mission**: Extend TinyTorch with cutting-edge ML algorithms and architectures
|
||||
|
||||
**Perfect for**: Students who love ML research, implementing papers, and algorithmic innovation
|
||||
|
||||
**Example Project**: *Vision Transformer (ViT) from Scratch*
|
||||
```python
|
||||
# Current: You have attention (Module 7) and dense layers (Module 5)
|
||||
from tinytorch.core.attention import SelfAttention
|
||||
from tinytorch.core.dense import Sequential, MLP
|
||||
|
||||
# Your extension: Complete Vision Transformer
|
||||
class VisionTransformer:
|
||||
def __init__(self, image_size=32, patch_size=4, d_model=256):
|
||||
# YOUR implementation using ONLY TinyTorch components
|
||||
self.patch_embedding = Dense(patch_size*patch_size*3, d_model)
|
||||
self.transformer_blocks = [
|
||||
TransformerBlock(d_model) for _ in range(6)
|
||||
]
|
||||
self.classifier = MLP([d_model, 128, 10])
|
||||
|
||||
def forward(self, images):
|
||||
# Implement patch extraction, position encoding,
|
||||
# transformer processing using your components
|
||||
pass
|
||||
|
||||
class TransformerBlock:
|
||||
def __init__(self, d_model):
|
||||
self.attention = SelfAttention(d_model)
|
||||
self.mlp = MLP([d_model, d_model*4, d_model])
|
||||
# Add YOUR layer normalization implementation
|
||||
```
|
||||
|
||||
**Concrete Projects to Choose From:**
|
||||
1. **Modern Optimizers**: Implement AdamW, RMSprop, Lion using your autograd system
|
||||
2. **Normalization Layers**: BatchNorm, LayerNorm, GroupNorm with full gradient support
|
||||
3. **Transformer Architectures**: Complete BERT/GPT-style models using your attention
|
||||
4. **Advanced Regularization**: Dropout, DropPath, data augmentation pipelines
|
||||
5. **Generative Models**: VAE or simple GAN using your framework
|
||||
|
||||
**Success Metrics:**
|
||||
- New algorithms integrate seamlessly with existing TinyTorch
|
||||
- Performance matches research paper results
|
||||
- Full autograd support for all new components
|
||||
- Documentation showing how to use new features
|
||||
|
||||
---
|
||||
|
||||
### **🔧 Track 3: Systems Engineer**
|
||||
**Mission**: Build production-grade infrastructure and developer tooling
|
||||
|
||||
**Perfect for**: Students interested in MLOps, distributed systems, and production ML
|
||||
|
||||
**Example Project**: *Production Training Infrastructure*
|
||||
```python
|
||||
# Current: Your basic trainer (Module 11)
|
||||
trainer = Trainer(model, optimizer, loss_fn)
|
||||
trainer.train(dataloader, epochs=10)
|
||||
|
||||
# Your production system: Enterprise-grade training
|
||||
class ProductionTrainer:
|
||||
def __init__(self, model, optimizer, config):
|
||||
self.model = model
|
||||
self.checkpointer = ModelCheckpointer(config.checkpoint_dir)
|
||||
self.profiler = MemoryProfiler()
|
||||
self.distributed = MultiGPUManager(config.num_gpus)
|
||||
self.monitor = TrainingMonitor(config.wandb_project)
|
||||
|
||||
def train(self, dataloader, epochs):
|
||||
for epoch in self.resume_from_checkpoint():
|
||||
# Distributed training across multiple processes
|
||||
# Memory profiling and leak detection
|
||||
# Automatic checkpointing and recovery
|
||||
# Real-time monitoring and alerts
|
||||
pass
|
||||
```
|
||||
|
||||
**Concrete Projects to Choose From:**
|
||||
1. **Model Serving API**: FastAPI deployment with batching and caching
|
||||
2. **Distributed Training**: Multi-process training with gradient synchronization
|
||||
3. **Advanced Checkpointing**: Resume training from any point, handle interruptions
|
||||
4. **Memory Profiler**: Track memory leaks and optimize allocation patterns
|
||||
5. **CI/CD Pipeline**: Automated testing, benchmarking, and deployment
|
||||
|
||||
**Success Metrics:**
|
||||
- Production-ready code with error handling and monitoring
|
||||
- 99.9% uptime for serving infrastructure
|
||||
- Automated testing and deployment pipelines
|
||||
- Real-world deployment handling thousands of requests
|
||||
|
||||
---
|
||||
|
||||
### **📊 Track 4: Benchmarking Scientist**
|
||||
**Mission**: Build comprehensive analysis tools and compare frameworks scientifically
|
||||
|
||||
**Perfect for**: Students who love data analysis, scientific methodology, and systematic evaluation
|
||||
|
||||
**Example Project**: *TinyTorch vs PyTorch Scientific Comparison*
|
||||
```python
|
||||
# Your comprehensive benchmarking suite
|
||||
class FrameworkComparison:
|
||||
def __init__(self):
|
||||
self.tinytorch_ops = TinyTorchOperations()
|
||||
self.pytorch_ops = PyTorchOperations()
|
||||
self.test_suite = MLOperationTestSuite()
|
||||
|
||||
def benchmark_complete_pipeline(self):
|
||||
# End-to-end CIFAR-10 training comparison
|
||||
results = {
|
||||
'tinytorch': self.run_tinytorch_training(),
|
||||
'pytorch': self.run_pytorch_training()
|
||||
}
|
||||
|
||||
return AnalysisReport({
|
||||
'speed_comparison': self.analyze_training_speed(results),
|
||||
'memory_usage': self.profile_memory_patterns(results),
|
||||
'accuracy_comparison': self.compare_final_accuracy(results),
|
||||
'code_complexity': self.analyze_implementation_complexity(),
|
||||
'engineering_insights': self.identify_optimization_opportunities()
|
||||
})
|
||||
```
|
||||
|
||||
**Concrete Projects to Choose From:**
|
||||
1. **Performance Regression Suite**: Automated benchmarking for every code change
|
||||
2. **Memory Usage Analysis**: Deep dive into allocation patterns and optimization opportunities
|
||||
3. **Scientific ML Comparison**: Compare your framework to PyTorch on standard benchmarks
|
||||
4. **Algorithm Analysis**: Compare different optimization algorithms empirically
|
||||
5. **Scalability Study**: How does your framework perform as model size increases?
|
||||
|
||||
**Success Metrics:**
|
||||
- Comprehensive benchmark suite with statistical significance
|
||||
- Detailed analysis reports with engineering insights
|
||||
- Performance regression detection system
|
||||
- Scientific paper-quality methodology and results
|
||||
|
||||
---
|
||||
|
||||
### **🛠️ Track 5: Developer Experience Master**
|
||||
**Mission**: Build tools that make TinyTorch easier to debug, understand, and extend
|
||||
|
||||
**Perfect for**: Students interested in tooling, visualization, and making complex systems accessible
|
||||
|
||||
**Example Project**: *TinyTorch Visual Debugger*
|
||||
```python
|
||||
# Your debugging and visualization suite
|
||||
class TinyTorchDebugger:
|
||||
def __init__(self, model):
|
||||
self.model = model
|
||||
self.gradient_tracker = GradientFlowTracker()
|
||||
self.activation_inspector = LayerActivationInspector()
|
||||
self.training_visualizer = TrainingDynamicsPlotter()
|
||||
|
||||
def debug_training_step(self, batch):
|
||||
# Visual gradient flow analysis
|
||||
grad_flow = self.gradient_tracker.track_gradients(batch)
|
||||
self.visualize_gradient_flow(grad_flow)
|
||||
|
||||
# Layer activation inspection
|
||||
activations = self.activation_inspector.capture_activations(batch)
|
||||
self.plot_activation_distributions(activations)
|
||||
|
||||
# Diagnose common training issues
|
||||
issues = self.diagnose_training_problems(grad_flow, activations)
|
||||
self.suggest_fixes(issues)
|
||||
```
|
||||
|
||||
**Concrete Projects to Choose From:**
|
||||
1. **Gradient Visualization Tools**: See gradient flow and detect vanishing/exploding gradients
|
||||
2. **Model Architecture Visualizer**: Interactive network graphs showing your models
|
||||
3. **Training Diagnostics**: Automated detection of learning rate, batch size issues
|
||||
4. **Interactive Tutorials**: Jupyter widgets for understanding framework internals
|
||||
5. **Error Message Enhancement**: Better debugging information with fix suggestions
|
||||
|
||||
**Success Metrics:**
|
||||
- Intuitive visualizations that reveal training dynamics
|
||||
- Diagnostic tools that catch common mistakes automatically
|
||||
- Interactive documentation and tutorials
|
||||
- User studies showing improved debugging efficiency
|
||||
|
||||
---
|
||||
|
||||
## 📋 **Project Phases: Your Engineering Journey**
|
||||
|
||||
### **Phase 1: Analysis & Planning** (Week 1)
|
||||
**Understand your starting point and define success**
|
||||
|
||||
```python
|
||||
# Step 1: Profile your current framework
|
||||
import cProfile
|
||||
from memory_profiler import profile
|
||||
|
||||
def profile_current_implementation():
|
||||
"""Identify bottlenecks in your TinyTorch framework."""
|
||||
|
||||
# Create realistic test scenario
|
||||
model = your_best_model_from_module_11()
|
||||
dataloader = CIFAR10Dataset(batch_size=64)
|
||||
|
||||
# Profile performance
|
||||
profiler = cProfile.Profile()
|
||||
profiler.enable()
|
||||
|
||||
# Run representative workload
|
||||
train_one_epoch(model, dataloader)
|
||||
|
||||
profiler.disable()
|
||||
# Analyze results and identify optimization targets
|
||||
```
|
||||
|
||||
**Deliverables:**
|
||||
- [ ] **Performance baseline**: Current speed and memory usage
|
||||
- [ ] **Bottleneck analysis**: Where does your framework spend time?
|
||||
- [ ] **Success metrics**: Specific, measurable goals (e.g., "10x faster matrix multiplication")
|
||||
- [ ] **Implementation plan**: Break project into 3-4 concrete milestones
|
||||
|
||||
### **Phase 2: Core Implementation** (Weeks 2-3)
|
||||
**Build your optimization/extension incrementally**
|
||||
|
||||
**Development Strategy:**
|
||||
1. **Start simple**: Get the minimal version working first
|
||||
2. **Test constantly**: Use your CIFAR-10 models to verify improvements
|
||||
3. **Benchmark early**: Measure performance at each step
|
||||
4. **Integrate gradually**: Ensure compatibility with existing TinyTorch components
|
||||
|
||||
**Weekly Check-ins:**
|
||||
- [ ] **Functionality demo**: Show your improvement working
|
||||
- [ ] **Performance measurement**: Quantify progress toward goals
|
||||
- [ ] **Integration testing**: Verify compatibility with existing code
|
||||
- [ ] **Documentation updates**: Keep track of design decisions
|
||||
|
||||
### **Phase 3: Optimization & Polish** (Week 4)
|
||||
**Refine your implementation and maximize impact**
|
||||
|
||||
**Focus Areas:**
|
||||
- **Performance tuning**: Squeeze out maximum efficiency gains
|
||||
- **Error handling**: Make your code robust for edge cases
|
||||
- **API design**: Ensure your improvements are easy to use
|
||||
- **Testing coverage**: Comprehensive tests for all new functionality
|
||||
|
||||
### **Phase 4: Evaluation & Presentation** (Week 5+)
|
||||
**Demonstrate impact and reflect on engineering trade-offs**
|
||||
|
||||
**Final Deliverables:**
|
||||
- [ ] **Benchmark comparison**: Before/after performance analysis
|
||||
- [ ] **Engineering report**: Technical decisions, trade-offs, lessons learned
|
||||
- [ ] **Live demonstration**: Show your improvements working on real examples
|
||||
- [ ] **Future roadmap**: Next optimization opportunities identified
|
||||
|
||||
---
|
||||
|
||||
## 🎯 **Success Criteria: Proving Mastery**
|
||||
|
||||
Your capstone demonstrates mastery when you achieve:
|
||||
|
||||
### **🔬 Technical Excellence**
|
||||
- [ ] **Measurable improvement**: 20%+ performance gain, significant new functionality, or major UX improvement
|
||||
- [ ] **Systems integration**: Your changes work seamlessly with all existing TinyTorch modules
|
||||
- [ ] **Production quality**: Error handling, edge cases, comprehensive testing
|
||||
- [ ] **Performance analysis**: You understand *why* your changes work and their trade-offs
|
||||
|
||||
### **🏗️ Framework Understanding**
|
||||
- [ ] **Architectural consistency**: Your additions follow TinyTorch design patterns
|
||||
- [ ] **No external dependencies**: Use only TinyTorch components you built (proves deep understanding)
|
||||
- [ ] **Backward compatibility**: Existing code still works after your improvements
|
||||
- [ ] **Future extensibility**: Your changes enable further optimization opportunities
|
||||
|
||||
### **💼 Professional Development**
|
||||
- [ ] **Clear documentation**: Other students can understand and use your improvements
|
||||
- [ ] **Engineering insights**: You can explain trade-offs and alternative approaches
|
||||
- [ ] **Systematic evaluation**: Scientific methodology in measuring improvements
|
||||
- [ ] **Presentation skills**: Effectively communicate technical work to different audiences
|
||||
|
||||
---
|
||||
|
||||
## 🏆 **Capstone Deliverables**
|
||||
|
||||
Submit your completed capstone as a professional portfolio:
|
||||
|
||||
### **1. 📊 Technical Report** (`capstone_report.md`)
|
||||
**Structure:**
|
||||
```markdown
|
||||
# [Your Track]: [Project Title]
|
||||
|
||||
## Executive Summary
|
||||
- Problem statement and motivation
|
||||
- Key technical achievements
|
||||
- Performance improvements achieved
|
||||
- Engineering insights gained
|
||||
|
||||
## Technical Approach
|
||||
- Architecture and design decisions
|
||||
- Implementation methodology
|
||||
- Tools and techniques used
|
||||
- Alternative approaches considered
|
||||
|
||||
## Results & Analysis
|
||||
- Quantitative performance improvements
|
||||
- Benchmark comparisons (before/after)
|
||||
- Trade-off analysis (speed vs memory vs complexity)
|
||||
- Limitations and future work
|
||||
|
||||
## Engineering Reflection
|
||||
- What you learned about framework design
|
||||
- Most challenging technical decisions
|
||||
- How your work fits into broader ML systems
|
||||
```
|
||||
|
||||
### **2. 💻 Implementation Code** (`src/` directory)
|
||||
```
|
||||
src/
|
||||
├── optimizations/ # Your improved components
|
||||
│ ├── fast_matmul.py
|
||||
│ ├── efficient_trainer.py
|
||||
│ └── advanced_optimizers.py
|
||||
├── tests/ # Comprehensive test suite
|
||||
│ ├── test_performance.py
|
||||
│ ├── test_compatibility.py
|
||||
│ └── test_edge_cases.py
|
||||
├── benchmarks/ # Performance measurement tools
|
||||
│ ├── benchmark_suite.py
|
||||
│ └── comparison_tools.py
|
||||
└── demo/ # Working examples
|
||||
├── demo_improvements.py
|
||||
└── integration_examples.py
|
||||
```
|
||||
|
||||
### **3. 📈 Performance Analysis** (`benchmarks/` directory)
|
||||
- **Before/after comparisons**: Quantify your improvements
|
||||
- **Memory profiling**: Allocation patterns and optimization impact
|
||||
- **Scalability analysis**: How improvements perform with larger models
|
||||
- **Framework comparison**: Your TinyTorch vs PyTorch (where relevant)
|
||||
|
||||
### **4. 🎥 Live Demonstration** (`demo.py`)
|
||||
**Requirements:**
|
||||
- Show your improvements working on real TinyTorch models
|
||||
- Side-by-side comparison with original implementation
|
||||
- Quantified performance improvements displayed
|
||||
- Real use case demonstrating practical value
|
||||
|
||||
---
|
||||
|
||||
## 💡 **Pro Tips for Capstone Success**
|
||||
|
||||
### **🎯 Start With Impact**
|
||||
```python
|
||||
# Instead of optimizing everything...
|
||||
def optimize_everything():
|
||||
pass # This leads to shallow improvements
|
||||
|
||||
# Find the biggest bottleneck first
|
||||
def profile_and_optimize():
|
||||
bottleneck = find_biggest_bottleneck() # 80% of runtime
|
||||
return optimize_specific_operation(bottleneck) # 10x speedup
|
||||
```
|
||||
|
||||
### **🧪 Measure Everything**
|
||||
- **Baseline early**: Know your starting point precisely
|
||||
- **Benchmark often**: Track progress with each change
|
||||
- **Compare fairly**: Use identical test conditions
|
||||
- **Document trade-offs**: Speed vs memory vs complexity
|
||||
|
||||
### **🔗 Use Your Existing Framework**
|
||||
```python
|
||||
# Test improvements with models you built in previous modules
|
||||
cifar_model = load_your_module_10_model() # Real CNN from Module 6
|
||||
test_your_optimization(cifar_model) # Does it still work?
|
||||
measure_improvement(cifar_model) # How much faster/better?
|
||||
```
|
||||
|
||||
### **📚 Think Like a Framework Maintainer**
|
||||
- **API design**: How would other students use your improvements?
|
||||
- **Documentation**: Can someone else understand and extend your work?
|
||||
- **Testing**: What could break? How do you prevent it?
|
||||
- **Compatibility**: Does existing code still work?
|
||||
|
||||
---
|
||||
|
||||
## 🚀 **Getting Started: Your First Steps**
|
||||
|
||||
### **1. Choose Your Track**
|
||||
Review the 5 tracks above and pick the one that excites you most. Consider:
|
||||
- What aspect of ML systems interests you most?
|
||||
- What would you want to optimize in a real job?
|
||||
- What matches your career goals?
|
||||
|
||||
### **2. Run Initial Profiling**
|
||||
```bash
|
||||
# Profile your current TinyTorch framework
|
||||
cd modules/source/16_capstone/
|
||||
python profile_baseline.py
|
||||
|
||||
# This will show you:
|
||||
# - Where your framework spends time
|
||||
# - Memory usage patterns
|
||||
# - Comparison to PyTorch baseline
|
||||
# - Optimization opportunities ranked by impact
|
||||
```
|
||||
|
||||
### **3. Set Specific Goals**
|
||||
Based on profiling results, choose concrete, measurable targets:
|
||||
- **Performance**: "5x faster matrix multiplication"
|
||||
- **Algorithm**: "Complete Vision Transformer implementation"
|
||||
- **Systems**: "Production API handling 1000 req/sec"
|
||||
- **Analysis**: "Scientific comparison with 95% confidence intervals"
|
||||
- **Developer UX**: "Visual debugger reducing debug time by 50%"
|
||||
|
||||
### **4. Start Building**
|
||||
```python
|
||||
# Begin with the simplest version that demonstrates your concept
|
||||
def minimal_viable_optimization():
|
||||
# Get something working first
|
||||
# Measure improvement
|
||||
# Then optimize further
|
||||
pass
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 🎓 **Your Capstone Journey Starts Now**
|
||||
|
||||
You've built a complete ML framework from scratch. You understand tensors, autograd, optimization, and production systems at the deepest level.
|
||||
|
||||
**Now prove it.**
|
||||
|
||||
Choose your track, set ambitious but achievable goals, and start optimizing. Remember: you're not just improving code—you're demonstrating that you can engineer production ML systems at the level of PyTorch contributors.
|
||||
|
||||
**Your goal**: Become the engineer others turn to when they need to make ML systems better.
|
||||
|
||||
### **Ready to start?**
|
||||
|
||||
1. **Choose your track** from the 5 options above
|
||||
2. **Run the profiling script** to understand your baseline
|
||||
3. **Set specific, measurable goals** for your improvement
|
||||
4. **Start with the simplest implementation** that shows progress
|
||||
|
||||
**🔥 Your TinyTorch framework is waiting to be optimized. Start engineering.**
|
||||
|
||||
---
|
||||
|
||||
*Remember: The best capstone projects solve real problems you encountered while building TinyTorch. What frustrated you? What was slow? What could be better? Start there.*
|
||||
File diff suppressed because it is too large
Load Diff
@@ -1,864 +0,0 @@
|
||||
#| default_exp core.capstone
|
||||
|
||||
# %% [markdown]
|
||||
"""
|
||||
# Module 16: Capstone - Building Production ML Systems
|
||||
|
||||
## Learning Objectives
|
||||
By the end of this module, you will:
|
||||
1. Integrate all TinyTorch components into a complete ML system
|
||||
2. Apply production ML systems principles across the entire stack
|
||||
3. Optimize end-to-end system performance
|
||||
4. Design and implement enterprise-grade ML solutions
|
||||
5. Master the complete ML systems engineering workflow
|
||||
"""
|
||||
|
||||
# %%
|
||||
import sys
|
||||
import os
|
||||
sys.path.append(os.path.dirname(os.path.dirname(os.path.dirname(os.path.dirname(os.path.abspath(__file__))))))
|
||||
|
||||
import numpy as np
|
||||
import time
|
||||
from typing import Dict, List, Tuple, Any, Optional
|
||||
from dataclasses import dataclass, field
|
||||
import json
|
||||
|
||||
# Import all TinyTorch components
|
||||
from tinytorch.tensor import Tensor
|
||||
from tinytorch.nn import Module, Layer
|
||||
from tinytorch.optim import Optimizer, SGD, Adam
|
||||
from tinytorch.data import DataLoader
|
||||
from tinytorch.autograd import no_grad
|
||||
|
||||
# %% [markdown]
|
||||
"""
|
||||
## Part 1: Module Introduction
|
||||
|
||||
This capstone module brings together everything you've learned to build a complete, production-ready ML system. You'll integrate all TinyTorch components while applying ML systems engineering principles at scale.
|
||||
|
||||
### What We're Building
|
||||
- Complete end-to-end ML system with all components integrated
|
||||
- Production-grade performance profiling and optimization
|
||||
- Enterprise MLOps workflow with monitoring and deployment
|
||||
- Scalable architecture ready for millions of users
|
||||
"""
|
||||
|
||||
# %% [markdown]
|
||||
"""
|
||||
## Part 2: Mathematical Background
|
||||
|
||||
### System-Level Optimization
|
||||
The complete ML system optimization problem involves multiple objectives:
|
||||
|
||||
$$\\min_{θ} \\mathcal{L}_{total} = \\mathcal{L}_{model} + λ_1\\mathcal{L}_{latency} + λ_2\\mathcal{L}_{memory} + λ_3\\mathcal{L}_{cost}$$
|
||||
|
||||
Where:
|
||||
- $\\mathcal{L}_{model}$: Model accuracy loss
|
||||
- $\\mathcal{L}_{latency}$: Inference latency penalty
|
||||
- $\\mathcal{L}_{memory}$: Memory usage penalty
|
||||
- $\\mathcal{L}_{cost}$: Computational cost penalty
|
||||
|
||||
### End-to-End Performance Model
|
||||
System throughput is bounded by:
|
||||
|
||||
$$Throughput ≤ \min\left(\frac{1}{T_{compute}}, \frac{B}{M_{transfer}}, \frac{C}{R_{memory}}\right)$$
|
||||
|
||||
Where:
|
||||
- $T_{compute}$: Computation time per sample
|
||||
- $M_{transfer}$: Memory transfer per sample
|
||||
- $R_{memory}$: Memory bandwidth
|
||||
"""
|
||||
|
||||
# %% [markdown]
|
||||
"""
|
||||
## Part 3: Core Implementation - Production ML System Profiler
|
||||
"""
|
||||
|
||||
# %%
|
||||
@dataclass
|
||||
class SystemMetrics:
|
||||
"""Complete system performance metrics"""
|
||||
model_accuracy: float
|
||||
inference_latency_ms: float
|
||||
throughput_samples_sec: float
|
||||
memory_usage_mb: float
|
||||
gpu_utilization: float
|
||||
cost_per_million_inferences: float
|
||||
|
||||
@dataclass
|
||||
class OptimizationRecommendation:
|
||||
"""System optimization recommendation"""
|
||||
component: str
|
||||
issue: str
|
||||
impact: str # "high", "medium", "low"
|
||||
recommendation: str
|
||||
estimated_improvement: float # percentage
|
||||
|
||||
class ProductionMLSystemProfiler:
|
||||
"""
|
||||
Complete ML system profiler integrating all components.
|
||||
85% implementation - students extend with custom systems.
|
||||
"""
|
||||
|
||||
def __init__(self):
|
||||
self.profiling_data = {}
|
||||
self.system_config = {
|
||||
"hardware": self._detect_hardware(),
|
||||
"deployment": "cloud", # cloud, edge, on-premise
|
||||
"scale": "enterprise" # prototype, production, enterprise
|
||||
}
|
||||
|
||||
def _detect_hardware(self) -> Dict[str, Any]:
|
||||
"""Detect available hardware configuration"""
|
||||
import platform
|
||||
import psutil
|
||||
|
||||
return {
|
||||
"cpu": platform.processor(),
|
||||
"cpu_cores": psutil.cpu_count(),
|
||||
"memory_gb": psutil.virtual_memory().total / (1024**3),
|
||||
"gpu": "simulated", # Would detect real GPU
|
||||
"accelerators": []
|
||||
}
|
||||
|
||||
def profile_end_to_end_system(self,
|
||||
model: 'Module',
|
||||
dataloader: 'DataLoader',
|
||||
optimizer: 'Optimizer') -> SystemMetrics:
|
||||
"""
|
||||
Profile complete ML system performance.
|
||||
|
||||
This integrates profiling from all previous modules:
|
||||
- Tensor operations (Module 2)
|
||||
- Activation functions (Module 3)
|
||||
- Layer computations (Module 4-7)
|
||||
- Data loading (Module 8)
|
||||
- Autograd (Module 9)
|
||||
- Optimization (Module 10)
|
||||
- Training (Module 11)
|
||||
"""
|
||||
print("🔬 Profiling End-to-End ML System...")
|
||||
|
||||
# Simulate comprehensive profiling
|
||||
start_time = time.time()
|
||||
|
||||
# Profile inference pipeline
|
||||
inference_times = []
|
||||
memory_usage = []
|
||||
|
||||
for batch_idx, (data, target) in enumerate(dataloader):
|
||||
if batch_idx >= 10: # Profile first 10 batches
|
||||
break
|
||||
|
||||
batch_start = time.time()
|
||||
|
||||
# Forward pass
|
||||
with no_grad():
|
||||
output = model(data)
|
||||
|
||||
batch_time = (time.time() - batch_start) * 1000
|
||||
inference_times.append(batch_time)
|
||||
|
||||
# Simulate memory tracking
|
||||
memory_usage.append(
|
||||
data.data.nbytes / (1024**2) +
|
||||
sum(p.data.nbytes / (1024**2) for p in model.parameters())
|
||||
)
|
||||
|
||||
# Calculate metrics
|
||||
metrics = SystemMetrics(
|
||||
model_accuracy=0.95, # Would calculate real accuracy
|
||||
inference_latency_ms=np.mean(inference_times),
|
||||
throughput_samples_sec=1000 / np.mean(inference_times) * dataloader.batch_size,
|
||||
memory_usage_mb=np.mean(memory_usage),
|
||||
gpu_utilization=0.75, # Simulated
|
||||
cost_per_million_inferences=0.10 # Simulated cloud cost
|
||||
)
|
||||
|
||||
# Store profiling data
|
||||
self.profiling_data['system_metrics'] = metrics
|
||||
|
||||
print(f"✅ System Profiling Complete")
|
||||
print(f" Latency: {metrics.inference_latency_ms:.2f}ms")
|
||||
print(f" Throughput: {metrics.throughput_samples_sec:.0f} samples/sec")
|
||||
print(f" Memory: {metrics.memory_usage_mb:.1f}MB")
|
||||
print(f" Cost: ${metrics.cost_per_million_inferences:.2f}/1M inferences")
|
||||
|
||||
return metrics
|
||||
|
||||
def detect_cross_module_optimizations(self) -> List[OptimizationRecommendation]:
|
||||
"""
|
||||
Identify optimization opportunities across modules.
|
||||
|
||||
This analyzes interactions between:
|
||||
- Tensor operations and memory layout
|
||||
- Layer fusion opportunities
|
||||
- Autograd graph optimization
|
||||
- Data pipeline and model overlap
|
||||
"""
|
||||
print("\n🔍 Detecting Cross-Module Optimization Opportunities...")
|
||||
|
||||
recommendations = []
|
||||
|
||||
# Kernel fusion opportunity
|
||||
recommendations.append(OptimizationRecommendation(
|
||||
component="Layers + Activations",
|
||||
issue="Separate kernel launches for linear and activation",
|
||||
impact="high",
|
||||
recommendation="Fuse linear layer with activation function",
|
||||
estimated_improvement=15.0
|
||||
))
|
||||
|
||||
# Memory layout optimization
|
||||
recommendations.append(OptimizationRecommendation(
|
||||
component="Tensor + Spatial",
|
||||
issue="Non-contiguous memory access in convolutions",
|
||||
impact="medium",
|
||||
recommendation="Use channels-last memory format",
|
||||
estimated_improvement=10.0
|
||||
))
|
||||
|
||||
# Data pipeline optimization
|
||||
recommendations.append(OptimizationRecommendation(
|
||||
component="DataLoader + Training",
|
||||
issue="CPU-GPU transfer blocking training",
|
||||
impact="high",
|
||||
recommendation="Implement data prefetching and pinned memory",
|
||||
estimated_improvement=20.0
|
||||
))
|
||||
|
||||
# Autograd optimization
|
||||
recommendations.append(OptimizationRecommendation(
|
||||
component="Autograd + Optimizer",
|
||||
issue="Redundant gradient computations",
|
||||
impact="low",
|
||||
recommendation="Implement gradient checkpointing for large models",
|
||||
estimated_improvement=5.0
|
||||
))
|
||||
|
||||
for rec in recommendations:
|
||||
print(f" [{rec.impact.upper()}] {rec.component}: {rec.recommendation}")
|
||||
print(f" Estimated improvement: {rec.estimated_improvement}%")
|
||||
|
||||
return recommendations
|
||||
|
||||
def validate_production_readiness(self) -> Dict[str, bool]:
|
||||
"""
|
||||
Validate system readiness for production deployment.
|
||||
|
||||
Checks all critical production requirements:
|
||||
- Performance SLAs
|
||||
- Scalability requirements
|
||||
- Monitoring and observability
|
||||
- Error handling and recovery
|
||||
- Security and compliance
|
||||
"""
|
||||
print("\n✅ Validating Production Readiness...")
|
||||
|
||||
checks = {
|
||||
"performance_sla": self._check_performance_sla(),
|
||||
"scalability": self._check_scalability(),
|
||||
"monitoring": self._check_monitoring(),
|
||||
"error_handling": self._check_error_handling(),
|
||||
"security": self._check_security(),
|
||||
"mlops_integration": self._check_mlops()
|
||||
}
|
||||
|
||||
for check, passed in checks.items():
|
||||
status = "✅" if passed else "❌"
|
||||
print(f" {status} {check.replace('_', ' ').title()}")
|
||||
|
||||
return checks
|
||||
|
||||
def _check_performance_sla(self) -> bool:
|
||||
"""Check if system meets performance SLAs"""
|
||||
if 'system_metrics' not in self.profiling_data:
|
||||
return False
|
||||
metrics = self.profiling_data['system_metrics']
|
||||
return metrics.inference_latency_ms < 100 # 100ms SLA
|
||||
|
||||
def _check_scalability(self) -> bool:
|
||||
"""Check scalability requirements"""
|
||||
# Would test with increasing load
|
||||
return True # Simulated
|
||||
|
||||
def _check_monitoring(self) -> bool:
|
||||
"""Check monitoring capabilities"""
|
||||
# Would verify metrics export, logging, etc.
|
||||
return True # Simulated
|
||||
|
||||
def _check_error_handling(self) -> bool:
|
||||
"""Check error handling and recovery"""
|
||||
# Would test failure scenarios
|
||||
return True # Simulated
|
||||
|
||||
def _check_security(self) -> bool:
|
||||
"""Check security requirements"""
|
||||
# Would verify authentication, encryption, etc.
|
||||
return True # Simulated
|
||||
|
||||
def _check_mlops(self) -> bool:
|
||||
"""Check MLOps integration"""
|
||||
# Would verify CI/CD, versioning, etc.
|
||||
return True # Simulated
|
||||
|
||||
def analyze_scalability(self, target_qps: int = 10000) -> Dict[str, Any]:
|
||||
"""
|
||||
Analyze system scalability to target QPS.
|
||||
|
||||
Determines resource requirements for scaling:
|
||||
- Horizontal scaling (replica count)
|
||||
- Vertical scaling (instance size)
|
||||
- Caching and optimization needs
|
||||
"""
|
||||
print(f"\n📈 Analyzing Scalability to {target_qps} QPS...")
|
||||
|
||||
if 'system_metrics' not in self.profiling_data:
|
||||
print(" ⚠️ Run system profiling first")
|
||||
return {}
|
||||
|
||||
metrics = self.profiling_data['system_metrics']
|
||||
current_qps = metrics.throughput_samples_sec
|
||||
|
||||
analysis = {
|
||||
"current_qps": current_qps,
|
||||
"target_qps": target_qps,
|
||||
"scaling_factor": target_qps / current_qps,
|
||||
"recommended_replicas": int(np.ceil(target_qps / current_qps)),
|
||||
"estimated_cost_per_hour": (target_qps / current_qps) * 2.50, # Simulated
|
||||
"bottlenecks": []
|
||||
}
|
||||
|
||||
# Identify bottlenecks
|
||||
if analysis["scaling_factor"] > 10:
|
||||
analysis["bottlenecks"].append("Need caching layer")
|
||||
if analysis["scaling_factor"] > 50:
|
||||
analysis["bottlenecks"].append("Need load balancing")
|
||||
if analysis["scaling_factor"] > 100:
|
||||
analysis["bottlenecks"].append("Consider model optimization")
|
||||
|
||||
print(f" Current QPS: {current_qps:.0f}")
|
||||
print(f" Scaling Factor: {analysis['scaling_factor']:.1f}x")
|
||||
print(f" Recommended Replicas: {analysis['recommended_replicas']}")
|
||||
print(f" Estimated Cost: ${analysis['estimated_cost_per_hour']:.2f}/hour")
|
||||
|
||||
return analysis
|
||||
|
||||
def optimize_cost(self, budget_per_hour: float = 100.0) -> Dict[str, Any]:
|
||||
"""
|
||||
Optimize system for cost constraints.
|
||||
|
||||
Balances:
|
||||
- Instance types and sizes
|
||||
- Batch processing vs real-time
|
||||
- Caching strategies
|
||||
- Model compression trade-offs
|
||||
"""
|
||||
print(f"\n💰 Optimizing for ${budget_per_hour}/hour budget...")
|
||||
|
||||
strategies = {
|
||||
"instance_optimization": {
|
||||
"current": "p3.2xlarge",
|
||||
"recommended": "g4dn.xlarge",
|
||||
"savings": 0.70
|
||||
},
|
||||
"batch_processing": {
|
||||
"enabled": True,
|
||||
"batch_window_ms": 50,
|
||||
"throughput_gain": 2.5
|
||||
},
|
||||
"model_compression": {
|
||||
"quantization": "int8",
|
||||
"size_reduction": 0.75,
|
||||
"accuracy_impact": 0.01
|
||||
},
|
||||
"caching": {
|
||||
"cache_hit_rate": 0.30,
|
||||
"cost_reduction": 0.30
|
||||
}
|
||||
}
|
||||
|
||||
total_savings = sum(s.get("savings", 0) or s.get("cost_reduction", 0)
|
||||
for s in strategies.values())
|
||||
|
||||
print(f" Total potential savings: {total_savings*100:.0f}%")
|
||||
for strategy, details in strategies.items():
|
||||
print(f" - {strategy.replace('_', ' ').title()}: {details}")
|
||||
|
||||
return strategies
|
||||
|
||||
def generate_deployment_config(self,
|
||||
deployment_target: str = "kubernetes") -> Dict[str, Any]:
|
||||
"""
|
||||
Generate production deployment configuration.
|
||||
|
||||
Creates complete deployment specs for:
|
||||
- Kubernetes
|
||||
- Docker Swarm
|
||||
- AWS ECS
|
||||
- Edge devices
|
||||
"""
|
||||
print(f"\n🚀 Generating {deployment_target.title()} Deployment Config...")
|
||||
|
||||
if deployment_target == "kubernetes":
|
||||
config = {
|
||||
"apiVersion": "apps/v1",
|
||||
"kind": "Deployment",
|
||||
"metadata": {
|
||||
"name": "tinytorch-ml-system",
|
||||
"labels": {"app": "tinytorch"}
|
||||
},
|
||||
"spec": {
|
||||
"replicas": 3,
|
||||
"selector": {"matchLabels": {"app": "tinytorch"}},
|
||||
"template": {
|
||||
"spec": {
|
||||
"containers": [{
|
||||
"name": "ml-inference",
|
||||
"image": "tinytorch:latest",
|
||||
"resources": {
|
||||
"limits": {"memory": "4Gi", "cpu": "2"},
|
||||
"requests": {"memory": "2Gi", "cpu": "1"}
|
||||
},
|
||||
"env": [
|
||||
{"name": "MODEL_PATH", "value": "/models/latest"},
|
||||
{"name": "BATCH_SIZE", "value": "32"},
|
||||
{"name": "MAX_WORKERS", "value": "4"}
|
||||
]
|
||||
}]
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
else:
|
||||
config = {"deployment_target": deployment_target, "status": "not_implemented"}
|
||||
|
||||
print(f" ✅ Deployment config generated")
|
||||
print(f" Replicas: {config.get('spec', {}).get('replicas', 'N/A')}")
|
||||
|
||||
return config
|
||||
|
||||
# %% [markdown]
|
||||
"""
|
||||
## Part 4: Testing the Production System Profiler
|
||||
|
||||
Let's test our comprehensive system profiler with a complete ML pipeline.
|
||||
"""
|
||||
|
||||
# %%
|
||||
def test_production_system_profiler():
|
||||
"""Test the complete production ML system profiler"""
|
||||
print("Testing Production ML System Profiler")
|
||||
print("=" * 50)
|
||||
|
||||
# Create mock components
|
||||
class MockModel(Module):
|
||||
def __init__(self):
|
||||
super().__init__()
|
||||
self.layers = []
|
||||
|
||||
def forward(self, x):
|
||||
return x
|
||||
|
||||
def parameters(self):
|
||||
return [Tensor(np.random.randn(100, 100))]
|
||||
|
||||
class MockDataLoader:
|
||||
def __init__(self):
|
||||
self.batch_size = 32
|
||||
|
||||
def __iter__(self):
|
||||
for _ in range(10):
|
||||
yield (Tensor(np.random.randn(32, 784)),
|
||||
Tensor(np.random.randint(0, 10, 32)))
|
||||
|
||||
# Initialize profiler
|
||||
profiler = ProductionMLSystemProfiler()
|
||||
|
||||
# Create mock components
|
||||
model = MockModel()
|
||||
dataloader = MockDataLoader()
|
||||
optimizer = SGD(model.parameters(), lr=0.01)
|
||||
|
||||
# Profile system
|
||||
metrics = profiler.profile_end_to_end_system(model, dataloader, optimizer)
|
||||
assert metrics.inference_latency_ms > 0
|
||||
|
||||
# Detect optimizations
|
||||
recommendations = profiler.detect_cross_module_optimizations()
|
||||
assert len(recommendations) > 0
|
||||
|
||||
# Validate production readiness
|
||||
checks = profiler.validate_production_readiness()
|
||||
assert all(isinstance(v, bool) for v in checks.values())
|
||||
|
||||
# Analyze scalability
|
||||
scalability = profiler.analyze_scalability(target_qps=10000)
|
||||
assert scalability["scaling_factor"] > 0
|
||||
|
||||
# Optimize cost
|
||||
cost_optimization = profiler.optimize_cost(budget_per_hour=100.0)
|
||||
assert len(cost_optimization) > 0
|
||||
|
||||
# Generate deployment config
|
||||
deploy_config = profiler.generate_deployment_config("kubernetes")
|
||||
assert "apiVersion" in deploy_config
|
||||
|
||||
print("\n✅ All production system profiler tests passed!")
|
||||
|
||||
# Only run tests if executed directly
|
||||
if __name__ == "__main__":
|
||||
test_production_system_profiler()
|
||||
|
||||
# %% [markdown]
|
||||
"""
|
||||
## Part 5: Building Complete ML Systems
|
||||
|
||||
Now let's build a complete, production-ready ML system that integrates all TinyTorch components.
|
||||
"""
|
||||
|
||||
# %%
|
||||
class CompleteMlSystem:
|
||||
"""
|
||||
Complete ML system integrating all TinyTorch components.
|
||||
This represents a production-ready system architecture.
|
||||
"""
|
||||
|
||||
def __init__(self, config: Dict[str, Any]):
|
||||
self.config = config
|
||||
self.components = {}
|
||||
self.metrics = {}
|
||||
self.profiler = ProductionMLSystemProfiler()
|
||||
|
||||
def build_system(self):
|
||||
"""Build the complete ML system with all components"""
|
||||
print("🏗️ Building Complete ML System...")
|
||||
|
||||
# Initialize all components
|
||||
self.components["model"] = self._build_model()
|
||||
self.components["optimizer"] = self._build_optimizer()
|
||||
self.components["dataloader"] = self._build_dataloader()
|
||||
self.components["monitor"] = self._build_monitor()
|
||||
|
||||
print("✅ System build complete")
|
||||
|
||||
def _build_model(self):
|
||||
"""Build model with all layer types"""
|
||||
# Would build real model with Dense, Conv, Attention layers
|
||||
print(" Building model architecture...")
|
||||
return None # Placeholder
|
||||
|
||||
def _build_optimizer(self):
|
||||
"""Build optimizer with adaptive strategies"""
|
||||
print(" Configuring optimizer...")
|
||||
return None # Placeholder
|
||||
|
||||
def _build_dataloader(self):
|
||||
"""Build data pipeline with preprocessing"""
|
||||
print(" Setting up data pipeline...")
|
||||
return None # Placeholder
|
||||
|
||||
def _build_monitor(self):
|
||||
"""Build monitoring and observability"""
|
||||
print(" Configuring monitoring...")
|
||||
return None # Placeholder
|
||||
|
||||
def train(self, epochs: int = 10):
|
||||
"""Production training loop with all features"""
|
||||
print(f"\n🎯 Training for {epochs} epochs...")
|
||||
|
||||
for epoch in range(epochs):
|
||||
# Training logic with:
|
||||
# - Gradient accumulation
|
||||
# - Mixed precision
|
||||
# - Checkpointing
|
||||
# - Early stopping
|
||||
# - Learning rate scheduling
|
||||
|
||||
if epoch % 5 == 0:
|
||||
print(f" Epoch {epoch}: loss=0.{100-epoch*5:.3f}")
|
||||
|
||||
print("✅ Training complete")
|
||||
|
||||
def deploy(self, target: str = "production"):
|
||||
"""Deploy system to production"""
|
||||
print(f"\n🚀 Deploying to {target}...")
|
||||
|
||||
# Deployment steps:
|
||||
# 1. Model optimization (quantization, pruning)
|
||||
# 2. Container building
|
||||
# 3. Service deployment
|
||||
# 4. Load balancer configuration
|
||||
# 5. Monitoring setup
|
||||
|
||||
print(f"✅ Deployed to {target}")
|
||||
|
||||
def monitor_production(self):
|
||||
"""Monitor production system"""
|
||||
print("\n📊 Production Monitoring Dashboard")
|
||||
print(" QPS: 5000")
|
||||
print(" P99 Latency: 45ms")
|
||||
print(" Error Rate: 0.01%")
|
||||
print(" Model Drift: None detected")
|
||||
|
||||
# %% [markdown]
|
||||
"""
|
||||
## Part 6: System Integration Testing
|
||||
|
||||
Let's test how all components work together in a production scenario.
|
||||
"""
|
||||
|
||||
# %%
|
||||
def test_complete_ml_system():
|
||||
"""Test the complete ML system integration"""
|
||||
print("Testing Complete ML System Integration")
|
||||
print("=" * 50)
|
||||
|
||||
# System configuration
|
||||
config = {
|
||||
"model": {
|
||||
"architecture": "transformer",
|
||||
"layers": 12,
|
||||
"hidden_dim": 768
|
||||
},
|
||||
"training": {
|
||||
"batch_size": 32,
|
||||
"learning_rate": 0.001,
|
||||
"epochs": 10
|
||||
},
|
||||
"deployment": {
|
||||
"target": "kubernetes",
|
||||
"replicas": 3,
|
||||
"autoscaling": True
|
||||
}
|
||||
}
|
||||
|
||||
# Build system
|
||||
system = CompleteMlSystem(config)
|
||||
system.build_system()
|
||||
|
||||
# Train model
|
||||
system.train(epochs=10)
|
||||
|
||||
# Deploy to production
|
||||
system.deploy("production")
|
||||
|
||||
# Monitor production
|
||||
system.monitor_production()
|
||||
|
||||
print("\n✅ Complete ML system test passed!")
|
||||
|
||||
# Only run tests if executed directly
|
||||
if __name__ == "__main__":
|
||||
test_complete_ml_system()
|
||||
|
||||
# %% [markdown]
|
||||
"""
|
||||
## Part 7: ML Systems Thinking Questions
|
||||
|
||||
### 🏗️ Complete ML System Architecture
|
||||
1. How would you design a multi-tenant ML platform that serves models for different customers while ensuring isolation and fair resource allocation?
|
||||
2. What are the trade-offs between monolithic and microservices architectures for ML systems, and when would you choose each?
|
||||
3. How do you handle versioning and compatibility when different components of your ML system evolve at different rates?
|
||||
4. What patterns would you use to ensure your ML system remains maintainable as it grows from 10 to 1000+ models?
|
||||
|
||||
### 🏢 Enterprise ML Platform Design
|
||||
1. How would you design an ML platform that supports both batch and real-time inference while sharing the same model artifacts?
|
||||
2. What governance and compliance features would you build into an enterprise ML platform for regulated industries?
|
||||
3. How would you implement multi-cloud ML deployments that can failover between providers seamlessly?
|
||||
4. What would be your strategy for building an ML platform that supports both centralized and federated learning?
|
||||
|
||||
### 🚀 Production System Optimization
|
||||
1. How would you systematically identify and eliminate bottlenecks in a complex ML system serving millions of requests?
|
||||
2. What strategies would you employ to reduce cold start latency in serverless ML deployments?
|
||||
3. How would you design an adaptive system that automatically adjusts resources based on traffic patterns and model complexity?
|
||||
4. What techniques would you use to optimize the cost-performance trade-off in a large-scale ML system?
|
||||
|
||||
### 📈 Scaling to Millions of Users
|
||||
1. How would you architect an ML system to handle sudden 100x traffic spikes during viral events?
|
||||
2. What caching strategies would you implement for ML predictions, and how would you handle cache invalidation?
|
||||
3. How would you design a global ML serving infrastructure that minimizes latency for users worldwide?
|
||||
4. What patterns would you use to ensure consistency when serving ML models across hundreds of edge locations?
|
||||
|
||||
### 🔮 Future of ML Systems
|
||||
1. How will ML systems architecture need to evolve to support increasingly large foundation models?
|
||||
2. What role will hardware-software co-design play in the future of ML systems, and how should engineers prepare?
|
||||
3. How might quantum computing change the way we design and optimize ML systems?
|
||||
4. What new abstractions and tools will be needed as ML systems become more autonomous and self-optimizing?
|
||||
"""
|
||||
|
||||
# %% [markdown]
|
||||
"""
|
||||
## Part 8: Enterprise Deployment Patterns
|
||||
|
||||
Let's implement advanced deployment patterns used in production ML systems.
|
||||
"""
|
||||
|
||||
# %%
|
||||
class EnterpriseDeploymentOrchestrator:
|
||||
"""
|
||||
Orchestrates enterprise ML deployments with advanced patterns.
|
||||
"""
|
||||
|
||||
def __init__(self):
|
||||
self.deployment_strategies = {
|
||||
"blue_green": self._blue_green_deployment,
|
||||
"canary": self._canary_deployment,
|
||||
"shadow": self._shadow_deployment,
|
||||
"gradual_rollout": self._gradual_rollout
|
||||
}
|
||||
|
||||
def _blue_green_deployment(self, model_v1, model_v2):
|
||||
"""Blue-green deployment with instant switchover"""
|
||||
print("🔵🟢 Executing Blue-Green Deployment")
|
||||
print(" 1. Deploy v2 to green environment")
|
||||
print(" 2. Run validation tests on green")
|
||||
print(" 3. Switch traffic from blue to green")
|
||||
print(" 4. Keep blue as rollback option")
|
||||
return {"status": "success", "rollback_available": True}
|
||||
|
||||
def _canary_deployment(self, model_v1, model_v2, canary_percent=5):
|
||||
"""Canary deployment with gradual rollout"""
|
||||
print(f"🐤 Executing Canary Deployment ({canary_percent}% initial)")
|
||||
print(f" 1. Route {canary_percent}% traffic to v2")
|
||||
print(" 2. Monitor metrics for 1 hour")
|
||||
print(" 3. Gradually increase to 100% if healthy")
|
||||
return {"status": "in_progress", "current_percentage": canary_percent}
|
||||
|
||||
def _shadow_deployment(self, model_v1, model_v2):
|
||||
"""Shadow deployment for risk-free testing"""
|
||||
print("👤 Executing Shadow Deployment")
|
||||
print(" 1. Deploy v2 in shadow mode")
|
||||
print(" 2. Duplicate traffic to v2 (responses ignored)")
|
||||
print(" 3. Compare v1 and v2 outputs")
|
||||
print(" 4. Promote v2 when confidence threshold met")
|
||||
return {"status": "shadowing", "agreement_rate": 0.98}
|
||||
|
||||
def _gradual_rollout(self, model_v1, model_v2, stages=[5, 25, 50, 100]):
|
||||
"""Multi-stage gradual rollout"""
|
||||
print(f"📊 Executing Gradual Rollout: {stages}%")
|
||||
for stage in stages:
|
||||
print(f" Stage: {stage}% - Monitor for 2 hours")
|
||||
return {"status": "staged", "stages": stages}
|
||||
|
||||
def deploy_with_strategy(self, strategy: str, **kwargs):
|
||||
"""Deploy using specified strategy"""
|
||||
if strategy in self.deployment_strategies:
|
||||
return self.deployment_strategies[strategy](**kwargs)
|
||||
else:
|
||||
raise ValueError(f"Unknown strategy: {strategy}")
|
||||
|
||||
# Test deployment patterns
|
||||
def test_enterprise_deployment():
|
||||
"""Test enterprise deployment patterns"""
|
||||
print("\nTesting Enterprise Deployment Patterns")
|
||||
print("=" * 50)
|
||||
|
||||
orchestrator = EnterpriseDeploymentOrchestrator()
|
||||
|
||||
# Test different strategies
|
||||
mock_v1 = "model_v1"
|
||||
mock_v2 = "model_v2"
|
||||
|
||||
# Blue-Green
|
||||
result = orchestrator.deploy_with_strategy("blue_green",
|
||||
model_v1=mock_v1,
|
||||
model_v2=mock_v2)
|
||||
assert result["status"] == "success"
|
||||
|
||||
# Canary
|
||||
result = orchestrator.deploy_with_strategy("canary",
|
||||
model_v1=mock_v1,
|
||||
model_v2=mock_v2,
|
||||
canary_percent=10)
|
||||
assert result["current_percentage"] == 10
|
||||
|
||||
print("\n✅ All deployment patterns tested successfully!")
|
||||
|
||||
# Only run tests if executed directly
|
||||
if __name__ == "__main__":
|
||||
test_enterprise_deployment()
|
||||
|
||||
# %% [markdown]
|
||||
"""
|
||||
## Part 9: Comprehensive Testing
|
||||
|
||||
Let's run comprehensive tests that validate the entire ML system.
|
||||
"""
|
||||
|
||||
# %%
|
||||
def run_comprehensive_system_tests():
|
||||
"""Run comprehensive tests for the complete ML system"""
|
||||
print("\n🧪 Running Comprehensive System Tests")
|
||||
print("=" * 50)
|
||||
|
||||
test_results = {
|
||||
"unit_tests": True,
|
||||
"integration_tests": True,
|
||||
"performance_tests": True,
|
||||
"scalability_tests": True,
|
||||
"security_tests": True,
|
||||
"mlops_tests": True
|
||||
}
|
||||
|
||||
# Simulate comprehensive testing
|
||||
for test_type, passed in test_results.items():
|
||||
status = "✅" if passed else "❌"
|
||||
print(f"{status} {test_type.replace('_', ' ').title()}: {'Passed' if passed else 'Failed'}")
|
||||
|
||||
# Overall status
|
||||
all_passed = all(test_results.values())
|
||||
|
||||
if all_passed:
|
||||
print("\n🎉 All comprehensive tests passed!")
|
||||
print("System is ready for production deployment!")
|
||||
else:
|
||||
print("\n⚠️ Some tests failed. Please review and fix issues.")
|
||||
|
||||
return all_passed
|
||||
|
||||
# Run comprehensive tests only if executed directly
|
||||
if __name__ == "__main__":
|
||||
success = run_comprehensive_system_tests()
|
||||
assert success, "System tests must pass before deployment"
|
||||
|
||||
# %% [markdown]
|
||||
"""
|
||||
## Part 10: Module Summary
|
||||
|
||||
### What We've Built
|
||||
You've successfully integrated all TinyTorch components into a complete, production-ready ML system:
|
||||
|
||||
1. **Complete System Profiler**: Analyzes performance across all components
|
||||
2. **Cross-Module Optimization**: Identifies and implements system-wide optimizations
|
||||
3. **Production Validation**: Ensures system meets enterprise requirements
|
||||
4. **Scalability Analysis**: Plans for growth to millions of users
|
||||
5. **Cost Optimization**: Balances performance with budget constraints
|
||||
6. **Enterprise Deployment**: Implements advanced deployment strategies
|
||||
7. **Comprehensive Testing**: Validates the entire system end-to-end
|
||||
|
||||
### Key Takeaways
|
||||
- ML systems engineering requires thinking beyond individual components
|
||||
- Production systems need careful orchestration of many moving parts
|
||||
- Performance optimization is a continuous, multi-dimensional process
|
||||
- Scalability must be designed in from the beginning
|
||||
- Monitoring and observability are critical for production success
|
||||
|
||||
### Your ML Systems Journey
|
||||
You've progressed from understanding basic tensors to building complete production ML systems. You now have the knowledge to:
|
||||
- Design and implement ML systems from scratch
|
||||
- Optimize for production performance and scale
|
||||
- Deploy and monitor ML systems in enterprise environments
|
||||
- Make informed architectural decisions
|
||||
- Continue learning as ML systems evolve
|
||||
|
||||
### Next Steps
|
||||
1. Build your own production ML system using TinyTorch
|
||||
2. Contribute to open-source ML frameworks
|
||||
3. Explore specialized areas (distributed training, edge deployment, etc.)
|
||||
4. Stay current with ML systems research and industry practices
|
||||
5. Share your knowledge and help others learn
|
||||
|
||||
Congratulations on completing the TinyTorch ML Systems Engineering journey! 🎉
|
||||
"""
|
||||
@@ -1,500 +0,0 @@
|
||||
# 🎯 Capstone Project Guide: Performance Optimization Example
|
||||
|
||||
## **Example Project: Vectorized Matrix Operations**
|
||||
|
||||
This guide walks through a complete capstone project optimizing TinyTorch's matrix operations. Follow this example to understand the process, then apply it to your chosen optimization track.
|
||||
|
||||
---
|
||||
|
||||
## **Phase 1: Analysis & Profiling**
|
||||
|
||||
### **Step 1: Profile Your Current Implementation**
|
||||
|
||||
First, let's identify where TinyTorch spends most of its time:
|
||||
|
||||
```python
|
||||
import cProfile
|
||||
import pstats
|
||||
import time
|
||||
import numpy as np
|
||||
from memory_profiler import profile
|
||||
|
||||
# Import your TinyTorch framework
|
||||
from tinytorch.core.tensor import Tensor
|
||||
from tinytorch.core.layers import Dense
|
||||
from tinytorch.core.networks import Sequential
|
||||
from tinytorch.core.activations import ReLU
|
||||
|
||||
def profile_current_framework():
|
||||
"""Profile a typical TinyTorch training scenario."""
|
||||
|
||||
# Create a realistic model
|
||||
model = Sequential([
|
||||
Dense(784, 256),
|
||||
ReLU(),
|
||||
Dense(256, 128),
|
||||
ReLU(),
|
||||
Dense(128, 10)
|
||||
])
|
||||
|
||||
# Generate realistic data (like MNIST)
|
||||
batch_size = 64
|
||||
X = Tensor(np.random.randn(batch_size, 784))
|
||||
|
||||
# Profile forward pass
|
||||
profiler = cProfile.Profile()
|
||||
profiler.enable()
|
||||
|
||||
# Run multiple forward passes
|
||||
for _ in range(100):
|
||||
output = model.forward(X)
|
||||
|
||||
profiler.disable()
|
||||
|
||||
# Analyze results
|
||||
stats = pstats.Stats(profiler)
|
||||
stats.sort_stats('cumulative')
|
||||
stats.print_stats(20)
|
||||
|
||||
return stats
|
||||
|
||||
# Run profiling
|
||||
print("🔍 Profiling Current TinyTorch Framework...")
|
||||
profile_results = profile_current_framework()
|
||||
```
|
||||
|
||||
### **Step 2: Analyze Bottlenecks**
|
||||
|
||||
Typical results show:
|
||||
```
|
||||
1003 function calls in 2.450 seconds
|
||||
|
||||
Ordered by: cumulative time
|
||||
|
||||
ncalls tottime percall cumtime percall filename:lineno(function)
|
||||
100 0.001 0.000 2.449 0.024 networks.py:45(forward)
|
||||
300 0.002 0.000 2.448 0.008 layers.py:67(forward)
|
||||
300 2.440 0.008 2.446 0.008 layers.py:34(matmul_naive) ← BOTTLENECK!
|
||||
200 0.004 0.000 0.004 0.000 activations.py:23(forward)
|
||||
```
|
||||
|
||||
**Finding**: 99.6% of time spent in `matmul_naive`! This is our optimization target.
|
||||
|
||||
### **Step 3: Baseline Benchmarks**
|
||||
|
||||
```python
|
||||
def benchmark_current_matmul():
|
||||
"""Establish baseline performance metrics."""
|
||||
|
||||
# Test various matrix sizes
|
||||
sizes = [(100, 100), (500, 500), (1000, 1000), (2000, 2000)]
|
||||
|
||||
for m, n in sizes:
|
||||
A = np.random.randn(m, n)
|
||||
B = np.random.randn(n, m)
|
||||
|
||||
# Time current implementation
|
||||
start = time.time()
|
||||
result = matmul_naive(A, B) # Your current implementation
|
||||
current_time = time.time() - start
|
||||
|
||||
# Time NumPy for comparison
|
||||
start = time.time()
|
||||
numpy_result = np.dot(A, B)
|
||||
numpy_time = time.time() - start
|
||||
|
||||
slowdown = current_time / numpy_time
|
||||
print(f"Size {m}x{n}: TinyTorch={current_time:.3f}s, NumPy={numpy_time:.3f}s, Slowdown={slowdown:.1f}x")
|
||||
|
||||
print("📊 Baseline Performance:")
|
||||
benchmark_current_matmul()
|
||||
```
|
||||
|
||||
**Typical Output:**
|
||||
```
|
||||
Size 100x100: TinyTorch=0.023s, NumPy=0.001s, Slowdown=23.0x
|
||||
Size 500x500: TinyTorch=0.890s, NumPy=0.012s, Slowdown=74.2x
|
||||
Size 1000x1000: TinyTorch=7.234s, NumPy=0.089s, Slowdown=81.3x
|
||||
```
|
||||
|
||||
**Goal**: Reduce this slowdown from 80x to under 5x.
|
||||
|
||||
---
|
||||
|
||||
## **Phase 2: Optimization Implementation**
|
||||
|
||||
### **Step 4: Implement Optimized Matrix Multiplication**
|
||||
|
||||
```python
|
||||
def matmul_optimized_v1(A, B):
|
||||
"""
|
||||
First optimization: Use NumPy's optimized dot product.
|
||||
|
||||
This isn't cheating - NumPy is our computational backend,
|
||||
just like PyTorch uses BLAS/LAPACK under the hood.
|
||||
"""
|
||||
# Validate inputs (keep your error checking)
|
||||
assert A.shape[1] == B.shape[0], f"Cannot multiply {A.shape} and {B.shape}"
|
||||
|
||||
# Use NumPy's optimized implementation
|
||||
return np.dot(A, B)
|
||||
|
||||
def matmul_optimized_v2(A, B):
|
||||
"""
|
||||
Second optimization: Block-based multiplication for large matrices.
|
||||
Better cache performance for very large operations.
|
||||
"""
|
||||
m, k = A.shape
|
||||
k2, n = B.shape
|
||||
assert k == k2
|
||||
|
||||
# For small matrices, use simple NumPy
|
||||
if m * n * k < 1000000: # Threshold tuned empirically
|
||||
return np.dot(A, B)
|
||||
|
||||
# For large matrices, use block multiplication
|
||||
block_size = 256 # Optimized for L2 cache
|
||||
C = np.zeros((m, n))
|
||||
|
||||
for i in range(0, m, block_size):
|
||||
for j in range(0, n, block_size):
|
||||
for l in range(0, k, block_size):
|
||||
# Extract blocks
|
||||
A_block = A[i:i+block_size, l:l+block_size]
|
||||
B_block = B[l:l+block_size, j:j+block_size]
|
||||
|
||||
# Multiply blocks
|
||||
C[i:i+block_size, j:j+block_size] += np.dot(A_block, B_block)
|
||||
|
||||
return C
|
||||
|
||||
def matmul_optimized_v3(A, B):
|
||||
"""
|
||||
Third optimization: Memory layout optimization.
|
||||
Ensure contiguous memory for better performance.
|
||||
"""
|
||||
# Ensure C-contiguous layout for better cache performance
|
||||
if not A.flags['C_CONTIGUOUS']:
|
||||
A = np.ascontiguousarray(A)
|
||||
if not B.flags['C_CONTIGUOUS']:
|
||||
B = np.ascontiguousarray(B)
|
||||
|
||||
# Use the block approach with optimized memory layout
|
||||
return matmul_optimized_v2(A, B)
|
||||
```
|
||||
|
||||
### **Step 5: Test and Benchmark Optimizations**
|
||||
|
||||
```python
|
||||
def benchmark_optimizations():
|
||||
"""Compare all optimization versions."""
|
||||
|
||||
sizes = [(100, 100), (500, 500), (1000, 1000), (2000, 2000)]
|
||||
|
||||
for m, n in sizes:
|
||||
A = np.random.randn(m, n)
|
||||
B = np.random.randn(n, m)
|
||||
|
||||
# Test correctness first
|
||||
result_naive = matmul_naive(A, B)
|
||||
result_v1 = matmul_optimized_v1(A, B)
|
||||
result_v2 = matmul_optimized_v2(A, B)
|
||||
result_v3 = matmul_optimized_v3(A, B)
|
||||
|
||||
# Verify all produce same results
|
||||
assert np.allclose(result_naive, result_v1, rtol=1e-10)
|
||||
assert np.allclose(result_naive, result_v2, rtol=1e-10)
|
||||
assert np.allclose(result_naive, result_v3, rtol=1e-10)
|
||||
|
||||
# Benchmark performance
|
||||
times = {}
|
||||
for name, func in [
|
||||
('naive', matmul_naive),
|
||||
('v1_numpy', matmul_optimized_v1),
|
||||
('v2_blocks', matmul_optimized_v2),
|
||||
('v3_memory', matmul_optimized_v3)
|
||||
]:
|
||||
start = time.time()
|
||||
_ = func(A, B)
|
||||
times[name] = time.time() - start
|
||||
|
||||
print(f"\nSize {m}x{n}:")
|
||||
baseline = times['naive']
|
||||
for name, t in times.items():
|
||||
speedup = baseline / t
|
||||
print(f" {name:12}: {t:.3f}s (speedup: {speedup:.1f}x)")
|
||||
|
||||
print("⚡ Optimization Results:")
|
||||
benchmark_optimizations()
|
||||
```
|
||||
|
||||
**Typical Results:**
|
||||
```
|
||||
Size 1000x1000:
|
||||
naive : 7.234s (speedup: 1.0x)
|
||||
v1_numpy : 0.089s (speedup: 81.3x) ← Huge improvement!
|
||||
v2_blocks : 0.091s (speedup: 79.5x) ← Slight regression for this size
|
||||
v3_memory : 0.087s (speedup: 83.1x) ← Best overall
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## **Phase 3: Integration & Testing**
|
||||
|
||||
### **Step 6: Update Your Dense Layer**
|
||||
|
||||
```python
|
||||
class DenseOptimized:
|
||||
"""Optimized Dense layer using improved matrix multiplication."""
|
||||
|
||||
def __init__(self, input_size, output_size):
|
||||
self.input_size = input_size
|
||||
self.output_size = output_size
|
||||
|
||||
# Initialize weights (same as before)
|
||||
self.weight = np.random.randn(input_size, output_size) * 0.1
|
||||
self.bias = np.zeros(output_size)
|
||||
|
||||
def forward(self, x):
|
||||
"""Forward pass using optimized matrix multiplication."""
|
||||
# Use our optimized matmul instead of naive version
|
||||
linear_output = matmul_optimized_v3(x, self.weight)
|
||||
return linear_output + self.bias
|
||||
|
||||
def __call__(self, x):
|
||||
return self.forward(x)
|
||||
```
|
||||
|
||||
### **Step 7: End-to-End Performance Test**
|
||||
|
||||
```python
|
||||
def test_full_network_improvement():
|
||||
"""Test the complete training pipeline with optimizations."""
|
||||
|
||||
# Create identical networks with different matmul implementations
|
||||
print("🏗️ Creating test networks...")
|
||||
|
||||
# Original network (using naive matmul)
|
||||
network_original = Sequential([
|
||||
Dense(784, 256), # Uses matmul_naive
|
||||
ReLU(),
|
||||
Dense(256, 128),
|
||||
ReLU(),
|
||||
Dense(128, 10)
|
||||
])
|
||||
|
||||
# Optimized network (using optimized matmul)
|
||||
network_optimized = Sequential([
|
||||
DenseOptimized(784, 256), # Uses matmul_optimized_v3
|
||||
ReLU(),
|
||||
DenseOptimized(256, 128),
|
||||
ReLU(),
|
||||
DenseOptimized(128, 10)
|
||||
])
|
||||
|
||||
# Test data
|
||||
batch_size = 64
|
||||
X = np.random.randn(batch_size, 784)
|
||||
|
||||
# Benchmark original network
|
||||
print("⏱️ Benchmarking original network...")
|
||||
start = time.time()
|
||||
for _ in range(100):
|
||||
output_orig = network_original.forward(X)
|
||||
time_original = time.time() - start
|
||||
|
||||
# Benchmark optimized network
|
||||
print("⚡ Benchmarking optimized network...")
|
||||
start = time.time()
|
||||
for _ in range(100):
|
||||
output_opt = network_optimized.forward(X)
|
||||
time_optimized = time.time() - start
|
||||
|
||||
# Calculate improvement
|
||||
speedup = time_original / time_optimized
|
||||
time_saved = time_original - time_optimized
|
||||
|
||||
print(f"\n🎉 Results:")
|
||||
print(f" Original network: {time_original:.3f}s")
|
||||
print(f" Optimized network: {time_optimized:.3f}s")
|
||||
print(f" Speedup: {speedup:.1f}x")
|
||||
print(f" Time saved: {time_saved:.3f}s ({time_saved/time_original*100:.1f}%)")
|
||||
|
||||
# Verify outputs are identical (within numerical precision)
|
||||
assert np.allclose(output_orig, output_opt, rtol=1e-10), "Outputs don't match!"
|
||||
print(f" ✅ Numerical correctness verified")
|
||||
|
||||
test_full_network_improvement()
|
||||
```
|
||||
|
||||
**Expected Results:**
|
||||
```
|
||||
🎉 Results:
|
||||
Original network: 2.450s
|
||||
Optimized network: 0.035s
|
||||
Speedup: 70.0x
|
||||
Time saved: 2.415s (98.6%)
|
||||
✅ Numerical correctness verified
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## **Phase 4: Documentation & Analysis**
|
||||
|
||||
### **Step 8: Document Your Engineering Decisions**
|
||||
|
||||
Create `capstone_report.md`:
|
||||
|
||||
```markdown
|
||||
# Performance Optimization Capstone Report
|
||||
|
||||
## Problem Analysis
|
||||
TinyTorch's matrix multiplication was 80x slower than NumPy, making training
|
||||
impractically slow. Profiling showed 99.6% of computation time in `matmul_naive`.
|
||||
|
||||
## Technical Approach
|
||||
1. **Root Cause**: Triple-nested loops with poor cache locality
|
||||
2. **Solution**: Leverage NumPy's optimized BLAS backend
|
||||
3. **Enhancement**: Add block-based multiplication for huge matrices
|
||||
4. **Polish**: Memory layout optimization for cache efficiency
|
||||
|
||||
## Engineering Trade-offs
|
||||
- **Gained**: 70x speedup in real networks, maintained numerical precision
|
||||
- **Lost**: Educational visibility into low-level matrix multiplication
|
||||
- **Justified**: Students learn optimization thinking, not reinventing BLAS
|
||||
|
||||
## Performance Results
|
||||
- Dense layer operations: 80x faster
|
||||
- Full network training: 70x faster
|
||||
- Memory usage: Unchanged
|
||||
- Numerical accuracy: Maintained (1e-10 relative tolerance)
|
||||
|
||||
## Future Optimizations
|
||||
1. GPU acceleration using CuPy/JAX
|
||||
2. Sparse matrix support for compressed models
|
||||
3. Mixed-precision training for memory efficiency
|
||||
```
|
||||
|
||||
### **Step 9: Create Demonstration**
|
||||
|
||||
Create `demo.py`:
|
||||
|
||||
```python
|
||||
"""
|
||||
TinyTorch Performance Optimization Demo
|
||||
|
||||
This demonstrates the 70x speedup achieved through matrix operation optimization.
|
||||
Run this to see before/after performance on your machine.
|
||||
"""
|
||||
|
||||
import time
|
||||
import numpy as np
|
||||
from tinytorch.core.networks import Sequential
|
||||
from tinytorch.core.layers import Dense, DenseOptimized
|
||||
from tinytorch.core.activations import ReLU
|
||||
|
||||
def main():
|
||||
print("🔥 TinyTorch Performance Optimization Demo")
|
||||
print("=" * 50)
|
||||
|
||||
# Create test scenario: MNIST-like classification
|
||||
print("📊 Scenario: MNIST-like classification (784→256→128→10)")
|
||||
batch_size = 64
|
||||
X = np.random.randn(batch_size, 784)
|
||||
|
||||
# Original network
|
||||
network_original = Sequential([
|
||||
Dense(784, 256), ReLU(),
|
||||
Dense(256, 128), ReLU(),
|
||||
Dense(128, 10)
|
||||
])
|
||||
|
||||
# Optimized network
|
||||
network_optimized = Sequential([
|
||||
DenseOptimized(784, 256), ReLU(),
|
||||
DenseOptimized(256, 128), ReLU(),
|
||||
DenseOptimized(128, 10)
|
||||
])
|
||||
|
||||
# Benchmark
|
||||
print("\n⏱️ Running 1000 forward passes...")
|
||||
|
||||
# Original
|
||||
start = time.time()
|
||||
for _ in range(1000):
|
||||
_ = network_original.forward(X)
|
||||
time_orig = time.time() - start
|
||||
|
||||
# Optimized
|
||||
start = time.time()
|
||||
for _ in range(1000):
|
||||
_ = network_optimized.forward(X)
|
||||
time_opt = time.time() - start
|
||||
|
||||
# Results
|
||||
speedup = time_orig / time_opt
|
||||
print(f"\n🎉 Results:")
|
||||
print(f" Original: {time_orig:.2f}s")
|
||||
print(f" Optimized: {time_opt:.2f}s")
|
||||
print(f" Speedup: {speedup:.1f}x")
|
||||
print(f" Time saved: {time_orig - time_opt:.2f}s")
|
||||
|
||||
if speedup > 50:
|
||||
print(f" 🚀 Excellent optimization!")
|
||||
elif speedup > 20:
|
||||
print(f" ⚡ Great improvement!")
|
||||
else:
|
||||
print(f" 📈 Good progress, consider further optimization")
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## **🎯 Your Turn: Apply This Process**
|
||||
|
||||
This example showed **Performance Engineering**. Now apply this same systematic approach to your chosen track:
|
||||
|
||||
### **For Algorithm Extensions:**
|
||||
1. **Profile**: Which algorithms are missing from your framework?
|
||||
2. **Plan**: What modern techniques would add most value?
|
||||
3. **Implement**: Build new layers/optimizers using existing TinyTorch components
|
||||
4. **Test**: Verify they work with your training pipeline
|
||||
5. **Document**: Explain design decisions and integration patterns
|
||||
|
||||
### **For Systems Optimization:**
|
||||
1. **Profile**: Where does memory usage spike? What limits parallelization?
|
||||
2. **Plan**: Which systems improvements would have biggest impact?
|
||||
3. **Implement**: Add memory profiling, gradient accumulation, checkpointing
|
||||
4. **Test**: Verify improvements don't break existing functionality
|
||||
5. **Document**: Analyze trade-offs between memory, speed, complexity
|
||||
|
||||
### **For Framework Analysis:**
|
||||
1. **Profile**: How does TinyTorch compare to PyTorch on key operations?
|
||||
2. **Plan**: What benchmarks would be most revealing?
|
||||
3. **Implement**: Automated testing suites comparing both frameworks
|
||||
4. **Test**: Run comprehensive performance analysis
|
||||
5. **Document**: Identify specific optimization opportunities
|
||||
|
||||
### **For Developer Experience:**
|
||||
1. **Profile**: What makes debugging TinyTorch difficult?
|
||||
2. **Plan**: Which tools would help developers most?
|
||||
3. **Implement**: Gradient visualization, error diagnosis, testing utilities
|
||||
4. **Test**: Use tools on real debugging scenarios
|
||||
5. **Document**: Show how tools improve development workflow
|
||||
|
||||
---
|
||||
|
||||
## **🚀 Success Criteria Reminder**
|
||||
|
||||
Your capstone succeeds when you can show:
|
||||
|
||||
1. **Measurable Impact**: 20%+ improvement in your chosen area
|
||||
2. **Systems Integration**: Your improvements work with all TinyTorch modules
|
||||
3. **Engineering Insight**: You understand and can explain the trade-offs
|
||||
4. **Professional Documentation**: Clear problem, solution, and results
|
||||
|
||||
**Remember**: You're not just optimizing code—you're proving you understand ML systems engineering at the framework level.
|
||||
|
||||
**🔥 Start with profiling your current TinyTorch framework and identifying your biggest optimization opportunity!**
|
||||
@@ -1,39 +0,0 @@
|
||||
# TinyTorch Module Metadata
|
||||
# Essential system information for CLI tools and build systems
|
||||
|
||||
name: "capstone"
|
||||
title: "Capstone Project"
|
||||
description: "Optimize and extend your complete TinyTorch framework through systems engineering"
|
||||
|
||||
# Dependencies - Used by CLI for module ordering and prerequisites
|
||||
dependencies:
|
||||
prerequisites: [
|
||||
"setup", "tensor", "activations", "layers", "networks", "cnn",
|
||||
"dataloader", "autograd", "optimizers", "training", "compression",
|
||||
"kernels", "benchmarking", "mlops"
|
||||
]
|
||||
enables: []
|
||||
|
||||
# Package Export - What gets built into tinytorch package
|
||||
exports_to: "tinytorch.capstone"
|
||||
|
||||
# File Structure - What files exist in this module
|
||||
files:
|
||||
dev_file: "capstone_dev.py"
|
||||
readme: "README.md"
|
||||
tests: "inline"
|
||||
|
||||
# Educational Metadata
|
||||
difficulty: "⭐⭐⭐⭐⭐ 🥷"
|
||||
time_estimate: "Capstone Project"
|
||||
|
||||
# Components - What's implemented in this module
|
||||
components:
|
||||
- "PerformanceProfiler"
|
||||
- "MemoryOptimizer"
|
||||
- "BatchNormalization"
|
||||
- "TransformerBlock"
|
||||
- "MultiGPUTraining"
|
||||
- "AdvancedOptimizer"
|
||||
- "FrameworkBenchmark"
|
||||
- "DeveloperTools"
|
||||
@@ -1,9 +0,0 @@
|
||||
"""
|
||||
TinyTorch Utils Package
|
||||
|
||||
Shared utilities for TinyTorch modules.
|
||||
"""
|
||||
|
||||
from .profiler import SimpleProfiler, profile_function
|
||||
|
||||
__all__ = ['SimpleProfiler', 'profile_function']
|
||||
@@ -1,226 +0,0 @@
|
||||
"""
|
||||
TinyTorch Utils: Simple Educational Profiler
|
||||
|
||||
A lightweight profiling utility for measuring performance of ML operations.
|
||||
Focused on measuring individual functions - students do their own comparisons.
|
||||
"""
|
||||
|
||||
import time
|
||||
import sys
|
||||
import gc
|
||||
import numpy as np
|
||||
from typing import Callable, Dict, Any, Optional
|
||||
|
||||
try:
|
||||
import psutil
|
||||
HAS_PSUTIL = True
|
||||
except ImportError:
|
||||
HAS_PSUTIL = False
|
||||
|
||||
try:
|
||||
import tracemalloc
|
||||
HAS_TRACEMALLOC = True
|
||||
except ImportError:
|
||||
HAS_TRACEMALLOC = False
|
||||
|
||||
class SimpleProfiler:
|
||||
"""
|
||||
Simple profiler for measuring individual function performance.
|
||||
|
||||
Measures timing, memory usage, and other key metrics for a single function.
|
||||
Students collect multiple measurements and compare results themselves.
|
||||
"""
|
||||
|
||||
def __init__(self, track_memory: bool = True, track_cpu: bool = True):
|
||||
self.track_memory = track_memory and HAS_TRACEMALLOC
|
||||
self.track_cpu = track_cpu and HAS_PSUTIL
|
||||
|
||||
if self.track_memory:
|
||||
tracemalloc.start()
|
||||
|
||||
def _get_memory_info(self) -> Dict[str, Any]:
|
||||
"""Get current memory information."""
|
||||
if not self.track_memory:
|
||||
return {}
|
||||
|
||||
try:
|
||||
current, peak = tracemalloc.get_traced_memory()
|
||||
return {
|
||||
'current_memory_mb': current / 1024 / 1024,
|
||||
'peak_memory_mb': peak / 1024 / 1024
|
||||
}
|
||||
except:
|
||||
return {}
|
||||
|
||||
def _get_cpu_info(self) -> Dict[str, Any]:
|
||||
"""Get current CPU information."""
|
||||
if not self.track_cpu:
|
||||
return {}
|
||||
|
||||
try:
|
||||
process = psutil.Process()
|
||||
return {
|
||||
'cpu_percent': process.cpu_percent(),
|
||||
'memory_percent': process.memory_percent(),
|
||||
'num_threads': process.num_threads()
|
||||
}
|
||||
except:
|
||||
return {}
|
||||
|
||||
def _get_array_info(self, result: Any) -> Dict[str, Any]:
|
||||
"""Get information about numpy arrays."""
|
||||
if not isinstance(result, np.ndarray):
|
||||
return {}
|
||||
|
||||
return {
|
||||
'result_shape': result.shape,
|
||||
'result_dtype': str(result.dtype),
|
||||
'result_size_mb': result.nbytes / 1024 / 1024,
|
||||
'result_elements': result.size
|
||||
}
|
||||
|
||||
def profile(self, func: Callable, *args, name: Optional[str] = None, warmup: bool = True, **kwargs) -> Dict[str, Any]:
|
||||
"""
|
||||
Profile a single function execution with comprehensive metrics.
|
||||
|
||||
Args:
|
||||
func: Function to profile
|
||||
*args: Arguments to pass to function
|
||||
name: Optional name for the function (defaults to func.__name__)
|
||||
warmup: Whether to do a warmup run (recommended for fair timing)
|
||||
**kwargs: Keyword arguments to pass to function
|
||||
|
||||
Returns:
|
||||
Dictionary with comprehensive performance metrics
|
||||
|
||||
Example:
|
||||
profiler = SimpleProfiler()
|
||||
result = profiler.profile(my_function, arg1, arg2, name="My Function")
|
||||
print(f"Time: {result['wall_time']:.4f}s")
|
||||
print(f"Memory: {result['memory_delta_mb']:.2f}MB")
|
||||
"""
|
||||
func_name = name or func.__name__
|
||||
|
||||
# Reset memory tracking
|
||||
if self.track_memory:
|
||||
tracemalloc.clear_traces()
|
||||
|
||||
# Warm up (important for fair comparison)
|
||||
if warmup:
|
||||
try:
|
||||
warmup_result = func(*args, **kwargs)
|
||||
del warmup_result
|
||||
except:
|
||||
pass
|
||||
|
||||
# Force garbage collection for clean measurement
|
||||
gc.collect()
|
||||
|
||||
# Get baseline measurements
|
||||
memory_before = self._get_memory_info()
|
||||
cpu_before = self._get_cpu_info()
|
||||
|
||||
# Time the actual execution
|
||||
start_time = time.time()
|
||||
start_cpu_time = time.process_time()
|
||||
|
||||
result = func(*args, **kwargs)
|
||||
|
||||
end_time = time.time()
|
||||
end_cpu_time = time.process_time()
|
||||
|
||||
# Get post-execution measurements
|
||||
memory_after = self._get_memory_info()
|
||||
cpu_after = self._get_cpu_info()
|
||||
|
||||
# Calculate metrics
|
||||
wall_time = end_time - start_time
|
||||
cpu_time = end_cpu_time - start_cpu_time
|
||||
|
||||
profile_result = {
|
||||
'name': func_name,
|
||||
'wall_time': wall_time,
|
||||
'cpu_time': cpu_time,
|
||||
'cpu_efficiency': (cpu_time / wall_time) if wall_time > 0 else 0,
|
||||
'result': result
|
||||
}
|
||||
|
||||
# Add memory metrics
|
||||
if self.track_memory and memory_before and memory_after:
|
||||
profile_result.update({
|
||||
'memory_before_mb': memory_before.get('current_memory_mb', 0),
|
||||
'memory_after_mb': memory_after.get('current_memory_mb', 0),
|
||||
'peak_memory_mb': memory_after.get('peak_memory_mb', 0),
|
||||
'memory_delta_mb': memory_after.get('current_memory_mb', 0) - memory_before.get('current_memory_mb', 0)
|
||||
})
|
||||
|
||||
# Add CPU metrics
|
||||
if self.track_cpu and cpu_after:
|
||||
profile_result.update({
|
||||
'cpu_percent': cpu_after.get('cpu_percent', 0),
|
||||
'memory_percent': cpu_after.get('memory_percent', 0),
|
||||
'num_threads': cpu_after.get('num_threads', 1)
|
||||
})
|
||||
|
||||
# Add array information
|
||||
profile_result.update(self._get_array_info(result))
|
||||
|
||||
return profile_result
|
||||
|
||||
def print_result(self, profile_result: Dict[str, Any], show_details: bool = False) -> None:
|
||||
"""
|
||||
Print profiling results in a readable format.
|
||||
|
||||
Args:
|
||||
profile_result: Result from profile() method
|
||||
show_details: Whether to show detailed metrics
|
||||
"""
|
||||
name = profile_result['name']
|
||||
wall_time = profile_result['wall_time']
|
||||
|
||||
print(f"📊 {name}: {wall_time:.4f}s")
|
||||
|
||||
if show_details:
|
||||
if 'memory_delta_mb' in profile_result:
|
||||
print(f" 💾 Memory: {profile_result['memory_delta_mb']:.2f}MB delta, {profile_result['peak_memory_mb']:.2f}MB peak")
|
||||
if 'result_size_mb' in profile_result:
|
||||
print(f" 🔢 Output: {profile_result['result_shape']} ({profile_result['result_size_mb']:.2f}MB)")
|
||||
if 'cpu_efficiency' in profile_result:
|
||||
print(f" ⚡ CPU: {profile_result['cpu_efficiency']:.2f} efficiency")
|
||||
|
||||
def get_capabilities(self) -> Dict[str, bool]:
|
||||
"""Get information about profiler capabilities."""
|
||||
return {
|
||||
'memory_tracking': self.track_memory,
|
||||
'cpu_tracking': self.track_cpu,
|
||||
'has_psutil': HAS_PSUTIL,
|
||||
'has_tracemalloc': HAS_TRACEMALLOC
|
||||
}
|
||||
|
||||
# Convenience function for quick profiling
|
||||
def profile_function(func: Callable, *args, name: Optional[str] = None,
|
||||
show_details: bool = False, **kwargs) -> Dict[str, Any]:
|
||||
"""
|
||||
Quick profiling of a single function.
|
||||
|
||||
Args:
|
||||
func: Function to profile
|
||||
*args: Arguments to pass to function
|
||||
name: Optional name for the function
|
||||
show_details: Whether to print detailed metrics
|
||||
**kwargs: Keyword arguments to pass to function
|
||||
|
||||
Returns:
|
||||
Dictionary with profiling results
|
||||
|
||||
Example:
|
||||
result = profile_function(my_matmul, A, B, name="Custom MatMul", show_details=True)
|
||||
print(f"Execution time: {result['wall_time']:.4f}s")
|
||||
"""
|
||||
profiler = SimpleProfiler(track_memory=True, track_cpu=True)
|
||||
result = profiler.profile(func, *args, name=name, **kwargs)
|
||||
|
||||
if show_details:
|
||||
profiler.print_result(result, show_details=True)
|
||||
|
||||
return result
|
||||
Reference in New Issue
Block a user