Remove redundant modules and streamline to 16-module structure

- Remove 00_introduction module (meta-content, not substantive learning) - Remove 16_capstone_backup backup directory - Remove utilities directory from modules/source - Clean up generated book chapters for removed modules Result: Clean 16-module progression (01_setup → 16_tinygpt) focused on hands-on ML systems implementation without administrative overhead.
2026-04-29 06:37:58 -05:00 · 2025-09-18 16:41:43 -04:00
parent ef487937bd
commit 9a366f7f45
13 changed files with 0 additions and 6699 deletions
--- a/modules/source/00_introduction/README.md
+++ b/modules/source/00_introduction/README.md
@@ -1,147 +0,0 @@
-# TinyTorch System Introduction & Architecture
-
-Welcome to **TinyTorch** - a complete neural network framework built from scratch for deep learning education and understanding.
-
-## 🎯 Module Overview
-
-This introduction module provides a comprehensive visual overview of the entire TinyTorch system, helping you understand how all 16 modules work together to create a complete machine learning framework.
-
-### What You'll Explore
-
- **🏗️ System Architecture** - Complete framework overview with visual diagrams
- **📊 Interactive Dependency Graphs** - See how all modules connect and depend on each other
- **📚 Learning Roadmap** - Optimal path through the entire TinyTorch curriculum
- **🔍 Component Analysis** - Deep dive into what each module implements
- **📈 Progress Visualization** - Track your learning journey through the system
-
-## 🚀 Key Features
-
-### Automated Analysis System
- **Module Metadata Parser** - Automatically loads and analyzes all module.yaml files
- **Dependency Graph Builder** - Creates NetworkX graphs of module relationships
- **Learning Path Generator** - Uses topological sort to find optimal learning sequence
-
-### Interactive Visualizations
- **Dependency Graph** - Hierarchical and circular layouts showing module connections
- **System Architecture** - Layered view of how components work together
- **Learning Roadmap** - Timeline view with time estimates and difficulty progression
- **Component Analysis** - Statistical analysis of module complexity and relationships
-
-### Export Functions
- **System Overview API** - Programmatic access to TinyTorch metadata
- **Module Information** - Detailed data about any specific module
- **Learning Recommendations** - Personalized next steps based on progress
-
-## 📊 What You'll Discover
-
-### System Statistics
- **16 modules** spanning from basic tensors to production MLOps
- **60+ components** implementing complete ML framework functionality
- **Estimated 80+ hours** of comprehensive learning content
- **5 difficulty levels** progressing from foundation to advanced topics
-
-### Learning Progression
-1. **Foundation** (3 modules) - Setup, tensors, activations
-2. **Core Architecture** (4 modules) - Layers, networks, attention, data loading
-3. **Training System** (3 modules) - Autograd, optimization, training loops
-4. **Production Ready** (5 modules) - Compression, kernels, benchmarking, MLOps, capstone
-5. **Integration** (1 module) - Final capstone project
-
-## 🎨 Visualization Gallery
-
-### Dependency Graph
-See how modules build upon each other with interactive dependency visualizations showing:
- **Prerequisite relationships** - What you need to learn first
- **Module difficulty** - Color-coded complexity levels
- **Component count** - Size indicates implementation scope
-
-### System Architecture
-Layered architecture diagram showing:
- **Foundation Layer** - Core tensors and setup
- **Component Layer** - Activations, layers, data loading
- **Network Layer** - Dense networks, CNNs, attention
- **Training Layer** - Autograd, optimizers, training
- **Production Layer** - Compression, kernels, MLOps
-
-### Learning Roadmap
-Timeline visualization featuring:
- **Optimal sequence** - Dependency-respecting learning order
- **Time estimates** - Realistic hour commitments per module
- **Difficulty progression** - Smooth learning curve design
- **Milestone tracking** - Major learning achievements
-
-## 🔧 Technical Implementation
-
-### Module Analysis Engine
-```python
-# Automatically analyze all TinyTorch modules
-analyzer = TinyTorchAnalyzer()
-overview = analyzer.get_tinytorch_overview()
-learning_path = analyzer.get_learning_path()
-```
-
-### Visualization System
-```python
-# Generate comprehensive system visualizations
-visualizations = visualize_tinytorch_system()
-dependency_graph = create_dependency_graph_visualization()
-architecture = create_system_architecture_diagram()
-roadmap = create_learning_roadmap()
-```
-
-### Learning Recommendations
-```python
-# Get personalized learning suggestions
-recommendations = get_learning_recommendations()
-next_modules = recommendations['next_modules']
-estimated_time = recommendations['remaining_time']
-```
-
-## 🤔 ML Systems Thinking
-
-This module connects TinyTorch's educational architecture to real-world ML systems:
-
-### Framework Design Patterns
- **Modular Dependencies** - How PyTorch and TensorFlow organize components
- **Component Composition** - Building complex operations from simple primitives
- **Abstraction Layers** - Balancing usability with performance control
-
-### Production Considerations
- **Deployment Pipelines** - From research code to production systems
- **Performance Optimization** - Hardware-aware kernel design
- **Monitoring & MLOps** - Continuous learning and model management
-
-### Educational Philosophy
- **Progressive Complexity** - Foundation → Architecture → Training → Production
- **Hands-on Learning** - Build before you use, understand before you optimize
- **Real-world Relevance** - Educational choices that mirror industry patterns
-
-## 📈 Learning Outcomes
-
-After completing this module, you will:
-
-1. **Understand TinyTorch Architecture** - Complete mental model of the framework
-2. **Navigate Module Dependencies** - Know what to learn when and why
-3. **Plan Your Learning Journey** - Realistic timeline and progression tracking
-4. **Connect to Industry** - See how educational patterns map to production ML
-
-## 🔗 Integration with TinyTorch
-
-This introduction module:
- **Requires no prerequisites** - Perfect starting point for new learners
- **Enables all other modules** - Provides context for the entire journey
- **Exports analysis tools** - Used by other modules for self-reflection
- **Updates automatically** - Visualization stays current as modules evolve
-
-## 🎓 Getting Started
-
-1. **Run the introduction notebook** to see all visualizations
-2. **Explore the dependency graph** to understand module relationships
-3. **Review the learning roadmap** to plan your journey
-4. **Bookmark key functions** for reference during your learning
-
-**Ready to build a neural network framework from scratch? Let's begin! 🚀**
-
---
-
-*This module serves as your guide through the complete TinyTorch learning experience. Use it to maintain big-picture understanding as you dive deep into implementation details.*
--- a/modules/source/00_introduction/introduction_dev.ipynb
+++ b/modules/source/00_introduction/introduction_dev.ipynb
--- a/modules/source/00_introduction/introduction_dev.py
+++ b/modules/source/00_introduction/introduction_dev.py
--- a/modules/source/00_introduction/module.yaml
+++ b/modules/source/00_introduction/module.yaml
@@ -1,37 +0,0 @@
-# TinyTorch Module Metadata
-# Essential system information for CLI tools and build systems
-
-name: "introduction"
-title: "System Introduction & Architecture"
-description: "Visual overview of TinyTorch framework architecture, module dependencies, and learning roadmap"
-
-# Dependencies - Used by CLI for module ordering and prerequisites
-dependencies:
-  prerequisites: []
-  enables: ["setup", "tensor", "activations", "layers", "dense", "spatial", "attention", "dataloader", "autograd", "optimizers", "training", "compression", "kernels", "benchmarking", "mlops", "capstone"]
-
-# Package Export - What gets built into tinytorch package
-exports_to: "tinytorch.introduction"
-
-# File Structure - What files exist in this module
-files:
-  dev_file: "introduction_dev.py"
-  readme: "README.md"
-  tests: "inline"
-
-# Educational Metadata
-difficulty: "⭐"
-time_estimate: "1-2 hours"
-
-# Components - What's implemented in this module
-components:
-  - "TinyTorchAnalyzer"
-  - "ModuleInfo"
-  - "get_tinytorch_overview"
-  - "visualize_tinytorch_system"
-  - "get_module_info"
-  - "get_learning_recommendations"
-  - "create_dependency_graph_visualization"
-  - "create_system_architecture_diagram"
-  - "create_learning_roadmap"
-  - "create_component_analysis"
--- a/modules/source/16_capstone_backup/README.md
+++ b/modules/source/16_capstone_backup/README.md
@@ -1,544 +0,0 @@
-# 🎓 TinyTorch Capstone: Advanced Framework Engineering
-
-**🎯 Prove your mastery. Optimize your framework. Become the engineer others ask for help.**
-
---
-
-## 📊 Module Overview
-
- **Difficulty**: ⭐⭐⭐⭐⭐ Expert Systems Engineering 🥷
- **Time Estimate**: 4-8 weeks (flexible scope)
- **Prerequisites**: **All 14 TinyTorch modules** - Your complete ML framework
- **Outcome**: **Advanced framework engineering portfolio** - Demonstrate deep systems mastery
-
-After 14 modules, you've built a complete ML framework from scratch. Now it's time to make it **faster**, **smarter**, and **more professional**. This capstone isn't about learning new concepts—it's about proving you can engineer production-quality ML systems.
-
---
-
-## 🔥 **What You've Already Built**
-
-Before choosing your capstone track, let's celebrate what you've accomplished:
-
-### 🏗️ **Complete ML Framework** (Modules 1-14)
-```python
-# This is YOUR implementation working together:
-from tinytorch.core.tensor import Tensor
-from tinytorch.core.layers import Dense  
-from tinytorch.core.dense import Sequential, MLP
-from tinytorch.core.spatial import Conv2D, flatten
-from tinytorch.core.attention import SelfAttention, scaled_dot_product_attention
-from tinytorch.core.activations import ReLU, Softmax
-from tinytorch.core.optimizers import Adam, SGD
-from tinytorch.core.training import CrossEntropyLoss, Trainer
-from tinytorch.core.dataloader import DataLoader, CIFAR10Dataset
-
-# Build a modern neural network with YOUR components
-model = Sequential([
-    Conv2D(3, 32, kernel_size=3),
-    ReLU(),
-    flatten,
-    Dense(32*30*30, 256),
-    ReLU(),
-    SelfAttention(d_model=256),
-    Dense(256, 10),
-    Softmax()
-])
-
-# Train on real data with YOUR training system
-trainer = Trainer(model, Adam(lr=0.001), CrossEntropyLoss())
-dataloader = DataLoader(CIFAR10Dataset(), batch_size=64)
-trainer.train(dataloader, epochs=10)
-```
-
-### 🎯 **Production-Ready Capabilities**
- ✅ **Tensor operations** with broadcasting and efficient computation
- ✅ **Automatic differentiation** with full backpropagation support  
- ✅ **Modern architectures** including CNNs and attention mechanisms
- ✅ **Advanced optimizers** with momentum and adaptive learning rates
- ✅ **Model compression** with pruning and quantization (75% size reduction)
- ✅ **High-performance kernels** with vectorization and parallelization
- ✅ **Comprehensive benchmarking** with memory profiling and performance analysis
-
-**You didn't just learn about ML systems. You built one.**
-
---
-
-## 🚀 **The Capstone Challenge: Choose Your Specialization**
-
-Now that you have a complete framework, choose your path to mastery. Each track focuses on different aspects of production ML engineering:
-
-### **⚡ Track 1: Performance Ninja** 
-**Mission**: Make TinyTorch competitive with PyTorch in speed and memory efficiency
-
-**Perfect for**: Students who love optimization, performance engineering, and making things fast
-
-**Example Project**: *CUDA-Style Matrix Operations*
-```python
-# Current: Your CPU implementation (Module 13)
-def attention_naive(Q, K, V):
-    scores = Q @ K.T  # Your matmul from Module 2
-    weights = softmax(scores)  # Your softmax from Module 3
-    return weights @ V
-
-# Your optimization target: 10x faster
-def attention_optimized(Q, K, V):
-    # Implement using advanced NumPy + memory optimization
-    # Target: Match 90% of PyTorch attention speed
-    pass
-```
-
-**Concrete Projects to Choose From:**
-1. **GPU-Accelerated Tensor Operations**: Use NumPy's advanced features + CuPy for near-GPU performance
-2. **Memory-Optimized Training**: Implement gradient accumulation and reduce memory usage by 50%
-3. **Vectorized Convolution**: Replace your naive Conv2D with optimized implementations  
-4. **Parallel Data Loading**: Multi-threaded CIFAR-10 loading with 3x speedup
-5. **JIT-Style Optimization**: Pre-compile operation graphs for faster execution
-
-**Success Metrics:**
- 5-10x speedup on specific operations
- 30%+ reduction in memory usage
- Benchmark reports comparing to PyTorch
- Performance regression testing suite
-
---
-
-### **🧠 Track 2: Algorithm Architect**
-**Mission**: Extend TinyTorch with cutting-edge ML algorithms and architectures
-
-**Perfect for**: Students who love ML research, implementing papers, and algorithmic innovation
-
-**Example Project**: *Vision Transformer (ViT) from Scratch*
-```python
-# Current: You have attention (Module 7) and dense layers (Module 5)
-from tinytorch.core.attention import SelfAttention
-from tinytorch.core.dense import Sequential, MLP
-
-# Your extension: Complete Vision Transformer
-class VisionTransformer:
-    def __init__(self, image_size=32, patch_size=4, d_model=256):
-        # YOUR implementation using ONLY TinyTorch components
-        self.patch_embedding = Dense(patch_size*patch_size*3, d_model)
-        self.transformer_blocks = [
-            TransformerBlock(d_model) for _ in range(6)
-        ]
-        self.classifier = MLP([d_model, 128, 10])
-    
-    def forward(self, images):
-        # Implement patch extraction, position encoding, 
-        # transformer processing using your components
-        pass
-
-class TransformerBlock:
-    def __init__(self, d_model):
-        self.attention = SelfAttention(d_model)
-        self.mlp = MLP([d_model, d_model*4, d_model])
-        # Add YOUR layer normalization implementation
-```
-
-**Concrete Projects to Choose From:**
-1. **Modern Optimizers**: Implement AdamW, RMSprop, Lion using your autograd system
-2. **Normalization Layers**: BatchNorm, LayerNorm, GroupNorm with full gradient support
-3. **Transformer Architectures**: Complete BERT/GPT-style models using your attention
-4. **Advanced Regularization**: Dropout, DropPath, data augmentation pipelines  
-5. **Generative Models**: VAE or simple GAN using your framework
-
-**Success Metrics:**
- New algorithms integrate seamlessly with existing TinyTorch
- Performance matches research paper results
- Full autograd support for all new components
- Documentation showing how to use new features
-
---
-
-### **🔧 Track 3: Systems Engineer**
-**Mission**: Build production-grade infrastructure and developer tooling
-
-**Perfect for**: Students interested in MLOps, distributed systems, and production ML
-
-**Example Project**: *Production Training Infrastructure*
-```python
-# Current: Your basic trainer (Module 11)
-trainer = Trainer(model, optimizer, loss_fn)
-trainer.train(dataloader, epochs=10)
-
-# Your production system: Enterprise-grade training
-class ProductionTrainer:
-    def __init__(self, model, optimizer, config):
-        self.model = model
-        self.checkpointer = ModelCheckpointer(config.checkpoint_dir)
-        self.profiler = MemoryProfiler()
-        self.distributed = MultiGPUManager(config.num_gpus)
-        self.monitor = TrainingMonitor(config.wandb_project)
-    
-    def train(self, dataloader, epochs):
-        for epoch in self.resume_from_checkpoint():
-            # Distributed training across multiple processes
-            # Memory profiling and leak detection  
-            # Automatic checkpointing and recovery
-            # Real-time monitoring and alerts
-        pass
-```
-
-**Concrete Projects to Choose From:**
-1. **Model Serving API**: FastAPI deployment with batching and caching
-2. **Distributed Training**: Multi-process training with gradient synchronization
-3. **Advanced Checkpointing**: Resume training from any point, handle interruptions
-4. **Memory Profiler**: Track memory leaks and optimize allocation patterns
-5. **CI/CD Pipeline**: Automated testing, benchmarking, and deployment
-
-**Success Metrics:**
- Production-ready code with error handling and monitoring
- 99.9% uptime for serving infrastructure  
- Automated testing and deployment pipelines
- Real-world deployment handling thousands of requests
-
---
-
-### **📊 Track 4: Benchmarking Scientist** 
-**Mission**: Build comprehensive analysis tools and compare frameworks scientifically
-
-**Perfect for**: Students who love data analysis, scientific methodology, and systematic evaluation
-
-**Example Project**: *TinyTorch vs PyTorch Scientific Comparison*
-```python
-# Your comprehensive benchmarking suite
-class FrameworkComparison:
-    def __init__(self):
-        self.tinytorch_ops = TinyTorchOperations()
-        self.pytorch_ops = PyTorchOperations()
-        self.test_suite = MLOperationTestSuite()
-    
-    def benchmark_complete_pipeline(self):
-        # End-to-end CIFAR-10 training comparison
-        results = {
-            'tinytorch': self.run_tinytorch_training(),
-            'pytorch': self.run_pytorch_training()
-        }
-        
-        return AnalysisReport({
-            'speed_comparison': self.analyze_training_speed(results),
-            'memory_usage': self.profile_memory_patterns(results),
-            'accuracy_comparison': self.compare_final_accuracy(results),
-            'code_complexity': self.analyze_implementation_complexity(),
-            'engineering_insights': self.identify_optimization_opportunities()
-        })
-```
-
-**Concrete Projects to Choose From:**
-1. **Performance Regression Suite**: Automated benchmarking for every code change
-2. **Memory Usage Analysis**: Deep dive into allocation patterns and optimization opportunities  
-3. **Scientific ML Comparison**: Compare your framework to PyTorch on standard benchmarks
-4. **Algorithm Analysis**: Compare different optimization algorithms empirically
-5. **Scalability Study**: How does your framework perform as model size increases?
-
-**Success Metrics:**
- Comprehensive benchmark suite with statistical significance
- Detailed analysis reports with engineering insights
- Performance regression detection system
- Scientific paper-quality methodology and results
-
---
-
-### **🛠️ Track 5: Developer Experience Master**
-**Mission**: Build tools that make TinyTorch easier to debug, understand, and extend
-
-**Perfect for**: Students interested in tooling, visualization, and making complex systems accessible
-
-**Example Project**: *TinyTorch Visual Debugger*
-```python
-# Your debugging and visualization suite
-class TinyTorchDebugger:
-    def __init__(self, model):
-        self.model = model
-        self.gradient_tracker = GradientFlowTracker()
-        self.activation_inspector = LayerActivationInspector()
-        self.training_visualizer = TrainingDynamicsPlotter()
-    
-    def debug_training_step(self, batch):
-        # Visual gradient flow analysis
-        grad_flow = self.gradient_tracker.track_gradients(batch)
-        self.visualize_gradient_flow(grad_flow)
-        
-        # Layer activation inspection
-        activations = self.activation_inspector.capture_activations(batch)
-        self.plot_activation_distributions(activations)
-        
-        # Diagnose common training issues
-        issues = self.diagnose_training_problems(grad_flow, activations)
-        self.suggest_fixes(issues)
-```
-
-**Concrete Projects to Choose From:**
-1. **Gradient Visualization Tools**: See gradient flow and detect vanishing/exploding gradients
-2. **Model Architecture Visualizer**: Interactive network graphs showing your models
-3. **Training Diagnostics**: Automated detection of learning rate, batch size issues
-4. **Interactive Tutorials**: Jupyter widgets for understanding framework internals
-5. **Error Message Enhancement**: Better debugging information with fix suggestions
-
-**Success Metrics:**
- Intuitive visualizations that reveal training dynamics
- Diagnostic tools that catch common mistakes automatically
- Interactive documentation and tutorials
- User studies showing improved debugging efficiency
-
---
-
-## 📋 **Project Phases: Your Engineering Journey**
-
-### **Phase 1: Analysis & Planning** (Week 1)
-**Understand your starting point and define success**
-
-```python
-# Step 1: Profile your current framework
-import cProfile
-from memory_profiler import profile
-
-def profile_current_implementation():
-    """Identify bottlenecks in your TinyTorch framework."""
-    
-    # Create realistic test scenario
-    model = your_best_model_from_module_11()
-    dataloader = CIFAR10Dataset(batch_size=64)
-    
-    # Profile performance
-profiler = cProfile.Profile()
-profiler.enable()
-
-    # Run representative workload
-    train_one_epoch(model, dataloader)
-
-profiler.disable()
-    # Analyze results and identify optimization targets
-```
-
-**Deliverables:**
- [ ] **Performance baseline**: Current speed and memory usage
- [ ] **Bottleneck analysis**: Where does your framework spend time?
- [ ] **Success metrics**: Specific, measurable goals (e.g., "10x faster matrix multiplication")
- [ ] **Implementation plan**: Break project into 3-4 concrete milestones
-
-### **Phase 2: Core Implementation** (Weeks 2-3)
-**Build your optimization/extension incrementally**
-
-**Development Strategy:**
-1. **Start simple**: Get the minimal version working first
-2. **Test constantly**: Use your CIFAR-10 models to verify improvements  
-3. **Benchmark early**: Measure performance at each step
-4. **Integrate gradually**: Ensure compatibility with existing TinyTorch components
-
-**Weekly Check-ins:**
- [ ] **Functionality demo**: Show your improvement working
- [ ] **Performance measurement**: Quantify progress toward goals
- [ ] **Integration testing**: Verify compatibility with existing code
- [ ] **Documentation updates**: Keep track of design decisions
-
-### **Phase 3: Optimization & Polish** (Week 4)
-**Refine your implementation and maximize impact**
-
-**Focus Areas:**
- **Performance tuning**: Squeeze out maximum efficiency gains
- **Error handling**: Make your code robust for edge cases
- **API design**: Ensure your improvements are easy to use
- **Testing coverage**: Comprehensive tests for all new functionality
-
-### **Phase 4: Evaluation & Presentation** (Week 5+)
-**Demonstrate impact and reflect on engineering trade-offs**
-
-**Final Deliverables:**
- [ ] **Benchmark comparison**: Before/after performance analysis
- [ ] **Engineering report**: Technical decisions, trade-offs, lessons learned
- [ ] **Live demonstration**: Show your improvements working on real examples
- [ ] **Future roadmap**: Next optimization opportunities identified
-
---
-
-## 🎯 **Success Criteria: Proving Mastery**
-
-Your capstone demonstrates mastery when you achieve:
-
-### **🔬 Technical Excellence**
- [ ] **Measurable improvement**: 20%+ performance gain, significant new functionality, or major UX improvement
- [ ] **Systems integration**: Your changes work seamlessly with all existing TinyTorch modules
- [ ] **Production quality**: Error handling, edge cases, comprehensive testing
- [ ] **Performance analysis**: You understand *why* your changes work and their trade-offs
-
-### **🏗️ Framework Understanding**
- [ ] **Architectural consistency**: Your additions follow TinyTorch design patterns
- [ ] **No external dependencies**: Use only TinyTorch components you built (proves deep understanding)
- [ ] **Backward compatibility**: Existing code still works after your improvements
- [ ] **Future extensibility**: Your changes enable further optimization opportunities
-
-### **💼 Professional Development**
- [ ] **Clear documentation**: Other students can understand and use your improvements
- [ ] **Engineering insights**: You can explain trade-offs and alternative approaches
- [ ] **Systematic evaluation**: Scientific methodology in measuring improvements
- [ ] **Presentation skills**: Effectively communicate technical work to different audiences
-
---
-
-## 🏆 **Capstone Deliverables**
-
-Submit your completed capstone as a professional portfolio:
-
-### **1. 📊 Technical Report** (`capstone_report.md`)
-**Structure:**
-```markdown
-# [Your Track]: [Project Title]
-
-## Executive Summary
- Problem statement and motivation
- Key technical achievements  
- Performance improvements achieved
- Engineering insights gained
-
-## Technical Approach
- Architecture and design decisions
- Implementation methodology
- Tools and techniques used
- Alternative approaches considered
-
-## Results & Analysis  
- Quantitative performance improvements
- Benchmark comparisons (before/after)
- Trade-off analysis (speed vs memory vs complexity)
- Limitations and future work
-
-## Engineering Reflection
- What you learned about framework design
- Most challenging technical decisions
- How your work fits into broader ML systems
-```
-
-### **2. 💻 Implementation Code** (`src/` directory)
-```
-src/
-├── optimizations/          # Your improved components
-│   ├── fast_matmul.py
-│   ├── efficient_trainer.py
-│   └── advanced_optimizers.py
-├── tests/                  # Comprehensive test suite
-│   ├── test_performance.py
-│   ├── test_compatibility.py
-│   └── test_edge_cases.py
-├── benchmarks/             # Performance measurement tools
-│   ├── benchmark_suite.py
-│   └── comparison_tools.py
-└── demo/                   # Working examples
-    ├── demo_improvements.py
-    └── integration_examples.py
-```
-
-### **3. 📈 Performance Analysis** (`benchmarks/` directory)
- **Before/after comparisons**: Quantify your improvements
- **Memory profiling**: Allocation patterns and optimization impact
- **Scalability analysis**: How improvements perform with larger models
- **Framework comparison**: Your TinyTorch vs PyTorch (where relevant)
-
-### **4. 🎥 Live Demonstration** (`demo.py`)
-**Requirements:**
- Show your improvements working on real TinyTorch models
- Side-by-side comparison with original implementation
- Quantified performance improvements displayed
- Real use case demonstrating practical value
-
---
-
-## 💡 **Pro Tips for Capstone Success**
-
-### **🎯 Start With Impact**
-```python
-# Instead of optimizing everything...
-def optimize_everything():
-    pass  # This leads to shallow improvements
-    
-# Find the biggest bottleneck first
-def profile_and_optimize():
-    bottleneck = find_biggest_bottleneck()  # 80% of runtime
-    return optimize_specific_operation(bottleneck)  # 10x speedup
-```
-
-### **🧪 Measure Everything**
- **Baseline early**: Know your starting point precisely
- **Benchmark often**: Track progress with each change
- **Compare fairly**: Use identical test conditions
- **Document trade-offs**: Speed vs memory vs complexity
-
-### **🔗 Use Your Existing Framework**
-```python
-# Test improvements with models you built in previous modules
-cifar_model = load_your_module_10_model()  # Real CNN from Module 6
-test_your_optimization(cifar_model)        # Does it still work?
-measure_improvement(cifar_model)           # How much faster/better?
-```
-
-### **📚 Think Like a Framework Maintainer**
- **API design**: How would other students use your improvements?
- **Documentation**: Can someone else understand and extend your work?
- **Testing**: What could break? How do you prevent it?
- **Compatibility**: Does existing code still work?
-
---
-
-## 🚀 **Getting Started: Your First Steps**
-
-### **1. Choose Your Track** 
-Review the 5 tracks above and pick the one that excites you most. Consider:
- What aspect of ML systems interests you most?
- What would you want to optimize in a real job?
- What matches your career goals?
-
-### **2. Run Initial Profiling**
-```bash
-# Profile your current TinyTorch framework
-cd modules/source/16_capstone/
-python profile_baseline.py
-
-# This will show you:
-# - Where your framework spends time
-# - Memory usage patterns  
-# - Comparison to PyTorch baseline
-# - Optimization opportunities ranked by impact
-```
-
-### **3. Set Specific Goals**
-Based on profiling results, choose concrete, measurable targets:
- **Performance**: "5x faster matrix multiplication" 
- **Algorithm**: "Complete Vision Transformer implementation"
- **Systems**: "Production API handling 1000 req/sec"
- **Analysis**: "Scientific comparison with 95% confidence intervals"
- **Developer UX**: "Visual debugger reducing debug time by 50%"
-
-### **4. Start Building**
-```python
-# Begin with the simplest version that demonstrates your concept
-def minimal_viable_optimization():
-    # Get something working first
-    # Measure improvement
-    # Then optimize further
-    pass
-```
-
---
-
-## 🎓 **Your Capstone Journey Starts Now**
-
-You've built a complete ML framework from scratch. You understand tensors, autograd, optimization, and production systems at the deepest level. 
-
-**Now prove it.**
-
-Choose your track, set ambitious but achievable goals, and start optimizing. Remember: you're not just improving code—you're demonstrating that you can engineer production ML systems at the level of PyTorch contributors.
-
-**Your goal**: Become the engineer others turn to when they need to make ML systems better.
-
-### **Ready to start?**
-
-1. **Choose your track** from the 5 options above
-2. **Run the profiling script** to understand your baseline
-3. **Set specific, measurable goals** for your improvement
-4. **Start with the simplest implementation** that shows progress
-
-**🔥 Your TinyTorch framework is waiting to be optimized. Start engineering.**
-
---
-
-*Remember: The best capstone projects solve real problems you encountered while building TinyTorch. What frustrated you? What was slow? What could be better? Start there.* 
--- a/modules/source/16_capstone_backup/capstone_dev.ipynb
+++ b/modules/source/16_capstone_backup/capstone_dev.ipynb
--- a/modules/source/16_capstone_backup/capstone_dev.py
+++ b/modules/source/16_capstone_backup/capstone_dev.py
@@ -1,864 +0,0 @@
-#| default_exp core.capstone
-
-# %% [markdown]
-"""
-# Module 16: Capstone - Building Production ML Systems
-
-## Learning Objectives
-By the end of this module, you will:
-1. Integrate all TinyTorch components into a complete ML system
-2. Apply production ML systems principles across the entire stack
-3. Optimize end-to-end system performance
-4. Design and implement enterprise-grade ML solutions
-5. Master the complete ML systems engineering workflow
-"""
-
-# %%
-import sys
-import os
-sys.path.append(os.path.dirname(os.path.dirname(os.path.dirname(os.path.dirname(os.path.abspath(__file__))))))
-
-import numpy as np
-import time
-from typing import Dict, List, Tuple, Any, Optional
-from dataclasses import dataclass, field
-import json
-
-# Import all TinyTorch components
-from tinytorch.tensor import Tensor
-from tinytorch.nn import Module, Layer
-from tinytorch.optim import Optimizer, SGD, Adam
-from tinytorch.data import DataLoader
-from tinytorch.autograd import no_grad
-
-# %% [markdown]
-"""
-## Part 1: Module Introduction
-
-This capstone module brings together everything you've learned to build a complete, production-ready ML system. You'll integrate all TinyTorch components while applying ML systems engineering principles at scale.
-
-### What We're Building
- Complete end-to-end ML system with all components integrated
- Production-grade performance profiling and optimization
- Enterprise MLOps workflow with monitoring and deployment
- Scalable architecture ready for millions of users
-"""
-
-# %% [markdown]
-"""
-## Part 2: Mathematical Background
-
-### System-Level Optimization
-The complete ML system optimization problem involves multiple objectives:
-
-$$\\min_{θ} \\mathcal{L}_{total} = \\mathcal{L}_{model} + λ_1\\mathcal{L}_{latency} + λ_2\\mathcal{L}_{memory} + λ_3\\mathcal{L}_{cost}$$
-
-Where:
- $\\mathcal{L}_{model}$: Model accuracy loss
- $\\mathcal{L}_{latency}$: Inference latency penalty
- $\\mathcal{L}_{memory}$: Memory usage penalty
- $\\mathcal{L}_{cost}$: Computational cost penalty
-
-### End-to-End Performance Model
-System throughput is bounded by:
-
-$$Throughput ≤ \min\left(\frac{1}{T_{compute}}, \frac{B}{M_{transfer}}, \frac{C}{R_{memory}}\right)$$
-
-Where:
- $T_{compute}$: Computation time per sample
- $M_{transfer}$: Memory transfer per sample
- $R_{memory}$: Memory bandwidth
-"""
-
-# %% [markdown]
-"""
-## Part 3: Core Implementation - Production ML System Profiler
-"""
-
-# %%
-@dataclass
-class SystemMetrics:
-    """Complete system performance metrics"""
-    model_accuracy: float
-    inference_latency_ms: float
-    throughput_samples_sec: float
-    memory_usage_mb: float
-    gpu_utilization: float
-    cost_per_million_inferences: float
-    
-@dataclass
-class OptimizationRecommendation:
-    """System optimization recommendation"""
-    component: str
-    issue: str
-    impact: str  # "high", "medium", "low"
-    recommendation: str
-    estimated_improvement: float  # percentage
-
-class ProductionMLSystemProfiler:
-    """
-    Complete ML system profiler integrating all components.
-    85% implementation - students extend with custom systems.
-    """
-    
-    def __init__(self):
-        self.profiling_data = {}
-        self.system_config = {
-            "hardware": self._detect_hardware(),
-            "deployment": "cloud",  # cloud, edge, on-premise
-            "scale": "enterprise"   # prototype, production, enterprise
-        }
-        
-    def _detect_hardware(self) -> Dict[str, Any]:
-        """Detect available hardware configuration"""
-        import platform
-        import psutil
-        
-        return {
-            "cpu": platform.processor(),
-            "cpu_cores": psutil.cpu_count(),
-            "memory_gb": psutil.virtual_memory().total / (1024**3),
-            "gpu": "simulated",  # Would detect real GPU
-            "accelerators": []
-        }
-    
-    def profile_end_to_end_system(self, 
-                                   model: 'Module',
-                                   dataloader: 'DataLoader',
-                                   optimizer: 'Optimizer') -> SystemMetrics:
-        """
-        Profile complete ML system performance.
-        
-        This integrates profiling from all previous modules:
-        - Tensor operations (Module 2)
-        - Activation functions (Module 3)
-        - Layer computations (Module 4-7)
-        - Data loading (Module 8)
-        - Autograd (Module 9)
-        - Optimization (Module 10)
-        - Training (Module 11)
-        """
-        print("🔬 Profiling End-to-End ML System...")
-        
-        # Simulate comprehensive profiling
-        start_time = time.time()
-        
-        # Profile inference pipeline
-        inference_times = []
-        memory_usage = []
-        
-        for batch_idx, (data, target) in enumerate(dataloader):
-            if batch_idx >= 10:  # Profile first 10 batches
-                break
-                
-            batch_start = time.time()
-            
-            # Forward pass
-            with no_grad():
-                output = model(data)
-            
-            batch_time = (time.time() - batch_start) * 1000
-            inference_times.append(batch_time)
-            
-            # Simulate memory tracking
-            memory_usage.append(
-                data.data.nbytes / (1024**2) + 
-                sum(p.data.nbytes / (1024**2) for p in model.parameters())
-            )
-        
-        # Calculate metrics
-        metrics = SystemMetrics(
-            model_accuracy=0.95,  # Would calculate real accuracy
-            inference_latency_ms=np.mean(inference_times),
-            throughput_samples_sec=1000 / np.mean(inference_times) * dataloader.batch_size,
-            memory_usage_mb=np.mean(memory_usage),
-            gpu_utilization=0.75,  # Simulated
-            cost_per_million_inferences=0.10  # Simulated cloud cost
-        )
-        
-        # Store profiling data
-        self.profiling_data['system_metrics'] = metrics
-        
-        print(f"✅ System Profiling Complete")
-        print(f"   Latency: {metrics.inference_latency_ms:.2f}ms")
-        print(f"   Throughput: {metrics.throughput_samples_sec:.0f} samples/sec")
-        print(f"   Memory: {metrics.memory_usage_mb:.1f}MB")
-        print(f"   Cost: ${metrics.cost_per_million_inferences:.2f}/1M inferences")
-        
-        return metrics
-    
-    def detect_cross_module_optimizations(self) -> List[OptimizationRecommendation]:
-        """
-        Identify optimization opportunities across modules.
-        
-        This analyzes interactions between:
-        - Tensor operations and memory layout
-        - Layer fusion opportunities
-        - Autograd graph optimization
-        - Data pipeline and model overlap
-        """
-        print("\n🔍 Detecting Cross-Module Optimization Opportunities...")
-        
-        recommendations = []
-        
-        # Kernel fusion opportunity
-        recommendations.append(OptimizationRecommendation(
-            component="Layers + Activations",
-            issue="Separate kernel launches for linear and activation",
-            impact="high",
-            recommendation="Fuse linear layer with activation function",
-            estimated_improvement=15.0
-        ))
-        
-        # Memory layout optimization
-        recommendations.append(OptimizationRecommendation(
-            component="Tensor + Spatial",
-            issue="Non-contiguous memory access in convolutions",
-            impact="medium",
-            recommendation="Use channels-last memory format",
-            estimated_improvement=10.0
-        ))
-        
-        # Data pipeline optimization
-        recommendations.append(OptimizationRecommendation(
-            component="DataLoader + Training",
-            issue="CPU-GPU transfer blocking training",
-            impact="high",
-            recommendation="Implement data prefetching and pinned memory",
-            estimated_improvement=20.0
-        ))
-        
-        # Autograd optimization
-        recommendations.append(OptimizationRecommendation(
-            component="Autograd + Optimizer",
-            issue="Redundant gradient computations",
-            impact="low",
-            recommendation="Implement gradient checkpointing for large models",
-            estimated_improvement=5.0
-        ))
-        
-        for rec in recommendations:
-            print(f"   [{rec.impact.upper()}] {rec.component}: {rec.recommendation}")
-            print(f"          Estimated improvement: {rec.estimated_improvement}%")
-        
-        return recommendations
-    
-    def validate_production_readiness(self) -> Dict[str, bool]:
-        """
-        Validate system readiness for production deployment.
-        
-        Checks all critical production requirements:
-        - Performance SLAs
-        - Scalability requirements
-        - Monitoring and observability
-        - Error handling and recovery
-        - Security and compliance
-        """
-        print("\n✅ Validating Production Readiness...")
-        
-        checks = {
-            "performance_sla": self._check_performance_sla(),
-            "scalability": self._check_scalability(),
-            "monitoring": self._check_monitoring(),
-            "error_handling": self._check_error_handling(),
-            "security": self._check_security(),
-            "mlops_integration": self._check_mlops()
-        }
-        
-        for check, passed in checks.items():
-            status = "✅" if passed else "❌"
-            print(f"   {status} {check.replace('_', ' ').title()}")
-        
-        return checks
-    
-    def _check_performance_sla(self) -> bool:
-        """Check if system meets performance SLAs"""
-        if 'system_metrics' not in self.profiling_data:
-            return False
-        metrics = self.profiling_data['system_metrics']
-        return metrics.inference_latency_ms < 100  # 100ms SLA
-    
-    def _check_scalability(self) -> bool:
-        """Check scalability requirements"""
-        # Would test with increasing load
-        return True  # Simulated
-    
-    def _check_monitoring(self) -> bool:
-        """Check monitoring capabilities"""
-        # Would verify metrics export, logging, etc.
-        return True  # Simulated
-    
-    def _check_error_handling(self) -> bool:
-        """Check error handling and recovery"""
-        # Would test failure scenarios
-        return True  # Simulated
-    
-    def _check_security(self) -> bool:
-        """Check security requirements"""
-        # Would verify authentication, encryption, etc.
-        return True  # Simulated
-    
-    def _check_mlops(self) -> bool:
-        """Check MLOps integration"""
-        # Would verify CI/CD, versioning, etc.
-        return True  # Simulated
-    
-    def analyze_scalability(self, target_qps: int = 10000) -> Dict[str, Any]:
-        """
-        Analyze system scalability to target QPS.
-        
-        Determines resource requirements for scaling:
-        - Horizontal scaling (replica count)
-        - Vertical scaling (instance size)
-        - Caching and optimization needs
-        """
-        print(f"\n📈 Analyzing Scalability to {target_qps} QPS...")
-        
-        if 'system_metrics' not in self.profiling_data:
-            print("   ⚠️ Run system profiling first")
-            return {}
-        
-        metrics = self.profiling_data['system_metrics']
-        current_qps = metrics.throughput_samples_sec
-        
-        analysis = {
-            "current_qps": current_qps,
-            "target_qps": target_qps,
-            "scaling_factor": target_qps / current_qps,
-            "recommended_replicas": int(np.ceil(target_qps / current_qps)),
-            "estimated_cost_per_hour": (target_qps / current_qps) * 2.50,  # Simulated
-            "bottlenecks": []
-        }
-        
-        # Identify bottlenecks
-        if analysis["scaling_factor"] > 10:
-            analysis["bottlenecks"].append("Need caching layer")
-        if analysis["scaling_factor"] > 50:
-            analysis["bottlenecks"].append("Need load balancing")
-        if analysis["scaling_factor"] > 100:
-            analysis["bottlenecks"].append("Consider model optimization")
-        
-        print(f"   Current QPS: {current_qps:.0f}")
-        print(f"   Scaling Factor: {analysis['scaling_factor']:.1f}x")
-        print(f"   Recommended Replicas: {analysis['recommended_replicas']}")
-        print(f"   Estimated Cost: ${analysis['estimated_cost_per_hour']:.2f}/hour")
-        
-        return analysis
-    
-    def optimize_cost(self, budget_per_hour: float = 100.0) -> Dict[str, Any]:
-        """
-        Optimize system for cost constraints.
-        
-        Balances:
-        - Instance types and sizes
-        - Batch processing vs real-time
-        - Caching strategies
-        - Model compression trade-offs
-        """
-        print(f"\n💰 Optimizing for ${budget_per_hour}/hour budget...")
-        
-        strategies = {
-            "instance_optimization": {
-                "current": "p3.2xlarge",
-                "recommended": "g4dn.xlarge",
-                "savings": 0.70
-            },
-            "batch_processing": {
-                "enabled": True,
-                "batch_window_ms": 50,
-                "throughput_gain": 2.5
-            },
-            "model_compression": {
-                "quantization": "int8",
-                "size_reduction": 0.75,
-                "accuracy_impact": 0.01
-            },
-            "caching": {
-                "cache_hit_rate": 0.30,
-                "cost_reduction": 0.30
-            }
-        }
-        
-        total_savings = sum(s.get("savings", 0) or s.get("cost_reduction", 0) 
-                           for s in strategies.values())
-        
-        print(f"   Total potential savings: {total_savings*100:.0f}%")
-        for strategy, details in strategies.items():
-            print(f"   - {strategy.replace('_', ' ').title()}: {details}")
-        
-        return strategies
-    
-    def generate_deployment_config(self, 
-                                   deployment_target: str = "kubernetes") -> Dict[str, Any]:
-        """
-        Generate production deployment configuration.
-        
-        Creates complete deployment specs for:
-        - Kubernetes
-        - Docker Swarm  
-        - AWS ECS
-        - Edge devices
-        """
-        print(f"\n🚀 Generating {deployment_target.title()} Deployment Config...")
-        
-        if deployment_target == "kubernetes":
-            config = {
-                "apiVersion": "apps/v1",
-                "kind": "Deployment",
-                "metadata": {
-                    "name": "tinytorch-ml-system",
-                    "labels": {"app": "tinytorch"}
-                },
-                "spec": {
-                    "replicas": 3,
-                    "selector": {"matchLabels": {"app": "tinytorch"}},
-                    "template": {
-                        "spec": {
-                            "containers": [{
-                                "name": "ml-inference",
-                                "image": "tinytorch:latest",
-                                "resources": {
-                                    "limits": {"memory": "4Gi", "cpu": "2"},
-                                    "requests": {"memory": "2Gi", "cpu": "1"}
-                                },
-                                "env": [
-                                    {"name": "MODEL_PATH", "value": "/models/latest"},
-                                    {"name": "BATCH_SIZE", "value": "32"},
-                                    {"name": "MAX_WORKERS", "value": "4"}
-                                ]
-                            }]
-                        }
-                    }
-                }
-            }
-        else:
-            config = {"deployment_target": deployment_target, "status": "not_implemented"}
-        
-        print(f"   ✅ Deployment config generated")
-        print(f"   Replicas: {config.get('spec', {}).get('replicas', 'N/A')}")
-        
-        return config
-
-# %% [markdown]
-"""
-## Part 4: Testing the Production System Profiler
-
-Let's test our comprehensive system profiler with a complete ML pipeline.
-"""
-
-# %%
-def test_production_system_profiler():
-    """Test the complete production ML system profiler"""
-    print("Testing Production ML System Profiler")
-    print("=" * 50)
-    
-    # Create mock components
-    class MockModel(Module):
-        def __init__(self):
-            super().__init__()
-            self.layers = []
-        
-        def forward(self, x):
-            return x
-        
-        def parameters(self):
-            return [Tensor(np.random.randn(100, 100))]
-    
-    class MockDataLoader:
-        def __init__(self):
-            self.batch_size = 32
-        
-        def __iter__(self):
-            for _ in range(10):
-                yield (Tensor(np.random.randn(32, 784)), 
-                      Tensor(np.random.randint(0, 10, 32)))
-    
-    # Initialize profiler
-    profiler = ProductionMLSystemProfiler()
-    
-    # Create mock components
-    model = MockModel()
-    dataloader = MockDataLoader()
-    optimizer = SGD(model.parameters(), lr=0.01)
-    
-    # Profile system
-    metrics = profiler.profile_end_to_end_system(model, dataloader, optimizer)
-    assert metrics.inference_latency_ms > 0
-    
-    # Detect optimizations
-    recommendations = profiler.detect_cross_module_optimizations()
-    assert len(recommendations) > 0
-    
-    # Validate production readiness
-    checks = profiler.validate_production_readiness()
-    assert all(isinstance(v, bool) for v in checks.values())
-    
-    # Analyze scalability
-    scalability = profiler.analyze_scalability(target_qps=10000)
-    assert scalability["scaling_factor"] > 0
-    
-    # Optimize cost
-    cost_optimization = profiler.optimize_cost(budget_per_hour=100.0)
-    assert len(cost_optimization) > 0
-    
-    # Generate deployment config
-    deploy_config = profiler.generate_deployment_config("kubernetes")
-    assert "apiVersion" in deploy_config
-    
-    print("\n✅ All production system profiler tests passed!")
-
-# Only run tests if executed directly
-if __name__ == "__main__":
-    test_production_system_profiler()
-
-# %% [markdown]
-"""
-## Part 5: Building Complete ML Systems
-
-Now let's build a complete, production-ready ML system that integrates all TinyTorch components.
-"""
-
-# %%
-class CompleteMlSystem:
-    """
-    Complete ML system integrating all TinyTorch components.
-    This represents a production-ready system architecture.
-    """
-    
-    def __init__(self, config: Dict[str, Any]):
-        self.config = config
-        self.components = {}
-        self.metrics = {}
-        self.profiler = ProductionMLSystemProfiler()
-        
-    def build_system(self):
-        """Build the complete ML system with all components"""
-        print("🏗️ Building Complete ML System...")
-        
-        # Initialize all components
-        self.components["model"] = self._build_model()
-        self.components["optimizer"] = self._build_optimizer()
-        self.components["dataloader"] = self._build_dataloader()
-        self.components["monitor"] = self._build_monitor()
-        
-        print("✅ System build complete")
-        
-    def _build_model(self):
-        """Build model with all layer types"""
-        # Would build real model with Dense, Conv, Attention layers
-        print("   Building model architecture...")
-        return None  # Placeholder
-    
-    def _build_optimizer(self):
-        """Build optimizer with adaptive strategies"""
-        print("   Configuring optimizer...")
-        return None  # Placeholder
-    
-    def _build_dataloader(self):
-        """Build data pipeline with preprocessing"""
-        print("   Setting up data pipeline...")
-        return None  # Placeholder
-    
-    def _build_monitor(self):
-        """Build monitoring and observability"""
-        print("   Configuring monitoring...")
-        return None  # Placeholder
-    
-    def train(self, epochs: int = 10):
-        """Production training loop with all features"""
-        print(f"\n🎯 Training for {epochs} epochs...")
-        
-        for epoch in range(epochs):
-            # Training logic with:
-            # - Gradient accumulation
-            # - Mixed precision
-            # - Checkpointing
-            # - Early stopping
-            # - Learning rate scheduling
-            
-            if epoch % 5 == 0:
-                print(f"   Epoch {epoch}: loss=0.{100-epoch*5:.3f}")
-        
-        print("✅ Training complete")
-    
-    def deploy(self, target: str = "production"):
-        """Deploy system to production"""
-        print(f"\n🚀 Deploying to {target}...")
-        
-        # Deployment steps:
-        # 1. Model optimization (quantization, pruning)
-        # 2. Container building
-        # 3. Service deployment
-        # 4. Load balancer configuration
-        # 5. Monitoring setup
-        
-        print(f"✅ Deployed to {target}")
-        
-    def monitor_production(self):
-        """Monitor production system"""
-        print("\n📊 Production Monitoring Dashboard")
-        print("   QPS: 5000")
-        print("   P99 Latency: 45ms")
-        print("   Error Rate: 0.01%")
-        print("   Model Drift: None detected")
-
-# %% [markdown]
-"""
-## Part 6: System Integration Testing
-
-Let's test how all components work together in a production scenario.
-"""
-
-# %%
-def test_complete_ml_system():
-    """Test the complete ML system integration"""
-    print("Testing Complete ML System Integration")
-    print("=" * 50)
-    
-    # System configuration
-    config = {
-        "model": {
-            "architecture": "transformer",
-            "layers": 12,
-            "hidden_dim": 768
-        },
-        "training": {
-            "batch_size": 32,
-            "learning_rate": 0.001,
-            "epochs": 10
-        },
-        "deployment": {
-            "target": "kubernetes",
-            "replicas": 3,
-            "autoscaling": True
-        }
-    }
-    
-    # Build system
-    system = CompleteMlSystem(config)
-    system.build_system()
-    
-    # Train model
-    system.train(epochs=10)
-    
-    # Deploy to production
-    system.deploy("production")
-    
-    # Monitor production
-    system.monitor_production()
-    
-    print("\n✅ Complete ML system test passed!")
-
-# Only run tests if executed directly
-if __name__ == "__main__":
-    test_complete_ml_system()
-
-# %% [markdown]
-"""
-## Part 7: ML Systems Thinking Questions
-
-### 🏗️ Complete ML System Architecture
-1. How would you design a multi-tenant ML platform that serves models for different customers while ensuring isolation and fair resource allocation?
-2. What are the trade-offs between monolithic and microservices architectures for ML systems, and when would you choose each?
-3. How do you handle versioning and compatibility when different components of your ML system evolve at different rates?
-4. What patterns would you use to ensure your ML system remains maintainable as it grows from 10 to 1000+ models?
-
-### 🏢 Enterprise ML Platform Design  
-1. How would you design an ML platform that supports both batch and real-time inference while sharing the same model artifacts?
-2. What governance and compliance features would you build into an enterprise ML platform for regulated industries?
-3. How would you implement multi-cloud ML deployments that can failover between providers seamlessly?
-4. What would be your strategy for building an ML platform that supports both centralized and federated learning?
-
-### 🚀 Production System Optimization
-1. How would you systematically identify and eliminate bottlenecks in a complex ML system serving millions of requests?
-2. What strategies would you employ to reduce cold start latency in serverless ML deployments?
-3. How would you design an adaptive system that automatically adjusts resources based on traffic patterns and model complexity?
-4. What techniques would you use to optimize the cost-performance trade-off in a large-scale ML system?
-
-### 📈 Scaling to Millions of Users
-1. How would you architect an ML system to handle sudden 100x traffic spikes during viral events?
-2. What caching strategies would you implement for ML predictions, and how would you handle cache invalidation?
-3. How would you design a global ML serving infrastructure that minimizes latency for users worldwide?
-4. What patterns would you use to ensure consistency when serving ML models across hundreds of edge locations?
-
-### 🔮 Future of ML Systems
-1. How will ML systems architecture need to evolve to support increasingly large foundation models?
-2. What role will hardware-software co-design play in the future of ML systems, and how should engineers prepare?
-3. How might quantum computing change the way we design and optimize ML systems?
-4. What new abstractions and tools will be needed as ML systems become more autonomous and self-optimizing?
-"""
-
-# %% [markdown]
-"""
-## Part 8: Enterprise Deployment Patterns
-
-Let's implement advanced deployment patterns used in production ML systems.
-"""
-
-# %%
-class EnterpriseDeploymentOrchestrator:
-    """
-    Orchestrates enterprise ML deployments with advanced patterns.
-    """
-    
-    def __init__(self):
-        self.deployment_strategies = {
-            "blue_green": self._blue_green_deployment,
-            "canary": self._canary_deployment,
-            "shadow": self._shadow_deployment,
-            "gradual_rollout": self._gradual_rollout
-        }
-        
-    def _blue_green_deployment(self, model_v1, model_v2):
-        """Blue-green deployment with instant switchover"""
-        print("🔵🟢 Executing Blue-Green Deployment")
-        print("   1. Deploy v2 to green environment")
-        print("   2. Run validation tests on green")
-        print("   3. Switch traffic from blue to green")
-        print("   4. Keep blue as rollback option")
-        return {"status": "success", "rollback_available": True}
-    
-    def _canary_deployment(self, model_v1, model_v2, canary_percent=5):
-        """Canary deployment with gradual rollout"""
-        print(f"🐤 Executing Canary Deployment ({canary_percent}% initial)")
-        print(f"   1. Route {canary_percent}% traffic to v2")
-        print("   2. Monitor metrics for 1 hour")
-        print("   3. Gradually increase to 100% if healthy")
-        return {"status": "in_progress", "current_percentage": canary_percent}
-    
-    def _shadow_deployment(self, model_v1, model_v2):
-        """Shadow deployment for risk-free testing"""
-        print("👤 Executing Shadow Deployment")
-        print("   1. Deploy v2 in shadow mode")
-        print("   2. Duplicate traffic to v2 (responses ignored)")
-        print("   3. Compare v1 and v2 outputs")
-        print("   4. Promote v2 when confidence threshold met")
-        return {"status": "shadowing", "agreement_rate": 0.98}
-    
-    def _gradual_rollout(self, model_v1, model_v2, stages=[5, 25, 50, 100]):
-        """Multi-stage gradual rollout"""
-        print(f"📊 Executing Gradual Rollout: {stages}%")
-        for stage in stages:
-            print(f"   Stage: {stage}% - Monitor for 2 hours")
-        return {"status": "staged", "stages": stages}
-    
-    def deploy_with_strategy(self, strategy: str, **kwargs):
-        """Deploy using specified strategy"""
-        if strategy in self.deployment_strategies:
-            return self.deployment_strategies[strategy](**kwargs)
-        else:
-            raise ValueError(f"Unknown strategy: {strategy}")
-
-# Test deployment patterns
-def test_enterprise_deployment():
-    """Test enterprise deployment patterns"""
-    print("\nTesting Enterprise Deployment Patterns")
-    print("=" * 50)
-    
-    orchestrator = EnterpriseDeploymentOrchestrator()
-    
-    # Test different strategies
-    mock_v1 = "model_v1"
-    mock_v2 = "model_v2"
-    
-    # Blue-Green
-    result = orchestrator.deploy_with_strategy("blue_green", 
-                                               model_v1=mock_v1, 
-                                               model_v2=mock_v2)
-    assert result["status"] == "success"
-    
-    # Canary
-    result = orchestrator.deploy_with_strategy("canary",
-                                               model_v1=mock_v1,
-                                               model_v2=mock_v2,
-                                               canary_percent=10)
-    assert result["current_percentage"] == 10
-    
-    print("\n✅ All deployment patterns tested successfully!")
-
-# Only run tests if executed directly
-if __name__ == "__main__":
-    test_enterprise_deployment()
-
-# %% [markdown]
-"""
-## Part 9: Comprehensive Testing
-
-Let's run comprehensive tests that validate the entire ML system.
-"""
-
-# %%
-def run_comprehensive_system_tests():
-    """Run comprehensive tests for the complete ML system"""
-    print("\n🧪 Running Comprehensive System Tests")
-    print("=" * 50)
-    
-    test_results = {
-        "unit_tests": True,
-        "integration_tests": True,
-        "performance_tests": True,
-        "scalability_tests": True,
-        "security_tests": True,
-        "mlops_tests": True
-    }
-    
-    # Simulate comprehensive testing
-    for test_type, passed in test_results.items():
-        status = "✅" if passed else "❌"
-        print(f"{status} {test_type.replace('_', ' ').title()}: {'Passed' if passed else 'Failed'}")
-    
-    # Overall status
-    all_passed = all(test_results.values())
-    
-    if all_passed:
-        print("\n🎉 All comprehensive tests passed!")
-        print("System is ready for production deployment!")
-    else:
-        print("\n⚠️ Some tests failed. Please review and fix issues.")
-    
-    return all_passed
-
-# Run comprehensive tests only if executed directly
-if __name__ == "__main__":
-    success = run_comprehensive_system_tests()
-    assert success, "System tests must pass before deployment"
-
-# %% [markdown]
-"""
-## Part 10: Module Summary
-
-### What We've Built
-You've successfully integrated all TinyTorch components into a complete, production-ready ML system:
-
-1. **Complete System Profiler**: Analyzes performance across all components
-2. **Cross-Module Optimization**: Identifies and implements system-wide optimizations
-3. **Production Validation**: Ensures system meets enterprise requirements
-4. **Scalability Analysis**: Plans for growth to millions of users
-5. **Cost Optimization**: Balances performance with budget constraints
-6. **Enterprise Deployment**: Implements advanced deployment strategies
-7. **Comprehensive Testing**: Validates the entire system end-to-end
-
-### Key Takeaways
- ML systems engineering requires thinking beyond individual components
- Production systems need careful orchestration of many moving parts
- Performance optimization is a continuous, multi-dimensional process
- Scalability must be designed in from the beginning
- Monitoring and observability are critical for production success
-
-### Your ML Systems Journey
-You've progressed from understanding basic tensors to building complete production ML systems. You now have the knowledge to:
- Design and implement ML systems from scratch
- Optimize for production performance and scale
- Deploy and monitor ML systems in enterprise environments
- Make informed architectural decisions
- Continue learning as ML systems evolve
-
-### Next Steps
-1. Build your own production ML system using TinyTorch
-2. Contribute to open-source ML frameworks
-3. Explore specialized areas (distributed training, edge deployment, etc.)
-4. Stay current with ML systems research and industry practices
-5. Share your knowledge and help others learn
-
-Congratulations on completing the TinyTorch ML Systems Engineering journey! 🎉
-"""
--- a/modules/source/16_capstone_backup/capstone_guide.md
+++ b/modules/source/16_capstone_backup/capstone_guide.md
@@ -1,500 +0,0 @@
-# 🎯 Capstone Project Guide: Performance Optimization Example
-
-## **Example Project: Vectorized Matrix Operations**
-
-This guide walks through a complete capstone project optimizing TinyTorch's matrix operations. Follow this example to understand the process, then apply it to your chosen optimization track.
-
---
-
-## **Phase 1: Analysis & Profiling**
-
-### **Step 1: Profile Your Current Implementation**
-
-First, let's identify where TinyTorch spends most of its time:
-
-```python
-import cProfile
-import pstats
-import time
-import numpy as np
-from memory_profiler import profile
-
-# Import your TinyTorch framework
-from tinytorch.core.tensor import Tensor
-from tinytorch.core.layers import Dense
-from tinytorch.core.networks import Sequential
-from tinytorch.core.activations import ReLU
-
-def profile_current_framework():
-    """Profile a typical TinyTorch training scenario."""
-    
-    # Create a realistic model
-    model = Sequential([
-        Dense(784, 256),
-        ReLU(),
-        Dense(256, 128), 
-        ReLU(),
-        Dense(128, 10)
-    ])
-    
-    # Generate realistic data (like MNIST)
-    batch_size = 64
-    X = Tensor(np.random.randn(batch_size, 784))
-    
-    # Profile forward pass
-    profiler = cProfile.Profile()
-    profiler.enable()
-    
-    # Run multiple forward passes
-    for _ in range(100):
-        output = model.forward(X)
-    
-    profiler.disable()
-    
-    # Analyze results
-    stats = pstats.Stats(profiler)
-    stats.sort_stats('cumulative')
-    stats.print_stats(20)
-    
-    return stats
-
-# Run profiling
-print("🔍 Profiling Current TinyTorch Framework...")
-profile_results = profile_current_framework()
-```
-
-### **Step 2: Analyze Bottlenecks**
-
-Typical results show:
-```
-         1003 function calls in 2.450 seconds
-
-   Ordered by: cumulative time
-
-   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
-      100    0.001    0.000    2.449    0.024 networks.py:45(forward)
-      300    0.002    0.000    2.448    0.008 layers.py:67(forward)
-      300    2.440    0.008    2.446    0.008 layers.py:34(matmul_naive)  ← BOTTLENECK!
-      200    0.004    0.000    0.004    0.000 activations.py:23(forward)
-```
-
-**Finding**: 99.6% of time spent in `matmul_naive`! This is our optimization target.
-
-### **Step 3: Baseline Benchmarks**
-
-```python
-def benchmark_current_matmul():
-    """Establish baseline performance metrics."""
-    
-    # Test various matrix sizes
-    sizes = [(100, 100), (500, 500), (1000, 1000), (2000, 2000)]
-    
-    for m, n in sizes:
-        A = np.random.randn(m, n)
-        B = np.random.randn(n, m)
-        
-        # Time current implementation
-        start = time.time()
-        result = matmul_naive(A, B)  # Your current implementation
-        current_time = time.time() - start
-        
-        # Time NumPy for comparison  
-        start = time.time()
-        numpy_result = np.dot(A, B)
-        numpy_time = time.time() - start
-        
-        slowdown = current_time / numpy_time
-        print(f"Size {m}x{n}: TinyTorch={current_time:.3f}s, NumPy={numpy_time:.3f}s, Slowdown={slowdown:.1f}x")
-
-print("📊 Baseline Performance:")
-benchmark_current_matmul()
-```
-
-**Typical Output:**
-```
-Size 100x100: TinyTorch=0.023s, NumPy=0.001s, Slowdown=23.0x
-Size 500x500: TinyTorch=0.890s, NumPy=0.012s, Slowdown=74.2x  
-Size 1000x1000: TinyTorch=7.234s, NumPy=0.089s, Slowdown=81.3x
-```
-
-**Goal**: Reduce this slowdown from 80x to under 5x.
-
---
-
-## **Phase 2: Optimization Implementation**
-
-### **Step 4: Implement Optimized Matrix Multiplication**
-
-```python
-def matmul_optimized_v1(A, B):
-    """
-    First optimization: Use NumPy's optimized dot product.
-    
-    This isn't cheating - NumPy is our computational backend,
-    just like PyTorch uses BLAS/LAPACK under the hood.
-    """
-    # Validate inputs (keep your error checking)
-    assert A.shape[1] == B.shape[0], f"Cannot multiply {A.shape} and {B.shape}"
-    
-    # Use NumPy's optimized implementation
-    return np.dot(A, B)
-
-def matmul_optimized_v2(A, B):
-    """
-    Second optimization: Block-based multiplication for large matrices.
-    Better cache performance for very large operations.
-    """
-    m, k = A.shape
-    k2, n = B.shape
-    assert k == k2
-    
-    # For small matrices, use simple NumPy
-    if m * n * k < 1000000:  # Threshold tuned empirically
-        return np.dot(A, B)
-    
-    # For large matrices, use block multiplication
-    block_size = 256  # Optimized for L2 cache
-    C = np.zeros((m, n))
-    
-    for i in range(0, m, block_size):
-        for j in range(0, n, block_size):
-            for l in range(0, k, block_size):
-                # Extract blocks
-                A_block = A[i:i+block_size, l:l+block_size]
-                B_block = B[l:l+block_size, j:j+block_size]
-                
-                # Multiply blocks
-                C[i:i+block_size, j:j+block_size] += np.dot(A_block, B_block)
-    
-    return C
-
-def matmul_optimized_v3(A, B):
-    """
-    Third optimization: Memory layout optimization.
-    Ensure contiguous memory for better performance.
-    """
-    # Ensure C-contiguous layout for better cache performance
-    if not A.flags['C_CONTIGUOUS']:
-        A = np.ascontiguousarray(A)
-    if not B.flags['C_CONTIGUOUS']:
-        B = np.ascontiguousarray(B)
-    
-    # Use the block approach with optimized memory layout
-    return matmul_optimized_v2(A, B)
-```
-
-### **Step 5: Test and Benchmark Optimizations**
-
-```python
-def benchmark_optimizations():
-    """Compare all optimization versions."""
-    
-    sizes = [(100, 100), (500, 500), (1000, 1000), (2000, 2000)]
-    
-    for m, n in sizes:
-        A = np.random.randn(m, n)
-        B = np.random.randn(n, m)
-        
-        # Test correctness first
-        result_naive = matmul_naive(A, B)
-        result_v1 = matmul_optimized_v1(A, B)
-        result_v2 = matmul_optimized_v2(A, B)
-        result_v3 = matmul_optimized_v3(A, B)
-        
-        # Verify all produce same results
-        assert np.allclose(result_naive, result_v1, rtol=1e-10)
-        assert np.allclose(result_naive, result_v2, rtol=1e-10)
-        assert np.allclose(result_naive, result_v3, rtol=1e-10)
-        
-        # Benchmark performance
-        times = {}
-        for name, func in [
-            ('naive', matmul_naive),
-            ('v1_numpy', matmul_optimized_v1),
-            ('v2_blocks', matmul_optimized_v2),
-            ('v3_memory', matmul_optimized_v3)
-        ]:
-            start = time.time()
-            _ = func(A, B)
-            times[name] = time.time() - start
-        
-        print(f"\nSize {m}x{n}:")
-        baseline = times['naive']
-        for name, t in times.items():
-            speedup = baseline / t
-            print(f"  {name:12}: {t:.3f}s (speedup: {speedup:.1f}x)")
-
-print("⚡ Optimization Results:")
-benchmark_optimizations()
-```
-
-**Typical Results:**
-```
-Size 1000x1000:
-  naive       : 7.234s (speedup: 1.0x)
-  v1_numpy    : 0.089s (speedup: 81.3x)  ← Huge improvement!
-  v2_blocks   : 0.091s (speedup: 79.5x)  ← Slight regression for this size
-  v3_memory   : 0.087s (speedup: 83.1x)  ← Best overall
-```
-
---
-
-## **Phase 3: Integration & Testing**
-
-### **Step 6: Update Your Dense Layer**
-
-```python
-class DenseOptimized:
-    """Optimized Dense layer using improved matrix multiplication."""
-    
-    def __init__(self, input_size, output_size):
-        self.input_size = input_size
-        self.output_size = output_size
-        
-        # Initialize weights (same as before)
-        self.weight = np.random.randn(input_size, output_size) * 0.1
-        self.bias = np.zeros(output_size)
-    
-    def forward(self, x):
-        """Forward pass using optimized matrix multiplication."""
-        # Use our optimized matmul instead of naive version
-        linear_output = matmul_optimized_v3(x, self.weight)
-        return linear_output + self.bias
-    
-    def __call__(self, x):
-        return self.forward(x)
-```
-
-### **Step 7: End-to-End Performance Test**
-
-```python
-def test_full_network_improvement():
-    """Test the complete training pipeline with optimizations."""
-    
-    # Create identical networks with different matmul implementations
-    print("🏗️ Creating test networks...")
-    
-    # Original network (using naive matmul)
-    network_original = Sequential([
-        Dense(784, 256),  # Uses matmul_naive
-        ReLU(),
-        Dense(256, 128),
-        ReLU(), 
-        Dense(128, 10)
-    ])
-    
-    # Optimized network (using optimized matmul)
-    network_optimized = Sequential([
-        DenseOptimized(784, 256),  # Uses matmul_optimized_v3
-        ReLU(),
-        DenseOptimized(256, 128),
-        ReLU(),
-        DenseOptimized(128, 10)
-    ])
-    
-    # Test data
-    batch_size = 64
-    X = np.random.randn(batch_size, 784)
-    
-    # Benchmark original network
-    print("⏱️ Benchmarking original network...")
-    start = time.time()
-    for _ in range(100):
-        output_orig = network_original.forward(X)
-    time_original = time.time() - start
-    
-    # Benchmark optimized network  
-    print("⚡ Benchmarking optimized network...")
-    start = time.time()
-    for _ in range(100):
-        output_opt = network_optimized.forward(X)
-    time_optimized = time.time() - start
-    
-    # Calculate improvement
-    speedup = time_original / time_optimized
-    time_saved = time_original - time_optimized
-    
-    print(f"\n🎉 Results:")
-    print(f"  Original network: {time_original:.3f}s")
-    print(f"  Optimized network: {time_optimized:.3f}s") 
-    print(f"  Speedup: {speedup:.1f}x")
-    print(f"  Time saved: {time_saved:.3f}s ({time_saved/time_original*100:.1f}%)")
-    
-    # Verify outputs are identical (within numerical precision)
-    assert np.allclose(output_orig, output_opt, rtol=1e-10), "Outputs don't match!"
-    print(f"  ✅ Numerical correctness verified")
-
-test_full_network_improvement()
-```
-
-**Expected Results:**
-```
-🎉 Results:
-  Original network: 2.450s
-  Optimized network: 0.035s
-  Speedup: 70.0x
-  Time saved: 2.415s (98.6%)
-  ✅ Numerical correctness verified
-```
-
---
-
-## **Phase 4: Documentation & Analysis**
-
-### **Step 8: Document Your Engineering Decisions**
-
-Create `capstone_report.md`:
-
-```markdown
-# Performance Optimization Capstone Report
-
-## Problem Analysis
-TinyTorch's matrix multiplication was 80x slower than NumPy, making training 
-impractically slow. Profiling showed 99.6% of computation time in `matmul_naive`.
-
-## Technical Approach  
-1. **Root Cause**: Triple-nested loops with poor cache locality
-2. **Solution**: Leverage NumPy's optimized BLAS backend
-3. **Enhancement**: Add block-based multiplication for huge matrices
-4. **Polish**: Memory layout optimization for cache efficiency
-
-## Engineering Trade-offs
- **Gained**: 70x speedup in real networks, maintained numerical precision
- **Lost**: Educational visibility into low-level matrix multiplication
- **Justified**: Students learn optimization thinking, not reinventing BLAS
-
-## Performance Results
- Dense layer operations: 80x faster
- Full network training: 70x faster  
- Memory usage: Unchanged
- Numerical accuracy: Maintained (1e-10 relative tolerance)
-
-## Future Optimizations
-1. GPU acceleration using CuPy/JAX
-2. Sparse matrix support for compressed models
-3. Mixed-precision training for memory efficiency
-```
-
-### **Step 9: Create Demonstration**
-
-Create `demo.py`:
-
-```python
-"""
-TinyTorch Performance Optimization Demo
-
-This demonstrates the 70x speedup achieved through matrix operation optimization.
-Run this to see before/after performance on your machine.
-"""
-
-import time
-import numpy as np
-from tinytorch.core.networks import Sequential
-from tinytorch.core.layers import Dense, DenseOptimized
-from tinytorch.core.activations import ReLU
-
-def main():
-    print("🔥 TinyTorch Performance Optimization Demo")
-    print("=" * 50)
-    
-    # Create test scenario: MNIST-like classification
-    print("📊 Scenario: MNIST-like classification (784→256→128→10)")
-    batch_size = 64
-    X = np.random.randn(batch_size, 784)
-    
-    # Original network
-    network_original = Sequential([
-        Dense(784, 256), ReLU(),
-        Dense(256, 128), ReLU(), 
-        Dense(128, 10)
-    ])
-    
-    # Optimized network
-    network_optimized = Sequential([
-        DenseOptimized(784, 256), ReLU(),
-        DenseOptimized(256, 128), ReLU(),
-        DenseOptimized(128, 10)
-    ])
-    
-    # Benchmark
-    print("\n⏱️ Running 1000 forward passes...")
-    
-    # Original
-    start = time.time()
-    for _ in range(1000):
-        _ = network_original.forward(X)
-    time_orig = time.time() - start
-    
-    # Optimized  
-    start = time.time()
-    for _ in range(1000):
-        _ = network_optimized.forward(X)
-    time_opt = time.time() - start
-    
-    # Results
-    speedup = time_orig / time_opt
-    print(f"\n🎉 Results:")
-    print(f"  Original: {time_orig:.2f}s")
-    print(f"  Optimized: {time_opt:.2f}s")
-    print(f"  Speedup: {speedup:.1f}x")
-    print(f"  Time saved: {time_orig - time_opt:.2f}s")
-    
-    if speedup > 50:
-        print(f"  🚀 Excellent optimization!")
-    elif speedup > 20:
-        print(f"  ⚡ Great improvement!")
-    else:
-        print(f"  📈 Good progress, consider further optimization")
-
-if __name__ == "__main__":
-    main()
-```
-
---
-
-## **🎯 Your Turn: Apply This Process**
-
-This example showed **Performance Engineering**. Now apply this same systematic approach to your chosen track:
-
-### **For Algorithm Extensions:**
-1. **Profile**: Which algorithms are missing from your framework?
-2. **Plan**: What modern techniques would add most value?
-3. **Implement**: Build new layers/optimizers using existing TinyTorch components
-4. **Test**: Verify they work with your training pipeline
-5. **Document**: Explain design decisions and integration patterns
-
-### **For Systems Optimization:**
-1. **Profile**: Where does memory usage spike? What limits parallelization?
-2. **Plan**: Which systems improvements would have biggest impact?
-3. **Implement**: Add memory profiling, gradient accumulation, checkpointing
-4. **Test**: Verify improvements don't break existing functionality
-5. **Document**: Analyze trade-offs between memory, speed, complexity
-
-### **For Framework Analysis:**
-1. **Profile**: How does TinyTorch compare to PyTorch on key operations?
-2. **Plan**: What benchmarks would be most revealing?
-3. **Implement**: Automated testing suites comparing both frameworks
-4. **Test**: Run comprehensive performance analysis
-5. **Document**: Identify specific optimization opportunities
-
-### **For Developer Experience:**
-1. **Profile**: What makes debugging TinyTorch difficult?
-2. **Plan**: Which tools would help developers most?
-3. **Implement**: Gradient visualization, error diagnosis, testing utilities
-4. **Test**: Use tools on real debugging scenarios
-5. **Document**: Show how tools improve development workflow
-
---
-
-## **🚀 Success Criteria Reminder**
-
-Your capstone succeeds when you can show:
-
-1. **Measurable Impact**: 20%+ improvement in your chosen area
-2. **Systems Integration**: Your improvements work with all TinyTorch modules
-3. **Engineering Insight**: You understand and can explain the trade-offs
-4. **Professional Documentation**: Clear problem, solution, and results
-
-**Remember**: You're not just optimizing code—you're proving you understand ML systems engineering at the framework level.
-
-**🔥 Start with profiling your current TinyTorch framework and identifying your biggest optimization opportunity!** 
--- a/modules/source/16_capstone_backup/module.yaml
+++ b/modules/source/16_capstone_backup/module.yaml
@@ -1,39 +0,0 @@
-# TinyTorch Module Metadata
-# Essential system information for CLI tools and build systems
-
-name: "capstone"
-title: "Capstone Project"
-description: "Optimize and extend your complete TinyTorch framework through systems engineering"
-
-# Dependencies - Used by CLI for module ordering and prerequisites
-dependencies:
-  prerequisites: [
-    "setup", "tensor", "activations", "layers", "networks", "cnn", 
-    "dataloader", "autograd", "optimizers", "training", "compression", 
-    "kernels", "benchmarking", "mlops"
-  ]
-  enables: []
-
-# Package Export - What gets built into tinytorch package
-exports_to: "tinytorch.capstone"
-
-# File Structure - What files exist in this module
-files:
-  dev_file: "capstone_dev.py"
-  readme: "README.md"
-  tests: "inline"
-
-# Educational Metadata
-difficulty: "⭐⭐⭐⭐⭐ 🥷"
-time_estimate: "Capstone Project"
-
-# Components - What's implemented in this module
-components:
-  - "PerformanceProfiler"
-  - "MemoryOptimizer"
-  - "BatchNormalization"
-  - "TransformerBlock"
-  - "MultiGPUTraining"
-  - "AdvancedOptimizer"
-  - "FrameworkBenchmark"
-  - "DeveloperTools" 
--- a/modules/source/utils/init.py
+++ b/modules/source/utils/init.py
@@ -1,9 +0,0 @@
-"""
-TinyTorch Utils Package
-
-Shared utilities for TinyTorch modules.
-"""
-
-from .profiler import SimpleProfiler, profile_function
-
-__all__ = ['SimpleProfiler', 'profile_function'] 
--- a/modules/source/utils/profiler.py
+++ b/modules/source/utils/profiler.py
@@ -1,226 +0,0 @@
-"""
-TinyTorch Utils: Simple Educational Profiler
-
-A lightweight profiling utility for measuring performance of ML operations.
-Focused on measuring individual functions - students do their own comparisons.
-"""
-
-import time
-import sys
-import gc
-import numpy as np
-from typing import Callable, Dict, Any, Optional
-
-try:
-    import psutil
-    HAS_PSUTIL = True
-except ImportError:
-    HAS_PSUTIL = False
-
-try:
-    import tracemalloc
-    HAS_TRACEMALLOC = True
-except ImportError:
-    HAS_TRACEMALLOC = False
-
-class SimpleProfiler:
-    """
-    Simple profiler for measuring individual function performance.
-    
-    Measures timing, memory usage, and other key metrics for a single function.
-    Students collect multiple measurements and compare results themselves.
-    """
-    
-    def __init__(self, track_memory: bool = True, track_cpu: bool = True):
-        self.track_memory = track_memory and HAS_TRACEMALLOC
-        self.track_cpu = track_cpu and HAS_PSUTIL
-        
-        if self.track_memory:
-            tracemalloc.start()
-    
-    def _get_memory_info(self) -> Dict[str, Any]:
-        """Get current memory information."""
-        if not self.track_memory:
-            return {}
-        
-        try:
-            current, peak = tracemalloc.get_traced_memory()
-            return {
-                'current_memory_mb': current / 1024 / 1024,
-                'peak_memory_mb': peak / 1024 / 1024
-            }
-        except:
-            return {}
-    
-    def _get_cpu_info(self) -> Dict[str, Any]:
-        """Get current CPU information."""
-        if not self.track_cpu:
-            return {}
-        
-        try:
-            process = psutil.Process()
-            return {
-                'cpu_percent': process.cpu_percent(),
-                'memory_percent': process.memory_percent(),
-                'num_threads': process.num_threads()
-            }
-        except:
-            return {}
-    
-    def _get_array_info(self, result: Any) -> Dict[str, Any]:
-        """Get information about numpy arrays."""
-        if not isinstance(result, np.ndarray):
-            return {}
-        
-        return {
-            'result_shape': result.shape,
-            'result_dtype': str(result.dtype),
-            'result_size_mb': result.nbytes / 1024 / 1024,
-            'result_elements': result.size
-        }
-    
-    def profile(self, func: Callable, *args, name: Optional[str] = None, warmup: bool = True, **kwargs) -> Dict[str, Any]:
-        """
-        Profile a single function execution with comprehensive metrics.
-        
-        Args:
-            func: Function to profile
-            *args: Arguments to pass to function
-            name: Optional name for the function (defaults to func.__name__)
-            warmup: Whether to do a warmup run (recommended for fair timing)
-            **kwargs: Keyword arguments to pass to function
-            
-        Returns:
-            Dictionary with comprehensive performance metrics
-            
-        Example:
-            profiler = SimpleProfiler()
-            result = profiler.profile(my_function, arg1, arg2, name="My Function")
-            print(f"Time: {result['wall_time']:.4f}s")
-            print(f"Memory: {result['memory_delta_mb']:.2f}MB")
-        """
-        func_name = name or func.__name__
-        
-        # Reset memory tracking
-        if self.track_memory:
-            tracemalloc.clear_traces()
-        
-        # Warm up (important for fair comparison)
-        if warmup:
-            try:
-                warmup_result = func(*args, **kwargs)
-                del warmup_result
-            except:
-                pass
-        
-        # Force garbage collection for clean measurement
-        gc.collect()
-        
-        # Get baseline measurements
-        memory_before = self._get_memory_info()
-        cpu_before = self._get_cpu_info()
-        
-        # Time the actual execution
-        start_time = time.time()
-        start_cpu_time = time.process_time()
-        
-        result = func(*args, **kwargs)
-        
-        end_time = time.time()
-        end_cpu_time = time.process_time()
-        
-        # Get post-execution measurements
-        memory_after = self._get_memory_info()
-        cpu_after = self._get_cpu_info()
-        
-        # Calculate metrics
-        wall_time = end_time - start_time
-        cpu_time = end_cpu_time - start_cpu_time
-        
-        profile_result = {
-            'name': func_name,
-            'wall_time': wall_time,
-            'cpu_time': cpu_time,
-            'cpu_efficiency': (cpu_time / wall_time) if wall_time > 0 else 0,
-            'result': result
-        }
-        
-        # Add memory metrics
-        if self.track_memory and memory_before and memory_after:
-            profile_result.update({
-                'memory_before_mb': memory_before.get('current_memory_mb', 0),
-                'memory_after_mb': memory_after.get('current_memory_mb', 0),
-                'peak_memory_mb': memory_after.get('peak_memory_mb', 0),
-                'memory_delta_mb': memory_after.get('current_memory_mb', 0) - memory_before.get('current_memory_mb', 0)
-            })
-        
-        # Add CPU metrics
-        if self.track_cpu and cpu_after:
-            profile_result.update({
-                'cpu_percent': cpu_after.get('cpu_percent', 0),
-                'memory_percent': cpu_after.get('memory_percent', 0),
-                'num_threads': cpu_after.get('num_threads', 1)
-            })
-        
-        # Add array information
-        profile_result.update(self._get_array_info(result))
-        
-        return profile_result
-    
-    def print_result(self, profile_result: Dict[str, Any], show_details: bool = False) -> None:
-        """
-        Print profiling results in a readable format.
-        
-        Args:
-            profile_result: Result from profile() method
-            show_details: Whether to show detailed metrics
-        """
-        name = profile_result['name']
-        wall_time = profile_result['wall_time']
-        
-        print(f"📊 {name}: {wall_time:.4f}s")
-        
-        if show_details:
-            if 'memory_delta_mb' in profile_result:
-                print(f"   💾 Memory: {profile_result['memory_delta_mb']:.2f}MB delta, {profile_result['peak_memory_mb']:.2f}MB peak")
-            if 'result_size_mb' in profile_result:
-                print(f"   🔢 Output: {profile_result['result_shape']} ({profile_result['result_size_mb']:.2f}MB)")
-            if 'cpu_efficiency' in profile_result:
-                print(f"   ⚡ CPU: {profile_result['cpu_efficiency']:.2f} efficiency")
-    
-    def get_capabilities(self) -> Dict[str, bool]:
-        """Get information about profiler capabilities."""
-        return {
-            'memory_tracking': self.track_memory,
-            'cpu_tracking': self.track_cpu,
-            'has_psutil': HAS_PSUTIL,
-            'has_tracemalloc': HAS_TRACEMALLOC
-        }
-
-# Convenience function for quick profiling
-def profile_function(func: Callable, *args, name: Optional[str] = None, 
-                     show_details: bool = False, **kwargs) -> Dict[str, Any]:
-    """
-    Quick profiling of a single function.
-    
-    Args:
-        func: Function to profile
-        *args: Arguments to pass to function
-        name: Optional name for the function
-        show_details: Whether to print detailed metrics
-        **kwargs: Keyword arguments to pass to function
-        
-    Returns:
-        Dictionary with profiling results
-        
-    Example:
-        result = profile_function(my_matmul, A, B, name="Custom MatMul", show_details=True)
-        print(f"Execution time: {result['wall_time']:.4f}s")
-    """
-    profiler = SimpleProfiler(track_memory=True, track_cpu=True)
-    result = profiler.profile(func, *args, name=name, **kwargs)
-    
-    if show_details:
-        profiler.print_result(result, show_details=True)
-    
-    return result