mirror of
https://github.com/MLSysBook/TinyTorch.git
synced 2026-05-03 20:55:44 -05:00
MAJOR: Implement beautiful module progression through strategic reordering
This commit implements the pedagogically optimal "inevitable discovery" module progression based on expert validation and educational design principles. ## Module Reordering Summary **Previous Order (Problems)**: - 05_losses → 06_autograd → 07_dataloader → 08_optimizers → 09_spatial → 10_training - Issues: Autograd before optimizers, DataLoader before training, scattered dependencies **New Order (Beautiful Progression)**: - 05_losses → 06_optimizers → 07_autograd → 08_training → 09_spatial → 10_dataloader - Benefits: Each module creates inevitable need for the next ## Pedagogical Flow Achieved **05_losses** → "Need systematic weight updates" → **06_optimizers** **06_optimizers** → "Need automatic gradients" → **07_autograd** **07_autograd** → "Need systematic training" → **08_training** **08_training** → "MLPs hit limits on images" → **09_spatial** **09_spatial** → "Training is too slow" → **10_dataloader** ## Technical Changes ### Module Directory Renaming - `06_autograd` → `07_autograd` - `07_dataloader` → `10_dataloader` - `08_optimizers` → `06_optimizers` - `10_training` → `08_training` - `09_spatial` → `09_spatial` (no change) ### System Integration Updates - **MODULE_TO_CHECKPOINT mapping**: Updated in tito/commands/export.py - **Test directories**: Renamed module_XX directories to match new numbers - **Documentation**: Updated all references in MD files and agent configurations - **CLI integration**: Updated next-steps suggestions for proper flow ### Agent Configuration Updates - **Quality Assurance**: Updated module audit status with new numbers - **Module Developer**: Updated work tracking with new sequence - **Documentation**: Updated MASTER_PLAN_OF_RECORD.md with beautiful progression ## Educational Benefits 1. **Inevitable Discovery**: Each module naturally leads to the next 2. **Cognitive Load**: Concepts introduced exactly when needed 3. **Motivation**: Students understand WHY each tool is necessary 4. **Synthesis**: Everything flows toward complete ML systems understanding 5. **Professional Alignment**: Matches real ML engineering workflows ## Quality Assurance - ✅ All CLI commands still function - ✅ Checkpoint system mappings updated - ✅ Documentation consistency maintained - ✅ Test directory structure aligned - ✅ Agent configurations synchronized **Impact**: This reordering transforms TinyTorch from a collection of modules into a coherent educational journey where each step naturally motivates the next, creating optimal conditions for deep learning systems understanding.
This commit is contained in:
184
docs/training-systems-ordering-analysis.md
Normal file
184
docs/training-systems-ordering-analysis.md
Normal file
@@ -0,0 +1,184 @@
|
||||
# Training Systems Module Ordering Analysis
|
||||
|
||||
## The Core Question
|
||||
Should DataLoader come BEFORE or AFTER Training? Let's analyze both directions.
|
||||
|
||||
## Option 1: DataLoader BEFORE Training (Current)
|
||||
```
|
||||
7. DataLoader → 8. Optimizers → 9. Spatial → 10. Training
|
||||
```
|
||||
|
||||
### Pros ✅
|
||||
- **Training uses real data from the start** - More satisfying
|
||||
- **Batching is available** - Training loop can show proper batching
|
||||
- **Real patterns** - SGD/Adam work on actual data distributions
|
||||
- **No rework** - Training module uses DataLoader immediately
|
||||
|
||||
### Cons ❌
|
||||
- **DataLoader without purpose** - Students don't know WHY they need it yet
|
||||
- **Abstract introduction** - Batching/shuffling seems arbitrary without training context
|
||||
- **Delayed gratification** - Can't train anything after building DataLoader
|
||||
|
||||
## Option 2: DataLoader AFTER Training
|
||||
```
|
||||
7. Optimizers → 8. Spatial → 9. Training → 10. DataLoader
|
||||
```
|
||||
|
||||
### Pros ✅
|
||||
- **Clear motivation** - Students hit limits with toy data, THEN get DataLoader
|
||||
- **Natural progression** - Simple → Complex data handling
|
||||
- **Pedagogical clarity** - "Now let's scale to real datasets"
|
||||
|
||||
### Cons ❌
|
||||
- **Training module is limited** - Can only use toy/synthetic data
|
||||
- **Rework needed** - Module 10 updates training to use DataLoader
|
||||
- **Artificial limitation** - Training without batching feels incomplete
|
||||
|
||||
## Option 3: Split Approach (RECOMMENDED)
|
||||
```
|
||||
7. Optimizers → 8. DataLoader → 9. Spatial → 10. Training
|
||||
```
|
||||
|
||||
### Why This Works Best 🎯
|
||||
|
||||
#### Module 7: Optimizers
|
||||
```python
|
||||
# Learn algorithms on simple problems
|
||||
# No need for complex data yet
|
||||
def optimize_parabola():
|
||||
w = 5.0
|
||||
for _ in range(100):
|
||||
grad = 2 * w # f(w) = w^2
|
||||
w = sgd_step(w, grad)
|
||||
```
|
||||
|
||||
#### Module 8: DataLoader (RIGHT AFTER OPTIMIZERS)
|
||||
```python
|
||||
# Now that we have optimizers, we need data!
|
||||
# Introduce batching WITH IMMEDIATE USE
|
||||
|
||||
# Simple example showing WHY we need batching
|
||||
dataset = SimpleDataset(10000) # Too big for memory!
|
||||
loader = DataLoader(dataset, batch_size=32)
|
||||
|
||||
# Immediately use with SGD
|
||||
for batch in loader:
|
||||
# Show how optimizers work with batches
|
||||
loss = compute_loss(batch)
|
||||
sgd.step(loss)
|
||||
```
|
||||
|
||||
#### Module 9: Spatial
|
||||
```python
|
||||
# Build CNNs using DataLoader for testing
|
||||
cifar = CIFAR10Dataset()
|
||||
loader = DataLoader(cifar, batch_size=1)
|
||||
|
||||
# Test convolution on real images
|
||||
for image, label in loader:
|
||||
output = conv2d(image)
|
||||
visualize(output) # See feature maps!
|
||||
```
|
||||
|
||||
#### Module 10: Training (EVERYTHING COMES TOGETHER)
|
||||
```python
|
||||
# Full training loop with all components
|
||||
model = CNN() # From Module 9
|
||||
optimizer = Adam(model.parameters()) # From Module 7
|
||||
train_loader = DataLoader(cifar_train) # From Module 8
|
||||
val_loader = DataLoader(cifar_val)
|
||||
|
||||
# Complete training pipeline
|
||||
for epoch in range(10):
|
||||
for batch in train_loader:
|
||||
loss = model.forward(batch)
|
||||
optimizer.step(loss.backward())
|
||||
```
|
||||
|
||||
## The Winner: Modified Current Order
|
||||
```
|
||||
7. Optimizers → 8. DataLoader → 9. Spatial → 10. Training
|
||||
```
|
||||
|
||||
### This is optimal because:
|
||||
|
||||
1. **Optimizers (Module 7)**: Learn the algorithms without data complexity
|
||||
2. **DataLoader (Module 8)**: Introduce right when needed for optimizer testing
|
||||
3. **Spatial (Module 9)**: Use DataLoader to visualize CNN features on real images
|
||||
4. **Training (Module 10)**: Everything culminates in complete pipeline
|
||||
|
||||
### Key Insight: DataLoader as the Bridge 🌉
|
||||
|
||||
DataLoader should come AFTER learning optimizers but BEFORE building architectures. This way:
|
||||
- Students understand gradient descent first
|
||||
- Then learn "how do we feed data to optimizers?"
|
||||
- Then build architectures that process this data
|
||||
- Finally put it all together in training
|
||||
|
||||
## Concrete Examples Showing the Flow
|
||||
|
||||
### Module 7 (Optimizers) - No DataLoader Needed
|
||||
```python
|
||||
# Optimize simple functions
|
||||
def rosenbrock(x, y):
|
||||
return (1-x)**2 + 100*(y-x**2)**2
|
||||
|
||||
# Students implement SGD, Adam
|
||||
optimizer = SGD([x, y], lr=0.01)
|
||||
for _ in range(1000):
|
||||
loss = rosenbrock(x, y)
|
||||
optimizer.step(loss.backward())
|
||||
```
|
||||
|
||||
### Module 8 (DataLoader) - Immediate Use Case
|
||||
```python
|
||||
# NOW we need to handle real data
|
||||
mnist = MNISTDataset() # 60,000 images!
|
||||
|
||||
# Without DataLoader (bad)
|
||||
for i in range(60000): # Memory explosion!
|
||||
optimizer.step(mnist[i])
|
||||
|
||||
# With DataLoader (good)
|
||||
loader = DataLoader(mnist, batch_size=32)
|
||||
for batch in loader: # Only 32 in memory
|
||||
optimizer.step(batch)
|
||||
```
|
||||
|
||||
### Module 9 (Spatial) - DataLoader for Visualization
|
||||
```python
|
||||
# Use DataLoader to explore convolutions
|
||||
loader = DataLoader(CIFAR10(), batch_size=1)
|
||||
conv = Conv2d(3, 16, kernel_size=3)
|
||||
|
||||
for image, _ in loader:
|
||||
features = conv(image)
|
||||
plot_feature_maps(features) # See what CNNs learn!
|
||||
```
|
||||
|
||||
### Module 10 (Training) - Full Integration
|
||||
```python
|
||||
# Everything they've built comes together
|
||||
train_loader = DataLoader(train_set, batch_size=64, shuffle=True)
|
||||
val_loader = DataLoader(val_set, batch_size=64)
|
||||
|
||||
trainer = Trainer(
|
||||
model=CNN(), # Module 9
|
||||
optimizer=Adam(), # Module 7
|
||||
train_loader=train_loader, # Module 8
|
||||
val_loader=val_loader # Module 8
|
||||
)
|
||||
|
||||
trainer.fit(epochs=20) # 75% on CIFAR-10!
|
||||
```
|
||||
|
||||
## Final Recommendation
|
||||
|
||||
Keep a modified version of current order but ensure:
|
||||
|
||||
1. **Module 7 (Optimizers)**: Focus on algorithms, not data
|
||||
2. **Module 8 (DataLoader)**: Immediately show WHY it's needed for optimizers
|
||||
3. **Module 9 (Spatial)**: Use DataLoader for CNN exploration
|
||||
4. **Module 10 (Training)**: Grand synthesis of all components
|
||||
|
||||
This way DataLoader is introduced exactly when students need it, and they use it throughout modules 8-10!
|
||||
Reference in New Issue
Block a user