mirror of
https://github.com/MLSysBook/TinyTorch.git
synced 2026-03-11 21:43:34 -05:00
MASSIVE DOCUMENTATION CLEANUP: - Reduced from 46 docs to 12 essential files - Archived 34 outdated planning and analysis documents ✅ KEPT (Essential for current operations): - STUDENT_QUICKSTART.md - Student onboarding - INSTRUCTOR_GUIDE.md - Instructor setup - cifar10-training-guide.md - North star achievement - tinytorch-assumptions.md - Complexity framework (NEW) - tinytorch-textbook-alignment.md - Academic alignment - NBGrader integration docs (3 files) - Development standards (3 files) - docs/README.md - Navigation guide (NEW) 🗑️ ARCHIVED (Completed/outdated planning): - All optimization-modules-* planning docs - All milestone-* system docs - All tutorial-master-plan and analysis docs - Module reordering and structure analysis - Agent setup and workflow case studies RESULT: Clean, focused documentation structure Only active, current docs remain - easy to find what you need!
5.5 KiB
5.5 KiB
Training Systems Module Ordering Analysis
The Core Question
Should DataLoader come BEFORE or AFTER Training? Let's analyze both directions.
Option 1: DataLoader BEFORE Training (Current)
7. DataLoader → 8. Optimizers → 9. Spatial → 10. Training
Pros ✅
- Training uses real data from the start - More satisfying
- Batching is available - Training loop can show proper batching
- Real patterns - SGD/Adam work on actual data distributions
- No rework - Training module uses DataLoader immediately
Cons ❌
- DataLoader without purpose - Students don't know WHY they need it yet
- Abstract introduction - Batching/shuffling seems arbitrary without training context
- Delayed gratification - Can't train anything after building DataLoader
Option 2: DataLoader AFTER Training
7. Optimizers → 8. Spatial → 9. Training → 10. DataLoader
Pros ✅
- Clear motivation - Students hit limits with toy data, THEN get DataLoader
- Natural progression - Simple → Complex data handling
- Pedagogical clarity - "Now let's scale to real datasets"
Cons ❌
- Training module is limited - Can only use toy/synthetic data
- Rework needed - Module 10 updates training to use DataLoader
- Artificial limitation - Training without batching feels incomplete
Option 3: Split Approach (RECOMMENDED)
7. Optimizers → 8. DataLoader → 9. Spatial → 10. Training
Why This Works Best 🎯
Module 7: Optimizers
# Learn algorithms on simple problems
# No need for complex data yet
def optimize_parabola():
w = 5.0
for _ in range(100):
grad = 2 * w # f(w) = w^2
w = sgd_step(w, grad)
Module 8: DataLoader (RIGHT AFTER OPTIMIZERS)
# Now that we have optimizers, we need data!
# Introduce batching WITH IMMEDIATE USE
# Simple example showing WHY we need batching
dataset = SimpleDataset(10000) # Too big for memory!
loader = DataLoader(dataset, batch_size=32)
# Immediately use with SGD
for batch in loader:
# Show how optimizers work with batches
loss = compute_loss(batch)
sgd.step(loss)
Module 9: Spatial
# Build CNNs using DataLoader for testing
cifar = CIFAR10Dataset()
loader = DataLoader(cifar, batch_size=1)
# Test convolution on real images
for image, label in loader:
output = conv2d(image)
visualize(output) # See feature maps!
Module 10: Training (EVERYTHING COMES TOGETHER)
# Full training loop with all components
model = CNN() # From Module 9
optimizer = Adam(model.parameters()) # From Module 7
train_loader = DataLoader(cifar_train) # From Module 8
val_loader = DataLoader(cifar_val)
# Complete training pipeline
for epoch in range(10):
for batch in train_loader:
loss = model.forward(batch)
optimizer.step(loss.backward())
The Winner: Modified Current Order
7. Optimizers → 8. DataLoader → 9. Spatial → 10. Training
This is optimal because:
- Optimizers (Module 7): Learn the algorithms without data complexity
- DataLoader (Module 8): Introduce right when needed for optimizer testing
- Spatial (Module 9): Use DataLoader to visualize CNN features on real images
- Training (Module 10): Everything culminates in complete pipeline
Key Insight: DataLoader as the Bridge 🌉
DataLoader should come AFTER learning optimizers but BEFORE building architectures. This way:
- Students understand gradient descent first
- Then learn "how do we feed data to optimizers?"
- Then build architectures that process this data
- Finally put it all together in training
Concrete Examples Showing the Flow
Module 7 (Optimizers) - No DataLoader Needed
# Optimize simple functions
def rosenbrock(x, y):
return (1-x)**2 + 100*(y-x**2)**2
# Students implement SGD, Adam
optimizer = SGD([x, y], lr=0.01)
for _ in range(1000):
loss = rosenbrock(x, y)
optimizer.step(loss.backward())
Module 8 (DataLoader) - Immediate Use Case
# NOW we need to handle real data
mnist = MNISTDataset() # 60,000 images!
# Without DataLoader (bad)
for i in range(60000): # Memory explosion!
optimizer.step(mnist[i])
# With DataLoader (good)
loader = DataLoader(mnist, batch_size=32)
for batch in loader: # Only 32 in memory
optimizer.step(batch)
Module 9 (Spatial) - DataLoader for Visualization
# Use DataLoader to explore convolutions
loader = DataLoader(CIFAR10(), batch_size=1)
conv = Conv2d(3, 16, kernel_size=3)
for image, _ in loader:
features = conv(image)
plot_feature_maps(features) # See what CNNs learn!
Module 10 (Training) - Full Integration
# Everything they've built comes together
train_loader = DataLoader(train_set, batch_size=64, shuffle=True)
val_loader = DataLoader(val_set, batch_size=64)
trainer = Trainer(
model=CNN(), # Module 9
optimizer=Adam(), # Module 7
train_loader=train_loader, # Module 8
val_loader=val_loader # Module 8
)
trainer.fit(epochs=20) # 75% on CIFAR-10!
Final Recommendation
Keep a modified version of current order but ensure:
- Module 7 (Optimizers): Focus on algorithms, not data
- Module 8 (DataLoader): Immediately show WHY it's needed for optimizers
- Module 9 (Spatial): Use DataLoader for CNN exploration
- Module 10 (Training): Grand synthesis of all components
This way DataLoader is introduced exactly when students need it, and they use it throughout modules 8-10!