Remove ML Systems Thinking sections from all modules

Cleaned up module structure by removing reflection questions: - Updated module-developer.md to remove ML Systems Thinking from template - Removed ML Systems Thinking sections from all 9 modules: * Module 01 (Tensor): Removed 113 lines of questions * Module 02 (Activations): Removed 24 lines of questions * Module 03 (Layers): Removed 84 lines of questions * Module 04 (Losses): Removed 93 lines of questions * Module 05 (Autograd): Removed 64 lines of questions * Module 06 (Optimizers): Removed questions section * Module 07 (Training): Removed questions section * Module 08 (DataLoader): Removed 35 lines of questions * Module 09 (Spatial): Removed 34 lines of questions Impact: - Modules now flow directly from tests to summary - Cleaner, more focused module structure - Removes assessment burden from implementation modules - Keeps focus on building and understanding code
2026-04-28 19:44:47 -05:00 · 2025-09-30 06:44:36 -04:00
parent aec4384c5d
commit 1b0f64bb2d
34 changed files with 2 additions and 6246 deletions
--- a/modules/06_optimizers/optimizers_dev.py
+++ b/modules/06_optimizers/optimizers_dev.py
@@ -1429,77 +1429,6 @@ if __name__ == "__main__":

    print("✅ Module validation complete!")

-# %% [markdown]
-"""
-## 🤔 ML Systems Thinking: Interactive Questions
-
-Now that you've built sophisticated optimization algorithms, let's reflect on the systems implications of your implementation.
-"""
-
-# %% nbgrader={"grade": false, "grade_id": "systems-q1", "solution": true}
-# %% [markdown]
-"""
-### Question 1: Memory Scaling in Large Models
-Your Adam optimizer uses 3× the memory of parameters (param + m_buffer + v_buffer).
-
-**a) Model Scale Impact**: For a 7B parameter model (like a small language model):
- SGD memory overhead: _____ GB (assuming float32 parameters)
- Adam memory overhead: _____ GB
- Total training memory: _____ GB
-
-**b) Memory Optimization**: What strategies could reduce Adam's memory usage while preserving its adaptive benefits?
-
-*Think about: gradient accumulation, mixed precision, gradient checkpointing, and parameter sharing*
-"""
-
-# %% nbgrader={"grade": false, "grade_id": "systems-q2", "solution": true}
-# %% [markdown]
-"""
-### Question 2: AdamW vs Adam Weight Decay
-You implemented two different weight decay approaches.
-
-**a) Mathematical Difference**: In Adam, you add `weight_decay * param` to gradients. In AdamW, you apply `param = param * (1 - lr * weight_decay)` after the gradient update. Why does this matter?
-
-**b) Practical Impact**: How might this difference affect:
- Learning rate scheduling?
- Hyperparameter tuning?
- Model regularization effectiveness?
-
-*Consider: how weight decay interacts with adaptive learning rates*
-"""
-
-# %% nbgrader={"grade": false, "grade_id": "systems-q3", "solution": true}
-# %% [markdown]
-"""
-### Question 3: Optimizer Selection in Production
-You built three optimizers with different computational costs.
-
-**a) Training Costs**: Rank SGD, Adam, and AdamW by:
- Memory usage per parameter: _____
- Computation per step: _____
- Convergence speed: _____
-
-**b) Production Decision**: When training a transformer for 1 week on expensive GPUs, what factors would determine your optimizer choice?
-
-*Think about: wall-clock time, hardware utilization, final model quality, and cost per training run*
-"""
-
-# %% nbgrader={"grade": false, "grade_id": "systems-q4", "solution": true}
-# %% [markdown]
-"""
-### Question 4: Gradient Processing Patterns
-Your optimizers process gradients differently - SGD uses them directly, while Adam smooths them over time.
-
-**a) Gradient Noise**: In batch training, gradients from different batches can vary significantly. How does this affect:
- SGD convergence behavior?
- Adam's moment estimates?
- Required batch sizes for stable training?
-
-**b) Systems Design**: If you had to implement gradient compression (reducing communication in distributed training), how would it affect each optimizer differently?
-
-*Consider: gradient sparsity, compression error accumulation, and adaptive learning rates*
-"""
-
 # %% [markdown]
 """
 ## 🎯 MODULE SUMMARY: Optimizers