Remove ML Systems Thinking sections from all modules

Cleaned up module structure by removing reflection questions:
- Updated module-developer.md to remove ML Systems Thinking from template
- Removed ML Systems Thinking sections from all 9 modules:
  * Module 01 (Tensor): Removed 113 lines of questions
  * Module 02 (Activations): Removed 24 lines of questions
  * Module 03 (Layers): Removed 84 lines of questions
  * Module 04 (Losses): Removed 93 lines of questions
  * Module 05 (Autograd): Removed 64 lines of questions
  * Module 06 (Optimizers): Removed questions section
  * Module 07 (Training): Removed questions section
  * Module 08 (DataLoader): Removed 35 lines of questions
  * Module 09 (Spatial): Removed 34 lines of questions

Impact:
- Modules now flow directly from tests to summary
- Cleaner, more focused module structure
- Removes assessment burden from implementation modules
- Keeps focus on building and understanding code
This commit is contained in:
Vijay Janapa Reddi
2025-09-30 06:44:36 -04:00
parent aec4384c5d
commit 1b0f64bb2d
34 changed files with 2 additions and 6246 deletions

View File

@@ -1429,77 +1429,6 @@ if __name__ == "__main__":
print("✅ Module validation complete!")
# %% [markdown]
"""
## 🤔 ML Systems Thinking: Interactive Questions
Now that you've built sophisticated optimization algorithms, let's reflect on the systems implications of your implementation.
"""
# %% nbgrader={"grade": false, "grade_id": "systems-q1", "solution": true}
# %% [markdown]
"""
### Question 1: Memory Scaling in Large Models
Your Adam optimizer uses 3× the memory of parameters (param + m_buffer + v_buffer).
**a) Model Scale Impact**: For a 7B parameter model (like a small language model):
- SGD memory overhead: _____ GB (assuming float32 parameters)
- Adam memory overhead: _____ GB
- Total training memory: _____ GB
**b) Memory Optimization**: What strategies could reduce Adam's memory usage while preserving its adaptive benefits?
*Think about: gradient accumulation, mixed precision, gradient checkpointing, and parameter sharing*
"""
# %% nbgrader={"grade": false, "grade_id": "systems-q2", "solution": true}
# %% [markdown]
"""
### Question 2: AdamW vs Adam Weight Decay
You implemented two different weight decay approaches.
**a) Mathematical Difference**: In Adam, you add `weight_decay * param` to gradients. In AdamW, you apply `param = param * (1 - lr * weight_decay)` after the gradient update. Why does this matter?
**b) Practical Impact**: How might this difference affect:
- Learning rate scheduling?
- Hyperparameter tuning?
- Model regularization effectiveness?
*Consider: how weight decay interacts with adaptive learning rates*
"""
# %% nbgrader={"grade": false, "grade_id": "systems-q3", "solution": true}
# %% [markdown]
"""
### Question 3: Optimizer Selection in Production
You built three optimizers with different computational costs.
**a) Training Costs**: Rank SGD, Adam, and AdamW by:
- Memory usage per parameter: _____
- Computation per step: _____
- Convergence speed: _____
**b) Production Decision**: When training a transformer for 1 week on expensive GPUs, what factors would determine your optimizer choice?
*Think about: wall-clock time, hardware utilization, final model quality, and cost per training run*
"""
# %% nbgrader={"grade": false, "grade_id": "systems-q4", "solution": true}
# %% [markdown]
"""
### Question 4: Gradient Processing Patterns
Your optimizers process gradients differently - SGD uses them directly, while Adam smooths them over time.
**a) Gradient Noise**: In batch training, gradients from different batches can vary significantly. How does this affect:
- SGD convergence behavior?
- Adam's moment estimates?
- Required batch sizes for stable training?
**b) Systems Design**: If you had to implement gradient compression (reducing communication in distributed training), how would it affect each optimizer differently?
*Consider: gradient sparsity, compression error accumulation, and adaptive learning rates*
"""
# %% [markdown]
"""
## 🎯 MODULE SUMMARY: Optimizers