mirror of
https://github.com/MLSysBook/TinyTorch.git
synced 2026-04-28 19:44:47 -05:00
Remove ML Systems Thinking sections from all modules
Cleaned up module structure by removing reflection questions: - Updated module-developer.md to remove ML Systems Thinking from template - Removed ML Systems Thinking sections from all 9 modules: * Module 01 (Tensor): Removed 113 lines of questions * Module 02 (Activations): Removed 24 lines of questions * Module 03 (Layers): Removed 84 lines of questions * Module 04 (Losses): Removed 93 lines of questions * Module 05 (Autograd): Removed 64 lines of questions * Module 06 (Optimizers): Removed questions section * Module 07 (Training): Removed questions section * Module 08 (DataLoader): Removed 35 lines of questions * Module 09 (Spatial): Removed 34 lines of questions Impact: - Modules now flow directly from tests to summary - Cleaner, more focused module structure - Removes assessment burden from implementation modules - Keeps focus on building and understanding code
This commit is contained in:
@@ -1429,77 +1429,6 @@ if __name__ == "__main__":
|
||||
|
||||
print("✅ Module validation complete!")
|
||||
|
||||
# %% [markdown]
|
||||
"""
|
||||
## 🤔 ML Systems Thinking: Interactive Questions
|
||||
|
||||
Now that you've built sophisticated optimization algorithms, let's reflect on the systems implications of your implementation.
|
||||
"""
|
||||
|
||||
# %% nbgrader={"grade": false, "grade_id": "systems-q1", "solution": true}
|
||||
# %% [markdown]
|
||||
"""
|
||||
### Question 1: Memory Scaling in Large Models
|
||||
Your Adam optimizer uses 3× the memory of parameters (param + m_buffer + v_buffer).
|
||||
|
||||
**a) Model Scale Impact**: For a 7B parameter model (like a small language model):
|
||||
- SGD memory overhead: _____ GB (assuming float32 parameters)
|
||||
- Adam memory overhead: _____ GB
|
||||
- Total training memory: _____ GB
|
||||
|
||||
**b) Memory Optimization**: What strategies could reduce Adam's memory usage while preserving its adaptive benefits?
|
||||
|
||||
*Think about: gradient accumulation, mixed precision, gradient checkpointing, and parameter sharing*
|
||||
"""
|
||||
|
||||
# %% nbgrader={"grade": false, "grade_id": "systems-q2", "solution": true}
|
||||
# %% [markdown]
|
||||
"""
|
||||
### Question 2: AdamW vs Adam Weight Decay
|
||||
You implemented two different weight decay approaches.
|
||||
|
||||
**a) Mathematical Difference**: In Adam, you add `weight_decay * param` to gradients. In AdamW, you apply `param = param * (1 - lr * weight_decay)` after the gradient update. Why does this matter?
|
||||
|
||||
**b) Practical Impact**: How might this difference affect:
|
||||
- Learning rate scheduling?
|
||||
- Hyperparameter tuning?
|
||||
- Model regularization effectiveness?
|
||||
|
||||
*Consider: how weight decay interacts with adaptive learning rates*
|
||||
"""
|
||||
|
||||
# %% nbgrader={"grade": false, "grade_id": "systems-q3", "solution": true}
|
||||
# %% [markdown]
|
||||
"""
|
||||
### Question 3: Optimizer Selection in Production
|
||||
You built three optimizers with different computational costs.
|
||||
|
||||
**a) Training Costs**: Rank SGD, Adam, and AdamW by:
|
||||
- Memory usage per parameter: _____
|
||||
- Computation per step: _____
|
||||
- Convergence speed: _____
|
||||
|
||||
**b) Production Decision**: When training a transformer for 1 week on expensive GPUs, what factors would determine your optimizer choice?
|
||||
|
||||
*Think about: wall-clock time, hardware utilization, final model quality, and cost per training run*
|
||||
"""
|
||||
|
||||
# %% nbgrader={"grade": false, "grade_id": "systems-q4", "solution": true}
|
||||
# %% [markdown]
|
||||
"""
|
||||
### Question 4: Gradient Processing Patterns
|
||||
Your optimizers process gradients differently - SGD uses them directly, while Adam smooths them over time.
|
||||
|
||||
**a) Gradient Noise**: In batch training, gradients from different batches can vary significantly. How does this affect:
|
||||
- SGD convergence behavior?
|
||||
- Adam's moment estimates?
|
||||
- Required batch sizes for stable training?
|
||||
|
||||
**b) Systems Design**: If you had to implement gradient compression (reducing communication in distributed training), how would it affect each optimizer differently?
|
||||
|
||||
*Consider: gradient sparsity, compression error accumulation, and adaptive learning rates*
|
||||
"""
|
||||
|
||||
# %% [markdown]
|
||||
"""
|
||||
## 🎯 MODULE SUMMARY: Optimizers
|
||||
|
||||
Reference in New Issue
Block a user