Systematic audit of all 20 modules against module-developer agent rules found 9 standalone helper functions missing #| export — these are called by exported code at runtime but were excluded from the generated package, causing NameError/AttributeError in CI. Modules fixed: - 05_dataloader: _pad_image, _random_crop_region (used by RandomCrop) - 06_autograd: _stable_softmax, _one_hot_encode (prior session) - 07_optimizers: 5 mixin classes + monkey-patches (prior session) - 08_training: 7 monkey-patched Trainer methods (prior session) - 10_tokenization: _count_byte_pairs, _merge_pair (used by BPETokenizer) - 11_embeddings: _compute_sinusoidal_table (prior session) - 12_attention: _compute_attention_scores, _scale_scores, _apply_mask (prior) - 15_quantization: _collect_layer_inputs, _quantize_single_layer (used by quantize_model) - 18_memoization: _cached_generation_step, _create_cache_storage, _cached_attention_forward (used by enable_kv_cache) - 19_benchmarking: rename TinyMLPerf→MLPerf, fix monkey-patch naming (prior) Also includes: vscode-ext icon refactor (ThemeIcon migration). All 789 tests pass (unit, integration, e2e, CLI).
Milestone 06: MLPerf - The Optimization Era (2018)
Historical Context
As ML models grew larger and deployment became critical, the community needed systematic optimization methodologies. MLCommons' MLPerf (2018) established standardized benchmarking and optimization workflows, shifting the focus from "can we build it?" to "can we deploy it efficiently?"
This milestone teaches production optimization - the systematic process of profiling, compressing, and accelerating models for real-world deployment.
What You're Building
A complete MLPerf-style optimization pipeline that takes YOUR networks from previous milestones and makes them production-ready!
Required Modules
| Module | Component | What It Provides |
|---|---|---|
| Module 01-03 | Tensor, Linear, ReLU | YOUR base components |
| Module 11 | Embeddings | YOUR token embeddings |
| Module 12 | Attention | YOUR multi-head attention |
| Module 14 | Profiling | YOUR profiler for measurement |
| Module 15 | Quantization | YOUR INT8/FP16 implementations |
| Module 16 | Compression | YOUR pruning techniques |
| Module 17 | Acceleration | YOUR vectorized operations |
Milestone Structure
This milestone has two scripts, each covering different optimization techniques:
01_optimization_olympics.py
Purpose: Optimize static models (MLP, CNN)
Uses YOUR implementations:
- Module 14 (Profiling): Measure parameters, latency, size
- Module 15 (Quantization): FP32 → INT8 (4× compression)
- Module 16 (Compression): Pruning (remove weights)
Networks from:
- DigitMLP (Milestone 03)
- SimpleCNN (Milestone 04)
02_generation_speedup.py
Purpose: Speed up Transformer generation
Uses YOUR implementations:
- Module 11 (Embeddings): Token embeddings
- Module 12 (Attention): Multi-head attention
- Module 14 (Profiling): Measure speedup
- Module 18 (KV Cache): Cache K,V for 6-10× speedup
Networks from:
- MinimalTransformer (Milestone 05)
Expected Results
Static Model Optimization (01)
| Optimization | Size | Accuracy | Notes |
|---|---|---|---|
| Baseline | 100% | 85-90% | Full precision |
| + Quantization | 25% | 84-89% | INT8 weights |
| + Pruning | 12.5% | 82-87% | 50% weights removed |
Generation Speedup (02)
| Mode | Time/Token | Speedup |
|---|---|---|
| Without Cache | ~10ms | 1× |
| With KV Cache | ~1ms | 6-10× |
Running the Milestone
# Optimize MLP/CNN (profiling + quantization + pruning)
python milestones/06_2018_mlperf/01_optimization_olympics.py
# Speed up Transformer generation (KV caching)
python milestones/06_2018_mlperf/02_generation_speedup.py
Or via tito:
tito milestone run 06
Key Learning
Unlike earlier milestones where you "build and run," optimization requires:
- Measure (profile to find bottlenecks)
- Optimize (apply targeted techniques)
- Validate (check accuracy didn't degrade)
- Repeat (iterate until deployment targets met)
This is ML systems engineering - the skill that ships products!
Further Reading
- MLPerf: https://mlcommons.org/en/inference-edge-11/
- Deep Compression (Han et al., 2015): https://arxiv.org/abs/1510.00149
- Efficient Transformers Survey: https://arxiv.org/abs/2009.06732