feat: establish Python-to-notebook workflow for Lens colabs

Introduces new development workflow where colabs are maintained as Python
source files and converted to Jupyter notebooks for student distribution.

**New Infrastructure:**
- Python source files in colabs/src/ (version controlled, clean git diffs)
- Extracted utilities in colabs/src/utils/ (reusable across colabs)
- Generated notebooks in colabs/notebooks/ (student distribution)
- Complete workflow documentation

**First Colab Implemented:**
- Ch01 AI Triangle colab demonstrating model-data-compute interdependence
- Interactive exploration: students try to improve accuracy 80% → 90%
- Discovers bottlenecks when scaling only one component
- Uses simplified analytical model (no full Lens toolkit yet)

**Benefits:**
- Clean version control (Python source vs notebook JSON)
- Easier refactoring and code reuse
- Better code quality (linters, formatters work on .py files)
- Standard Jupyter format for students (no workflow change for them)

**Files:**
- colabs/src/ch01_ai_triangle.py - Python source (percent format)
- colabs/src/utils/ai_triangle_sim.py - AITriangleSimulator class
- colabs/notebooks/ch01_ai_triangle.ipynb - Student-facing notebook
- colabs/PYTHON_TO_NOTEBOOK_WORKFLOW.md - Complete workflow guide
- colabs/CONVERSION_STATUS.md - Current status and test plan
- colabs/CHAPTER_MAPPING.md - Progressive Lens module availability
- colabs/MLS_SIMULATOR_BUILD_VS_LEVERAGE.md - Hybrid approach analysis
- colabs/NAMING_DISCUSSION.md - Why "Lens" not "MLS Simulator"

**Next Steps:**
- Test conversion with Jupytext
- Validate in Google Colab
- Apply workflow to remaining chapters
This commit is contained in:
Vijay Janapa Reddi
2025-11-28 07:01:41 +01:00
parent 60dee31ec7
commit c08ca3424a
9 changed files with 2317 additions and 115 deletions

41
colabs/CHAPTER_MAPPING.md Normal file
View File

@@ -0,0 +1,41 @@
# Chapter to Directory Mapping
## Core Chapters (20 total)
| Ch# | Directory | Chapter Name | Lens Modules Available |
|-----|-----------|--------------|------------------------|
| 01 | introduction | Introduction | None (introduces Lens concept) |
| 02 | ml_systems | ML Systems | `lens.hardware`, `lens.network` |
| 03 | dl_primer | Deep Learning Primer | + `lens.workload` |
| 04 | dnn_architectures | DNN Architectures | (same as Ch03) |
| 05 | workflow | Workflow | + `lens.pipeline` |
| 06 | data_engineering | Data Engineering | + `lens.data` |
| 07 | frameworks | Frameworks | + `lens.frameworks` |
| 08 | training | Training | + `lens.distributed` |
| 09 | efficient_ai | Efficient AI | + `lens.compression` |
| 10 | optimizations | Optimizations | (same as Ch09) |
| 11 | hw_acceleration | Hardware Acceleration | + `lens.roofline` ⭐ KEY |
| 12 | benchmarking | Benchmarking | + `lens.benchmarking` |
| 13 | ops | MLOps | + `lens.lifecycle`, `lens.drift` |
| 14 | ondevice_learning | On-Device Learning | + `lens.federated` (Flower) |
| 15 | privacy_security | Privacy & Security | + `lens.security` |
| 16 | robust_ai | Robust AI | + `lens.reliability` |
| 17 | responsible_ai | Responsible AI | + `lens.fairness` |
| 18 | sustainable_ai | Sustainable AI | + `lens.carbon` (electricityMap) |
| 19 | ai_for_good | AI for Good | (all modules available) |
| 20 | conclusion | Conclusion | (all modules available) |
## Key Progressive Introductions
- **Ch02**: First Lens usage - simple hardware/network comparisons
- **Ch11**: Roofline model introduced - becomes central analysis tool
- **Ch14**: Federated learning (Flower integration)
- **Ch18**: Carbon modeling (electricityMap integration)
## Frontiers Chapter
| Ch# | Directory | Chapter Name |
|-----|-----------|--------------|
| 21 | frontiers | Frontiers of ML Systems |
Note: Frontiers may not need traditional Lens colabs (more forward-looking)

153
colabs/CONVERSION_STATUS.md Normal file
View File

@@ -0,0 +1,153 @@
# Python to Notebook Conversion - Status
## Completed
### ✅ Directory Structure
```
colabs/
├── src/ # Python source files
│ ├── ch01_ai_triangle.py # ✅ Created
│ └── utils/
│ └── ai_triangle_sim.py # ✅ Extracted utility class
├── notebooks/
│ └── ch01_ai_triangle.ipynb # ✅ Already exists (finalized)
└── docs/
└── PYTHON_TO_NOTEBOOK_WORKFLOW.md # ✅ Complete guide
```
### ✅ Files Created
1. **`src/ch01_ai_triangle.py`** - Python source using percent format (`# %%`)
- Contains all markdown and code cells
- Properly formatted for Jupytext conversion
- Ready for version control
2. **`src/utils/ai_triangle_sim.py`** - Reusable simulator class
- Extracted from colab for maintainability
- Documented with docstrings
- Can be imported by future colabs
3. **`PYTHON_TO_NOTEBOOK_WORKFLOW.md`** - Complete workflow documentation
- Explains percent format
- Shows Jupytext usage
- Includes manual conversion script
- Best practices and automation ideas
### ✅ Documentation Updates
- Updated `colabs/README.md` with:
- New directory structure section
- Workflow for authors
- Links to conversion documentation
- Progress checklist
## Current State
**Source of Truth**: `colabs/src/ch01_ai_triangle.py` (Python file)
**Student Distribution**: `colabs/notebooks/ch01_ai_triangle.ipynb` (already finalized)
**Next Conversion**: When updating the colab, edit `.py` file and regenerate `.ipynb`
## How to Use (For Future Updates)
### Option 1: Using Jupytext (Recommended)
```bash
# Install once
pip install jupytext
# Convert single file
jupytext --to notebook colabs/src/ch01_ai_triangle.py \
--output colabs/notebooks/ch01_ai_triangle.ipynb
# Convert all chapter colabs
jupytext --to notebook colabs/src/ch*.py \
--output-dir colabs/notebooks/
```
### Option 2: Manual Editing
1. Edit `colabs/notebooks/ch01_ai_triangle.ipynb` directly in Jupyter
2. When ready to extract source:
```bash
jupytext --to py:percent colabs/notebooks/ch01_ai_triangle.ipynb \
--output colabs/src/ch01_ai_triangle.py
```
### Option 3: Paired Notebooks (Best for Development)
```bash
# Pair files (changes sync automatically)
jupytext --set-formats ipynb,py:percent colabs/notebooks/ch01_ai_triangle.ipynb
# Now edits to either .ipynb or .py sync to both!
```
## Test Plan (Next Steps)
### 1. Validate Conversion
```bash
# Convert Python → Notebook
jupytext --to notebook colabs/src/ch01_ai_triangle.py \
--output /tmp/test_ch01.ipynb
# Compare with original
diff colabs/notebooks/ch01_ai_triangle.ipynb /tmp/test_ch01.ipynb
```
Expected: Minimal differences (metadata only)
### 2. Test in Google Colab
1. Upload `/tmp/test_ch01.ipynb` to Google Colab
2. Run all cells
3. Verify:
- All imports work
- Simulator runs correctly
- Visualizations display
- Interactive cells allow parameter changes
### 3. Validate Workflow
```bash
# Make small edit to Python source
# Convert to notebook
# Test in Colab
# Commit both files
```
## Future Automation Ideas
### Pre-commit Hook
Auto-convert modified Python colabs to notebooks on commit:
```bash
# .git/hooks/pre-commit
jupytext --to notebook colabs/src/ch*.py --output-dir colabs/notebooks/
git add colabs/notebooks/*.ipynb
```
### CI/CD Pipeline
GitHub Actions workflow to auto-convert on push:
```yaml
# .github/workflows/convert-colabs.yml
- run: pip install jupytext
- run: jupytext --to notebook colabs/src/ch*.py --output-dir colabs/notebooks/
- run: git commit -am "chore: auto-convert colabs"
```
### VSCode Integration
Paired notebooks using Jupytext extension:
- Edit Python or notebook, both stay in sync
- Better git diffs from Python format
- Full Jupyter functionality when needed
## Status: Ready for Testing
- ✅ Python source created and formatted correctly
- ✅ Utility extracted for reusability
- ✅ Workflow documented
- ✅ README updated
- ⏳ Conversion tested (ready for user)
- ⏳ Google Colab validation (ready for user)
**Recommendation**: Test the conversion workflow by:
1. Installing Jupytext: `pip install jupytext`
2. Converting: `jupytext --to notebook colabs/src/ch01_ai_triangle.py`
3. Uploading to Google Colab and running all cells
4. If successful, adopt this workflow for all future colabs

View File

@@ -0,0 +1,315 @@
# MLS Simulator: Build vs Leverage Analysis
## Executive Summary
After researching existing ML systems simulation tools and frameworks, I recommend a **hybrid approach**: build a lightweight custom MLS Simulator that wraps and integrates existing battle-tested tools where they exist, and fills gaps with simple analytical models where they don't.
**Key Finding**: No single existing framework provides the pedagogically-focused, unified interface we need across all deployment paradigms (Cloud, Edge, Mobile, TinyML) with simple analytical models. However, several excellent tools exist for specific domains that we should leverage rather than rebuild.
---
## Option 1: Build Custom MLS Simulator from Scratch
### What We'd Need to Build
- Hardware performance models (CPU, GPU, TPU, mobile, MCU)
- Network simulation (cloud, edge, mobile, TinyML tiers)
- Workload characterization (training, inference, data loading)
- Lifecycle management (drift, retraining)
- Reliability models (SDC, checkpointing, fault injection)
- Sustainability models (carbon intensity, region-aware scheduling)
- Security simulation (adversarial attacks, model extraction)
- Federated learning framework
### Pros
**Pedagogically optimized**: Simple analytical models designed for learning (not research accuracy)
**Unified API**: One consistent interface across all chapters
**Progressive complexity**: We control exactly when complexity is introduced
**Lightweight**: Fast execution in Colab notebooks (no heavyweight dependencies)
**Systems-first**: Designed around systems trade-offs, not algorithm accuracy
### Cons
**Reinventing wheels**: Many components already exist in mature tools
**Validation burden**: Need to verify analytical models match reality
**Maintenance**: Ongoing updates as hardware/frameworks evolve
**Credibility**: Students may question "toy models" vs real tools
**Time investment**: 3-6 months to build and validate all components
---
## Option 2: Leverage Existing Tools Only
### Available Tools by Domain
#### ML Systems Simulation
- **[ASTRA-sim 2.0](https://astra-sim.github.io/)**: End-to-end training simulation with hierarchical networks
- Pros: Detailed, cycle-accurate, widely used in research
- Cons: C++ based, complex setup, overkill for pedagogy
- **[VIDUR](https://github.com/microsoft/vidur)**: LLM inference performance simulation
- Pros: Production-ready, realistic LLM workloads
- Cons: LLM-specific, heavy dependencies
- **[MLSynth](https://dl.acm.org/doi/10.1145/3748273.3749211)**: Synthetic ML trace generation
- Pros: Realistic workload patterns
- Cons: Trace generation focus, not full system simulation
#### Hardware Accelerator Modeling
- **[SCALE-Sim](https://github.com/ARM-software/SCALE-Sim)**: Systolic CNN accelerator simulator (ARM/Georgia Tech)
- Pros: Cycle-accurate, TPU-like systolic arrays, v3 adds sparse support + DRAM modeling
- Cons: Systolic array specific, cycle-accurate = slow, Python but complex setup
- **Pedagogical fit**: Could use simplified mode for basic accelerator concepts
- **[Timeloop](https://github.com/NVlabs/timeloop)**: NVIDIA/MIT accelerator modeling framework
- Pros: Fast analytical model, mapper for optimal dataflows, supports sparse (v2.0), widely used
- Cons: Complex configuration, research-oriented, steep learning curve
- **Pedagogical fit**: Excellent for advanced students, too complex for intro
- **[MAESTRO](https://github.com/maestro-project/maestro)**: Georgia Tech dataflow cost model
- Pros: Fast analytical (not cycle-accurate), 96% accuracy vs RTL, 20+ statistics
- Cons: Dataflow-specific, requires understanding of mapping directives
- **Pedagogical fit**: Good analytical approach, could inspire simplified wrapper
#### Roofline Analysis
- **[Rooflini](https://github.com/giopaglia/rooflini)**: Python roofline plotting library
- Pros: Pure Python, easy integration, good visualizations
- Cons: Plotting only, doesn't simulate workloads
- **[Perfplot](https://github.com/GeorgOfenbeck/perfplot)**: Roofline visualization
- Pros: Clean API, educational focus
- Cons: Visualization focused, limited to roofline model
#### Federated Learning
- **[Flower](https://flower.ai/)**: Production federated learning framework
- Pros: Battle-tested, active development, great docs
- Cons: Complex for beginners, production-oriented
#### Drift Detection & Monitoring
- **[Evidently AI](https://www.evidentlyai.com/)**: ML monitoring and drift detection
- Pros: Production-ready, comprehensive metrics
- Cons: Heavy framework, complex for pedagogy
#### Carbon/Sustainability
- **[Carbon Explorer](https://mlco2.github.io/codecarbon/)**: ML carbon footprint tracking
- **[electricityMap API](https://www.electricitymaps.com/)**: Real-time carbon intensity
- Pros: Real data, credible sources
- Cons: API limits, requires internet connectivity
### Pros of Pure Leverage Approach
**Battle-tested**: Production-ready tools with real validation
**Credibility**: Students learn actual industry tools
**No maintenance**: Tools maintained by their communities
**Rich features**: More capabilities than we'd build
### Cons of Pure Leverage Approach
**Fragmented**: Different APIs, paradigms, languages across tools
**Too complex**: Production tools have steep learning curves
**Missing pieces**: No unified cloud/edge/mobile/TinyML comparison framework
**Heavy dependencies**: Many tools require complex setups
**Pedagogical mismatch**: Tools optimized for research/production, not learning
---
## Option 3: Hybrid Approach (RECOMMENDED)
### Architecture: Lightweight MLS Wrapper + Existing Tools
Build a **thin analytical layer** (MLS Simulator) that provides:
1. **Unified API** across deployment paradigms
2. **Simple analytical models** for core hardware/network trade-offs
3. **Integration wrappers** for existing tools where they excel
### Component Strategy
| Component | Approach | Rationale |
|-----------|----------|-----------|
| **Hardware Models** | **Build analytical** | Need unified cloud/edge/mobile/TinyML comparison; existing tools too heavy |
| **Accelerator Modeling** | **Inspired by MAESTRO/Timeloop** | Use their analytical approach (not tools directly); simplified dataflow cost models |
| **Systolic Arrays** | **Simplified SCALE-Sim concepts** | Teach TPU-like architectures without cycle-accurate complexity |
| **Roofline Analysis** | **Wrap Rooflini** | Excellent Python tool, just need workload characterization layer |
| **Federated Learning** | **Wrap Flower** | Production-ready, too complex to rebuild, just need simplified interface |
| **Drift Detection** | **Build analytical + examples with Evidently** | Simple drift models for pedagogy, show real tool in advanced section |
| **Carbon Modeling** | **Integrate electricityMap API** | Real data is best, wrap in simplified interface |
| **Network Simulation** | **Build analytical** | Simple latency/bandwidth models, existing tools overkill |
| **Reliability (SDC)** | **Build analytical** | Fault injection needs custom pedagogical design |
| **Security (Adversarial)** | **Use existing attacks + wrap** | Use CleverHans/ART, wrap in simplified interface |
### Proposed MLS Simulator Architecture
```python
# Core analytical models (custom)
from mls import hardware, network, workload
# Integrated existing tools
from mls import roofline # Wraps Rooflini
from mls import federated # Wraps Flower
from mls import carbon # Wraps electricityMap
from mls import security # Wraps CleverHans/ART
# Example: Unified interface with analytical backend
cloud = hardware.cloud_tier(gpu_type="A100")
edge = hardware.edge_tier(device="Jetson Xavier")
# Simple analytical model (custom)
cloud_perf = network.simulate_inference(cloud, model="ResNet-50")
edge_perf = network.simulate_inference(edge, model="ResNet-50")
# Roofline analysis (wrapped Rooflini)
roofline_result = roofline.analyze(
hardware=cloud,
workload=workload.characterize("ResNet-50", batch_size=32)
)
# Federated learning (wrapped Flower)
fed_sim = federated.simulate(
clients=10,
data_distribution="iid",
model="MobileNetV2"
)
# Carbon modeling (integrated electricityMap)
carbon_cost = carbon.compare_regions(
workload=cloud_perf,
regions=["US-West", "EU-North", "Asia-East"]
)
```
### What to Build (Custom Analytical Models)
**1. Hardware Performance Models** (~2-3 weeks)
- Simple analytical formulas for FLOPS, memory bandwidth, power
- Device database: A100, V100, Jetson, iPhone chips, Arduino MCUs
- Validation: ±20% accuracy vs MLPerf benchmarks
**2. Network Tier Models** (~1 week)
- Simple latency/bandwidth models for cloud/edge/mobile/TinyML
- Cost models ($/inference, $/training hour)
- Deployment constraints (offline capability, privacy)
**3. Drift Simulation** (~1 week)
- Synthetic drift injection (covariate, prior, concept)
- Simple statistical tests (KS, PSI)
- Retraining decision logic
**4. Reliability Models** (~2 weeks)
- Silent data corruption injection
- Checkpointing overhead simulation
- Fault tolerance strategy comparison
**Total Build Time**: ~6-8 weeks for core analytical components
### What to Wrap (Existing Tools)
**1. Roofline Analysis** (~1 week)
- Wrap [Rooflini](https://github.com/giopaglia/rooflini) for plotting
- Add workload characterization layer
- Create pedagogical examples
**2. Federated Learning** (~2 weeks)
- Simplified Flower wrapper for common scenarios
- Pre-configured scenarios (IID, non-IID, heterogeneous)
- Visualization layer for convergence/communication
**3. Carbon Modeling** (~1 week)
- Wrap electricityMap API with caching
- Add cost comparison utilities
- Offline fallback with static carbon intensity data
**4. Security/Adversarial** (~1 week)
- Wrap CleverHans or Adversarial Robustness Toolbox
- Simplified attack interfaces (FGSM, PGD)
- Defense evaluation utilities
**Total Integration Time**: ~5 weeks
---
## Comparison Matrix
| Criteria | Build Custom | Leverage Only | Hybrid (Recommended) |
|----------|--------------|---------------|---------------------|
| **Pedagogical fit** | ⭐⭐⭐⭐⭐ | ⭐⭐ | ⭐⭐⭐⭐⭐ |
| **Development time** | 3-6 months | 2-3 weeks | 2-3 months |
| **Credibility** | ⭐⭐⭐ | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ |
| **Maintenance burden** | ⭐⭐ | ⭐⭐⭐⭐ | ⭐⭐⭐⭐ |
| **Unified API** | ⭐⭐⭐⭐⭐ | ⭐⭐ | ⭐⭐⭐⭐⭐ |
| **Execution speed** | ⭐⭐⭐⭐⭐ | ⭐⭐⭐ | ⭐⭐⭐⭐ |
| **Real-world relevance** | ⭐⭐⭐ | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ |
| **Progressive complexity** | ⭐⭐⭐⭐⭐ | ⭐⭐ | ⭐⭐⭐⭐⭐ |
---
## Recommendation: Hybrid Approach
### Why Hybrid Wins
1. **Best of both worlds**: Pedagogically optimized analytical models + battle-tested tools where they excel
2. **Credibility**: Students see real tools (Flower, Rooflini, electricityMap) integrated into unified framework
3. **Maintainable**: Core analytical models are simple; complex components maintained by their communities
4. **Unified API**: Single consistent interface for students across all chapters
5. **Progressive**: Start with simple analytical models, introduce real tools as complexity builds
6. **Practical timeline**: 2-3 months vs 6+ months for full custom build
### Implementation Roadmap
**Phase 1: Core Analytical Models (Weeks 1-6)**
- Hardware performance models (cloud, edge, mobile, TinyML)
- Network tier simulation
- Basic workload characterization
- Simple drift injection
**Phase 2: Tool Integration (Weeks 7-11)**
- Roofline wrapper (Rooflini)
- Federated learning wrapper (Flower)
- Carbon API integration (electricityMap)
- Security wrapper (CleverHans/ART)
**Phase 3: Pilot Colabs (Weeks 12-14)**
- Ch02: Deployment paradigm comparison (analytical)
- Ch11: Roofline analysis (wrapped Rooflini)
- Ch14: Federated learning (wrapped Flower)
**Phase 4: Validation & Iteration (Weeks 15-16)**
- Student testing
- Accuracy validation (±20% vs benchmarks)
- Documentation and examples
---
## Decision Criteria
### Choose Full Custom If:
- You have 6+ months development time
- You want complete control over all components
- Analytical simplicity is more important than real-world tools
- You're concerned about external dependencies
### Choose Leverage Only If:
- You're okay with fragmented student experience
- Students can handle production-level complexity
- You have 2-3 weeks for integration only
- You prioritize real-world tool experience over unified learning
### Choose Hybrid If (RECOMMENDED):
- You want pedagogical optimization + real-world credibility
- You have 2-3 months development time
- You value unified API + progressive complexity
- You want maintainable long-term solution
---
## Open Questions for You
1. **Timeline**: Do you have 2-3 months for hybrid development before needing colabs?
2. **Accuracy**: Is ±20% accuracy for analytical models acceptable for pedagogy?
3. **Dependencies**: Are you comfortable with external dependencies (Flower, Rooflini, etc.) in Colab notebooks?
4. **Scope**: Should we start with Phase 1-2 (core + integration) and defer some components?
---
## Conclusion
The MLS Simulator vision **absolutely makes sense**, but you don't need to build everything from scratch. A hybrid approach gives you the pedagogical benefits of a unified analytical framework while leveraging excellent existing tools where they excel.
The key insight: **Wrap, don't replace.** Build the glue layer that gives students a consistent systems-thinking interface, but use battle-tested tools underneath where they exist.
**Recommended Next Step**: Prototype Phase 1 (core analytical models) for Ch02 deployment paradigm colab to validate the approach before committing to full development.

286
colabs/NAMING_DISCUSSION.md Normal file
View File

@@ -0,0 +1,286 @@
# Naming the ML Systems Educational Framework
## The Problem with "MLS Simulator"
You're absolutely right - "MLS Simulator" doesn't quite fit. Here's why:
1. **Not just simulation**: It's analytical modeling, visualization, tool integration, and pedagogical framework
2. **Confusion with existing tools**: SCALE-Sim, ASTRA-sim, Timeloop are "simulators" (cycle-accurate or detailed)
3. **Too narrow**: "Simulator" implies hardware/performance focus, but we cover security, sustainability, drift, etc.
4. **Academic clash**: Researchers will expect gem5-level detail when they hear "simulator"
## What This Actually Is
This is a **pedagogical systems analysis toolkit** that:
- Provides analytical models (not cycle-accurate simulation)
- Integrates existing tools (Flower, Rooflini, etc.)
- Offers interactive exploration of trade-offs
- Builds progressively across chapters
- Focuses on systems thinking, not algorithm accuracy
## Naming Options
### Option 1: **MLSys Workbench**
**Tagline**: "An interactive toolkit for exploring ML systems trade-offs"
**Pros**:
- "Workbench" suggests tools, exploration, learning
- MLSys is established conference/community name
- Not pretending to be a research simulator
- Implies hands-on, practical work
**Cons**:
- Generic feeling
- Could be confused with development tools
**Example usage**:
```python
from mlsys_workbench import hardware, deployment, sustainability
# Compare deployment paradigms
cloud = deployment.CloudTier(gpu="A100")
edge = deployment.EdgeTier(device="Jetson")
```
---
### Option 2: **Lens** (Learning Environment for Network & Systems)
**Tagline**: "See ML systems through a new lens"
**Pros**:
- You already use "Lens colabs" - perfect alignment!
- Short, memorable, unique
- "Lens" = perspective, insight, clarity
- Works as verb: "Let's lens this problem"
- Not technical jargon
**Cons**:
- Might need explanation (but that's okay)
- Could conflict with existing projects named Lens
**Example usage**:
```python
from lens import hardware, network, carbon
# Compare with Lens
cloud_perf = lens.compare_deployments(
model="ResNet-50",
tiers=["cloud", "edge", "mobile", "tiny"]
)
```
**Brand potential**:
- Lens Colabs (already using this!)
- "View through the Lens"
- "Lens into ML systems"
---
### Option 3: **SysLens**
**Tagline**: "A systems lens for ML engineering"
**Pros**:
- Combines "systems" + "lens" concept
- More specific than just "Lens"
- Immediately clear it's about systems analysis
- Short, pronounceable
**Cons**:
- Slightly less elegant than just "Lens"
- Could feel like forced portmanteau
**Example usage**:
```python
from syslens import hardware, roofline, federated
# Analyze with SysLens
analysis = syslens.roofline(
hardware="A100",
workload="transformer_training"
)
```
---
### Option 4: **MLSys Studio**
**Tagline**: "Where ML systems concepts come to life"
**Pros**:
- "Studio" = creative workspace, exploration
- Professional sounding
- Clear it's for learning/exploration
**Cons**:
- Feels more like an IDE/GUI tool
- Less unique
---
### Option 5: **Atlas** (Analytical Toolkit for Learning About Systems)
**Tagline**: "Navigate the landscape of ML systems"
**Pros**:
- Atlas = maps, navigation, exploration
- Nice metaphor for exploring trade-off spaces
- Professional, memorable
- Works standalone without acronym
**Cons**:
- Common name (MongoDB Atlas, ATLAS experiment, etc.)
- Forced acronym (don't need to use it)
**Example usage**:
```python
from atlas import hardware, landscape
# Navigate the design space
landscape.plot_latency_vs_power(
models=["ResNet", "MobileNet", "EfficientNet"],
hardware=["A100", "V100", "Jetson", "iPhone"]
)
```
---
### Option 6: **TinySim** or **LiteSim**
**Tagline**: "Lightweight analytical models for ML systems learning"
**Pros**:
- Honest about being simplified/analytical
- Clear differentiation from SCALE-Sim, ASTRA-sim (heavyweight)
- "Tiny" aligns with TinyML content
**Cons**:
- Still uses "Sim" suffix (simulator confusion)
- Sounds less serious/powerful
---
### Option 7: **Prism**
**Tagline**: "Decompose ML systems complexity into understandable components"
**Pros**:
- Prism = breaking complex light into spectrum
- Beautiful metaphor for analyzing systems from multiple angles
- Short, memorable, elegant
- Visual/pedagogical connotation
**Cons**:
- Might conflict with existing projects
- Less obvious connection to ML systems
**Example usage**:
```python
from prism import analyze, spectrum
# View through multiple lenses
spectrum.deployment(
model="BERT",
aspects=["latency", "cost", "carbon", "privacy"]
)
```
---
## Recommendation: **Lens**
I recommend **Lens** for these reasons:
### 1. Perfect Alignment with Existing Branding
You're already calling them "Lens colabs" - the framework should match! Students will naturally understand that Lens colabs use the Lens toolkit.
### 2. Pedagogical Philosophy Match
- "Lens" emphasizes **perspective** and **insight**
- Not claiming to be authoritative simulation
- Honest about being an analytical/learning tool
- Suggests seeing systems from different angles
### 3. Clean, Memorable, Unique
- Short Python import: `from lens import hardware`
- Not jargon-heavy
- Easy to say and remember
- Doesn't clash with academic simulator terminology
### 4. Scalability
- Works for simple analytical models (Ch02)
- Works for integrated tools (Ch14 Flower wrapper)
- Works for complex multi-perspective analysis (Ch18 sustainability)
### 5. Brand Consistency
```
Lens Colabs → explore ML systems with Lens toolkit
"Let's examine this through the Lens framework"
"Use Lens to compare deployment paradigms"
```
---
## Alternative: **SysLens** if you want more specificity
If "Lens" feels too generic, **SysLens** is the runner-up:
- More explicit about systems focus
- Still short and memorable
- Aligns with Lens colabs branding
---
## Implementation Example: Lens
```python
# Package structure
lens/
├── __init__.py
├── hardware/ # Hardware performance models
├── cloud.py
├── edge.py
├── mobile.py
└── tiny.py
├── network/ # Deployment tier models
├── workload/ # Model characterization
├── roofline/ # Wrapper around Rooflini
├── federated/ # Wrapper around Flower
├── carbon/ # Carbon intensity integration
├── reliability/ # SDC, checkpointing
├── security/ # Adversarial attacks
└── viz/ # Visualization utilities
# Student usage
from lens import hardware, roofline, carbon
# Compare hardware
cloud = hardware.CloudGPU("A100")
edge = hardware.EdgeDevice("Jetson Xavier")
# Roofline analysis
perf = roofline.analyze(cloud, workload="ResNet-50")
# Carbon comparison
carbon.compare_regions(workload=perf, regions=["US", "EU", "Asia"])
```
---
## Final Recommendation
**Name**: Lens
**Full name**: Lens - Interactive ML Systems Analysis Toolkit
**Tagline**: "See ML systems trade-offs through a new lens"
**Package name**: `lens` (or `mlsys-lens` if PyPI conflict)
**Branding**:
- Lens Colabs (already using!)
- Lens Toolkit
- View through Lens
- Lens into ML systems
This positions your framework as:
- ✅ Pedagogically focused (not research simulator)
- ✅ Analytical and fast (not cycle-accurate)
- ✅ Multi-perspective (hardware, carbon, security, etc.)
- ✅ Aligned with existing "Lens colabs" branding
- ✅ Distinct from SCALE-Sim, ASTRA-sim, Timeloop, etc.
What do you think?

View File

@@ -0,0 +1,372 @@
# Python to Notebook Conversion Workflow
## Overview
This document describes the workflow for maintaining Lens colabs as Python source files (`.py`) and converting them to Jupyter notebooks (`.ipynb`) for student distribution.
## Why Python Source Files?
**Benefits:**
- **Better git diffs**: Line-by-line changes are clear in `.py` format
- **Code quality**: Easier to run linters, formatters, and static analysis
- **Refactoring**: Extract utilities and reusable components cleanly
- **Version control**: Merge conflicts are easier to resolve
- **Testing**: Can import and test functions directly
**Trade-off:**
- Students receive `.ipynb` files (standard Jupyter/Colab format)
- Conversion step required before distribution
## Directory Structure
```
colabs/
├── src/ # Python source files (version controlled)
│ ├── ch01_ai_triangle.py # Chapter 1 colab source
│ ├── ch02_deployment.py # Chapter 2 colab source
│ ├── ...
│ └── utils/ # Reusable utilities
│ ├── ai_triangle_sim.py # AI Triangle simulator class
│ └── visualization.py # Common plotting functions
├── notebooks/ # Generated Jupyter notebooks (for students)
│ ├── ch01_ai_triangle.ipynb # Generated from src/ch01_ai_triangle.py
│ ├── ch02_deployment.ipynb # Generated from src/ch02_deployment.py
│ └── ...
└── docs/ # Documentation
├── PYTHON_TO_NOTEBOOK_WORKFLOW.md # This file
└── ...
```
## Python Source Format
### Percent Format (Jupytext-Compatible)
Use **percent format** (`# %%`) to define cells in Python files:
```python
# %% [markdown]
# # Colab Title
#
# This is a markdown cell with **bold** and *italic* text.
# %%
import numpy as np
print("This is a code cell")
# %% [markdown]
# Another markdown cell
```
### Cell Types
**Markdown cells:**
```python
# %% [markdown]
# Your markdown content here
# Use standard markdown syntax
```
**Code cells:**
```python
# %%
# Your Python code here
x = 42
```
### Special Directives
**Matplotlib inline (Colab/Jupyter specific):**
```python
# %matplotlib inline
```
This is commented in `.py` files but will work when converted to `.ipynb`.
## Conversion Tools
### Option 1: Jupytext (Recommended)
**Install:**
```bash
pip install jupytext
```
**Convert single file:**
```bash
jupytext --to notebook colabs/src/ch01_ai_triangle.py \
--output colabs/notebooks/ch01_ai_triangle.ipynb
```
**Convert all files:**
```bash
jupytext --to notebook colabs/src/ch*.py \
--output-dir colabs/notebooks/
```
**Set kernel metadata:**
```bash
jupytext --to notebook --set-kernel python3 colabs/src/ch01_ai_triangle.py
```
### Option 2: Manual Script (Custom Control)
Create `tools/scripts/convert_colabs.py`:
```python
#!/usr/bin/env python3
"""Convert Python source files to Jupyter notebooks"""
import json
import re
from pathlib import Path
def parse_py_to_cells(py_content):
"""Parse percent-format Python to notebook cells"""
cells = []
current_cell = None
for line in py_content.split('\n'):
if line.startswith('# %% [markdown]'):
if current_cell:
cells.append(current_cell)
current_cell = {'cell_type': 'markdown', 'source': []}
elif line.startswith('# %%'):
if current_cell:
cells.append(current_cell)
current_cell = {'cell_type': 'code', 'source': [], 'outputs': []}
else:
if current_cell:
if current_cell['cell_type'] == 'markdown':
# Remove leading "# " from markdown lines
clean_line = line[2:] if line.startswith('# ') else line
current_cell['source'].append(clean_line + '\n')
else:
current_cell['source'].append(line + '\n')
if current_cell:
cells.append(current_cell)
return cells
def create_notebook(cells):
"""Create Jupyter notebook JSON structure"""
return {
'cells': [
{
'cell_type': cell['cell_type'],
'metadata': {},
'source': cell['source'],
**(
{'execution_count': None, 'outputs': []}
if cell['cell_type'] == 'code' else {}
)
}
for cell in cells
],
'metadata': {
'kernelspec': {
'display_name': 'Python 3',
'language': 'python',
'name': 'python3'
},
'language_info': {
'name': 'python',
'version': '3.9.0'
}
},
'nbformat': 4,
'nbformat_minor': 4
}
def convert_py_to_notebook(src_path, dest_path):
"""Convert Python source to Jupyter notebook"""
with open(src_path, 'r') as f:
py_content = f.read()
cells = parse_py_to_cells(py_content)
notebook = create_notebook(cells)
with open(dest_path, 'w') as f:
json.dump(notebook, f, indent=1)
print(f"✓ Converted {src_path}{dest_path}")
if __name__ == '__main__':
src_dir = Path('colabs/src')
dest_dir = Path('colabs/notebooks')
for py_file in src_dir.glob('ch*.py'):
nb_file = dest_dir / py_file.with_suffix('.ipynb').name
convert_py_to_notebook(py_file, nb_file)
```
## Workflow for Content Updates
### 1. Edit Python Source
```bash
# Edit the source file
code colabs/src/ch01_ai_triangle.py
```
### 2. Test Locally (Optional)
```bash
# Run the Python file directly to test logic
python3 colabs/src/ch01_ai_triangle.py
# Or use Jupyter to test the notebook
jupytext --to notebook --execute colabs/src/ch01_ai_triangle.py
```
### 3. Convert to Notebook
```bash
# Convert single file
jupytext --to notebook colabs/src/ch01_ai_triangle.py \
--output colabs/notebooks/ch01_ai_triangle.ipynb
# Or convert all
jupytext --to notebook colabs/src/ch*.py --output-dir colabs/notebooks/
```
### 4. Version Control
```bash
# Only commit the .py source files
git add colabs/src/ch01_ai_triangle.py
# Optionally commit generated notebooks (for student access)
git add colabs/notebooks/ch01_ai_triangle.ipynb
# Commit
git commit -m "feat: add AI Triangle interactive colab"
```
## Best Practices
### 1. Keep Utilities Separate
Extract reusable code to `colabs/src/utils/`:
```python
# In colab source: colabs/src/ch01_ai_triangle.py
from utils.ai_triangle_sim import AITriangleSimulator
# Students won't see utils/ - it's packaged differently
```
### 2. Clear Cell Boundaries
Use clear comments and spacing:
```python
# %% [markdown]
# ## Section Title
#
# Description of what we're doing.
# %%
# Code implementing the concept
x = compute_something()
x.plot()
# %% [markdown]
# Explanation of results
```
### 3. Test Before Converting
Run Python file directly to catch syntax errors:
```bash
python3 colabs/src/ch01_ai_triangle.py
```
### 4. Use Descriptive Cell Comments
```python
# %% [markdown]
# ### Your Turn: Open-Ended Exploration
#
# Try different configurations...
# %%
# Student experimentation cell
my_system = AITriangleSimulator(...)
```
## Automation (Future)
### Pre-commit Hook
Create `.git/hooks/pre-commit`:
```bash
#!/bin/bash
# Auto-convert modified Python colabs to notebooks
changed_files=$(git diff --cached --name-only | grep 'colabs/src/ch.*\.py')
for src_file in $changed_files; do
nb_file="colabs/notebooks/$(basename $src_file .py).ipynb"
jupytext --to notebook "$src_file" --output "$nb_file"
git add "$nb_file"
done
```
### CI/CD Pipeline
```yaml
# .github/workflows/convert-colabs.yml
name: Convert Colabs
on:
push:
paths:
- 'colabs/src/*.py'
jobs:
convert:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- uses: actions/setup-python@v4
with:
python-version: '3.9'
- run: pip install jupytext
- run: jupytext --to notebook colabs/src/ch*.py --output-dir colabs/notebooks/
- uses: stefanzweifel/git-auto-commit-action@v4
with:
commit_message: "chore: auto-convert colabs to notebooks"
```
## FAQ
**Q: Should notebooks be version controlled?**
A: Yes, commit both `.py` (source of truth) and `.ipynb` (student distribution). Git will track meaningful changes in `.py`, notebooks are for convenience.
**Q: What about outputs in notebooks?**
A: Clear outputs before committing. Students run fresh notebooks.
**Q: Can students edit notebooks directly?**
A: Yes! Students work with `.ipynb` files. Our workflow is for textbook authors only.
**Q: How do we handle imports from `utils/`?**
A: For distribution, either:
1. Inline the utility code in generated notebooks
2. Distribute utilities as a package (`pip install lens-mlsys`)
3. Include utility cells at top of notebook
## Current Status
- ✅ Python source format established (percent format)
- ✅ Directory structure created (`src/`, `notebooks/`, `utils/`)
- ✅ First colab converted: `ch01_ai_triangle.py``ch01_ai_triangle.ipynb`
- ✅ Utility extracted: `AITriangleSimulator``utils/ai_triangle_sim.py`
- ⏳ Conversion script (manual for now, Jupytext recommended)
- ⏳ Pre-commit hook (future automation)
- ⏳ CI/CD pipeline (future automation)
## Next Steps
1. Install Jupytext: `pip install jupytext`
2. Test conversion: Convert ch01 and upload to Google Colab
3. Refine format based on student testing
4. Document package distribution strategy (Lens toolkit)
5. Create remaining chapter colabs in `.py` format

View File

@@ -1,133 +1,140 @@
# MLSysBook Interactive Colabs
# Lens: Interactive ML Systems Analysis Toolkit
This directory contains interactive Google Colab notebooks that complement the MLSysBook textbook, providing hands-on demonstrations of key concepts at strategic learning junctions.
**Tagline**: See ML systems trade-offs through a new lens
## Overview
This directory contains the design documentation and planning for **Lens**, the pedagogical framework for interactive ML systems exploration used throughout the textbook.
Each Colab is designed as a **"Concept Bridge"** that:
- Illuminates a specific concept with minimal, runnable code
- Shows immediate results connecting theory to observable behavior
- Complements (not duplicates) TinyTorch hands-on implementation
- Completes in 5-10 minutes to maintain reading flow
## What is Lens?
Lens is a lightweight analytical modeling toolkit that:
- Provides simple analytical models for hardware/network/deployment trade-offs
- Wraps existing battle-tested tools (Flower, Rooflini, electricityMap)
- Offers unified API across all chapters and deployment paradigms
- Builds progressively from simple (Ch02) to complex (Ch18)
- Focuses on systems thinking and trade-off exploration
**Not a simulator**: Lens uses analytical models (fast, pedagogically focused) rather than cycle-accurate simulation.
## Key Documentation
### Current Planning
- **[NAMING_DISCUSSION.md](NAMING_DISCUSSION.md)** - Why "Lens" instead of "MLS Simulator"
- **[MLS_SIMULATOR_BUILD_VS_LEVERAGE.md](MLS_SIMULATOR_BUILD_VS_LEVERAGE.md)** - Build vs leverage analysis, hybrid approach recommendation
- **[PYTHON_TO_NOTEBOOK_WORKFLOW.md](PYTHON_TO_NOTEBOOK_WORKFLOW.md)** - Python source to Jupyter notebook conversion workflow
- **[CHAPTER_MAPPING.md](CHAPTER_MAPPING.md)** - Progressive Lens module availability across chapters
### Archive
- **[archive/](archive/)** - Previous planning iterations (v1-v3 master plans, original vision)
## Framework Architecture
### Hybrid Approach
**Build analytical models for**:
- Hardware performance (cloud/edge/mobile/TinyML)
- Network tier simulation (latency, bandwidth, cost)
- Drift models and lifecycle management
- Reliability (SDC, fault injection)
**Wrap existing tools for**:
- Roofline analysis → Rooflini
- Federated learning → Flower
- Carbon modeling → electricityMap API
- Adversarial attacks → CleverHans/ART
**Inspired by academic tools**:
- MAESTRO (analytical dataflow cost model)
- Timeloop (accelerator modeling)
- SCALE-Sim (systolic array concepts)
## Lens Colabs Structure
Each Lens colab follows the **OERC framework**:
1. **Observe**: Present a systems scenario/trade-off
2. **Explore**: Interactive exploration with Lens toolkit
3. **Reason**: Guided analysis and critical thinking
4. **Connect**: Link to production systems and research
**Duration**: 20-30 minutes per colab
## Progressive Complexity
- **Ch02**: Simple deployment paradigm comparison (cloud/edge/mobile/TinyML)
- **Ch11**: Roofline model introduction (Lens becomes central analysis tool)
- **Ch14**: Federated learning with Flower integration
- **Ch18**: Multi-dimensional carbon-aware scheduling
## Development Roadmap
### Phase 1: Core Analytical Models (Weeks 1-6)
- Hardware performance models
- Network tier simulation
- Basic workload characterization
### Phase 2: Tool Integration (Weeks 7-11)
- Roofline wrapper (Rooflini)
- Federated wrapper (Flower)
- Carbon API integration
### Phase 3: Pilot Colabs (Weeks 12-14)
- Ch02: Deployment paradigms
- Ch11: Roofline analysis
- Ch14: Federated learning
### Phase 4: Validation (Weeks 15-16)
- Student testing
- Accuracy validation (±20% target)
- Documentation
## Directory Structure
```
colabs/
├── README.md # This file
├── docs/ # Documentation and specifications
│ ├── COLAB_INTEGRATION_PLAN.md # Complete specifications
── COLAB_PLACEMENT_MATRIX.md # Quick reference table
├── COLAB_CHAPTER_OUTLINE.md # Chapter-by-chapter placement
├── COLAB_TEMPLATE_SPECIFICATION.md # Template standards
├── COLAB_TEMPLATE_EXAMPLE.md # Working example
│ └── COLAB_STANDARDS_SUMMARY.md # Quick checklist
├── ch03_dl_primer/ # Chapter 3 Colabs
├── ch06_data_engineering/ # Chapter 6 Colabs
├── ch08_training/ # Chapter 8 Colabs
├── ch10_optimizations/ # Chapter 10 Colabs
│ └── quantization_demo.ipynb # ✓ Quantization demonstration
└── [other chapters...]
├── src/ # Python source files (version controlled)
│ ├── ch01_ai_triangle.py # Chapter 1: AI Triangle colab
│ ├── ch02_deployment.py # Chapter 2: Deployment paradigms
── utils/ # Reusable utilities
├── ai_triangle_sim.py # AI Triangle simulator
└── visualization.py # Common plotting functions
├── notebooks/ # Generated Jupyter notebooks (for students)
├── ch01_ai_triangle.ipynb # Ready for Google Colab
├── ch02_deployment.ipynb
│ └── ...
├── docs/ # Planning and design documentation
│ ├── NAMING_DISCUSSION.md
│ ├── PYTHON_TO_NOTEBOOK_WORKFLOW.md
│ └── ...
└── README.md # This file
```
## Available Colabs
## Workflow for Authors
### Phase 1 (v0.5.0 MVP) - 5 Colabs
### Creating New Colabs
1. Write colab as Python file in `colabs/src/chXX_topic.py` using percent format
2. Use `# %% [markdown]` for text cells, `# %%` for code cells
3. Extract reusable code to `colabs/src/utils/`
4. Convert to notebook: `jupytext --to notebook src/chXX_topic.py --output notebooks/chXX_topic.ipynb`
5. Test in Google Colab
6. Commit both `.py` (source) and `.ipynb` (distribution)
| Chapter | Colab | Status | Link |
|---------|-------|--------|------|
| Ch 3: DL Primer | Gradient Descent Visualization | 🚧 Planned | - |
| Ch 6: Data Engineering | Data Quality Impact | 🚧 Planned | - |
| Ch 8: Training | Training Dynamics Explorer | 🚧 Planned | - |
| Ch 10: Optimizations | Quantization Demo | ✅ Complete | [Open in Colab](link) |
| Ch 11: Hardware Acceleration | CPU vs GPU vs TPU | 🚧 Planned | - |
See [PYTHON_TO_NOTEBOOK_WORKFLOW.md](PYTHON_TO_NOTEBOOK_WORKFLOW.md) for details.
### Phase 2 (v0.5.1) - 13 Colabs
Status: Planned
## Next Steps
### Phase 3 (v0.5.2) - 10 Colabs
Status: Planned
**Total Planned**: 28 Colabs across 18 chapters
## Using These Colabs
### For Readers
1. **Read the textbook section first** - Colabs complement, not replace, textbook content
2. **Click "Open in Colab"** - Launches notebook in Google Colab (free account sufficient)
3. **Follow the notebook** - Execute cells sequentially
4. **Experiment** - Modify parameters and explore
5. **Connect back to theory** - Review textbook with new insights
### For Contributors
1. **Read the documentation** - Start with `docs/COLAB_STANDARDS_SUMMARY.md`
2. **Follow the template** - Use `docs/COLAB_TEMPLATE_SPECIFICATION.md`
3. **Review examples** - See `docs/COLAB_TEMPLATE_EXAMPLE.md`
4. **Test thoroughly** - Must run in < 10 minutes on Colab Free Tier
5. **Submit PR** - Follow contribution guidelines
## Development Standards
Every MLSysBook Colab must:
- Have ONE clear learning objective
- Complete in < 10 minutes on Colab Free Tier
- Connect explicitly to textbook section
- Include quantitative results
- Follow MLSysBook visual standards
- Be reproducible (seeds set)
- Include MLSysBook branding
See `docs/COLAB_STANDARDS_SUMMARY.md` for complete checklist.
## Quick Start for Development
```bash
# 1. Review template
cat colabs/docs/COLAB_TEMPLATE_SPECIFICATION.md
# 2. Copy and rename template (when available)
cp colabs/TEMPLATE.ipynb colabs/ch##_chapter/notebook_name.ipynb
# 3. Follow the 10-section structure
# 4. Test on Colab Free Tier
# 5. Run pre-publication checklist
# 6. Submit for review
```
## Integration with Textbook
Colabs are referenced in the textbook using special callout blocks:
```markdown
::: {.callout-colab}
## Interactive Exercise: Quantization in Action
Experience INT8 quantization reducing model size and latency.
**Learning Objective**: Understand quantization trade-offs
**Estimated Time**: 6-8 minutes
[![Open In Colab](badge)](link-to-colab)
:::
```
## Support and Feedback
- **Documentation Issues**: [GitHub Issues](https://github.com/harvard-edge/cs249r_book/issues)
- **Questions**: [GitHub Discussions](https://github.com/harvard-edge/cs249r_book/discussions)
- **Book Website**: https://mlsysbook.ai
## License
All Colabs are licensed under **CC BY-NC-SA 4.0** (same as MLSysBook).
1.**Created lens-colab-designer agent** - Expert at designing OERC-structured pedagogical notebooks
2.**Created colab-writer agent** - Transforms designs into executable notebooks
3.**First colab implemented** - Ch01 AI Triangle (Python source + notebook)
4.**Test Ch01 colab** - Upload to Google Colab and validate student experience
5.**Redesign remaining chapters** - Using Lens framework and hybrid tool approach
6.**Prototype Phase 1** - Core analytical models for Ch02 proof-of-concept
7.**Implement Lens package** - `pip install lens-mlsys` or embedded in Colab
---
**Status**: Phase 1 Development (1/5 Colabs complete)
**Last Updated**: November 5, 2025
**Maintainer**: MLSysBook Team
**Total Estimated Colabs**: 45-50 across 20 chapters (consolidated from original 98)
**Package name**: `lens` (or `lens-mlsys` if PyPI conflict)
**Import style**: `from lens import hardware, roofline, federated, carbon`

View File

@@ -0,0 +1,522 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# The AI Triangle: Why You Can't Optimize in Isolation\n",
"\n",
"**Machine Learning Systems: Engineering Intelligence at Scale** \n",
"_Chapter 1: Introduction - Understanding ML as a Systems Discipline_\n",
"\n",
"---\n",
"\n",
"In Chapter 1, you learned that ML systems consist of three tightly coupled components: **models** (algorithms that learn patterns), **data** (examples that guide learning), and **infrastructure** (compute that enables training and inference). The AI Triangle framework shows these aren't independent—they shape each other's possibilities.\n",
"\n",
"Now you'll experience this interdependence hands-on. You'll attempt to improve a medical imaging classifier from 80% to 90% accuracy. As you try different approaches, you'll discover why optimizing one component creates bottlenecks in the others.\n",
"\n",
"**Why this matters**: This pattern repeats across every ML deployment. Google's diabetic retinopathy detector required coordinated scaling of all three components over 3+ years. Tesla's Autopilot balances model sophistication against real-time compute constraints. Understanding these trade-offs is the essence of ML systems engineering."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Setup\n",
"\n",
"Run this cell to load the AI Triangle simulator."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import numpy as np\n",
"import matplotlib.pyplot as plt\n",
"from IPython.display import display, clear_output\n",
"\n",
"plt.style.use('seaborn-v0_8-whitegrid')\n",
"%matplotlib inline\n",
"\n",
"print(\"✓ Setup complete\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"class AITriangleSimulator:\n",
" \"\"\"Simplified simulator showing model-data-compute interdependencies\"\"\"\n",
" \n",
" MODELS = {\n",
" 'Small CNN': 5,\n",
" 'ResNet-50': 25,\n",
" 'ResNet-101': 45,\n",
" 'ResNet-152': 60,\n",
" 'Large Model': 100\n",
" }\n",
" \n",
" def __init__(self, model_name='ResNet-50', dataset_size=10000, num_gpus=1):\n",
" self.model_name = model_name\n",
" self.model_params = self.MODELS[model_name]\n",
" self.dataset_size = dataset_size\n",
" self.num_gpus = num_gpus\n",
" \n",
" def estimate_accuracy(self):\n",
" model_factor = np.log10(self.model_params + 1) * 10\n",
" data_factor = np.log10(self.dataset_size / 1000 + 1) * 15\n",
" compute_factor = np.log10(self.num_gpus + 1) * 5\n",
" \n",
" data_per_param = self.dataset_size / (self.model_params * 1000)\n",
" \n",
" if data_per_param < 10:\n",
" bottleneck = \"⚠️ DATA BOTTLENECK: Not enough data for model size (overfitting risk)\"\n",
" accuracy = min(95, 60 + data_factor)\n",
" elif self.num_gpus < np.log10(self.dataset_size):\n",
" bottleneck = \"⚠️ COMPUTE BOTTLENECK: Training will be very slow or incomplete\"\n",
" accuracy = min(95, 65 + model_factor + compute_factor)\n",
" else:\n",
" bottleneck = \"✓ Balanced system\"\n",
" accuracy = min(95, 70 + model_factor + data_factor + compute_factor)\n",
" \n",
" return round(accuracy, 1), bottleneck\n",
" \n",
" def estimate_costs(self):\n",
" baseline_data = 10000\n",
" new_data = max(0, self.dataset_size - baseline_data)\n",
" data_cost = new_data * 3\n",
" data_collection_months = new_data / 3000\n",
" \n",
" base_hours = (self.model_params * self.dataset_size) / (self.num_gpus * 1000)\n",
" training_hours = max(0.5, base_hours / 100)\n",
" compute_cost_per_run = training_hours * self.num_gpus * 50\n",
" memory_gb = self.model_params * 0.2\n",
" \n",
" return {\n",
" 'data_collection_cost': round(data_cost),\n",
" 'data_collection_months': round(data_collection_months, 1),\n",
" 'compute_cost_per_run': round(compute_cost_per_run),\n",
" 'training_hours': round(training_hours, 1),\n",
" 'memory_gb': round(memory_gb, 1)\n",
" }\n",
" \n",
" def display_status(self):\n",
" accuracy, bottleneck = self.estimate_accuracy()\n",
" costs = self.estimate_costs()\n",
" \n",
" print(\"=\" * 70)\n",
" print(\"AI TRIANGLE - CURRENT SYSTEM\")\n",
" print(\"=\" * 70)\n",
" print(f\"\\nAccuracy: {accuracy}%\")\n",
" print(f\"Status: {bottleneck}\")\n",
" print(f\"\\nModel: {self.model_name} ({self.model_params}M parameters)\")\n",
" print(f\"Dataset: {self.dataset_size:,} images\")\n",
" print(f\"Compute: {self.num_gpus} GPU(s)\")\n",
" print(f\"\\nData Cost: ${costs['data_collection_cost']:,} over {costs['data_collection_months']} months\")\n",
" print(f\"Training: {costs['training_hours']}h at ${costs['compute_cost_per_run']:,}/run\")\n",
" print(f\"Memory: {costs['memory_gb']} GB\")\n",
" print(\"=\" * 70)\n",
" \n",
" return accuracy, costs\n",
"\n",
"print(\"✓ AI Triangle simulator loaded\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"---\n",
"\n",
"## The Challenge\n",
"\n",
"You're an ML engineer at a medical imaging startup. Your team has built an AI system to detect diseases from chest X-rays.\n",
"\n",
"**Current system:**\n",
"- Model: ResNet-50 (25M parameters)\n",
"- Data: 10,000 labeled medical images (6 months to collect)\n",
"- Compute: 1 GPU\n",
"- **Accuracy: 80%**\n",
"\n",
"Your boss: *\"We need 90% accuracy for FDA approval. Fix it.\"*\n",
"\n",
"**The question**: How do you improve accuracy? Should you use a bigger model? Collect more data? Add more GPUs? All three?\n",
"\n",
"Let's find out what happens when you try each approach."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"---\n",
"\n",
"## Baseline System"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"system = AITriangleSimulator(\n",
" model_name='ResNet-50',\n",
" dataset_size=10000,\n",
" num_gpus=1\n",
")\n",
"\n",
"print(\"BASELINE:\")\n",
"baseline_accuracy, baseline_costs = system.display_status()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"---\n",
"\n",
"## Attempt 1: Use a Bigger Model\n",
"\n",
"**Intuition**: More parameters should mean better accuracy.\n",
"\n",
"Let's upgrade from ResNet-50 (25M params) to ResNet-152 (60M params)."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"system_v2 = AITriangleSimulator(\n",
" model_name='ResNet-152', # Bigger!\n",
" dataset_size=10000, # Same\n",
" num_gpus=1 # Same\n",
")\n",
"\n",
"print(\"ATTEMPT 1: Bigger Model\")\n",
"accuracy_v2, costs_v2 = system_v2.display_status()\n",
"\n",
"print(f\"\\nChange: {baseline_accuracy}% → {accuracy_v2}% ({accuracy_v2 - baseline_accuracy:+.1f}%)\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**What happened?** The bigger model hit a data bottleneck—60M parameters need more than 10K examples to avoid overfitting."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"---\n",
"\n",
"## Attempt 2: Collect More Data\n",
"\n",
"**Intuition**: The bigger model needs more data. Let's collect 25,000 images (2.5x more)."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"system_v3 = AITriangleSimulator(\n",
" model_name='ResNet-152',\n",
" dataset_size=25000, # More data!\n",
" num_gpus=1\n",
")\n",
"\n",
"print(\"ATTEMPT 2: Bigger Model + More Data\")\n",
"accuracy_v3, costs_v3 = system_v3.display_status()\n",
"\n",
"print(f\"\\nChange: {baseline_accuracy}% → {accuracy_v3}% ({accuracy_v3 - baseline_accuracy:+.1f}%)\")\n",
"print(f\"Data cost: ${costs_v3['data_collection_cost']:,} over {costs_v3['data_collection_months']} months\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**What happened?** Accuracy improved, but now we hit a compute bottleneck—training 60M parameters on 25K images with 1 GPU takes too long."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"---\n",
"\n",
"## Attempt 3: Add More Compute\n",
"\n",
"**Intuition**: Use 8 GPUs to speed up training and enable better hyperparameter search."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"system_v4 = AITriangleSimulator(\n",
" model_name='ResNet-152',\n",
" dataset_size=25000,\n",
" num_gpus=8 # More compute!\n",
")\n",
"\n",
"print(\"ATTEMPT 3: All Three Components Scaled\")\n",
"accuracy_v4, costs_v4 = system_v4.display_status()\n",
"\n",
"print(f\"\\nFinal change: {baseline_accuracy}% → {accuracy_v4}% ({accuracy_v4 - baseline_accuracy:+.1f}%)\")\n",
"print(f\"\\nTotal investment:\")\n",
"print(f\" Data: ${costs_v4['data_collection_cost']:,} over {costs_v4['data_collection_months']} months\")\n",
"print(f\" Compute: ${costs_v4['compute_cost_per_run']:,} per training run\")\n",
"print(f\" Time: {costs_v4['training_hours']}h training\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Key insight**: To improve from ~80% to ~90%, we had to change **all three components**. We couldn't just:\n",
"- Make the model bigger (hit data bottleneck)\n",
"- Add more data (hit compute bottleneck) \n",
"- Add more compute alone (didn't help without model + data)\n",
"\n",
"**This is the AI Triangle in action.**"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"---\n",
"\n",
"## Your Turn: Explore the Trade-offs\n",
"\n",
"Try to find:\n",
"1. The **cheapest** way to get above 85% accuracy\n",
"2. The **fastest** way to get above 88% accuracy\n",
"3. What happens if you max out the model but keep minimal data/compute\n",
"\n",
"**Modify the parameters below and run:**"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# YOUR EXPERIMENT\n",
"# Models: 'Small CNN', 'ResNet-50', 'ResNet-101', 'ResNet-152', 'Large Model'\n",
"\n",
"my_system = AITriangleSimulator(\n",
" model_name='ResNet-50', # Change this\n",
" dataset_size=10000, # Change this (5000-50000)\n",
" num_gpus=1 # Change this (1-16)\n",
")\n",
"\n",
"my_accuracy, my_costs = my_system.display_status()\n",
"\n",
"if my_accuracy >= 90:\n",
" print(\"\\n🎉 SUCCESS! 90%+ accuracy\")\n",
"elif my_accuracy >= 85:\n",
" print(\"\\n✓ Good progress (85%+)\")\n",
"else:\n",
" print(\"\\nKeep experimenting!\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"---\n",
"\n",
"## Visualizing the Interdependencies"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))\n",
"\n",
"# Left: Model size vs accuracy for different data amounts\n",
"model_sizes = [5, 25, 45, 60, 100]\n",
"for data_size in [5000, 10000, 25000, 50000]:\n",
" accuracies = []\n",
" for model_name, params in AITriangleSimulator.MODELS.items():\n",
" if params in model_sizes:\n",
" sim = AITriangleSimulator(model_name, data_size, num_gpus=4)\n",
" acc, _ = sim.estimate_accuracy()\n",
" accuracies.append(acc)\n",
" ax1.plot(model_sizes, accuracies, marker='o', label=f'{data_size:,} images')\n",
"\n",
"ax1.axhline(y=90, color='r', linestyle='--', label='Goal: 90%')\n",
"ax1.set_xlabel('Model Size (Million Parameters)')\n",
"ax1.set_ylabel('Accuracy (%)')\n",
"ax1.set_title('Bigger Models Need More Data')\n",
"ax1.legend()\n",
"ax1.grid(True, alpha=0.3)\n",
"\n",
"# Right: Compute vs training time\n",
"gpu_counts = [1, 2, 4, 8, 16]\n",
"training_times = []\n",
"for gpus in gpu_counts:\n",
" sim = AITriangleSimulator('ResNet-152', 25000, gpus)\n",
" costs = sim.estimate_costs()\n",
" training_times.append(costs['training_hours'])\n",
"\n",
"ax2.plot(gpu_counts, training_times, marker='s', color='green', linewidth=2)\n",
"ax2.set_xlabel('Number of GPUs')\n",
"ax2.set_ylabel('Training Time (hours)')\n",
"ax2.set_title('More Compute Reduces Training Time')\n",
"ax2.grid(True, alpha=0.3)\n",
"\n",
"plt.tight_layout()\n",
"plt.show()\n",
"\n",
"print(\"Key insight: All three components must scale together\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"---\n",
"\n",
"## Understanding the Interdependencies\n",
"\n",
"**Why does a bigger model need more data?**\n",
"\n",
"A model with 60M parameters has millions of \"knobs to tune.\" With only 10K training examples, the model memorizes the training set (overfitting) rather than learning generalizable patterns. You need roughly 10+ examples per 1000 parameters to avoid this.\n",
"\n",
"**Why does more data need more compute?**\n",
"\n",
"Each training example must be processed through the model multiple times (epochs). If you 2.5x your dataset (10K → 25K images), training time increases proportionally unless you add more compute to parallelize.\n",
"\n",
"**What if you had unlimited budget?**\n",
"\n",
"Even with infinite money, you're constrained by:\n",
"- Time to collect and label data (months to years)\n",
"- Time to train models (physical limits)\n",
"- Available expert labelers (for medical images)\n",
"- Regulatory approval processes\n",
"\n",
"**This is why systems thinking matters**—you're always optimizing under constraints."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"---\n",
"\n",
"## Real-World Examples\n",
"\n",
"**Google Health (Diabetic Retinopathy)**\n",
"- Model: Deep CNN with millions of parameters\n",
"- Data: 128,000 retinal images (years to collect)\n",
"- Compute: Weeks of training on TPU clusters\n",
"- Result: 94% sensitivity, but took 3+ years to deploy\n",
"\n",
"**Tesla Autopilot**\n",
"- Model: Billions of parameters\n",
"- Data: Millions of miles from fleet\n",
"- Compute: 10,000+ GPUs ($10M+/year)\n",
"- Trade-off: Massive infrastructure for continuous improvement\n",
"\n",
"**MobileNet (On-Device)**\n",
"- Strategy: Optimized for opposite constraints\n",
"- Model: Tiny (4M params vs 60M)\n",
"- Data: Works with smaller datasets\n",
"- Compute: Runs on phone CPUs\n",
"- Trade-off: Lower accuracy (~85%) but instant, anywhere\n",
"\n",
"**AlexNet (2012) - The Deep Learning Revolution**\n",
"\n",
"The breakthrough didn't come from a new algorithm—CNNs existed since the 1980s. It happened because all three components came together:\n",
"1. Algorithm: CNNs (already known)\n",
"2. Data: ImageNet (1.2M labeled images)\n",
"3. Compute: GPUs made training feasible (2 GPUs, 6 days)\n",
"\n",
"Result: Accuracy jumped from 74% to 85% on ImageNet."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"---\n",
"\n",
"## The Bitter Lesson\n",
"\n",
"Richard Sutton's observation from 70 years of AI research:\n",
"\n",
"> *\"General methods that leverage computation are ultimately the most effective, and by a large margin.\"*\n",
"\n",
"Why? Because scaling compute enables:\n",
"- Bigger models (more parameters for complex patterns)\n",
"- More data processing (train on larger datasets)\n",
"- Which together beat clever algorithms with limited resources\n",
"\n",
"This is why **systems engineering** (knowing how to scale compute + data + models together) matters more than algorithmic tricks. The rest of this book teaches you how."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"---\n",
"\n",
"## Summary\n",
"\n",
"**Key takeaways:**\n",
"\n",
"1. **The AI Triangle shows interdependence** - You can't optimize models, data, or compute in isolation. Changes to one create bottlenecks in others.\n",
"\n",
"2. **Every system is a compromise** - Google can spend millions on all three. Startups choose carefully. Mobile apps work with tiny models. Context determines trade-offs.\n",
"\n",
"3. **Systems engineering matters most** - The Bitter Lesson: scaling compute + data beats clever algorithms. Knowing HOW to scale is ML systems engineering.\n",
"\n",
"4. **Constraints propagate** - Limited budget → smaller model → less data → lower accuracy. Every choice has ripple effects.\n",
"\n",
"**What's next:** Throughout this book, you'll learn how to navigate these trade-offs across deployment paradigms (Chapter 2), optimize each component (Chapters 3-10), scale systems (Chapters 11-14), and build robust, fair, sustainable ML systems (Chapters 15-20).\n",
"\n",
"Welcome to ML systems engineering."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.0"
}
},
"nbformat": 4,
"nbformat_minor": 4
}

View File

@@ -0,0 +1,384 @@
# %% [markdown]
# # The AI Triangle: Why You Can't Optimize in Isolation
#
# **Machine Learning Systems: Engineering Intelligence at Scale**
# _Chapter 1: Introduction - Understanding ML as a Systems Discipline_
#
# ---
#
# In Chapter 1, you learned that ML systems consist of three tightly coupled components: **models** (algorithms that learn patterns), **data** (examples that guide learning), and **infrastructure** (compute that enables training and inference). The AI Triangle framework shows these aren't independent—they shape each other's possibilities.
#
# Now you'll experience this interdependence hands-on. You'll attempt to improve a medical imaging classifier from 80% to 90% accuracy. As you try different approaches, you'll discover why optimizing one component creates bottlenecks in the others.
#
# **Why this matters**: This pattern repeats across every ML deployment. Google's diabetic retinopathy detector required coordinated scaling of all three components over 3+ years. Tesla's Autopilot balances model sophistication against real-time compute constraints. Understanding these trade-offs is the essence of ML systems engineering.
# %% [markdown]
# ## Setup
#
# Run this cell to load the AI Triangle simulator.
# %%
import numpy as np
import matplotlib.pyplot as plt
from IPython.display import display, clear_output
plt.style.use('seaborn-v0_8-whitegrid')
# %matplotlib inline
print("✓ Setup complete")
# %%
class AITriangleSimulator:
"""Simplified simulator showing model-data-compute interdependencies"""
MODELS = {
'Small CNN': 5,
'ResNet-50': 25,
'ResNet-101': 45,
'ResNet-152': 60,
'Large Model': 100
}
def __init__(self, model_name='ResNet-50', dataset_size=10000, num_gpus=1):
self.model_name = model_name
self.model_params = self.MODELS[model_name]
self.dataset_size = dataset_size
self.num_gpus = num_gpus
def estimate_accuracy(self):
model_factor = np.log10(self.model_params + 1) * 10
data_factor = np.log10(self.dataset_size / 1000 + 1) * 15
compute_factor = np.log10(self.num_gpus + 1) * 5
data_per_param = self.dataset_size / (self.model_params * 1000)
if data_per_param < 10:
bottleneck = "⚠️ DATA BOTTLENECK: Not enough data for model size (overfitting risk)"
accuracy = min(95, 60 + data_factor)
elif self.num_gpus < np.log10(self.dataset_size):
bottleneck = "⚠️ COMPUTE BOTTLENECK: Training will be very slow or incomplete"
accuracy = min(95, 65 + model_factor + compute_factor)
else:
bottleneck = "✓ Balanced system"
accuracy = min(95, 70 + model_factor + data_factor + compute_factor)
return round(accuracy, 1), bottleneck
def estimate_costs(self):
baseline_data = 10000
new_data = max(0, self.dataset_size - baseline_data)
data_cost = new_data * 3
data_collection_months = new_data / 3000
base_hours = (self.model_params * self.dataset_size) / (self.num_gpus * 1000)
training_hours = max(0.5, base_hours / 100)
compute_cost_per_run = training_hours * self.num_gpus * 50
memory_gb = self.model_params * 0.2
return {
'data_collection_cost': round(data_cost),
'data_collection_months': round(data_collection_months, 1),
'compute_cost_per_run': round(compute_cost_per_run),
'training_hours': round(training_hours, 1),
'memory_gb': round(memory_gb, 1)
}
def display_status(self):
accuracy, bottleneck = self.estimate_accuracy()
costs = self.estimate_costs()
print("=" * 70)
print("AI TRIANGLE - CURRENT SYSTEM")
print("=" * 70)
print(f"\nAccuracy: {accuracy}%")
print(f"Status: {bottleneck}")
print(f"\nModel: {self.model_name} ({self.model_params}M parameters)")
print(f"Dataset: {self.dataset_size:,} images")
print(f"Compute: {self.num_gpus} GPU(s)")
print(f"\nData Cost: ${costs['data_collection_cost']:,} over {costs['data_collection_months']} months")
print(f"Training: {costs['training_hours']}h at ${costs['compute_cost_per_run']:,}/run")
print(f"Memory: {costs['memory_gb']} GB")
print("=" * 70)
return accuracy, costs
print("✓ AI Triangle simulator loaded")
# %% [markdown]
# ---
#
# ## The Challenge
#
# You're an ML engineer at a medical imaging startup. Your team has built an AI system to detect diseases from chest X-rays.
#
# **Current system:**
# - Model: ResNet-50 (25M parameters)
# - Data: 10,000 labeled medical images (6 months to collect)
# - Compute: 1 GPU
# - **Accuracy: 80%**
#
# Your boss: *"We need 90% accuracy for FDA approval. Fix it."*
#
# **The question**: How do you improve accuracy? Should you use a bigger model? Collect more data? Add more GPUs? All three?
#
# Let's find out what happens when you try each approach.
# %% [markdown]
# ---
#
# ## Baseline System
# %%
system = AITriangleSimulator(
model_name='ResNet-50',
dataset_size=10000,
num_gpus=1
)
print("BASELINE:")
baseline_accuracy, baseline_costs = system.display_status()
# %% [markdown]
# ---
#
# ## Attempt 1: Use a Bigger Model
#
# **Intuition**: More parameters should mean better accuracy.
#
# Let's upgrade from ResNet-50 (25M params) to ResNet-152 (60M params).
# %%
system_v2 = AITriangleSimulator(
model_name='ResNet-152', # Bigger!
dataset_size=10000, # Same
num_gpus=1 # Same
)
print("ATTEMPT 1: Bigger Model")
accuracy_v2, costs_v2 = system_v2.display_status()
print(f"\nChange: {baseline_accuracy}% → {accuracy_v2}% ({accuracy_v2 - baseline_accuracy:+.1f}%)")
# %% [markdown]
# **What happened?** The bigger model hit a data bottleneck—60M parameters need more than 10K examples to avoid overfitting.
# %% [markdown]
# ---
#
# ## Attempt 2: Collect More Data
#
# **Intuition**: The bigger model needs more data. Let's collect 25,000 images (2.5x more).
# %%
system_v3 = AITriangleSimulator(
model_name='ResNet-152',
dataset_size=25000, # More data!
num_gpus=1
)
print("ATTEMPT 2: Bigger Model + More Data")
accuracy_v3, costs_v3 = system_v3.display_status()
print(f"\nChange: {baseline_accuracy}% → {accuracy_v3}% ({accuracy_v3 - baseline_accuracy:+.1f}%)")
print(f"Data cost: ${costs_v3['data_collection_cost']:,} over {costs_v3['data_collection_months']} months")
# %% [markdown]
# **What happened?** Accuracy improved, but now we hit a compute bottleneck—training 60M parameters on 25K images with 1 GPU takes too long.
# %% [markdown]
# ---
#
# ## Attempt 3: Add More Compute
#
# **Intuition**: Use 8 GPUs to speed up training and enable better hyperparameter search.
# %%
system_v4 = AITriangleSimulator(
model_name='ResNet-152',
dataset_size=25000,
num_gpus=8 # More compute!
)
print("ATTEMPT 3: All Three Components Scaled")
accuracy_v4, costs_v4 = system_v4.display_status()
print(f"\nFinal change: {baseline_accuracy}% → {accuracy_v4}% ({accuracy_v4 - baseline_accuracy:+.1f}%)")
print(f"\nTotal investment:")
print(f" Data: ${costs_v4['data_collection_cost']:,} over {costs_v4['data_collection_months']} months")
print(f" Compute: ${costs_v4['compute_cost_per_run']:,} per training run")
print(f" Time: {costs_v4['training_hours']}h training")
# %% [markdown]
# **Key insight**: To improve from ~80% to ~90%, we had to change **all three components**. We couldn't just:
# - Make the model bigger (hit data bottleneck)
# - Add more data (hit compute bottleneck)
# - Add more compute alone (didn't help without model + data)
#
# **This is the AI Triangle in action.**
# %% [markdown]
# ---
#
# ## Your Turn: Explore the Trade-offs
#
# Try to find:
# 1. The **cheapest** way to get above 85% accuracy
# 2. The **fastest** way to get above 88% accuracy
# 3. What happens if you max out the model but keep minimal data/compute
#
# **Modify the parameters below and run:**
# %%
# YOUR EXPERIMENT
# Models: 'Small CNN', 'ResNet-50', 'ResNet-101', 'ResNet-152', 'Large Model'
my_system = AITriangleSimulator(
model_name='ResNet-50', # Change this
dataset_size=10000, # Change this (5000-50000)
num_gpus=1 # Change this (1-16)
)
my_accuracy, my_costs = my_system.display_status()
if my_accuracy >= 90:
print("\n🎉 SUCCESS! 90%+ accuracy")
elif my_accuracy >= 85:
print("\n✓ Good progress (85%+)")
else:
print("\nKeep experimenting!")
# %% [markdown]
# ---
#
# ## Visualizing the Interdependencies
# %%
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))
# Left: Model size vs accuracy for different data amounts
model_sizes = [5, 25, 45, 60, 100]
for data_size in [5000, 10000, 25000, 50000]:
accuracies = []
for model_name, params in AITriangleSimulator.MODELS.items():
if params in model_sizes:
sim = AITriangleSimulator(model_name, data_size, num_gpus=4)
acc, _ = sim.estimate_accuracy()
accuracies.append(acc)
ax1.plot(model_sizes, accuracies, marker='o', label=f'{data_size:,} images')
ax1.axhline(y=90, color='r', linestyle='--', label='Goal: 90%')
ax1.set_xlabel('Model Size (Million Parameters)')
ax1.set_ylabel('Accuracy (%)')
ax1.set_title('Bigger Models Need More Data')
ax1.legend()
ax1.grid(True, alpha=0.3)
# Right: Compute vs training time
gpu_counts = [1, 2, 4, 8, 16]
training_times = []
for gpus in gpu_counts:
sim = AITriangleSimulator('ResNet-152', 25000, gpus)
costs = sim.estimate_costs()
training_times.append(costs['training_hours'])
ax2.plot(gpu_counts, training_times, marker='s', color='green', linewidth=2)
ax2.set_xlabel('Number of GPUs')
ax2.set_ylabel('Training Time (hours)')
ax2.set_title('More Compute Reduces Training Time')
ax2.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
print("Key insight: All three components must scale together")
# %% [markdown]
# ---
#
# ## Understanding the Interdependencies
#
# **Why does a bigger model need more data?**
#
# A model with 60M parameters has millions of "knobs to tune." With only 10K training examples, the model memorizes the training set (overfitting) rather than learning generalizable patterns. You need roughly 10+ examples per 1000 parameters to avoid this.
#
# **Why does more data need more compute?**
#
# Each training example must be processed through the model multiple times (epochs). If you 2.5x your dataset (10K → 25K images), training time increases proportionally unless you add more compute to parallelize.
#
# **What if you had unlimited budget?**
#
# Even with infinite money, you're constrained by:
# - Time to collect and label data (months to years)
# - Time to train models (physical limits)
# - Available expert labelers (for medical images)
# - Regulatory approval processes
#
# **This is why systems thinking matters**—you're always optimizing under constraints.
# %% [markdown]
# ---
#
# ## Real-World Examples
#
# **Google Health (Diabetic Retinopathy)**
# - Model: Deep CNN with millions of parameters
# - Data: 128,000 retinal images (years to collect)
# - Compute: Weeks of training on TPU clusters
# - Result: 94% sensitivity, but took 3+ years to deploy
#
# **Tesla Autopilot**
# - Model: Billions of parameters
# - Data: Millions of miles from fleet
# - Compute: 10,000+ GPUs ($10M+/year)
# - Trade-off: Massive infrastructure for continuous improvement
#
# **MobileNet (On-Device)**
# - Strategy: Optimized for opposite constraints
# - Model: Tiny (4M params vs 60M)
# - Data: Works with smaller datasets
# - Compute: Runs on phone CPUs
# - Trade-off: Lower accuracy (~85%) but instant, anywhere
#
# **AlexNet (2012) - The Deep Learning Revolution**
#
# The breakthrough didn't come from a new algorithm—CNNs existed since the 1980s. It happened because all three components came together:
# 1. Algorithm: CNNs (already known)
# 2. Data: ImageNet (1.2M labeled images)
# 3. Compute: GPUs made training feasible (2 GPUs, 6 days)
#
# Result: Accuracy jumped from 74% to 85% on ImageNet.
# %% [markdown]
# ---
#
# ## The Bitter Lesson
#
# Richard Sutton's observation from 70 years of AI research:
#
# > *"General methods that leverage computation are ultimately the most effective, and by a large margin."*
#
# Why? Because scaling compute enables:
# - Bigger models (more parameters for complex patterns)
# - More data processing (train on larger datasets)
# - Which together beat clever algorithms with limited resources
#
# This is why **systems engineering** (knowing how to scale compute + data + models together) matters more than algorithmic tricks. The rest of this book teaches you how.
# %% [markdown]
# ---
#
# ## Summary
#
# **Key takeaways:**
#
# 1. **The AI Triangle shows interdependence** - You can't optimize models, data, or compute in isolation. Changes to one create bottlenecks in others.
#
# 2. **Every system is a compromise** - Google can spend millions on all three. Startups choose carefully. Mobile apps work with tiny models. Context determines trade-offs.
#
# 3. **Systems engineering matters most** - The Bitter Lesson: scaling compute + data beats clever algorithms. Knowing HOW to scale is ML systems engineering.
#
# 4. **Constraints propagate** - Limited budget → smaller model → less data → lower accuracy. Every choice has ripple effects.
#
# **What's next:** Throughout this book, you'll learn how to navigate these trade-offs across deployment paradigms (Chapter 2), optimize each component (Chapters 3-10), scale systems (Chapters 11-14), and build robust, fair, sustainable ML systems (Chapters 15-20).
#
# Welcome to ML systems engineering.

View File

@@ -0,0 +1,122 @@
"""
AI Triangle Simulator - Model-Data-Compute Interdependencies
This module provides a simplified analytical model demonstrating how model size,
dataset size, and compute resources interact to determine ML system performance.
Used in Chapter 1 to teach systems thinking fundamentals.
"""
import numpy as np
class AITriangleSimulator:
"""Simplified simulator showing model-data-compute interdependencies"""
MODELS = {
'Small CNN': 5,
'ResNet-50': 25,
'ResNet-101': 45,
'ResNet-152': 60,
'Large Model': 100
}
def __init__(self, model_name='ResNet-50', dataset_size=10000, num_gpus=1):
"""
Initialize AI Triangle simulator
Parameters:
-----------
model_name : str
Model architecture name (from MODELS dict)
dataset_size : int
Number of training examples
num_gpus : int
Number of GPUs available for training
"""
self.model_name = model_name
self.model_params = self.MODELS[model_name]
self.dataset_size = dataset_size
self.num_gpus = num_gpus
def estimate_accuracy(self):
"""
Estimate accuracy and identify bottlenecks
Returns:
--------
tuple: (accuracy, bottleneck_message)
"""
model_factor = np.log10(self.model_params + 1) * 10
data_factor = np.log10(self.dataset_size / 1000 + 1) * 15
compute_factor = np.log10(self.num_gpus + 1) * 5
data_per_param = self.dataset_size / (self.model_params * 1000)
if data_per_param < 10:
bottleneck = "⚠️ DATA BOTTLENECK: Not enough data for model size (overfitting risk)"
accuracy = min(95, 60 + data_factor)
elif self.num_gpus < np.log10(self.dataset_size):
bottleneck = "⚠️ COMPUTE BOTTLENECK: Training will be very slow or incomplete"
accuracy = min(95, 65 + model_factor + compute_factor)
else:
bottleneck = "✓ Balanced system"
accuracy = min(95, 70 + model_factor + data_factor + compute_factor)
return round(accuracy, 1), bottleneck
def estimate_costs(self):
"""
Estimate data collection and compute costs
Returns:
--------
dict: Cost breakdown with keys:
- data_collection_cost (int): Dollar cost for data
- data_collection_months (float): Time to collect data
- compute_cost_per_run (int): Dollar cost per training run
- training_hours (float): Hours per training run
- memory_gb (float): GPU memory required
"""
baseline_data = 10000
new_data = max(0, self.dataset_size - baseline_data)
data_cost = new_data * 3
data_collection_months = new_data / 3000
base_hours = (self.model_params * self.dataset_size) / (self.num_gpus * 1000)
training_hours = max(0.5, base_hours / 100)
compute_cost_per_run = training_hours * self.num_gpus * 50
memory_gb = self.model_params * 0.2
return {
'data_collection_cost': round(data_cost),
'data_collection_months': round(data_collection_months, 1),
'compute_cost_per_run': round(compute_cost_per_run),
'training_hours': round(training_hours, 1),
'memory_gb': round(memory_gb, 1)
}
def display_status(self):
"""
Print current system configuration and performance
Returns:
--------
tuple: (accuracy, costs_dict)
"""
accuracy, bottleneck = self.estimate_accuracy()
costs = self.estimate_costs()
print("=" * 70)
print("AI TRIANGLE - CURRENT SYSTEM")
print("=" * 70)
print(f"\nAccuracy: {accuracy}%")
print(f"Status: {bottleneck}")
print(f"\nModel: {self.model_name} ({self.model_params}M parameters)")
print(f"Dataset: {self.dataset_size:,} images")
print(f"Compute: {self.num_gpus} GPU(s)")
print(f"\nData Cost: ${costs['data_collection_cost']:,} over {costs['data_collection_months']} months")
print(f"Training: {costs['training_hours']}h at ${costs['compute_cost_per_run']:,}/run")
print(f"Memory: {costs['memory_gb']} GB")
print("=" * 70)
return accuracy, costs