mirror of
https://github.com/harvard-edge/cs249r_book.git
synced 2026-04-29 00:59:07 -05:00
Merge feature/tinytorch-core: fix notebook filename convention in docs and Binder
This commit is contained in:
@@ -18,6 +18,8 @@ echo "📓 Generating student notebooks from source..."
|
||||
for module_dir in src/*/; do
|
||||
module_name=$(basename "$module_dir")
|
||||
py_file="$module_dir/${module_name}.py"
|
||||
# Strip numeric prefix for notebook name (e.g., "01_tensor" -> "tensor")
|
||||
short_name="${module_name#*_}"
|
||||
|
||||
if [ -f "$py_file" ]; then
|
||||
# Create output directory
|
||||
@@ -25,7 +27,7 @@ for module_dir in src/*/; do
|
||||
|
||||
# Convert .py to .ipynb using jupytext
|
||||
echo " 📝 Converting $module_name..."
|
||||
jupytext --to notebook "$py_file" --output "modules/$module_name/${module_name}.ipynb" 2>/dev/null || {
|
||||
jupytext --to notebook "$py_file" --output "modules/$module_name/${short_name}.ipynb" 2>/dev/null || {
|
||||
echo " ⚠️ Warning: Could not convert $module_name"
|
||||
}
|
||||
fi
|
||||
|
||||
@@ -193,7 +193,7 @@ TinyTorch/
|
||||
│
|
||||
├── modules/ # 📓 Generated notebooks (learners work here)
|
||||
│ ├── 01_tensor/ # Auto-generated from src/
|
||||
│ │ ├── 01_tensor.ipynb # Jupyter notebook for learning
|
||||
│ │ ├── tensor.ipynb # Jupyter notebook for learning
|
||||
│ │ ├── README.md # Practical implementation guide
|
||||
│ │ └── tensor.py # Your implementation
|
||||
│ └── ... # (20 module directories)
|
||||
|
||||
@@ -15,6 +15,8 @@ echo "📓 Generating student notebooks from source..."
|
||||
for module_dir in src/*/; do
|
||||
module_name=$(basename "$module_dir")
|
||||
py_file="$module_dir/${module_name}.py"
|
||||
# Strip numeric prefix for notebook name (e.g., "01_tensor" -> "tensor")
|
||||
short_name="${module_name#*_}"
|
||||
|
||||
if [ -f "$py_file" ]; then
|
||||
# Create output directory
|
||||
@@ -22,7 +24,7 @@ for module_dir in src/*/; do
|
||||
|
||||
# Convert .py to .ipynb using jupytext
|
||||
echo " 📝 Converting $module_name..."
|
||||
jupytext --to notebook "$py_file" --output "modules/$module_name/${module_name}.ipynb" 2>/dev/null || {
|
||||
jupytext --to notebook "$py_file" --output "modules/$module_name/${short_name}.ipynb" 2>/dev/null || {
|
||||
echo " ⚠️ Warning: Could not convert $module_name"
|
||||
}
|
||||
fi
|
||||
|
||||
@@ -66,7 +66,7 @@ clean:
|
||||
install:
|
||||
@echo "📦 Installing dependencies..."
|
||||
pip install -U pip
|
||||
pip install "jupyter-book<1.0"
|
||||
pip install "jupyter-book>=1.0.0,<2.0.0"
|
||||
pip install -r requirements.txt
|
||||
|
||||
test:
|
||||
|
||||
@@ -13,10 +13,10 @@ description: >-
|
||||
Learn by implementing your own PyTorch-style framework with hands-on coding,
|
||||
real datasets, and production-ready practices.
|
||||
|
||||
# Execution settings - disable for PDF
|
||||
# Execution settings - cache mode enables {glue} computed values in ABOUT.md files
|
||||
execute:
|
||||
execute_notebooks: "off"
|
||||
allow_errors: false
|
||||
execute_notebooks: "cache"
|
||||
allow_errors: true
|
||||
timeout: 300
|
||||
|
||||
# Exclude patterns
|
||||
@@ -57,8 +57,9 @@ sphinx:
|
||||
# --pdfFit scales PDF to fit the diagram (not full page)
|
||||
# --scale 1.0 keeps diagrams at natural size (1.5 was too large for tall diagrams)
|
||||
mermaid_output_format: "pdf"
|
||||
# Width 800 constrains diagram width; scale must be integer (1 = natural size)
|
||||
mermaid_params: ['--pdfFit', '--scale', '1', '--width', '800', '--backgroundColor', 'white']
|
||||
# Width 600 constrains diagram viewport; scale 1 = natural size
|
||||
# Smaller viewport + pdfcrop produces tighter diagrams that don't stretch to full page width
|
||||
mermaid_params: ['--pdfFit', '--scale', '1', '--width', '600', '--backgroundColor', 'white']
|
||||
# Use pdfcrop to trim whitespace from mermaid PDFs
|
||||
mermaid_pdfcrop: "pdfcrop"
|
||||
# Use professional sans-serif font for mermaid diagrams to match document
|
||||
@@ -91,6 +92,9 @@ sphinx:
|
||||
papersize: 'letterpaper'
|
||||
pointsize: '10pt'
|
||||
figure_align: 'H'
|
||||
# Pass 'export' option to adjustbox before Sphinx loads it (avoids option clash).
|
||||
# This enables max width/height keys in \includegraphics for mermaid figure capping.
|
||||
passoptionstopackages: '\PassOptionsToPackage{export}{adjustbox}'
|
||||
fontpkg: |
|
||||
% Professional academic font stack (TeX Gyre - available in TeX Live)
|
||||
\usepackage{fontspec}
|
||||
@@ -111,6 +115,27 @@ sphinx:
|
||||
\usepackage{hyperref}
|
||||
\usepackage{float}
|
||||
|
||||
% Cap Mermaid diagram width at 75% of text width.
|
||||
% sphinxcontrib-mermaid hardcodes width=\linewidth for all diagrams,
|
||||
% which stretches small flowcharts to full page width. This override
|
||||
% intercepts \includegraphics and uses adjustbox's max width for
|
||||
% mermaid-*.pdf files while passing other images through unchanged.
|
||||
% Note: adjustbox 'export' option passed via passoptionstopackages above.
|
||||
\let\OrigIncludeGraphics\includegraphics
|
||||
\makeatletter
|
||||
\renewcommand{\includegraphics}[2][]{%
|
||||
\begingroup
|
||||
\def\@mermaidtest{mermaid-}%
|
||||
\@expandtwoargs\in@{\@mermaidtest}{#2}%
|
||||
\ifin@
|
||||
\OrigIncludeGraphics[max width=0.75\linewidth,max height=0.45\textheight,keepaspectratio]{#2}%
|
||||
\else
|
||||
\OrigIncludeGraphics[#1]{#2}%
|
||||
\fi
|
||||
\endgroup
|
||||
}
|
||||
\makeatother
|
||||
|
||||
% Better figure placement - keep figures inline with text
|
||||
\renewcommand{\topfraction}{0.9}
|
||||
\renewcommand{\bottomfraction}{0.9}
|
||||
|
||||
@@ -104,10 +104,10 @@ This opens the module notebook and tracks your progress.
|
||||
|
||||
### Work in the notebook
|
||||
|
||||
Edit `modules/01_tensor/01_tensor.ipynb` in Jupyter:
|
||||
Edit `modules/01_tensor/tensor.ipynb` in Jupyter:
|
||||
|
||||
```bash
|
||||
jupyter lab modules/01_tensor/01_tensor.ipynb
|
||||
jupyter lab modules/01_tensor/tensor.ipynb
|
||||
```
|
||||
|
||||
You'll implement:
|
||||
|
||||
@@ -409,11 +409,11 @@ src/ ← Developer source code
|
||||
|
||||
modules/ ← Generated notebooks (students use)
|
||||
├── 01_tensor/
|
||||
│ └── 01_tensor.ipynb ← AUTO-GENERATED for students
|
||||
│ └── tensor.ipynb ← AUTO-GENERATED for students
|
||||
├── 02_activations/
|
||||
│ └── 02_activations.ipynb ← AUTO-GENERATED for students
|
||||
│ └── activations.ipynb ← AUTO-GENERATED for students
|
||||
└── 03_layers/
|
||||
└── 03_layers.ipynb ← AUTO-GENERATED for students
|
||||
└── layers.ipynb ← AUTO-GENERATED for students
|
||||
```
|
||||
|
||||
### Where Code Exports
|
||||
|
||||
@@ -455,19 +455,19 @@ File → Save File (or Cmd/Ctrl + S)
|
||||
|
||||
**Step 2: Check file permissions**:
|
||||
```bash
|
||||
ls -la modules/01_tensor/01_tensor.ipynb
|
||||
ls -la modules/01_tensor/tensor.ipynb
|
||||
# Should be writable (not read-only)
|
||||
```
|
||||
|
||||
**Step 3: If read-only, fix permissions**:
|
||||
```bash
|
||||
chmod u+w modules/01_tensor/01_tensor.ipynb
|
||||
chmod u+w modules/01_tensor/tensor.ipynb
|
||||
```
|
||||
|
||||
**Step 4: Verify changes saved**:
|
||||
```bash
|
||||
# Check the notebook was updated
|
||||
ls -l modules/01_tensor/01_tensor.ipynb
|
||||
ls -l modules/01_tensor/tensor.ipynb
|
||||
```
|
||||
|
||||
</div>
|
||||
|
||||
@@ -1,3 +1,9 @@
|
||||
---
|
||||
file_format: mystnb
|
||||
kernelspec:
|
||||
name: python3
|
||||
---
|
||||
|
||||
# Module 01: Tensor
|
||||
|
||||
:::{admonition} Module Info
|
||||
@@ -30,7 +36,7 @@ Listen to an AI-generated overview.
|
||||
|
||||
Run interactively in your browser.
|
||||
|
||||
<a href="https://mybinder.org/v2/gh/harvard-edge/cs249r_book/main?labpath=tinytorch%2Fmodules%2F01_tensor%2F01_tensor.ipynb" target="_blank" style="display: flex; align-items: center; justify-content: center; width: 100%; height: 54px; margin-top: auto; background: #f97316; color: white; text-align: center; text-decoration: none; border-radius: 27px; font-size: 14px; box-sizing: border-box;">Open in Binder →</a>
|
||||
<a href="https://mybinder.org/v2/gh/harvard-edge/cs249r_book/main?labpath=tinytorch%2Fmodules%2F01_tensor%2Ftensor.ipynb" target="_blank" style="display: flex; align-items: center; justify-content: center; width: 100%; height: 54px; margin-top: auto; background: #f97316; color: white; text-align: center; text-decoration: none; border-radius: 27px; font-size: 14px; box-sizing: border-box;">Open in Binder →</a>
|
||||
```
|
||||
|
||||
```{grid-item-card} 📄 View Source
|
||||
@@ -502,7 +508,20 @@ The rules are simpler than they look. Compare shapes from right to left. At each
|
||||
| `(3, 4)` | `(3,)` | Error | ✗ (3 ≠ 4) |
|
||||
| `(2, 3, 4)` | `(3, 4)` | `(2, 3, 4)` | ✓ |
|
||||
|
||||
The memory savings are dramatic. Adding a `(768,)` vector to a `(32, 512, 768)` tensor would require copying the vector 32×512 times without broadcasting, allocating 50 MB of redundant data (12.5 million float32 numbers). With broadcasting, you store just the original 3 KB vector.
|
||||
```{code-cell} python3
|
||||
:tags: [remove-input, remove-output]
|
||||
from myst_nb import glue
|
||||
|
||||
# Broadcasting memory comparison
|
||||
broadcast_full_elements = 32 * 512 * 768
|
||||
broadcast_full_bytes = broadcast_full_elements * 4
|
||||
broadcast_vec_bytes = 768 * 4
|
||||
glue("bcast_mb", f"{broadcast_full_bytes / 1024**2:.0f} MB")
|
||||
glue("bcast_elements", f"{broadcast_full_elements / 1e6:.1f} million")
|
||||
glue("bcast_vec_kb", f"{broadcast_vec_bytes / 1024:.0f} KB")
|
||||
```
|
||||
|
||||
The memory savings are dramatic. Adding a `(768,)` vector to a `(32, 512, 768)` tensor would require copying the vector 32×512 times without broadcasting, allocating {glue:text}`bcast_mb` of redundant data ({glue:text}`bcast_elements` float32 numbers). With broadcasting, you store just the original {glue:text}`bcast_vec_kb` vector.
|
||||
|
||||
### Views vs. Copies
|
||||
|
||||
@@ -803,16 +822,45 @@ Broadcasting rules, shape semantics, and API design patterns. When you debug PyT
|
||||
|
||||
### Why Tensors Matter at Scale
|
||||
|
||||
```{code-cell} python3
|
||||
:tags: [remove-input, remove-output]
|
||||
|
||||
# LLM parameter storage (fp16 = 2 bytes per param)
|
||||
llm_params = 175_000_000_000
|
||||
llm_bytes = llm_params * 2
|
||||
glue("llm_gb", f"{llm_bytes / 1024**3:.0f} GB")
|
||||
|
||||
# Batch of images (float32)
|
||||
batch_128_bytes = 128 * 3 * 224 * 224 * 4
|
||||
glue("batch128_mb", f"{batch_128_bytes / 1024**2:.1f} MB")
|
||||
```
|
||||
|
||||
To appreciate why tensor operations matter, consider the scale of modern ML systems:
|
||||
|
||||
- **Large language models**: 175 billion numbers stored as tensors = **350 GB** (like storing 70,000 full-resolution photos)
|
||||
- **Image processing**: A batch of 128 images = **77 MB** of tensor data
|
||||
- **Large language models**: 175 billion numbers stored as tensors = **{glue:text}`llm_gb`** (like storing 70,000 full-resolution photos)
|
||||
- **Image processing**: A batch of 128 images = **{glue:text}`batch128_mb`** of tensor data
|
||||
- **Self-driving cars**: Process tensor operations at **36 FPS** across multiple cameras (each frame = millions of operations in 28 milliseconds)
|
||||
|
||||
A single matrix multiplication can consume **90% of computation time** in neural networks. Understanding tensor operations isn't just academic; it's essential for building and debugging real ML systems.
|
||||
|
||||
## Check Your Understanding
|
||||
|
||||
```{code-cell} python3
|
||||
:tags: [remove-input, remove-output]
|
||||
|
||||
# Q1: Batch memory
|
||||
q1_bytes = 32 * 3 * 224 * 224 * 4
|
||||
glue("q1_bytes", f"{q1_bytes:,}")
|
||||
glue("q1_mb", f"{q1_bytes / 1024**2:.1f} MB")
|
||||
|
||||
# Q2: Broadcasting
|
||||
q2_full_bytes = 32 * 512 * 768 * 4
|
||||
q2_vec_bytes = 768 * 4
|
||||
glue("q2_full_mb", f"{q2_full_bytes / 1024**2:.1f} MB")
|
||||
glue("q2_vec_kb", f"{q2_vec_bytes / 1024:.0f} KB")
|
||||
glue("q2_savings_mb", f"~{q2_full_bytes / 1024**2:.0f} MB")
|
||||
```
|
||||
|
||||
Test yourself with these systems thinking questions. They're designed to build intuition for the performance characteristics you'll encounter in production ML.
|
||||
|
||||
**Q1: Memory Calculation**
|
||||
@@ -822,7 +870,7 @@ A batch of 32 RGB images (224×224 pixels) stored as float32. How much memory?
|
||||
```{admonition} Answer
|
||||
:class: dropdown
|
||||
|
||||
32 × 3 × 224 × 224 × 4 = **19,267,584 bytes ≈ 19.3 MB**
|
||||
32 × 3 × 224 × 224 × 4 = **{glue:text}`q1_bytes` bytes ≈ {glue:text}`q1_mb`**
|
||||
|
||||
This is why batch size matters - double the batch, double the memory!
|
||||
```
|
||||
@@ -834,11 +882,11 @@ Adding a vector `(768,)` to a 3D tensor `(32, 512, 768)`. How much memory does b
|
||||
```{admonition} Answer
|
||||
:class: dropdown
|
||||
|
||||
Without broadcasting: 32 × 512 × 768 × 4 = **50.3 MB**
|
||||
Without broadcasting: 32 × 512 × 768 × 4 = **{glue:text}`q2_full_mb`**
|
||||
|
||||
With broadcasting: 768 × 4 = **3 KB**
|
||||
With broadcasting: 768 × 4 = **{glue:text}`q2_vec_kb`**
|
||||
|
||||
Savings: **~50 MB per operation** - this adds up across hundreds of operations in a neural network!
|
||||
Savings: **{glue:text}`q2_savings_mb` per operation** - this adds up across hundreds of operations in a neural network!
|
||||
```
|
||||
|
||||
**Q3: Matmul Scaling**
|
||||
|
||||
@@ -1,3 +1,9 @@
|
||||
---
|
||||
file_format: mystnb
|
||||
kernelspec:
|
||||
name: python3
|
||||
---
|
||||
|
||||
# Module 02: Activations
|
||||
|
||||
:::{admonition} Module Info
|
||||
@@ -30,7 +36,7 @@ Listen to an AI-generated overview.
|
||||
|
||||
Run interactively in your browser.
|
||||
|
||||
<a href="https://mybinder.org/v2/gh/harvard-edge/cs249r_book/main?labpath=tinytorch%2Fmodules%2F02_activations%2F02_activations.ipynb" target="_blank" style="display: flex; align-items: center; justify-content: center; width: 100%; height: 54px; margin-top: auto; background: #f97316; color: white; text-align: center; text-decoration: none; border-radius: 27px; font-size: 14px; box-sizing: border-box;">Open in Binder →</a>
|
||||
<a href="https://mybinder.org/v2/gh/harvard-edge/cs249r_book/main?labpath=tinytorch%2Fmodules%2F02_activations%2Factivations.ipynb" target="_blank" style="display: flex; align-items: center; justify-content: center; width: 100%; height: 54px; margin-top: auto; background: #f97316; color: white; text-align: center; text-decoration: none; border-radius: 27px; font-size: 14px; box-sizing: border-box;">Open in Binder →</a>
|
||||
```
|
||||
|
||||
```{grid-item-card} 📄 View Source
|
||||
@@ -693,16 +699,47 @@ Let's walk through the key similarities and differences:
|
||||
Mathematical functions, numerical stability techniques (max subtraction in softmax), and the concept of element-wise transformations. When you debug PyTorch activation issues, you'll understand exactly what's happening because you implemented the same logic.
|
||||
```
|
||||
|
||||
```{code-cell} python3
|
||||
:tags: [remove-input, remove-output]
|
||||
from myst_nb import glue
|
||||
|
||||
# Prose: "Why Activations Matter at Scale"
|
||||
prose_gelu_ops = 96 * 2
|
||||
glue("prose_gelu_ops", f"{prose_gelu_ops:,}")
|
||||
|
||||
prose_daily_activations = 1000 * 86400
|
||||
glue("prose_daily_activations", f"{prose_daily_activations / 1e6:.0f} million")
|
||||
```
|
||||
|
||||
### Why Activations Matter at Scale
|
||||
|
||||
To appreciate why activation choice matters, consider the scale of modern ML systems:
|
||||
|
||||
- **Large language models**: GPT-3 has 96 transformer layers, each with 2 GELU activations. That's **192 GELU operations per forward pass** on billions of parameters.
|
||||
- **Large language models**: GPT-3 has 96 transformer layers, each with 2 GELU activations. That's **{glue:text}`prose_gelu_ops` GELU operations per forward pass** on billions of parameters.
|
||||
- **Image classification**: ResNet-50 has 49 convolutional layers, each followed by ReLU. Processing a batch of 256 images at 224×224 resolution means **12 billion ReLU operations** per batch.
|
||||
- **Production serving**: A model serving 1000 requests per second performs **86 million activation computations per day**. A 20% speedup from ReLU vs GELU saves hours of compute time.
|
||||
- **Production serving**: A model serving 1000 requests per second performs **{glue:text}`prose_daily_activations` activation computations per day**. A 20% speedup from ReLU vs GELU saves hours of compute time.
|
||||
|
||||
Activation functions account for **5-15% of total training time** in typical networks (the rest is matrix multiplication). But in transformer models with many layers and small matrix sizes, activations can account for **20-30% of compute time**. This is why GELU vs ReLU is a real trade-off: slower computation but potentially better accuracy.
|
||||
|
||||
```{code-cell} python3
|
||||
:tags: [remove-input, remove-output]
|
||||
from myst_nb import glue
|
||||
|
||||
# Q1: Memory calculation
|
||||
q1_bytes = 32 * 4096 * 4
|
||||
glue("q1_bytes", f"{q1_bytes:,}")
|
||||
glue("q1_kb", f"{q1_bytes / 1024:.0f} KB")
|
||||
|
||||
q1_100layer_kb = 100 * (q1_bytes / 1024)
|
||||
glue("q1_100layer_mb", f"{q1_100layer_kb / 1024:.0f} MB")
|
||||
|
||||
# Q4: Sparsity analysis
|
||||
q4_total = 128 * 1024
|
||||
q4_zeros = q4_total // 2
|
||||
glue("q4_total", f"{q4_total:,}")
|
||||
glue("q4_zeros", f"≈ {q4_zeros:,}")
|
||||
```
|
||||
|
||||
## Check Your Understanding
|
||||
|
||||
Test yourself with these systems thinking questions. They're designed to build intuition for how activations behave in real neural networks.
|
||||
@@ -714,9 +751,9 @@ A batch of 32 samples passes through a hidden layer with 4096 neurons and ReLU a
|
||||
```{admonition} Answer
|
||||
:class: dropdown
|
||||
|
||||
32 × 4096 × 4 bytes = **524,288 bytes ≈ 512 KB**
|
||||
32 × 4096 × 4 bytes = **{glue:text}`q1_bytes` bytes ≈ {glue:text}`q1_kb`**
|
||||
|
||||
This is the activation memory for ONE layer. A 100-layer network needs 50 MB just to store activations for one forward pass. This is why activation memory dominates training memory usage — activations must be cached for backpropagation.
|
||||
This is the activation memory for ONE layer. A 100-layer network needs {glue:text}`q1_100layer_mb` just to store activations for one forward pass. This is why activation memory dominates training memory usage — activations must be cached for backpropagation.
|
||||
```
|
||||
|
||||
**Q2: Computational Cost**
|
||||
@@ -764,8 +801,8 @@ For a standard normal distribution N(0, 1), approximately **50% of values are ne
|
||||
|
||||
ReLU zeros all negative values, so approximately **50% of outputs will be exactly zero**.
|
||||
|
||||
Total elements: 128 × 1024 = 131,072
|
||||
Zeros: ≈ 65,536
|
||||
Total elements: 128 × 1024 = {glue:text}`q4_total`
|
||||
Zeros: {glue:text}`q4_zeros`
|
||||
|
||||
This sparsity has major implications:
|
||||
- **Speed**: Multiplying by zero is free, so downstream computations can skip ~50% of operations
|
||||
@@ -839,7 +876,7 @@ Implement Linear layers that combine your Tensor operations with your activation
|
||||
|
||||
```{tip} Interactive Options
|
||||
|
||||
- **[Launch Binder](https://mybinder.org/v2/gh/harvard-edge/cs249r_book/main?urlpath=lab/tree/tinytorch/modules/02_activations/02_activations.ipynb)** - Run interactively in browser, no setup required
|
||||
- **[Launch Binder](https://mybinder.org/v2/gh/harvard-edge/cs249r_book/main?urlpath=lab/tree/tinytorch/modules/02_activations/activations.ipynb)** - Run interactively in browser, no setup required
|
||||
- **[View Source](https://github.com/harvard-edge/cs249r_book/blob/main/tinytorch/src/02_activations/02_activations.py)** - Browse the implementation code
|
||||
```
|
||||
|
||||
|
||||
@@ -1,3 +1,9 @@
|
||||
---
|
||||
file_format: mystnb
|
||||
kernelspec:
|
||||
name: python3
|
||||
---
|
||||
|
||||
# Module 03: Layers
|
||||
|
||||
:::{admonition} Module Info
|
||||
@@ -30,7 +36,7 @@ Listen to an AI-generated overview.
|
||||
|
||||
Run interactively in your browser.
|
||||
|
||||
<a href="https://mybinder.org/v2/gh/harvard-edge/cs249r_book/main?labpath=tinytorch%2Fmodules%2F03_layers%2F03_layers.ipynb" target="_blank" style="display: flex; align-items: center; justify-content: center; width: 100%; height: 54px; margin-top: auto; background: #f97316; color: white; text-align: center; text-decoration: none; border-radius: 27px; font-size: 14px; box-sizing: border-box;">Open in Binder →</a>
|
||||
<a href="https://mybinder.org/v2/gh/harvard-edge/cs249r_book/main?labpath=tinytorch%2Fmodules%2F03_layers%2Flayers.ipynb" target="_blank" style="display: flex; align-items: center; justify-content: center; width: 100%; height: 54px; margin-top: auto; background: #f97316; color: white; text-align: center; text-decoration: none; border-radius: 27px; font-size: 14px; box-sizing: border-box;">Open in Binder →</a>
|
||||
```
|
||||
|
||||
```{grid-item-card} 📄 View Source
|
||||
@@ -637,20 +643,56 @@ The forward pass chains computations, and `parameters()` collects all trainable
|
||||
|
||||
Understanding the memory and computational costs of layers is essential for building efficient networks. Linear layers dominate both parameter memory and computation time in fully connected architectures.
|
||||
|
||||
```{code-cell} python3
|
||||
:tags: [remove-input, remove-output]
|
||||
from myst_nb import glue
|
||||
|
||||
# Parameter memory for Linear(784, 256)
|
||||
mem_weight_bytes = 784 * 256 * 4
|
||||
mem_weight_kb = mem_weight_bytes / 1024
|
||||
glue("mem_weight_bytes", f"{mem_weight_bytes:,}")
|
||||
glue("mem_weight_kb", f"{mem_weight_kb:.0f}")
|
||||
|
||||
mem_bias_bytes = 256 * 4
|
||||
mem_bias_kb = mem_bias_bytes / 1024
|
||||
glue("mem_bias_bytes", f"{mem_bias_bytes:,}")
|
||||
glue("mem_bias_kb", f"{mem_bias_kb:.0f}")
|
||||
|
||||
mem_total_kb = (mem_weight_bytes + mem_bias_bytes) / 1024
|
||||
glue("mem_total_kb", f"{mem_total_kb:.0f}")
|
||||
|
||||
# Activation memory for batch=32
|
||||
mem_input_bytes = 32 * 784 * 4
|
||||
mem_input_kb = mem_input_bytes / 1024
|
||||
glue("mem_input_bytes", f"{mem_input_bytes:,}")
|
||||
glue("mem_input_kb", f"{mem_input_kb:.0f}")
|
||||
|
||||
mem_output_bytes = 32 * 256 * 4
|
||||
mem_output_kb = mem_output_bytes / 1024
|
||||
glue("mem_output_bytes", f"{mem_output_bytes:,}")
|
||||
glue("mem_output_kb", f"{mem_output_kb:.0f}")
|
||||
|
||||
# 3-layer FLOPs
|
||||
flops_l1 = 32 * 784 * 256
|
||||
flops_l2 = 32 * 256 * 128
|
||||
flops_l3 = 32 * 128 * 10
|
||||
flops_total = flops_l1 + flops_l2 + flops_l3
|
||||
glue("flops_l1", f"{flops_l1:,}")
|
||||
glue("flops_l2", f"{flops_l2:,}")
|
||||
glue("flops_l3", f"{flops_l3:,}")
|
||||
glue("flops_total", f"{flops_total / 1e6:.1f}")
|
||||
```
|
||||
|
||||
Parameter memory for a Linear layer is straightforward: `in_features × out_features × 4 bytes` for weights, plus `out_features × 4 bytes` for bias (assuming float32). For Linear(784, 256):
|
||||
|
||||
```
|
||||
Weights: 784 × 256 × 4 = 802,816 bytes ≈ 803 KB
|
||||
Bias: 256 × 4 = 1,024 bytes ≈ 1 KB
|
||||
Total: ≈ 804 KB
|
||||
```
|
||||
Weights: 784 × 256 × 4 = {glue:text}`mem_weight_bytes` bytes ≈ {glue:text}`mem_weight_kb` KB
|
||||
Bias: 256 × 4 = {glue:text}`mem_bias_bytes` bytes ≈ {glue:text}`mem_bias_kb` KB
|
||||
Total: ≈ {glue:text}`mem_total_kb` KB
|
||||
|
||||
Activation memory depends on batch size. For batch size 32 and the same layer:
|
||||
|
||||
```
|
||||
Input: 32 × 784 × 4 = 100,352 bytes ≈ 100 KB
|
||||
Output: 32 × 256 × 4 = 32,768 bytes ≈ 33 KB
|
||||
```
|
||||
Input: 32 × 784 × 4 = {glue:text}`mem_input_bytes` bytes ≈ {glue:text}`mem_input_kb` KB
|
||||
Output: 32 × 256 × 4 = {glue:text}`mem_output_bytes` bytes ≈ {glue:text}`mem_output_kb` KB
|
||||
|
||||
The computational cost of the forward pass is dominated by matrix multiplication. For input shape `(batch, in_features)` and weight shape `(in_features, out_features)`, the operation requires `batch × in_features × out_features` multiplications and the same number of additions. Bias addition is just `batch × out_features` additions, negligible compared to matrix multiplication.
|
||||
|
||||
@@ -662,12 +704,10 @@ The computational cost of the forward pass is dominated by matrix multiplication
|
||||
|
||||
For a 3-layer network (784→256→128→10) with batch size 32:
|
||||
|
||||
```
|
||||
Layer 1: 32 × 784 × 256 = 6,422,528 FLOPs
|
||||
Layer 2: 32 × 256 × 128 = 1,048,576 FLOPs
|
||||
Layer 3: 32 × 128 × 10 = 40,960 FLOPs
|
||||
Total: ≈ 7.5 million FLOPs per forward pass
|
||||
```
|
||||
Layer 1: 32 × 784 × 256 = {glue:text}`flops_l1` FLOPs
|
||||
Layer 2: 32 × 256 × 128 = {glue:text}`flops_l2` FLOPs
|
||||
Layer 3: 32 × 128 × 10 = {glue:text}`flops_l3` FLOPs
|
||||
Total: ≈ {glue:text}`flops_total` million FLOPs per forward pass
|
||||
|
||||
The first layer dominates because it has the largest input dimension. This is why production networks often use dimension reduction early to save computation in later layers.
|
||||
|
||||
@@ -843,34 +883,85 @@ Test yourself with these systems thinking questions. They're designed to build i
|
||||
|
||||
A Linear layer has `in_features=784` and `out_features=256`. How many parameters does it have? If you double `out_features` to 512, how many parameters now?
|
||||
|
||||
```{code-cell} python3
|
||||
:tags: [remove-input, remove-output]
|
||||
from myst_nb import glue
|
||||
|
||||
# Q1: Parameter Scaling
|
||||
q1_orig_params = 784 * 256 + 256
|
||||
q1_doubled_params = 784 * 512 + 512
|
||||
glue("q1_orig_params", f"{q1_orig_params:,}")
|
||||
glue("q1_doubled_params", f"{q1_doubled_params:,}")
|
||||
|
||||
q1_orig_weights = 784 * 256
|
||||
q1_doubled_weights = 784 * 512
|
||||
glue("q1_orig_weights", f"{q1_orig_weights:,}")
|
||||
glue("q1_doubled_weights", f"{q1_doubled_weights:,}")
|
||||
|
||||
q1_orig_bytes = q1_orig_params * 4
|
||||
q1_orig_kb = q1_orig_bytes / 1024
|
||||
glue("q1_orig_bytes", f"{q1_orig_bytes:,}")
|
||||
glue("q1_orig_kb", f"{q1_orig_kb:.0f}")
|
||||
|
||||
q1_doubled_bytes = q1_doubled_params * 4
|
||||
q1_doubled_mb = q1_doubled_bytes / 1024**2
|
||||
glue("q1_doubled_bytes", f"{q1_doubled_bytes:,}")
|
||||
glue("q1_doubled_mb", f"{q1_doubled_mb:.2f}")
|
||||
```
|
||||
|
||||
```{admonition} Answer
|
||||
:class: dropdown
|
||||
|
||||
**Original**: 784 × 256 + 256 = 200,960 parameters
|
||||
**Original**: 784 × 256 + 256 = {glue:text}`q1_orig_params` parameters
|
||||
|
||||
**Doubled**: 784 × 512 + 512 = 401,920 parameters
|
||||
**Doubled**: 784 × 512 + 512 = {glue:text}`q1_doubled_params` parameters
|
||||
|
||||
Doubling `out_features` approximately doubles the parameter count because weights dominate (200,704 vs 401,408 for weights alone). This shows parameter count scales linearly with layer width.
|
||||
Doubling `out_features` approximately doubles the parameter count because weights dominate ({glue:text}`q1_orig_weights` vs {glue:text}`q1_doubled_weights` for weights alone). This shows parameter count scales linearly with layer width.
|
||||
|
||||
**Memory**: 200,960 × 4 = 803,840 bytes ≈ 804 KB (original) vs 401,920 × 4 = 1,607,680 bytes ≈ 1.6 MB (doubled)
|
||||
**Memory**: {glue:text}`q1_orig_params` × 4 = {glue:text}`q1_orig_bytes` bytes ≈ {glue:text}`q1_orig_kb` KB (original) vs {glue:text}`q1_doubled_params` × 4 = {glue:text}`q1_doubled_bytes` bytes ≈ {glue:text}`q1_doubled_mb` MB (doubled)
|
||||
```
|
||||
|
||||
**Q2: Multi-layer Memory**
|
||||
|
||||
A 3-layer network has architecture 784→256→128→10. Calculate total parameter count and memory usage (assume float32).
|
||||
|
||||
```{code-cell} python3
|
||||
:tags: [remove-input, remove-output]
|
||||
from myst_nb import glue
|
||||
|
||||
# Q2: Multi-layer Memory
|
||||
q2_l1 = 784 * 256 + 256
|
||||
q2_l2 = 256 * 128 + 128
|
||||
q2_l3 = 128 * 10 + 10
|
||||
q2_total = q2_l1 + q2_l2 + q2_l3
|
||||
glue("q2_l1", f"{q2_l1:,}")
|
||||
glue("q2_l2", f"{q2_l2:,}")
|
||||
glue("q2_l3", f"{q2_l3:,}")
|
||||
glue("q2_total", f"{q2_total:,}")
|
||||
|
||||
q2_mem_bytes = q2_total * 4
|
||||
q2_mem_kb = q2_mem_bytes / 1024
|
||||
glue("q2_mem_bytes", f"{q2_mem_bytes:,}")
|
||||
glue("q2_mem_kb", f"{q2_mem_kb:.0f}")
|
||||
|
||||
# Activation memory for batch size 32
|
||||
q2_act_bytes = 32 * (784 + 256 + 128 + 10) * 4
|
||||
q2_act_kb = q2_act_bytes / 1024
|
||||
glue("q2_act_kb", f"{q2_act_kb:.0f}")
|
||||
```
|
||||
|
||||
```{admonition} Answer
|
||||
:class: dropdown
|
||||
|
||||
**Layer 1**: 784 × 256 + 256 = 200,960 parameters
|
||||
**Layer 2**: 256 × 128 + 128 = 32,896 parameters
|
||||
**Layer 3**: 128 × 10 + 10 = 1,290 parameters
|
||||
**Layer 1**: 784 × 256 + 256 = {glue:text}`q2_l1` parameters
|
||||
**Layer 2**: 256 × 128 + 128 = {glue:text}`q2_l2` parameters
|
||||
**Layer 3**: 128 × 10 + 10 = {glue:text}`q2_l3` parameters
|
||||
|
||||
**Total**: 235,146 parameters
|
||||
**Total**: {glue:text}`q2_total` parameters
|
||||
|
||||
**Memory**: 235,146 × 4 = 940,584 bytes ≈ 940 KB
|
||||
**Memory**: {glue:text}`q2_total` × 4 = {glue:text}`q2_mem_bytes` bytes ≈ {glue:text}`q2_mem_kb` KB
|
||||
|
||||
This is parameter memory only. Add activation memory for batch processing: for batch size 32, you need space for intermediate tensors at each layer (32×784, 32×256, 32×128, 32×10 = approximately 260 KB more).
|
||||
This is parameter memory only. Add activation memory for batch processing: for batch size 32, you need space for intermediate tensors at each layer (32×784, 32×256, 32×128, 32×10 = approximately {glue:text}`q2_act_kb` KB more).
|
||||
```
|
||||
|
||||
**Q3: Dropout Scaling**
|
||||
@@ -891,6 +982,19 @@ Why do we scale surviving values by `1/(1-p)` during training? What happens if w
|
||||
|
||||
For Linear layer forward pass `y = xW + b`, which operation dominates: matrix multiply or bias addition?
|
||||
|
||||
```{code-cell} python3
|
||||
:tags: [remove-input, remove-output]
|
||||
from myst_nb import glue
|
||||
|
||||
# Q4: Computational Bottleneck
|
||||
q4_matmul = 32 * 784 * 256
|
||||
q4_bias = 32 * 256
|
||||
q4_ratio = q4_matmul / q4_bias
|
||||
glue("q4_matmul", f"{q4_matmul:,}")
|
||||
glue("q4_bias", f"{q4_bias:,}")
|
||||
glue("q4_ratio", f"{q4_ratio:.0f}")
|
||||
```
|
||||
|
||||
```{admonition} Answer
|
||||
:class: dropdown
|
||||
|
||||
@@ -898,10 +1002,10 @@ For Linear layer forward pass `y = xW + b`, which operation dominates: matrix mu
|
||||
**Bias addition**: O(batch × out_features) operations
|
||||
|
||||
For Linear(784, 256) with batch size 32:
|
||||
- **Matmul**: 32 × 784 × 256 = 6,422,528 operations
|
||||
- **Bias**: 32 × 256 = 8,192 operations
|
||||
- **Matmul**: 32 × 784 × 256 = {glue:text}`q4_matmul` operations
|
||||
- **Bias**: 32 × 256 = {glue:text}`q4_bias` operations
|
||||
|
||||
Matrix multiply dominates by ~783x. This is why optimizing matmul (using BLAS, GPU kernels) is critical for neural network performance.
|
||||
Matrix multiply dominates by ~{glue:text}`q4_ratio`x. This is why optimizing matmul (using BLAS, GPU kernels) is critical for neural network performance.
|
||||
```
|
||||
|
||||
**Q5: Initialization Impact**
|
||||
@@ -975,7 +1079,7 @@ Implement loss functions (MSELoss, CrossEntropyLoss) that measure prediction err
|
||||
|
||||
```{tip} Interactive Options
|
||||
|
||||
- **[Launch Binder](https://mybinder.org/v2/gh/harvard-edge/cs249r_book/main?urlpath=lab/tree/tinytorch/modules/03_layers/03_layers.ipynb)** - Run interactively in browser, no setup required
|
||||
- **[Launch Binder](https://mybinder.org/v2/gh/harvard-edge/cs249r_book/main?urlpath=lab/tree/tinytorch/modules/03_layers/layers.ipynb)** - Run interactively in browser, no setup required
|
||||
- **[View Source](https://github.com/harvard-edge/cs249r_book/blob/main/tinytorch/src/03_layers/03_layers.py)** - Browse the implementation code
|
||||
```
|
||||
|
||||
|
||||
@@ -1,3 +1,9 @@
|
||||
---
|
||||
file_format: mystnb
|
||||
kernelspec:
|
||||
name: python3
|
||||
---
|
||||
|
||||
# Module 04: Losses
|
||||
|
||||
:::{admonition} Module Info
|
||||
@@ -30,7 +36,7 @@ Listen to an AI-generated overview.
|
||||
|
||||
Run interactively in your browser.
|
||||
|
||||
<a href="https://mybinder.org/v2/gh/harvard-edge/cs249r_book/main?labpath=tinytorch%2Fmodules%2F04_losses%2F04_losses.ipynb" target="_blank" style="display: flex; align-items: center; justify-content: center; width: 100%; height: 54px; margin-top: auto; background: #f97316; color: white; text-align: center; text-decoration: none; border-radius: 27px; font-size: 14px; box-sizing: border-box;">Open in Binder →</a>
|
||||
<a href="https://mybinder.org/v2/gh/harvard-edge/cs249r_book/main?labpath=tinytorch%2Fmodules%2F04_losses%2Flosses.ipynb" target="_blank" style="display: flex; align-items: center; justify-content: center; width: 100%; height: 54px; margin-top: auto; background: #f97316; color: white; text-align: center; text-decoration: none; border-radius: 27px; font-size: 14px; box-sizing: border-box;">Open in Binder →</a>
|
||||
```
|
||||
|
||||
```{grid-item-card} 📄 View Source
|
||||
@@ -671,13 +677,32 @@ The mathematical formulas, numerical stability techniques (log-sum-exp trick), a
|
||||
|
||||
### Why Loss Functions Matter at Scale
|
||||
|
||||
```{code-cell} python3
|
||||
:tags: [remove-input, remove-output]
|
||||
from myst_nb import glue
|
||||
|
||||
# Scale section: exponential operations for language model loss
|
||||
scale_exp_ops = 50_000 * 128
|
||||
glue("scale_exp_ops", f"{scale_exp_ops / 1e6:.1f}M")
|
||||
|
||||
# Cross-entropy memory: 3 tensors (logits, softmax, log-softmax)
|
||||
scale_ce_bytes = 128 * 50_000 * 4 * 3
|
||||
scale_ce_mb = scale_ce_bytes / 1024**2
|
||||
glue("scale_ce_mb", f"{scale_ce_mb:.1f} MB")
|
||||
|
||||
# FP16 cross-entropy memory
|
||||
scale_fp16_bytes = 128 * 50_000 * 2 * 3
|
||||
scale_fp16_mb = scale_fp16_bytes / 1024**2
|
||||
glue("scale_fp16_mb", f"{scale_fp16_mb:.1f} MB")
|
||||
```
|
||||
|
||||
To appreciate why loss functions matter in production, consider the scale of modern ML systems:
|
||||
|
||||
- **Language models**: 50,000 token vocabulary × 128 batch size = **6.4M exponential operations per loss computation**. With sampled softmax, this reduces to ~128K operations (50× speedup).
|
||||
- **Language models**: 50,000 token vocabulary × 128 batch size = **{glue:text}`scale_exp_ops` exponential operations per loss computation**. With sampled softmax, this reduces to ~128K operations (50× speedup).
|
||||
- **Computer vision**: ImageNet with 1,000 classes processes **256,000 softmax computations** per batch. Fused CUDA kernels reduce this from 15ms to 0.5ms.
|
||||
- **Recommendation systems**: Billions of items require specialized loss functions. YouTube's recommendation system uses **sampled softmax over 1M+ videos**, making loss computation the primary bottleneck.
|
||||
|
||||
Memory pressure is equally significant. A language model forward pass might consume 8GB for activations, 2GB for parameters, but **768MB just for the cross-entropy loss computation** (B=128, C=50000, float32). Using FP16 cuts this to 384MB. Using hierarchical softmax eliminates the materialization entirely.
|
||||
Memory pressure is equally significant. A language model forward pass might consume 8GB for activations, 2GB for parameters, but **{glue:text}`scale_ce_mb` just for the cross-entropy loss computation** (B=128, C=50000, float32). Using FP16 cuts this to {glue:text}`scale_fp16_mb`. Using hierarchical softmax eliminates the materialization entirely.
|
||||
|
||||
The loss computation typically accounts for **5-10% of total training time** in well-optimized systems, but can dominate (30-50%) for large vocabularies without optimization. This is why production frameworks invest heavily in fused kernels, specialized data structures, and algorithmic improvements like hierarchical softmax.
|
||||
|
||||
@@ -689,31 +714,60 @@ Test yourself with these systems thinking questions. They're designed to build i
|
||||
|
||||
A language model with 50,000 token vocabulary uses CrossEntropyLoss with batch size 128. Using float32, how much memory does the loss computation require for logits, softmax probabilities, and log-probabilities?
|
||||
|
||||
```{code-cell} python3
|
||||
:tags: [remove-input, remove-output]
|
||||
from myst_nb import glue
|
||||
|
||||
# Q1: Per-tensor memory (B × C × 4 bytes, binary MB)
|
||||
q1_per_tensor_bytes = 128 * 50_000 * 4
|
||||
q1_per_tensor_mb = q1_per_tensor_bytes / 1024**2
|
||||
glue("q1_per_tensor_mb", f"{q1_per_tensor_mb:.1f} MB")
|
||||
|
||||
# Q1: Total for 3 tensors
|
||||
q1_total_bytes = q1_per_tensor_bytes * 3
|
||||
q1_total_mb = q1_total_bytes / 1024**2
|
||||
glue("q1_total_mb", f"{q1_total_mb:.1f} MB")
|
||||
|
||||
# Q1: FP16 total
|
||||
q1_fp16_bytes = 128 * 50_000 * 2 * 3
|
||||
q1_fp16_mb = q1_fp16_bytes / 1024**2
|
||||
glue("q1_fp16_mb", f"{q1_fp16_mb:.1f} MB")
|
||||
```
|
||||
|
||||
```{admonition} Answer
|
||||
:class: dropdown
|
||||
|
||||
**Calculation:**
|
||||
- Logits: 128 × 50,000 × 4 bytes = 25.6 MB
|
||||
- Softmax probabilities: 128 × 50,000 × 4 bytes = 25.6 MB
|
||||
- Log-softmax: 128 × 50,000 × 4 bytes = 25.6 MB
|
||||
- Logits: 128 × 50,000 × 4 bytes = {glue:text}`q1_per_tensor_mb`
|
||||
- Softmax probabilities: 128 × 50,000 × 4 bytes = {glue:text}`q1_per_tensor_mb`
|
||||
- Log-softmax: 128 × 50,000 × 4 bytes = {glue:text}`q1_per_tensor_mb`
|
||||
|
||||
**Total: 76.8 MB** just for loss computation (before model activations!)
|
||||
**Total: {glue:text}`q1_total_mb`** just for loss computation (before model activations!)
|
||||
|
||||
**Key insight**: Memory scales as B×C. Doubling vocabulary doubles loss computation memory. This is why large language models use techniques like sampled softmax - they literally can't afford to materialize the full vocabulary every forward pass.
|
||||
|
||||
**Production solution**: Switch to FP16 (cuts to 38.4 MB) or use hierarchical/sampled softmax (reduces C from 50,000 to ~1,000).
|
||||
**Production solution**: Switch to FP16 (cuts to {glue:text}`q1_fp16_mb`) or use hierarchical/sampled softmax (reduces C from 50,000 to ~1,000).
|
||||
```
|
||||
|
||||
**Q2: Complexity Analysis - Softmax Bottleneck**
|
||||
|
||||
Your training profile shows: Forward pass 80ms, Loss computation 120ms, Backward pass 150ms. Your model has 1,000 output classes and batch size 64. Why is loss computation so expensive, and what's the fix?
|
||||
|
||||
```{code-cell} python3
|
||||
:tags: [remove-input, remove-output]
|
||||
from myst_nb import glue
|
||||
|
||||
# Q2: exp/log operations
|
||||
q2_ops = 64 * 1_000
|
||||
glue("q2_ops", f"{q2_ops:,}")
|
||||
```
|
||||
|
||||
```{admonition} Answer
|
||||
:class: dropdown
|
||||
|
||||
**Problem**: Loss taking 120ms (34% of iteration time) is unusually high. Normal ratio is 5-10%.
|
||||
|
||||
**Root cause**: CrossEntropyLoss is O(B×C). With B=64 and C=1,000, that's 64,000 exp/log operations. If implemented naively in Python loops (not vectorized), this becomes a bottleneck.
|
||||
**Root cause**: CrossEntropyLoss is O(B×C). With B=64 and C=1,000, that's {glue:text}`q2_ops` exp/log operations. If implemented naively in Python loops (not vectorized), this becomes a bottleneck.
|
||||
|
||||
**Diagnosis steps**:
|
||||
1. Profile within loss: Is `log_softmax` the bottleneck? (Likely yes)
|
||||
@@ -732,7 +786,7 @@ Your training profile shows: Forward pass 80ms, Loss computation 120ms, Backward
|
||||
|
||||
Your model outputs logits `[50, 100, 150]`. Without the log-sum-exp trick, what happens when you compute softmax? With the trick, what values are actually computed?
|
||||
|
||||
```{admonition} Answer
|
||||
````{admonition} Answer
|
||||
:class: dropdown
|
||||
|
||||
**Without the trick (naive softmax):**
|
||||
@@ -756,13 +810,13 @@ log_softmax = shifted - log_sum_exp = [-100, -50, 0]
|
||||
**Result**: Valid log-probabilities, stable training.
|
||||
|
||||
**Key insight**: Subtracting max makes largest value 0, so `exp(0) = 1.0` is always safe. Smaller values underflow to 0, but that's fine - they contribute negligibly anyway. This is why **you must use log-sum-exp for any softmax computation**.
|
||||
```
|
||||
````
|
||||
|
||||
**Q4: Loss Function Selection - Classification Problem**
|
||||
|
||||
You're building a medical diagnosis system with 5 disease categories. Should you use BinaryCrossEntropyLoss or CrossEntropyLoss? What if the categories aren't mutually exclusive (patient can have multiple diseases)?
|
||||
|
||||
```{admonition} Answer
|
||||
````{admonition} Answer
|
||||
:class: dropdown
|
||||
|
||||
**Case 1: Mutually exclusive diseases** (patient has exactly one)
|
||||
@@ -789,13 +843,13 @@ loss = BinaryCrossEntropyLoss()(probs, targets)
|
||||
```
|
||||
|
||||
**Critical medical consideration**: Multi-label is more realistic - patients often have comorbidities!
|
||||
```
|
||||
````
|
||||
|
||||
**Q5: Batch Size Impact - Memory and Gradients**
|
||||
|
||||
You train with batch size 32, using 4GB GPU memory. You want to increase to batch size 128. Will memory usage be 16GB? What happens to the loss value and gradient quality?
|
||||
|
||||
```{admonition} Answer
|
||||
````{admonition} Answer
|
||||
:class: dropdown
|
||||
|
||||
**Memory usage**: Yes, approximately **16GB** (4× increase)
|
||||
@@ -828,7 +882,7 @@ optimizer.step() # Update once with accumulated gradients (4×32 = 128 effectiv
|
||||
```
|
||||
|
||||
This gives you the gradient quality of batch 128 with only the memory cost of batch 32!
|
||||
```
|
||||
````
|
||||
|
||||
## Further Reading
|
||||
|
||||
@@ -867,7 +921,7 @@ Build efficient data pipelines that handle batching, shuffling, and iteration ov
|
||||
|
||||
```{tip} Interactive Options
|
||||
|
||||
- **[Launch Binder](https://mybinder.org/v2/gh/harvard-edge/cs249r_book/main?urlpath=lab/tree/tinytorch/modules/04_losses/04_losses.ipynb)** - Run interactively in browser, no setup required
|
||||
- **[Launch Binder](https://mybinder.org/v2/gh/harvard-edge/cs249r_book/main?urlpath=lab/tree/tinytorch/modules/04_losses/losses.ipynb)** - Run interactively in browser, no setup required
|
||||
- **[View Source](https://github.com/harvard-edge/cs249r_book/blob/main/tinytorch/src/04_losses/04_losses.py)** - Browse the implementation code
|
||||
```
|
||||
|
||||
|
||||
@@ -1,3 +1,9 @@
|
||||
---
|
||||
file_format: mystnb
|
||||
kernelspec:
|
||||
name: python3
|
||||
---
|
||||
|
||||
# Module 05: DataLoader
|
||||
|
||||
:::{admonition} Module Info
|
||||
@@ -25,7 +31,7 @@ Listen to an AI-generated overview.
|
||||
|
||||
Run interactively in your browser.
|
||||
|
||||
<a href="https://mybinder.org/v2/gh/harvard-edge/cs249r_book/main?labpath=tinytorch%2Fmodules%2F05_dataloader%2F05_dataloader.ipynb" target="_blank" style="display: flex; align-items: center; justify-content: center; width: 100%; height: 54px; margin-top: auto; background: #f97316; color: white; text-align: center; text-decoration: none; border-radius: 27px; font-size: 14px; box-sizing: border-box;">Open in Binder →</a>
|
||||
<a href="https://mybinder.org/v2/gh/harvard-edge/cs249r_book/main?labpath=tinytorch%2Fmodules%2F05_dataloader%2Fdataloader.ipynb" target="_blank" style="display: flex; align-items: center; justify-content: center; width: 100%; height: 54px; margin-top: auto; background: #f97316; color: white; text-align: center; text-decoration: none; border-radius: 27px; font-size: 14px; box-sizing: border-box;">Open in Binder →</a>
|
||||
```
|
||||
|
||||
```{grid-item-card} 📄 View Source
|
||||
@@ -557,11 +563,24 @@ def __iter__(self) -> Iterator:
|
||||
yield self._collate_batch(batch)
|
||||
```
|
||||
|
||||
The key insight: `random.shuffle(indices)` randomizes a list of integers, not actual data. For 50,000 samples, this shuffles 50,000 integers (400 KB) instead of 50,000 images (potentially gigabytes). The actual data stays in place; only the access order changes.
|
||||
```{code-cell} python3
|
||||
:tags: [remove-input, remove-output]
|
||||
from myst_nb import glue
|
||||
|
||||
# Shuffling: index memory for 50K samples (SI — illustrative)
|
||||
shuffle_50k_bytes = 50_000 * 8
|
||||
glue("shuffle_50k_kb", f"{shuffle_50k_bytes // 1000:,} KB")
|
||||
|
||||
# Shuffling: index memory for 1M samples (SI — illustrative)
|
||||
shuffle_1m_bytes = 1_000_000 * 8
|
||||
glue("shuffle_1m_mb", f"{shuffle_1m_bytes // 10**6} MB")
|
||||
```
|
||||
|
||||
The key insight: `random.shuffle(indices)` randomizes a list of integers, not actual data. For 50,000 samples, this shuffles 50,000 integers ({glue:text}`shuffle_50k_kb`) instead of 50,000 images (potentially gigabytes). The actual data stays in place; only the access order changes.
|
||||
|
||||
Each epoch generates a fresh shuffle, so the same samples appear in different batches. If sample 42 and sample 1337 were in the same batch in epoch 1, they're likely in different batches in epoch 2. This decorrelation is essential for generalization.
|
||||
|
||||
The memory cost of shuffling is `8 bytes × dataset_size`. For 1 million samples, that's 8 MB, negligible compared to the actual data. The time cost is O(n) for generating and shuffling indices, which happens once per epoch, not per batch.
|
||||
The memory cost of shuffling is `8 bytes × dataset_size`. For 1 million samples, that's {glue:text}`shuffle_1m_mb`, negligible compared to the actual data. The time cost is O(n) for generating and shuffling indices, which happens once per epoch, not per batch.
|
||||
|
||||
### Iterator Protocol and Generator Pattern
|
||||
|
||||
@@ -774,13 +793,27 @@ The Dataset abstraction, DataLoader interface, and batching semantics are identi
|
||||
|
||||
### Why DataLoaders Matter at Scale
|
||||
|
||||
```{code-cell} python3
|
||||
:tags: [remove-input, remove-output]
|
||||
from myst_nb import glue
|
||||
|
||||
# Scale section: batch memory (SI units — illustrative values)
|
||||
scale_batch_bytes = 256 * 150_000 # 256 images × 150 KB (SI) per image
|
||||
scale_batch_mb = scale_batch_bytes / 10**6
|
||||
glue("scale_batch_mb", f"{scale_batch_mb:.0f} MB")
|
||||
|
||||
# Scale section: I/O throughput (use rounded MB value to match illustrative style)
|
||||
scale_io_time_ms = int(scale_batch_mb) / 500 * 1000 # 38 MB / 500 MB/s
|
||||
glue("scale_io_ms", f"{scale_io_time_ms:.0f} ms")
|
||||
```
|
||||
|
||||
To appreciate why data loading infrastructure matters, consider the scale of production training:
|
||||
|
||||
- **ImageNet training**: 1.2 million images at 224×224×3 pixels = **600 GB** of uncompressed data
|
||||
- **Batch memory**: batch_size=256 with 150 KB per image = **38 MB** per batch
|
||||
- **I/O throughput**: Loading from SSD at 500 MB/s = **76 ms per batch** just for disk reads
|
||||
- **Batch memory**: batch_size=256 with 150 KB per image = **{glue:text}`scale_batch_mb`** per batch
|
||||
- **I/O throughput**: Loading from SSD at 500 MB/s = **{glue:text}`scale_io_ms` per batch** just for disk reads
|
||||
|
||||
Without proper batching and prefetching, data loading would dominate training time. A forward and backward pass might take 50 ms, but loading the data takes 76 ms. The GPU sits idle 60% of the time waiting for data.
|
||||
Without proper batching and prefetching, data loading would dominate training time. A forward and backward pass might take 50 ms, but loading the data takes {glue:text}`scale_io_ms`. The GPU sits idle 60% of the time waiting for data.
|
||||
|
||||
Production solutions:
|
||||
|
||||
@@ -792,6 +825,57 @@ Your DataLoader provides the interface that enables these optimizations. Add `nu
|
||||
|
||||
## Check Your Understanding
|
||||
|
||||
```{code-cell} python3
|
||||
:tags: [remove-input, remove-output]
|
||||
from myst_nb import glue
|
||||
|
||||
# Q1: Memory Calculation — CIFAR-10 batch
|
||||
q1_image_bytes = 32 * 32 * 3 * 4
|
||||
q1_image_kb = q1_image_bytes / 1024
|
||||
q1_batch_bytes = 128 * q1_image_bytes
|
||||
q1_batch_kb = q1_batch_bytes / 1024
|
||||
q1_batch_mb = q1_batch_bytes / 1024**2
|
||||
glue("q1_image_bytes", f"{q1_image_bytes:,}")
|
||||
glue("q1_image_kb", f"{q1_image_kb:.0f} KB")
|
||||
glue("q1_batch_kb", f"{q1_batch_kb:,.0f} KB")
|
||||
glue("q1_batch_mb", f"{q1_batch_mb:.1f} MB")
|
||||
|
||||
# Q2: Throughput Analysis
|
||||
q2_data_ms = 45
|
||||
q2_total_ms = 120
|
||||
q2_pct = q2_data_ms / q2_total_ms * 100
|
||||
q2_compute_ms = 30 + 35 + 10
|
||||
q2_speedup = q2_total_ms / q2_compute_ms
|
||||
glue("q2_pct", f"{q2_pct:.1f}%")
|
||||
glue("q2_compute_ms", f"{q2_compute_ms}")
|
||||
glue("q2_speedup", f"{q2_speedup:.1f}")
|
||||
|
||||
# Q3: Shuffle Memory Overhead (SI units — illustrative)
|
||||
q3_num_samples = 10_000_000
|
||||
q3_index_bytes = q3_num_samples * 8
|
||||
q3_index_mb = q3_index_bytes / 10**6
|
||||
q3_dataset_bytes = q3_num_samples * 10_000 # 10 KB per sample (SI)
|
||||
q3_dataset_gb = q3_dataset_bytes / 10**9
|
||||
q3_overhead_pct = q3_index_bytes / q3_dataset_bytes * 100
|
||||
glue("q3_index_mb", f"{q3_index_mb:.0f} MB")
|
||||
glue("q3_dataset_gb", f"{q3_dataset_gb:.0f} GB")
|
||||
glue("q3_overhead_pct", f"{q3_overhead_pct:.2f}%")
|
||||
|
||||
# Q5: Collation Cost (binary units for KB/MB)
|
||||
q5_sample_bytes = 3 * 224 * 224 * 4
|
||||
q5_sample_kb = q5_sample_bytes / 1024
|
||||
q5_batch_bytes = 128 * q5_sample_bytes
|
||||
q5_batch_kb = q5_batch_bytes / 1024
|
||||
q5_batch_mb = q5_batch_bytes / 1024**2
|
||||
# Copy time: use rounded binary MB / SI bandwidth for ~3.7 ms (matches text)
|
||||
q5_copy_time_ms = q5_batch_mb / (20 * 1000) * 1000
|
||||
glue("q5_sample_bytes", f"{q5_sample_bytes:,}")
|
||||
glue("q5_sample_kb", f"{q5_sample_kb:.0f} KB")
|
||||
glue("q5_batch_kb", f"{q5_batch_kb:,.0f} KB")
|
||||
glue("q5_batch_mb", f"{q5_batch_mb:.1f} MB")
|
||||
glue("q5_copy_ms", f"~{q5_copy_time_ms:.1f} milliseconds")
|
||||
```
|
||||
|
||||
Test your understanding with these systems thinking questions. Focus on quantitative analysis and performance trade-offs.
|
||||
|
||||
**Q1: Memory Calculation**
|
||||
@@ -801,9 +885,9 @@ You're training on CIFAR-10 with 50,000 RGB images (32×32×3 pixels, float32).
|
||||
```{admonition} Answer
|
||||
:class: dropdown
|
||||
|
||||
Each image: 32 × 32 × 3 × 4 bytes = 12,288 bytes ≈ 12 KB
|
||||
Each image: 32 × 32 × 3 × 4 bytes = {glue:text}`q1_image_bytes` bytes ≈ {glue:text}`q1_image_kb`
|
||||
|
||||
Batch of 128 images: 128 × 12 KB = **1,536 KB ≈ 1.5 MB**
|
||||
Batch of 128 images: 128 × {glue:text}`q1_image_kb` = **{glue:text}`q1_batch_kb` ≈ {glue:text}`q1_batch_mb`**
|
||||
|
||||
This is the minimum memory just for the input batch. Add activations, gradients, and model parameters, and peak memory might be 50-100× higher. But the **batch size directly controls the baseline memory consumption**.
|
||||
```
|
||||
@@ -821,11 +905,11 @@ Total: 120ms per batch. Where's the bottleneck? How much faster could training b
|
||||
```{admonition} Answer
|
||||
:class: dropdown
|
||||
|
||||
Data loading takes 45ms out of 120ms = **37.5% of total time**.
|
||||
Data loading takes 45ms out of 120ms = **{glue:text}`q2_pct` of total time**.
|
||||
|
||||
If data loading were instant (via prefetching or caching), total time would be 30+35+10 = **75ms per batch**.
|
||||
If data loading were instant (via prefetching or caching), total time would be 30+35+10 = **{glue:text}`q2_compute_ms`ms per batch**.
|
||||
|
||||
Speedup: 120ms → 75ms = **1.6× faster training** just by fixing data loading!
|
||||
Speedup: 120ms → {glue:text}`q2_compute_ms`ms = **{glue:text}`q2_speedup`× faster training** just by fixing data loading!
|
||||
|
||||
This shows why production systems use prefetching with `num_workers`: while the GPU computes batch N, the CPU loads batch N+1. Data loading and computation overlap, hiding the I/O latency.
|
||||
```
|
||||
@@ -837,11 +921,11 @@ You're training on a dataset with 10 million samples. How much extra memory does
|
||||
```{admonition} Answer
|
||||
:class: dropdown
|
||||
|
||||
Shuffling requires storing the index array: 10,000,000 indices × 8 bytes = **80 MB**
|
||||
Shuffling requires storing the index array: 10,000,000 indices × 8 bytes = **{glue:text}`q3_index_mb`**
|
||||
|
||||
This is the complete overhead. The actual data isn't copied or moved, only the index array is shuffled.
|
||||
|
||||
For comparison, if each sample is 10 KB, the full dataset is 100 GB. Shuffling adds 80 MB to randomize access to 100 GB of data, **0.08% overhead**. This is why index-based shuffling scales to massive datasets.
|
||||
For comparison, if each sample is 10 KB, the full dataset is {glue:text}`q3_dataset_gb` GB. Shuffling adds {glue:text}`q3_index_mb` to randomize access to {glue:text}`q3_dataset_gb` GB of data, **{glue:text}`q3_overhead_pct` overhead**. This is why index-based shuffling scales to massive datasets.
|
||||
```
|
||||
|
||||
**Q4: Batch Size Trade-offs**
|
||||
@@ -878,11 +962,11 @@ Your DataLoader collates batches using `np.stack()`. For batch_size=128 with sam
|
||||
```{admonition} Answer
|
||||
:class: dropdown
|
||||
|
||||
Each sample: 3 × 224 × 224 × 4 bytes = 602,112 bytes ≈ 588 KB
|
||||
Each sample: 3 × 224 × 224 × 4 bytes = {glue:text}`q5_sample_bytes` bytes ≈ {glue:text}`q5_sample_kb`
|
||||
|
||||
Batch of 128 samples: 128 × 588 KB = **75,264 KB ≈ 73.5 MB**
|
||||
Batch of 128 samples: 128 × {glue:text}`q5_sample_kb` = **{glue:text}`q5_batch_kb` ≈ {glue:text}`q5_batch_mb`**
|
||||
|
||||
`np.stack()` allocates a new array of this size and copies all 128 samples into contiguous memory. On a modern CPU with 20 GB/s memory bandwidth, this copy takes approximately **3.7 milliseconds**.
|
||||
`np.stack()` allocates a new array of this size and copies all 128 samples into contiguous memory. On a modern CPU with 20 GB/s memory bandwidth, this copy takes approximately **{glue:text}`q5_copy_ms`**.
|
||||
|
||||
This is why larger batch sizes can have higher absolute collation costs (more data to copy), but the per-sample overhead decreases because you're copying 128 samples in one operation instead of processing 128 tiny batches separately.
|
||||
```
|
||||
@@ -923,7 +1007,7 @@ Implement automatic differentiation that computes gradients through computation
|
||||
|
||||
```{tip} Interactive Options
|
||||
|
||||
- **[Launch Binder](https://mybinder.org/v2/gh/harvard-edge/cs249r_book/main?urlpath=lab/tree/tinytorch/modules/05_dataloader/05_dataloader.ipynb)** - Run interactively in browser, no setup required
|
||||
- **[Launch Binder](https://mybinder.org/v2/gh/harvard-edge/cs249r_book/main?urlpath=lab/tree/tinytorch/modules/05_dataloader/dataloader.ipynb)** - Run interactively in browser, no setup required
|
||||
- **[View Source](https://github.com/harvard-edge/cs249r_book/blob/main/tinytorch/src/05_dataloader/05_dataloader.py)** - Browse the implementation code
|
||||
```
|
||||
|
||||
|
||||
@@ -1,3 +1,9 @@
|
||||
---
|
||||
file_format: mystnb
|
||||
kernelspec:
|
||||
name: python3
|
||||
---
|
||||
|
||||
# Module 06: Autograd
|
||||
|
||||
:::{admonition} Module Info
|
||||
@@ -32,7 +38,7 @@ Listen to an AI-generated overview.
|
||||
|
||||
Run interactively in your browser.
|
||||
|
||||
<a href="https://mybinder.org/v2/gh/harvard-edge/cs249r_book/main?labpath=tinytorch%2Fmodules%2F06_autograd%2F06_autograd.ipynb" target="_blank" style="display: flex; align-items: center; justify-content: center; width: 100%; height: 54px; margin-top: auto; background: #f97316; color: white; text-align: center; text-decoration: none; border-radius: 27px; font-size: 14px; box-sizing: border-box;">Open in Binder →</a>
|
||||
<a href="https://mybinder.org/v2/gh/harvard-edge/cs249r_book/main?labpath=tinytorch%2Fmodules%2F06_autograd%2Fautograd.ipynb" target="_blank" style="display: flex; align-items: center; justify-content: center; width: 100%; height: 54px; margin-top: auto; background: #f97316; color: white; text-align: center; text-decoration: none; border-radius: 27px; font-size: 14px; box-sizing: border-box;">Open in Binder →</a>
|
||||
```
|
||||
|
||||
```{grid-item-card} 📄 View Source
|
||||
@@ -620,20 +626,51 @@ Consider a simple linear layer: `y = x @ W + b`
|
||||
- grad_W (same shape as W)
|
||||
- grad_b (same shape as b)
|
||||
|
||||
```{code-cell} python3
|
||||
:tags: [remove-input, remove-output]
|
||||
from myst_nb import glue
|
||||
|
||||
# Memory Management: linear layer with batch=32, input=512, output=768
|
||||
mem_x_bytes = 32 * 512 * 4
|
||||
mem_x_kb = mem_x_bytes / 1024
|
||||
glue("mem_x_bytes", f"{mem_x_bytes:,}")
|
||||
glue("mem_x_kb", f"{mem_x_kb:,.0f}")
|
||||
|
||||
mem_W_bytes = 512 * 768 * 4
|
||||
mem_W_kb = mem_W_bytes / 1024
|
||||
glue("mem_W_bytes", f"{mem_W_bytes:,}")
|
||||
glue("mem_W_kb", f"{mem_W_kb:,.0f}")
|
||||
|
||||
mem_grad_x_kb = mem_x_kb
|
||||
glue("mem_grad_x_bytes", f"{mem_x_bytes:,}")
|
||||
glue("mem_grad_x_kb", f"{mem_grad_x_kb:,.0f}")
|
||||
|
||||
mem_grad_W_kb = mem_W_kb
|
||||
glue("mem_grad_W_bytes", f"{mem_W_bytes:,}")
|
||||
glue("mem_grad_W_kb", f"{mem_grad_W_kb:,.0f}")
|
||||
|
||||
mem_grad_b_bytes = 768 * 4
|
||||
mem_grad_b_kb = mem_grad_b_bytes / 1024
|
||||
glue("mem_grad_b_bytes", f"{mem_grad_b_bytes:,}")
|
||||
glue("mem_grad_b_kb", f"{mem_grad_b_kb:,.0f}")
|
||||
|
||||
mem_total_bytes = mem_x_bytes + mem_W_bytes + mem_x_bytes + mem_W_bytes + mem_grad_b_bytes
|
||||
mem_total_mb = mem_total_bytes / 1024**2
|
||||
glue("mem_total_mb", f"{mem_total_mb:.1f}")
|
||||
```
|
||||
|
||||
For a batch of 32 samples through a (512, 768) linear layer, the memory breakdown is:
|
||||
|
||||
```
|
||||
Forward storage:
|
||||
x: 32 × 512 × 4 bytes = 64 KB
|
||||
W: 512 × 768 × 4 bytes = 1,572 KB
|
||||
x: 32 × 512 × 4 bytes = {glue:text}`mem_x_kb` KB
|
||||
W: 512 × 768 × 4 bytes = {glue:text}`mem_W_kb` KB
|
||||
|
||||
Backward storage:
|
||||
grad_x: 32 × 512 × 4 bytes = 64 KB
|
||||
grad_W: 512 × 768 × 4 bytes = 1,572 KB
|
||||
grad_b: 768 × 4 bytes = 3 KB
|
||||
grad_x: 32 × 512 × 4 bytes = {glue:text}`mem_grad_x_kb` KB
|
||||
grad_W: 512 × 768 × 4 bytes = {glue:text}`mem_grad_W_kb` KB
|
||||
grad_b: 768 × 4 bytes = {glue:text}`mem_grad_b_kb` KB
|
||||
|
||||
Total: ~3.3 MB for one layer (2× parameter size + activation size)
|
||||
```
|
||||
Total: ~{glue:text}`mem_total_mb` MB for one layer (2× parameter size + activation size)
|
||||
|
||||
Multiply by network depth and you see why memory limits batch size. A 100-layer transformer stores 100× the activations, which can easily exceed GPU memory.
|
||||
|
||||
@@ -763,16 +800,48 @@ Test yourself with these systems thinking questions. They're designed to build i
|
||||
|
||||
A 5-layer MLP processes a batch of 64 samples. Each layer stores its input activation for backward pass. Layer dimensions are: 784 → 512 → 256 → 128 → 10. How much memory (in MB) is used to store activations for one batch?
|
||||
|
||||
```{code-cell} python3
|
||||
:tags: [remove-input, remove-output]
|
||||
from myst_nb import glue
|
||||
|
||||
# Q1: Computation Graph Memory — 5-layer MLP, batch=64, float32 (4 bytes)
|
||||
q1_l1_bytes = 64 * 784 * 4
|
||||
q1_l1_kb = q1_l1_bytes / 1024
|
||||
glue("q1_l1_kb", f"{q1_l1_kb:,.0f}")
|
||||
|
||||
q1_l2_bytes = 64 * 512 * 4
|
||||
q1_l2_kb = q1_l2_bytes / 1024
|
||||
glue("q1_l2_kb", f"{q1_l2_kb:,.0f}")
|
||||
|
||||
q1_l3_bytes = 64 * 256 * 4
|
||||
q1_l3_kb = q1_l3_bytes / 1024
|
||||
glue("q1_l3_kb", f"{q1_l3_kb:,.0f}")
|
||||
|
||||
q1_l4_bytes = 64 * 128 * 4
|
||||
q1_l4_kb = q1_l4_bytes / 1024
|
||||
glue("q1_l4_kb", f"{q1_l4_kb:,.0f}")
|
||||
|
||||
q1_l5_bytes = 64 * 10 * 4
|
||||
q1_l5_kb = q1_l5_bytes / 1024
|
||||
glue("q1_l5_kb", f"{q1_l5_kb:.1f}")
|
||||
|
||||
q1_total_bytes = q1_l1_bytes + q1_l2_bytes + q1_l3_bytes + q1_l4_bytes + q1_l5_bytes
|
||||
q1_total_kb = q1_total_bytes / 1024
|
||||
q1_total_mb = q1_total_bytes / 1024**2
|
||||
glue("q1_total_kb", f"{q1_total_kb:.1f}")
|
||||
glue("q1_total_mb", f"{q1_total_mb:.2f}")
|
||||
```
|
||||
|
||||
```{admonition} Answer
|
||||
:class: dropdown
|
||||
|
||||
Layer 1 input: 64 × 784 × 4 bytes = 200 KB
|
||||
Layer 2 input: 64 × 512 × 4 bytes = 131 KB
|
||||
Layer 3 input: 64 × 256 × 4 bytes = 66 KB
|
||||
Layer 4 input: 64 × 128 × 4 bytes = 33 KB
|
||||
Layer 5 input: 64 × 10 × 4 bytes = 3 KB
|
||||
Layer 1 input: 64 × 784 × 4 bytes = {glue:text}`q1_l1_kb` KB
|
||||
Layer 2 input: 64 × 512 × 4 bytes = {glue:text}`q1_l2_kb` KB
|
||||
Layer 3 input: 64 × 256 × 4 bytes = {glue:text}`q1_l3_kb` KB
|
||||
Layer 4 input: 64 × 128 × 4 bytes = {glue:text}`q1_l4_kb` KB
|
||||
Layer 5 input: 64 × 10 × 4 bytes = {glue:text}`q1_l5_kb` KB
|
||||
|
||||
**Total: ~433 KB ≈ 0.43 MB**
|
||||
**Total: ~{glue:text}`q1_total_kb` KB = {glue:text}`q1_total_mb` MB**
|
||||
|
||||
This is per forward pass! A 100-layer transformer would store 100× this amount, which is why gradient checkpointing trades compute for memory by recomputing activations during backward pass.
|
||||
```
|
||||
@@ -799,14 +868,32 @@ This is why training (forward + backward) takes roughly 3× inference time. GPU
|
||||
|
||||
You have 16GB GPU memory and a model with 1B parameters (float32). How much memory is available for activations and gradients during training?
|
||||
|
||||
```{code-cell} python3
|
||||
:tags: [remove-input, remove-output]
|
||||
from myst_nb import glue
|
||||
|
||||
# Q3: Gradient Accumulation Memory — 1B params, float32
|
||||
# Using decimal GB (1 GB = 10^9 bytes) for clean round numbers
|
||||
q3_params = 1_000_000_000
|
||||
q3_model_gb = q3_params * 4 / 10**9
|
||||
q3_grad_gb = q3_params * 4 / 10**9
|
||||
q3_opt_gb = q3_params * 8 / 10**9
|
||||
q3_total_gb = q3_model_gb + q3_grad_gb + q3_opt_gb
|
||||
|
||||
glue("q3_model_gb", f"{q3_model_gb:.0f}")
|
||||
glue("q3_grad_gb", f"{q3_grad_gb:.0f}")
|
||||
glue("q3_opt_gb", f"{q3_opt_gb:.0f}")
|
||||
glue("q3_total_gb", f"{q3_total_gb:.0f}")
|
||||
```
|
||||
|
||||
```{admonition} Answer
|
||||
:class: dropdown
|
||||
|
||||
Model parameters: 1B × 4 bytes = 4 GB
|
||||
Gradients: 1B × 4 bytes = 4 GB
|
||||
Optimizer state (Adam): 1B × 8 bytes = 8 GB (momentum + variance)
|
||||
Model parameters: 1B × 4 bytes = {glue:text}`q3_model_gb` GB
|
||||
Gradients: 1B × 4 bytes = {glue:text}`q3_grad_gb` GB
|
||||
Optimizer state (Adam): 1B × 8 bytes = {glue:text}`q3_opt_gb` GB (momentum + variance)
|
||||
|
||||
**Total framework overhead: 16 GB**
|
||||
**Total framework overhead: {glue:text}`q3_total_gb` GB**
|
||||
|
||||
**Available for activations: 0 GB** - you've already exceeded memory!
|
||||
|
||||
@@ -817,6 +904,21 @@ This is why large models use gradient accumulation across multiple forward passe
|
||||
|
||||
A typical training batch has: 32 images (input), 10M parameter tensors (weights), 50 intermediate activation tensors. If requires_grad defaults to True for all tensors, how many tensors unnecessarily track gradients?
|
||||
|
||||
```{code-cell} python3
|
||||
:tags: [remove-input, remove-output]
|
||||
from myst_nb import glue
|
||||
|
||||
# Q4: requires_grad Performance — batch of 32 images, 3×224×224, float32
|
||||
q4_per_image = 3 * 224 * 224
|
||||
q4_batch_values = 32 * q4_per_image
|
||||
q4_batch_bytes = q4_batch_values * 4
|
||||
q4_batch_mb = q4_batch_bytes / 1024**2
|
||||
|
||||
glue("q4_batch_values", f"{q4_batch_values / 1e6:.1f}M")
|
||||
glue("q4_batch_bytes", f"{q4_batch_bytes:,}")
|
||||
glue("q4_batch_mb", f"{q4_batch_mb:.1f}")
|
||||
```
|
||||
|
||||
```{admonition} Answer
|
||||
:class: dropdown
|
||||
|
||||
@@ -829,7 +931,7 @@ Tensors that DON'T need gradients:
|
||||
|
||||
**32 input tensors unnecessarily track gradients** if requires_grad defaults to True.
|
||||
|
||||
This is why PyTorch defaults requires_grad=False for new tensors and requires explicit opt-in for parameters. For image inputs with 32×3×224×224 = 4.8M values each, tracking gradients wastes 4.8M × 4 bytes = 19 MB per image × 32 = 608 MB for the batch!
|
||||
This is why PyTorch defaults requires_grad=False for new tensors and requires explicit opt-in for parameters. For a batch of 32 images with 3×224×224 pixels each, tracking gradients wastes {glue:text}`q4_batch_values` values × 4 bytes = {glue:text}`q4_batch_mb` MB for the batch!
|
||||
```
|
||||
|
||||
**Q5: Graph Retention**
|
||||
@@ -890,7 +992,7 @@ Implement SGD, Adam, and other optimization algorithms that use your autograd gr
|
||||
|
||||
```{tip} Interactive Options
|
||||
|
||||
- **[Launch Binder](https://mybinder.org/v2/gh/harvard-edge/cs249r_book/main?urlpath=lab/tree/tinytorch/modules/06_autograd/06_autograd.ipynb)** - Run interactively in browser, no setup required
|
||||
- **[Launch Binder](https://mybinder.org/v2/gh/harvard-edge/cs249r_book/main?urlpath=lab/tree/tinytorch/modules/06_autograd/autograd.ipynb)** - Run interactively in browser, no setup required
|
||||
- **[View Source](https://github.com/harvard-edge/cs249r_book/blob/main/tinytorch/src/06_autograd/06_autograd.py)** - Browse the implementation code
|
||||
```
|
||||
|
||||
|
||||
@@ -1,3 +1,9 @@
|
||||
---
|
||||
file_format: mystnb
|
||||
kernelspec:
|
||||
name: python3
|
||||
---
|
||||
|
||||
# Module 07: Optimizers
|
||||
|
||||
:::{admonition} Module Info
|
||||
@@ -31,7 +37,7 @@ Listen to an AI-generated overview.
|
||||
|
||||
Run interactively in your browser.
|
||||
|
||||
<a href="https://mybinder.org/v2/gh/harvard-edge/cs249r_book/main?labpath=tinytorch%2Fmodules%2F07_optimizers%2F07_optimizers.ipynb" target="_blank" style="display: flex; align-items: center; justify-content: center; width: 100%; height: 54px; margin-top: auto; background: #f97316; color: white; text-align: center; text-decoration: none; border-radius: 27px; font-size: 14px; box-sizing: border-box;">Open in Binder →</a>
|
||||
<a href="https://mybinder.org/v2/gh/harvard-edge/cs249r_book/main?labpath=tinytorch%2Fmodules%2F07_optimizers%2Foptimizers.ipynb" target="_blank" style="display: flex; align-items: center; justify-content: center; width: 100%; height: 54px; margin-top: auto; background: #f97316; color: white; text-align: center; text-decoration: none; border-radius: 27px; font-size: 14px; box-sizing: border-box;">Open in Binder →</a>
|
||||
```
|
||||
|
||||
```{grid-item-card} 📄 View Source
|
||||
@@ -686,9 +692,27 @@ The optimizer API, update algorithms, and memory patterns are identical. When yo
|
||||
|
||||
### Why Optimizers Matter at Scale
|
||||
|
||||
```{code-cell} python3
|
||||
:tags: [remove-input, remove-output]
|
||||
from myst_nb import glue
|
||||
|
||||
# 175B-parameter model: optimizer state with Adam
|
||||
scale_params = 175_000_000_000
|
||||
scale_bytes_per_param = 4
|
||||
scale_param_bytes = scale_params * scale_bytes_per_param
|
||||
scale_param_gb = scale_param_bytes / 1024**3
|
||||
scale_state_bytes = 2 * scale_param_bytes # 2 Adam buffers (m, v)
|
||||
scale_state_tb = scale_state_bytes / 1024**4
|
||||
scale_multiplier = 3 # params + 2 state buffers
|
||||
|
||||
glue("scale_param_gb", f"{scale_param_gb:.1f} GB")
|
||||
glue("scale_state_tb", f"{scale_state_tb:.2f} TB")
|
||||
glue("scale_multiplier", f"{scale_multiplier}x")
|
||||
```
|
||||
|
||||
To appreciate optimizer importance, consider production training scenarios:
|
||||
|
||||
- **Large language models (175B parameters)**: Optimizer state alone consumes **1.4 TB** with Adam (3x × 700 GB parameters), requiring multi-GPU state sharding
|
||||
- **Large language models (175B parameters)**: Optimizer state alone consumes **{glue:text}`scale_state_tb`** with Adam ({glue:text}`scale_multiplier` x {glue:text}`scale_param_gb` parameters), requiring multi-GPU state sharding
|
||||
- **Transformer training**: AdamW with weight_decay=0.01 is standard, improving generalization over plain Adam by 2-5% accuracy
|
||||
- **Convergence speed**: Adam typically converges in **30-50% fewer steps** than SGD on vision and language tasks, saving hours of GPU time despite higher memory cost
|
||||
|
||||
@@ -696,6 +720,74 @@ The optimizer choice directly impacts training feasibility. For models that bare
|
||||
|
||||
## Check Your Understanding
|
||||
|
||||
```{code-cell} python3
|
||||
:tags: [remove-input, remove-output]
|
||||
from myst_nb import glue
|
||||
|
||||
# Q1: Memory calculation for 10B-parameter model (float32)
|
||||
q1_params = 10_000_000_000
|
||||
q1_bytes_per_param = 4
|
||||
q1_param_bytes = q1_params * q1_bytes_per_param
|
||||
q1_param_gb = q1_param_bytes / 1024**3
|
||||
|
||||
q1_adam_state_gb = 2 * q1_param_gb # 2 buffers (m, v)
|
||||
q1_total_adam_gb = q1_param_gb + q1_adam_state_gb
|
||||
|
||||
q1_sgd_state_gb = q1_param_gb # 1 buffer (velocity)
|
||||
q1_total_sgd_gb = q1_param_gb + q1_sgd_state_gb
|
||||
|
||||
q1_diff_gb = q1_total_adam_gb - q1_total_sgd_gb
|
||||
|
||||
glue("q1_param_gb", f"{q1_param_gb:.2f} GB")
|
||||
glue("q1_adam_state_gb", f"{q1_adam_state_gb:.2f} GB")
|
||||
glue("q1_total_adam_gb", f"{q1_total_adam_gb:.2f} GB")
|
||||
glue("q1_sgd_state_gb", f"{q1_sgd_state_gb:.2f} GB")
|
||||
glue("q1_total_sgd_gb", f"{q1_total_sgd_gb:.2f} GB")
|
||||
glue("q1_diff_gb", f"{q1_diff_gb:.2f} GB")
|
||||
```
|
||||
|
||||
```{code-cell} python3
|
||||
:tags: [remove-input, remove-output]
|
||||
from myst_nb import glue
|
||||
|
||||
# Q2: Convergence trade-off
|
||||
q2_adam_steps = 100_000
|
||||
q2_adam_overhead = 1.2
|
||||
q2_sgd_steps = 200_000
|
||||
q2_sgd_overhead = 1.0
|
||||
|
||||
q2_adam_time = q2_adam_steps * q2_adam_overhead
|
||||
q2_sgd_time = q2_sgd_steps * q2_sgd_overhead
|
||||
q2_speedup = q2_sgd_time / q2_adam_time
|
||||
|
||||
glue("q2_adam_time", f"{q2_adam_time:,.0f}")
|
||||
glue("q2_sgd_time", f"{q2_sgd_time:,.0f}")
|
||||
glue("q2_speedup", f"{q2_speedup:.2f}x")
|
||||
```
|
||||
|
||||
```{code-cell} python3
|
||||
:tags: [remove-input, remove-output]
|
||||
from myst_nb import glue
|
||||
|
||||
# Q3: Bias correction impact
|
||||
q3_beta1 = 0.9
|
||||
|
||||
q3_corr_step1 = 1 - q3_beta1 ** 1
|
||||
q3_corr_step10 = 1 - q3_beta1 ** 10
|
||||
q3_corr_step100 = 1 - q3_beta1 ** 100
|
||||
|
||||
q3_mult_step1 = 1 / q3_corr_step1
|
||||
q3_mult_step10 = 1 / q3_corr_step10
|
||||
q3_mult_step100 = 1 / q3_corr_step100
|
||||
|
||||
glue("q3_corr_step1", f"{q3_corr_step1:.1f}")
|
||||
glue("q3_corr_step10", f"{q3_corr_step10:.3f}")
|
||||
glue("q3_corr_step100", f"{q3_corr_step100:.4f}")
|
||||
glue("q3_mult_step1", f"{q3_mult_step1:.0f}x")
|
||||
glue("q3_mult_step10", f"{q3_mult_step10:.2f}x")
|
||||
glue("q3_mult_step100", f"{q3_mult_step100:.1f}x")
|
||||
```
|
||||
|
||||
Test yourself with these systems thinking questions designed to build intuition for optimization trade-offs in production ML.
|
||||
|
||||
**Q1: Memory Calculation**
|
||||
@@ -705,15 +797,15 @@ A language model has 10 billion float32 parameters. Using Adam optimizer, how mu
|
||||
```{admonition} Answer
|
||||
:class: dropdown
|
||||
|
||||
**Parameters:** 10B × 4 bytes = **40 GB**
|
||||
**Parameters:** 10B x 4 bytes = **{glue:text}`q1_param_gb`**
|
||||
|
||||
**Adam state:** 2 buffers (m, v) = 2 × 40 GB = **80 GB**
|
||||
**Total with Adam:** 40 GB (params) + 80 GB (state) = **120 GB**
|
||||
**Adam state:** 2 buffers (m, v) = 2 x {glue:text}`q1_param_gb` = **{glue:text}`q1_adam_state_gb`**
|
||||
**Total with Adam:** {glue:text}`q1_param_gb` (params) + {glue:text}`q1_adam_state_gb` (state) = **{glue:text}`q1_total_adam_gb`**
|
||||
|
||||
**SGD with momentum:** 1 buffer (velocity) = **40 GB**
|
||||
**Total with SGD:** 40 GB (params) + 40 GB (state) = **80 GB**
|
||||
**SGD with momentum:** 1 buffer (velocity) = **{glue:text}`q1_sgd_state_gb`**
|
||||
**Total with SGD:** {glue:text}`q1_param_gb` (params) + {glue:text}`q1_sgd_state_gb` (state) = **{glue:text}`q1_total_sgd_gb`**
|
||||
|
||||
**Difference:** Adam uses **40 GB more** than SGD (50% increase). This might force you to use fewer GPUs or implement optimizer state sharding.
|
||||
**Difference:** Adam uses **{glue:text}`q1_diff_gb` more** than SGD (50% increase). This might force you to use fewer GPUs or implement optimizer state sharding.
|
||||
```
|
||||
|
||||
**Q2: Convergence Trade-off**
|
||||
@@ -723,24 +815,24 @@ If Adam converges in 100,000 steps and SGD needs 200,000 steps, but Adam's per-s
|
||||
```{admonition} Answer
|
||||
:class: dropdown
|
||||
|
||||
**Adam:** 100,000 steps × 1.2 = **120,000 time units**
|
||||
**SGD:** 200,000 steps × 1.0 = **200,000 time units**
|
||||
**Adam:** 100,000 steps x 1.2 = **{glue:text}`q2_adam_time` time units**
|
||||
**SGD:** 200,000 steps x 1.0 = **{glue:text}`q2_sgd_time` time units**
|
||||
|
||||
**Adam finishes 1.67x faster** despite higher per-step cost. The convergence advantage (2x fewer steps) outweighs the computational overhead (1.2x slower steps).
|
||||
**Adam finishes {glue:text}`q2_speedup` faster** despite higher per-step cost. The convergence advantage (2x fewer steps) outweighs the computational overhead (1.2x slower steps).
|
||||
|
||||
This illustrates why Adam is popular despite higher memory and compute: wall-clock time to convergence often matters more than per-step efficiency.
|
||||
```
|
||||
|
||||
**Q3: Bias Correction Impact**
|
||||
|
||||
In Adam, bias correction divides first moment by (1 - β₁^t). At step 1 with β₁=0.9, this correction factor is 0.1. At step 10, it's 0.651. How does this affect early vs late training?
|
||||
In Adam, bias correction divides first moment by (1 - β₁^t). At step 1 with β₁=0.9, this correction factor is {glue:text}`q3_corr_step1`. At step 10, it's {glue:text}`q3_corr_step10`. How does this affect early vs late training?
|
||||
|
||||
```{admonition} Answer
|
||||
:class: dropdown
|
||||
|
||||
**Step 1:** Divide by 0.1 = multiply by **10x** (huge correction)
|
||||
**Step 10:** Divide by 0.651 = multiply by **1.54x** (moderate correction)
|
||||
**Step 100:** Divide by 0.9999 ≈ multiply by **1.0x** (negligible correction)
|
||||
**Step 1:** Divide by {glue:text}`q3_corr_step1` = multiply by **{glue:text}`q3_mult_step1`** (huge correction)
|
||||
**Step 10:** Divide by {glue:text}`q3_corr_step10` = multiply by **{glue:text}`q3_mult_step10`** (moderate correction)
|
||||
**Step 100:** Divide by {glue:text}`q3_corr_step100` ≈ multiply by **{glue:text}`q3_mult_step100`** (negligible correction)
|
||||
|
||||
**Early training:** Large corrections amplify small moment estimates to reasonable magnitudes, enabling effective learning from the first step.
|
||||
|
||||
@@ -826,7 +918,7 @@ Combine optimizers with training loops to actually train neural networks. You'll
|
||||
|
||||
```{tip} Interactive Options
|
||||
|
||||
- **[Launch Binder](https://mybinder.org/v2/gh/harvard-edge/cs249r_book/main?urlpath=lab/tree/tinytorch/modules/07_optimizers/07_optimizers.ipynb)** - Run interactively in browser, no setup required
|
||||
- **[Launch Binder](https://mybinder.org/v2/gh/harvard-edge/cs249r_book/main?urlpath=lab/tree/tinytorch/modules/07_optimizers/optimizers.ipynb)** - Run interactively in browser, no setup required
|
||||
- **[View Source](https://github.com/harvard-edge/cs249r_book/blob/main/tinytorch/src/07_optimizers/07_optimizers.py)** - Browse the implementation code
|
||||
```
|
||||
|
||||
|
||||
@@ -1,3 +1,9 @@
|
||||
---
|
||||
file_format: mystnb
|
||||
kernelspec:
|
||||
name: python3
|
||||
---
|
||||
|
||||
# Module 08: Training
|
||||
|
||||
:::{admonition} Module Info
|
||||
@@ -25,7 +31,7 @@ Listen to an AI-generated overview.
|
||||
|
||||
Run interactively in your browser.
|
||||
|
||||
<a href="https://mybinder.org/v2/gh/harvard-edge/cs249r_book/main?labpath=tinytorch%2Fmodules%2F08_training%2F08_training.ipynb" target="_blank" style="display: flex; align-items: center; justify-content: center; width: 100%; height: 54px; margin-top: auto; background: #f97316; color: white; text-align: center; text-decoration: none; border-radius: 27px; font-size: 14px; box-sizing: border-box;">Open in Binder →</a>
|
||||
<a href="https://mybinder.org/v2/gh/harvard-edge/cs249r_book/main?labpath=tinytorch%2Fmodules%2F08_training%2Ftraining.ipynb" target="_blank" style="display: flex; align-items: center; justify-content: center; width: 100%; height: 54px; margin-top: auto; background: #f97316; color: white; text-align: center; text-decoration: none; border-radius: 27px; font-size: 14px; box-sizing: border-box;">Open in Binder →</a>
|
||||
```
|
||||
|
||||
```{grid-item-card} 📄 View Source
|
||||
@@ -515,11 +521,30 @@ The `accumulation_steps` parameter enables a clever memory trick: if you want an
|
||||
|
||||
### Epochs and Iterations
|
||||
|
||||
```{code-cell} python3
|
||||
:tags: [remove-input, remove-output]
|
||||
import math
|
||||
from myst_nb import glue
|
||||
|
||||
# Iterations per epoch: 10,000 samples / batch size 32
|
||||
epoch_iters_10k = math.ceil(10_000 / 32)
|
||||
glue("epoch_iters_10k", f"{epoch_iters_10k:,}")
|
||||
|
||||
# ImageNet iterations: 1.2M images / batch 256 * 90 epochs
|
||||
imagenet_iters_per_epoch = math.ceil(1_200_000 / 256)
|
||||
imagenet_total_iters = imagenet_iters_per_epoch * 90
|
||||
glue("imagenet_total_iters", f"{imagenet_total_iters:,}")
|
||||
|
||||
# ImageNet wall-clock time at 250ms/iter
|
||||
imagenet_hours = imagenet_total_iters * 0.250 / 3600
|
||||
glue("imagenet_hours", f"{imagenet_hours:.0f}")
|
||||
```
|
||||
|
||||
Training operates on two timescales: iterations (single batch updates) and epochs (complete passes through the dataset). Understanding this hierarchy helps you reason about training progress and resource requirements.
|
||||
|
||||
An iteration processes one batch: forward pass, backward pass, optimizer step. If your dataset has 10,000 samples and batch size is 32, one epoch requires 313 iterations (10,000 ÷ 32, rounded up). Training a model to convergence typically requires dozens or hundreds of epochs, meaning tens of thousands of iterations.
|
||||
An iteration processes one batch: forward pass, backward pass, optimizer step. If your dataset has 10,000 samples and batch size is 32, one epoch requires {glue:text}`epoch_iters_10k` iterations (10,000 ÷ 32, rounded up). Training a model to convergence typically requires dozens or hundreds of epochs, meaning tens of thousands of iterations.
|
||||
|
||||
The mathematics is straightforward but the implications are significant. Training ImageNet with 1.2 million images, batch size 256, for 90 epochs requires 421,875 iterations (1,200,000 ÷ 256 × 90). At 250ms per iteration, that's 29 hours of compute. Understanding this arithmetic helps you estimate training costs and debug slow convergence.
|
||||
The mathematics is straightforward but the implications are significant. Training ImageNet with 1.2 million images, batch size 256, for 90 epochs requires {glue:text}`imagenet_total_iters` iterations (1,200,000 ÷ 256 × 90). At 250ms per iteration, that's {glue:text}`imagenet_hours` hours of compute. Understanding this arithmetic helps you estimate training costs and debug slow convergence.
|
||||
|
||||
Your Trainer tracks both: `self.step` counts total iterations across all epochs, while `self.epoch` counts how many complete dataset passes you've completed. Schedulers typically operate on epoch boundaries (learning rate changes each epoch), while monitoring systems track loss per iteration.
|
||||
|
||||
@@ -584,15 +609,42 @@ def get_lr(self, epoch: int) -> float:
|
||||
|
||||
The mathematics creates a smooth curve. At epoch 0, `np.cos(0) = 1`, so `cosine_factor = (1+1)/2 = 1.0`, giving `max_lr`. At the final epoch, `np.cos(π) = -1`, so `cosine_factor = (1-1)/2 = 0.0`, giving `min_lr`. Between these extremes, the cosine function creates a smooth descent.
|
||||
|
||||
```{code-cell} python3
|
||||
:tags: [remove-input, remove-output]
|
||||
import math
|
||||
from myst_nb import glue
|
||||
|
||||
# Cosine annealing schedule: max_lr=0.1, min_lr=0.01, total_epochs=100
|
||||
max_lr = 0.1
|
||||
min_lr = 0.01
|
||||
total_epochs = 100
|
||||
|
||||
def cosine_lr(epoch):
|
||||
if epoch >= total_epochs:
|
||||
return min_lr
|
||||
cosine_factor = (1 + math.cos(math.pi * epoch / total_epochs)) / 2
|
||||
return min_lr + (max_lr - min_lr) * cosine_factor
|
||||
|
||||
lr_epoch_0 = cosine_lr(0)
|
||||
lr_epoch_25 = cosine_lr(25)
|
||||
lr_epoch_50 = cosine_lr(50)
|
||||
lr_epoch_75 = cosine_lr(75)
|
||||
lr_epoch_100 = cosine_lr(100)
|
||||
|
||||
glue("cosine_lr_0", f"{lr_epoch_0:.3f}")
|
||||
glue("cosine_lr_25", f"{lr_epoch_25:.3f}")
|
||||
glue("cosine_lr_50", f"{lr_epoch_50:.3f}")
|
||||
glue("cosine_lr_75", f"{lr_epoch_75:.3f}")
|
||||
glue("cosine_lr_100", f"{lr_epoch_100:.3f}")
|
||||
```
|
||||
|
||||
Visualizing the schedule for `max_lr=0.1`, `min_lr=0.01`, `total_epochs=100`:
|
||||
|
||||
```
|
||||
Epoch 0: 0.100 (aggressive learning)
|
||||
Epoch 25: 0.085 (still fast)
|
||||
Epoch 50: 0.055 (slowing down)
|
||||
Epoch 75: 0.025 (fine-tuning)
|
||||
Epoch 100: 0.010 (stable convergence)
|
||||
```
|
||||
Epoch 0: {glue:text}`cosine_lr_0` (aggressive learning)
|
||||
Epoch 25: {glue:text}`cosine_lr_25` (still fast)
|
||||
Epoch 50: {glue:text}`cosine_lr_50` (slowing down)
|
||||
Epoch 75: {glue:text}`cosine_lr_75` (fine-tuning)
|
||||
Epoch 100: {glue:text}`cosine_lr_100` (stable convergence)
|
||||
|
||||
Your Trainer applies the schedule automatically after each epoch:
|
||||
|
||||
@@ -606,6 +658,24 @@ This updates the optimizer's learning rate before the next epoch begins, creatin
|
||||
|
||||
### Gradient Clipping
|
||||
|
||||
```{code-cell} python3
|
||||
:tags: [remove-input, remove-output]
|
||||
import math
|
||||
from myst_nb import glue
|
||||
|
||||
# Gradient clipping example: gradients [100, 200, 50]
|
||||
grads = [100, 200, 50]
|
||||
clip_norm_prose = math.sqrt(sum(g**2 for g in grads))
|
||||
clip_coef_prose = 1.0 / clip_norm_prose
|
||||
clipped_prose = [g * clip_coef_prose for g in grads]
|
||||
|
||||
glue("clip_norm_prose", f"{clip_norm_prose:.0f}")
|
||||
glue("clip_coef_prose", f"{clip_coef_prose:.5f}")
|
||||
glue("clip_g1_prose", f"{clipped_prose[0]:.3f}")
|
||||
glue("clip_g2_prose", f"{clipped_prose[1]:.3f}")
|
||||
glue("clip_g3_prose", f"{clipped_prose[2]:.3f}")
|
||||
```
|
||||
|
||||
Gradient clipping prevents exploding gradients that destroy training progress. During backpropagation, gradients sometimes become extremely large (thousands or even infinity), causing parameter updates that jump far from the optimal solution or create numerical overflow (NaN values). Clipping rescales large gradients to a safe maximum while preserving their direction.
|
||||
|
||||
The key insight is clipping by global norm rather than individual gradients. Computing the norm across all parameters `√(Σ g²)` and scaling uniformly preserves the relative magnitudes between different parameters:
|
||||
@@ -635,7 +705,7 @@ def clip_grad_norm(parameters: List, max_norm: float = 1.0) -> float:
|
||||
return float(total_norm)
|
||||
```
|
||||
|
||||
Consider gradients `[100, 200, 50]` with global norm `√(100² + 200² + 50²) = 230`. With `max_norm=1.0`, we compute `clip_coef = 1.0 / 230 = 0.00435` and scale all gradients: `[0.435, 0.870, 0.217]`. The new norm is exactly 1.0, but the relative magnitudes are preserved (the second gradient is still twice the first).
|
||||
Consider gradients [100, 200, 50] with global norm √(100² + 200² + 50²) = {glue:text}`clip_norm_prose`. With max_norm=1.0, we compute clip_coef = 1.0 / {glue:text}`clip_norm_prose` = {glue:text}`clip_coef_prose` and scale all gradients: [{glue:text}`clip_g1_prose`, {glue:text}`clip_g2_prose`, {glue:text}`clip_g3_prose`]. The new norm is exactly 1.0, but the relative magnitudes are preserved (the second gradient is still twice the first).
|
||||
|
||||
This uniform scaling is crucial. If we clipped each gradient independently to 1.0, we'd get `[1.0, 1.0, 1.0]`, destroying the information that the second parameter needs larger updates than the first. Global norm clipping prevents explosions while respecting the gradient's message about relative importance.
|
||||
|
||||
@@ -691,6 +761,44 @@ After loading, training resumes as if the interruption never happened. The next
|
||||
|
||||
### Computational Complexity
|
||||
|
||||
```{code-cell} python3
|
||||
:tags: [remove-input, remove-output]
|
||||
import math
|
||||
from myst_nb import glue
|
||||
|
||||
# Computational complexity: 2-layer network (d=512), 10k samples, batch 32, 100 epochs
|
||||
d = 512
|
||||
L = 2
|
||||
N = 10_000
|
||||
B = 32
|
||||
E = 100
|
||||
|
||||
d_sq_L = d**2 * L
|
||||
glue("comp_d_sq_L", f"{d_sq_L:,}")
|
||||
|
||||
batch_ops = d_sq_L * B
|
||||
batch_ops_M = batch_ops / 1e6
|
||||
glue("comp_batch_ops", f"{batch_ops_M:.1f}")
|
||||
|
||||
iters_per_epoch = math.ceil(N / B)
|
||||
glue("comp_iters_per_epoch", f"{iters_per_epoch:,}")
|
||||
|
||||
total_iters = iters_per_epoch * E
|
||||
glue("comp_total_iters", f"{total_iters:,}")
|
||||
|
||||
total_ops = total_iters * batch_ops
|
||||
total_ops_B = total_ops / 1e9
|
||||
glue("comp_total_ops_B", f"{total_ops_B:.0f}")
|
||||
|
||||
cpu_seconds = total_ops / 1e9
|
||||
cpu_minutes = cpu_seconds / 60
|
||||
glue("comp_cpu_seconds", f"{cpu_seconds:.0f}")
|
||||
glue("comp_cpu_minutes", f"{cpu_minutes:.0f}")
|
||||
|
||||
gpu_seconds = total_ops / 1e12
|
||||
glue("comp_gpu_seconds", f"{gpu_seconds:.1f}")
|
||||
```
|
||||
|
||||
Training complexity depends on model architecture and dataset size. For a simple fully connected network with L layers of size d, each forward pass is O(d² × L) (matrix multiplications dominate). Backward pass has the same complexity (automatic differentiation revisits each operation). With N training samples and batch size B, one epoch requires N/B iterations.
|
||||
|
||||
Total training cost for E epochs:
|
||||
@@ -704,15 +812,13 @@ Total complexity: O((N × E × d² × L) / B)
|
||||
|
||||
Real numbers make this concrete. Training a 2-layer network (d=512) on 10,000 samples (batch size 32) for 100 epochs:
|
||||
|
||||
```
|
||||
d² × L = 512² × 2 = 524,288 operations per sample
|
||||
Batch operations = 524,288 × 32 = 16.8 million ops
|
||||
Iterations per epoch = 10,000 / 32 = 313
|
||||
Total iterations = 313 × 100 = 31,300
|
||||
Total operations = 31,300 × 16.8M = 525 billion operations
|
||||
```
|
||||
d² × L = 512² × 2 = {glue:text}`comp_d_sq_L` operations per sample
|
||||
Batch operations = {glue:text}`comp_d_sq_L` × 32 = {glue:text}`comp_batch_ops` million ops
|
||||
Iterations per epoch = 10,000 / 32 = {glue:text}`comp_iters_per_epoch`
|
||||
Total iterations = {glue:text}`comp_iters_per_epoch` × 100 = {glue:text}`comp_total_iters`
|
||||
Total operations = {glue:text}`comp_total_iters` × {glue:text}`comp_batch_ops`M = {glue:text}`comp_total_ops_B` billion operations
|
||||
|
||||
At 1 billion operations per second (typical CPU), that's 525 seconds (9 minutes). This arithmetic explains why GPUs matter: a GPU at 1 trillion ops/second (1000× faster) completes this in 0.5 seconds.
|
||||
At 1 billion operations per second (typical CPU), that's {glue:text}`comp_cpu_seconds` seconds ({glue:text}`comp_cpu_minutes` minutes). This arithmetic explains why GPUs matter: a GPU at 1 trillion ops/second (1000× faster) completes this in {glue:text}`comp_gpu_seconds` seconds.
|
||||
|
||||
Memory complexity is simpler but just as important:
|
||||
|
||||
@@ -801,11 +907,23 @@ The core training loop pattern: forward pass → loss → backward → gradient
|
||||
|
||||
### Why Training Infrastructure Matters at Scale
|
||||
|
||||
```{code-cell} python3
|
||||
:tags: [remove-input, remove-output]
|
||||
import math
|
||||
from myst_nb import glue
|
||||
|
||||
# ImageNet training time: 1.2M images, batch 256, 90 epochs, 250ms/iter
|
||||
imagenet_iters_per_epoch = math.ceil(1_200_000 / 256)
|
||||
imagenet_total_iters = imagenet_iters_per_epoch * 90
|
||||
imagenet_hours = imagenet_total_iters * 0.250 / 3600
|
||||
glue("prod_imagenet_hours", f"{imagenet_hours:.0f}")
|
||||
```
|
||||
|
||||
To appreciate the engineering behind training systems, consider production scale:
|
||||
|
||||
- **GPT-3 training**: 175 billion parameters, trained on 300 billion tokens, cost ~$4.6 million in compute time. A single checkpoint is **350 GB** (larger than most hard drives). Checkpoint frequency must balance fault tolerance against storage costs.
|
||||
|
||||
- **ImageNet training**: 1.2 million images, 90 epochs standard. At 250ms per iteration (batch size 256), that's **29 hours** on one GPU. Learning rate scheduling is the difference between 75% accuracy (poor) and 76.5% accuracy (state-of-the-art).
|
||||
- **ImageNet training**: 1.2 million images, 90 epochs standard. At 250ms per iteration (batch size 256), that's **{glue:text}`prod_imagenet_hours` hours** on one GPU. Learning rate scheduling is the difference between 75% accuracy (poor) and 76.5% accuracy (state-of-the-art).
|
||||
|
||||
- **Training instability**: Without gradient clipping, 1 in 50 training runs randomly diverges (gradients explode, model outputs NaN, all progress lost). Production systems can't tolerate 2% failure rates when runs cost thousands of dollars.
|
||||
|
||||
@@ -819,22 +937,47 @@ Test yourself with these systems thinking questions. They build intuition for th
|
||||
|
||||
You have a model with 10 million parameters (float32) and use Adam optimizer. Estimate total training memory required: parameters + gradients + optimizer state. Then compare with SGD optimizer.
|
||||
|
||||
```{code-cell} python3
|
||||
:tags: [remove-input, remove-output]
|
||||
from myst_nb import glue
|
||||
|
||||
# Q1: Training memory for 10M param model
|
||||
num_params = 10_000_000
|
||||
bytes_per_param = 4 # float32
|
||||
|
||||
param_mb = num_params * bytes_per_param / 1e6
|
||||
glue("q1_param_mb", f"{param_mb:.0f}")
|
||||
|
||||
adam_moments_mb = num_params * 2 * bytes_per_param / 1e6
|
||||
glue("q1_adam_moments_mb", f"{adam_moments_mb:.0f}")
|
||||
|
||||
adam_total_mb = param_mb + param_mb + adam_moments_mb # params + grads + 2 moments
|
||||
glue("q1_adam_total_mb", f"{adam_total_mb:.0f}")
|
||||
|
||||
sgd_momentum_mb = param_mb # one momentum buffer
|
||||
sgd_total_mb = param_mb + param_mb + sgd_momentum_mb # params + grads + momentum
|
||||
glue("q1_sgd_total_mb", f"{sgd_total_mb:.0f}")
|
||||
|
||||
mem_diff_pct = (adam_total_mb - sgd_total_mb) / sgd_total_mb * 100
|
||||
glue("q1_mem_diff_pct", f"{mem_diff_pct:.0f}")
|
||||
```
|
||||
|
||||
```{admonition} Answer
|
||||
:class: dropdown
|
||||
|
||||
**Adam optimizer:**
|
||||
- Parameters: 10M × 4 bytes = **40 MB**
|
||||
- Gradients: 10M × 4 bytes = **40 MB**
|
||||
- Adam state (two moments): 10M × 2 × 4 bytes = **80 MB**
|
||||
- **Total: 160 MB** (4× parameter size)
|
||||
- Parameters: 10M × 4 bytes = **{glue:text}`q1_param_mb` MB**
|
||||
- Gradients: 10M × 4 bytes = **{glue:text}`q1_param_mb` MB**
|
||||
- Adam state (two moments): 10M × 2 × 4 bytes = **{glue:text}`q1_adam_moments_mb` MB**
|
||||
- **Total: {glue:text}`q1_adam_total_mb` MB** (4× parameter size)
|
||||
|
||||
**SGD with momentum:**
|
||||
- Parameters: 10M × 4 bytes = **40 MB**
|
||||
- Gradients: 10M × 4 bytes = **40 MB**
|
||||
- Momentum buffer: 10M × 4 bytes = **40 MB**
|
||||
- **Total: 120 MB** (3× parameter size)
|
||||
- Parameters: 10M × 4 bytes = **{glue:text}`q1_param_mb` MB**
|
||||
- Gradients: 10M × 4 bytes = **{glue:text}`q1_param_mb` MB**
|
||||
- Momentum buffer: 10M × 4 bytes = **{glue:text}`q1_param_mb` MB**
|
||||
- **Total: {glue:text}`q1_sgd_total_mb` MB** (3× parameter size)
|
||||
|
||||
**Key insight:** Optimizer choice affects memory by 33%. For large models near GPU memory limits, SGD may be the only option.
|
||||
**Key insight:** Optimizer choice affects memory by {glue:text}`q1_mem_diff_pct`%. For large models near GPU memory limits, SGD may be the only option.
|
||||
```
|
||||
|
||||
**Q2: Gradient Accumulation Trade-off**
|
||||
@@ -879,9 +1022,26 @@ Starting high (0.1) provides fast early progress. Gradual decay (0.1 → 0.01) a
|
||||
|
||||
**Q4: Checkpoint Storage Strategy**
|
||||
|
||||
You're training for 100 epochs. Each checkpoint is 1 GB. Checkpointing every epoch creates 100 GB of storage. Checkpointing every 10 epochs risks losing 10 epochs of work if training crashes. Design a checkpointing strategy that balances fault tolerance and storage costs.
|
||||
```{code-cell} python3
|
||||
:tags: [remove-input, remove-output]
|
||||
from myst_nb import glue
|
||||
|
||||
```{admonition} Answer
|
||||
# Q4: Checkpoint storage strategy
|
||||
epochs = 100
|
||||
ckpt_size_gb = 1
|
||||
total_every_epoch_gb = epochs * ckpt_size_gb
|
||||
glue("q4_total_every_epoch_gb", f"{total_every_epoch_gb:,}")
|
||||
|
||||
last_n = 3
|
||||
best = 1
|
||||
milestones = 3 # every 25 epochs: 25, 50, 75
|
||||
smart_total_gb = (last_n + best + milestones) * ckpt_size_gb
|
||||
glue("q4_smart_total_gb", f"{smart_total_gb:,}")
|
||||
```
|
||||
|
||||
You're training for 100 epochs. Each checkpoint is 1 GB. Checkpointing every epoch creates {glue:text}`q4_total_every_epoch_gb` GB of storage. Checkpointing every 10 epochs risks losing 10 epochs of work if training crashes. Design a checkpointing strategy that balances fault tolerance and storage costs.
|
||||
|
||||
````{admonition} Answer
|
||||
:class: dropdown
|
||||
|
||||
**Strategy: Keep last N + best + milestones**
|
||||
@@ -890,7 +1050,7 @@ You're training for 100 epochs. Each checkpoint is 1 GB. Checkpointing every epo
|
||||
2. **Keep best checkpoint** (lowest validation loss): `best_epoch_72.pkl` (1 GB)
|
||||
3. **Keep milestone checkpoints** (every 25 epochs): `epoch_25.pkl`, `epoch_50.pkl`, `epoch_75.pkl` (3 GB)
|
||||
|
||||
**Total storage: 7 GB** (vs 100 GB for every epoch)
|
||||
**Total storage: {glue:text}`q4_smart_total_gb` GB** (vs {glue:text}`q4_total_every_epoch_gb` GB for every epoch)
|
||||
|
||||
**Fault tolerance:**
|
||||
- Last 3 checkpoints: Lose at most 1 epoch of work
|
||||
@@ -908,11 +1068,33 @@ if is_best_validation: # Best
|
||||
```
|
||||
|
||||
**Production systems** use this strategy plus cloud storage for off-site backup.
|
||||
```
|
||||
````
|
||||
|
||||
**Q5: Global Norm Clipping Analysis**
|
||||
|
||||
Two training runs: (A) clips each gradient individually to max 1.0, (B) clips by global norm (max_norm=1.0). Both encounter gradients `[50, 100, 5]` with global norm `√(50² + 100² + 5²) ≈ 112`. What are the clipped gradients in each case? Which preserves gradient direction better?
|
||||
```{code-cell} python3
|
||||
:tags: [remove-input, remove-output]
|
||||
import math
|
||||
from myst_nb import glue
|
||||
|
||||
# Q5: Global norm clipping analysis with gradients [50, 100, 5]
|
||||
grads_q5 = [50, 100, 5]
|
||||
global_norm_q5 = math.sqrt(sum(g**2 for g in grads_q5))
|
||||
glue("q5_global_norm", f"{global_norm_q5:.0f}")
|
||||
|
||||
scale_factor_q5 = 1.0 / global_norm_q5
|
||||
glue("q5_scale_factor", f"{scale_factor_q5:.4f}")
|
||||
|
||||
clipped_q5 = [g * scale_factor_q5 for g in grads_q5]
|
||||
glue("q5_clipped_g1", f"{clipped_q5[0]:.2f}")
|
||||
glue("q5_clipped_g2", f"{clipped_q5[1]:.2f}")
|
||||
glue("q5_clipped_g3", f"{clipped_q5[2]:.2f}")
|
||||
|
||||
verify_norm_q5 = math.sqrt(sum(g**2 for g in clipped_q5))
|
||||
glue("q5_verify_norm", f"{verify_norm_q5:.1f}")
|
||||
```
|
||||
|
||||
Two training runs: (A) clips each gradient individually to max 1.0, (B) clips by global norm (max_norm=1.0). Both encounter gradients [50, 100, 5] with global norm √(50² + 100² + 5²) ≈ {glue:text}`q5_global_norm`. What are the clipped gradients in each case? Which preserves gradient direction better?
|
||||
|
||||
```{admonition} Answer
|
||||
:class: dropdown
|
||||
@@ -923,16 +1105,16 @@ Two training runs: (A) clips each gradient individually to max 1.0, (B) clips by
|
||||
- **Result:** All parameters get equal updates (destroys relative importance information)
|
||||
|
||||
**(B) Global norm clipping** (scale uniformly):
|
||||
- Original: `[50, 100, 5]`, global norm ≈ 112
|
||||
- Scale factor: `1.0 / 112 ≈ 0.0089`
|
||||
- Clipped: `[0.45, 0.89, 0.04]`
|
||||
- New global norm: **1.0** (exactly max_norm)
|
||||
- Original: [50, 100, 5], global norm ≈ {glue:text}`q5_global_norm`
|
||||
- Scale factor: 1.0 / {glue:text}`q5_global_norm` ≈ {glue:text}`q5_scale_factor`
|
||||
- Clipped: [{glue:text}`q5_clipped_g1`, {glue:text}`q5_clipped_g2`, {glue:text}`q5_clipped_g3`]
|
||||
- New global norm: **{glue:text}`q5_verify_norm`** (exactly max_norm)
|
||||
- **Result:** Relative magnitudes preserved (second parameter still gets 2× update of first)
|
||||
|
||||
**Why (B) is better:**
|
||||
Gradients encode relative importance: parameter 2 needs larger updates than parameter 1. Global norm clipping prevents explosion while respecting this information. Individual clipping destroys it, effectively treating all parameters as equally important.
|
||||
|
||||
**Verification:** `√(0.45² + 0.89² + 0.04²) ≈ 1.0` ✓
|
||||
**Verification:** √({glue:text}`q5_clipped_g1`² + {glue:text}`q5_clipped_g2`² + {glue:text}`q5_clipped_g3`²) ≈ {glue:text}`q5_verify_norm` ✓
|
||||
```
|
||||
|
||||
## Further Reading
|
||||
@@ -971,7 +1153,7 @@ Implement Conv2d, MaxPool2d, and Flatten layers to build convolutional neural ne
|
||||
|
||||
```{tip} Interactive Options
|
||||
|
||||
- **[Launch Binder](https://mybinder.org/v2/gh/harvard-edge/cs249r_book/main?urlpath=lab/tree/tinytorch/modules/08_training/08_training.ipynb)** - Run interactively in browser, no setup required
|
||||
- **[Launch Binder](https://mybinder.org/v2/gh/harvard-edge/cs249r_book/main?urlpath=lab/tree/tinytorch/modules/08_training/training.ipynb)** - Run interactively in browser, no setup required
|
||||
- **[View Source](https://github.com/harvard-edge/cs249r_book/blob/main/tinytorch/src/08_training/08_training.py)** - Browse the implementation code
|
||||
```
|
||||
|
||||
|
||||
@@ -1,3 +1,9 @@
|
||||
---
|
||||
file_format: mystnb
|
||||
kernelspec:
|
||||
name: python3
|
||||
---
|
||||
|
||||
# Module 09: Convolutions
|
||||
|
||||
:::{admonition} Module Info
|
||||
@@ -30,7 +36,7 @@ Listen to an AI-generated overview.
|
||||
|
||||
Run interactively in your browser.
|
||||
|
||||
<a href="https://mybinder.org/v2/gh/harvard-edge/cs249r_book/main?labpath=tinytorch%2Fmodules%2F09_convolutions%2F09_convolutions.ipynb" target="_blank" style="display: flex; align-items: center; justify-content: center; width: 100%; height: 54px; margin-top: auto; background: #f97316; color: white; text-align: center; text-decoration: none; border-radius: 27px; font-size: 14px; box-sizing: border-box;">Open in Binder →</a>
|
||||
<a href="https://mybinder.org/v2/gh/harvard-edge/cs249r_book/main?labpath=tinytorch%2Fmodules%2F09_convolutions%2Fconvolutions.ipynb" target="_blank" style="display: flex; align-items: center; justify-content: center; width: 100%; height: 54px; margin-top: auto; background: #f97316; color: white; text-align: center; text-decoration: none; border-radius: 27px; font-size: 14px; box-sizing: border-box;">Open in Binder →</a>
|
||||
```
|
||||
|
||||
```{grid-item-card} 📄 View Source
|
||||
@@ -512,7 +518,16 @@ def forward(self, x):
|
||||
output[b, out_ch, out_h, out_w] = conv_sum
|
||||
```
|
||||
|
||||
The seven nested loops reveal where the computational cost comes from. For a typical CNN layer processing a batch of 32 RGB images (224×224) with 64 output channels and 3×3 kernels, this structure executes **2.8 billion multiply-accumulate operations** per forward pass. This is why optimized implementations matter.
|
||||
```{code-cell} python3
|
||||
:tags: [remove-input, remove-output]
|
||||
from myst_nb import glue
|
||||
|
||||
# Prose: typical CNN layer ops (batch=32, 3 RGB channels, 224x224, 64 output channels, 3x3 kernel)
|
||||
conv_ops_prose = 32 * 64 * 224 * 224 * 3 * 3 * 3
|
||||
glue("conv_ops_billions", f"{conv_ops_prose / 1e9:.1f} billion")
|
||||
```
|
||||
|
||||
The seven nested loops reveal where the computational cost comes from. For a typical CNN layer processing a batch of 32 RGB images (224×224) with 64 output channels and 3×3 kernels, this structure executes **{glue:text}`conv_ops_billions` multiply-accumulate operations** per forward pass. This is why optimized implementations matter.
|
||||
|
||||
Each output pixel summarizes information from a local neighborhood in the input. A 3×3 convolution looks at 9 pixels to produce each output value, enabling the network to detect local patterns like edges, corners, and textures.
|
||||
|
||||
@@ -542,6 +557,17 @@ Output: 3×3 Output: 5×5
|
||||
5×5 output preserved
|
||||
```
|
||||
|
||||
```{code-cell} python3
|
||||
:tags: [remove-input, remove-output]
|
||||
from myst_nb import glue
|
||||
|
||||
# Stride and padding output dimension examples
|
||||
stride1_out = (224 + 2 * 1 - 3) // 1 + 1
|
||||
stride2_out = (224 + 2 * 1 - 3) // 2 + 1
|
||||
glue("stride1_out", f"{stride1_out}")
|
||||
glue("stride2_out", f"{stride2_out}")
|
||||
```
|
||||
|
||||
The formula connecting these parameters is:
|
||||
|
||||
```
|
||||
@@ -549,14 +575,12 @@ output_size = (input_size + 2×padding - kernel_size) / stride + 1
|
||||
```
|
||||
|
||||
For a 224×224 input with kernel=3, padding=1, stride=1:
|
||||
```
|
||||
output_size = (224 + 2×1 - 3) / 1 + 1 = 224
|
||||
```
|
||||
|
||||
output_size = (224 + 2×1 - 3) / 1 + 1 = {glue:text}`stride1_out`
|
||||
|
||||
For the same input with stride=2:
|
||||
```
|
||||
output_size = (224 + 2×1 - 3) / 2 + 1 = 112
|
||||
```
|
||||
|
||||
output_size = (224 + 2×1 - 3) / 2 + 1 = {glue:text}`stride2_out`
|
||||
|
||||
### Receptive Fields
|
||||
|
||||
@@ -616,7 +640,20 @@ def forward(self, x):
|
||||
output[b, c, out_h, out_w] = max_val
|
||||
```
|
||||
|
||||
A 2×2 max pooling with stride=2 divides spatial dimensions by 2, reducing memory and computation by 4×. For a 224×224×64 feature map (12.8 MB), pooling produces 112×112×64 (3.2 MB), saving 9.6 MB.
|
||||
```{code-cell} python3
|
||||
:tags: [remove-input, remove-output]
|
||||
from myst_nb import glue
|
||||
|
||||
# Pooling memory example: 224x224x64 feature map -> 112x112x64 after 2x2 pool
|
||||
pool_before_bytes = 224 * 224 * 64 * 4
|
||||
pool_after_bytes = 112 * 112 * 64 * 4
|
||||
pool_saved_bytes = pool_before_bytes - pool_after_bytes
|
||||
glue("pool_before_mb", f"{pool_before_bytes / 1024**2:.1f} MB")
|
||||
glue("pool_after_mb", f"{pool_after_bytes / 1024**2:.1f} MB")
|
||||
glue("pool_saved_mb", f"{pool_saved_bytes / 1024**2:.1f} MB")
|
||||
```
|
||||
|
||||
A 2×2 max pooling with stride=2 divides spatial dimensions by 2, reducing memory and computation by 4×. For a 224×224×64 feature map ({glue:text}`pool_before_mb`), pooling produces 112×112×64 ({glue:text}`pool_after_mb`), saving {glue:text}`pool_saved_mb`.
|
||||
|
||||
Max pooling provides translation invariance: if a cat's ear moves one pixel, the max in that region remains roughly the same, making the network robust to small shifts. This is crucial for object recognition where precise pixel alignment doesn't matter.
|
||||
|
||||
@@ -635,26 +672,41 @@ W_out = ⌊(W_in + 2×padding - kernel_w) / stride⌋ + 1
|
||||
|
||||
The floor operation (⌊⌋) ensures integer dimensions. If the calculation doesn't divide evenly, the rightmost and bottommost regions get ignored.
|
||||
|
||||
```{code-cell} python3
|
||||
:tags: [remove-input, remove-output]
|
||||
from myst_nb import glue
|
||||
|
||||
# Output shape example 1: Conv2d(3, 64, k=3, p=1, s=1) on 224x224
|
||||
ex1_h = (224 + 2 * 1 - 3) // 1 + 1
|
||||
glue("ex1_h", f"{ex1_h}")
|
||||
|
||||
# Output shape example 2: MaxPool2d(k=2, s=2) on 224x224
|
||||
ex2_h = (224 + 0 - 2) // 2 + 1
|
||||
glue("ex2_h", f"{ex2_h}")
|
||||
|
||||
# Output shape example 3: Conv2d(64, 128, k=3, p=0, s=2) on 112x112
|
||||
ex3_h = (112 + 0 - 3) // 2 + 1
|
||||
glue("ex3_h", f"{ex3_h}")
|
||||
```
|
||||
|
||||
**Example calculations:**
|
||||
|
||||
```
|
||||
Input: (32, 3, 224, 224) [batch=32, RGB channels, 224×224 image]
|
||||
|
||||
Conv2d(3, 64, kernel_size=3, padding=1, stride=1):
|
||||
H_out = (224 + 2×1 - 3) / 1 + 1 = 224
|
||||
W_out = (224 + 2×1 - 3) / 1 + 1 = 224
|
||||
Output: (32, 64, 224, 224)
|
||||
H_out = (224 + 2×1 - 3) / 1 + 1 = {glue:text}`ex1_h`
|
||||
W_out = (224 + 2×1 - 3) / 1 + 1 = {glue:text}`ex1_h`
|
||||
Output: (32, 64, {glue:text}`ex1_h`, {glue:text}`ex1_h`)
|
||||
|
||||
MaxPool2d(kernel_size=2, stride=2):
|
||||
H_out = (224 + 0 - 2) / 2 + 1 = 112
|
||||
W_out = (224 + 0 - 2) / 2 + 1 = 112
|
||||
Output: (32, 64, 112, 112)
|
||||
H_out = (224 + 0 - 2) / 2 + 1 = {glue:text}`ex2_h`
|
||||
W_out = (224 + 0 - 2) / 2 + 1 = {glue:text}`ex2_h`
|
||||
Output: (32, 64, {glue:text}`ex2_h`, {glue:text}`ex2_h`)
|
||||
|
||||
Conv2d(64, 128, kernel_size=3, padding=0, stride=2):
|
||||
H_out = (112 + 0 - 3) / 2 + 1 = 55
|
||||
W_out = (112 + 0 - 3) / 2 + 1 = 55
|
||||
Output: (32, 128, 55, 55)
|
||||
```
|
||||
H_out = (112 + 0 - 3) / 2 + 1 = {glue:text}`ex3_h`
|
||||
W_out = (112 + 0 - 3) / 2 + 1 = {glue:text}`ex3_h`
|
||||
Output: (32, 128, {glue:text}`ex3_h`, {glue:text}`ex3_h`)
|
||||
|
||||
**Common patterns:**
|
||||
- **Same convolution** (padding=1, stride=1, kernel=3): Preserves spatial dimensions
|
||||
@@ -670,15 +722,37 @@ For a single Conv2d forward pass:
|
||||
Operations = B × C_out × H_out × W_out × C_in × K_h × K_w
|
||||
```
|
||||
|
||||
**Example:** Batch=32, Input=(3, 224, 224), Conv2d(3→64, kernel=3, padding=1, stride=1)
|
||||
```
|
||||
Operations = 32 × 64 × 224 × 224 × 3 × 3 × 3
|
||||
= 32 × 64 × 50,176 × 27
|
||||
= 2,764,800,000 multiply-accumulate operations
|
||||
≈ 2.8 billion operations per forward pass!
|
||||
```{code-cell} python3
|
||||
:tags: [remove-input, remove-output]
|
||||
from myst_nb import glue
|
||||
|
||||
# Complexity example: Batch=32, Input=(3, 224, 224), Conv2d(3->64, k=3, p=1, s=1)
|
||||
complexity_h_out = 224 # (224 + 2*1 - 3) // 1 + 1
|
||||
complexity_hw = complexity_h_out * complexity_h_out
|
||||
complexity_kernel = 3 * 3 * 3
|
||||
complexity_ops = 32 * 64 * complexity_hw * complexity_kernel
|
||||
glue("complexity_hw", f"{complexity_hw:,}")
|
||||
glue("complexity_kernel", f"{complexity_kernel}")
|
||||
glue("complexity_ops", f"{complexity_ops:,}")
|
||||
glue("complexity_approx", f"{complexity_ops / 1e9:.1f}")
|
||||
|
||||
# 7x7 vs 3x3 kernel ratio
|
||||
kernel_ratio = (7 * 7) / (3 * 3)
|
||||
glue("kernel_ratio", f"{kernel_ratio:.1f}")
|
||||
|
||||
# Memory for (32, 64, 224, 224) float32 tensor
|
||||
mem_bytes = 32 * 64 * 224 * 224 * 4
|
||||
glue("tensor_mem_mb", f"{mem_bytes / 1024**2:.0f} MB")
|
||||
```
|
||||
|
||||
This is why kernel size matters enormously. A 7×7 kernel requires (7×7)/(3×3) = 5.4× more computation than 3×3. Modern architectures favor stacking multiple 3×3 convolutions instead of using large kernels.
|
||||
**Example:** Batch=32, Input=(3, 224, 224), Conv2d(3→64, kernel=3, padding=1, stride=1)
|
||||
|
||||
Operations = 32 × 64 × 224 × 224 × 3 × 3 × 3
|
||||
= 32 × 64 × {glue:text}`complexity_hw` × {glue:text}`complexity_kernel`
|
||||
= {glue:text}`complexity_ops` multiply-accumulate operations
|
||||
≈ {glue:text}`complexity_approx` billion operations per forward pass!
|
||||
|
||||
This is why kernel size matters enormously. A 7×7 kernel requires (7×7)/(3×3) = {glue:text}`kernel_ratio`× more computation than 3×3. Modern architectures favor stacking multiple 3×3 convolutions instead of using large kernels.
|
||||
|
||||
Pooling operations are cheap by comparison: no learnable parameters, just comparison or addition operations. A 2×2 max pooling visits each output position once and compares 4 values, requiring only 4× comparisons per output.
|
||||
|
||||
@@ -689,9 +763,8 @@ Pooling operations are cheap by comparison: no learnable parameters, just compar
|
||||
| AvgPool2d (K×K) | O(B×C×H×W×K²) | Same as MaxPool but with addition |
|
||||
|
||||
Memory consumption follows the output shape. A (32, 64, 224, 224) float32 tensor requires:
|
||||
```
|
||||
32 × 64 × 224 × 224 × 4 bytes = 411 MB
|
||||
```
|
||||
|
||||
32 × 64 × 224 × 224 × 4 bytes = {glue:text}`tensor_mem_mb`
|
||||
|
||||
This is why batch size matters: doubling batch size doubles memory usage. GPUs have limited memory (typically 8-24 GB), constraining how large your batches and feature maps can be.
|
||||
|
||||
@@ -713,12 +786,20 @@ Conv2d requires 4D input: (batch, channels, height, width). If you forget the ba
|
||||
|
||||
The floor operation in output dimension calculation can surprise you. If `(input + 2×padding - kernel) / stride` doesn't divide evenly, the result gets floored.
|
||||
|
||||
**Example**:
|
||||
```python
|
||||
# Input: 224×224, kernel=3, padding=0, stride=2
|
||||
output_size = (224 + 0 - 3) // 2 + 1 = 221 // 2 + 1 = 110 + 1 = 111
|
||||
```{code-cell} python3
|
||||
:tags: [remove-input, remove-output]
|
||||
from myst_nb import glue
|
||||
|
||||
# Dimension error example: 224x224, k=3, p=0, s=2
|
||||
dim_err_result = (224 + 0 - 3) // 2 + 1
|
||||
glue("dim_err_result", f"{dim_err_result}")
|
||||
```
|
||||
|
||||
**Example**:
|
||||
|
||||
Input: 224×224, kernel=3, padding=0, stride=2
|
||||
output_size = (224 + 0 - 3) // 2 + 1 = 221 // 2 + 1 = 110 + 1 = {glue:text}`dim_err_result`
|
||||
|
||||
**Fix**: Use calculators or test with dummy data to verify dimensions before building full architecture.
|
||||
|
||||
### Padding Value Confusion
|
||||
@@ -744,7 +825,16 @@ By convention, pooling uses non-overlapping windows: `stride = kernel_size`. If
|
||||
|
||||
**Error**: `RuntimeError: CUDA out of memory` or system hangs
|
||||
|
||||
Large feature maps consume enormous memory. A batch of 64 images at 224×224×64 channels = 1.3 GB for a single layer's output. Deep networks with many layers can exceed GPU memory.
|
||||
```{code-cell} python3
|
||||
:tags: [remove-input, remove-output]
|
||||
from myst_nb import glue
|
||||
|
||||
# Memory overflow example: 64 images at 224x224x64 channels
|
||||
overflow_bytes = 64 * 224 * 224 * 64 * 4
|
||||
glue("overflow_gb", f"{overflow_bytes / 1024**3:.1f} GB")
|
||||
```
|
||||
|
||||
Large feature maps consume enormous memory. A batch of 64 images at 224×224×64 channels = {glue:text}`overflow_gb` for a single layer's output. Deep networks with many layers can exceed GPU memory.
|
||||
|
||||
**Fix**: Reduce batch size, use smaller images, or add more pooling layers to reduce spatial dimensions faster.
|
||||
|
||||
@@ -838,6 +928,59 @@ Modern frameworks achieve this through:
|
||||
|
||||
## Check Your Understanding
|
||||
|
||||
```{code-cell} python3
|
||||
:tags: [remove-input, remove-output]
|
||||
from myst_nb import glue
|
||||
|
||||
# Q1: Conv2d(3, 64, k=5, p=2, s=2) on (32, 3, 128, 128)
|
||||
q1_h = (128 + 2 * 2 - 5) // 2 + 1
|
||||
glue("q1_h", f"{q1_h}")
|
||||
|
||||
# Q2: Conv2d(3, 64, k=3, bias=True) parameter count
|
||||
q2_weight = 64 * 3 * 3 * 3
|
||||
q2_bias = 64
|
||||
q2_total = q2_weight + q2_bias
|
||||
glue("q2_weight", f"{q2_weight:,}")
|
||||
glue("q2_bias", f"{q2_bias}")
|
||||
glue("q2_total", f"{q2_total:,}")
|
||||
|
||||
# Q2: Dense layer comparison
|
||||
q2_dense_input = 224 * 224 * 3
|
||||
q2_dense_params = q2_dense_input * 64
|
||||
q2_ratio = q2_dense_params // q2_total
|
||||
glue("q2_dense_input", f"{q2_dense_input:,}")
|
||||
glue("q2_dense_params", f"{q2_dense_params:,}")
|
||||
glue("q2_ratio", f"{q2_ratio:,}")
|
||||
|
||||
# Q3: Conv2d(64, 128, k=3, p=1, s=1) on (16, 64, 56, 56)
|
||||
q3_h = (56 + 2 * 1 - 3) // 1 + 1
|
||||
q3_hw = q3_h * q3_h
|
||||
q3_kernel_block = 64 * 3 * 3
|
||||
q3_ops = 16 * 128 * q3_hw * q3_kernel_block
|
||||
glue("q3_h", f"{q3_h}")
|
||||
glue("q3_hw", f"{q3_hw:,}")
|
||||
glue("q3_kernel_block", f"{q3_kernel_block}")
|
||||
glue("q3_ops", f"{q3_ops:,}")
|
||||
glue("q3_approx", f"{q3_ops / 1e9:.1f}")
|
||||
|
||||
# Q4: Conv2d(3, 256, k=7, s=2, p=3) on (64, 3, 224, 224)
|
||||
q4_h = (224 + 2 * 3 - 7) // 2 + 1
|
||||
q4_mem_bytes = 64 * 256 * q4_h * q4_h * 4
|
||||
glue("q4_h", f"{q4_h}")
|
||||
glue("q4_mem_bytes", f"{q4_mem_bytes:,}")
|
||||
glue("q4_mem_mb", f"{q4_mem_bytes / 1024**2:.0f} MB")
|
||||
|
||||
# Q5: Receptive field growth
|
||||
q5_rf1 = 3
|
||||
q5_rf2 = q5_rf1 + (2 - 1) * 1 # MaxPool(2x2, s=2)
|
||||
q5_rf3 = q5_rf2 + (3 - 1) * 2 # Conv(3x3, s=1), accumulated stride = 2
|
||||
q5_rf4 = q5_rf3 + (3 - 1) * 2 # Conv(3x3, s=1), accumulated stride = 2
|
||||
glue("q5_rf1", f"{q5_rf1}")
|
||||
glue("q5_rf2", f"{q5_rf2}")
|
||||
glue("q5_rf3", f"{q5_rf3}")
|
||||
glue("q5_rf4", f"{q5_rf4}")
|
||||
```
|
||||
|
||||
Test yourself with these systems thinking questions. They're designed to build intuition for the spatial operations and performance characteristics you'll encounter in real CNN architectures.
|
||||
|
||||
**Q1: Output Shape Calculation**
|
||||
@@ -848,12 +991,11 @@ Given input (32, 3, 128, 128), what's the output shape after Conv2d(3, 64, kerne
|
||||
:class: dropdown
|
||||
|
||||
Calculate height and width:
|
||||
```
|
||||
H_out = (128 + 2×2 - 5) / 2 + 1 = (128 + 4 - 5) / 2 + 1 = 127 / 2 + 1 = 63 + 1 = 64
|
||||
W_out = (128 + 2×2 - 5) / 2 + 1 = 64
|
||||
```
|
||||
|
||||
Output shape: **(32, 64, 64, 64)**
|
||||
H_out = (128 + 2×2 - 5) / 2 + 1 = (128 + 4 - 5) / 2 + 1 = 127 / 2 + 1 = 63 + 1 = {glue:text}`q1_h`
|
||||
W_out = (128 + 2×2 - 5) / 2 + 1 = {glue:text}`q1_h`
|
||||
|
||||
Output shape: **(32, 64, {glue:text}`q1_h`, {glue:text}`q1_h`)**
|
||||
|
||||
Batch and channels change (3→64), spatial dimensions halve due to stride=2.
|
||||
```
|
||||
@@ -866,18 +1008,16 @@ How many parameters in Conv2d(3, 64, kernel_size=3, bias=True)?
|
||||
:class: dropdown
|
||||
|
||||
Weight parameters: out_channels × in_channels × kernel_h × kernel_w
|
||||
```
|
||||
Weight: 64 × 3 × 3 × 3 = 1,728 parameters
|
||||
Bias: 64 parameters
|
||||
Total: 1,792 parameters
|
||||
```
|
||||
|
||||
Weight: 64 × 3 × 3 × 3 = {glue:text}`q2_weight` parameters
|
||||
Bias: {glue:text}`q2_bias` parameters
|
||||
Total: {glue:text}`q2_total` parameters
|
||||
|
||||
Compare this to a fully connected layer for 224×224 RGB images:
|
||||
```
|
||||
Dense(224×224×3, 64) = 150,528 × 64 = 9,633,792 parameters!
|
||||
```
|
||||
|
||||
Convolution achieves **5,373× fewer parameters** through parameter sharing!
|
||||
Dense(224×224×3, 64) = {glue:text}`q2_dense_input` × 64 = {glue:text}`q2_dense_params` parameters!
|
||||
|
||||
Convolution achieves **{glue:text}`q2_ratio`× fewer parameters** through parameter sharing!
|
||||
```
|
||||
|
||||
**Q3: Computational Complexity**
|
||||
@@ -890,18 +1030,16 @@ For input (16, 64, 56, 56) and Conv2d(64, 128, kernel_size=3, padding=1, stride=
|
||||
Operations = B × C_out × H_out × W_out × C_in × K_h × K_w
|
||||
|
||||
First calculate output dimensions:
|
||||
```
|
||||
H_out = (56 + 2×1 - 3) / 1 + 1 = 56
|
||||
W_out = (56 + 2×1 - 3) / 1 + 1 = 56
|
||||
```
|
||||
|
||||
H_out = (56 + 2×1 - 3) / 1 + 1 = {glue:text}`q3_h`
|
||||
W_out = (56 + 2×1 - 3) / 1 + 1 = {glue:text}`q3_h`
|
||||
|
||||
Then total operations:
|
||||
```
|
||||
16 × 128 × 56 × 56 × 64 × 3 × 3
|
||||
= 16 × 128 × 3,136 × 576
|
||||
= 3,707,764,736 operations
|
||||
≈ 3.7 billion operations per forward pass!
|
||||
```
|
||||
|
||||
16 × 128 × {glue:text}`q3_h` × {glue:text}`q3_h` × 64 × 3 × 3
|
||||
= 16 × 128 × {glue:text}`q3_hw` × {glue:text}`q3_kernel_block`
|
||||
= {glue:text}`q3_ops` operations
|
||||
≈ {glue:text}`q3_approx` billion operations per forward pass!
|
||||
|
||||
This is why batch size directly impacts training time: doubling batch doubles operations.
|
||||
```
|
||||
@@ -914,18 +1052,16 @@ What's the memory requirement for storing the output of Conv2d(3, 256, kernel_si
|
||||
:class: dropdown
|
||||
|
||||
First calculate output dimensions:
|
||||
```
|
||||
H_out = (224 + 2×3 - 7) / 2 + 1 = (224 + 6 - 7) / 2 + 1 = 223 / 2 + 1 = 111 + 1 = 112
|
||||
W_out = 112
|
||||
```
|
||||
|
||||
Output shape: (64, 256, 112, 112)
|
||||
H_out = (224 + 2×3 - 7) / 2 + 1 = (224 + 6 - 7) / 2 + 1 = 223 / 2 + 1 = 111 + 1 = {glue:text}`q4_h`
|
||||
W_out = {glue:text}`q4_h`
|
||||
|
||||
Output shape: (64, 256, {glue:text}`q4_h`, {glue:text}`q4_h`)
|
||||
|
||||
Memory (float32 = 4 bytes):
|
||||
```
|
||||
64 × 256 × 112 × 112 × 4 = 825,753,600 bytes
|
||||
≈ 826 MB for a single layer's output!
|
||||
```
|
||||
|
||||
64 × 256 × {glue:text}`q4_h` × {glue:text}`q4_h` × 4 = {glue:text}`q4_mem_bytes` bytes
|
||||
≈ {glue:text}`q4_mem_mb` for a single layer's output!
|
||||
|
||||
This is why deep CNNs require GPUs with large memory (16+ GB). Storing activations for backpropagation across 50+ layers quickly exceeds memory limits.
|
||||
```
|
||||
@@ -939,14 +1075,14 @@ Starting with 224×224 input, you stack: Conv(3×3, stride=1) → MaxPool(2×2,
|
||||
|
||||
Track receptive field growth through each layer:
|
||||
|
||||
Layer 1 - Conv(3×3, stride=1): RF = 3
|
||||
Layer 2 - MaxPool(2×2, stride=2): RF = 3 + (2-1)×1 = 4
|
||||
Layer 3 - Conv(3×3, stride=1): RF = 4 + (3-1)×2 = 8 (stride accumulates)
|
||||
Layer 4 - Conv(3×3, stride=1): RF = 8 + (3-1)×2 = 12
|
||||
Layer 1 - Conv(3×3, stride=1): RF = {glue:text}`q5_rf1`
|
||||
Layer 2 - MaxPool(2×2, stride=2): RF = {glue:text}`q5_rf1` + (2-1)×1 = {glue:text}`q5_rf2`
|
||||
Layer 3 - Conv(3×3, stride=1): RF = {glue:text}`q5_rf2` + (3-1)×2 = {glue:text}`q5_rf3` (stride accumulates)
|
||||
Layer 4 - Conv(3×3, stride=1): RF = {glue:text}`q5_rf3` + (3-1)×2 = {glue:text}`q5_rf4`
|
||||
|
||||
**Receptive field = 12×12**
|
||||
**Receptive field = {glue:text}`q5_rf4`×{glue:text}`q5_rf4`**
|
||||
|
||||
Each neuron in the final layer sees a 12×12 region of the original input. This is why stacking layers with stride/pooling is crucial: it grows the receptive field so deeper layers can detect larger patterns.
|
||||
Each neuron in the final layer sees a {glue:text}`q5_rf4`×{glue:text}`q5_rf4` region of the original input. This is why stacking layers with stride/pooling is crucial: it grows the receptive field so deeper layers can detect larger patterns.
|
||||
|
||||
Formula: RF_new = RF_old + (kernel_size - 1) × stride_product
|
||||
|
||||
@@ -992,7 +1128,7 @@ Shift from spatial processing (images) to sequential processing (text). You'll i
|
||||
|
||||
```{tip} Interactive Options
|
||||
|
||||
- **[Launch Binder](https://mybinder.org/v2/gh/harvard-edge/cs249r_book/main?urlpath=lab/tree/tinytorch/modules/09_convolutions/09_convolutions.ipynb)** - Run interactively in browser, no setup required
|
||||
- **[Launch Binder](https://mybinder.org/v2/gh/harvard-edge/cs249r_book/main?urlpath=lab/tree/tinytorch/modules/09_convolutions/convolutions.ipynb)** - Run interactively in browser, no setup required
|
||||
- **[View Source](https://github.com/harvard-edge/cs249r_book/blob/main/tinytorch/src/09_convolutions/09_convolutions.py)** - Browse the implementation code
|
||||
```
|
||||
|
||||
|
||||
@@ -1,3 +1,9 @@
|
||||
---
|
||||
file_format: mystnb
|
||||
kernelspec:
|
||||
name: python3
|
||||
---
|
||||
|
||||
# Module 10: Tokenization
|
||||
|
||||
:::{admonition} Module Info
|
||||
@@ -30,7 +36,7 @@ Listen to an AI-generated overview.
|
||||
|
||||
Run interactively in your browser.
|
||||
|
||||
<a href="https://mybinder.org/v2/gh/harvard-edge/cs249r_book/main?labpath=tinytorch%2Fmodules%2F10_tokenization%2F10_tokenization.ipynb" target="_blank" style="display: flex; align-items: center; justify-content: center; width: 100%; height: 54px; margin-top: auto; background: #f97316; color: white; text-align: center; text-decoration: none; border-radius: 27px; font-size: 14px; box-sizing: border-box;">Open in Binder →</a>
|
||||
<a href="https://mybinder.org/v2/gh/harvard-edge/cs249r_book/main?labpath=tinytorch%2Fmodules%2F10_tokenization%2Ftokenization.ipynb" target="_blank" style="display: flex; align-items: center; justify-content: center; width: 100%; height: 54px; margin-top: auto; background: #f97316; color: white; text-align: center; text-decoration: none; border-radius: 27px; font-size: 14px; box-sizing: border-box;">Open in Binder →</a>
|
||||
```
|
||||
|
||||
```{grid-item-card} 📄 View Source
|
||||
@@ -512,7 +518,18 @@ def build_vocab(self, corpus: List[str]) -> None:
|
||||
|
||||
The special `<UNK>` token at position 0 handles characters not in the vocabulary. When encoding text with unknown characters, they all map to ID 0. This graceful degradation prevents crashes while signaling that information was lost.
|
||||
|
||||
Character vocabularies are tiny (typically 50-200 tokens depending on language), which means small embedding tables. A 100-character vocabulary with 512-dimensional embeddings requires only 51,200 parameters, about 200 KB of memory. This is dramatically smaller than word-level vocabularies with 100,000+ entries.
|
||||
```{code-cell} python3
|
||||
:tags: [remove-input, remove-output]
|
||||
from myst_nb import glue
|
||||
|
||||
# Vocabulary building: 100 chars * 512 dim
|
||||
vocab_params = 100 * 512
|
||||
vocab_bytes = vocab_params * 4
|
||||
glue("vocab_char_params", f"{vocab_params:,}")
|
||||
glue("vocab_char_mem", f"{vocab_bytes / 1024:.0f} KB")
|
||||
```
|
||||
|
||||
Character vocabularies are tiny (typically 50-200 tokens depending on language), which means small embedding tables. A 100-character vocabulary with 512-dimensional embeddings requires only {glue:text}`vocab_char_params` parameters, about {glue:text}`vocab_char_mem` of memory. This is dramatically smaller than word-level vocabularies with 100,000+ entries.
|
||||
|
||||
### Byte Pair Encoding (BPE)
|
||||
|
||||
@@ -657,11 +674,30 @@ Memory and computation scale oppositely:
|
||||
**Embedding table memory** = vocabulary size × embedding dimension × bytes per parameter
|
||||
**Sequence processing cost** = sequence length² × embedding dimension (for attention)
|
||||
|
||||
A character tokenizer with vocabulary 100 and embedding dimension 512 needs 100 × 512 × 4 = 204 KB for embeddings. But a 50-word sentence produces roughly 250 character tokens, requiring 250² = 62,500 attention computations per layer.
|
||||
```{code-cell} python3
|
||||
:tags: [remove-input, remove-output]
|
||||
from myst_nb import glue
|
||||
|
||||
A BPE tokenizer with vocabulary 50,000 and embedding dimension 512 needs 50,000 × 512 × 4 = 102 MB for embeddings. But that same 50-word sentence might produce only 75 BPE tokens, requiring 75² = 5,625 attention computations per layer.
|
||||
# Character tokenizer: vocab 100, dim 512, float32
|
||||
char_embed_bytes = 100 * 512 * 4
|
||||
char_seq_len = 250
|
||||
char_attn = char_seq_len ** 2
|
||||
glue("tradeoff_char_embed", f"{char_embed_bytes / 1024:.0f} KB")
|
||||
glue("tradeoff_char_attn", f"{char_attn:,}")
|
||||
|
||||
The attention cost savings (62,500 vs 5,625) dwarf the embedding memory cost (204 KB vs 102 MB) for models with multiple layers. This is why production language models use large vocabularies: the embedding table fits easily in memory, while shorter sequences dramatically reduce training and inference time.
|
||||
# BPE tokenizer: vocab 50,000, dim 512, float32
|
||||
bpe_embed_bytes = 50_000 * 512 * 4
|
||||
bpe_seq_len = 75
|
||||
bpe_attn = bpe_seq_len ** 2
|
||||
glue("tradeoff_bpe_embed", f"{bpe_embed_bytes / 1024**2:.1f} MB")
|
||||
glue("tradeoff_bpe_attn", f"{bpe_attn:,}")
|
||||
```
|
||||
|
||||
A character tokenizer with vocabulary 100 and embedding dimension 512 needs 100 × 512 × 4 = {glue:text}`tradeoff_char_embed` for embeddings. But a 50-word sentence produces roughly 250 character tokens, requiring 250² = {glue:text}`tradeoff_char_attn` attention computations per layer.
|
||||
|
||||
A BPE tokenizer with vocabulary 50,000 and embedding dimension 512 needs 50,000 × 512 × 4 = {glue:text}`tradeoff_bpe_embed` for embeddings. But that same 50-word sentence might produce only 75 BPE tokens, requiring 75² = {glue:text}`tradeoff_bpe_attn` attention computations per layer.
|
||||
|
||||
The attention cost savings ({glue:text}`tradeoff_char_attn` vs {glue:text}`tradeoff_bpe_attn`) dwarf the embedding memory cost ({glue:text}`tradeoff_char_embed` vs {glue:text}`tradeoff_bpe_embed`) for models with multiple layers. This is why production language models use large vocabularies: the embedding table fits easily in memory, while shorter sequences dramatically reduce training and inference time.
|
||||
|
||||
Modern language models balance these factors:
|
||||
|
||||
@@ -748,11 +784,25 @@ The BPE algorithm, merge rule learning, vocabulary structure, and encode/decode
|
||||
|
||||
### Why Tokenization Matters at Scale
|
||||
|
||||
```{code-cell} python3
|
||||
:tags: [remove-input, remove-output]
|
||||
from myst_nb import glue
|
||||
|
||||
# GPT-3 embedding table: 50,000 vocab * 12,288 dim * 4 bytes
|
||||
gpt3_embed_bytes = 50_000 * 12_288 * 4
|
||||
gpt3_embed_gb = gpt3_embed_bytes / 1024**3
|
||||
gpt3_total_params = 175_000_000_000
|
||||
gpt3_embed_params = 50_000 * 12_288
|
||||
gpt3_pct = gpt3_embed_params / gpt3_total_params * 100
|
||||
glue("gpt3_embed_gb", f"{gpt3_embed_gb:.2f} GB")
|
||||
glue("gpt3_embed_pct", f"{gpt3_pct:.2f}%")
|
||||
```
|
||||
|
||||
To appreciate why tokenization choices matter, consider the scale of modern systems:
|
||||
|
||||
- **GPT-3 training**: Processing 300 billion tokens required careful vocabulary selection. Using character tokenization would have increased sequence lengths by 3-4×, multiplying training time by 9-16× (quadratic attention cost).
|
||||
|
||||
- **Embedding table memory**: A 50,000 token vocabulary with 12,288-dimensional embeddings (GPT-3 size) requires 50,000 × 12,288 × 4 bytes = **2.4 GB** just for the embedding layer. This is ~0.14% of GPT-3's 175 billion total parameters, a reasonable fraction.
|
||||
- **Embedding table memory**: A 50,000 token vocabulary with 12,288-dimensional embeddings (GPT-3 size) requires 50,000 × 12,288 × 4 bytes = **{glue:text}`gpt3_embed_gb`** just for the embedding layer. This is ~{glue:text}`gpt3_embed_pct` of GPT-3's 175 billion total parameters, a reasonable fraction.
|
||||
|
||||
- **Real-time inference**: Chatbots must tokenize user input in milliseconds. Python tokenizers take 5-20 ms per sentence; Rust tokenizers take 0.05-0.2 ms. At 1 million requests per day, this saves ~5 hours of compute time daily.
|
||||
|
||||
@@ -764,19 +814,42 @@ Test yourself with these systems thinking questions. They're designed to build i
|
||||
|
||||
You train a BPE tokenizer with `vocab_size=30,000` for a production model. If using 768-dimensional embeddings with float32 precision, how much memory does the embedding table require?
|
||||
|
||||
```{code-cell} python3
|
||||
:tags: [remove-input, remove-output]
|
||||
from myst_nb import glue
|
||||
|
||||
# Q1: 30,000 vocab * 768 dim * 4 bytes
|
||||
q1_bytes = 30_000 * 768 * 4
|
||||
q1_mb = q1_bytes / 1024**2
|
||||
q1_params = 30_000 * 768
|
||||
q1_params_m = q1_params / 1e6
|
||||
q1_mem_mb = q1_params * 4 / 1024**2
|
||||
|
||||
# Doubling to 60K
|
||||
q1_double_bytes = 60_000 * 768 * 4
|
||||
q1_double_mb = q1_double_bytes / 1024**2
|
||||
|
||||
glue("q1_bytes", f"{q1_bytes:,}")
|
||||
glue("q1_mb", f"{q1_mb:.2f} MB")
|
||||
glue("q1_params", f"{q1_params:,}")
|
||||
glue("q1_params_m", f"{q1_params_m:.2f}M")
|
||||
glue("q1_mem_mb", f"{q1_mem_mb:.2f} MB")
|
||||
glue("q1_double_mb", f"~{q1_double_mb:.0f} MB")
|
||||
```
|
||||
|
||||
```{admonition} Answer
|
||||
:class: dropdown
|
||||
|
||||
30,000 × 768 × 4 bytes = **92,160,000 bytes ≈ 92.16 MB**
|
||||
30,000 × 768 × 4 bytes = **{glue:text}`q1_bytes` bytes ≈ {glue:text}`q1_mb`**
|
||||
|
||||
Breakdown:
|
||||
- Vocabulary size: 30,000 tokens
|
||||
- Embedding dimension: 768 (BERT-base size)
|
||||
- Float32: 4 bytes per parameter
|
||||
- Total parameters: 30,000 × 768 = 23,040,000
|
||||
- Memory: 23.04M × 4 = 92.16 MB
|
||||
- Total parameters: 30,000 × 768 = {glue:text}`q1_params`
|
||||
- Memory: {glue:text}`q1_params_m` × 4 = {glue:text}`q1_mem_mb`
|
||||
|
||||
This is why vocabulary size matters! Doubling to 60K vocab would double embedding memory to ~184 MB.
|
||||
This is why vocabulary size matters! Doubling to 60K vocab would double embedding memory to {glue:text}`q1_double_mb`.
|
||||
```
|
||||
|
||||
**Q2: Sequence Length Trade-offs**
|
||||
@@ -786,37 +859,80 @@ A sentence contains 200 characters. With character tokenization it produces 200
|
||||
- How many attention computations for character tokenization per batch?
|
||||
- How many for BPE tokenization per batch?
|
||||
|
||||
```{code-cell} python3
|
||||
:tags: [remove-input, remove-output]
|
||||
from myst_nb import glue
|
||||
|
||||
# Q2: attention computations
|
||||
q2_char_seq = 200
|
||||
q2_char_attn = q2_char_seq ** 2
|
||||
q2_batch = 32
|
||||
q2_char_total = q2_batch * q2_char_attn
|
||||
|
||||
q2_bpe_seq = 50
|
||||
q2_bpe_attn = q2_bpe_seq ** 2
|
||||
q2_bpe_total = q2_batch * q2_bpe_attn
|
||||
|
||||
q2_speedup = q2_char_attn // q2_bpe_attn
|
||||
|
||||
glue("q2_char_attn", f"{q2_char_attn:,}")
|
||||
glue("q2_char_total", f"{q2_char_total:,}")
|
||||
glue("q2_bpe_attn", f"{q2_bpe_attn:,}")
|
||||
glue("q2_bpe_total", f"{q2_bpe_total:,}")
|
||||
glue("q2_speedup", f"{q2_speedup}×")
|
||||
```
|
||||
|
||||
```{admonition} Answer
|
||||
:class: dropdown
|
||||
|
||||
**Character tokenization:**
|
||||
- Sequence length: 200 tokens
|
||||
- Attention per sequence: 200² = 40,000 operations
|
||||
- Attention per sequence: 200² = {glue:text}`q2_char_attn` operations
|
||||
- Batch size: 32
|
||||
- Total: 32 × 40,000 = **1,280,000 attention operations**
|
||||
- Total: 32 × {glue:text}`q2_char_attn` = **{glue:text}`q2_char_total` attention operations**
|
||||
|
||||
**BPE tokenization:**
|
||||
- Sequence length: 50 tokens (200 chars ÷ 4)
|
||||
- Attention per sequence: 50² = 2,500 operations
|
||||
- Attention per sequence: 50² = {glue:text}`q2_bpe_attn` operations
|
||||
- Batch size: 32
|
||||
- Total: 32 × 2,500 = **80,000 attention operations**
|
||||
- Total: 32 × {glue:text}`q2_bpe_attn` = **{glue:text}`q2_bpe_total` attention operations**
|
||||
|
||||
BPE is **16× faster** for attention! This is why modern models use subword tokenization despite larger embedding tables.
|
||||
BPE is **{glue:text}`q2_speedup` faster** for attention! This is why modern models use subword tokenization despite larger embedding tables.
|
||||
```
|
||||
|
||||
**Q3: Unknown Token Handling**
|
||||
|
||||
Your BPE tokenizer encounters the word "supercalifragilistic" (not in training corpus). Character tokenizer maps it to 22 known tokens. BPE tokenizer decomposes it into subwords like `['super', 'cal', 'ifr', 'ag', 'il', 'istic']` (6 tokens). Which is better?
|
||||
```{code-cell} python3
|
||||
:tags: [remove-input, remove-output]
|
||||
from myst_nb import glue
|
||||
|
||||
# Q3: 'supercalifragilistic' character analysis
|
||||
q3_word = "supercalifragilistic"
|
||||
q3_char_count = len(q3_word)
|
||||
q3_bpe_tokens = 6
|
||||
q3_compression = q3_char_count / q3_bpe_tokens
|
||||
q3_char_attn = q3_char_count ** 2
|
||||
q3_bpe_attn = q3_bpe_tokens ** 2
|
||||
q3_attn_ratio = q3_char_attn / q3_bpe_attn
|
||||
|
||||
glue("q3_char_count", f"{q3_char_count}")
|
||||
glue("q3_compression", f"{q3_compression:.1f}×")
|
||||
glue("q3_char_attn", f"{q3_char_attn}")
|
||||
glue("q3_bpe_attn", f"{q3_bpe_attn}")
|
||||
glue("q3_attn_ratio", f"{q3_attn_ratio:.0f}×")
|
||||
```
|
||||
|
||||
Your BPE tokenizer encounters the word "supercalifragilistic" (not in training corpus). Character tokenizer maps it to {glue:text}`q3_char_count` known tokens. BPE tokenizer decomposes it into subwords like `['super', 'cal', 'ifr', 'ag', 'il', 'istic']` (6 tokens). Which is better?
|
||||
|
||||
```{admonition} Answer
|
||||
:class: dropdown
|
||||
|
||||
**BPE is better for production:**
|
||||
|
||||
- **Efficiency**: 6 tokens vs 22 tokens = 3.7× shorter sequence
|
||||
- **Efficiency**: 6 tokens vs {glue:text}`q3_char_count` tokens = {glue:text}`q3_compression` shorter sequence
|
||||
- **Semantics**: Subwords like "super" and "istic" carry meaning; individual characters don't
|
||||
- **Generalization**: Model learns that "super" prefix modifies meaning (superman, supermarket)
|
||||
- **Memory**: 6² = 36 attention computations vs 22² = 484 (13× faster)
|
||||
- **Memory**: {glue:text}`q3_bpe_attn` attention computations vs {glue:text}`q3_char_attn` ({glue:text}`q3_attn_ratio` faster)
|
||||
|
||||
**Character tokenization advantages:**
|
||||
- **Perfect coverage**: Never maps to `<UNK>`, always recovers original text
|
||||
@@ -833,16 +949,35 @@ You analyze two tokenizers on a 10,000 character corpus:
|
||||
|
||||
What's the compression ratio, and what does it tell you about efficiency?
|
||||
|
||||
```{code-cell} python3
|
||||
:tags: [remove-input, remove-output]
|
||||
from myst_nb import glue
|
||||
|
||||
# Q4: compression ratio
|
||||
q4_char_tokens = 10_000
|
||||
q4_bpe_tokens = 2_500
|
||||
q4_ratio = q4_char_tokens / q4_bpe_tokens
|
||||
q4_attn_speedup = int(q4_ratio ** 2)
|
||||
q4_bpe_context_chars = 512 * int(q4_ratio)
|
||||
q4_char_context_words = 512 # chars ~= 100 words estimate (reference, not pure arithmetic)
|
||||
q4_bpe_context_words = q4_bpe_context_chars # chars ~= 400 words estimate (reference)
|
||||
|
||||
glue("q4_ratio", f"{q4_ratio:.1f}")
|
||||
glue("q4_avg_chars", f"{int(q4_ratio)}")
|
||||
glue("q4_attn_speedup", f"{q4_attn_speedup}×")
|
||||
glue("q4_bpe_context_chars", f"{q4_bpe_context_chars:,}")
|
||||
```
|
||||
|
||||
```{admonition} Answer
|
||||
:class: dropdown
|
||||
|
||||
**Compression ratio: 10,000 ÷ 2,500 = 4.0**
|
||||
**Compression ratio: 10,000 ÷ 2,500 = {glue:text}`q4_ratio`**
|
||||
|
||||
This means each BPE token represents an average of 4 characters.
|
||||
This means each BPE token represents an average of {glue:text}`q4_avg_chars` characters.
|
||||
|
||||
**Efficiency implications:**
|
||||
- **Sequence processing**: 4× shorter sequences = 16× faster attention (quadratic scaling)
|
||||
- **Context window**: With max length 512, character tokenizer handles 512 chars (~100 words); BPE handles 2,048 chars (~400 words)
|
||||
- **Sequence processing**: {glue:text}`q4_ratio`× shorter sequences = {glue:text}`q4_attn_speedup` faster attention (quadratic scaling)
|
||||
- **Context window**: With max length 512, character tokenizer handles 512 chars (~100 words); BPE handles {glue:text}`q4_bpe_context_chars` chars (~400 words)
|
||||
- **Information density**: Each BPE token carries more semantic information (subword vs character)
|
||||
|
||||
**Trade-off**: BPE vocabulary is ~100× larger (10K tokens vs 100), increasing embedding memory from ~200 KB to ~20 MB. This trade-off heavily favors BPE for models with multiple transformer layers where attention cost dominates.
|
||||
@@ -852,14 +987,40 @@ This means each BPE token represents an average of 4 characters.
|
||||
|
||||
Training BPE on 1,000 words takes 100ms. How long will 10,000 words take? What about 100,000 words?
|
||||
|
||||
```{code-cell} python3
|
||||
:tags: [remove-input, remove-output]
|
||||
from myst_nb import glue
|
||||
|
||||
# Q5: O(n^2) scaling
|
||||
q5_base_ms = 100
|
||||
q5_base_words = 1_000
|
||||
|
||||
q5_10k_scale = (10_000 / q5_base_words) ** 2
|
||||
q5_10k_ms = q5_base_ms * q5_10k_scale
|
||||
q5_10k_sec = q5_10k_ms / 1000
|
||||
|
||||
q5_100k_scale = (100_000 / q5_base_words) ** 2
|
||||
q5_100k_ms = q5_base_ms * q5_100k_scale
|
||||
q5_100k_sec = q5_100k_ms / 1000
|
||||
q5_100k_min = q5_100k_sec / 60
|
||||
|
||||
glue("q5_10k_ms", f"{q5_10k_ms:,.0f}")
|
||||
glue("q5_10k_sec", f"{q5_10k_sec:.0f}")
|
||||
glue("q5_10k_factor", f"{q5_10k_scale:.0f}×")
|
||||
glue("q5_100k_ms", f"{q5_100k_ms:,.0f}")
|
||||
glue("q5_100k_sec", f"{q5_100k_sec:,.0f}")
|
||||
glue("q5_100k_min", f"{q5_100k_min:.1f}")
|
||||
glue("q5_100k_factor", f"{q5_100k_scale:,.0f}×")
|
||||
```
|
||||
|
||||
```{admonition} Answer
|
||||
:class: dropdown
|
||||
|
||||
BPE training scales approximately **O(n²)** where n is corpus size (due to repeated pair counting across the corpus).
|
||||
|
||||
- **1,000 words**: 100 ms (baseline)
|
||||
- **10,000 words**: ~10,000 ms = 10 seconds (100× longer, due to 10² scaling)
|
||||
- **100,000 words**: ~1,000,000 ms = 1,000 seconds ≈ **16.7 minutes** (10,000× longer)
|
||||
- **10,000 words**: ~{glue:text}`q5_10k_ms` ms = {glue:text}`q5_10k_sec` seconds ({glue:text}`q5_10k_factor` longer, due to 10² scaling)
|
||||
- **100,000 words**: ~{glue:text}`q5_100k_ms` ms = {glue:text}`q5_100k_sec` seconds ≈ **{glue:text}`q5_100k_min` minutes** ({glue:text}`q5_100k_factor` longer)
|
||||
|
||||
**Production strategies to handle this:**
|
||||
- Sample representative subset (~50K-100K sentences is usually sufficient)
|
||||
@@ -908,7 +1069,7 @@ Convert your token IDs into learnable dense vector representations. You'll imple
|
||||
|
||||
```{tip} Interactive Options
|
||||
|
||||
- **[Launch Binder](https://mybinder.org/v2/gh/harvard-edge/cs249r_book/main?urlpath=lab/tree/tinytorch/modules/10_tokenization/10_tokenization.ipynb)** - Run interactively in browser, no setup required
|
||||
- **[Launch Binder](https://mybinder.org/v2/gh/harvard-edge/cs249r_book/main?urlpath=lab/tree/tinytorch/modules/10_tokenization/tokenization.ipynb)** - Run interactively in browser, no setup required
|
||||
- **[View Source](https://github.com/harvard-edge/cs249r_book/blob/main/tinytorch/src/10_tokenization/10_tokenization.py)** - Browse the implementation code
|
||||
```
|
||||
|
||||
|
||||
@@ -1,3 +1,102 @@
|
||||
---
|
||||
file_format: mystnb
|
||||
kernelspec:
|
||||
name: python3
|
||||
---
|
||||
|
||||
```{code-cell} python3
|
||||
:tags: [remove-input, remove-output]
|
||||
from myst_nb import glue
|
||||
|
||||
# --- Embedding Dimension Trade-offs section ---
|
||||
|
||||
# Prose: 50,000 vocab x 512 embed_dim memory (approximate)
|
||||
tradeoffs_50k_512_bytes = 50000 * 512 * 4
|
||||
tradeoffs_50k_512_mb = tradeoffs_50k_512_bytes / 1024**2
|
||||
glue("tradeoffs_50k_512_mb", f"{tradeoffs_50k_512_mb:.0f} MB")
|
||||
|
||||
# Prose: doubled to 1024 embed_dim
|
||||
tradeoffs_50k_1024_bytes = 50000 * 1024 * 4
|
||||
tradeoffs_50k_1024_mb = tradeoffs_50k_1024_bytes / 1024**2
|
||||
glue("tradeoffs_50k_1024_mb", f"{tradeoffs_50k_1024_mb:.0f} MB")
|
||||
|
||||
# GPT-3 embedding memory in prose
|
||||
gpt3_embed_bytes = 50257 * 12288 * 4
|
||||
gpt3_embed_gb = gpt3_embed_bytes / 1024**3
|
||||
glue("tradeoffs_gpt3_embed_gb", f"{gpt3_embed_gb:.1f} GB")
|
||||
|
||||
# --- Production table ---
|
||||
|
||||
# Small BERT: 30,000 x 768
|
||||
table_bert_bytes = 30000 * 768 * 4
|
||||
table_bert_mb = table_bert_bytes / 1024**2
|
||||
glue("table_bert_mb", f"{table_bert_mb:.0f} MB")
|
||||
|
||||
# GPT-2: 50,257 x 1,024
|
||||
table_gpt2_bytes = 50257 * 1024 * 4
|
||||
table_gpt2_mb = table_gpt2_bytes / 1024**2
|
||||
glue("table_gpt2_mb", f"{table_gpt2_mb:.0f} MB")
|
||||
|
||||
# GPT-3: 50,257 x 12,288
|
||||
table_gpt3_bytes = 50257 * 12288 * 4
|
||||
table_gpt3_mb = table_gpt3_bytes / 1024**2
|
||||
glue("table_gpt3_mb", f"{table_gpt3_mb:,.0f} MB")
|
||||
|
||||
# Large Transformer: 100,000 x 1,024
|
||||
table_large_bytes = 100000 * 1024 * 4
|
||||
table_large_mb = table_large_bytes / 1024**2
|
||||
glue("table_large_mb", f"{table_large_mb:.0f} MB")
|
||||
|
||||
# --- Scale section ---
|
||||
|
||||
# GPT-3 parameter count
|
||||
gpt3_params = 50257 * 12288
|
||||
glue("scale_gpt3_params", f"{gpt3_params / 1e6:.0f} million parameters")
|
||||
glue("scale_gpt3_gb", f"{gpt3_embed_gb:.1f} GB")
|
||||
|
||||
# Batch lookups: 32 sequences x 2048 tokens
|
||||
batch_lookups = 32 * 2048
|
||||
glue("scale_batch_lookups", f"{batch_lookups:,}")
|
||||
|
||||
# --- Q1: Memory Calculation ---
|
||||
|
||||
q1_params = 50000 * 512
|
||||
q1_bytes = q1_params * 4
|
||||
q1_mb = q1_bytes / 1024**2
|
||||
glue("q1_total_bytes", f"{q1_bytes:,} bytes")
|
||||
glue("q1_mb", f"{q1_mb:.1f} MB")
|
||||
glue("q1_params", f"{q1_params:,}")
|
||||
glue("q1_bytes_full", f"{q1_bytes:,}")
|
||||
|
||||
# --- Q2: Positional Encoding Memory ---
|
||||
|
||||
q2_params = 2048 * 512
|
||||
q2_bytes = q2_params * 4
|
||||
q2_mb = q2_bytes / 1024**2
|
||||
glue("q2_bytes", f"{q2_bytes:,} bytes")
|
||||
glue("q2_mb", f"{q2_mb:.1f} MB")
|
||||
glue("q2_params", f"{q2_params:,}")
|
||||
|
||||
# GPT-3 learned PE memory
|
||||
q2_gpt3_bytes = 2048 * 12288 * 4
|
||||
q2_gpt3_mb = q2_gpt3_bytes / 1024**2
|
||||
glue("q2_gpt3_pe_mb", f"{q2_gpt3_mb:.0f} MB")
|
||||
|
||||
# --- Q3: Lookup Complexity ---
|
||||
|
||||
q3_total = 32 * 128
|
||||
glue("q3_total_lookups", f"{q3_total:,}")
|
||||
|
||||
# --- Q4: Embedding Dimension Scaling ---
|
||||
|
||||
q4_original_bytes = 50000 * 512 * 4
|
||||
q4_original_mb = q4_original_bytes / 1024**2
|
||||
q4_doubled_bytes = 50000 * 1024 * 4
|
||||
q4_doubled_mb = q4_doubled_bytes / 1024**2
|
||||
glue("q4_original_mb", f"{q4_original_mb:.0f} MB")
|
||||
glue("q4_doubled_mb", f"{q4_doubled_mb:.0f} MB")
|
||||
```
|
||||
|
||||
# Module 11: Embeddings
|
||||
|
||||
:::{admonition} Module Info
|
||||
@@ -30,7 +129,7 @@ Listen to an AI-generated overview.
|
||||
|
||||
Run interactively in your browser.
|
||||
|
||||
<a href="https://mybinder.org/v2/gh/harvard-edge/cs249r_book/main?labpath=tinytorch%2Fmodules%2F11_embeddings%2F11_embeddings.ipynb" target="_blank" style="display: flex; align-items: center; justify-content: center; width: 100%; height: 54px; margin-top: auto; background: #f97316; color: white; text-align: center; text-decoration: none; border-radius: 27px; font-size: 14px; box-sizing: border-box;">Open in Binder →</a>
|
||||
<a href="https://mybinder.org/v2/gh/harvard-edge/cs249r_book/main?labpath=tinytorch%2Fmodules%2F11_embeddings%2Fembeddings.ipynb" target="_blank" style="display: flex; align-items: center; justify-content: center; width: 100%; height: 54px; margin-top: auto; background: #f97316; color: white; text-align: center; text-decoration: none; border-radius: 27px; font-size: 14px; box-sizing: border-box;">Open in Binder →</a>
|
||||
```
|
||||
|
||||
```{grid-item-card} 📄 View Source
|
||||
@@ -613,7 +712,7 @@ The trigonometric identity enables learning relative positions: `PE(pos+k)` can
|
||||
|
||||
The embedding dimension D controls the capacity of your learned representations. Larger D provides more expressiveness but costs memory and compute. The choice involves several interacting factors.
|
||||
|
||||
**Memory scaling**: Embedding tables scale as `vocab_size × embed_dim × 4 bytes` (for float32). A vocabulary of 50,000 tokens with 512-dimensional embeddings requires 100 MB. Double the dimension to 1024, and memory doubles to 200 MB. For large vocabularies, the embedding table often dominates total model memory. GPT-3's 50,257 token vocabulary with 12,288-dimensional embeddings uses approximately 2.4 GB just for token embeddings.
|
||||
**Memory scaling**: Embedding tables scale as `vocab_size × embed_dim × 4 bytes` (for float32). A vocabulary of 50,000 tokens with 512-dimensional embeddings requires {glue:text}`tradeoffs_50k_512_mb`. Double the dimension to 1024, and memory doubles to {glue:text}`tradeoffs_50k_1024_mb`. For large vocabularies, the embedding table often dominates total model memory. GPT-3's 50,257 token vocabulary with 12,288-dimensional embeddings uses approximately {glue:text}`tradeoffs_gpt3_embed_gb` just for token embeddings.
|
||||
|
||||
**Semantic capacity**: Higher dimensions allow finer-grained semantic distinctions. With 64 dimensions, you might capture basic categories (animals, actions, objects). With 512 dimensions, you can encode subtle relationships (synonyms, antonyms, part-of-speech, contextual variations). With 1024+ dimensions, you have capacity for highly nuanced semantic features discovered through training.
|
||||
|
||||
@@ -623,10 +722,10 @@ The embedding dimension D controls the capacity of your learned representations.
|
||||
|
||||
| Model | Vocabulary | Embed Dim | Embedding Memory |
|
||||
|-------|-----------|-----------|------------------|
|
||||
| Small BERT | 30,000 | 768 | 92 MB |
|
||||
| GPT-2 | 50,257 | 1,024 | 206 MB |
|
||||
| GPT-3 | 50,257 | 12,288 | 2,471 MB |
|
||||
| Large Transformer | 100,000 | 1,024 | 410 MB |
|
||||
| Small BERT | 30,000 | 768 | {glue:text}`table_bert_mb` |
|
||||
| GPT-2 | 50,257 | 1,024 | {glue:text}`table_gpt2_mb` |
|
||||
| GPT-3 | 50,257 | 12,288 | {glue:text}`table_gpt3_mb` |
|
||||
| Large Transformer | 100,000 | 1,024 | {glue:text}`table_large_mb` |
|
||||
|
||||
The embedding dimension typically matches the model's hidden dimension since embeddings feed directly into the first transformer layer. You rarely see models with embedding dimension different from hidden dimension (though it's technically possible with a projection layer).
|
||||
|
||||
@@ -795,10 +894,10 @@ Embedding lookup semantics, gradient flow patterns, and the addition of position
|
||||
|
||||
To appreciate embedding systems, consider the scale of modern language models:
|
||||
|
||||
- **GPT-3 embeddings**: 50,257 token vocabulary × 12,288 dimensions = **618 million parameters** = 2.4 GB of memory (just for token embeddings, not counting position embeddings)
|
||||
- **Lookup throughput**: Processing 32 sequences of 2048 tokens requires **65,536 embedding lookups** per batch. At 1000 batches per second (typical training), that's 65 million lookups per second.
|
||||
- **GPT-3 embeddings**: 50,257 token vocabulary × 12,288 dimensions = **{glue:text}`scale_gpt3_params`** = {glue:text}`scale_gpt3_gb` of memory (just for token embeddings, not counting position embeddings)
|
||||
- **Lookup throughput**: Processing 32 sequences of 2048 tokens requires **{glue:text}`scale_batch_lookups` embedding lookups** per batch. At 1000 batches per second (typical training), that's 65 million lookups per second.
|
||||
- **Memory bandwidth**: Each lookup transfers 512-1024 dimensions × 4 bytes = **2-4 KB from RAM to cache**. At scale, memory bandwidth (not compute) becomes the bottleneck.
|
||||
- **Gradient sparsity**: In a batch with 65,536 tokens, only a small fraction of the 50,257 vocabulary is accessed. Efficient training exploits this sparsity, updating only the accessed embeddings' gradients.
|
||||
- **Gradient sparsity**: In a batch with {glue:text}`scale_batch_lookups` tokens, only a small fraction of the 50,257 vocabulary is accessed. Efficient training exploits this sparsity, updating only the accessed embeddings' gradients.
|
||||
|
||||
Modern transformer training spends approximately **10-15% of total time** in embedding operations (lookup + position encoding). The remaining 85-90% goes to attention and feedforward layers. However, embeddings consume **30-40% of model memory** for models with large vocabularies, making them critical for deployment.
|
||||
|
||||
@@ -813,12 +912,12 @@ An embedding layer has `vocab_size=50000` and `embed_dim=512`. How much memory d
|
||||
```{admonition} Answer
|
||||
:class: dropdown
|
||||
|
||||
50,000 × 512 × 4 bytes = **102,400,000 bytes = 97.7 MB**
|
||||
50,000 × 512 × 4 bytes = **{glue:text}`q1_total_bytes` = {glue:text}`q1_mb`**
|
||||
|
||||
Calculation breakdown:
|
||||
- Parameters: 50,000 × 512 = 25,600,000
|
||||
- Memory: 25,600,000 × 4 bytes (float32) = 102,400,000 bytes
|
||||
- In MB: 102,400,000 / (1024 × 1024) = 97.7 MB
|
||||
- Parameters: 50,000 × 512 = {glue:text}`q1_params`
|
||||
- Memory: {glue:text}`q1_params` × 4 bytes (float32) = {glue:text}`q1_bytes_full` bytes
|
||||
- In MB: {glue:text}`q1_bytes_full` / (1024 × 1024) = {glue:text}`q1_mb`
|
||||
|
||||
This is why vocabulary size matters for model deployment!
|
||||
```
|
||||
@@ -830,11 +929,11 @@ Compare memory requirements for learned vs sinusoidal positional encoding with `
|
||||
```{admonition} Answer
|
||||
:class: dropdown
|
||||
|
||||
**Learned PE**: 2,048 × 512 × 4 = **4,194,304 bytes = 4.0 MB** (1,048,576 parameters)
|
||||
**Learned PE**: 2,048 × 512 × 4 = **{glue:text}`q2_bytes` = {glue:text}`q2_mb`** ({glue:text}`q2_params` parameters)
|
||||
|
||||
**Sinusoidal PE**: **0 bytes** (0 parameters - computed mathematically)
|
||||
|
||||
For large models, learned PE adds significant memory. GPT-3 uses learned PE with 2048 positions × 12,288 dimensions = 100 MB additional memory. Some models use sinusoidal to save this memory.
|
||||
For large models, learned PE adds significant memory. GPT-3 uses learned PE with 2048 positions × 12,288 dimensions = {glue:text}`q2_gpt3_pe_mb` additional memory. Some models use sinusoidal to save this memory.
|
||||
```
|
||||
|
||||
**Q3: Lookup Complexity**
|
||||
@@ -844,25 +943,25 @@ What is the time complexity of looking up embeddings for a batch of 32 sequences
|
||||
```{admonition} Answer
|
||||
:class: dropdown
|
||||
|
||||
**O(1) per token**, or **O(batch_size × seq_len)** = O(32 × 128) = O(4096) total
|
||||
**O(1) per token**, or **O(batch_size × seq_len)** = O(32 × 128) = O({glue:text}`q3_total_lookups`) total
|
||||
|
||||
The lookup operation is constant time per token because it's just array indexing: `weight[token_id]`. For 4,096 tokens, you perform 4,096 constant-time lookups.
|
||||
The lookup operation is constant time per token because it's just array indexing: `weight[token_id]`. For {glue:text}`q3_total_lookups` tokens, you perform {glue:text}`q3_total_lookups` constant-time lookups.
|
||||
|
||||
Importantly, vocabulary size does NOT affect lookup time. Looking up tokens from a 1,000 word vocabulary is the same speed as from a 100,000 word vocabulary (assuming cache effects are comparable). The memory access is direct indexing, not search.
|
||||
```
|
||||
|
||||
**Q4: Embedding Dimension Scaling**
|
||||
|
||||
You have an embedding layer with `vocab_size=50000, embed_dim=512` using 100 MB. If you double `embed_dim` to 1024, what happens to memory?
|
||||
You have an embedding layer with `vocab_size=50000, embed_dim=512` using {glue:text}`q4_original_mb`. If you double `embed_dim` to 1024, what happens to memory?
|
||||
|
||||
```{admonition} Answer
|
||||
:class: dropdown
|
||||
|
||||
Memory **doubles to 200 MB**
|
||||
Memory **doubles to {glue:text}`q4_doubled_mb`**
|
||||
|
||||
Embedding memory scales linearly with embedding dimension:
|
||||
- Original: 50,000 × 512 × 4 = 100 MB
|
||||
- Doubled: 50,000 × 1,024 × 4 = 200 MB
|
||||
- Original: 50,000 × 512 × 4 = {glue:text}`q4_original_mb`
|
||||
- Doubled: 50,000 × 1,024 × 4 = {glue:text}`q4_doubled_mb`
|
||||
|
||||
This is why you can't arbitrarily increase embedding dimensions. Each doubling doubles memory and memory bandwidth requirements. Large models carefully balance embedding dimension against available memory.
|
||||
```
|
||||
@@ -871,7 +970,7 @@ This is why you can't arbitrarily increase embedding dimensions. Each doubling d
|
||||
|
||||
You trained a model with sinusoidal positional encoding and `max_seq_len=512`. Can you process sequences of length 1024 at inference time? What about with learned positional encoding?
|
||||
|
||||
```{admonition} Answer
|
||||
````{admonition} Answer
|
||||
:class: dropdown
|
||||
|
||||
**Sinusoidal PE: Yes** - can extrapolate to length 1024 (or any length)
|
||||
@@ -889,7 +988,7 @@ Learned PE creates a fixed embedding table of shape `(max_seq_len, embed_dim)`.
|
||||
- Truncate sequences to 512 tokens
|
||||
|
||||
This is why many production models use sinusoidal or relative positional encodings that can handle variable lengths.
|
||||
```
|
||||
````
|
||||
|
||||
## Further Reading
|
||||
|
||||
@@ -927,7 +1026,7 @@ Implement attention mechanisms that let embeddings interact with each other. You
|
||||
|
||||
```{tip} Interactive Options
|
||||
|
||||
- **[Launch Binder](https://mybinder.org/v2/gh/harvard-edge/cs249r_book/main?urlpath=lab/tree/tinytorch/modules/11_embeddings/11_embeddings.ipynb)** - Run interactively in browser, no setup required
|
||||
- **[Launch Binder](https://mybinder.org/v2/gh/harvard-edge/cs249r_book/main?urlpath=lab/tree/tinytorch/modules/11_embeddings/embeddings.ipynb)** - Run interactively in browser, no setup required
|
||||
- **[View Source](https://github.com/harvard-edge/cs249r_book/blob/main/tinytorch/src/11_embeddings/11_embeddings.py)** - Browse the implementation code
|
||||
```
|
||||
|
||||
|
||||
@@ -1,3 +1,88 @@
|
||||
---
|
||||
file_format: mystnb
|
||||
kernelspec:
|
||||
name: python3
|
||||
---
|
||||
|
||||
```{code-cell} python3
|
||||
:tags: [remove-input, remove-output]
|
||||
import math
|
||||
from myst_nb import glue
|
||||
|
||||
# --- Multi-Head Attention: head dimension ---
|
||||
mha_embed_dim = 512
|
||||
mha_num_heads = 8
|
||||
mha_head_dim = mha_embed_dim // mha_num_heads
|
||||
glue("mha_head_dim", f"{mha_embed_dim}/{mha_num_heads}={mha_head_dim}")
|
||||
|
||||
# --- Computational Complexity (prose): GPT-3 scale ---
|
||||
complexity_seq = 2048
|
||||
complexity_elements = complexity_seq ** 2
|
||||
complexity_bytes = complexity_elements * 4
|
||||
complexity_mb = complexity_bytes / 1024**2
|
||||
complexity_gpt3_layers = 96
|
||||
complexity_gpt3_attn_gb = complexity_gpt3_layers * complexity_mb / 1024
|
||||
|
||||
glue("complexity_elements", f"{complexity_elements:,}")
|
||||
glue("complexity_mb", f"{complexity_mb:.0f} MB")
|
||||
glue("complexity_gpt3_attn_gb", f"{complexity_gpt3_attn_gb:.1f} GB")
|
||||
|
||||
# --- Computational Complexity (prose): GPT-3 training (5x inference) ---
|
||||
complexity_train_multiplier = 5
|
||||
complexity_gpt3_train_gb = complexity_train_multiplier * complexity_gpt3_attn_gb
|
||||
glue("complexity_gpt3_train_gb", f"~{complexity_gpt3_train_gb:.1f} GB")
|
||||
|
||||
# --- Computational Complexity (prose): GPT-4 estimate ---
|
||||
complexity_gpt4_layers = 120
|
||||
complexity_gpt4_ctx = 32768
|
||||
complexity_gpt4_gb = (complexity_gpt4_layers * (complexity_gpt4_ctx ** 2) * 4) / 1024**3
|
||||
glue("complexity_gpt4_gb", f"~{complexity_gpt4_gb:.0f} GB")
|
||||
|
||||
# --- Q1: Memory Calculation ---
|
||||
q1_seq_a = 1024
|
||||
q1_elements_a = q1_seq_a ** 2
|
||||
q1_bytes_a = q1_elements_a * 4
|
||||
q1_mb_a = q1_bytes_a / 1024**2
|
||||
|
||||
q1_seq_b = 2048
|
||||
q1_elements_b = q1_seq_b ** 2
|
||||
q1_bytes_b = q1_elements_b * 4
|
||||
q1_mb_b = q1_bytes_b / 1024**2
|
||||
|
||||
q1_scale_factor = (q1_seq_b // q1_seq_a) ** 2
|
||||
q1_gpt3_layers = 96
|
||||
q1_gpt3_gb = q1_gpt3_layers * q1_mb_b / 1024
|
||||
|
||||
glue("q1_elements_a", f"{q1_elements_a:,}")
|
||||
glue("q1_mb_a", f"{q1_mb_a:.1f} MB")
|
||||
glue("q1_elements_b", f"{q1_elements_b:,}")
|
||||
glue("q1_mb_b", f"{q1_mb_b:.1f} MB")
|
||||
glue("q1_scale_factor", f"{q1_scale_factor}")
|
||||
glue("q1_gpt3_layers", f"{q1_gpt3_layers}")
|
||||
glue("q1_gpt3_mb_b", f"{q1_mb_b:.1f} MB")
|
||||
glue("q1_gpt3_total_gb", f"{q1_gpt3_gb:.1f} GB")
|
||||
|
||||
# --- Q2: Attention Bottleneck ---
|
||||
q2_d = 512
|
||||
q2_d_squared = q2_d ** 2
|
||||
q2_crossover = q2_d_squared // q2_d
|
||||
|
||||
glue("q2_d_squared", f"{q2_d_squared:,}")
|
||||
glue("q2_crossover", f"{q2_crossover}")
|
||||
|
||||
# --- Q5: Gradient Memory ---
|
||||
q5_multiplier = 5
|
||||
q5_layers = 96
|
||||
q5_attn_mb = q1_mb_b # reuse 2048-context value: 16.0 MB
|
||||
q5_inference_gb = q5_layers * q5_attn_mb / 1024
|
||||
q5_training_gb = q5_layers * q5_attn_mb * q5_multiplier / 1024
|
||||
|
||||
glue("q5_multiplier", f"{q5_multiplier}")
|
||||
glue("q5_attn_mb", f"{q5_attn_mb:.0f} MB")
|
||||
glue("q5_inference_gb", f"{q5_inference_gb:.1f} GB")
|
||||
glue("q5_training_gb", f"{q5_training_gb:.1f} GB")
|
||||
```
|
||||
|
||||
# Module 12: Attention
|
||||
|
||||
:::{admonition} Module Info
|
||||
@@ -32,7 +117,7 @@ Listen to an AI-generated overview.
|
||||
|
||||
Run interactively in your browser.
|
||||
|
||||
<a href="https://mybinder.org/v2/gh/harvard-edge/cs249r_book/main?labpath=tinytorch%2Fmodules%2F12_attention%2F12_attention.ipynb" target="_blank" style="display: flex; align-items: center; justify-content: center; width: 100%; height: 54px; margin-top: auto; background: #f97316; color: white; text-align: center; text-decoration: none; border-radius: 27px; font-size: 14px; box-sizing: border-box;">Open in Binder →</a>
|
||||
<a href="https://mybinder.org/v2/gh/harvard-edge/cs249r_book/main?labpath=tinytorch%2Fmodules%2F12_attention%2Fattention.ipynb" target="_blank" style="display: flex; align-items: center; justify-content: center; width: 100%; height: 54px; margin-top: auto; background: #f97316; color: white; text-align: center; text-decoration: none; border-radius: 27px; font-size: 14px; box-sizing: border-box;">Open in Binder →</a>
|
||||
```
|
||||
|
||||
```{grid-item-card} 📄 View Source
|
||||
@@ -525,7 +610,7 @@ The mask addition is clever: for positions where `mask=0` (masked), we add -1e9
|
||||
|
||||
Single-head attention learns one similarity function between queries and keys. But sequences have multiple types of relationships: syntactic dependencies, semantic similarity, positional patterns, long-range coreference. Multi-head attention addresses this by running multiple attention mechanisms in parallel, each with different learned projections.
|
||||
|
||||
The key insight is splitting the embedding dimension across heads rather than duplicating it. For `embed_dim=512` and `num_heads=8`, each head operates on `512/8=64` dimensions. This keeps parameter count constant while allowing diverse specialization. One head might learn to focus on adjacent tokens (local syntax), another on semantically similar words (meaning), another on specific positional offsets (structured patterns).
|
||||
The key insight is splitting the embedding dimension across heads rather than duplicating it. For `embed_dim=512` and `num_heads=8`, each head operates on {glue:text}`mha_head_dim` dimensions. This keeps parameter count constant while allowing diverse specialization. One head might learn to focus on adjacent tokens (local syntax), another on semantically similar words (meaning), another on specific positional offsets (structured patterns).
|
||||
|
||||
Your implementation handles this through reshape and transpose operations:
|
||||
|
||||
@@ -579,7 +664,7 @@ When combined with the masking logic in attention (adding -1e9 to masked scores
|
||||
|
||||
Attention's power comes from all-to-all connectivity: every position can attend to every other position. But this creates quadratic scaling in both computation and memory. For sequence length n, the attention matrix has n² elements. The vectorized `Q @ K^T` operation computes all n² similarity scores in one matrix multiplication, softmax normalizes n² values, and applying attention to values multiplies n² weights by the value vectors.
|
||||
|
||||
The memory cost is particularly severe. For GPT-3 with 2048-token context, a single attention matrix stores 2048² = 4,194,304 float32 values, requiring 16 MB. With 96 layers, attention matrices alone need 1.5 GB, excluding activations, gradients, and other tensors. This quadratic wall is why long-context AI remains an active research challenge.
|
||||
The memory cost is particularly severe. For GPT-3 with 2048-token context, a single attention matrix stores 2048² = {glue:text}`complexity_elements` float32 values, requiring {glue:text}`complexity_mb`. With 96 layers, attention matrices alone need {glue:text}`complexity_gpt3_attn_gb`, excluding activations, gradients, and other tensors. This quadratic wall is why long-context AI remains an active research challenge.
|
||||
|
||||
| Operation | Time Complexity | Memory Complexity | Dominates When |
|
||||
|-----------|----------------|-------------------|----------------|
|
||||
@@ -716,8 +801,8 @@ The mathematical operations, architectural patterns, and shape conventions are i
|
||||
|
||||
To appreciate why attention research is crucial, consider the scaling characteristics of modern language models:
|
||||
|
||||
- **GPT-3** (96 layers, 2048 context): ~1.5 GB just for attention matrices during forward pass, ~6 GB with gradients during training
|
||||
- **GPT-4** (estimated 120 layers, 32K context): Would require ~480 GB for attention alone without optimization, exceeding single-GPU memory
|
||||
- **GPT-3** (96 layers, 2048 context): ~{glue:text}`complexity_gpt3_attn_gb` just for attention matrices during forward pass, {glue:text}`complexity_gpt3_train_gb` with gradients during training
|
||||
- **GPT-4** (estimated 120 layers, 32K context): Would require {glue:text}`complexity_gpt4_gb` for attention alone without optimization, exceeding single-GPU memory
|
||||
- **Long-context models** (100K+ tokens): Attention becomes computationally prohibitive without algorithmic improvements
|
||||
|
||||
These constraints drive modern attention research:
|
||||
@@ -741,17 +826,17 @@ For sequence length 1024, how much memory does a single attention matrix require
|
||||
:class: dropdown
|
||||
|
||||
**Sequence length 1024:**
|
||||
- Attention matrix: 1024 × 1024 = 1,048,576 elements
|
||||
- Memory: 1,048,576 × 4 bytes = **4.2 MB**
|
||||
- Attention matrix: 1024 × 1024 = {glue:text}`q1_elements_a` elements
|
||||
- Memory: {glue:text}`q1_elements_a` × 4 bytes = **{glue:text}`q1_mb_a`**
|
||||
|
||||
**Sequence length 2048:**
|
||||
- Attention matrix: 2048 × 2048 = 4,194,304 elements
|
||||
- Memory: 4,194,304 × 4 bytes = **16.8 MB**
|
||||
- Attention matrix: 2048 × 2048 = {glue:text}`q1_elements_b` elements
|
||||
- Memory: {glue:text}`q1_elements_b` × 4 bytes = **{glue:text}`q1_mb_b`**
|
||||
|
||||
**Scaling factor:** Doubling sequence length quadruples memory (2² = 4×)
|
||||
**Scaling factor:** Doubling sequence length quadruples memory (2² = {glue:text}`q1_scale_factor`×)
|
||||
|
||||
For GPT-3 (96 layers, 2048 context):
|
||||
- 96 layers × 16.8 MB = **1.6 GB** just for attention matrices!
|
||||
For GPT-3 ({glue:text}`q1_gpt3_layers` layers, 2048 context):
|
||||
- {glue:text}`q1_gpt3_layers` layers × {glue:text}`q1_gpt3_mb_b` = **{glue:text}`q1_gpt3_total_gb`** just for attention matrices!
|
||||
- This excludes Q/K/V projections, gradients, and all other tensors.
|
||||
```
|
||||
|
||||
@@ -764,12 +849,12 @@ A transformer layer has attention (O(n² × d)) and feed-forward network (O(n ×
|
||||
|
||||
**Complexity comparison:**
|
||||
- Attention: O(n² × d) = O(n² × 512)
|
||||
- FFN: O(n × d²) = O(n × 512²) = O(n × 262,144)
|
||||
- FFN: O(n × d²) = O(n × 512²) = O(n × {glue:text}`q2_d_squared`)
|
||||
|
||||
**Crossover point:** n² × 512 > n × 262,144
|
||||
- Simplify: n > 262,144 / 512 = **512**
|
||||
**Crossover point:** n² × 512 > n × {glue:text}`q2_d_squared`
|
||||
- Simplify: n > {glue:text}`q2_d_squared` / 512 = **{glue:text}`q2_crossover`**
|
||||
|
||||
**When n > 512**, attention becomes the memory bottleneck.
|
||||
**When n > {glue:text}`q2_crossover`**, attention becomes the memory bottleneck.
|
||||
|
||||
**Real-world implications:**
|
||||
- Short sequences (n=128): FFN dominates, 262K vs 8K operations
|
||||
@@ -815,7 +900,7 @@ Why use 8 heads of 64 dimensions instead of 1 head of 512 dimensions? Parameters
|
||||
|
||||
Causal masking zeros out the upper triangle (roughly half the attention matrix). Do we save computation, or just ensure correctness?
|
||||
|
||||
```{admonition} Answer
|
||||
````{admonition} Answer
|
||||
:class: dropdown
|
||||
|
||||
**In your implementation: NO computation saved**
|
||||
@@ -842,7 +927,7 @@ scores = scores + adder_mask_tensor # Masking happens after
|
||||
- Sparse attention (BigBird, Longformer): Actually skips computation for sparse patterns
|
||||
|
||||
**Memory could be saved:** Store only lower triangle (n²/2 elements), but requires custom indexing
|
||||
```
|
||||
````
|
||||
|
||||
**Q5: Gradient Memory**
|
||||
|
||||
@@ -867,11 +952,11 @@ Training attention requires storing activations for backpropagation. How much me
|
||||
- Forward: 1× (attention weights)
|
||||
- Backward: +2× (gradients)
|
||||
- Optimizer: +2× (Adam state)
|
||||
- **Total: 5× inference memory**
|
||||
- **Total: {glue:text}`q5_multiplier`× inference memory**
|
||||
|
||||
**For GPT-3 scale (96 layers, 2048 context):**
|
||||
- Inference: 96 × 16 MB = 1.5 GB
|
||||
- Training: 96 × 16 MB × 5 = **7.5 GB** just for attention gradients and optimizer state!
|
||||
- Inference: 96 × {glue:text}`q5_attn_mb` = {glue:text}`q5_inference_gb`
|
||||
- Training: 96 × {glue:text}`q5_attn_mb` × {glue:text}`q5_multiplier` = **{glue:text}`q5_training_gb`** just for attention gradients and optimizer state!
|
||||
|
||||
This excludes Q/K/V matrices, feed-forward networks, embeddings, and activations from other layers. Full GPT-3 training requires 350+ GB.
|
||||
```
|
||||
@@ -915,7 +1000,7 @@ Build complete transformer blocks by combining your attention mechanism with fee
|
||||
|
||||
```{tip} Interactive Options
|
||||
|
||||
- **[Launch Binder](https://mybinder.org/v2/gh/harvard-edge/cs249r_book/main?urlpath=lab/tree/tinytorch/modules/12_attention/12_attention.ipynb)** - Run interactively in browser, no setup required
|
||||
- **[Launch Binder](https://mybinder.org/v2/gh/harvard-edge/cs249r_book/main?urlpath=lab/tree/tinytorch/modules/12_attention/attention.ipynb)** - Run interactively in browser, no setup required
|
||||
- **[View Source](https://github.com/harvard-edge/cs249r_book/blob/main/tinytorch/src/12_attention/12_attention.py)** - Browse the implementation code
|
||||
```
|
||||
|
||||
|
||||
@@ -1,3 +1,121 @@
|
||||
---
|
||||
file_format: mystnb
|
||||
kernelspec:
|
||||
name: python3
|
||||
---
|
||||
|
||||
```{code-cell} python3
|
||||
:tags: [remove-input, remove-output]
|
||||
from myst_nb import glue
|
||||
|
||||
# --- MLP section: embed_dim=512, hidden_dim=2048 ---
|
||||
mlp_linear1 = 512 * 2048 + 2048
|
||||
mlp_linear2 = 2048 * 512 + 512
|
||||
mlp_total = mlp_linear1 + mlp_linear2
|
||||
mlp_total_12layers = 12 * mlp_total
|
||||
|
||||
glue("mlp_linear1", f"{mlp_linear1:,}")
|
||||
glue("mlp_linear1_approx", f"{mlp_linear1 / 1e6:.2f}M")
|
||||
glue("mlp_linear2", f"{mlp_linear2:,}")
|
||||
glue("mlp_linear2_approx", f"{mlp_linear2 / 1e6:.2f}M")
|
||||
glue("mlp_total", f"{mlp_total:,}")
|
||||
glue("mlp_total_approx", f"{mlp_total / 1e6:.1f}M")
|
||||
glue("mlp_12layer_approx", f"{mlp_total_12layers / 1e6:.1f}M")
|
||||
|
||||
# --- Parameter table: embed_dim=512, num_heads=8 ---
|
||||
attn_params_512 = 4 * (512 * 512)
|
||||
ln_params_512 = 2 * 512
|
||||
mlp_params_512 = (512 * 2048 + 2048) + (2048 * 512 + 512)
|
||||
block_total_512 = attn_params_512 + ln_params_512 + mlp_params_512 + ln_params_512
|
||||
|
||||
glue("attn_params_512", f"~{attn_params_512 / 1e6:.2f}M")
|
||||
glue("attn_params_512_raw", f"{attn_params_512:,}")
|
||||
glue("ln_params_512", f"{ln_params_512:,}")
|
||||
glue("ln_params_512_approx", f"{ln_params_512 / 1e3:.0f}K")
|
||||
glue("mlp_params_512", f"~{mlp_params_512 / 1e6:.1f}M")
|
||||
glue("block_total_512", f"~{block_total_512 / 1e6:.1f}M")
|
||||
|
||||
# --- GPT model totals: vocab=50000, embed_dim=512, seq=2048, layers=12 ---
|
||||
tok_emb_512 = 50000 * 512
|
||||
pos_emb_512 = 2048 * 512
|
||||
blocks_total_512 = 12 * block_total_512
|
||||
gpt_total_512 = tok_emb_512 + pos_emb_512 + blocks_total_512
|
||||
|
||||
glue("tok_emb_512", f"{tok_emb_512 / 1e6:.1f}M")
|
||||
glue("tok_emb_512_raw", f"{tok_emb_512:,}")
|
||||
glue("pos_emb_512", f"{pos_emb_512 / 1e6:.1f}M")
|
||||
glue("pos_emb_512_raw", f"{pos_emb_512:,}")
|
||||
glue("blocks_total_512", f"{blocks_total_512 / 1e6:.1f}M")
|
||||
glue("blocks_total_512_formula", f"12 x {block_total_512 / 1e6:.1f}M = {blocks_total_512 / 1e6:.1f}M")
|
||||
glue("gpt_total_512", f"~{gpt_total_512 / 1e6:.0f}M")
|
||||
|
||||
# --- Attention memory table: batch=4, heads=8, float32 ---
|
||||
KB = 1024
|
||||
MB = 1024 ** 2
|
||||
GB = 1024 ** 3
|
||||
|
||||
def attn_mem_mb(batch, heads, seq):
|
||||
return batch * heads * seq * seq * 4 / MB
|
||||
|
||||
attn_512 = attn_mem_mb(4, 8, 512)
|
||||
attn_1024 = attn_mem_mb(4, 8, 1024)
|
||||
attn_2048 = attn_mem_mb(4, 8, 2048)
|
||||
attn_4096 = attn_mem_mb(4, 8, 4096)
|
||||
|
||||
glue("attn_mem_512", f"{attn_512:.1f}")
|
||||
glue("attn_mem_1024", f"{attn_1024:.1f}")
|
||||
glue("attn_mem_2048", f"{attn_2048:.1f}")
|
||||
glue("attn_mem_4096", f"{attn_4096:.1f}")
|
||||
|
||||
# --- Q1: Attention memory calc: batch=8, heads=16, seq=2048/4096 ---
|
||||
q1_elements_2048 = 8 * 16 * 2048 * 2048
|
||||
q1_bytes_2048 = q1_elements_2048 * 4
|
||||
q1_gb_2048 = q1_bytes_2048 / GB
|
||||
|
||||
q1_elements_4096 = 8 * 16 * 4096 * 4096
|
||||
q1_bytes_4096 = q1_elements_4096 * 4
|
||||
q1_gb_4096 = q1_bytes_4096 / GB
|
||||
|
||||
glue("q1_elements_2048", f"{q1_elements_2048:,}")
|
||||
glue("q1_bytes_2048", f"{q1_bytes_2048:,}")
|
||||
glue("q1_gb_2048", f"{q1_gb_2048:.1f}")
|
||||
glue("q1_elements_4096", f"{q1_elements_4096:,}")
|
||||
glue("q1_gb_4096", f"{q1_gb_4096:.1f}")
|
||||
|
||||
# --- Q2: Parameter distribution: vocab=50000, embed=768, layers=12, heads=12 ---
|
||||
q2_tok_emb = 50000 * 768
|
||||
q2_pos_emb = 2048 * 768
|
||||
q2_attn_per_block = 4 * (768 * 768)
|
||||
q2_mlp_per_block = (768 * 3072 + 3072) + (3072 * 768 + 768)
|
||||
q2_per_block = q2_attn_per_block + q2_mlp_per_block
|
||||
q2_total_blocks = 12 * q2_per_block
|
||||
q2_total_emb = q2_tok_emb + q2_pos_emb
|
||||
q2_grand_total = q2_tok_emb + q2_pos_emb + q2_total_blocks
|
||||
|
||||
glue("q2_tok_emb", f"{q2_tok_emb / 1e6:.1f}M")
|
||||
glue("q2_tok_emb_raw", f"{q2_tok_emb:,}")
|
||||
glue("q2_pos_emb", f"{q2_pos_emb / 1e6:.1f}M")
|
||||
glue("q2_pos_emb_raw", f"{q2_pos_emb:,}")
|
||||
glue("q2_attn_per_block", f"~{q2_attn_per_block / 1e6:.1f}M")
|
||||
glue("q2_attn_per_block_raw", f"{q2_attn_per_block:,}")
|
||||
glue("q2_mlp_per_block", f"~{q2_mlp_per_block / 1e6:.1f}M")
|
||||
glue("q2_mlp_per_block_raw", f"{q2_mlp_per_block:,}")
|
||||
glue("q2_per_block", f"{q2_per_block / 1e6:.1f}M")
|
||||
glue("q2_total_blocks", f"{q2_total_blocks / 1e6:.0f}M")
|
||||
glue("q2_total_blocks_raw", f"{q2_total_blocks:,}")
|
||||
glue("q2_total_emb", f"{q2_total_emb / 1e6:.0f}M")
|
||||
glue("q2_grand_total", f"~{q2_grand_total / 1e6:.0f}M")
|
||||
|
||||
# --- Q4: Generation efficiency: prompt=50, gen=100 ---
|
||||
q4_total_processings = sum(range(50, 150))
|
||||
q4_optimized = 50 + 100
|
||||
q4_speedup = q4_total_processings / q4_optimized
|
||||
|
||||
glue("q4_total_processings", f"{q4_total_processings:,}")
|
||||
glue("q4_optimized", f"{q4_optimized:,}")
|
||||
glue("q4_speedup", f"{q4_speedup:.0f}")
|
||||
```
|
||||
|
||||
# Module 13: Transformers
|
||||
|
||||
:::{admonition} Module Info
|
||||
@@ -25,7 +143,7 @@ Listen to an AI-generated overview.
|
||||
|
||||
Run interactively in your browser.
|
||||
|
||||
<a href="https://mybinder.org/v2/gh/harvard-edge/cs249r_book/main?labpath=tinytorch%2Fmodules%2F13_transformers%2F13_transformers.ipynb" target="_blank" style="display: flex; align-items: center; justify-content: center; width: 100%; height: 54px; margin-top: auto; background: #f97316; color: white; text-align: center; text-decoration: none; border-radius: 27px; font-size: 14px; box-sizing: border-box;">Open in Binder →</a>
|
||||
<a href="https://mybinder.org/v2/gh/harvard-edge/cs249r_book/main?labpath=tinytorch%2Fmodules%2F13_transformers%2Ftransformers.ipynb" target="_blank" style="display: flex; align-items: center; justify-content: center; width: 100%; height: 54px; margin-top: auto; background: #f97316; color: white; text-align: center; text-decoration: none; border-radius: 27px; font-size: 14px; box-sizing: border-box;">Open in Binder →</a>
|
||||
```
|
||||
|
||||
```{grid-item-card} 📄 View Source
|
||||
@@ -556,7 +674,7 @@ class MLP:
|
||||
|
||||
GELU (Gaussian Error Linear Unit) activation replaced ReLU in transformer models because it provides smoother gradients. Where ReLU has a hard cutoff at zero, GELU smoothly gates values based on their magnitude, creating better training dynamics for language modeling.
|
||||
|
||||
The parameter count in the MLP is substantial. For `embed_dim = 512`, the first layer has `512 × 2048 + 2048 ≈ 1.05M` parameters, and the second has `2048 × 512 + 512 ≈ 1.05M`, totaling 2.1M parameters per block. In a 12-layer model, MLPs alone contribute 25M parameters.
|
||||
The parameter count in the MLP is substantial. For `embed_dim = 512`, the first layer has `512 x 2048 + 2048 =` {glue:text}`mlp_linear1` ({glue:text}`mlp_linear1_approx`) parameters, and the second has `2048 x 512 + 512 =` {glue:text}`mlp_linear2` ({glue:text}`mlp_linear2_approx`) parameters, totaling {glue:text}`mlp_total_approx` parameters per block. In a 12-layer model, MLPs alone contribute {glue:text}`mlp_12layer_approx` parameters.
|
||||
|
||||
### Causal Masking for Autoregressive Generation
|
||||
|
||||
@@ -620,22 +738,20 @@ For a single transformer block with `embed_dim = 512` and `num_heads = 8`:
|
||||
|
||||
| Component | Parameters | Calculation |
|
||||
|-----------|------------|-------------|
|
||||
| Multi-Head Attention | ~1.5M | 4 × (512 × 512) for Q, K, V, O projections |
|
||||
| Layer Norm 1 | 1K | 2 × 512 for gamma, beta |
|
||||
| MLP | ~2.1M | (512 × 2048 + 2048) + (2048 × 512 + 512) |
|
||||
| Layer Norm 2 | 1K | 2 × 512 for gamma, beta |
|
||||
| **Total per block** | **~3.6M** | Dominated by MLP and attention |
|
||||
| Multi-Head Attention | {glue:text}`attn_params_512` | 4 x (512 x 512) for Q, K, V, O projections |
|
||||
| Layer Norm 1 | {glue:text}`ln_params_512_approx` | 2 x 512 for gamma, beta |
|
||||
| MLP | {glue:text}`mlp_params_512` | (512 x 2048 + 2048) + (2048 x 512 + 512) |
|
||||
| Layer Norm 2 | {glue:text}`ln_params_512_approx` | 2 x 512 for gamma, beta |
|
||||
| **Total per block** | **{glue:text}`block_total_512`** | Dominated by MLP and attention |
|
||||
|
||||
For a complete GPT model, add embeddings and output projection:
|
||||
|
||||
```
|
||||
Embeddings: vocab_size × embed_dim (e.g., 50000 × 512 = 25.6M)
|
||||
Position Embeddings: max_seq_len × embed_dim (e.g., 2048 × 512 = 1M)
|
||||
Transformer Blocks: num_layers × 3.6M (e.g., 12 × 3.6M = 43.2M)
|
||||
Output Projection: embed_dim × vocab_size (often tied to embeddings)
|
||||
Embeddings: vocab_size x embed_dim (e.g., 50000 x 512 = {glue:text}`tok_emb_512`)
|
||||
Position Embeddings: max_seq_len x embed_dim (e.g., 2048 x 512 = {glue:text}`pos_emb_512`)
|
||||
Transformer Blocks: num_layers x {glue:text}`block_total_512` (e.g., {glue:text}`blocks_total_512_formula`)
|
||||
Output Projection: embed_dim x vocab_size (often tied to embeddings)
|
||||
|
||||
Total: ~70M parameters for this configuration
|
||||
```
|
||||
Total: {glue:text}`gpt_total_512` parameters for this configuration
|
||||
|
||||
Memory requirements have three components:
|
||||
|
||||
@@ -647,10 +763,10 @@ The attention memory wall explains why extending context length is expensive. Fo
|
||||
|
||||
| Sequence Length | Attention Matrix Size | Memory (MB) |
|
||||
|-----------------|----------------------|-------------|
|
||||
| 512 | 4 × 8 × 512 × 512 | 33.6 |
|
||||
| 1024 | 4 × 8 × 1024 × 1024 | 134.2 |
|
||||
| 2048 | 4 × 8 × 2048 × 2048 | 536.9 |
|
||||
| 4096 | 4 × 8 × 4096 × 4096 | 2147.5 |
|
||||
| 512 | 4 x 8 x 512 x 512 | {glue:text}`attn_mem_512` |
|
||||
| 1024 | 4 x 8 x 1024 x 1024 | {glue:text}`attn_mem_1024` |
|
||||
| 2048 | 4 x 8 x 2048 x 2048 | {glue:text}`attn_mem_2048` |
|
||||
| 4096 | 4 x 8 x 4096 x 4096 | {glue:text}`attn_mem_4096` |
|
||||
|
||||
Doubling sequence length quadruples attention memory. This quadratic scaling drove innovations like sparse attention, linear attention, and FlashAttention that make long context tractable.
|
||||
|
||||
@@ -739,15 +855,15 @@ A transformer with `batch_size=8`, `num_heads=16`, `seq_len=2048` computes atten
|
||||
```{admonition} Answer
|
||||
:class: dropdown
|
||||
|
||||
Attention matrix size: `batch_size × num_heads × seq_len × seq_len`
|
||||
= `8 × 16 × 2048 × 2048 = 536,870,912 elements`
|
||||
Attention matrix size: `batch_size x num_heads x seq_len x seq_len`
|
||||
= `8 x 16 x 2048 x 2048 = ` {glue:text}`q1_elements_2048` elements
|
||||
|
||||
Memory: `536,870,912 × 4 bytes (float32) = 2,147,483,648 bytes ≈ 2.15 GB`
|
||||
Memory: {glue:text}`q1_elements_2048` ` x 4 bytes (float32) = ` {glue:text}`q1_bytes_2048` ` bytes =` {glue:text}`q1_gb_2048` GB
|
||||
|
||||
Doubling sequence length to 4096:
|
||||
= `8 × 16 × 4096 × 4096 = 2,147,483,648 elements ≈ 8.6 GB`
|
||||
= `8 x 16 x 4096 x 4096 = ` {glue:text}`q1_elements_4096` ` elements =` {glue:text}`q1_gb_4096` GB
|
||||
|
||||
**Scaling**: Doubling sequence length quadruples memory (4× increase). This quadratic scaling is why long context is expensive and drove innovations like sparse attention.
|
||||
**Scaling**: Doubling sequence length quadruples memory (4x increase). This quadratic scaling is why long context is expensive and drove innovations like sparse attention.
|
||||
```
|
||||
|
||||
**Q2: Parameter Distribution Analysis**
|
||||
@@ -757,22 +873,22 @@ For a GPT model with `vocab_size=50000`, `embed_dim=768`, `num_layers=12`, `num_
|
||||
```{admonition} Answer
|
||||
:class: dropdown
|
||||
|
||||
**Token Embeddings**: `50000 × 768 = 38.4M`
|
||||
**Token Embeddings**: `50000 x 768 = ` {glue:text}`q2_tok_emb`
|
||||
|
||||
**Position Embeddings**: `2048 × 768 = 1.6M` (assuming max_seq_len=2048)
|
||||
**Position Embeddings**: `2048 x 768 = ` {glue:text}`q2_pos_emb` (assuming max_seq_len=2048)
|
||||
|
||||
**Transformer Blocks**: Each block has approximately 3.6M parameters with embed_dim=768
|
||||
- Attention: `4 × (768 × 768) ≈ 2.4M`
|
||||
- MLP: `(768 × 3072 + 3072) + (3072 × 768 + 768) ≈ 4.7M`
|
||||
**Transformer Blocks**: Each block has approximately {glue:text}`q2_per_block` parameters with embed_dim=768
|
||||
- Attention: `4 x (768 x 768) = ` {glue:text}`q2_attn_per_block`
|
||||
- MLP: `(768 x 3072 + 3072) + (3072 x 768 + 768) = ` {glue:text}`q2_mlp_per_block`
|
||||
- Layer norms: negligible
|
||||
- **Per block**: approximately 7.1M
|
||||
- **Total blocks**: `12 × 7.1M ≈ 85M`
|
||||
- **Per block**: approximately {glue:text}`q2_per_block`
|
||||
- **Total blocks**: `12 x ` {glue:text}`q2_per_block` ` = ` {glue:text}`q2_total_blocks`
|
||||
|
||||
**Output Projection**: Usually tied to embeddings (0 additional)
|
||||
|
||||
**Total**: `38.4M + 1.6M + 85M ≈ 125M parameters`
|
||||
**Total**: {glue:text}`q2_tok_emb` ` + ` {glue:text}`q2_pos_emb` ` + ` {glue:text}`q2_total_blocks` ` = ` {glue:text}`q2_grand_total` parameters
|
||||
|
||||
**Dominant component**: Transformer blocks (85M) > Embeddings (40M). As models scale, transformer blocks dominate because they scale with `embed_dim²` while embeddings scale linearly.
|
||||
**Dominant component**: Transformer blocks ({glue:text}`q2_total_blocks`) > Embeddings ({glue:text}`q2_total_emb`). As models scale, transformer blocks dominate because they scale with `embed_dim²` while embeddings scale linearly.
|
||||
```
|
||||
|
||||
**Q3: Residual Connection Benefits**
|
||||
@@ -810,16 +926,16 @@ Your `generate()` method processes the entire sequence for each new token. For g
|
||||
- ...
|
||||
- Token 100: Process 149 tokens
|
||||
|
||||
**Total forward passes**: `50 + 51 + 52 + ... + 149 = Σ(50 to 149) = 9,950 token processings`
|
||||
**Total forward passes**: `50 + 51 + 52 + ... + 149 = ` {glue:text}`q4_total_processings` token processings
|
||||
|
||||
**Why inefficient**: Attention recomputes key/value projections for all previous tokens every step, even though they don't change. For position 50, we recompute the same key/value vectors 100 times.
|
||||
|
||||
**KV Caching optimization**: Store computed key/value projections for previous tokens
|
||||
- Each new token only computes its own key/value
|
||||
- Attention uses cached keys/values from previous tokens
|
||||
- Total computation: `50 (initial) + 100 (new tokens) = 150 token processings`
|
||||
- Total computation: `50 (initial) + 100 (new tokens) = ` {glue:text}`q4_optimized` token processings
|
||||
|
||||
**Speedup**: `9,950 / 150 ≈ 66× faster` for this example. The speedup increases with generation length, making KV caching essential for production systems.
|
||||
**Speedup**: {glue:text}`q4_total_processings` ` / ` {glue:text}`q4_optimized` ` = ` {glue:text}`q4_speedup` `x faster` for this example. The speedup increases with generation length, making KV caching essential for production systems.
|
||||
```
|
||||
|
||||
**Q5: Layer Normalization vs Batch Normalization**
|
||||
@@ -842,8 +958,8 @@ Why do transformers use layer normalization instead of batch normalization? Cons
|
||||
- Works naturally with variable-length sequences
|
||||
|
||||
**Example**: For a tensor `(batch=3, seq=10, features=768)`:
|
||||
- Batch norm: Compute 10 × 768 statistics across batch dimension (problematic)
|
||||
- Layer norm: Compute 3 × 10 statistics across feature dimension (independent)
|
||||
- Batch norm: Compute 10 x 768 statistics across batch dimension (problematic)
|
||||
- Layer norm: Compute 3 x 10 statistics across feature dimension (independent)
|
||||
|
||||
**Why it matters**: Transformers process variable-length sequences. Layer norm treats each position independently, making it robust to sequence length variation and batch composition.
|
||||
```
|
||||
@@ -887,7 +1003,7 @@ Profile your transformer to identify performance bottlenecks. You'll learn to me
|
||||
|
||||
```{tip} Interactive Options
|
||||
|
||||
- **[Launch Binder](https://mybinder.org/v2/gh/harvard-edge/cs249r_book/main?urlpath=lab/tree/tinytorch/modules/13_transformers/13_transformers.ipynb)** - Run interactively in browser, no setup required
|
||||
- **[Launch Binder](https://mybinder.org/v2/gh/harvard-edge/cs249r_book/main?urlpath=lab/tree/tinytorch/modules/13_transformers/transformers.ipynb)** - Run interactively in browser, no setup required
|
||||
- **[View Source](https://github.com/harvard-edge/cs249r_book/blob/main/tinytorch/src/13_transformers/13_transformers.py)** - Browse the implementation code
|
||||
```
|
||||
|
||||
|
||||
@@ -1,3 +1,9 @@
|
||||
---
|
||||
file_format: mystnb
|
||||
kernelspec:
|
||||
name: python3
|
||||
---
|
||||
|
||||
# Module 14: Profiling
|
||||
|
||||
:::{admonition} Module Info
|
||||
@@ -30,7 +36,7 @@ Listen to an AI-generated overview.
|
||||
|
||||
Run interactively in your browser.
|
||||
|
||||
<a href="https://mybinder.org/v2/gh/harvard-edge/cs249r_book/main?labpath=tinytorch%2Fmodules%2F14_profiling%2F14_profiling.ipynb" target="_blank" style="display: flex; align-items: center; justify-content: center; width: 100%; height: 54px; margin-top: auto; background: #f97316; color: white; text-align: center; text-decoration: none; border-radius: 27px; font-size: 14px; box-sizing: border-box;">Open in Binder →</a>
|
||||
<a href="https://mybinder.org/v2/gh/harvard-edge/cs249r_book/main?labpath=tinytorch%2Fmodules%2F14_profiling%2Fprofiling.ipynb" target="_blank" style="display: flex; align-items: center; justify-content: center; width: 100%; height: 54px; margin-top: auto; background: #f97316; color: white; text-align: center; text-decoration: none; border-radius: 27px; font-size: 14px; box-sizing: border-box;">Open in Binder →</a>
|
||||
```
|
||||
|
||||
```{grid-item-card} 📄 View Source
|
||||
@@ -337,8 +343,17 @@ Profiling (14) → Model-Level (15-16) → Runtime (17-18) → Benchmarking (19)
|
||||
"What's slow?" "Shrink the model" "Speed up execution" "Did it work?"
|
||||
```
|
||||
|
||||
```{code-cell} python3
|
||||
:tags: [remove-input, remove-output]
|
||||
from myst_nb import glue
|
||||
|
||||
# Quantization compression ratio: FP32 (32 bits) -> INT8 (8 bits)
|
||||
quant_compression = 32 // 8
|
||||
glue("tier_quant_compression", f"{quant_compression}")
|
||||
```
|
||||
|
||||
**Model-Level Optimizations (15-16)**: Change the model itself
|
||||
- Quantization: FP32 → INT8 for 4× compression
|
||||
- Quantization: FP32 → INT8 for {glue:text}`tier_quant_compression`× compression
|
||||
- Compression: Prune unnecessary weights
|
||||
|
||||
**Runtime Optimizations (17-18)**: Change how execution happens
|
||||
@@ -538,11 +553,23 @@ def measure_memory(self, model, input_shape: Tuple[int, ...]) -> Dict[str, float
|
||||
}
|
||||
```
|
||||
|
||||
Parameter memory is persistent and constant regardless of batch size. A model with 125 million parameters uses 500 MB (125M × 4 bytes per float32) whether you're processing one sample or a thousand.
|
||||
```{code-cell} python3
|
||||
:tags: [remove-input, remove-output]
|
||||
from myst_nb import glue
|
||||
|
||||
# 125M parameter model memory: 125,000,000 params * 4 bytes/float32
|
||||
mem_125m_params = 125_000_000
|
||||
mem_125m_bytes = mem_125m_params * 4
|
||||
mem_125m_mb = mem_125m_bytes / (1024 ** 2)
|
||||
glue("mem_125m_params", f"{mem_125m_params // 1_000_000}")
|
||||
glue("mem_125m_mb", f"{round(mem_125m_mb)}")
|
||||
```
|
||||
|
||||
Parameter memory is persistent and constant regardless of batch size. A model with {glue:text}`mem_125m_params` million parameters uses {glue:text}`mem_125m_mb` MB ({glue:text}`mem_125m_params`M × 4 bytes per float32) whether you're processing one sample or a thousand.
|
||||
|
||||
Activation memory scales with batch size. Doubling the batch doubles activation memory. This is why large batch training requires more GPU memory than inference.
|
||||
|
||||
Gradient memory matches parameter memory exactly. Every parameter needs a gradient during training, adding another 500 MB for a 125M parameter model.
|
||||
Gradient memory matches parameter memory exactly. Every parameter needs a gradient during training, adding another {glue:text}`mem_125m_mb` MB for a {glue:text}`mem_125m_params`M parameter model.
|
||||
|
||||
### Bottleneck Identification
|
||||
|
||||
@@ -642,9 +669,21 @@ The profiling workflow: measure parameters, FLOPs, memory, and latency to identi
|
||||
|
||||
### Why Profiling Matters at Scale
|
||||
|
||||
```{code-cell} python3
|
||||
:tags: [remove-input, remove-output]
|
||||
from myst_nb import glue
|
||||
|
||||
# GPT-3: 175B parameters at FP32 (4 bytes each)
|
||||
gpt3_params_b = 175
|
||||
gpt3_bytes = gpt3_params_b * 1_000_000_000 * 4
|
||||
gpt3_gb = gpt3_bytes / (1024 ** 3)
|
||||
glue("scale_gpt3_params", f"{gpt3_params_b}")
|
||||
glue("scale_gpt3_gb", f"{round(gpt3_gb)}")
|
||||
```
|
||||
|
||||
To appreciate profiling's importance, consider production ML systems:
|
||||
|
||||
- **GPT-3 (175B parameters)**: 700 GB model size at FP32. Profiling reveals which layers to quantize for deployment.
|
||||
- **GPT-3 ({glue:text}`scale_gpt3_params`B parameters)**: {glue:text}`scale_gpt3_gb` GB model size at FP32. Profiling reveals which layers to quantize for deployment.
|
||||
- **BERT training**: 80% of time in self-attention. Profiling identifies FlashAttention as the optimization to implement.
|
||||
- **Image classification**: Batch size 256 uses 12 GB GPU memory. Profiling shows 10 GB is activations, suggesting gradient checkpointing.
|
||||
|
||||
@@ -658,17 +697,49 @@ Test yourself with these systems thinking questions about profiling and performa
|
||||
|
||||
A transformer model has 12 layers, each with a feed-forward network containing two Linear layers: Linear(768, 3072) and Linear(3072, 768). How much memory do the feed-forward network parameters consume across all layers?
|
||||
|
||||
```{code-cell} python3
|
||||
:tags: [remove-input, remove-output]
|
||||
from myst_nb import glue
|
||||
|
||||
# Q1: Feed-forward network parameter memory calculation
|
||||
q1_first_weights = 768 * 3072
|
||||
q1_first_bias = 3072
|
||||
q1_first_total = q1_first_weights + q1_first_bias
|
||||
|
||||
q1_second_weights = 3072 * 768
|
||||
q1_second_bias = 768
|
||||
q1_second_total = q1_second_weights + q1_second_bias
|
||||
|
||||
q1_per_layer = q1_first_total + q1_second_total
|
||||
q1_num_layers = 12
|
||||
q1_all_layers = q1_num_layers * q1_per_layer
|
||||
|
||||
q1_bytes = q1_all_layers * 4
|
||||
q1_mb = round(q1_bytes / (1024 ** 2))
|
||||
|
||||
glue("q1_first_weights", f"{q1_first_weights:,}")
|
||||
glue("q1_first_bias", f"{q1_first_bias:,}")
|
||||
glue("q1_first_total", f"{q1_first_total:,}")
|
||||
glue("q1_second_weights", f"{q1_second_weights:,}")
|
||||
glue("q1_second_bias", f"{q1_second_bias:,}")
|
||||
glue("q1_second_total", f"{q1_second_total:,}")
|
||||
glue("q1_per_layer", f"{q1_per_layer:,}")
|
||||
glue("q1_all_layers", f"{q1_all_layers:,}")
|
||||
glue("q1_bytes", f"{q1_bytes:,}")
|
||||
glue("q1_mb", f"{q1_mb}")
|
||||
```
|
||||
|
||||
```{admonition} Answer
|
||||
:class: dropdown
|
||||
|
||||
Each feed-forward network:
|
||||
- First layer: (768 × 3072) + 3072 = 2,362,368 parameters
|
||||
- Second layer: (3072 × 768) + 768 = 2,360,064 parameters
|
||||
- Total per layer: 4,722,432 parameters
|
||||
- First layer: (768 × 3072) + 3072 = {glue:text}`q1_first_total` parameters
|
||||
- Second layer: (3072 × 768) + 768 = {glue:text}`q1_second_total` parameters
|
||||
- Total per layer: {glue:text}`q1_per_layer` parameters
|
||||
|
||||
Across 12 layers: 12 × 4,722,432 = 56,669,184 parameters
|
||||
Across 12 layers: 12 × {glue:text}`q1_per_layer` = {glue:text}`q1_all_layers` parameters
|
||||
|
||||
Memory: 56,669,184 × 4 bytes = 226,676,736 bytes ≈ **227 MB**
|
||||
Memory: {glue:text}`q1_all_layers` × 4 bytes = {glue:text}`q1_bytes` bytes ≈ **{glue:text}`q1_mb` MB**
|
||||
|
||||
This is just the feed-forward networks. Attention adds more parameters.
|
||||
```
|
||||
@@ -677,16 +748,37 @@ This is just the feed-forward networks. Attention adds more parameters.
|
||||
|
||||
A Linear(512, 512) layer processes a batch of 64 samples. Your profiler's `count_flops()` method returns FLOPs per sample (batch-size independent). How many FLOPs are required for one sample? For the whole batch, if each sample is processed independently?
|
||||
|
||||
```{code-cell} python3
|
||||
:tags: [remove-input, remove-output]
|
||||
from myst_nb import glue
|
||||
|
||||
# Q2: FLOP counting for Linear(512, 512)
|
||||
q2_in_features = 512
|
||||
q2_out_features = 512
|
||||
q2_per_sample = q2_in_features * q2_out_features * 2
|
||||
q2_batch_size = 64
|
||||
q2_batch_total = q2_batch_size * q2_per_sample
|
||||
|
||||
# Latency at 50 GFLOP/s
|
||||
q2_gflops = 50
|
||||
q2_latency_s = q2_batch_total / (q2_gflops * 1e9)
|
||||
q2_latency_ms = q2_latency_s * 1000
|
||||
|
||||
glue("q2_per_sample", f"{q2_per_sample:,}")
|
||||
glue("q2_batch_total", f"{q2_batch_total:,}")
|
||||
glue("q2_latency_ms", f"{q2_latency_ms:.2f}")
|
||||
```
|
||||
|
||||
```{admonition} Answer
|
||||
:class: dropdown
|
||||
|
||||
Per-sample FLOPs (what `count_flops()` returns): 512 × 512 × 2 = **524,288 FLOPs**
|
||||
Per-sample FLOPs (what `count_flops()` returns): 512 × 512 × 2 = **{glue:text}`q2_per_sample` FLOPs**
|
||||
|
||||
Note: The `count_flops()` method is batch-size independent. It returns per-sample FLOPs whether you pass input_shape=(1, 512) or (64, 512).
|
||||
|
||||
If processing a batch of 64 samples: 64 × 524,288 = 33,554,432 total FLOPs
|
||||
If processing a batch of 64 samples: 64 × {glue:text}`q2_per_sample` = {glue:text}`q2_batch_total` total FLOPs
|
||||
|
||||
Minimum latency at 50 GFLOP/s: 33,554,432 FLOPs ÷ 50 GFLOP/s = **0.67 ms** for the full batch
|
||||
Minimum latency at 50 GFLOP/s: {glue:text}`q2_batch_total` FLOPs ÷ 50 GFLOP/s = **{glue:text}`q2_latency_ms` ms** for the full batch
|
||||
|
||||
This assumes perfect computational efficiency (100%). Real latency is higher due to memory bandwidth and overhead.
|
||||
```
|
||||
@@ -695,10 +787,22 @@ This assumes perfect computational efficiency (100%). Real latency is higher due
|
||||
|
||||
A model achieves 5 GFLOP/s on hardware with 100 GFLOP/s peak compute. The memory bandwidth is 50 GB/s. Is this workload compute-bound or memory-bound?
|
||||
|
||||
```{code-cell} python3
|
||||
:tags: [remove-input, remove-output]
|
||||
from myst_nb import glue
|
||||
|
||||
# Q3: Computational efficiency
|
||||
q3_achieved = 5
|
||||
q3_peak = 100
|
||||
q3_efficiency_pct = (q3_achieved / q3_peak) * 100
|
||||
|
||||
glue("q3_efficiency_pct", f"{q3_efficiency_pct:.0f}")
|
||||
```
|
||||
|
||||
```{admonition} Answer
|
||||
:class: dropdown
|
||||
|
||||
Computational efficiency: 5 GFLOP/s ÷ 100 GFLOP/s = **5% efficiency**
|
||||
Computational efficiency: 5 GFLOP/s ÷ 100 GFLOP/s = **{glue:text}`q3_efficiency_pct`% efficiency**
|
||||
|
||||
This extremely low efficiency suggests the workload is **memory-bound**. The hardware can compute 100 GFLOP/s but only achieves 5 GFLOP/s because it spends most of the time waiting for data transfers.
|
||||
|
||||
@@ -707,17 +811,38 @@ Optimization strategy: Focus on reducing memory transfers, improving cache local
|
||||
|
||||
**Q4: Training Memory Estimation**
|
||||
|
||||
A model has 125M parameters (500 MB). You're training with Adam optimizer. What's the total memory requirement during training, including gradients and optimizer state?
|
||||
```{code-cell} python3
|
||||
:tags: [remove-input, remove-output]
|
||||
from myst_nb import glue
|
||||
|
||||
# Q4: Training memory with Adam optimizer
|
||||
# 125M params at FP32 = 500 MB for parameters
|
||||
q4_param_mb = 500
|
||||
q4_grad_mb = q4_param_mb # gradients match parameters
|
||||
q4_adam_m_mb = q4_param_mb # first moment (momentum)
|
||||
q4_adam_v_mb = q4_param_mb # second moment (velocity)
|
||||
q4_total_mb = q4_param_mb + q4_grad_mb + q4_adam_m_mb + q4_adam_v_mb
|
||||
q4_total_gb = q4_total_mb / 1000
|
||||
|
||||
glue("q4_param_mb", f"{q4_param_mb}")
|
||||
glue("q4_grad_mb", f"{q4_grad_mb}")
|
||||
glue("q4_adam_m_mb", f"{q4_adam_m_mb}")
|
||||
glue("q4_adam_v_mb", f"{q4_adam_v_mb}")
|
||||
glue("q4_total_mb", f"{q4_total_mb:,}")
|
||||
glue("q4_total_gb", f"{q4_total_gb:.0f}")
|
||||
```
|
||||
|
||||
A model has 125M parameters ({glue:text}`q4_param_mb` MB). You're training with Adam optimizer. What's the total memory requirement during training, including gradients and optimizer state?
|
||||
|
||||
```{admonition} Answer
|
||||
:class: dropdown
|
||||
|
||||
- Parameters: 500 MB
|
||||
- Gradients: 500 MB (same as parameters)
|
||||
- Adam momentum: 500 MB (first moment estimates)
|
||||
- Adam velocity: 500 MB (second moment estimates)
|
||||
- Parameters: {glue:text}`q4_param_mb` MB
|
||||
- Gradients: {glue:text}`q4_grad_mb` MB (same as parameters)
|
||||
- Adam momentum: {glue:text}`q4_adam_m_mb` MB (first moment estimates)
|
||||
- Adam velocity: {glue:text}`q4_adam_v_mb` MB (second moment estimates)
|
||||
|
||||
Total: 500 + 500 + 500 + 500 = **2,000 MB (2 GB)**
|
||||
Total: {glue:text}`q4_param_mb` + {glue:text}`q4_grad_mb` + {glue:text}`q4_adam_m_mb` + {glue:text}`q4_adam_v_mb` = **{glue:text}`q4_total_mb` MB ({glue:text}`q4_total_gb` GB)**
|
||||
|
||||
This is just model state. Activations add more memory that scales with batch size. A typical training run might use 4-8 GB total including activations.
|
||||
```
|
||||
@@ -773,7 +898,7 @@ Implement quantization to reduce model size and accelerate inference. You'll use
|
||||
|
||||
```{tip} Interactive Options
|
||||
|
||||
- **[Launch Binder](https://mybinder.org/v2/gh/harvard-edge/cs249r_book/main?urlpath=lab/tree/tinytorch/modules/14_profiling/14_profiling.ipynb)** - Run interactively in browser, no setup required
|
||||
- **[Launch Binder](https://mybinder.org/v2/gh/harvard-edge/cs249r_book/main?urlpath=lab/tree/tinytorch/modules/14_profiling/profiling.ipynb)** - Run interactively in browser, no setup required
|
||||
- **[View Source](https://github.com/harvard-edge/cs249r_book/blob/main/tinytorch/src/14_profiling/14_profiling.py)** - Browse the implementation code
|
||||
```
|
||||
|
||||
|
||||
@@ -1,3 +1,9 @@
|
||||
---
|
||||
file_format: mystnb
|
||||
kernelspec:
|
||||
name: python3
|
||||
---
|
||||
|
||||
# Module 15: Quantization
|
||||
|
||||
:::{admonition} Module Info
|
||||
@@ -31,7 +37,7 @@ Listen to an AI-generated overview.
|
||||
|
||||
Run interactively in your browser.
|
||||
|
||||
<a href="https://mybinder.org/v2/gh/harvard-edge/cs249r_book/main?labpath=tinytorch%2Fmodules%2F15_quantization%2F15_quantization.ipynb" target="_blank" style="display: flex; align-items: center; justify-content: center; width: 100%; height: 54px; margin-top: auto; background: #f97316; color: white; text-align: center; text-decoration: none; border-radius: 27px; font-size: 14px; box-sizing: border-box;">Open in Binder →</a>
|
||||
<a href="https://mybinder.org/v2/gh/harvard-edge/cs249r_book/main?labpath=tinytorch%2Fmodules%2F15_quantization%2Fquantization.ipynb" target="_blank" style="display: flex; align-items: center; justify-content: center; width: 100%; height: 54px; margin-top: auto; background: #f97316; color: white; text-align: center; text-decoration: none; border-radius: 27px; font-size: 14px; box-sizing: border-box;">Open in Binder →</a>
|
||||
```
|
||||
|
||||
```{grid-item-card} 📄 View Source
|
||||
@@ -322,7 +328,27 @@ initSlideViewer('15_quantization', '../_static/slides/15_quantization.pdf');
|
||||
|
||||
## Overview
|
||||
|
||||
Modern neural networks face a memory wall problem. A BERT model requires 440 MB, GPT-2 needs 6 GB, and GPT-3 demands 700 GB, yet mobile devices have only 4-8 GB of RAM. The culprit? Every parameter uses 4 bytes of FP32 precision, representing values with 32-bit accuracy when 8 bits often suffice. Quantization solves this by converting FP32 weights to INT8, achieving 4× memory reduction with less than 1% accuracy loss.
|
||||
```{code-cell} python3
|
||||
:tags: [remove-input, remove-output]
|
||||
from myst_nb import glue
|
||||
|
||||
# Model sizes in FP32 (4 bytes per parameter)
|
||||
bert_params = 110_000_000
|
||||
gpt2_params = 1_500_000_000
|
||||
gpt3_params = 175_000_000_000
|
||||
bert_mb = bert_params * 4 / 1024**2
|
||||
gpt2_gb = gpt2_params * 4 / 1024**3
|
||||
gpt3_gb = gpt3_params * 4 / 1024**3
|
||||
glue("bert_mb", f"{bert_mb:.0f} MB")
|
||||
glue("gpt2_gb", f"{gpt2_gb:.1f} GB")
|
||||
glue("gpt3_gb", f"{gpt3_gb:.0f} GB")
|
||||
|
||||
# Quantized BERT (INT8 = 1 byte per param)
|
||||
bert_int8_mb = bert_params * 1 / 1024**2
|
||||
glue("bert_int8_mb", f"{bert_int8_mb:.0f} MB")
|
||||
```
|
||||
|
||||
Modern neural networks face a memory wall problem. A BERT model requires {glue:text}`bert_mb`, GPT-2 needs {glue:text}`gpt2_gb`, and GPT-3 demands {glue:text}`gpt3_gb`, yet mobile devices have only 4-8 GB of RAM. The culprit? Every parameter uses 4 bytes of FP32 precision, representing values with 32-bit accuracy when 8 bits often suffice. Quantization solves this by converting FP32 weights to INT8, achieving 4× memory reduction with less than 1% accuracy loss.
|
||||
|
||||
In this module, you'll build a production-quality INT8 quantization system. You'll implement the core quantization algorithm, create quantized layer classes, and develop calibration techniques that optimize quantization parameters for minimal accuracy degradation. By the end, you'll compress entire neural networks from hundreds of megabytes to a fraction of their original size, enabling deployment on memory-constrained devices.
|
||||
|
||||
@@ -448,7 +474,7 @@ Neural networks use FP32 (32-bit floating point) by default, which can represent
|
||||
|
||||
INT8 quantization maps this continuous FP32 range to just 256 discrete values (from -128 to 127). The key insight is that we can preserve model accuracy by carefully choosing how to map these 256 levels across the actual range of values in each tensor. A tensor with values in [-0.5, 0.5] needs different quantization parameters than one with values in [-10, 10].
|
||||
|
||||
Consider the storage implications. A single FP32 parameter requires 4 bytes, while INT8 uses 1 byte. For a model with 100 million parameters, this is the difference between 400 MB (FP32) and 100 MB (INT8). The 4× compression ratio is consistent across all model sizes because we're always reducing from 32 bits to 8 bits per value.
|
||||
Consider the storage implications. A single FP32 parameter requires 4 bytes, while INT8 uses 1 byte. For a model with 100 million parameters, this is the difference between {glue:text}`q4_fp32_mb` (FP32) and {glue:text}`q4_int8_mb` (INT8). The 4× compression ratio is consistent across all model sizes because we're always reducing from 32 bits to 8 bits per value.
|
||||
|
||||
### Quantization Schemes
|
||||
|
||||
@@ -661,13 +687,66 @@ The core quantization mathematics: scale calculation, zero-point mapping, INT8 r
|
||||
|
||||
To appreciate why quantization is critical for production ML, consider these deployment scenarios:
|
||||
|
||||
- **Mobile AI**: iPhone has 6 GB RAM shared across all apps. A quantized BERT (110 MB) fits comfortably; FP32 version (440 MB) causes memory pressure and swapping.
|
||||
- **Mobile AI**: iPhone has 6 GB RAM shared across all apps. A quantized BERT ({glue:text}`bert_int8_mb`) fits comfortably; FP32 version ({glue:text}`bert_mb`) causes memory pressure and swapping.
|
||||
- **Edge computing**: IoT devices often have 512 MB RAM. Quantization enables on-device inference for privacy-sensitive applications (medical devices, security cameras).
|
||||
- **Data centers**: Serving 1000 requests/second requires multiple model replicas. With 4× memory reduction, you fit 4× more models per GPU, reducing serving costs by 75%.
|
||||
- **Battery life**: INT8 operations consume 2-4× less energy than FP32 on mobile processors. Quantized models drain battery slower, improving user experience.
|
||||
|
||||
## Check Your Understanding
|
||||
|
||||
```{code-cell} python3
|
||||
:tags: [remove-input, remove-output]
|
||||
import math
|
||||
|
||||
# Q1: 3-layer network parameter counting
|
||||
q1_l1 = 784 * 256 + 256
|
||||
q1_l2 = 256 * 128 + 128
|
||||
q1_l3 = 128 * 10 + 10
|
||||
q1_total = q1_l1 + q1_l2 + q1_l3
|
||||
q1_fp32_bytes = q1_total * 4
|
||||
q1_int8_bytes = q1_total * 1
|
||||
q1_savings = q1_fp32_bytes - q1_int8_bytes
|
||||
glue("q1_l1", f"{q1_l1:,}")
|
||||
glue("q1_l2", f"{q1_l2:,}")
|
||||
glue("q1_l3", f"{q1_l3:,}")
|
||||
glue("q1_total", f"{q1_total:,}")
|
||||
glue("q1_fp32_bytes", f"{q1_fp32_bytes:,}")
|
||||
glue("q1_fp32_mb", f"{q1_fp32_bytes / 1024**2:.2f} MB")
|
||||
glue("q1_int8_bytes", f"{q1_int8_bytes:,}")
|
||||
glue("q1_int8_mb", f"{q1_int8_bytes / 1024**2:.2f} MB")
|
||||
glue("q1_savings_mb", f"{q1_savings / 1024**2:.2f} MB")
|
||||
|
||||
# Q2: Quantization error and SNR
|
||||
q2_range = 1.0
|
||||
q2_levels = 255
|
||||
q2_scale = q2_range / q2_levels
|
||||
q2_max_error = q2_scale / 2
|
||||
q2_snr = 20 * math.log10(q2_range / q2_scale)
|
||||
glue("q2_scale", f"{q2_scale:.6f}")
|
||||
glue("q2_max_error", f"±{q2_max_error:.6f}")
|
||||
glue("q2_snr", f"{q2_snr:.0f} dB")
|
||||
|
||||
# Q4: Loading time
|
||||
q4_fp32_mb = 100_000_000 * 4 / 1024**2
|
||||
q4_int8_mb = 100_000_000 * 1 / 1024**2
|
||||
q4_bandwidth = 500 # MB/s
|
||||
q4_fp32_time = q4_fp32_mb / q4_bandwidth
|
||||
q4_int8_time = q4_int8_mb / q4_bandwidth
|
||||
glue("q4_fp32_mb", f"{q4_fp32_mb:.0f} MB")
|
||||
glue("q4_int8_mb", f"{q4_int8_mb:.0f} MB")
|
||||
glue("q4_fp32_time", f"{q4_fp32_time:.1f} seconds")
|
||||
glue("q4_int8_time", f"{q4_int8_time:.2f} seconds")
|
||||
glue("q4_time_saved", f"{q4_fp32_time - q4_int8_time:.1f}s")
|
||||
|
||||
# Q5: SIMD register capacity
|
||||
simd_bits = 512
|
||||
fp32_per_reg = simd_bits // 32
|
||||
int8_per_reg = simd_bits // 8
|
||||
glue("q5_fp32", f"{fp32_per_reg}")
|
||||
glue("q5_int8", f"{int8_per_reg}")
|
||||
glue("q5_ratio", f"{int8_per_reg // fp32_per_reg}×")
|
||||
```
|
||||
|
||||
Test your quantization knowledge with these systems thinking questions. They're designed to build intuition for memory, precision, and performance trade-offs.
|
||||
|
||||
**Q1: Memory Calculation**
|
||||
@@ -678,15 +757,15 @@ A neural network has three Linear layers: 784→256, 256→128, 128→10. How mu
|
||||
:class: dropdown
|
||||
|
||||
**Parameter count:**
|
||||
- Layer 1: (784 × 256) + 256 = 200,960
|
||||
- Layer 2: (256 × 128) + 128 = 32,896
|
||||
- Layer 3: (128 × 10) + 10 = 1,290
|
||||
- **Total: 235,146 parameters**
|
||||
- Layer 1: (784 × 256) + 256 = {glue:text}`q1_l1`
|
||||
- Layer 2: (256 × 128) + 128 = {glue:text}`q1_l2`
|
||||
- Layer 3: (128 × 10) + 10 = {glue:text}`q1_l3`
|
||||
- **Total: {glue:text}`q1_total` parameters**
|
||||
|
||||
**Memory usage:**
|
||||
- FP32: 235,146 × 4 bytes = **940,584 bytes ≈ 0.92 MB**
|
||||
- INT8: 235,146 × 1 byte = **235,146 bytes ≈ 0.23 MB**
|
||||
- **Savings: 0.69 MB (75% reduction, 4× compression)**
|
||||
- FP32: {glue:text}`q1_total` × 4 bytes = **{glue:text}`q1_fp32_bytes` bytes ≈ {glue:text}`q1_fp32_mb`**
|
||||
- INT8: {glue:text}`q1_total` × 1 byte = **{glue:text}`q1_int8_bytes` bytes ≈ {glue:text}`q1_int8_mb`**
|
||||
- **Savings: {glue:text}`q1_savings_mb` (75% reduction, 4× compression)**
|
||||
|
||||
This shows why quantization matters: even small models benefit significantly.
|
||||
```
|
||||
@@ -700,14 +779,14 @@ For FP32 weights uniformly distributed in [-0.5, 0.5], what is the maximum quant
|
||||
|
||||
**Quantization error:**
|
||||
- Range: 0.5 - (-0.5) = 1.0
|
||||
- Scale: 1.0 / 255 = **0.003922**
|
||||
- Max error: scale / 2 = **±0.001961** (half step size)
|
||||
- Scale: 1.0 / 255 = **{glue:text}`q2_scale`**
|
||||
- Max error: scale / 2 = **{glue:text}`q2_max_error`** (half step size)
|
||||
|
||||
**Signal-to-noise ratio:**
|
||||
- SNR = 20 × log₁₀(signal_range / quantization_step)
|
||||
- SNR = 20 × log₁₀(1.0 / 0.003922)
|
||||
- SNR = 20 × log₁₀(1.0 / {glue:text}`q2_scale`)
|
||||
- SNR = 20 × log₁₀(255)
|
||||
- SNR ≈ **48 dB**
|
||||
- SNR ≈ **{glue:text}`q2_snr`**
|
||||
|
||||
This is sufficient for neural networks (typical requirement: >40 dB). The 8-bit quantization provides approximately 6 dB per bit, matching the theoretical limit.
|
||||
```
|
||||
@@ -743,16 +822,16 @@ A model has 100M parameters. Loading from SSD to RAM at 500 MB/s, how long does
|
||||
:class: dropdown
|
||||
|
||||
**Loading time:**
|
||||
- FP32 size: 100M × 4 bytes = 400 MB
|
||||
- INT8 size: 100M × 1 byte = 100 MB
|
||||
- FP32 load time: 400 MB / 500 MB/s = **0.8 seconds**
|
||||
- INT8 load time: 100 MB / 500 MB/s = **0.2 seconds**
|
||||
- FP32 size: 100M × 4 bytes = {glue:text}`q4_fp32_mb`
|
||||
- INT8 size: 100M × 1 byte = {glue:text}`q4_int8_mb`
|
||||
- FP32 load time: {glue:text}`q4_fp32_mb` / 500 MB/s = **{glue:text}`q4_fp32_time`**
|
||||
- INT8 load time: {glue:text}`q4_int8_mb` / 500 MB/s = **{glue:text}`q4_int8_time`**
|
||||
- **Speedup: 4× faster loading**
|
||||
|
||||
**User experience impact:**
|
||||
- Mobile app launch: 0.8s → 0.2s (**0.6s faster startup**)
|
||||
- Cloud inference: 0.8s latency → 0.2s latency (**4× better throughput**)
|
||||
- Model updates: 400 MB download → 100 MB download (**75% less data usage**)
|
||||
- Mobile app launch: {glue:text}`q4_fp32_time` → {glue:text}`q4_int8_time` (**{glue:text}`q4_time_saved` faster startup**)
|
||||
- Cloud inference: {glue:text}`q4_fp32_time` latency → {glue:text}`q4_int8_time` latency (**4× better throughput**)
|
||||
- Model updates: {glue:text}`q4_fp32_mb` download → {glue:text}`q4_int8_mb` download (**75% less data usage**)
|
||||
|
||||
**Key insight**: Quantization reduces not just RAM usage, but also disk I/O, network transfer, and cold-start latency. The 4× reduction applies to all memory movement operations.
|
||||
```
|
||||
@@ -765,9 +844,9 @@ Modern CPUs have AVX-512 VNNI instructions that can perform INT8 matrix multiply
|
||||
:class: dropdown
|
||||
|
||||
**SIMD capacity:**
|
||||
- 512-bit register with FP32: 512 / 32 = **16 values**
|
||||
- 512-bit register with INT8: 512 / 8 = **64 values**
|
||||
- **Theoretical speedup: 64/16 = 4×**
|
||||
- 512-bit register with FP32: 512 / 32 = **{glue:text}`q5_fp32` values**
|
||||
- 512-bit register with INT8: 512 / 8 = **{glue:text}`q5_int8` values**
|
||||
- **Theoretical speedup: {glue:text}`q5_int8`/{glue:text}`q5_fp32` = {glue:text}`q5_ratio`**
|
||||
|
||||
**Why actual speedup is 2-3× (not 4×):**
|
||||
|
||||
@@ -823,7 +902,7 @@ Implement model pruning and weight compression techniques. You'll build structur
|
||||
|
||||
```{tip} Interactive Options
|
||||
|
||||
- **[Launch Binder](https://mybinder.org/v2/gh/harvard-edge/cs249r_book/main?urlpath=lab/tree/tinytorch/modules/15_quantization/15_quantization.ipynb)** - Run interactively in browser, no setup required
|
||||
- **[Launch Binder](https://mybinder.org/v2/gh/harvard-edge/cs249r_book/main?urlpath=lab/tree/tinytorch/modules/15_quantization/quantization.ipynb)** - Run interactively in browser, no setup required
|
||||
- **[View Source](https://github.com/harvard-edge/cs249r_book/blob/main/tinytorch/src/15_quantization/15_quantization.py)** - Browse the implementation code
|
||||
```
|
||||
|
||||
|
||||
@@ -1,3 +1,9 @@
|
||||
---
|
||||
file_format: mystnb
|
||||
kernelspec:
|
||||
name: python3
|
||||
---
|
||||
|
||||
# Module 16: Compression
|
||||
|
||||
:::{admonition} Module Info
|
||||
@@ -30,7 +36,7 @@ Listen to an AI-generated overview.
|
||||
|
||||
Run interactively in your browser.
|
||||
|
||||
<a href="https://mybinder.org/v2/gh/harvard-edge/cs249r_book/main?labpath=tinytorch%2Fmodules%2F16_compression%2F16_compression.ipynb" target="_blank" style="display: flex; align-items: center; justify-content: center; width: 100%; height: 54px; margin-top: auto; background: #f97316; color: white; text-align: center; text-decoration: none; border-radius: 27px; font-size: 14px; box-sizing: border-box;">Open in Binder →</a>
|
||||
<a href="https://mybinder.org/v2/gh/harvard-edge/cs249r_book/main?labpath=tinytorch%2Fmodules%2F16_compression%2Fcompression.ipynb" target="_blank" style="display: flex; align-items: center; justify-content: center; width: 100%; height: 54px; margin-top: auto; background: #f97316; color: white; text-align: center; text-decoration: none; border-radius: 27px; font-size: 14px; box-sizing: border-box;">Open in Binder →</a>
|
||||
```
|
||||
|
||||
```{grid-item-card} 📄 View Source
|
||||
@@ -494,7 +500,20 @@ Result: Keep only weights >= 0.087 (top 10%)
|
||||
|
||||
The critical insight is that weight distributions in trained networks are heavily skewed toward zero. Most weights contribute minimally, so removing them preserves the essential computation while dramatically reducing storage and compute.
|
||||
|
||||
The memory impact is immediate. A model with 10 million parameters at 90% sparsity has only 1 million active weights. With sparse storage formats (like scipy's CSR matrix), this translates directly to 90% memory reduction. The compute savings come from skipping zero multiplications, though realizing this speedup requires sparse computation libraries.
|
||||
```{code-cell} python3
|
||||
:tags: [remove-input, remove-output]
|
||||
from myst_nb import glue
|
||||
|
||||
# Pruning fundamentals: 10M parameter model at 90% sparsity
|
||||
prune_total = 10_000_000
|
||||
prune_sparsity = 0.9
|
||||
prune_active = int(prune_total * (1 - prune_sparsity))
|
||||
glue("prune_total", f"{prune_total / 1_000_000:.0f} million")
|
||||
glue("prune_active", f"{prune_active / 1_000_000:.0f} million")
|
||||
glue("prune_pct", f"{prune_sparsity * 100:.0f}%")
|
||||
```
|
||||
|
||||
The memory impact is immediate. A model with {glue:text}`prune_total` parameters at {glue:text}`prune_pct` sparsity has only {glue:text}`prune_active` active weights. With sparse storage formats (like scipy's CSR matrix), this translates directly to 90% memory reduction. The compute savings come from skipping zero multiplications, though realizing this speedup requires sparse computation libraries.
|
||||
|
||||
### Structured vs Unstructured Pruning
|
||||
|
||||
@@ -602,9 +621,48 @@ The combined loss balances two objectives. The soft loss (with `alpha=0.7`) teac
|
||||
|
||||
### Low-Rank Approximation Theory
|
||||
|
||||
```{code-cell} python3
|
||||
:tags: [remove-input, remove-output]
|
||||
from myst_nb import glue
|
||||
|
||||
# Low-rank example 1: (512, 256) matrix with rank_ratio=0.5
|
||||
lr1_m, lr1_n = 512, 256
|
||||
lr1_original = lr1_m * lr1_n
|
||||
lr1_rank_ratio = 0.5
|
||||
lr1_k = int(lr1_rank_ratio * min(lr1_m, lr1_n))
|
||||
lr1_compressed = (lr1_m * lr1_k) + lr1_k + (lr1_k * lr1_n)
|
||||
lr1_ratio = lr1_original / lr1_compressed
|
||||
lr1_reduction_pct = (1 - lr1_compressed / lr1_original) * 100
|
||||
glue("lr1_original", f"{lr1_original:,}")
|
||||
glue("lr1_k", f"{lr1_k}")
|
||||
glue("lr1_u", f"{lr1_m * lr1_k:,}")
|
||||
glue("lr1_s", f"{lr1_k:,}")
|
||||
glue("lr1_v", f"{lr1_k * lr1_n:,}")
|
||||
glue("lr1_compressed", f"{lr1_compressed:,}")
|
||||
glue("lr1_ratio", f"{lr1_ratio:.2f}x")
|
||||
glue("lr1_reduction_pct", f"{lr1_reduction_pct:.0f}%")
|
||||
|
||||
# Low-rank example 2: (1024, 1024) matrix with rank_ratio=0.1
|
||||
lr2_m, lr2_n = 1024, 1024
|
||||
lr2_original = lr2_m * lr2_n
|
||||
lr2_rank_ratio = 0.1
|
||||
lr2_k = int(lr2_rank_ratio * min(lr2_m, lr2_n))
|
||||
lr2_compressed = (lr2_m * lr2_k) + lr2_k + (lr2_k * lr2_n)
|
||||
lr2_ratio = lr2_original / lr2_compressed
|
||||
lr2_reduction_pct = (1 - lr2_compressed / lr2_original) * 100
|
||||
glue("lr2_original", f"{lr2_original:,}")
|
||||
glue("lr2_k", f"{lr2_k}")
|
||||
glue("lr2_u", f"{lr2_m * lr2_k:,}")
|
||||
glue("lr2_s", f"{lr2_k:,}")
|
||||
glue("lr2_v", f"{lr2_k * lr2_n:,}")
|
||||
glue("lr2_compressed", f"{lr2_compressed:,}")
|
||||
glue("lr2_ratio", f"{lr2_ratio:.1f}x")
|
||||
glue("lr2_reduction_pct", f"{lr2_reduction_pct:.0f}%")
|
||||
```
|
||||
|
||||
Weight matrices in neural networks often contain redundancy that can be captured through low-rank approximations. Singular Value Decomposition (SVD) provides the mathematically optimal way to approximate a matrix with fewer parameters while minimizing reconstruction error.
|
||||
|
||||
The core idea is matrix factorization. Instead of storing a full (512, 256) weight matrix with 131,072 parameters, you decompose it into smaller factors that capture the essential structure:
|
||||
The core idea is matrix factorization. Instead of storing a full (512, 256) weight matrix with {glue:text}`lr1_original` parameters, you decompose it into smaller factors that capture the essential structure:
|
||||
|
||||
```python
|
||||
def low_rank_approximate(weight_matrix, rank_ratio=0.5):
|
||||
@@ -629,14 +687,14 @@ def low_rank_approximate(weight_matrix, rank_ratio=0.5):
|
||||
SVD identifies the most important "directions" in the weight matrix through singular values. Larger singular values capture more variance, so keeping only the top k values preserves most of the matrix's information while dramatically reducing parameters.
|
||||
|
||||
For a (512, 256) matrix with rank_ratio=0.5:
|
||||
- Original: 512 × 256 = 131,072 parameters
|
||||
- Compressed: (512 × 128) + 128 + (128 × 256) = 98,432 parameters
|
||||
- Compression ratio: 1.33x (25% reduction)
|
||||
- Original: 512 × 256 = {glue:text}`lr1_original` parameters
|
||||
- Compressed: (512 × {glue:text}`lr1_k`) + {glue:text}`lr1_s` + ({glue:text}`lr1_k` × 256) = {glue:text}`lr1_compressed` parameters
|
||||
- Compression ratio: {glue:text}`lr1_ratio` ({glue:text}`lr1_reduction_pct` reduction)
|
||||
|
||||
The compression ratio improves with larger matrices. For a (1024, 1024) matrix at rank_ratio=0.1:
|
||||
- Original: 1,048,576 parameters
|
||||
- Compressed: (1024 × 102) + 102 + (102 × 1024) = 209,046 parameters
|
||||
- Compression ratio: 5.0x (80% reduction)
|
||||
- Original: {glue:text}`lr2_original` parameters
|
||||
- Compressed: (1024 × {glue:text}`lr2_k`) + {glue:text}`lr2_s` + ({glue:text}`lr2_k` × 1024) = {glue:text}`lr2_compressed` parameters
|
||||
- Compression ratio: {glue:text}`lr2_ratio` ({glue:text}`lr2_reduction_pct` reduction)
|
||||
|
||||
Low-rank approximation trades accuracy for size. The reconstruction error depends on the discarded singular values. Choosing the right rank_ratio balances compression and accuracy preservation.
|
||||
|
||||
@@ -731,6 +789,18 @@ The core algorithms for magnitude thresholding, L2 norm channel ranking, and kno
|
||||
|
||||
### Why Compression Matters at Scale
|
||||
|
||||
```{code-cell} python3
|
||||
:tags: [remove-input, remove-output]
|
||||
from myst_nb import glue
|
||||
|
||||
# Deployment example: 100MB model pruned to 90% sparsity
|
||||
deploy_model_mb = 100
|
||||
deploy_sparsity = 0.9
|
||||
deploy_sparse_mb = deploy_model_mb * (1 - deploy_sparsity)
|
||||
glue("deploy_model_mb", f"{deploy_model_mb} MB")
|
||||
glue("deploy_sparse_mb", f"{deploy_sparse_mb:.0f} MB")
|
||||
```
|
||||
|
||||
To appreciate compression's impact, consider real deployment constraints:
|
||||
|
||||
- **Mobile apps**: Models must fit in <10MB for reasonable download sizes and <50MB runtime memory
|
||||
@@ -739,10 +809,66 @@ To appreciate compression's impact, consider real deployment constraints:
|
||||
- **Latency targets**: Self-driving cars need <100ms inference time; compression enables real-time decisions
|
||||
- **Energy efficiency**: Smartphones have ~3000mAh batteries; model size directly impacts battery life
|
||||
|
||||
A 100MB model pruned to 90% sparsity becomes 10MB with sparse storage, fitting mobile constraints. The same model distilled to a 1MB student runs 10x faster, meeting latency requirements. These aren't theoretical gains; they're necessary for deployment.
|
||||
A {glue:text}`deploy_model_mb` model pruned to 90% sparsity becomes {glue:text}`deploy_sparse_mb` with sparse storage, fitting mobile constraints. The same model distilled to a 1MB student runs 10x faster, meeting latency requirements. These aren't theoretical gains; they're necessary for deployment.
|
||||
|
||||
## Check Your Understanding
|
||||
|
||||
```{code-cell} python3
|
||||
:tags: [remove-input, remove-output]
|
||||
from myst_nb import glue
|
||||
|
||||
# Q1: Sparsity calculation for (512, 256) layer at 80% pruning
|
||||
q1_rows, q1_cols = 512, 256
|
||||
q1_total = q1_rows * q1_cols
|
||||
q1_sparsity = 0.8
|
||||
q1_active = int(q1_total * (1 - q1_sparsity))
|
||||
q1_zeroed = int(q1_total * q1_sparsity)
|
||||
glue("q1_total", f"{q1_total:,}")
|
||||
glue("q1_active", f"{q1_active:,}")
|
||||
glue("q1_zeroed", f"{q1_zeroed:,}")
|
||||
|
||||
# Q2: Sequential pruning (magnitude 90% then structured 50%)
|
||||
q2_after_mag = 0.10 # 10% active after 90% magnitude
|
||||
q2_after_struct = 0.50 # 50% channels remain
|
||||
q2_final_active_pct = q2_after_mag * q2_after_struct * 100
|
||||
q2_final_sparse_pct = 100 - q2_final_active_pct
|
||||
glue("q2_final_active", f"{q2_final_active_pct:.0f}%")
|
||||
glue("q2_final_sparse", f"{q2_final_sparse_pct:.0f}%")
|
||||
|
||||
# Q3: Knowledge distillation efficiency
|
||||
q3_teacher_params = 100_000_000
|
||||
q3_student_params = 10_000_000
|
||||
q3_teacher_acc = 95
|
||||
q3_student_acc = 92
|
||||
q3_teacher_ms = 500
|
||||
q3_student_ms = 50
|
||||
q3_compression = q3_teacher_params / q3_student_params
|
||||
q3_speedup = q3_teacher_ms / q3_student_ms
|
||||
q3_acc_loss = q3_teacher_acc - q3_student_acc
|
||||
glue("q3_compression", f"{q3_compression:.0f}x")
|
||||
glue("q3_speedup", f"{q3_speedup:.0f}x")
|
||||
glue("q3_acc_loss", f"{q3_acc_loss:.0f}%")
|
||||
|
||||
# Q4: Low-rank decomposition for (1000, 1000) with rank=100
|
||||
q4_m, q4_n = 1000, 1000
|
||||
q4_original = q4_m * q4_n
|
||||
q4_rank = 100
|
||||
q4_u_params = q4_m * q4_rank
|
||||
q4_s_params = q4_rank
|
||||
q4_v_params = q4_rank * q4_n
|
||||
q4_compressed = q4_u_params + q4_s_params + q4_v_params
|
||||
q4_ratio = q4_original / q4_compressed
|
||||
q4_savings_bytes = (q4_original - q4_compressed) * 4
|
||||
q4_savings_mb = q4_savings_bytes / 1024**2
|
||||
glue("q4_original", f"{q4_original:,}")
|
||||
glue("q4_u_params", f"{q4_u_params:,}")
|
||||
glue("q4_s_params", f"{q4_s_params:,}")
|
||||
glue("q4_v_params", f"{q4_v_params:,}")
|
||||
glue("q4_compressed", f"{q4_compressed:,}")
|
||||
glue("q4_ratio", f"~{q4_ratio:.0f}x")
|
||||
glue("q4_savings_mb", f"{q4_savings_mb:.1f} MB")
|
||||
```
|
||||
|
||||
Test yourself with these systems thinking questions. They're designed to build intuition for compression trade-offs you'll encounter in production.
|
||||
|
||||
**Q1: Sparsity Calculation**
|
||||
@@ -752,11 +878,11 @@ A Linear layer with shape (512, 256) undergoes 80% magnitude pruning. How many w
|
||||
```{admonition} Answer
|
||||
:class: dropdown
|
||||
|
||||
Total parameters: 512 × 256 = **131,072**
|
||||
Total parameters: 512 × 256 = **{glue:text}`q1_total`**
|
||||
|
||||
After 80% pruning: 20% remain active = 131,072 × 0.2 = **26,214 active weights**
|
||||
After 80% pruning: 20% remain active = {glue:text}`q1_total` × 0.2 = **{glue:text}`q1_active` active weights**
|
||||
|
||||
Zeroed weights: 131,072 × 0.8 = **104,858 zeros**
|
||||
Zeroed weights: {glue:text}`q1_total` × 0.8 = **{glue:text}`q1_zeroed` zeros**
|
||||
|
||||
This is why sparsity creates memory savings - 80% of parameters are literally zero!
|
||||
```
|
||||
@@ -773,7 +899,7 @@ You apply magnitude pruning (90% sparsity) and structured pruning (50% channels)
|
||||
Approximation:
|
||||
- After magnitude: 90% sparse → 10% active weights
|
||||
- Structured removes 50% of channels → removes 50% of rows/columns
|
||||
- Final active weights ≈ 10% × 50% = **5% active → 95% sparse**
|
||||
- Final active weights ≈ 10% × 50% = **{glue:text}`q2_final_active` active → {glue:text}`q2_final_sparse` sparse**
|
||||
|
||||
Actual result depends on which channels structured pruning removes. If it removes already-sparse channels, sparsity increases less.
|
||||
```
|
||||
@@ -788,15 +914,15 @@ What's the compression ratio and speedup?
|
||||
```{admonition} Answer
|
||||
:class: dropdown
|
||||
|
||||
**Compression ratio**: 100M / 10M = **10x smaller**
|
||||
**Compression ratio**: 100M / 10M = **{glue:text}`q3_compression` smaller**
|
||||
|
||||
**Speedup**: 500ms / 50ms = **10x faster**
|
||||
**Speedup**: 500ms / 50ms = **{glue:text}`q3_speedup` faster**
|
||||
|
||||
**Accuracy loss**: 95% - 92% = **3% degradation**
|
||||
**Accuracy loss**: 95% - 92% = **{glue:text}`q3_acc_loss` degradation**
|
||||
|
||||
Why speedup matches compression: Student has 10x fewer parameters, so 10x fewer operations. Linear scaling!
|
||||
|
||||
Is this good? **Yes** - 10x compression with only 3% accuracy loss is excellent for mobile deployment.
|
||||
Is this good? **Yes** - {glue:text}`q3_compression` compression with only {glue:text}`q3_acc_loss` accuracy loss is excellent for mobile deployment.
|
||||
```
|
||||
|
||||
**Q4: Low-Rank Decomposition Math**
|
||||
@@ -806,18 +932,18 @@ A (1000, 1000) weight matrix gets low-rank approximation with rank=100. Calculat
|
||||
```{admonition} Answer
|
||||
:class: dropdown
|
||||
|
||||
Original: 1000 × 1000 = **1,000,000 parameters**
|
||||
Original: 1000 × 1000 = **{glue:text}`q4_original` parameters**
|
||||
|
||||
SVD decomposition: W ≈ U @ S @ V
|
||||
- U: (1000, 100) = 100,000 parameters
|
||||
- S: (100,) = 100 parameters (diagonal)
|
||||
- V: (100, 1000) = 100,000 parameters
|
||||
- U: (1000, 100) = {glue:text}`q4_u_params` parameters
|
||||
- S: (100,) = {glue:text}`q4_s_params` parameters (diagonal)
|
||||
- V: (100, 1000) = {glue:text}`q4_v_params` parameters
|
||||
|
||||
Compressed: 100,000 + 100 + 100,000 = **200,100 parameters**
|
||||
Compressed: {glue:text}`q4_u_params` + {glue:text}`q4_s_params` + {glue:text}`q4_v_params` = **{glue:text}`q4_compressed` parameters**
|
||||
|
||||
Compression ratio: 1,000,000 / 200,100 = **~5x reduction**
|
||||
Compression ratio: {glue:text}`q4_original` / {glue:text}`q4_compressed` = **{glue:text}`q4_ratio` reduction**
|
||||
|
||||
Memory savings: (1,000,000 - 200,100) × 4 bytes = **3.2 MB saved** (float32)
|
||||
Memory savings: ({glue:text}`q4_original` - {glue:text}`q4_compressed`) × 4 bytes = **{glue:text}`q4_savings_mb` saved** (float32)
|
||||
```
|
||||
|
||||
**Q5: Structured vs Unstructured Trade-offs**
|
||||
@@ -846,7 +972,7 @@ For students who want to understand the academic foundations and explore compres
|
||||
|
||||
- **Learning both Weights and Connections for Efficient Neural Networks** - Han et al. (2015). Introduced magnitude-based pruning and demonstrated 90% sparsity with minimal accuracy loss. Foundation for modern pruning research. [arXiv:1506.02626](https://arxiv.org/abs/1506.02626)
|
||||
|
||||
- **The Lottery Ticket Hypothesis** - Frankle & Carbin (2019). Showed that dense networks contain sparse subnetworks trainable to full accuracy from initialization. Changed how we think about pruning and network over-parameterization. [arXiv:1803.03635](https://arxiv.org/abs/1803.03635)
|
||||
- **The Lottery Ticket Hypothesis** - Frankle & Carlin (2019). Showed that dense networks contain sparse subnetworks trainable to full accuracy from initialization. Changed how we think about pruning and network over-parameterization. [arXiv:1803.03635](https://arxiv.org/abs/1803.03635)
|
||||
|
||||
- **Distilling the Knowledge in a Neural Network** - Hinton et al. (2015). Introduced knowledge distillation with temperature scaling. Enables training compact models that match large model accuracy. [arXiv:1503.02531](https://arxiv.org/abs/1503.02531)
|
||||
|
||||
@@ -877,7 +1003,7 @@ Implement hardware-aware optimization techniques including vectorized matrix ope
|
||||
|
||||
```{tip} Interactive Options
|
||||
|
||||
- **[Launch Binder](https://mybinder.org/v2/gh/harvard-edge/cs249r_book/main?urlpath=lab/tree/tinytorch/modules/16_compression/16_compression.ipynb)** - Run interactively in browser, no setup required
|
||||
- **[Launch Binder](https://mybinder.org/v2/gh/harvard-edge/cs249r_book/main?urlpath=lab/tree/tinytorch/modules/16_compression/compression.ipynb)** - Run interactively in browser, no setup required
|
||||
- **[View Source](https://github.com/harvard-edge/cs249r_book/blob/main/tinytorch/src/16_compression/16_compression.py)** - Browse the implementation code
|
||||
```
|
||||
|
||||
|
||||
@@ -1,3 +1,9 @@
|
||||
---
|
||||
file_format: mystnb
|
||||
kernelspec:
|
||||
name: python3
|
||||
---
|
||||
|
||||
# Module 17: Acceleration
|
||||
|
||||
:::{admonition} Module Info
|
||||
@@ -31,7 +37,7 @@ Listen to an AI-generated overview.
|
||||
|
||||
Run interactively in your browser.
|
||||
|
||||
<a href="https://mybinder.org/v2/gh/harvard-edge/cs249r_book/main?labpath=tinytorch%2Fmodules%2F17_acceleration%2F17_acceleration.ipynb" target="_blank" style="display: flex; align-items: center; justify-content: center; width: 100%; height: 54px; margin-top: auto; background: #f97316; color: white; text-align: center; text-decoration: none; border-radius: 27px; font-size: 14px; box-sizing: border-box;">Open in Binder →</a>
|
||||
<a href="https://mybinder.org/v2/gh/harvard-edge/cs249r_book/main?labpath=tinytorch%2Fmodules%2F17_acceleration%2Facceleration.ipynb" target="_blank" style="display: flex; align-items: center; justify-content: center; width: 100%; height: 54px; margin-top: auto; background: #f97316; color: white; text-align: center; text-decoration: none; border-radius: 27px; font-size: 14px; box-sizing: border-box;">Open in Binder →</a>
|
||||
```
|
||||
|
||||
```{grid-item-card} 📄 View Source
|
||||
@@ -469,13 +475,30 @@ The magic happens inside `np.matmul`. NumPy delegates to BLAS (Basic Linear Alge
|
||||
|
||||
### BLAS and LAPACK
|
||||
|
||||
```{code-cell} python3
|
||||
:tags: [remove-input, remove-output]
|
||||
from myst_nb import glue
|
||||
|
||||
# GEMM for N=1024: FLOPs = 2 * N^3, memory = 3 * N^2 elements * 4 bytes (float32)
|
||||
N_blas = 1024
|
||||
blas_flops = 2 * N_blas**3
|
||||
blas_elements = 3 * N_blas**2
|
||||
blas_bytes = blas_elements * 4
|
||||
blas_data_mb = blas_bytes / 1024**2
|
||||
blas_ai = blas_flops / blas_bytes
|
||||
|
||||
glue("blas_flops_billions", f"{blas_flops / 1e9:.1f}")
|
||||
glue("blas_data_mb", f"{blas_data_mb:.0f}")
|
||||
glue("blas_ai", f"{blas_ai:.0f}")
|
||||
```
|
||||
|
||||
BLAS provides three levels of operations, each with different performance characteristics:
|
||||
|
||||
- **Level 1**: Vector operations (AXPY: y = αx + y). These are memory-bound with low arithmetic intensity.
|
||||
- **Level 2**: Matrix-vector operations (GEMV: y = αAx + βy). Better arithmetic intensity but still memory-limited.
|
||||
- **Level 3**: Matrix-matrix operations (GEMM: C = αAB + βC). High arithmetic intensity, compute-bound.
|
||||
|
||||
Matrix multiplication (GEMM) dominates neural network training because every linear layer, every attention mechanism, and every convolution ultimately reduces to matrix multiplication. GEMM performs 2N³ floating-point operations while reading only 3N² elements from memory. For a 1024×1024 matrix, that's 2.1 billion operations on just 12 MB of data - an arithmetic intensity of 170 FLOPs/byte. This high ratio of computation to memory access makes GEMM perfect for hardware acceleration.
|
||||
Matrix multiplication (GEMM) dominates neural network training because every linear layer, every attention mechanism, and every convolution ultimately reduces to matrix multiplication. GEMM performs 2N³ floating-point operations while reading only 3N² elements from memory. For a 1024×1024 matrix, that's {glue:text}`blas_flops_billions` billion operations on just {glue:text}`blas_data_mb` MB of data - an arithmetic intensity of {glue:text}`blas_ai` FLOPs/byte. This high ratio of computation to memory access makes GEMM perfect for hardware acceleration.
|
||||
|
||||
### Memory Layout Optimization
|
||||
|
||||
@@ -512,6 +535,48 @@ def tiled_matmul(a: Tensor, b: Tensor, tile_size: int = 64) -> Tensor:
|
||||
|
||||
### Kernel Fusion
|
||||
|
||||
```{code-cell} python3
|
||||
:tags: [remove-input, remove-output]
|
||||
from myst_nb import glue
|
||||
|
||||
# Kernel fusion memory traffic analysis
|
||||
# 4 million element tensor, float32 (4 bytes each)
|
||||
num_elements = 4_000_000
|
||||
bytes_per_element = 4
|
||||
tensor_bytes = num_elements * bytes_per_element
|
||||
tensor_mb = tensor_bytes / 1024**2
|
||||
|
||||
# Unfused: 7 intermediate arrays created and consumed
|
||||
# Each intermediate array is written then read = 2 memory ops per intermediate
|
||||
# Plus 1 input read + 1 output write = 7*2 + 2 = 16 memory operations
|
||||
# But the original text counts: "7 reads + 7 writes per element" = 14 per element
|
||||
# Total memory ops = 14 per element => 14 * tensor_size_in_bytes
|
||||
unfused_mem_ops = 14
|
||||
unfused_traffic_bytes = unfused_mem_ops * tensor_bytes
|
||||
unfused_traffic_mb = unfused_traffic_bytes / 1024**2
|
||||
|
||||
# Fused: 1 read + 1 write = 2 memory operations
|
||||
fused_mem_ops = 2
|
||||
fused_traffic_bytes = fused_mem_ops * tensor_bytes
|
||||
fused_traffic_mb = fused_traffic_bytes / 1024**2
|
||||
|
||||
# Bandwidth calculation: 50 GB/s
|
||||
bandwidth_gb_s = 50
|
||||
bandwidth_bytes_s = bandwidth_gb_s * 1e9
|
||||
unfused_time_ms = (unfused_traffic_bytes / bandwidth_bytes_s) * 1000
|
||||
fused_time_ms = (fused_traffic_bytes / bandwidth_bytes_s) * 1000
|
||||
fusion_speedup = unfused_time_ms / fused_time_ms
|
||||
|
||||
glue("fusion_tensor_elements", f"{num_elements:,}")
|
||||
glue("fusion_tensor_mb", f"{tensor_mb:.0f}")
|
||||
glue("fusion_unfused_mem_ops", f"{num_elements * unfused_mem_ops / 1e6:.0f}")
|
||||
glue("fusion_unfused_traffic_mb", f"{unfused_traffic_mb:.0f}")
|
||||
glue("fusion_unfused_time_ms", f"{unfused_time_ms:.2f}")
|
||||
glue("fusion_fused_traffic_mb", f"{fused_traffic_mb:.0f}")
|
||||
glue("fusion_fused_time_ms", f"{fused_time_ms:.2f}")
|
||||
glue("fusion_speedup", f"{fusion_speedup:.1f}")
|
||||
```
|
||||
|
||||
Element-wise operations like GELU activation are memory-bound: they spend more time loading and storing data than computing results. Consider the GELU formula:
|
||||
|
||||
```
|
||||
@@ -537,7 +602,7 @@ def unfused_gelu(x: Tensor) -> Tensor:
|
||||
return result
|
||||
```
|
||||
|
||||
Each temporary array allocation writes to memory, and each subsequent operation reads from memory. For a 4 million element tensor, this unfused version performs 28 million memory operations (7 reads + 7 writes per element). Memory bandwidth on a typical CPU is around 50 GB/s, so moving 112 MB takes 2.24 milliseconds - just for memory traffic, before any computation.
|
||||
Each temporary array allocation writes to memory, and each subsequent operation reads from memory. For a {glue:text}`fusion_tensor_elements` element tensor, this unfused version performs {glue:text}`fusion_unfused_mem_ops` million memory operations (7 reads + 7 writes per element). Memory bandwidth on a typical CPU is around 50 GB/s, so moving {glue:text}`fusion_unfused_traffic_mb` MB takes {glue:text}`fusion_unfused_time_ms` milliseconds - just for memory traffic, before any computation.
|
||||
|
||||
Kernel fusion combines all operations into a single expression:
|
||||
|
||||
@@ -554,7 +619,7 @@ def fused_gelu(x: Tensor) -> Tensor:
|
||||
return Tensor(result_data)
|
||||
```
|
||||
|
||||
Now there are only two memory operations: read the input, write the output. For the same 4 million element tensor, that's just 32 MB of memory traffic, completing in 0.64 milliseconds. The fused version is 3.5x faster purely from memory bandwidth reduction, even though both versions perform the same arithmetic.
|
||||
Now there are only two memory operations: read the input, write the output. For the same {glue:text}`fusion_tensor_elements` element tensor, that's just {glue:text}`fusion_fused_traffic_mb` MB of memory traffic, completing in {glue:text}`fusion_fused_time_ms` milliseconds. The fused version is {glue:text}`fusion_speedup`x faster purely from memory bandwidth reduction, even though both versions perform the same arithmetic.
|
||||
|
||||
### Parallel Processing
|
||||
|
||||
@@ -580,6 +645,24 @@ The acceleration techniques you implement in this module - vectorization, fusion
|
||||
|
||||
### Arithmetic Intensity and the Roofline Model
|
||||
|
||||
```{code-cell} python3
|
||||
:tags: [remove-input, remove-output]
|
||||
from myst_nb import glue
|
||||
|
||||
# Element-wise addition: AI = N / (3N * 4) = 1/12
|
||||
ai_elemwise = 1 / 12
|
||||
|
||||
# Matrix multiplication: AI = 2N^3 / (3N^2 * 4) = N/6
|
||||
N_roof = 1024
|
||||
ai_matmul_formula = "N/6"
|
||||
ai_matmul_1024 = N_roof / 6
|
||||
ai_ratio = ai_matmul_1024 / ai_elemwise
|
||||
|
||||
glue("roof_ai_elemwise", f"{ai_elemwise:.3f}")
|
||||
glue("roof_ai_matmul_1024", f"{ai_matmul_1024:.0f}")
|
||||
glue("roof_ai_ratio", f"{ai_ratio:.0f}")
|
||||
```
|
||||
|
||||
Not all operations are created equal. The roofline model helps predict whether an operation will be limited by memory bandwidth or computational throughput. Arithmetic intensity is the ratio of floating-point operations to bytes transferred:
|
||||
|
||||
```
|
||||
@@ -589,21 +672,21 @@ Arithmetic Intensity (AI) = FLOPs / Bytes
|
||||
For element-wise addition of two N-element arrays:
|
||||
- FLOPs: N (one addition per element)
|
||||
- Bytes: 3N × 4 = 12N (read A, read B, write C, each 4 bytes for float32)
|
||||
- AI = N / 12N = 0.083 FLOPs/byte
|
||||
- AI = N / 12N = {glue:text}`roof_ai_elemwise` FLOPs/byte
|
||||
|
||||
For matrix multiplication of N×N matrices:
|
||||
- FLOPs: 2N³ (N³ multiplications + N³ additions)
|
||||
- Bytes: 3N² × 4 = 12N² (read A, read B, write C)
|
||||
- AI = 2N³ / 12N² = N/6 FLOPs/byte
|
||||
|
||||
For a 1024×1024 matrix: AI = 170 FLOPs/byte. Matrix multiplication performs 2000x more computation per byte transferred than element-wise addition. This is why GPUs excel at matrix operations but struggle with element-wise ops.
|
||||
For a 1024×1024 matrix: AI = {glue:text}`roof_ai_matmul_1024` FLOPs/byte. Matrix multiplication performs {glue:text}`roof_ai_ratio`x more computation per byte transferred than element-wise addition. This is why GPUs excel at matrix operations but struggle with element-wise ops.
|
||||
|
||||
| Operation | Arithmetic Intensity | Bottleneck | Optimization Strategy |
|
||||
|-----------|---------------------|------------|----------------------|
|
||||
| Element-wise add | ~0.08 FLOPs/byte | Memory bandwidth | Kernel fusion |
|
||||
| Element-wise multiply | ~0.08 FLOPs/byte | Memory bandwidth | Kernel fusion |
|
||||
| GELU activation | ~1.0 FLOPs/byte | Memory bandwidth | Kernel fusion |
|
||||
| Matrix multiply (1024×1024) | ~170 FLOPs/byte | Compute throughput | Vectorization, tiling |
|
||||
| Matrix multiply (1024×1024) | ~{glue:text}`roof_ai_matmul_1024` FLOPs/byte | Compute throughput | Vectorization, tiling |
|
||||
|
||||
The roofline model plots achievable performance against arithmetic intensity. Your hardware has a peak memory bandwidth (horizontal line) and peak computational throughput (diagonal line). The minimum of these two lines is your performance ceiling.
|
||||
|
||||
@@ -741,37 +824,115 @@ Small percentage improvements at this scale translate to millions in savings and
|
||||
|
||||
Test your understanding of acceleration techniques with these quantitative questions.
|
||||
|
||||
```{code-cell} python3
|
||||
:tags: [remove-input, remove-output]
|
||||
from myst_nb import glue
|
||||
|
||||
# Q1: Arithmetic Intensity for 1024x1024 float32 matmul
|
||||
N_q1 = 1024
|
||||
q1_flops = 2 * N_q1**3
|
||||
q1_matrix_bytes = N_q1 * N_q1 * 4 # float32 = 4 bytes
|
||||
q1_matrix_mb = q1_matrix_bytes / 1024**2
|
||||
q1_total_bytes = 3 * q1_matrix_bytes # read A + read B + write C
|
||||
q1_total_mb = 3 * q1_matrix_mb
|
||||
q1_ai = q1_flops / q1_total_bytes
|
||||
|
||||
glue("q1_flops", f"{q1_flops:,}")
|
||||
glue("q1_matrix_mb", f"{q1_matrix_mb:.0f}")
|
||||
glue("q1_total_mb", f"{q1_total_mb:.0f}")
|
||||
glue("q1_total_bytes", f"{q1_total_bytes:,}")
|
||||
glue("q1_ai", f"{q1_ai:.0f}")
|
||||
```
|
||||
|
||||
**Q1: Arithmetic Intensity**
|
||||
|
||||
Matrix multiplication of two 1024×1024 float32 matrices performs 2,147,483,648 FLOPs. It reads 8 MB (matrix A) + 8 MB (matrix B) = 16 MB and writes 8 MB (matrix C) = 24 MB total. What is the arithmetic intensity?
|
||||
Matrix multiplication of two 1024×1024 float32 matrices performs {glue:text}`q1_flops` FLOPs. It reads {glue:text}`q1_matrix_mb` MB (matrix A) + {glue:text}`q1_matrix_mb` MB (matrix B) = {glue:text}`q1_total_mb` MB and writes {glue:text}`q1_matrix_mb` MB (matrix C) = {glue:text}`q1_total_mb` MB total. What is the arithmetic intensity?
|
||||
|
||||
```{admonition} Answer
|
||||
:class: dropdown
|
||||
|
||||
Arithmetic Intensity = 2,147,483,648 FLOPs / 24,000,000 bytes = **~89 FLOPs/byte**
|
||||
Arithmetic Intensity = {glue:text}`q1_flops` FLOPs / {glue:text}`q1_total_bytes` bytes = **~{glue:text}`q1_ai` FLOPs/byte**
|
||||
|
||||
This high arithmetic intensity (compared to ~0.08 for element-wise ops) is why matrix multiplication is ideal for GPUs and why it dominates neural network training time.
|
||||
```
|
||||
|
||||
```{code-cell} python3
|
||||
:tags: [remove-input, remove-output]
|
||||
from myst_nb import glue
|
||||
|
||||
# Q2: Memory bandwidth savings from kernel fusion
|
||||
q2_elements = 1_000_000
|
||||
q2_tensor_bytes = q2_elements * 4 # float32
|
||||
q2_tensor_mb = q2_tensor_bytes / 1024**2
|
||||
|
||||
# Unfused: 7 intermediate arrays => 7 reads + 7 writes + 1 input read + 1 output write = 16 ops
|
||||
q2_unfused_ops = 16
|
||||
q2_unfused_mb = q2_unfused_ops * q2_tensor_mb
|
||||
|
||||
# Fused: 1 input read + 1 output write = 2 ops
|
||||
q2_fused_ops = 2
|
||||
q2_fused_mb = q2_fused_ops * q2_tensor_mb
|
||||
|
||||
q2_savings_mb = q2_unfused_mb - q2_fused_mb
|
||||
q2_savings_pct = (q2_savings_mb / q2_unfused_mb) * 100
|
||||
|
||||
glue("q2_elements", f"{q2_elements:,}")
|
||||
glue("q2_tensor_mb", f"{q2_tensor_mb:.0f}")
|
||||
glue("q2_unfused_mb", f"{q2_unfused_mb:.0f}")
|
||||
glue("q2_fused_mb", f"{q2_fused_mb:.0f}")
|
||||
glue("q2_savings_mb", f"{q2_savings_mb:.0f}")
|
||||
glue("q2_savings_pct", f"{q2_savings_pct:.1f}")
|
||||
```
|
||||
|
||||
**Q2: Memory Bandwidth Savings**
|
||||
|
||||
Your fused GELU processes a tensor with 1,000,000 elements (4 MB as float32). The unfused version creates 7 intermediate arrays. How much memory bandwidth does fusion save?
|
||||
Your fused GELU processes a tensor with {glue:text}`q2_elements` elements ({glue:text}`q2_tensor_mb` MB as float32). The unfused version creates 7 intermediate arrays. How much memory bandwidth does fusion save?
|
||||
|
||||
```{admonition} Answer
|
||||
:class: dropdown
|
||||
|
||||
**Unfused**: 7 reads + 7 writes + 1 input read + 1 output write = 16 memory operations × 4 MB = **64 MB**
|
||||
**Unfused**: 7 reads + 7 writes + 1 input read + 1 output write = 16 memory operations × {glue:text}`q2_tensor_mb` MB = **{glue:text}`q2_unfused_mb` MB**
|
||||
|
||||
**Fused**: 1 input read + 1 output write = 2 memory operations × 4 MB = **8 MB**
|
||||
**Fused**: 1 input read + 1 output write = 2 memory operations × {glue:text}`q2_tensor_mb` MB = **{glue:text}`q2_fused_mb` MB**
|
||||
|
||||
**Savings**: 64 - 8 = **56 MB saved (87.5% reduction)**
|
||||
**Savings**: {glue:text}`q2_unfused_mb` - {glue:text}`q2_fused_mb` = **{glue:text}`q2_savings_mb` MB saved ({glue:text}`q2_savings_pct`% reduction)**
|
||||
|
||||
For typical CPUs with ~50 GB/s bandwidth, this saves ~1 millisecond per GELU call. In a transformer with 96 GELU activations per forward pass, that's 96ms saved - enough to improve throughput by 10-20%.
|
||||
```
|
||||
|
||||
```{code-cell} python3
|
||||
:tags: [remove-input, remove-output]
|
||||
from myst_nb import glue
|
||||
import math
|
||||
|
||||
# Q3: Cache tiling for 2048x2048 float32 with 256 KB L2
|
||||
q3_cache_kb = 256
|
||||
q3_cache_bytes = q3_cache_kb * 1024 # binary: 262,144 bytes
|
||||
q3_matrix_dim = 2048
|
||||
q3_matrix_bytes = q3_matrix_dim * q3_matrix_dim * 4
|
||||
q3_matrix_mb = q3_matrix_bytes / 1024**2
|
||||
|
||||
# 3 tiles must fit in cache: 3 * tile^2 * 4 <= cache_bytes
|
||||
q3_max_tile_sq = q3_cache_bytes / 12 # 3 tiles * 4 bytes
|
||||
q3_max_tile = math.isqrt(int(q3_max_tile_sq)) # integer sqrt (floor)
|
||||
# Practical power-of-2 tile size
|
||||
q3_practical_tile = 128
|
||||
q3_practical_bytes = 3 * q3_practical_tile**2 * 4
|
||||
q3_practical_kb = q3_practical_bytes / 1024
|
||||
|
||||
glue("q3_cache_kb", f"{q3_cache_kb}")
|
||||
glue("q3_matrix_dim", f"{q3_matrix_dim}")
|
||||
glue("q3_matrix_mb", f"{q3_matrix_mb:.0f}")
|
||||
glue("q3_cache_bytes", f"{q3_cache_bytes:,}")
|
||||
glue("q3_max_tile_sq", f"{q3_max_tile_sq:,.0f}")
|
||||
glue("q3_max_tile", f"{q3_max_tile}")
|
||||
glue("q3_practical_tile", f"{q3_practical_tile}")
|
||||
glue("q3_practical_kb", f"{q3_practical_kb:.0f}")
|
||||
```
|
||||
|
||||
**Q3: Cache Tiling**
|
||||
|
||||
A CPU has 256 KB L2 cache. You're multiplying two 2048×2048 float32 matrices (16 MB each). What tile size keeps the working set in L2 cache?
|
||||
A CPU has {glue:text}`q3_cache_kb` KB L2 cache. You're multiplying two {glue:text}`q3_matrix_dim`×{glue:text}`q3_matrix_dim` float32 matrices ({glue:text}`q3_matrix_mb` MB each). What tile size keeps the working set in L2 cache?
|
||||
|
||||
```{admonition} Answer
|
||||
:class: dropdown
|
||||
@@ -781,13 +942,28 @@ For tiled multiplication, we need 3 tiles in cache simultaneously:
|
||||
- Tile from matrix B: tile_size × tile_size × 4 bytes
|
||||
- Output tile: tile_size × tile_size × 4 bytes
|
||||
|
||||
Total: 3 × tile_size² × 4 bytes ≤ 256 KB
|
||||
Total: 3 × tile_size² × 4 bytes ≤ {glue:text}`q3_cache_kb` KB
|
||||
|
||||
Solving: tile_size² ≤ 256,000 / 12 = 21,333
|
||||
Solving: tile_size² ≤ {glue:text}`q3_cache_bytes` / 12 = {glue:text}`q3_max_tile_sq`
|
||||
|
||||
**tile_size ≈ 146**
|
||||
**tile_size ≈ {glue:text}`q3_max_tile`**
|
||||
|
||||
In practice, use powers of 2: **128 works well** (3 × 128² × 4 = 196 KB, leaving room for other data).
|
||||
In practice, use powers of 2: **{glue:text}`q3_practical_tile` works well** (3 × {glue:text}`q3_practical_tile`² × 4 = {glue:text}`q3_practical_kb` KB, leaving room for other data).
|
||||
```
|
||||
|
||||
```{code-cell} python3
|
||||
:tags: [remove-input, remove-output]
|
||||
from myst_nb import glue
|
||||
|
||||
# Q4: BLAS performance calculation
|
||||
q4_flops = 2.15e9
|
||||
q4_time_s = 0.01 # 10ms
|
||||
q4_gflops = q4_flops / (q4_time_s * 1e9)
|
||||
q4_peak_gflops = 500
|
||||
q4_efficiency_pct = (q4_gflops / q4_peak_gflops) * 100
|
||||
|
||||
glue("q4_gflops", f"{q4_gflops:.0f}")
|
||||
glue("q4_efficiency_pct", f"{q4_efficiency_pct:.0f}")
|
||||
```
|
||||
|
||||
**Q4: BLAS Performance**
|
||||
@@ -797,14 +973,29 @@ Your vectorized matmul completes a 1024×1024 multiplication in 10ms. The operat
|
||||
```{admonition} Answer
|
||||
:class: dropdown
|
||||
|
||||
GFLOPS = 2,150,000,000 FLOPs / (0.01 seconds × 1,000,000,000) = **215 GFLOPS**
|
||||
GFLOPS = 2,150,000,000 FLOPs / (0.01 seconds × 1,000,000,000) = **{glue:text}`q4_gflops` GFLOPS**
|
||||
|
||||
For reference:
|
||||
- Modern CPU peak: 500-1000 GFLOPS (AVX-512)
|
||||
- Your efficiency: 215/500 = **43% of peak** (typical for real code)
|
||||
- Your efficiency: {glue:text}`q4_gflops`/500 = **{glue:text}`q4_efficiency_pct`% of peak** (typical for real code)
|
||||
- GPU equivalent: ~50 TFLOPS (230x faster than single CPU core)
|
||||
```
|
||||
|
||||
```{code-cell} python3
|
||||
:tags: [remove-input, remove-output]
|
||||
from myst_nb import glue
|
||||
|
||||
# Q5: Speedup from fusion
|
||||
q5_unfused_ms = 8.0
|
||||
q5_fused_ms = 2.5
|
||||
q5_speedup = q5_unfused_ms / q5_fused_ms
|
||||
q5_mem_overhead_pct = ((q5_unfused_ms - q5_fused_ms) / q5_unfused_ms) * 100
|
||||
|
||||
glue("q5_speedup", f"{q5_speedup:.1f}")
|
||||
glue("q5_mem_overhead_pct", f"{q5_mem_overhead_pct:.2f}")
|
||||
glue("q5_mem_overhead_approx", f"{round(q5_mem_overhead_pct)}")
|
||||
```
|
||||
|
||||
**Q5: Speedup from Fusion**
|
||||
|
||||
Unfused GELU takes 8ms on a 2000×2000 tensor. Fused GELU takes 2.5ms. What percentage of the unfused time was memory overhead?
|
||||
@@ -812,12 +1003,12 @@ Unfused GELU takes 8ms on a 2000×2000 tensor. Fused GELU takes 2.5ms. What perc
|
||||
```{admonition} Answer
|
||||
:class: dropdown
|
||||
|
||||
Speedup = 8ms / 2.5ms = **3.2x faster**
|
||||
Speedup = 8ms / 2.5ms = **{glue:text}`q5_speedup`x faster**
|
||||
|
||||
Assuming both versions do the same computation, the difference is memory bandwidth:
|
||||
- Memory overhead = (8 - 2.5) / 8 = **68.75%**
|
||||
- Memory overhead = (8 - 2.5) / 8 = **{glue:text}`q5_mem_overhead_pct`%**
|
||||
|
||||
Nearly **70% of the unfused version's time** was spent waiting for memory! This is typical for element-wise operations with low arithmetic intensity.
|
||||
Nearly **{glue:text}`q5_mem_overhead_approx`% of the unfused version's time** was spent waiting for memory! This is typical for element-wise operations with low arithmetic intensity.
|
||||
```
|
||||
|
||||
## Further Reading
|
||||
@@ -858,7 +1049,7 @@ Implement caching and memoization strategies to eliminate redundant computations
|
||||
|
||||
```{tip} Interactive Options
|
||||
|
||||
- **[Launch Binder](https://mybinder.org/v2/gh/harvard-edge/cs249r_book/main?urlpath=lab/tree/tinytorch/modules/17_acceleration/17_acceleration.ipynb)** - Run interactively in browser, no setup required
|
||||
- **[Launch Binder](https://mybinder.org/v2/gh/harvard-edge/cs249r_book/main?urlpath=lab/tree/tinytorch/modules/17_acceleration/acceleration.ipynb)** - Run interactively in browser, no setup required
|
||||
- **[View Source](https://github.com/harvard-edge/cs249r_book/blob/main/tinytorch/src/17_acceleration/17_acceleration.py)** - Browse the implementation code
|
||||
```
|
||||
|
||||
|
||||
@@ -1,3 +1,110 @@
|
||||
---
|
||||
file_format: mystnb
|
||||
kernelspec:
|
||||
name: python3
|
||||
---
|
||||
|
||||
```{code-cell} python3
|
||||
:tags: [remove-input, remove-output]
|
||||
from myst_nb import glue
|
||||
|
||||
# === Overview section: 100-token generation ===
|
||||
overview_n = 100
|
||||
overview_without = overview_n * (overview_n + 1) // 2
|
||||
overview_with = overview_n
|
||||
overview_speedup = overview_without / overview_with
|
||||
glue("overview_without_cache", f"{overview_without:,}")
|
||||
glue("overview_with_cache", f"{overview_with:,}")
|
||||
glue("overview_speedup", f"{overview_without // overview_with}x")
|
||||
|
||||
# === Gradient checkpointing: 96-layer transformer ===
|
||||
ckpt_layers = 96
|
||||
ckpt_interval = 12
|
||||
ckpt_stored = ckpt_layers // ckpt_interval
|
||||
ckpt_recomputed = ckpt_interval - 1
|
||||
ckpt_reduction = ckpt_layers // ckpt_stored
|
||||
glue("ckpt_stored", f"{ckpt_stored}")
|
||||
glue("ckpt_recomputed", f"{ckpt_recomputed}")
|
||||
glue("ckpt_reduction", f"{ckpt_reduction}x")
|
||||
|
||||
# === Memory-Compute Trade-offs: GPT-2 Small cache ===
|
||||
L, H, S, D = 12, 12, 1024, 64
|
||||
bytes_per_element = 4
|
||||
gpt2_cache_bytes = 2 * L * H * S * D * bytes_per_element
|
||||
gpt2_cache_mb = gpt2_cache_bytes / (1024 ** 2)
|
||||
gpt2_model_mb = 500
|
||||
gpt2_overhead_pct = gpt2_cache_mb / gpt2_model_mb * 100
|
||||
glue("gpt2_cache_bytes", f"{gpt2_cache_bytes:,}")
|
||||
glue("gpt2_cache_mb", f"{gpt2_cache_mb:.0f}")
|
||||
glue("gpt2_overhead_pct", f"{gpt2_overhead_pct:.0f}%")
|
||||
|
||||
# === Compute reduction examples (inline prose) ===
|
||||
for n in [100, 1000]:
|
||||
without = n * (n + 1) / 2
|
||||
reduction = without / n
|
||||
glue(f"compute_red_{n}", f"{reduction:.0f}x")
|
||||
|
||||
# === Compute reduction table ===
|
||||
for n in [10, 50, 100, 500]:
|
||||
without = n * (n + 1) / 2
|
||||
reduction = without / n
|
||||
glue(f"table_red_{n}", f"{reduction:.1f}x")
|
||||
|
||||
# === Production context: concurrent users ===
|
||||
prod_cache_per_user_mb = 75
|
||||
prod_users = 10
|
||||
prod_total_cache_mb = prod_cache_per_user_mb * prod_users
|
||||
prod_model_mb = 500
|
||||
prod_total_gb = (prod_total_cache_mb + prod_model_mb) / 1000
|
||||
glue("prod_total_cache_mb", f"{prod_total_cache_mb:,}")
|
||||
glue("prod_total_gb", f"{prod_total_gb:.2f}")
|
||||
|
||||
# === Q1: Cache Memory Calculation ===
|
||||
q1_batch, q1_heads, q1_seq, q1_dim, q1_layers = 4, 8, 1024, 64, 12
|
||||
q1_elements_per_tensor = q1_batch * q1_heads * q1_seq * q1_dim
|
||||
q1_elements_per_layer = 2 * q1_elements_per_tensor
|
||||
q1_total_elements = q1_layers * q1_elements_per_layer
|
||||
q1_total_bytes = q1_total_elements * bytes_per_element
|
||||
q1_total_mb = q1_total_bytes / (1024 ** 2)
|
||||
glue("q1_per_tensor", f"{q1_elements_per_tensor:,}")
|
||||
glue("q1_per_layer", f"{q1_elements_per_layer:,}")
|
||||
glue("q1_total_elements", f"{q1_total_elements:,}")
|
||||
glue("q1_total_bytes", f"{q1_total_bytes:,}")
|
||||
glue("q1_total_mb", f"{q1_total_mb:.0f}")
|
||||
|
||||
# === Q2: Complexity Reduction (200 tokens) ===
|
||||
q2_n = 200
|
||||
q2_without = q2_n * (q2_n + 1) // 2
|
||||
q2_with = q2_n
|
||||
q2_reduction = q2_without / q2_with
|
||||
glue("q2_without", f"{q2_without:,}")
|
||||
glue("q2_with", f"{q2_with:,}")
|
||||
glue("q2_reduction", f"{q2_reduction:.1f}x")
|
||||
|
||||
# === Q3: Memory-Compute Trade-off ===
|
||||
q3_cache_mb = 300
|
||||
q3_model_mb = 2000
|
||||
q3_overhead_pct = q3_cache_mb / q3_model_mb * 100
|
||||
glue("q3_overhead_pct", f"{q3_overhead_pct:.0f}%")
|
||||
|
||||
# === Q4: Cache Hit Rate ===
|
||||
for pos in [50, 100, 500]:
|
||||
hit_rate = (pos - 1) / pos * 100
|
||||
glue(f"q4_hit_{pos}", f"{hit_rate:.0f}%" if hit_rate == int(hit_rate) else f"{hit_rate:.1f}%")
|
||||
|
||||
# === Q5: Batch Inference Scaling ===
|
||||
q5_base_mb = 75
|
||||
q5_batch = 8
|
||||
q5_total_mb = q5_base_mb * q5_batch
|
||||
q5_gpu_mb = 16 * 1000 # 16 GB in decimal MB (as used in the text)
|
||||
q5_model_mb = 2000
|
||||
q5_avail_mb = q5_gpu_mb - q5_model_mb
|
||||
q5_max_seqs = int(q5_avail_mb / q5_base_mb)
|
||||
glue("q5_total_mb", f"{q5_total_mb:,}")
|
||||
glue("q5_avail_gb", f"{q5_avail_mb / 1000:.0f}")
|
||||
glue("q5_max_seqs", f"{q5_max_seqs:,}")
|
||||
```
|
||||
|
||||
# Module 18: Memoization
|
||||
|
||||
:::{admonition} Module Info
|
||||
@@ -30,7 +137,7 @@ Listen to an AI-generated overview.
|
||||
|
||||
Run interactively in your browser.
|
||||
|
||||
<a href="https://mybinder.org/v2/gh/harvard-edge/cs249r_book/main?labpath=tinytorch%2Fmodules%2F18_memoization%2F18_memoization.ipynb" target="_blank" style="display: flex; align-items: center; justify-content: center; width: 100%; height: 54px; margin-top: auto; background: #f97316; color: white; text-align: center; text-decoration: none; border-radius: 27px; font-size: 14px; box-sizing: border-box;">Open in Binder →</a>
|
||||
<a href="https://mybinder.org/v2/gh/harvard-edge/cs249r_book/main?labpath=tinytorch%2Fmodules%2F18_memoization%2Fmemoization.ipynb" target="_blank" style="display: flex; align-items: center; justify-content: center; width: 100%; height: 54px; margin-top: auto; background: #f97316; color: white; text-align: center; text-decoration: none; border-radius: 27px; font-size: 14px; box-sizing: border-box;">Open in Binder →</a>
|
||||
```
|
||||
|
||||
```{grid-item-card} 📄 View Source
|
||||
@@ -470,9 +577,9 @@ def update(self, layer_idx: int, key: Tensor, value: Tensor) -> None:
|
||||
This O(1) update operation writes directly to a pre-allocated position in the cache. No array resizing, no data copying, just an indexed assignment. The use of `.data` accesses the underlying NumPy array directly, avoiding gradient tracking overhead since caching is inference-only.
|
||||
|
||||
The computational savings compound across generation steps. For a 100-token sequence:
|
||||
- Without caching: 1 + 2 + 3 + ... + 100 = 5,050 K,V computations
|
||||
- With caching: 100 K,V computations (one per token)
|
||||
- Speedup: 50x reduction in K,V computation alone
|
||||
- Without caching: 1 + 2 + 3 + ... + 100 = {glue:text}`overview_without_cache` K,V computations
|
||||
- With caching: {glue:text}`overview_with_cache` K,V computations (one per token)
|
||||
- Speedup: {glue:text}`overview_speedup` reduction in K,V computation alone
|
||||
|
||||
### KV Cache in Transformers
|
||||
|
||||
@@ -516,8 +623,8 @@ The technique works by discarding some intermediate activations during the forwa
|
||||
|
||||
For a transformer with 96 layers:
|
||||
- Without checkpointing: Store 96 sets of activations
|
||||
- With checkpointing every 12 layers: Store 8 sets, recompute 11 sets during backward
|
||||
- Memory reduction: 12x decrease
|
||||
- With checkpointing every 12 layers: Store {glue:text}`ckpt_stored` sets, recompute {glue:text}`ckpt_recomputed` sets during backward
|
||||
- Memory reduction: {glue:text}`ckpt_reduction` decrease
|
||||
- Compute increase: ~33% slower training (recomputation overhead)
|
||||
|
||||
This is the inverse trade-off from KV caching. KV caching spends memory to save compute during inference. Gradient checkpointing spends compute to save memory during training. Both techniques recognize that memory and compute are fungible resources with different costs in different contexts.
|
||||
@@ -555,7 +662,6 @@ Every optimization involves trade-offs. KV caching trades memory for speed, and
|
||||
|
||||
For a transformer with L layers, H heads per layer, dimension D per head, and maximum sequence length S, the cache requires:
|
||||
|
||||
```
|
||||
Memory = 2 × L × H × S × D × 4 bytes
|
||||
|
||||
Example (GPT-2 Small):
|
||||
@@ -563,10 +669,9 @@ L = 12 layers
|
||||
H = 12 heads
|
||||
S = 1024 tokens
|
||||
D = 64 dimensions
|
||||
Memory = 2 × 12 × 12 × 1024 × 64 × 4 = 75,497,472 bytes ≈ 75 MB
|
||||
```
|
||||
Memory = 2 × 12 × 12 × 1024 × 64 × 4 = {glue:text}`gpt2_cache_bytes` bytes ≈ {glue:text}`gpt2_cache_mb` MB
|
||||
|
||||
For a model with 125 million parameters (500 MB), the cache adds 15% memory overhead. This seems significant until you consider the computational savings.
|
||||
For a model with 125 million parameters (500 MB), the cache adds {glue:text}`gpt2_overhead_pct` memory overhead. This seems significant until you consider the computational savings.
|
||||
|
||||
Without caching, generating a sequence of length N requires computing K,V for:
|
||||
- Step 1: 1 token
|
||||
@@ -582,14 +687,14 @@ With caching:
|
||||
- Step N: 1 token (compute and append)
|
||||
- Total: N computations
|
||||
|
||||
For N = 100 tokens, caching provides 50x reduction in K,V computation. For N = 1000 tokens, the reduction is 500x. The speedup grows with sequence length, making the memory trade-off increasingly favorable for longer generation.
|
||||
For N = 100 tokens, caching provides {glue:text}`compute_red_100` reduction in K,V computation. For N = 1000 tokens, the reduction is {glue:text}`compute_red_1000`. The speedup grows with sequence length, making the memory trade-off increasingly favorable for longer generation.
|
||||
|
||||
| Sequence Length | Cache Memory | Compute Reduction | Effective Speedup |
|
||||
|-----------------|--------------|-------------------|-------------------|
|
||||
| 10 tokens | 75 MB | 5.5x | 3-5x |
|
||||
| 50 tokens | 75 MB | 25.5x | 8-12x |
|
||||
| 100 tokens | 75 MB | 50.5x | 10-15x |
|
||||
| 500 tokens | 75 MB | 250.5x | 12-20x |
|
||||
| 10 tokens | {glue:text}`gpt2_cache_mb` MB | {glue:text}`table_red_10` | 3-5x |
|
||||
| 50 tokens | {glue:text}`gpt2_cache_mb` MB | {glue:text}`table_red_50` | 8-12x |
|
||||
| 100 tokens | {glue:text}`gpt2_cache_mb` MB | {glue:text}`table_red_100` | 10-15x |
|
||||
| 500 tokens | {glue:text}`gpt2_cache_mb` MB | {glue:text}`table_red_500` | 12-20x |
|
||||
|
||||
The effective speedup is lower than the theoretical compute reduction because attention includes other operations beyond K,V projection, but the benefit is still dramatic.
|
||||
|
||||
@@ -753,7 +858,7 @@ To appreciate the production impact of KV caching, consider the economics of lan
|
||||
The memory cost is modest compared to the benefit. For a GPT-2 model:
|
||||
- Model parameters: 500 MB (loaded once, shared across all users)
|
||||
- KV cache per user: 75 MB
|
||||
- 10 concurrent users: 750 MB cache + 500 MB model = 1.25 GB total
|
||||
- 10 concurrent users: {glue:text}`prod_total_cache_mb` MB cache + 500 MB model = {glue:text}`prod_total_gb` GB total
|
||||
- Fits comfortably on a 16 GB GPU while delivering 10x throughput
|
||||
|
||||
## Check Your Understanding
|
||||
@@ -769,13 +874,13 @@ A 12-layer transformer has 8 attention heads per layer, each head has 64 dimensi
|
||||
|
||||
Shape per cache tensor: (batch=4, heads=8, seq=1024, dim=64)
|
||||
|
||||
Elements per tensor: 4 × 8 × 1024 × 64 = 2,097,152
|
||||
Elements per tensor: 4 × 8 × 1024 × 64 = {glue:text}`q1_per_tensor`
|
||||
|
||||
Each layer has 2 tensors (K and V): 2 × 2,097,152 = 4,194,304 elements per layer
|
||||
Each layer has 2 tensors (K and V): 2 × {glue:text}`q1_per_tensor` = {glue:text}`q1_per_layer` elements per layer
|
||||
|
||||
Total across 12 layers: 12 × 4,194,304 = 50,331,648 elements
|
||||
Total across 12 layers: 12 × {glue:text}`q1_per_layer` = {glue:text}`q1_total_elements` elements
|
||||
|
||||
Memory: 50,331,648 × 4 bytes = 201,326,592 bytes ≈ **192 MB**
|
||||
Memory: {glue:text}`q1_total_elements` × 4 bytes = {glue:text}`q1_total_bytes` bytes ≈ **{glue:text}`q1_total_mb` MB**
|
||||
|
||||
This is why production systems carefully tune batch size and sequence length!
|
||||
```
|
||||
@@ -787,11 +892,11 @@ Without caching, generating 200 tokens requires how many K,V computations? With
|
||||
```{admonition} Answer
|
||||
:class: dropdown
|
||||
|
||||
**Without caching**: 1 + 2 + 3 + ... + 200 = 200 × 201 / 2 = **20,100 computations**
|
||||
**Without caching**: 1 + 2 + 3 + ... + 200 = 200 × 201 / 2 = **{glue:text}`q2_without` computations**
|
||||
|
||||
**With caching**: 200 computations (one per token)
|
||||
**With caching**: {glue:text}`q2_with` computations (one per token)
|
||||
|
||||
**Reduction**: 20,100 / 200 = **100.5x fewer K,V computations**
|
||||
**Reduction**: {glue:text}`q2_without` / {glue:text}`q2_with` = **{glue:text}`q2_reduction` fewer K,V computations**
|
||||
|
||||
This is why the speedup grows with sequence length!
|
||||
```
|
||||
@@ -803,12 +908,12 @@ A model uses 2 GB for parameters. Adding KV cache uses 300 MB. Is this trade-off
|
||||
```{admonition} Answer
|
||||
:class: dropdown
|
||||
|
||||
**Memory overhead**: 300 MB / 2000 MB = 15% increase
|
||||
**Memory overhead**: 300 MB / 2000 MB = {glue:text}`q3_overhead_pct` increase
|
||||
|
||||
**Speedup**: 12x faster generation
|
||||
|
||||
**Analysis**:
|
||||
- Cost: 15% more memory
|
||||
- Cost: {glue:text}`q3_overhead_pct` more memory
|
||||
- Benefit: 12x more throughput (or 12x lower latency)
|
||||
- Result: You can serve 12x more users with 1.15x the memory
|
||||
|
||||
@@ -829,11 +934,11 @@ At token position 50:
|
||||
- Cache retrievals: 49 previous K,V pairs
|
||||
- Total: 50 K,V pairs needed
|
||||
|
||||
**Cache hit rate**: 49/50 = **98%**
|
||||
**Cache hit rate**: 49/50 = **{glue:text}`q4_hit_50`**
|
||||
|
||||
As generation continues:
|
||||
- Token 100: 99/100 = 99% hit rate
|
||||
- Token 500: 499/500 = 99.8% hit rate
|
||||
- Token 100: 99/100 = {glue:text}`q4_hit_100` hit rate
|
||||
- Token 500: 499/500 = {glue:text}`q4_hit_500` hit rate
|
||||
|
||||
The cache hit rate approaches 100% for long sequences, explaining why speedup increases with length!
|
||||
```
|
||||
@@ -847,7 +952,7 @@ Cache memory for batch_size=1 is 75 MB. What is cache memory for batch_size=8?
|
||||
|
||||
Cache memory scales linearly with batch size:
|
||||
|
||||
**batch_size=8**: 75 MB × 8 = **600 MB**
|
||||
**batch_size=8**: 75 MB × 8 = **{glue:text}`q5_total_mb` MB**
|
||||
|
||||
This is why production systems carefully manage batch size:
|
||||
- Larger batches → higher throughput (more sequences per second)
|
||||
@@ -855,8 +960,8 @@ This is why production systems carefully manage batch size:
|
||||
|
||||
Trade-off example on 16 GB GPU:
|
||||
- Model: 2 GB
|
||||
- Available for cache: 14 GB
|
||||
- Max batch size: 14 GB / 75 MB ≈ 186 sequences
|
||||
- Available for cache: {glue:text}`q5_avail_gb` GB
|
||||
- Max batch size: {glue:text}`q5_avail_gb` GB / 75 MB ≈ {glue:text}`q5_max_seqs` sequences
|
||||
|
||||
Production systems balance batch size against latency requirements and memory constraints.
|
||||
```
|
||||
@@ -900,7 +1005,7 @@ Learn to measure and compare performance systematically. You'll build benchmarki
|
||||
|
||||
```{tip} Interactive Options
|
||||
|
||||
- **[Launch Binder](https://mybinder.org/v2/gh/harvard-edge/cs249r_book/main?urlpath=lab/tree/tinytorch/modules/18_memoization/18_memoization.ipynb)** - Run interactively in browser, no setup required
|
||||
- **[Launch Binder](https://mybinder.org/v2/gh/harvard-edge/cs249r_book/main?urlpath=lab/tree/tinytorch/modules/18_memoization/memoization.ipynb)** - Run interactively in browser, no setup required
|
||||
- **[View Source](https://github.com/harvard-edge/cs249r_book/blob/main/tinytorch/src/18_memoization/18_memoization.py)** - Browse the implementation code
|
||||
```
|
||||
|
||||
|
||||
@@ -1,3 +1,9 @@
|
||||
---
|
||||
file_format: mystnb
|
||||
kernelspec:
|
||||
name: python3
|
||||
---
|
||||
|
||||
# Module 19: Benchmarking
|
||||
|
||||
:::{admonition} Module Info
|
||||
@@ -25,7 +31,7 @@ Listen to an AI-generated overview.
|
||||
|
||||
Run interactively in your browser.
|
||||
|
||||
<a href="https://mybinder.org/v2/gh/harvard-edge/cs249r_book/main?labpath=tinytorch%2Fmodules%2F19_benchmarking%2F19_benchmarking.ipynb" target="_blank" style="display: flex; align-items: center; justify-content: center; width: 100%; height: 54px; margin-top: auto; background: #f97316; color: white; text-align: center; text-decoration: none; border-radius: 27px; font-size: 14px; box-sizing: border-box;">Open in Binder →</a>
|
||||
<a href="https://mybinder.org/v2/gh/harvard-edge/cs249r_book/main?labpath=tinytorch%2Fmodules%2F19_benchmarking%2Fbenchmarking.ipynb" target="_blank" style="display: flex; align-items: center; justify-content: center; width: 100%; height: 54px; margin-top: auto; background: #f97316; color: white; text-align: center; text-decoration: none; border-radius: 27px; font-size: 14px; box-sizing: border-box;">Open in Binder →</a>
|
||||
```
|
||||
|
||||
```{grid-item-card} 📄 View Source
|
||||
@@ -792,9 +798,20 @@ The statistical methodology, warmup protocols, and reproducibility requirements
|
||||
|
||||
### Why Benchmarking Matters at Scale
|
||||
|
||||
```{code-cell} python3
|
||||
:tags: [remove-input, remove-output]
|
||||
from myst_nb import glue
|
||||
|
||||
# Production cost savings from latency reduction
|
||||
scale_annual_cost = 50_000_000
|
||||
scale_latency_reduction = 0.10
|
||||
scale_savings = scale_annual_cost * scale_latency_reduction
|
||||
glue("scale_savings", f"${scale_savings / 1_000_000:.0f} million")
|
||||
```
|
||||
|
||||
Production ML systems operate at scales where small performance differences compound into massive resource consumption:
|
||||
|
||||
- **Cost**: A data center running 10,000 GPUs 24/7 consumes $50 million in electricity annually. Reducing latency 10% saves $5 million per year.
|
||||
- **Cost**: A data center running 10,000 GPUs 24/7 consumes $50 million in electricity annually. Reducing latency 10% saves {glue:text}`scale_savings` per year.
|
||||
- **User Experience**: Search engines must return results in under 200ms. A 50ms latency reduction is the difference between keeping or losing users.
|
||||
- **Sustainability**: Training GPT-3 consumed 1,287 MWh of energy, equivalent to the annual energy use of 120 US homes. Optimization reduces carbon footprint.
|
||||
|
||||
@@ -802,6 +819,88 @@ Fair benchmarking ensures optimization efforts focus on changes that produce mea
|
||||
|
||||
## Check Your Understanding
|
||||
|
||||
```{code-cell} python3
|
||||
:tags: [remove-input, remove-output]
|
||||
import math
|
||||
|
||||
# Q1: Statistical Significance
|
||||
# Baseline: mean=12.5ms, std=1.2ms, n=10
|
||||
# Optimized: mean=11.8ms, std=1.5ms, n=10
|
||||
q1_n = 10
|
||||
q1_bl_mean = 12.5
|
||||
q1_bl_std = 1.2
|
||||
q1_bl_margin = 1.96 * (q1_bl_std / math.sqrt(q1_n))
|
||||
q1_bl_ci_lo = q1_bl_mean - q1_bl_margin
|
||||
q1_bl_ci_hi = q1_bl_mean + q1_bl_margin
|
||||
|
||||
q1_opt_mean = 11.8
|
||||
q1_opt_std = 1.5
|
||||
q1_opt_margin = 1.96 * (q1_opt_std / math.sqrt(q1_n))
|
||||
q1_opt_ci_lo = q1_opt_mean - q1_opt_margin
|
||||
q1_opt_ci_hi = q1_opt_mean + q1_opt_margin
|
||||
|
||||
q1_mean_diff = q1_bl_mean - q1_opt_mean
|
||||
|
||||
glue("q1_bl_margin", f"{q1_bl_margin:.2f}")
|
||||
glue("q1_bl_ci_lo", f"{q1_bl_ci_lo:.2f}")
|
||||
glue("q1_bl_ci_hi", f"{q1_bl_ci_hi:.2f}")
|
||||
glue("q1_opt_margin", f"{q1_opt_margin:.2f}")
|
||||
glue("q1_opt_ci_lo", f"{q1_opt_ci_lo:.2f}")
|
||||
glue("q1_opt_ci_hi", f"{q1_opt_ci_hi:.2f}")
|
||||
glue("q1_mean_diff", f"{q1_mean_diff:.1f}")
|
||||
|
||||
# Q2: Sample Size Calculation
|
||||
# std=2.0ms, target margin=±0.5ms
|
||||
q2_std = 2.0
|
||||
q2_target = 0.5
|
||||
q2_sqrt_n = 1.96 * q2_std / q2_target
|
||||
q2_n = q2_sqrt_n ** 2
|
||||
q2_n_ceil = math.ceil(q2_n)
|
||||
|
||||
glue("q2_sqrt_n", f"{q2_sqrt_n:.2f}")
|
||||
glue("q2_n_raw", f"{q2_n:.1f}")
|
||||
glue("q2_n_ceil", f"{q2_n_ceil}")
|
||||
|
||||
# Q3: Warmup Impact
|
||||
# Without warmup measurements
|
||||
q3_no_warmup = [15.2, 12.1, 10.8, 10.5, 10.6, 10.4]
|
||||
q3_warmup = [10.5, 10.6, 10.4, 10.7, 10.5, 10.6]
|
||||
q3_nw_mean = sum(q3_no_warmup) / len(q3_no_warmup)
|
||||
q3_w_mean = sum(q3_warmup) / len(q3_warmup)
|
||||
q3_latency_diff = q3_nw_mean - q3_w_mean
|
||||
q3_latency_pct = q3_latency_diff / q3_nw_mean * 100
|
||||
|
||||
# Std values are given as approximate in the text (1.8 and 0.1)
|
||||
q3_nw_std = 1.8
|
||||
q3_w_std = 0.1
|
||||
q3_var_reduction = (q3_nw_std - q3_w_std) / q3_nw_std * 100
|
||||
|
||||
glue("q3_nw_mean", f"{q3_nw_mean:.1f}ms")
|
||||
glue("q3_w_mean", f"{q3_w_mean:.2f}ms")
|
||||
glue("q3_latency_diff", f"{q3_latency_diff:.2f}ms")
|
||||
glue("q3_latency_pct", f"{q3_latency_pct:.0f}%")
|
||||
glue("q3_var_reduction", f"{q3_var_reduction:.0f}%")
|
||||
|
||||
# Q5: Measurement Overhead
|
||||
# 1μs timer overhead, 50μs operation, 1000 measurements
|
||||
q5_op_us = 50
|
||||
q5_overhead_us = 1
|
||||
q5_n_measurements = 1000
|
||||
q5_total_op_us = q5_op_us * q5_n_measurements
|
||||
q5_total_overhead_us = q5_overhead_us * q5_n_measurements
|
||||
q5_total_op_ms = q5_total_op_us / 1000
|
||||
q5_total_overhead_ms = q5_total_overhead_us / 1000
|
||||
q5_total_measured_ms = q5_total_op_ms + q5_total_overhead_ms
|
||||
q5_overhead_pct = (q5_total_overhead_ms / q5_total_measured_ms) * 100
|
||||
|
||||
glue("q5_total_op_us", f"{q5_total_op_us:,}")
|
||||
glue("q5_total_op_ms", f"{q5_total_op_ms:.0f}ms")
|
||||
glue("q5_total_overhead_us", f"{q5_total_overhead_us:,}")
|
||||
glue("q5_total_overhead_ms", f"{q5_total_overhead_ms:.0f}ms")
|
||||
glue("q5_total_measured_ms", f"{q5_total_measured_ms:.0f}ms")
|
||||
glue("q5_overhead_pct", f"{q5_overhead_pct:.2f}%")
|
||||
```
|
||||
|
||||
Test your understanding of benchmarking statistics and methodology with these quantitative questions.
|
||||
|
||||
**Q1: Statistical Significance**
|
||||
@@ -813,13 +912,13 @@ You benchmark a baseline model and an optimized model 10 times each. Baseline: m
|
||||
|
||||
**Calculate 95% confidence intervals:**
|
||||
|
||||
Baseline: CI = mean ± 1.96 * (std / sqrt(n)) = 12.5 ± 1.96 * (1.2 / sqrt(10)) = 12.5 ± 0.74 = [11.76, 13.24]
|
||||
Baseline: CI = mean ± 1.96 * (std / sqrt(n)) = 12.5 ± 1.96 * (1.2 / sqrt(10)) = 12.5 ± {glue:text}`q1_bl_margin` = [{glue:text}`q1_bl_ci_lo`, {glue:text}`q1_bl_ci_hi`]
|
||||
|
||||
Optimized: CI = 11.8 ± 1.96 * (1.5 / sqrt(10)) = 11.8 ± 0.93 = [10.87, 12.73]
|
||||
Optimized: CI = 11.8 ± 1.96 * (1.5 / sqrt(10)) = 11.8 ± {glue:text}`q1_opt_margin` = [{glue:text}`q1_opt_ci_lo`, {glue:text}`q1_opt_ci_hi`]
|
||||
|
||||
**Result**: The confidence intervals OVERLAP (baseline goes as low as 11.76, optimized goes as high as 12.73). This means the difference is **NOT statistically significant** at the 95% confidence level. You cannot confidently claim the optimized model is faster.
|
||||
**Result**: The confidence intervals OVERLAP (baseline goes as low as {glue:text}`q1_bl_ci_lo`, optimized goes as high as {glue:text}`q1_opt_ci_hi`). This means the difference is **NOT statistically significant** at the 95% confidence level. You cannot confidently claim the optimized model is faster.
|
||||
|
||||
**Lesson**: Always compute confidence intervals. A 0.7ms difference in means might seem meaningful, but with these variances and sample sizes, it could be random noise.
|
||||
**Lesson**: Always compute confidence intervals. A {glue:text}`q1_mean_diff`ms difference in means might seem meaningful, but with these variances and sample sizes, it could be random noise.
|
||||
```
|
||||
|
||||
**Q2: Sample Size Calculation**
|
||||
@@ -833,9 +932,9 @@ You measure latency with standard deviation of 2.0ms. How many measurements do y
|
||||
|
||||
**Solve for n**: 0.5 = 1.96 * (2.0 / sqrt(n))
|
||||
|
||||
sqrt(n) = 1.96 * 2.0 / 0.5 = 7.84
|
||||
sqrt(n) = 1.96 * 2.0 / 0.5 = {glue:text}`q2_sqrt_n`
|
||||
|
||||
n = 7.84² = **61.5 ≈ 62 measurements**
|
||||
n = {glue:text}`q2_sqrt_n`² = **{glue:text}`q2_n_raw` ≈ {glue:text}`q2_n_ceil` measurements**
|
||||
|
||||
**Lesson**: Achieving tight confidence intervals requires many measurements. Quadrupling precision (from ±1.0ms to ±0.5ms) requires 4x more samples (15 to 60). This is why professional benchmarks run hundreds of iterations.
|
||||
```
|
||||
@@ -848,16 +947,16 @@ Without warmup, your measurements are: [15.2, 12.1, 10.8, 10.5, 10.6, 10.4] ms.
|
||||
:class: dropdown
|
||||
|
||||
**Without warmup:**
|
||||
- Mean = (15.2 + 12.1 + 10.8 + 10.5 + 10.6 + 10.4) / 6 = **11.6ms**
|
||||
- Mean = (15.2 + 12.1 + 10.8 + 10.5 + 10.6 + 10.4) / 6 = **{glue:text}`q3_nw_mean`**
|
||||
- Std = 1.8ms (high variance due to warmup effects)
|
||||
|
||||
**With warmup:**
|
||||
- Mean = (10.5 + 10.6 + 10.4 + 10.7 + 10.5 + 10.6) / 6 = **10.55ms**
|
||||
- Mean = (10.5 + 10.6 + 10.4 + 10.7 + 10.5 + 10.6) / 6 = **{glue:text}`q3_w_mean`**
|
||||
- Std = 0.1ms (low variance, stable measurements)
|
||||
|
||||
**Impact:**
|
||||
- Latency reduced: 11.6 - 10.55 = **1.05ms (9% reduction)**
|
||||
- Variance reduced: 1.8 → 0.1ms = **95% reduction in noise**
|
||||
- Latency reduced: {glue:text}`q3_nw_mean` - {glue:text}`q3_w_mean` = **{glue:text}`q3_latency_diff` ({glue:text}`q3_latency_pct` reduction)**
|
||||
- Variance reduced: 1.8 → 0.1ms = **{glue:text}`q3_var_reduction` reduction in noise**
|
||||
|
||||
**Lesson**: Warmup eliminates cold-start effects and dramatically reduces measurement variance. Without warmup, you are measuring system startup behavior, not steady-state performance.
|
||||
```
|
||||
@@ -888,13 +987,13 @@ Your timer has 1μs overhead per measurement. You measure a 50μs operation 1000
|
||||
```{admonition} Answer
|
||||
:class: dropdown
|
||||
|
||||
**Total true operation time**: 50μs × 1000 = 50,000μs = 50ms
|
||||
**Total true operation time**: 50μs × 1000 = {glue:text}`q5_total_op_us`μs = {glue:text}`q5_total_op_ms`
|
||||
|
||||
**Total timer overhead**: 1μs × 1000 = 1,000μs = 1ms
|
||||
**Total timer overhead**: 1μs × 1000 = {glue:text}`q5_total_overhead_us`μs = {glue:text}`q5_total_overhead_ms`
|
||||
|
||||
**Total measured time**: 50ms + 1ms = 51ms
|
||||
**Total measured time**: {glue:text}`q5_total_op_ms` + {glue:text}`q5_total_overhead_ms` = {glue:text}`q5_total_measured_ms`
|
||||
|
||||
**Overhead percentage**: (1ms / 51ms) × 100% = **1.96%**
|
||||
**Overhead percentage**: ({glue:text}`q5_total_overhead_ms` / {glue:text}`q5_total_measured_ms`) × 100% = **{glue:text}`q5_overhead_pct`**
|
||||
|
||||
**Lesson**: Timer overhead is negligible for operations longer than ~50μs, but becomes significant for microsecond-scale operations. This is why we use `time.perf_counter()` with nanosecond resolution and minimal overhead. For operations under 10μs, consider measuring batches and averaging.
|
||||
```
|
||||
@@ -938,7 +1037,7 @@ Apply everything you have learned in Modules 01-19 to compete in the TorchPerf O
|
||||
|
||||
```{tip} Interactive Options
|
||||
|
||||
- **[Launch Binder](https://mybinder.org/v2/gh/harvard-edge/cs249r_book/main?urlpath=lab/tree/tinytorch/modules/19_benchmarking/19_benchmarking.ipynb)** - Run interactively in browser, no setup required
|
||||
- **[Launch Binder](https://mybinder.org/v2/gh/harvard-edge/cs249r_book/main?urlpath=lab/tree/tinytorch/modules/19_benchmarking/benchmarking.ipynb)** - Run interactively in browser, no setup required
|
||||
- **[View Source](https://github.com/harvard-edge/cs249r_book/blob/main/tinytorch/src/19_benchmarking/19_benchmarking.py)** - Browse the implementation code
|
||||
```
|
||||
|
||||
|
||||
@@ -1,3 +1,9 @@
|
||||
---
|
||||
file_format: mystnb
|
||||
kernelspec:
|
||||
name: python3
|
||||
---
|
||||
|
||||
# Module 20: Capstone
|
||||
|
||||
:::{admonition} Module Info
|
||||
@@ -30,7 +36,7 @@ Listen to an AI-generated overview.
|
||||
|
||||
Run interactively in your browser.
|
||||
|
||||
<a href="https://mybinder.org/v2/gh/harvard-edge/cs249r_book/main?labpath=tinytorch%2Fmodules%2F20_capstone%2F20_capstone.ipynb" target="_blank" style="display: flex; align-items: center; justify-content: center; width: 100%; height: 54px; margin-top: auto; background: #f97316; color: white; text-align: center; text-decoration: none; border-radius: 27px; font-size: 14px; box-sizing: border-box;">Open in Binder →</a>
|
||||
<a href="https://mybinder.org/v2/gh/harvard-edge/cs249r_book/main?labpath=tinytorch%2Fmodules%2F20_capstone%2Fcapstone.ipynb" target="_blank" style="display: flex; align-items: center; justify-content: center; width: 100%; height: 54px; margin-top: auto; background: #f97316; color: white; text-align: center; text-decoration: none; border-radius: 27px; font-size: 14px; box-sizing: border-box;">Open in Binder →</a>
|
||||
```
|
||||
|
||||
```{grid-item-card} 📄 View Source
|
||||
@@ -543,13 +549,28 @@ latency_ms = (time.time() - start) * 1000
|
||||
|
||||
A model with 10ms latency processes one input in 10 milliseconds. If a user submits a query, they wait 10ms for a response. This directly impacts user experience.
|
||||
|
||||
```{code-cell} python3
|
||||
:tags: [remove-input, remove-output]
|
||||
from myst_nb import glue
|
||||
|
||||
# Latency vs Throughput: derived metrics
|
||||
lt_latency_ms = 10
|
||||
lt_throughput = 1000 / lt_latency_ms
|
||||
glue("lt_throughput", f"{lt_throughput:.0f}")
|
||||
|
||||
lt_batch_size = 32
|
||||
lt_batch_time_ms = 50
|
||||
lt_batch_throughput = lt_batch_size * 1000 / lt_batch_time_ms
|
||||
glue("lt_batch_throughput", f"{lt_batch_throughput:.0f}")
|
||||
```
|
||||
|
||||
**Throughput** measures batch capacity: how many inputs can you process per second? This matters for offline batch jobs processing millions of examples. Your implementation derives throughput from latency:
|
||||
|
||||
```python
|
||||
throughput_samples_per_sec = 1000 / avg_latency
|
||||
```
|
||||
|
||||
If latency is 10ms per sample, throughput is 1000ms / 10ms = 100 samples/second. But this assumes processing samples one at a time. In practice, batching increases throughput significantly while adding latency. Processing a batch of 32 samples might take 50ms total, giving 640 samples/second throughput but 50ms per-request latency.
|
||||
If latency is 10ms per sample, throughput is 1000ms / 10ms = {glue:text}`lt_throughput` samples/second. But this assumes processing samples one at a time. In practice, batching increases throughput significantly while adding latency. Processing a batch of 32 samples might take 50ms total, giving {glue:text}`lt_batch_throughput` samples/second throughput but 50ms per-request latency.
|
||||
|
||||
The trade-off: **Batching increases throughput but hurts latency.** A production API serving individual user requests optimizes for latency. A batch processing pipeline optimizes for throughput.
|
||||
|
||||
@@ -759,10 +780,35 @@ The workflow pattern: baseline → optimize → benchmark → compare → decide
|
||||
|
||||
### Why Benchmarking Matters at Scale
|
||||
|
||||
```{code-cell} python3
|
||||
:tags: [remove-input, remove-output]
|
||||
from myst_nb import glue
|
||||
|
||||
# Model serving cost calculation
|
||||
scale_requests_per_day = 10_000_000
|
||||
scale_latency_before_ms = 20
|
||||
scale_latency_after_ms = 10
|
||||
scale_saved_ms = scale_latency_before_ms - scale_latency_after_ms
|
||||
scale_seconds_saved = scale_requests_per_day * scale_saved_ms / 1000
|
||||
scale_days_saved = scale_seconds_saved / 86400
|
||||
scale_cost_reduction_pct = (scale_saved_ms / scale_latency_before_ms) * 100
|
||||
|
||||
glue("scale_seconds_saved", f"{scale_seconds_saved:,.0f}")
|
||||
glue("scale_days_saved", f"{scale_days_saved:.2f}")
|
||||
glue("scale_cost_reduction", f"{scale_cost_reduction_pct:.0f}%")
|
||||
|
||||
# Training pipeline savings
|
||||
scale_training_cost = 1_000_000
|
||||
scale_data_loading_pct = 60
|
||||
scale_pipeline_savings = scale_training_cost * scale_data_loading_pct / 100
|
||||
|
||||
glue("scale_pipeline_savings", f"${scale_pipeline_savings:,.0f}")
|
||||
```
|
||||
|
||||
To appreciate why professional benchmarking matters, consider the scale of production ML systems:
|
||||
|
||||
- **Model serving**: A recommendation system handles 10 million requests/day. If you reduce latency from 20ms to 10ms, you save 100,000 seconds of compute daily = 1.16 days of compute per day = 42% cost reduction.
|
||||
- **Training efficiency**: Training a large language model costs $1 million in GPU time. Profiling reveals 60% of time is spent in data loading. Optimizing the data pipeline saves $600,000.
|
||||
- **Model serving**: A recommendation system handles 10 million requests/day. If you reduce latency from 20ms to 10ms, you save {glue:text}`scale_seconds_saved` seconds of compute daily = {glue:text}`scale_days_saved` days of compute per day = {glue:text}`scale_cost_reduction` cost reduction.
|
||||
- **Training efficiency**: Training a large language model costs $1 million in GPU time. Profiling reveals 60% of time is spent in data loading. Optimizing the data pipeline saves {glue:text}`scale_pipeline_savings`.
|
||||
- **Deployment constraints**: A mobile app's model must fit in 50MB. Quantization compresses a 200MB model to 50MB with 1% accuracy loss. The app ships; without benchmarking, you wouldn't know the trade-off was acceptable.
|
||||
|
||||
Systematic benchmarking with reproducible results isn't academic exercise—it's how engineers justify technical decisions and demonstrate business impact.
|
||||
@@ -775,33 +821,82 @@ Test yourself with these systems thinking questions about benchmarking and perfo
|
||||
|
||||
A model has 5 million parameters stored as FP32. After INT8 quantization, how much memory is saved?
|
||||
|
||||
```{code-cell} python3
|
||||
:tags: [remove-input, remove-output]
|
||||
from myst_nb import glue
|
||||
|
||||
# Q1: Memory calculation (FP32 vs INT8 quantization)
|
||||
# Using binary units: 1 MB = 1024^2 = 1,048,576 bytes
|
||||
q1_params = 5_000_000
|
||||
q1_fp32_bytes_per_param = 4
|
||||
q1_int8_bytes_per_param = 1
|
||||
q1_bytes_per_mb = 1024 ** 2
|
||||
|
||||
q1_fp32_bytes = q1_params * q1_fp32_bytes_per_param
|
||||
q1_int8_bytes = q1_params * q1_int8_bytes_per_param
|
||||
q1_fp32_mb = q1_fp32_bytes / q1_bytes_per_mb
|
||||
q1_int8_mb = q1_int8_bytes / q1_bytes_per_mb
|
||||
q1_savings_mb = q1_fp32_mb - q1_int8_mb
|
||||
q1_reduction_pct = q1_savings_mb / q1_fp32_mb * 100
|
||||
q1_compression = q1_fp32_mb / q1_int8_mb
|
||||
|
||||
glue("q1_fp32_bytes", f"{q1_fp32_bytes:,}")
|
||||
glue("q1_fp32_mb", f"{q1_fp32_mb:.2f}")
|
||||
glue("q1_int8_bytes", f"{q1_int8_bytes:,}")
|
||||
glue("q1_int8_mb", f"{q1_int8_mb:.2f}")
|
||||
glue("q1_savings_mb", f"{q1_savings_mb:.2f}")
|
||||
glue("q1_reduction_pct", f"{q1_reduction_pct:.0f}%")
|
||||
glue("q1_compression", f"{q1_compression:.1f}x")
|
||||
```
|
||||
|
||||
```{admonition} Answer
|
||||
:class: dropdown
|
||||
|
||||
FP32: 5,000,000 parameters × 4 bytes = 20,000,000 bytes = **20 MB**
|
||||
FP32: 5,000,000 parameters x 4 bytes = {glue:text}`q1_fp32_bytes` bytes = **{glue:text}`q1_fp32_mb` MB**
|
||||
|
||||
INT8: 5,000,000 parameters × 1 byte = 5,000,000 bytes = **5 MB**
|
||||
INT8: 5,000,000 parameters x 1 byte = {glue:text}`q1_int8_bytes` bytes = **{glue:text}`q1_int8_mb` MB**
|
||||
|
||||
Savings: 20 MB - 5 MB = **15 MB** (75% reduction)
|
||||
Savings: {glue:text}`q1_fp32_mb` MB - {glue:text}`q1_int8_mb` MB = **{glue:text}`q1_savings_mb` MB** ({glue:text}`q1_reduction_pct` reduction)
|
||||
|
||||
Compression ratio: 20 MB / 5 MB = **4.0x**
|
||||
Compression ratio: {glue:text}`q1_fp32_mb` MB / {glue:text}`q1_int8_mb` MB = **{glue:text}`q1_compression`**
|
||||
|
||||
This is why quantization is standard in mobile deployment—models must fit in tight memory budgets.
|
||||
```
|
||||
|
||||
**Q2: Latency Variance Analysis**
|
||||
|
||||
Model A: 10.0ms ± 0.3ms latency. Model B: 10.0ms ± 3.0ms latency. Both have same accuracy. Which do you deploy and why?
|
||||
Model A: 10.0ms +/- 0.3ms latency. Model B: 10.0ms +/- 3.0ms latency. Both have same accuracy. Which do you deploy and why?
|
||||
|
||||
```{code-cell} python3
|
||||
:tags: [remove-input, remove-output]
|
||||
from myst_nb import glue
|
||||
|
||||
# Q2: Latency variance analysis (95% confidence = +/- 2 std)
|
||||
q2_mean = 10.0
|
||||
q2_std_a = 0.3
|
||||
q2_std_b = 3.0
|
||||
q2_variance_ratio = q2_std_b / q2_std_a
|
||||
q2_a_lo = q2_mean - 2 * q2_std_a
|
||||
q2_a_hi = q2_mean + 2 * q2_std_a
|
||||
q2_b_lo = q2_mean - 2 * q2_std_b
|
||||
q2_b_hi = q2_mean + 2 * q2_std_b
|
||||
|
||||
glue("q2_variance_ratio", f"{q2_variance_ratio:.0f}x")
|
||||
glue("q2_a_lo", f"{q2_a_lo:.1f}")
|
||||
glue("q2_a_hi", f"{q2_a_hi:.1f}")
|
||||
glue("q2_b_lo", f"{q2_b_lo:.1f}")
|
||||
glue("q2_b_hi", f"{q2_b_hi:.1f}")
|
||||
```
|
||||
|
||||
```{admonition} Answer
|
||||
:class: dropdown
|
||||
|
||||
**Deploy Model A.**
|
||||
|
||||
Same mean latency (10.0ms) but Model A has 10x lower variance (0.3ms vs 3.0ms std).
|
||||
Same mean latency (10.0ms) but Model A has {glue:text}`q2_variance_ratio` lower variance (0.3ms vs 3.0ms std).
|
||||
|
||||
Model A's latency range: ~9.4-10.6ms (95% confidence: ± 2 std)
|
||||
Model B's latency range: ~4.0-16.0ms (95% confidence: ± 2 std)
|
||||
Model A's latency range: ~{glue:text}`q2_a_lo`-{glue:text}`q2_a_hi`ms (95% confidence: +/- 2 std)
|
||||
Model B's latency range: ~{glue:text}`q2_b_lo`-{glue:text}`q2_b_hi`ms (95% confidence: +/- 2 std)
|
||||
|
||||
**Why consistency matters:**
|
||||
- Users prefer predictable performance over erratic speed
|
||||
@@ -813,7 +908,34 @@ In production, **reliability > mean performance**. A consistently decent experie
|
||||
|
||||
**Q3: Batch Size Trade-off**
|
||||
|
||||
Measuring latency with batch_size=32 gives 100ms total. Can you claim 100ms / 32 = 3.1ms per-sample latency?
|
||||
```{code-cell} python3
|
||||
:tags: [remove-input, remove-output]
|
||||
from myst_nb import glue
|
||||
|
||||
# Q3: Batch size trade-off — why amortized != actual per-sample latency
|
||||
# Given: batch=32 takes 100ms total, batch=1 takes 8ms actual
|
||||
# Solve system of equations:
|
||||
# batch_total = fixed + batch_size * variable_per_sample
|
||||
# batch1_total = fixed + 1 * variable_per_sample
|
||||
# => fixed = (batch_total - batch_size * batch1) / (1 - batch_size)
|
||||
q3_batch_size = 32
|
||||
q3_batch_total_ms = 100
|
||||
q3_batch1_actual_ms = 8.0
|
||||
|
||||
q3_amortized = q3_batch_total_ms / q3_batch_size
|
||||
q3_fixed = (q3_batch_total_ms - q3_batch_size * q3_batch1_actual_ms) / (1 - q3_batch_size)
|
||||
q3_var_total = q3_batch_total_ms - q3_fixed
|
||||
q3_var_per_sample = q3_var_total / q3_batch_size
|
||||
q3_batch1_check = q3_fixed + q3_var_per_sample
|
||||
|
||||
glue("q3_amortized", f"{q3_amortized:.1f}")
|
||||
glue("q3_fixed", f"{q3_fixed:.1f}")
|
||||
glue("q3_var_total", f"{q3_var_total:.1f}")
|
||||
glue("q3_var_per_sample", f"{q3_var_per_sample:.1f}")
|
||||
glue("q3_batch1_check", f"{q3_batch1_check:.1f}")
|
||||
```
|
||||
|
||||
Measuring latency with batch_size=32 gives 100ms total. Can you claim 100ms / 32 = {glue:text}`q3_amortized`ms per-sample latency?
|
||||
|
||||
```{admonition} Answer
|
||||
:class: dropdown
|
||||
@@ -823,35 +945,60 @@ Measuring latency with batch_size=32 gives 100ms total. Can you claim 100ms / 32
|
||||
Batching amortizes fixed overhead (data transfer, kernel launch). Per-sample latency at batch=1 is higher than batch=32 divided by 32.
|
||||
|
||||
Example reality:
|
||||
- Batch=32: 100ms total → 3.1ms per sample (amortized)
|
||||
- Batch=32: 100ms total → {glue:text}`q3_amortized`ms per sample (amortized)
|
||||
- Batch=1: 8ms total → 8ms per sample (actual)
|
||||
|
||||
**Why the discrepancy?**
|
||||
- Fixed overhead: 10ms (data transfer, setup)
|
||||
- Variable cost: 90ms / 32 = 2.8ms per sample
|
||||
- At batch=1: 10ms fixed + 2.8ms variable = 12.8ms
|
||||
- Fixed overhead: {glue:text}`q3_fixed`ms (data transfer, setup)
|
||||
- Variable cost: {glue:text}`q3_var_total`ms / 32 = {glue:text}`q3_var_per_sample`ms per sample
|
||||
- At batch=1: {glue:text}`q3_fixed`ms fixed + {glue:text}`q3_var_per_sample`ms variable = {glue:text}`q3_batch1_check`ms
|
||||
|
||||
**Always benchmark at deployment batch size.** If production serves single requests, measure with batch=1.
|
||||
```
|
||||
|
||||
**Q4: Speedup Calculation**
|
||||
|
||||
```{code-cell} python3
|
||||
:tags: [remove-input, remove-output]
|
||||
from myst_nb import glue
|
||||
|
||||
# Q4: Speedup and real-world impact
|
||||
q4_baseline_ms = 20
|
||||
q4_optimized_ms = 5
|
||||
q4_speedup = q4_baseline_ms / q4_optimized_ms
|
||||
|
||||
q4_baseline_rps = 100 # requests/sec (given scenario)
|
||||
q4_optimized_rps = q4_baseline_rps * q4_speedup
|
||||
q4_baseline_cost = 1000 # $/month (given scenario)
|
||||
q4_optimized_cost = q4_baseline_cost / q4_speedup
|
||||
|
||||
q4_baseline_util = 60 # % utilization (given scenario)
|
||||
q4_optimized_util = q4_baseline_util / q4_speedup
|
||||
q4_headroom = 100 - q4_optimized_util
|
||||
|
||||
glue("q4_speedup", f"{q4_speedup:.1f}x")
|
||||
glue("q4_times_faster", f"{q4_speedup:.0f}")
|
||||
glue("q4_optimized_rps", f"{q4_optimized_rps:.0f}")
|
||||
glue("q4_optimized_cost", f"${q4_optimized_cost:.0f}")
|
||||
glue("q4_headroom", f"{q4_headroom:.0f}%")
|
||||
```
|
||||
|
||||
Baseline: 20ms latency. Optimized: 5ms latency. What is the speedup and what does it mean?
|
||||
|
||||
```{admonition} Answer
|
||||
:class: dropdown
|
||||
|
||||
Speedup = baseline_latency / optimized_latency = 20ms / 5ms = **4.0x**
|
||||
Speedup = baseline_latency / optimized_latency = 20ms / 5ms = **{glue:text}`q4_speedup`**
|
||||
|
||||
**What it means:**
|
||||
- Optimized model is **4 times faster**
|
||||
- Processes same input in 1/4 the time
|
||||
- Can handle 4x more traffic with same hardware
|
||||
- Optimized model is **{glue:text}`q4_times_faster` times faster**
|
||||
- Processes same input in 1/{glue:text}`q4_times_faster` the time
|
||||
- Can handle {glue:text}`q4_times_faster`x more traffic with same hardware
|
||||
|
||||
**Real-world impact:**
|
||||
- If baseline served 100 requests/sec, optimized serves 400 requests/sec
|
||||
- If baseline cost $1000/month in compute, optimized costs $250/month
|
||||
- If baseline met latency SLA at 60% utilization, optimized has 85% headroom
|
||||
- If baseline served 100 requests/sec, optimized serves {glue:text}`q4_optimized_rps` requests/sec
|
||||
- If baseline cost $1000/month in compute, optimized costs {glue:text}`q4_optimized_cost`/month
|
||||
- If baseline met latency SLA at 60% utilization, optimized has {glue:text}`q4_headroom` headroom
|
||||
|
||||
**Note:** Speedup alone doesn't tell the full story. Check accuracy_delta and compression_ratio to understand trade-offs.
|
||||
```
|
||||
@@ -860,7 +1007,7 @@ Speedup = baseline_latency / optimized_latency = 20ms / 5ms = **4.0x**
|
||||
|
||||
Why does the submission schema require `accuracy` as float in [0, 1] instead of allowing any format?
|
||||
|
||||
```{admonition} Answer
|
||||
````{admonition} Answer
|
||||
:class: dropdown
|
||||
|
||||
**Type safety enables automation.**
|
||||
@@ -889,7 +1036,7 @@ With schema:
|
||||
4. **APIs** - Other tools can consume submissions without custom parsers
|
||||
|
||||
**Real example:** Papers with Code leaderboards require strict schemas. Thousands of submissions from different teams aggregate automatically because everyone follows the same format.
|
||||
```
|
||||
````
|
||||
|
||||
## Further Reading
|
||||
|
||||
@@ -929,7 +1076,7 @@ You've built a complete ML framework from scratch—from basic tensors to produc
|
||||
|
||||
```{tip} Interactive Options
|
||||
|
||||
- **[Launch Binder](https://mybinder.org/v2/gh/harvard-edge/cs249r_book/main?urlpath=lab/tree/tinytorch/modules/20_capstone/20_capstone.ipynb)** - Run interactively in browser, no setup required
|
||||
- **[Launch Binder](https://mybinder.org/v2/gh/harvard-edge/cs249r_book/main?urlpath=lab/tree/tinytorch/modules/20_capstone/capstone.ipynb)** - Run interactively in browser, no setup required
|
||||
- **[View Source](https://github.com/harvard-edge/cs249r_book/blob/main/tinytorch/src/20_capstone/20_capstone.py)** - Browse the implementation code
|
||||
```
|
||||
|
||||
|
||||
Reference in New Issue
Block a user