feat(slides): 5-component speaker notes + upgrade 7 chapters

New speaker note standard (LINK, NARRATE, ENGAGE, WARN, FLEX) based on
8 pedagogical frameworks (Shulman PCK, Ambrose, Rosenshine, Chi ICAP,
Merrill, Bain, Wiggins UbD, Garner & Alley).

Upgraded 7 chapters: Vol1 Ch00/Ch05/Ch09/Ch13, Vol2 Ch00/Ch06/Ch12.
Updated stats row on portal landing page.
28 chapters remaining for next pass.
This commit is contained in:
Vijay Janapa Reddi
2026-03-16 18:04:15 -04:00
parent b92409c521
commit 6cf5430928
8 changed files with 1701 additions and 274 deletions

View File

@@ -205,9 +205,9 @@ toc: false
<div class="stats-row">
<div class="stat"><span class="stat-num">35</span><span class="stat-lbl">Decks</span></div>
<div class="stat"><span class="stat-num">1,099</span><span class="stat-lbl">Slides</span></div>
<div class="stat"><span class="stat-num">266</span><span class="stat-lbl">SVG Figures</span></div>
<div class="stat"><span class="stat-num">2</span><span class="stat-lbl">Volumes</span></div>
<div class="stat"><span class="stat-num">~38 hrs</span><span class="stat-lbl">Teaching Time</span></div>
<div class="stat"><span class="stat-num">308</span><span class="stat-lbl">Active Learning</span></div>
<div class="stat"><span class="stat-num">1,099</span><span class="stat-lbl">Speaker Notes</span></div>
</div>
<!-- Actions -->

View File

@@ -52,8 +52,16 @@
\begin{frame}{Visual Language}
\note{[1 min] Explain the semantic color system used throughout the course.
These colors are consistent across all diagrams and slides.}
\note{
% -- NARRATE: Walk through each card. ``Blue means compute---anytime you see
% blue in a diagram, think GPU ops, matrix multiplies, forward pass. Green is
% data flow---memory, caches, healthy paths. Orange is routing---schedulers,
% load balancers. Red is cost, error, or bottleneck.'' Point to each card as
% you name it.
%
% -- FLEX: [CORE] Show on Day 1 and again briefly on Day 2.
% IF SHORT: Display the slide but do not narrate---students can read the cards.
}
\small
Throughout this course, colors carry meaning:
@@ -105,9 +113,17 @@ Throughout this course, colors carry meaning:
% =============================================================================
\begin{frame}{Welcome}
\note{[2 min] Welcome students. Set the tone: this is not an ML algorithms class.
This is about the \emph{systems} that make ML work. Ask: ``How many of you have
trained a model? How many have deployed one?'' The gap is the course.}
\note{
% -- NARRATE: ``Welcome. Raise your hand if you have trained a model.'' (most
% hands go up) ``Now keep your hand up if you have deployed one to production.''
% (most hands drop) ``That gap---training vs.\ shipping---is this entire course.
% This is not an ML algorithms class. This is about the systems that make ML
% work: memory, bandwidth, power, latency, and the physics behind every design
% decision.''
%
% -- FLEX: [CORE] Sets the emotional contract for the semester.
% IF SHORT: Cut the hand-raise; just state the training-deployment gap directly.
}
\centering
\vspace{0.8cm}
@@ -129,10 +145,27 @@ trained a model? How many have deployed one?'' The gap is the course.}
% =============================================================================
\begin{frame}{The Gap Between ML Research and Production}
\note{[3 min] Most students have trained models in notebooks. Very few have
shipped one. The failure rates are staggering: 60--85\% of ML projects never
reach production. The bottleneck is not algorithms --- it is systems.
Ask: ``Why do you think most ML projects fail?''}
\note{
% -- LINK: Students just heard ``this is a systems course, not an algorithms
% course.'' This slide gives the quantitative evidence for WHY.
%
% -- NARRATE: ``60 to 85 percent of ML projects never reach production. Let
% that sink in. Look at this table---research uses a static dataset, a single
% GPU, and optimizes one metric. Production uses a shifting data stream, a
% fleet of heterogeneous hardware, and must hit accuracy AND latency AND cost
% targets simultaneously. That gap is not fixed by a better optimizer.''
%
% -- ENGAGE: ``Why do you think most ML projects fail? Write one reason.''
% Give 30 seconds. Cold-call 2 students.
% Expected: ``data quality,'' ``hardware limits.'' Surprise answer: systems.
%
% -- WARN: Students assume failures are algorithmic (``bad model''). Correct
% framing: the bottleneck is infrastructure---data pipelines, serving, monitoring.
% IF STUCK: Point to the ``90\% of time goes to data + infrastructure'' bullet.
%
% -- FLEX: [CORE] This slide motivates the entire semester.
% IF AHEAD: ``What percentage of engineering time goes to the model itself?''
}
\small
\begin{columns}[T]
@@ -173,9 +206,26 @@ Ask: ``Why do you think most ML projects fail?''}
\end{frame}
\begin{frame}{The 5\% Problem}
\note{[3 min] Sculley et al.\ 2015. The ML model code is the tiny box in the
center. Everything around it --- data pipelines, serving, monitoring, config ---
is what this course teaches. Ask: ``What is the biggest box?''}
\note{
% -- LINK: The previous slide said ``the bottleneck is systems.'' This diagram
% shows exactly what those systems look like.
%
% -- NARRATE: Point to the tiny center box: ``That is the ML model code---about
% 5 percent. Everything around it---data pipelines, feature stores, serving
% infrastructure, monitoring, configuration---is what this course teaches.
% Sculley et al.\ called this `hidden technical debt.' ''
%
% -- ENGAGE: ``Look at the diagram. What is the biggest box?'' Give 15 seconds.
% Expected answer: data collection or configuration. Both are valid---the point
% is that neither is the model.
%
% -- WARN: Students equate ``ML'' with ``the model.'' Correct framing: the
% model is 5\%; the other 95\% is systems engineering that determines whether
% the model ever reaches a user.
%
% -- FLEX: [CORE] Foundational mental model for the course.
% IF SHORT: Skip the question; just narrate the diagram for 90 seconds.
}
% --- Layout: FULL-WIDTH IMAGE ---
\centering
@@ -198,11 +248,30 @@ you choose it because of how it parallelizes on real silicon.%
}
\begin{frame}{AI Is Infrastructure}
\note{[2 min] This is the philosophical foundation of the course.
Every design decision in ML systems traces back to a physical constraint:
memory bandwidth, power budget, speed of light. If you understand the
constraints, the architecture choices become obvious.
Ask: ``Why can't we run GPT-4 on a phone?''}
\note{
% -- LINK: The Core Thesis focus slide just said ``constraints drive
% architecture.'' This slide names the four specific physical constraints.
%
% -- NARRATE: Walk through bullets top to bottom. ``Memory bandwidth limits
% how fast data reaches the processor---this is why GPT-4 inference is slow
% even on powerful GPUs. Power budget limits where a model can run---a phone
% cannot sustain 700 watts. Speed of light limits latency---a self-driving car
% cannot wait 50 ms for a cloud round trip. Thermodynamics limits compute per
% rack---you cannot cool infinite GPUs in a data center.''
% ANALOGY: ``Physics is to ML systems what gravity is to bridges. You can
% build creative bridges, but none of them ignore gravity.''
%
% -- ENGAGE: ``Why can't we run GPT-4 on a smartphone?'' Cold-call one student.
% Expected: ``not enough memory.'' Deepen: ``How much memory does it need?
% 3.6 TB at FP32. A phone has 8 GB. That is a 450x gap---physics, not software.''
%
% -- WARN: Students think hardware limitations are temporary (``next year's chip
% will fix it''). Correct framing: these are physical laws, not engineering gaps.
% Memory bandwidth grows ~20\%/yr; compute demand grows 7x faster than Moore's Law.
%
% -- FLEX: [CORE] Philosophical anchor for the course.
% IF AHEAD: ``Which constraint is hardest to overcome with engineering?''
}
\small
\begin{columns}[T]
@@ -240,9 +309,26 @@ Ask: ``Why can't we run GPT-4 on a phone?''}
% =============================================================================
\begin{frame}{Three Analytical Frameworks}
\note{[3 min] These are the three recurring analytical tools for the course.
Show the overview diagram. Students will learn each in depth in Chapter 1.
Today, just plant the seed. Ask: ``What question does each framework answer?''}
\note{
% -- LINK: Students now know that physical constraints drive architecture.
% These three frameworks are the tools for reasoning about those constraints.
%
% -- NARRATE: Point to each framework in the diagram. ``D-A-M tells you WHERE
% the bottleneck is. The Iron Law tells you HOW LONG an operation takes. The
% Degradation Equation tells you WHEN a model will fail. These three tools
% recur in every single chapter.''
%
% -- ENGAGE: ``What question does each framework answer? Write one word per
% framework.'' 30 seconds. Cold-call one student.
% Expected: Where/How long/When---or close variants.
%
% -- WARN: Students will try to memorize the equations without understanding
% what question each answers. Correct framing: frameworks are diagnostic
% tools, not formulas to plug numbers into.
%
% -- FLEX: [CORE] Preview only---do not go deep. Chapter 1 covers each.
% IF SHORT: Just name the three frameworks and move on (60 seconds).
}
% --- Layout: FULL-WIDTH IMAGE ---
\centering
@@ -254,10 +340,27 @@ Today, just plant the seed. Ask: ``What question does each framework answer?''}
\end{frame}
\begin{frame}{The \DAM{} Taxonomy}
\note{[2 min] Brief intro to D-A-M. Every ML system has three interdependent
axes: Data, Algorithm, Machine. Optimizing one shifts pressure to another.
The diagnostic question: which axis is the bottleneck?
Do not go deep --- Chapter 1 covers this in full.}
\note{
% -- LINK: The previous diagram named D-A-M as one of three frameworks.
% Now we unpack it briefly.
%
% -- NARRATE: ``Every ML system sits at the intersection of three axes.
% Data: how much, how fast can we move it. Algorithm: how many operations,
% what parallelism pattern. Machine: what silicon, what memory hierarchy.
% The diagnostic question is always the same: which axis is the bottleneck?
% Optimizing one axis shifts pressure to the others---they are coupled.''
% Point to the table: ``Notice the units. These become the Iron Law variables.''
%
% -- ENGAGE: ``If you double the training dataset, which other axis feels
% the pressure?'' Expected: Machine (need more bandwidth or compute time).
%
% -- WARN: Students treat the three axes as independent knobs. Correct framing:
% they are interdependent---doubling data volume requires proportionally more
% bandwidth or longer training time.
%
% -- FLEX: [CORE] But keep it brief---Chapter 1 goes deep.
% IF SHORT: Show the slide for 60 seconds, skip the engage question.
}
\small
\begin{columns}[T]
@@ -304,9 +407,26 @@ Do not go deep --- Chapter 1 covers this in full.}
\end{frame}
\begin{frame}{The Iron Law of ML Systems}
\note{[2 min] Quick preview. Every term resolves to seconds. The slowest
term dominates end-to-end latency. Chapter 1 covers worked examples.
Ask: ``For a phone camera app, which term dominates?''}
\note{
% -- LINK: D-A-M tells you where the bottleneck is. The Iron Law quantifies
% how long each axis takes in seconds.
%
% -- NARRATE: Point to each term. ``Data term: bytes divided by bandwidth
% gives seconds. Compute term: FLOPs divided by peak rate times efficiency
% gives seconds. Overhead: orchestration tax, also seconds. You add three
% times and the slowest one dominates. This is dimensional analysis---if
% your units do not resolve to seconds, the equation is wrong.''
%
% -- ENGAGE: ``For a phone camera app classifying a photo, which term
% dominates?'' Give 20 seconds. Expected: Data term (reading the image from
% memory) or Overhead (framework launch cost). Accept either with reasoning.
%
% -- WARN: Students try to add FLOPs to bytes. Correct framing: every term
% must resolve to seconds before you can compare or add them.
%
% -- FLEX: [CORE] Preview only---worked examples come in Chapter 1.
% IF SHORT: State the equation, emphasize ``slowest term dominates,'' move on.
}
\small
$$T_{\text{total}} = \underbrace{\dfrac{D_{\text{vol}}}{BW}}_{\text{Data}} +
@@ -339,9 +459,28 @@ $$T_{\text{total}} = \underbrace{\dfrac{D_{\text{vol}}}{BW}}_{\text{Data}} +
\end{frame}
\begin{frame}{The Degradation Equation}
\note{[2 min] ML systems fail silently. Accuracy degrades as the world changes
around the model. The degradation equation quantifies this.
Ask: ``How would you know your model is getting worse if no code changed?''}
\note{
% -- LINK: The Iron Law measures performance at a point in time. The
% Degradation Equation measures how performance decays over time.
%
% -- NARRATE: ``Accuracy at time t equals initial accuracy minus alpha times
% the distribution distance. Alpha is how sensitive the model is to drift.
% Delta measures how far the live data has drifted from training data.
% Look at the example: a recommendation system starts at 85\% and drops to
% 79.2\% in 6 months. No code changed. No bugs. The world changed.
% The engineering response: set a retraining trigger at a threshold.''
%
% -- ENGAGE: ``How would you know your model is getting worse if no code
% changed and no one filed a bug?'' Give 20 seconds. Cold-call one student.
% Expected: monitoring accuracy metrics over time. Deepen: ``What if you
% do not have ground truth labels in real time?''
%
% -- WARN: Students assume ``no bugs = working correctly.'' Correct framing:
% ML systems degrade through data drift even when code is untouched.
%
% -- FLEX: [CORE] Third framework preview.
% IF SHORT: Show equation, state the rec-system example, move on.
}
\small
$$A(t) = A_0 - \alpha \cdot \Delta(P_{\text{train}},\; P_{\text{live}}(t))$$

View File

@@ -68,10 +68,26 @@
% LEARNING OBJECTIVES
% =============================================================================
\begin{frame}{Learning Objectives}
\note{[2 min] Read objectives aloud. Emphasize: this chapter is about the
\emph{computational workload} that neural networks create, not about how to
use a framework. Ask: ``How many of you have trained a model but never
thought about why training uses 4$\times$ more memory than inference?''}
\note{
% -- LINK: Prior chapters established the Iron Law and DAM taxonomy.
% This chapter reveals what neural networks actually compute inside those terms.
% -- NARRATE: Read objectives aloud, pausing on each verb.
``This chapter is about the computational workload that neural networks
create, not about how to use a framework. Every objective maps to a
measurable skill you can demonstrate.''
% -- ENGAGE: ``How many of you have trained a model but never thought
% about why training uses 4x more memory than inference?''
% Show of hands. Use the count to calibrate depth later.
% -- WARN: Students often confuse ``understanding neural networks'' with
% ``using PyTorch.'' Correct framing: this chapter is about the math and
% memory, not the API.
% -- FLEX: [CORE] Never skip objectives---they set the contract for the lecture.
% IF SHORT: Read only the bolded terms, skip the full sentence for each.
}
\footnotesize
\begin{enumerate}\setlength\itemsep{0pt}
@@ -88,8 +104,16 @@ thought about why training uses 4$\times$ more memory than inference?''}
\end{frame}
\begin{frame}{Visual Language}
\note{[1 min] Explain the semantic color system used throughout the course.
These colors are consistent across all diagrams and slides.}
\note{
% -- NARRATE: Point to each card in turn.
``Blue means compute---any time a GPU is doing arithmetic. Green means data
or memory---bytes moving through the system. Orange means scheduling or
routing decisions. Red means cost, error, or bottleneck. These colors are
identical in every diagram across the entire course.''
% -- FLEX: [OPTIONAL] Students internalize colors through exposure, not memorization.
% IF SHORT: Show for 15 seconds and move on; the colors reinforce themselves.
}
\small
Throughout this course, colors carry meaning:
@@ -127,11 +151,30 @@ Throughout this course, colors carry meaning:
% =============================================================================
\begin{frame}{The Silicon Contract}
\note{[2 min] Bridge from prior chapters. The Iron Law (Ch1) established that
every model makes a computational bargain with hardware. This chapter reveals
what those computations actually are. The operators inside a neural network
determine memory consumption, execution time, and energy expenditure.
Ask: ``If the model code just says `multiply these matrices,' where is the bug?''}
\note{
% -- LINK: The Iron Law (Ch1) decomposed performance into D, O, and L terms.
% Students know the formula but not what fills each term. This slide connects
% the abstract equation to the concrete operators inside a neural network.
% -- NARRATE: Point to the equation on the slide.
``Every term in the Iron Law has a physical origin inside the neural network.
O comes from matrix multiplications. D comes from weight and activation
traffic. L comes from pipeline overhead. The operators you choose---and how
you arrange them---determine which term dominates.''
ANALOGY: ``The Iron Law is the utility bill; this chapter opens the meter.''
% -- ENGAGE: ``If the model code just says `multiply these matrices,' where
% is the bug?'' Cold-call one student. Expected answer: the bug is not a
% syntax error---it is a numerical instability (gradient explosion, overflow).
% -- WARN: Students expect bugs to look like Python exceptions. In neural
% networks, bugs are silent: NaN gradients, saturating activations, memory
% exhaustion. The code runs---the math fails.
% -- FLEX: [CORE] This slide sets the chapter thesis---never skip.
% IF AHEAD: ``Can you name a specific numerical instability you have seen?''
% IF SHORT: Skip the analogy, keep the equation walkthrough.
}
\small
\begin{columns}[T]
@@ -167,10 +210,28 @@ Ask: ``If the model code just says `multiply these matrices,' where is the bug?'
\end{frame}
\begin{frame}{Three Paradigms, One Digit}
\note{[3 min] This is the chapter's central comparison. Walk through the same
$28\times28$ digit across three paradigms. The 1,092$\times$ compute explosion
is the visceral number. Ask: ``Where does each paradigm sit on the Iron Law?''
Common error: students think more compute is always bad.}
\note{
% -- LINK: The Silicon Contract slide introduced the Iron Law terms.
% Now we see what happens when you move from rule-based to neural: the
% same 28x28 digit triggers 1,092x more operations.
% -- NARRATE: Point left to right across the three panels.
``Same digit, same 784 pixels. Rule-based: 100 comparisons. Classical ML:
8,000 feature extractions. Neural net: 109,184 multiply-accumulate ops.
That is a 1,092x compute explosion for the same input.''
% -- ENGAGE: ``Where does each paradigm sit on the Iron Law? Which term
% dominates for each?'' Give 30 seconds. Expected: rule-based is L-dominated,
% classical ML is balanced, neural net is O-dominated.
% -- WARN: Students assume more compute is always bad. Correct framing:
% more compute buys representation power---the question is whether the
% systems cost is justified by the accuracy gain.
% -- FLEX: [CORE] The 1,092x number is the chapter's anchor.
% IF AHEAD: ``At what point does the accuracy gain stop justifying the cost?''
% IF SHORT: Just emphasize the 1,092x ratio and move on.
}
% --- Full-width image ---
\centering
@@ -182,9 +243,27 @@ Common error: students think more compute is always bad.}
\end{frame}
\begin{frame}{The Compute Explosion in Numbers}
\note{[2 min] Quantitative backing for the diagram. Walk through the table.
Key insight: memory also jumps---from fitting in L1 cache to exceeding it.
If short: just emphasize the 1,092$\times$ ratio and the cache threshold.}
\note{
% -- LINK: The three-paradigms diagram showed the qualitative shift.
% This table adds the quantitative evidence students need to reason precisely.
% -- NARRATE: Walk down each row of the table.
``Rule-based: 100 ops, 784 bytes---fits in a register file. Classical ML:
8,000 ops, 2 KB---fits in L1 cache. Neural net: 109,184 MACs, 427 KB---
blows past L1 (typically 64 KB). The moment you cross the cache boundary,
every inference forces memory traffic.''
% -- ENGAGE: ``Which jump matters more for systems design: the 1,092x
% compute increase or the 546x memory increase?'' Pair discussion, 30 sec.
% Expected: the memory jump, because it changes the bottleneck regime.
% -- WARN: Students fixate on FLOP counts. The cache threshold crossing
% (784 B to 427 KB) is the more consequential systems event---it changes
% whether the workload is compute-bound or memory-bound.
% -- FLEX: [OPTIONAL] The table reinforces the diagram.
% IF SHORT: Point to the 1,092x and 546x numbers, skip row-by-row walkthrough.
}
\small
\renewcommand{\arraystretch}{1.2}
@@ -209,9 +288,25 @@ If short: just emphasize the 1,092$\times$ ratio and the cache threshold.}
% --- ACTIVE LEARNING 1: Predict ---
\begin{frame}{Predict: What Is a Neuron Computing?}
\note{[2 min] Prediction exercise before revealing the neuron. Give students
60 seconds. Do NOT reveal yet. This primes the MAC concept.
Ask 2--3 students: ``What mathematical operation does a neuron perform?''}
\note{
% -- LINK: Students just saw 109,184 MACs but do not yet know what a MAC is.
% This prediction primes them to discover the neuron equation themselves.
% -- NARRATE: Read the prompt aloud, then go silent for 60 seconds.
``784 inputs, 128 neurons, 100,352 multiply-accumulate operations.
Write one equation that explains what each neuron computes.''
% -- ENGAGE: Think-Write-Share. 60 seconds writing, then turn to a neighbor.
% Cold-call 2--3 students. Expected answer: weighted sum plus bias, then
% activation. Accept partial answers---the full equation comes next slide.
% -- WARN: Some students will write softmax or loss---those are network-level
% ops, not neuron-level. Redirect: ``What does a single neuron do to its inputs?''
% -- FLEX: [CORE] Prediction before reveal is the highest-leverage active
% learning moment. Never skip.
% IF SHORT: Reduce to 30 seconds writing, skip neighbor comparison.
}
\centering
\vspace{0.8cm}
@@ -235,11 +330,29 @@ needs 100,352 multiply-accumulate operations.\\[0.2cm]
% =============================================================================
\begin{frame}{Anatomy of a Neuron}
\note{[3 min] Reveal after prediction. The neuron computes a weighted sum
plus bias, then applies a nonlinear activation. The MAC is the atomic
operation. N inputs $\to$ N MACs. Emphasize: this is NOT a biological
neuron---it is a computational primitive.
Ask: ``How many memory accesses does one neuron need?''}
\note{
% -- LINK: Students just predicted the neuron equation. Now reveal and
% validate their answers against the actual formula.
% -- NARRATE: Point to the diagram left-to-right.
``Each input x_i is multiplied by a weight w_i, all products are summed,
a bias b is added, then a nonlinear activation f is applied. That is one
neuron: N multiply-accumulate operations. A layer of M neurons does M*N
MACs---one matrix multiplication.''
ANALOGY: ``A neuron is a dot product with a switch on the end.''
% -- ENGAGE: ``How many memory accesses does one neuron with 784 inputs
% need?'' Expected: 784 weights + 784 inputs + 1 bias = 1,569 reads minimum,
% plus 1 write for the output. Memory traffic dominates for small neurons.
% -- WARN: Students confuse biological neurons with computational neurons.
% Correct framing: this is a multiply-accumulate primitive, not a model of
% biology. The name is historical; the operation is linear algebra.
% -- FLEX: [CORE] The neuron equation is foundational for every later slide.
% IF AHEAD: ``What happens if we remove the activation function f?''
% (Answer: the entire network collapses to a single linear transformation.)
}
% --- Full-width image ---
\centering

View File

@@ -68,9 +68,33 @@
% LEARNING OBJECTIVES
% =============================================================================
\begin{frame}{Learning Objectives}
\note{[2 min] Read through objectives. Emphasize that data selection is the
highest-leverage optimization in the D-A-M stack. Ask: ``How many of you have
ever questioned whether all your training data is actually useful?''}
\note{
% -- LINK: Learning objectives frame opens the lecture
This is the roadmap slide. Students arrive from Ch.\ 8 (Training) knowing how to
train models; now they learn that \emph{what} you train on matters more than
\emph{how long} you train.
% -- NARRATE: Walk through objectives with emphasis
Read each objective aloud. Pause on objective 1: ``Data selection is the
highest-leverage optimization in the entire D-A-M stack --- it reduces the
numerator \emph{before} anything else touches it.'' Emphasize that every
subsequent objective builds toward the Selection Inequality (objective 5).
% -- ENGAGE: Opening question to surface assumptions
Ask: ``How many of you have ever questioned whether all your training data
is actually useful?'' Follow up: ``What fraction would you guess is redundant?''
[Expected: most guess 10--20\%; the real answer is 50--90\%.]
% -- WARN: Students underestimate data waste
Students arrive believing ``more data = better model'' because scaling-law
papers dominate the discourse. This lecture systematically dismantles that
assumption with quantitative evidence.
% -- FLEX: [CORE] --- never skip
[CORE] Objectives frame sets the contract for the entire lecture.
IF AHEAD: Ask students to rank which objective they find most surprising.
IF SHORT: Read objectives quickly, spend time on the opening question.
}
\small
\begin{enumerate}
@@ -86,8 +110,21 @@ ever questioned whether all your training data is actually useful?''}
\end{frame}
\begin{frame}{Visual Language}
\note{[1 min] Explain the semantic color system used throughout the course.
These colors are consistent across all diagrams and slides.}
\note{
% -- LINK: Follows learning objectives; sets visual conventions before content
Students just saw what they will learn; this slide equips them to read every
diagram that follows.
% -- NARRATE: Walk through each color with a concrete example
Point to each card: ``Blue means compute --- anytime you see blue, think
GPU cycles. Green means data flow or memory. Orange is routing or scheduling.
Red flags cost, error, or a bottleneck. These colors are identical across
every slide and every SVG in this course.''
% -- FLEX: [CORE] --- first time seeing the color system
[CORE] Essential for first lecture where students encounter the color system.
IF SHORT: Spend 30 seconds; students will internalize through repeated exposure.
}
\small
Throughout this course, colors carry meaning:
@@ -125,10 +162,36 @@ Throughout this course, colors carry meaning:
% =============================================================================
\begin{frame}{The Data Wall}
\note{[3 min] Open with the key tension. Compute grows 10x/3yr while quality
data grows 2x/5yr. The internet has already been scraped. This asymmetry
inverts the optimization priority. Ask: ``If you had unlimited GPUs but
limited data, what would you optimize?''}
\note{
% -- LINK: First content slide after objectives
Students just heard that data selection is the highest-leverage optimization.
This slide provides the \emph{why}: a physical asymmetry between compute
growth and data growth.
% -- NARRATE: Build the tension with the table
Point to the table row by row: ``Compute: 10x every 3 years --- Moore's Law
on steroids. Training data: 2x every 5 years --- we have already scraped
the internet. This asymmetry is the Data Wall.'' Tap the red callout card:
``The field has flipped from data-poor/compute-poor to compute-rich/data-poor.''
% -- ENGAGE: Falsifiable question
Ask: ``If you had unlimited GPUs but limited high-quality data, what would
you optimize first?'' Cold-call one student.
[Expected: most say ``get more data'' --- correct answer is ``get more
\emph{value} from existing data.'']
% -- WARN: Students conflate data quantity with data quality
Common error: students assume more data always helps because scaling-law
papers show log-linear improvement. Correct framing: scaling laws assume
\emph{unique, high-quality} tokens --- duplicates and noise yield diminishing
returns far earlier.
% -- FLEX: [CORE] --- motivates the entire chapter
[CORE] This is the chapter thesis slide.
IF AHEAD: ``What happens when synthetic data grows unbounded but
quality-limited?''
IF SHORT: Skip the question, let the table speak for itself.
}
\footnotesize
\begin{columns}[T]
@@ -170,9 +233,34 @@ limited data, what would you optimize?''}
\end{frame}
\begin{frame}{What Is Data Selection?}
\note{[2 min] Formal definition. Emphasize the distinction from data
engineering: quality (is it correct?) vs.\ value (is it worth the compute?).
Common error: students think data selection = data cleaning.}
\note{
% -- LINK: The Data Wall motivates a formal response
Students just saw compute outpacing data supply. This slide names the
discipline that responds: data selection, distinct from data engineering
they learned in Ch.\ 4.
% -- NARRATE: Read the definition, then contrast with the table
Read the crimson card aloud slowly. Then point to the comparison table:
``Ch.\ 4 asked `is the data correct?' Ch.\ 9 asks `is correct data
worth the compute?' A perfectly clean dataset can still be 90\% redundant.''
Pause on the insight card: ``10x low-quality data < 1.1x carefully selected
high-quality data.''
% -- ENGAGE: Falsifiable distinction
Ask: ``Give me one example where data engineering fixes the problem and one
where only data selection helps.'' [Expected: dedup of corrupted images =
engineering; removing easy samples near cluster centers = selection.]
% -- WARN: Students conflate selection with cleaning
Common error: students hear ``data selection'' and think ``data cleaning.''
Correct framing: cleaning fixes errors; selection removes \emph{correct
but uninformative} samples. Both are necessary; neither subsumes the other.
% -- FLEX: [CORE] --- foundational definition
[CORE] The ICR definition here is referenced throughout the rest of the deck.
IF AHEAD: ``Can a sample be high-quality but low-ICR? Give an example.''
IF SHORT: Skip the table, keep the definition card and the insight.
}
\small
\begin{mlsyscard}{crimson}
@@ -199,10 +287,38 @@ Common error: students think data selection = data cleaning.}
\end{frame}
\begin{frame}{Data Selection and the Iron Law}
\note{[3 min] Connect to the Iron Law from Ch.\ 1. Data selection is the only
technique that reduces the number of passes through the entire equation.
Model compression reduces O per pass; hardware increases R. Data selection
reduces the pass count itself. 2x * 2x * 2x = 8x, not 6x.}
\note{
% -- LINK: From definition to mechanism via the Iron Law
Students just defined data selection and ICR. This slide connects data
selection to the Iron Law from Ch.\ 1, showing \emph{where} in the
equation it acts.
% -- NARRATE: Walk through the D-A-M diagram
Point to the diagram: ``Data selection reduces the total number of passes
through the \emph{entire} equation. Model compression (Ch.\ 10) reduces
O per pass. Hardware (Ch.\ 11) increases R. But data selection reduces
the pass count itself --- it is the only technique that shrinks the
workload before the other two even see it.''
ANALOGY: ``Think of a factory: compression makes each widget faster to
build, hardware buys faster machines, but data selection throws away
widgets nobody ordered.''
% -- ENGAGE: Multiplicative vs.\ additive
Before showing the concept card, ask: ``If each technique gives 2x, is
the combined gain 6x or 8x?'' Give 10 seconds.
[Expected: many say 6x (additive). Correct: 8x (multiplicative).]
% -- WARN: Additive thinking is the default
Students instinctively add speedups (2+2+2=6) instead of multiplying
(2*2*2=8). Correct framing: the three optimizations operate on
\emph{different terms} of the same equation, so they compound.
% -- FLEX: [CORE] --- the D-A-M multiplicative argument
[CORE] This multiplicative insight is revisited in Key Takeaways.
IF AHEAD: ``What happens if data selection gives 10x but compression
only 1.2x? Where should the team invest next?''
IF SHORT: Show diagram, state the 8x result, move on.
}
% --- Layout: FULL-WIDTH diagram ---
\centering
@@ -216,9 +332,34 @@ reduces the pass count itself. 2x * 2x * 2x = 8x, not 6x.}
% --- ACTIVE LEARNING 1: Predict ---
\begin{frame}{Predict: Where Does the Waste Live?}
\note{[2 min] Prediction exercise. Give students 60 seconds. The answer will
be revealed with the ICR curve. Most will say ``noisy samples'' --- the
real answer includes redundant easy samples far from the decision boundary.}
\note{
% -- LINK: From the Iron Law connection to hands-on reasoning
Students just saw that data selection reduces total passes. Now they must
decide \emph{which} samples to cut --- before seeing the ICR framework.
% -- NARRATE: Run the Think-Write-Share protocol
Say: ``You have 1 million samples and can keep only 10\%. Write down your
strategy --- which samples do you throw away and why?'' Give 60 seconds
of silent writing, then 30 seconds of neighbor discussion. Do NOT reveal
the ICR curve yet.
% -- ENGAGE: The prediction itself is the engagement
This is the active learning moment. Walk the room during writing time.
Listen for common strategies: ``remove noisy samples,'' ``random subset,''
``remove outliers.'' The answer (revealed next slide): remove redundant
easy samples deep within class clusters, not just noisy ones.
% -- WARN: Students fixate on noise, ignore redundancy
Most students say ``throw away noisy samples.'' The deeper insight is
that \emph{clean, easy} samples far from the decision boundary are the
biggest source of wasted compute --- they contribute near-zero gradient.
% -- FLEX: [CORE] --- first active learning moment
[CORE] This prediction primes the ICR curve reveal on the next slide.
IF AHEAD: Ask a follow-up: ``Would your strategy change if you could
keep 50\% instead of 10\%?''
IF SHORT: Reduce writing time to 30 seconds, skip neighbor discussion.
}
\centering
\vspace{0.8cm}

View File

@@ -67,9 +67,30 @@
% LEARNING OBJECTIVES
% =============================================================================
\begin{frame}{Learning Objectives}
\note{[2 min] Read objectives aloud. Emphasize the inversion theme: everything
students learned about training optimization is about to be flipped.
Ask: ``How many of you have deployed a model to production?''}
\note{
% -- LINK: What prior concept connects to this slide
Students spent 12 chapters optimizing throughput. This slide frames the
inversion: every training priority is about to flip.
% -- NARRATE: What to SAY while showing this slide
Read each objective aloud, pausing on ``latency budget'' and ``queuing
theory'' --- these are the new quantitative anchors replacing samples/hour.
ANALOGY: ``Training is a factory running 24/7. Serving is an ER --- every
patient has a deadline.''
% -- ENGAGE: Specific question, prediction, or task for THIS slide
Ask: ``How many of you have deployed a model to production? What was your
biggest surprise?'' Cold-call one responder.
% -- WARN: What students will get wrong on THIS topic
Common error: students assume serving is just calling model.predict().
Correct framing: serving is a six-stage pipeline where the model is one
stage consuming less than 50\% of the budget.
% -- FLEX: [CORE] This slide is essential --- do not skip.
IF AHEAD: Ask students to predict which objective will be hardest.
IF SHORT: Read objectives without discussion; move to content.
}
\footnotesize
\begin{enumerate}\setlength\itemsep{0pt}
@@ -85,8 +106,16 @@ Ask: ``How many of you have deployed a model to production?''}
\end{frame}
\begin{frame}{Visual Language}
\note{[1 min] Explain the semantic color system used throughout the course.
These colors are consistent across all diagrams and slides.}
\note{
% -- NARRATE: What to SAY while showing this slide
Point to each card: ``Blue is compute --- GPU forward passes. Green is data
flow and memory. Orange is routing and scheduling. Red flags bottlenecks
and cost.'' In serving diagrams, you will see orange load balancers feeding
blue inference runners, with red marking decode bottlenecks.
% -- FLEX: [OPTIONAL] Skip if students already know the palette from earlier chapters.
IF SHORT: Say ``same colors as always'' and advance.
}
\small
Throughout this course, colors carry meaning:
@@ -124,11 +153,34 @@ Throughout this course, colors carry meaning:
% =============================================================================
\begin{frame}{Why Serving Is Different}
\note{[3 min] Core thesis of the chapter. Training maximizes throughput; serving
minimizes latency. Same hardware, opposite priorities. Walk through the DAM
inversion: Data goes from Volume to Freshness, Algorithm from Mutable to
Frozen, Machine from Saturation to Headroom.
Ask: ``If training saturates GPUs at 95\%, why would serving aim for 50\%?''}
\note{
% -- LINK: What prior concept connects to this slide
In every prior chapter, success meant saturating the GPU. This slide
reveals why that strategy fails in production serving.
% -- NARRATE: What to SAY while showing this slide
Point to the diagram: ``Left side is training --- maximize throughput,
saturate hardware. Right side is serving --- minimize latency, maintain
headroom.'' Walk through the DAM inversion: Data shifts from Volume to
Freshness, Algorithm from Mutable to Frozen, Machine from Saturation to
Headroom.
ANALOGY: ``Training is a freight train --- pack it full. Serving is an
ambulance --- it must always be ready to go.''
% -- ENGAGE: Specific question, prediction, or task for THIS slide
Ask: ``If training saturates GPUs at 95\%, why would serving aim for
50\%?'' Give 15 seconds, then cold-call. [Expected: queuing theory ---
high utilization causes latency spikes.]
% -- WARN: What students will get wrong on THIS topic
Common error: students think serving is just training with batch size 1.
Correct framing: serving inverts the optimization objective itself ---
latency replaces throughput as the primary metric.
% -- FLEX: [CORE] This slide is essential --- do not skip.
IF AHEAD: ``What happens to cost when you run GPUs at 50\% instead of 95\%?''
IF SHORT: Show diagram, state the inversion, skip the DAM walkthrough.
}
% --- Full-width diagram ---
\centering
@@ -140,9 +192,34 @@ Ask: ``If training saturates GPUs at 95\%, why would serving aim for 50\%?''}
\end{frame}
\begin{frame}{The D\raisebox{0.04em}{\tiny$\bullet$}A\raisebox{0.04em}{\tiny$\bullet$}M Inversion}
\note{[2 min] Formalize the inversion along DAM axes. Students already know DAM
from Ch1; here we show how every axis flips. The Iron Law shifts from the
compute term dominating (training) to the latency term dominating (serving).}
\note{
% -- LINK: What prior concept connects to this slide
The previous slide showed the inversion visually. This slide formalizes
it along the three DAM axes students learned in Ch1.
% -- NARRATE: What to SAY while showing this slide
Walk down the table row by row: ``Data: training ingests billions of
samples; serving handles one request at a time --- freshness replaces
volume. Algorithm: training runs backprop; serving is forward-only ---
no optimizer state needed. Machine: training saturates at 95\%; serving
holds headroom at 40--60\% to absorb traffic spikes.'' End on the Iron
Law row: ``The dominant term flips from compute to latency overhead.''
% -- ENGAGE: Specific question, prediction, or task for THIS slide
Before showing the bottom card, ask: ``If both training and serving use
the same GPU, why does the Iron Law's dominant term change?''
[Expected: serving processes one request at low arithmetic intensity,
making the latency/overhead term dominate.]
% -- WARN: What students will get wrong on THIS topic
Common error: students think removing backprop makes serving trivially
easy. Correct framing: removing backprop frees memory but exposes the
latency term that training's large batches amortized away.
% -- FLEX: [CORE] This slide is essential --- do not skip.
IF AHEAD: ``What happens to the Machine row during a traffic spike?''
IF SHORT: Cover Data and Machine rows; skip Algorithm row.
}
\scriptsize
\renewcommand{\arraystretch}{1.1}
@@ -167,10 +244,33 @@ compute term dominating (training) to the latency term dominating (serving).}
\end{frame}
\begin{frame}{Static vs.\ Dynamic Inference}
\note{[2 min] First architectural decision: when to compute predictions.
Static = pre-compute overnight (photo classification). Dynamic = on-demand
(content moderation). Most production systems use a hybrid.
If short on time: cover the table quickly and move on.}
\note{
% -- LINK: What prior concept connects to this slide
The DAM inversion showed that serving prioritizes freshness over volume.
This slide presents the first design decision that follows: when to
compute predictions.
% -- NARRATE: What to SAY while showing this slide
Point to the green card: ``Static inference pre-computes overnight ---
10,000 photos times 5 ms equals 50 seconds total. Zero runtime latency,
but it cannot handle novel inputs.'' Then the red card: ``Dynamic
inference computes on demand under a 100 ms budget. Flexible but
expensive.'' Finish with the hybrid insight at the bottom.
% -- ENGAGE: Specific question, prediction, or task for THIS slide
Ask: ``A search engine autocomplete --- static or dynamic?''
[Expected: hybrid --- common queries are cached, novel queries computed
on demand.]
% -- WARN: What students will get wrong on THIS topic
Common error: students dismiss static inference as outdated. Correct
framing: recommendation systems pre-compute candidate sets for millions
of users nightly; only the final ranking is dynamic.
% -- FLEX: [OPTIONAL] This slide provides context but is not load-bearing.
IF SHORT: Cover the two cards quickly and move on to Server Anatomy.
IF AHEAD: ``What determines the boundary between cached and dynamic?''
}
\footnotesize
\begin{columns}[T]

View File

@@ -71,11 +71,38 @@
\section{Welcome}
\begin{frame}{Welcome to Volume II}
\note{[3 min] Set the tone: this is the advanced course. Students already know
how one machine works; now they learn how thousands coordinate. The metaphor:
``The fleet is the computer'' --- like ``The network is the computer'' (Sun
Microsystems), but for ML clusters. Ask: ``How many of you have SSH'd into a
multi-GPU cluster?''}
\note{
% -- LINK: What prior concept connects to this slide
Volume I taught the single-machine mental model: one node, 1--8 GPUs,
shared memory, PCIe/NVLink. This slide establishes that everything
students learned still applies --- but a new scale axis changes the rules.
% -- NARRATE: What to SAY while showing this slide
Point to the comparison table on the right: ``Every row flips when you
cross the node boundary. The bus becomes a network. Shared memory becomes
message passing. Rare failures become daily certainty.'' Use the tagline:
``The fleet is the computer'' --- echoing Sun Microsystems' ``The network
is the computer,'' but applied to ML clusters.
% -- ENGAGE: Specific question, prediction, or task for THIS slide
Ask: ``How many of you have SSH'd into a multi-GPU cluster?''
Hands up. Then: ``How many have debugged a training job that stalled
because one of 1,000 GPUs went silent?'' The gap between those two
counts is what this course fills.
% -- WARN: What students will get wrong on THIS topic
Common error: students assume ``distributed = just add more GPUs.''
Correct framing: crossing a node boundary changes failure modes,
communication patterns, and programming models qualitatively.
IF STUCK: Ask them to compare restarting a crashed browser tab vs.\
restarting one node in a 1,000-node synchronized training job.
% -- FLEX: [CORE] or [OPTIONAL] + contingency
[CORE] This is the opening frame --- it sets the tone for the entire course.
IF AHEAD: Ask what specific Vol I concept they found most surprising;
connect it to the fleet version.
IF SHORT: Skip the hand-raise, go straight to the table walkthrough.
}
\small
\begin{columns}[T]
@@ -122,10 +149,38 @@ multi-GPU cluster?''}
% =============================================================================
\begin{frame}{From Single Node to Fleet}
\note{[3 min] The transition slide. Walk through left vs.\ right.
Key insight: at fleet scale, the network replaces the memory bus as the
critical interconnect. Latency goes from nanoseconds to milliseconds.
Ask: ``What happens to your training job when one of 10,000 GPUs dies?''}
\note{
% -- LINK: What prior concept connects to this slide
The Welcome slide introduced the fleet concept verbally. This diagram
makes the transition visual --- left side is Vol I territory, right side
is Vol II.
% -- NARRATE: What to SAY while showing this slide
Walk left-to-right through the diagram: ``On the left, one node ---
everything connected by NVLink at 900 GB/s, failures are rare, latency
is nanoseconds. Cross the dotted line to the right: the bus becomes
InfiniBand at 50 GB/s, latency jumps to microseconds, and with 10,000
GPUs, one fails every few hours. The network replaces the memory bus as
the critical interconnect.''
% -- ENGAGE: Specific question, prediction, or task for THIS slide
Ask: ``What happens to your training job when one of 10,000 GPUs dies?''
Give 15 seconds of think time. Expected answer: the entire job stalls
or crashes --- which is exactly why fault tolerance is first-class.
% -- WARN: What students will get wrong on THIS topic
Students will underestimate the latency jump: nanoseconds to microseconds
sounds small, but it is a 1,000x increase. At 350 GB of gradients per
step, that 1,000x turns into minutes of synchronization overhead per hour.
IF STUCK: Compare it to a highway going from 300 mph to 0.3 mph at a
toll booth --- the booth is the node boundary.
% -- FLEX: [CORE] or [OPTIONAL] + contingency
[CORE] This diagram anchors the entire course narrative.
IF AHEAD: ``At what fleet size does the probability of zero failures
during a 1-hour training window drop below 50\%?''
IF SHORT: Point to the diagram, read the insight callout, move on.
}
% --- Layout: FULL-WIDTH diagram ---
\centering
@@ -137,10 +192,41 @@ Ask: ``What happens to your training job when one of 10,000 GPUs dies?''}
\end{frame}
\begin{frame}{Why Scale Changes Everything}
\note{[3 min] Three fundamental changes. These are NOT just ``more of the same.''
Each one requires new engineering disciplines that do not exist in single-node ML.
Common error: students think distributed = ``just add more GPUs.''
Ask: ``Which of these three surprises you most?''}
\note{
% -- LINK: What prior concept connects to this slide
The previous diagram showed the physical transition. This slide names
the three qualitative changes that make fleet engineering a different
discipline from single-node optimization.
% -- NARRATE: What to SAY while showing this slide
Point to each card in order. ``First, communication dominates ---
at 10,000 GPUs, AllReduce takes longer than the forward pass. Second,
failure is routine --- 10,000 GPUs times 100,000h MTBF equals one failure
every 10 hours. Third, emergent behavior --- stragglers, hot spots, and
cascading failures that no single component predicts.''
ANALOGY: ``A single car can break down. A fleet of 10,000 taxis has
at least one broken down at any moment --- and every taxi must wait for
every other taxi before the next fare.''
% -- ENGAGE: Specific question, prediction, or task for THIS slide
Ask: ``Which of these three surprises you most?'' Cold-call one student.
Most will say failure frequency --- use that to preview the reliability
math coming later.
% -- WARN: What students will get wrong on THIS topic
Students will treat these as independent problems. In reality they
interact: a straggler (emergent behavior) that triggers a timeout
(failure) during an AllReduce (communication) cascades across all three.
IF STUCK: Walk through a concrete cascade: slow GPU triggers BSP
barrier timeout, which triggers checkpoint, which stalls all 10,000 GPUs.
% -- FLEX: [CORE] or [OPTIONAL] + contingency
[CORE] These three categories structure the entire course.
IF AHEAD: ``Can you think of a fourth qualitative change at scale?''
(Answer: cost --- 1\% inefficiency at 10,000 GPUs is millions of dollars.)
IF SHORT: Read the three card titles and the bottom-line callout, skip
the analogy.
}
\small
\begin{columns}[T]
@@ -175,11 +261,41 @@ Ask: ``Which of these three surprises you most?''}
% =============================================================================
\begin{frame}{The \Cthree{} Taxonomy: Compute, Communication, Coordination}
\note{[3 min] Introduce the diagnostic framework for Vol 2. Every performance
problem in a fleet can be traced to one of these three axes. This replaces
\note{
% -- LINK: What prior concept connects to this slide
The previous slide named three qualitative changes at scale. The C3
taxonomy formalizes these into a diagnostic framework --- it replaces
the single-node D-A-M taxonomy with a fleet-scale lens.
Ask: ``If training stalls for 30 seconds every hour, which C is the culprit?''
(Answer: Coordination --- likely checkpointing or straggler mitigation.)}
% -- NARRATE: What to SAY while showing this slide
Point to the diagram: ``Every performance problem in a fleet traces to
one of these three axes. Compute: are the GPUs doing useful math?
Communication: how fast can gradients and activations move? Coordination:
who decides what runs where, when to checkpoint, how to handle stragglers?''
Draw the parallel explicitly: ``Vol I asked `is this workload Data-bound,
Algorithm-bound, or Memory-bound?' Vol II asks `is the fleet bottlenecked
on Compute, Communication, or Coordination?'''
% -- ENGAGE: Specific question, prediction, or task for THIS slide
Ask: ``If training stalls for 30 seconds every hour, which C is the
culprit?'' Give 15 seconds. Cold-call. Expected answer: Coordination ---
likely synchronous checkpointing or straggler mitigation, not raw
compute or bandwidth.
% -- WARN: What students will get wrong on THIS topic
Students will conflate Communication and Coordination. Communication is
moving bytes (AllReduce, gradient sync). Coordination is making decisions
(scheduling, checkpointing, barrier management). A straggler that slows
AllReduce is a Coordination problem manifesting through Communication.
IF STUCK: ``Communication is the pipe. Coordination is the traffic cop.''
% -- FLEX: [CORE] or [OPTIONAL] + contingency
[CORE] C3 is the diagnostic backbone of the entire volume.
IF AHEAD: ``Can a problem be bottlenecked on two C's simultaneously?
Give an example.'' (Yes: gradient compression reduces Communication but
adds Compute overhead for encode/decode.)
IF SHORT: Show diagram, ask the 30-second stall question, move on.
}
% --- Layout: FULL-WIDTH diagram ---
\centering
@@ -191,9 +307,37 @@ Ask: ``If training stalls for 30 seconds every hour, which C is the culprit?''
\end{frame}
\begin{frame}{\Cthree{} in Practice}
\note{[2 min] Concrete examples mapping real problems to the C3 axes.
Walk through the table row by row. The point: every fleet problem has
a C3 diagnosis that guides the engineering response.}
\note{
% -- LINK: What prior concept connects to this slide
The previous slide defined C3 abstractly. This table maps real fleet
symptoms to C3 axes, showing the framework in diagnostic action.
% -- NARRATE: What to SAY while showing this slide
Walk through the table row by row. ``Low GPU utilization? That is
Compute --- memory-bound kernels, fix with operator fusion. Throughput
plateau? Communication --- AllReduce saturated, fix with gradient
compression. Periodic 30-second stalls? Coordination --- synchronous
checkpointing, fix with async checkpointing.'' Emphasize the pattern:
diagnosis precedes optimization.
% -- ENGAGE: Specific question, prediction, or task for THIS slide
Cover the ``Response'' column with your hand. Point to ``Throughput
variance'' and ask: ``Which C axis and what would you do?'' Give 20
seconds. Expected: Coordination (straggler nodes), response is straggler
mitigation or redundant computation.
% -- WARN: What students will get wrong on THIS topic
Students will jump to solutions before diagnosing. ``Just buy faster
GPUs'' is the default instinct. The table shows that 4 of 6 symptoms
are NOT solved by faster hardware --- they require algorithmic or
systems-level interventions.
IF STUCK: Ask which rows would NOT be helped by doubling GPU TFLOPS.
% -- FLEX: [OPTIONAL] + contingency
[OPTIONAL] The C3 concept was introduced on the previous slide.
IF SHORT: Show the table briefly, read the insight callout, move on.
IF AHEAD: Ask students to propose a 7th row with a novel symptom.
}
\scriptsize
\renewcommand{\arraystretch}{1.15}
@@ -217,9 +361,27 @@ a C3 diagnosis that guides the engineering response.}
% --- ACTIVE LEARNING 1: Predict ---
\begin{frame}{Predict: Where Is the Bottleneck?}
\note{[2 min] Prediction exercise. Give students 60 seconds. Do NOT reveal
the answer yet. The point: build intuition for C3 diagnosis before the
course teaches the formal tools. Ask 2-3 students to share.}
\note{
% -- NARRATE: What to SAY while showing this slide
Read the scenario aloud. ``You are training a 70B model across 4,096
GPUs. Throughput is only 40\% of theoretical peak. Which C3 axis is
most likely the bottleneck?'' Emphasize: 40\% means 60\% of silicon
is doing nothing useful.
% -- ENGAGE: Specific question, prediction, or task for THIS slide
Give 60 seconds for individual writing. Then 30 seconds of pair
discussion. Cold-call 2--3 students. Expected answer: Communication ---
at 4,096 GPUs, AllReduce overhead for 70B parameters (140 GB gradients)
is massive. Some students will say Compute (kernel inefficiency) ---
acknowledge it but redirect: at 40\% peak with 4,096 GPUs, the
Communication term is almost certainly dominant.
% -- FLEX: [CORE] or [OPTIONAL] + contingency
[CORE] This is the first active learning moment --- establishes the
predict-before-reveal pattern for the entire course.
IF SHORT: Reduce to 30 seconds writing, skip pair discussion, cold-call
one student.
}
\centering
\vspace{0.8cm}
@@ -243,12 +405,42 @@ Write your answer and one reason why. \textcolor{midgray}{(60 seconds)}}
% =============================================================================
\begin{frame}{The Fleet Stack: Four Layers}
\note{[3 min] The organizing framework for the entire course. Walk through
bottom to top: infrastructure provides the physical substrate, distributed ML
adds the training algorithms, deployment puts models into production,
governance ensures responsible operation.
Key insight: you cannot skip layers. A governance failure is ultimately
an infrastructure failure.}
\note{
% -- LINK: What prior concept connects to this slide
C3 diagnoses WHERE the bottleneck is. The Fleet Stack organizes HOW the
course addresses each layer of the system --- from physical silicon to
societal governance.
% -- NARRATE: What to SAY while showing this slide
Walk bottom-to-top through the diagram: ``Infrastructure provides the
physical substrate --- compute, network, storage. Distributed ML adds
the training algorithms --- data, tensor, pipeline parallelism.
Deployment puts models into production --- inference optimization,
scheduling, SLAs. Governance ensures responsible operation --- security,
sustainability, fairness.'' Pause on the insight: ``You cannot skip
layers. A governance failure is ultimately an infrastructure failure ---
if the cluster cannot audit which data trained which model, no amount
of policy fixes that.''
% -- ENGAGE: Specific question, prediction, or task for THIS slide
Ask: ``Which layer do most ML courses stop at?'' Expected answer:
Layer 2 (Distributed ML). ``This course covers all four --- production
ML fails when any layer is neglected.''
% -- WARN: What students will get wrong on THIS topic
Students will treat layers as independent. In reality, decisions at the
bottom constrain possibilities at the top: a network topology that
cannot isolate tenants makes multi-tenant governance impossible.
IF STUCK: Give the concrete example: ``If your IB fabric has no SR-IOV,
you cannot do secure multi-tenant serving.''
% -- FLEX: [CORE] or [OPTIONAL] + contingency
[CORE] The Fleet Stack is the course roadmap --- every chapter maps
to a layer.
IF AHEAD: ``Which layer do you think has the highest dollar-cost of
getting wrong?''
IF SHORT: Name the four layers, read the insight callout, move on.
}
% --- Layout: FULL-WIDTH diagram ---
\centering
@@ -260,9 +452,38 @@ an infrastructure failure.}
\end{frame}
\begin{frame}{Fleet Stack: What Each Layer Teaches}
\note{[2 min] Quick overview of what students will learn in each layer.
This is the ``what's in it for me'' slide. Emphasize that the course
covers the full stack, not just distributed training.}
\note{
% -- LINK: What prior concept connects to this slide
The previous slide showed the Fleet Stack as a diagram. This table
translates it into concrete skills students will develop in each layer.
% -- NARRATE: What to SAY while showing this slide
Read each row as a promise: ``In the Infrastructure layer, you will
learn to reason about cluster hardware --- GPU selection, InfiniBand
topology, storage hierarchy. In Distributed ML, you will design parallel
training --- DP, TP, PP, collective ops, fault-tolerant training.''
Emphasize the crimson card: ``Most courses stop at Layer 2. This course
covers all four.''
% -- ENGAGE: Specific question, prediction, or task for THIS slide
Ask: ``Which layer interests you most and why?'' Quick show of hands
for each layer. Use the distribution to preview which chapters will
resonate most.
% -- WARN: What students will get wrong on THIS topic
Students will discount Governance as ``not technical.'' Counter with:
the EU AI Act requires auditable training provenance --- that is a
distributed systems problem (data lineage across a 10,000-GPU fleet),
not a policy problem.
IF STUCK: ``If you cannot prove which data trained your model, you
cannot deploy in Europe. That is infrastructure.''
% -- FLEX: [OPTIONAL] + contingency
[OPTIONAL] The Fleet Stack was covered on the previous slide.
IF SHORT: Skip this slide entirely --- the diagram carries the message.
IF AHEAD: Ask which layer they think is most under-invested at real
companies.
}
\scriptsize
\renewcommand{\arraystretch}{1.15}
@@ -289,10 +510,39 @@ covers the full stack, not just distributed training.}
% =============================================================================
\begin{frame}{17 Chapters in Four Parts}
\note{[3 min] The map of the semester. Walk through each part briefly.
Emphasize that C3 threads through all four parts. Point out the chapter
numbers so students can look ahead. If short: just name the four parts
and move on.}
\note{
% -- LINK: What prior concept connects to this slide
The Fleet Stack named four layers. This roadmap maps those layers to
17 specific chapters across the semester, showing the learning arc.
% -- NARRATE: What to SAY while showing this slide
Point to each part in the diagram: ``Part I is Infrastructure ---
compute, network, storage. Part II is Distributed ML --- parallelism
strategies, collective communication, fault tolerance. Part III is
Deployment --- inference, scheduling, serving. Part IV is Governance ---
security, sustainability, responsible AI.'' Highlight the color coding:
``Notice how C3 threads through all four parts --- Compute constraints
dominate Parts I and II, Communication in Parts II and III, Coordination
in Parts III and IV.''
% -- ENGAGE: Specific question, prediction, or task for THIS slide
Ask: ``Looking at this map, which chapter title are you most curious
about?'' Quick poll. This surfaces student interests early.
% -- WARN: What students will get wrong on THIS topic
Students will assume the parts are independent. In reality, Part III
(Deployment) depends heavily on Part I (Infrastructure) decisions ---
you cannot optimize inference serving without understanding the memory
hierarchy from Chapter 2.
IF STUCK: ``Think of it as a building: you cannot furnish the penthouse
before pouring the foundation.''
% -- FLEX: [OPTIONAL] + contingency
[OPTIONAL] This is a reference slide that students will revisit.
IF SHORT: Show the diagram, name the four parts, move on.
IF AHEAD: Ask students to predict which part will be most relevant
to their career goals.
}
% --- Layout: FULL-WIDTH diagram ---
\centering
@@ -306,10 +556,40 @@ and move on.}
\end{frame}
\begin{frame}{The Numbers That Define Fleet Scale}
\note{[3 min] Let the numbers speak. Pause on each one. The failure rate
calculation is the most surprising: 10,000 GPUs with 100,000h MTBF means
one failure every 10 hours. Ask: ``How does this change how you think about
writing training code?''}
\note{
% -- LINK: What prior concept connects to this slide
The roadmap showed the course structure. This slide grounds it in
physical reality --- the numbers that make fleet engineering different
from single-node work.
% -- NARRATE: What to SAY while showing this slide
Let the numbers speak. Pause on each one: ``10,000 GPUs. 100 MW of
power --- that is a small city. 350 GB of gradients synchronized every
few seconds. And the one that changes everything: 10,000 GPUs with
100,000h MTBF means one failure every 10 hours. Not one failure per
year. Every 10 hours.''
ANALOGY: ``Imagine a 10,000-person orchestra where one musician
collapses every 10 hours, and the entire orchestra must stop and restart
from a checkpoint.''
% -- ENGAGE: Specific question, prediction, or task for THIS slide
Ask: ``How does this failure rate change how you think about writing
training code?'' Give 20 seconds. Expected insight: checkpointing
becomes the most critical code path, not the model architecture.
% -- WARN: What students will get wrong on THIS topic
Students will try to prevent failures rather than engineer for
resilience. At 10,000 GPUs, prevention is impossible --- the math
guarantees failures. The correct framing is: minimize recovery time,
not failure probability.
IF STUCK: ``You cannot prevent rain. You build a roof.''
% -- FLEX: [CORE] or [OPTIONAL] + contingency
[CORE] These numbers recur throughout the entire course as anchors.
IF AHEAD: ``Calculate the fleet MTBF for 25,000 GPUs with 100,000h
per-GPU MTBF.'' (Answer: 4 hours.)
IF SHORT: Highlight the failure rate number and the bottom-line callout.
}
% --- Layout: FULL-WIDTH diagram ---
\centering
@@ -322,7 +602,17 @@ writing training code?''}
% --- ACTIVE LEARNING: Micro-Retrieval Cue ---
\begin{frame}{Quick Check}
\note{[1 min] Answer: ~once per hour (10000/8760).}
\note{
% -- NARRATE: What to SAY while showing this slide
Read the question aloud: ``If one GPU fails once per year, how often
does a 10,000-GPU cluster fail?'' Pause 15 seconds. Cold-call.
Answer: approximately once per hour (10,000 failures/year divided by
8,760 hours/year is about 1.14 per hour).
% -- FLEX: [CORE] or [OPTIONAL] + contingency
[CORE] This cements the failure-rate intuition before moving on.
IF SHORT: Ask, pause 10 seconds, give the answer, move on.
}
\centering
\vspace{1.0cm}
@@ -343,9 +633,37 @@ writing training code?''}
% =============================================================================
\begin{frame}{Learning Outcomes}
\note{[2 min] Read through outcomes. These map to assessable skills.
Emphasize that every outcome is \emph{quantitative} --- ``design'' means
calculate, not just describe.}
\note{
% -- LINK: What prior concept connects to this slide
The roadmap and scale numbers established what the course covers and
why. This slide translates that into measurable skills students will
demonstrate.
% -- NARRATE: What to SAY while showing this slide
Read through outcomes emphasizing the verbs: ``Design --- not describe,
design. Calculate --- not estimate, calculate. Every outcome is
quantitative. When we say `design distributed training pipelines,' we
mean you will specify TP degree, PP stages, and DP replicas for a given
model and cluster, then calculate the expected MFU.''
% -- ENGAGE: Specific question, prediction, or task for THIS slide
Ask: ``Which of these seven outcomes do you currently feel least
prepared for?'' Quick show of hands per outcome. Use the distribution
to calibrate pacing for early chapters.
% -- WARN: What students will get wrong on THIS topic
Students will underestimate outcome 7 (governance). It requires the
same quantitative rigor as the others --- calculating carbon footprint
per training run, auditing data provenance across a distributed pipeline.
IF STUCK: ``Governance is not an essay. It is a systems design problem
with measurable constraints.''
% -- FLEX: [OPTIONAL] + contingency
[OPTIONAL] Outcomes are a reference --- students will revisit on the
syllabus.
IF SHORT: Skim the list, emphasize the quantitative verbs, move on.
IF AHEAD: Ask students to rank the outcomes by difficulty.
}
\footnotesize
By the end of this course, you will be able to:
@@ -364,8 +682,19 @@ By the end of this course, you will be able to:
\end{frame}
\begin{frame}{Visual Language}
\note{[1 min] Explain the semantic color system used throughout the course.
These colors are consistent across all diagrams and slides.}
\note{
% -- NARRATE: What to SAY while showing this slide
Point to each card: ``Blue means compute or processing --- GPU ops,
forward/backward pass. Green means data flow or healthy paths. Orange
means routing or scheduling. Red means error, cost, or bottleneck.
These colors are consistent across every diagram and slide in the
course. When you see red in a figure, something is wrong or expensive.''
% -- FLEX: [OPTIONAL] + contingency
[OPTIONAL] Reference slide for the color system.
IF SHORT: Skip entirely --- students will absorb the colors through
exposure.
}
\small
Throughout this course, colors carry meaning:
@@ -400,10 +729,43 @@ Throughout this course, colors carry meaning:
\begin{frame}{A Taste of What's Coming}
\note{[3 min] The hook. Make it visceral. Walk through the scenario step by step.
The point: every one of these failure modes is a chapter in the course.
\note{
% -- LINK: What prior concept connects to this slide
The learning outcomes listed abstract skills. This scenario makes them
visceral --- a concrete frontier training run where every failure mode
maps to a course chapter.
% -- NARRATE: What to SAY while showing this slide
Walk through the scenario step by step: ``25,000 GPUs. A GPU dies every
4 hours. AllReduce across 25,000 GPUs takes 500 ms per step. A rack
switch fails and 128 GPUs go dark. Gradient staleness causes loss
spikes. Power budget limits utilization to 80\%.'' Then map each
problem to a chapter on the right: ``Fault-tolerant checkpointing ---
Chapter 7. Collective communication optimization --- Chapter 6. Network
fabric redundancy --- Chapter 3. Async training --- Chapter 5.
Sustainability and power budgets --- Chapter 15.''
% -- ENGAGE: Specific question, prediction, or task for THIS slide
Ask: ``Which of these problems would you know how to solve today?''
(Expected answer: none of them --- that's why they're taking this course.)}
Expected answer: none of them --- and that gap is exactly why they are
taking this course. If a student claims to know one, press for
specifics.
% -- WARN: What students will get wrong on THIS topic
Students will try to solve each problem in isolation. The real challenge
is that these problems interact: a GPU failure triggers checkpointing,
which saturates the network, which causes gradient staleness, which
spikes loss. The system view matters more than any individual fix.
IF STUCK: ``These are not five separate problems. They are one system
with five failure modes.''
% -- FLEX: [CORE] or [OPTIONAL] + contingency
[CORE] This is the motivational hook that justifies the entire course.
IF AHEAD: ``What is the dollar cost of 4 hours of 25,000 idle H100s
at \$3/GPU-hour?'' (Answer: \$300K wasted per failure event.)
IF SHORT: Read the left column only, point to the right column as
``what this course teaches,'' move on.
}
\footnotesize
\textbf{Scenario: Training a Frontier Model Across 25,000 GPUs}
@@ -441,10 +803,40 @@ Ask: ``Which of these problems would you know how to solve today?''
% --- ACTIVE LEARNING 2: Discussion ---
\begin{frame}{Discussion: What Breaks First?}
\note{[3 min] Turn-and-talk. Students discuss in pairs for 90 seconds.
Cold-call 2-3 pairs. No single right answer --- the point is that
``more GPUs'' is never the full answer. Common student answer: ``the network''
--- press them on whether that's communication or coordination.}
\note{
% -- LINK: What prior concept connects to this slide
The scenario slide listed five failure modes. This discussion forces
students to reason about which one strikes first --- building intuition
for failure ordering at scale.
% -- NARRATE: What to SAY while showing this slide
Read the scenario: ``Your startup raised \$100M. You buy 10,000 H100
GPUs, connect them with InfiniBand. What breaks first?'' Point to the
five options along the bottom.
% -- ENGAGE: Specific question, prediction, or task for THIS slide
Turn-and-talk for 90 seconds. Then cold-call 2--3 pairs. There is no
single right answer --- the point is that ``more GPUs'' is never the
full answer. Common student answer: ``the network'' --- press them on
whether that is Communication (bandwidth saturation) or Coordination
(scheduling, straggler management). Some will say ``your budget'' ---
that is a valid and insightful answer worth exploring.
% -- WARN: What students will get wrong on THIS topic
Students will focus on GPU hardware failures. In practice, the software
stack (NCCL hangs, CUDA OOM, driver crashes) breaks far more often than
physical hardware. Meta's Grand Teton paper reports that software issues
cause more downtime than hardware failures.
IF STUCK: ``Think about what you have to set up BEFORE the GPUs start
computing.''
% -- FLEX: [CORE] or [OPTIONAL] + contingency
[CORE] This is the second active learning moment and the most
interactive slide in the deck.
IF SHORT: Reduce to 60-second pairs, cold-call one pair.
IF AHEAD: After discussion, ask: ``What would you monitor to detect
the failure before it happens?''
}
\centering
\vspace{0.8cm}
@@ -471,10 +863,40 @@ You buy 10,000 H100 GPUs and connect them with InfiniBand.\\[0.3cm]
% =============================================================================
\begin{frame}{Prerequisites}
\note{[2 min] Set expectations clearly. Students need Vol 1 or equivalent.
Distributed systems basics (consensus, message passing) are helpful but
will be introduced as needed. Programming: PyTorch and basic Linux/cluster
experience expected.}
\note{
% -- LINK: What prior concept connects to this slide
The course content and motivation are established. Now students need to
know: am I prepared for this?
% -- NARRATE: What to SAY while showing this slide
Walk through the Required column: ``Vol I or equivalent --- you must
understand single-GPU training, the memory hierarchy, and basic
profiling. PyTorch proficiency --- you will write torch.distributed
code from week 2. Linux and SSH --- you will be running jobs on
multi-node clusters.'' Then the Helpful column: ``Distributed systems
concepts like consensus and RPC will be introduced as needed. If you
know Slurm or Kubernetes, you will have a head start on the scheduling
chapters, but it is not required.''
% -- ENGAGE: Specific question, prediction, or task for THIS slide
Ask: ``Raise your hand if you are comfortable explaining what AllReduce
does.'' The fraction of hands up calibrates how much Ch.~5--6 review
is needed.
% -- WARN: What students will get wrong on THIS topic
Students without cluster experience will feel behind. Reassure them:
the course introduces cluster concepts from scratch. The real
prerequisite is single-GPU fluency, not distributed systems expertise.
IF STUCK: ``If you can train a ResNet on one GPU and profile where time
goes, you are ready.''
% -- FLEX: [OPTIONAL] + contingency
[OPTIONAL] Logistical slide.
IF SHORT: Name the top 3 required skills, mention the ``helpful but not
required'' list exists, move on.
IF AHEAD: Ask a student with cluster experience to share one surprising
lesson from their first multi-node job.
}
\small
\begin{columns}[T]
@@ -505,11 +927,40 @@ experience expected.}
\end{frame}
\begin{frame}{Relationship to Volume I}
\note{[2 min] Address the key question: ``Is this harder than Vol I?''
Answer: not harder --- wider. The distinction is SCOPE, not DEPTH.
Vol I went deep on one machine; Vol II goes wide across the fleet.
Both are equally rigorous. Vol II re-derives key frameworks (like the
Iron Law) at fleet scale.}
\note{
% -- LINK: What prior concept connects to this slide
Prerequisites established what students need. This slide addresses the
elephant in the room: ``Is this harder than Vol I?''
% -- NARRATE: What to SAY while showing this slide
Point to the two cards side by side: ``Vol I: one machine, 1--8 GPUs,
shared memory, DataParallel, the Iron Law. Vol II: 10,000+ GPUs,
InfiniBand, message passing, torch.distributed, C3.'' Then the key
message: ``The distinction is SCOPE, not DEPTH. Vol I went deep on one
machine. Vol II goes wide across the fleet. Both are equally rigorous.
We re-derive key frameworks like the Iron Law at fleet scale, adding
communication and coordination terms.''
% -- ENGAGE: Specific question, prediction, or task for THIS slide
Ask: ``What Vol I concept do you think will change most at fleet scale?''
Cold-call. Any answer works --- use it to preview how that concept
evolves. (e.g., ``the memory wall'' becomes the ``network wall.'')
% -- WARN: What students will get wrong on THIS topic
Students will assume Vol II content is ``harder.'' It is not --- it is
different. A student who struggled with cache hierarchies in Vol I may
excel at distributed fault tolerance in Vol II. The thinking patterns
are complementary, not hierarchical.
IF STUCK: ``Think of it as learning to fly after learning to drive.
Different skills, not harder skills.''
% -- FLEX: [OPTIONAL] + contingency
[OPTIONAL] Context-setting slide.
IF SHORT: Read the crimson card at the bottom, skip the detailed
comparison.
IF AHEAD: Ask: ``Which Vol I equation do you think we will generalize
first?'' (Answer: the Iron Law, in Chapter 1.)
}
\small
\begin{columns}[T]
@@ -559,9 +1010,17 @@ Iron Law) at fleet scale.}
% --- MUDDIEST POINT ---
\begin{frame}{Muddiest Point}
\note{[2 min] Quick anonymous poll. Students write on a slip of paper or submit
digitally. Collect and scan for patterns. Address the top 2--3 confusions in the
next lecture's opening. This closes the feedback loop.}
\note{
% -- NARRATE: What to SAY while showing this slide
``Before we close, I want to know what confused you most. Write one
sentence --- the concept you found muddiest. Anonymous. Submit before
you leave --- slip of paper or digital form.'' Scan responses after
class and address the top 2--3 confusions at the start of next lecture.
% -- FLEX: [CORE] or [OPTIONAL] + contingency
[CORE] Closes the feedback loop --- essential for adaptive teaching.
IF SHORT: Reduce to ``write one word on a slip of paper.''
}
\centering
\vspace{1.0cm}
@@ -576,8 +1035,18 @@ next lecture's opening. This closes the feedback loop.}
\end{frame}
\begin{frame}{What Were the Key Ideas?}
\note{[2 min] Retrieval practice. Students write 90 seconds, no notes.
Do NOT show next slide yet. Walk around the room.}
\note{
% -- NARRATE: What to SAY while showing this slide
``Close your notes. 90 seconds. Write down the 3 most important ideas
from today. No peeking.'' Walk around the room while students write.
Do NOT show the next slide yet --- the struggle to recall is where
learning happens.
% -- FLEX: [CORE] or [OPTIONAL] + contingency
[CORE] Retrieval practice is the highest-impact learning technique
(Rosenshine). Never skip.
IF SHORT: Reduce to 60 seconds.
}
\centering
\vspace{1.5cm}
@@ -592,9 +1061,24 @@ Do NOT show next slide yet. Walk around the room.}
\end{frame}
\begin{frame}{Key Takeaways}
\note{[2 min] Reveal. Walk through each bullet. Emphasize that every concept
here will recur throughout the course. The C3 framework is the lens; the
Fleet Stack is the map.}
\note{
% -- LINK: What prior concept connects to this slide
Students just attempted recall. This slide reveals the answers and
fills gaps.
% -- NARRATE: What to SAY while showing this slide
Walk through each bullet, pausing on the quantitative anchors: ``The
fleet is the computer --- thousands of accelerators as one unit. Scale
is qualitative --- not just more, but fundamentally different. C3
framework --- every bottleneck maps to Compute, Communication, or
Coordination. Fleet Stack --- four layers from infrastructure to
governance. Failure math --- one failure every 10 hours at 10,000 GPUs.
Scope not depth --- Vol II is wider, not harder.''
% -- FLEX: [CORE] or [OPTIONAL] + contingency
[CORE] Consolidation slide.
IF SHORT: Read just the first three bullets.
}
\scriptsize
\begin{itemize}\setlength\itemsep{0pt}
@@ -610,7 +1094,17 @@ Fleet Stack is the map.}
\end{frame}
\begin{frame}{References}
\note{[1 min] Point students to foundational readings for the course.}
\note{
% -- NARRATE: What to SAY while showing this slide
``These five references are the foundational readings for the course.
Verbraeken is the survey that maps the landscape. Narayanan and Jiang
show how frontier labs actually train at scale. Dean is the historical
origin. Patterson and Hennessy is the pedagogical model we follow.''
% -- FLEX: [OPTIONAL] + contingency
[OPTIONAL] Reference slide.
IF SHORT: Skip verbal walkthrough --- students can read it.
}
\small
\mlsysref{Verbraeken+20}{Verbraeken et al. ``A Survey on Distributed ML.'' ACM Computing Surveys, 2020.}
@@ -622,10 +1116,24 @@ Fleet Stack is the map.}
\end{frame}
\begin{frame}{Next Lecture: The Distributed Landscape}
\note{[1 min] Forward hook. The next lecture introduces the fleet as a system:
what hardware is in a modern GPU cluster, how nodes are connected, and why
the topology matters. The question ``how do 10,000 GPUs talk to each other?''
is the central puzzle.}
\note{
% -- LINK: What prior concept connects to this slide
Today established why scale matters and introduced C3 and the Fleet
Stack. The next lecture dives into the first concrete question: what
is actually inside a modern GPU cluster?
% -- NARRATE: What to SAY while showing this slide
``The fleet is the computer. But what hardware makes up this computer?
Next lecture: Chapter 1 introduces the fleet as a system --- what
accelerators, what interconnects, what topology, and why the answer to
`how do 10,000 GPUs talk to each other?' is the central puzzle of
fleet engineering.'' Point to the three columns: Compute, Communication,
Coordination --- the C3 lens applied to hardware.
% -- FLEX: [CORE] or [OPTIONAL] + contingency
[CORE] Forward hook that creates anticipation.
IF SHORT: Read the central question and move on.
}
\small
\centering
@@ -664,7 +1172,14 @@ is the central puzzle.}
\appendix
\begin{frame}{Backup: Extended Reference}
\note{Backup slide with additional reference material for this chapter.}
\note{
% -- NARRATE: What to SAY while showing this slide
Backup reference slide. Only show if students ask for additional
resources or problem set reference material.
% -- FLEX: [OPTIONAL] + contingency
[OPTIONAL] Backup slide --- do not present unless needed.
}
\footnotesize
This slide provides extended reference material for students who want to go deeper.
@@ -679,7 +1194,14 @@ textbook's summary tables. Use them as a quick reference during problem sets.
\end{frame}
\begin{frame}{Backup: Further Reading}
\note{Backup slide. Point students to additional resources beyond the references slide.}
\note{
% -- NARRATE: What to SAY while showing this slide
Backup slide pointing to additional resources. Only present if a
student asks ``where can I learn more before next class?''
% -- FLEX: [OPTIONAL] + contingency
[OPTIONAL] Backup slide --- do not present unless needed.
}
\footnotesize
\textbf{For deeper exploration:}

View File

@@ -68,10 +68,23 @@
% LEARNING OBJECTIVES
% =============================================================================
\begin{frame}{Learning Objectives}
\note{[2 min] Walk through objectives. Emphasize that this chapter bridges
the physical network (Ch5) with the algorithms that run on it. Every concept
reduces to the alpha-beta model. Ask: ``How long does it take to send 140 GB
across a datacenter?''}
\note{
% -- LINK: What prior concept connects to this slide
Ch5 established the physical network --- NVLink, InfiniBand, fat-tree topologies. This chapter asks: what traffic patterns actually flow over those wires during training?
% -- NARRATE: What to SAY while showing this slide
Read each objective aloud, pausing on ``alpha-beta model'' and ``gradient compression.'' Emphasize that every concept in this chapter reduces to one question: how long does it take to move N bytes across P GPUs?
ANALOGY: ``Ch5 built the highway system; today we study the traffic patterns.''
% -- ENGAGE: Specific question, prediction, or task for THIS slide
Ask: ``How long does it take to send 140 GB of gradients across a datacenter?'' Accept guesses --- we will calculate the exact answer shortly.
% -- WARN: What students will get wrong on THIS topic
Students often think communication is a minor overhead. Set up the surprise: at scale, GPUs spend most of their time waiting, not computing.
% -- FLEX: [CORE] This slide is essential --- do not skip.
IF SHORT: Read only objectives 1, 2, and 6 aloud; let students read the rest.
}
\small
\begin{enumerate}
@@ -86,8 +99,13 @@ across a datacenter?''}
\end{frame}
\begin{frame}{Visual Language}
\note{[1 min] Explain the semantic color system used throughout the course.
These colors are consistent across all diagrams and slides.}
\note{
% -- NARRATE: What to SAY while showing this slide
Point to each card: ``Blue is compute --- GPU ops, forward and backward passes. Green is data flow and healthy paths. Orange is routing and scheduling. Red is error, cost, or bottleneck.'' These colors are consistent across every diagram in this course.
% -- FLEX: [OPTIONAL] Skip if students have seen this in a prior lecture.
IF SHORT: Say ``same color system as last lecture'' and move on in 15 seconds.
}
\small
Throughout this course, colors carry meaning:
@@ -126,9 +144,23 @@ Throughout this course, colors carry meaning:
% =============================================================================
\begin{frame}{The Communication Bottleneck}
\note{[3 min] Open with the visceral fact: at scale, GPUs spend most of their
time waiting for data, not computing. The 70B model example grounds this.
Ask: ``If compute is cheap but communication is expensive, what should we optimize?''}
\note{
% -- LINK: What prior concept connects to this slide
Ch5 showed that InfiniBand NDR delivers 50 GB/s per port. This slide reveals what happens when you actually need to move hundreds of gigabytes of gradients across that fabric.
% -- NARRATE: What to SAY while showing this slide
Point to the left column: ``Adding GPUs is easy --- compute scales linearly. But coordination scales quadratically or worse.'' Walk through the 70B example: 70 billion params times 4 bytes = 280 GB of gradients per step. Point to the red card: ``Ring AllReduce across 64 GPUs at 50 GB/s costs 11.2 seconds of pure communication per training step.''
ANALOGY: ``Imagine 64 people in a room each holding a 4 GB file. Everyone needs a copy of the merged result. That is AllReduce.''
% -- ENGAGE: Specific question, prediction, or task for THIS slide
Ask: ``If compute is cheap but communication is expensive, which term in the Iron Law should we optimize?'' Expected answer: the data movement term.
% -- WARN: What students will get wrong on THIS topic
Students assume adding more GPUs always speeds up training. At 64+ GPUs, communication can consume 40--70\% of the step, making additional GPUs counterproductive without communication optimization.
% -- FLEX: [CORE] This slide is essential --- do not skip.
IF AHEAD: Ask ``At what GPU count does communication exceed 50\% of the step?''
}
\footnotesize
\begin{columns}[T]
@@ -156,9 +188,22 @@ Ask: ``If compute is cheap but communication is expensive, what should we optimi
\end{frame}
\begin{frame}{The Compute-Communication Timeline}
\note{[3 min] Walk through the stacked bars. Key transition: from NVLink-dominated
(8 GPUs, 25\% comm) to InfiniBand-limited (4096 GPUs, 65\% comm + 15\% sync).
Ask: ``At what point does buying more GPUs stop helping?''}
\note{
% -- LINK: What prior concept connects to this slide
The previous slide stated 40--70\% overhead abstractly. This diagram shows the concrete breakdown as GPU count grows from 8 to 4,096.
% -- NARRATE: What to SAY while showing this slide
Point to the stacked bars left-to-right: ``At 8 GPUs, NVLink keeps communication to about 25\% of the step. At 256 GPUs, InfiniBand dominates and communication rises to 50\%. At 4,096 GPUs, communication plus sync consume 80\% of the training step.'' Trace the inflection point where the blue bar shrinks relative to the red and orange bars.
% -- ENGAGE: Specific question, prediction, or task for THIS slide
Ask: ``At what point does buying more GPUs stop helping?'' Expected answer: when communication exceeds the compute time gained by adding GPUs --- roughly the 512--1024 GPU range without hierarchical AllReduce.
% -- WARN: What students will get wrong on THIS topic
Students read ``80\% communication'' and think the hardware is slow. The hardware is near wire-speed --- the problem is algorithmic (flat Ring over too many hops).
% -- FLEX: [CORE] This slide is essential --- do not skip.
IF SHORT: Point to the 4,096-GPU bar and state the 80\% figure; skip the intermediate data points.
}
% --- Layout: FULL-WIDTH IMAGE + annotation ---
\centering
@@ -170,10 +215,23 @@ Ask: ``At what point does buying more GPUs stop helping?''}
\end{frame}
\begin{frame}{The Physics of Data Movement}
\note{[2 min] Three physical constraints: speed of light (latency floor),
bandwidth-distance product, and energy per bit. Emphasize that these are
not software problems --- they are physics constraints.
Common error: students think faster NICs solve everything.}
\note{
% -- LINK: What prior concept connects to this slide
The timeline showed communication growing with scale. This slide explains the three physical constraints that make communication fundamentally expensive, regardless of the algorithm.
% -- NARRATE: What to SAY while showing this slide
Walk through the table row by row. ``Latency: light in fiber travels at 5 microseconds per kilometer. A 500-meter datacenter round-trip costs thousands of GPU cycles.'' ``Bandwidth: PAM4 signaling limits copper to about 2 meters at NDR speeds --- that is why optical cables cost more.'' ``Energy: moving a bit over InfiniBand costs 20--50 picojoules, 40--100 times more than an SRAM access.'' Then highlight RDMA: ``GPUDirect RDMA bypasses the kernel, cutting latency from 10--20 microseconds to 1--3 microseconds.''
% -- ENGAGE: Specific question, prediction, or task for THIS slide
Ask: ``Which of these three constraints is a software problem that we can engineer away?'' Expected answer: none --- all three are physics. Software can only minimize exposure, not eliminate the constraints.
% -- WARN: What students will get wrong on THIS topic
Students think faster NICs solve everything. Even with 800G networking, latency is bounded by speed of light and energy per bit scales with distance. The constraints are multiplicative, not additive.
% -- FLEX: [OPTIONAL] Can be compressed to 1 minute.
IF SHORT: State ``three physics constraints --- latency, bandwidth, energy --- none are software problems'' and move on.
IF AHEAD: Discuss the energy implications at 10K GPUs where communication power approaches compute power.
}
\footnotesize
\textbf{Three constraints interact multiplicatively:}
@@ -197,10 +255,22 @@ Common error: students think faster NICs solve everything.}
% --- ACTIVE LEARNING 1: Predict ---
\begin{frame}{Predict: What Determines Communication Cost?}
\note{[2 min] Prediction exercise before revealing the alpha-beta model.
Give students 60 seconds. Do NOT reveal the answer yet.
Ask 2--3 students to share. Most will say ``bandwidth'' --- set up the reveal
that latency matters too.}
\note{
% -- LINK: What prior concept connects to this slide
Students just learned about the three physics constraints. This prediction exercise primes them for the alpha-beta model by asking them to reason about message size before seeing the formula.
% -- NARRATE: What to SAY while showing this slide
Read the prompt aloud. ``A 4 KB message and a 140 GB message, same 50 GB/s network. Which takes longer relative to its theoretical minimum?'' Emphasize ``relative'' --- the 140 GB message takes longer in absolute time, but the question is about overhead ratio.
% -- ENGAGE: Specific question, prediction, or task for THIS slide
Give 60 seconds to write. Do NOT reveal the answer. Ask 2--3 students to share. Most will say ``the small message'' --- this is correct. The 4 KB message is dominated by startup latency (alpha), making its actual time orders of magnitude above the bandwidth limit. The 140 GB message is almost entirely bandwidth-bound, achieving near-theoretical throughput.
% -- WARN: What students will get wrong on THIS topic
Some students will say ``the large message'' because it takes longer in absolute time. Redirect: the question is about relative overhead, not absolute time. This distinction is exactly what the alpha-beta model formalizes.
% -- FLEX: [CORE] This slide is essential --- it primes the alpha-beta model.
IF SHORT: Reduce to 30 seconds of think time and skip pair sharing.
}
\centering
\vspace{1.0cm}
@@ -225,10 +295,23 @@ $T(n) = \underbrace{\alpha}_{\text{Startup latency}} + \underbrace{\dfrac{n}{\be
}
\begin{frame}{Two Regimes, One Crossover}
\note{[3 min] Walk through the alpha-beta model. The critical message size
n* = alpha * beta separates two regimes. For IB NDR: n* = 100 KB.
MoE tokens (4 KB) are latency-bound; LLM gradients (140 GB) are bandwidth-bound.
Ask: ``Which optimization helps MoE but not LLMs?''}
\note{
% -- LINK: What prior concept connects to this slide
The prediction exercise showed that small and large messages behave very differently. The alpha-beta model formalizes this into two regimes separated by the critical message size n*.
% -- NARRATE: What to SAY while showing this slide
Point to the diagram: ``Left of n-star, latency dominates --- the message is so small that startup overhead dwarfs transfer time. Right of n-star, bandwidth dominates --- the message is large enough that transfer time dwarfs startup.'' State the crossover: ``For IB NDR, n-star is about 100 KB. MoE tokens at 4 KB are deeply latency-bound. LLM gradients at 140 GB are deeply bandwidth-bound.''
ANALOGY: ``Alpha is the cost of picking up the phone. Beta is how fast you can talk. A one-word message is dominated by dialing time. A novel is dominated by reading speed.''
% -- ENGAGE: Specific question, prediction, or task for THIS slide
Ask: ``Which optimization helps MoE routing tokens but not LLM gradients?'' Expected answer: latency reduction (RDMA, kernel bypass, topology optimization). Bandwidth compression helps LLMs but not MoE.
% -- WARN: What students will get wrong on THIS topic
Students conflate ``latency'' with ``slowness.'' Clarify: a latency-bound message is not slow in absolute terms --- it just cannot be sped up by increasing bandwidth. The optimization target is different for each regime.
% -- FLEX: [CORE] This slide is essential --- do not skip.
IF AHEAD: Ask students to calculate n* for PCIe Gen5 (alpha=5us, beta=64 GB/s) and compare with IB NDR.
}
% --- Layout: FULL-WIDTH diagram ---
\centering
@@ -256,9 +339,22 @@ Ask: ``Which optimization helps MoE but not LLMs?''}
\end{frame}
\begin{frame}{Your Turn: Critical Message Size}
\note{[3 min] Give students 90 seconds. Answer: n* = 2e-6 * 50e9 = 100 KB.
A 140 GB gradient is 1.4 million times above n*, so bandwidth optimization
dominates. A 4 KB MoE token is 25x below n*, so latency optimization dominates.}
\note{
% -- LINK: What prior concept connects to this slide
Students just learned the alpha-beta model conceptually. This exercise makes them apply it quantitatively for the first time.
% -- NARRATE: What to SAY while showing this slide
Read the problem aloud. ``InfiniBand NDR 400G: alpha is 2 microseconds, beta is 50 GB/s.'' Give 90 seconds. After the pause, walk through: ``n-star equals alpha times beta equals 2 times 10 to the minus 6 times 50 times 10 to the 9 equals 100,000 bytes equals 100 KB.'' Then: ``A 140 GB gradient is 1.4 million times above n-star --- bandwidth optimization dominates. A 4 KB MoE token is 25 times below n-star --- latency optimization dominates.''
% -- ENGAGE: Specific question, prediction, or task for THIS slide
Students calculate n* and classify two workloads. After solving, ask neighbors to compare. Cold-call one pair to present.
% -- WARN: What students will get wrong on THIS topic
Unit errors: students will forget to convert microseconds to seconds or gigabytes to bytes. Emphasize writing out the full exponents: 2e-6 times 50e9.
% -- FLEX: [CORE] This slide is essential --- first quantitative application of alpha-beta.
IF SHORT: Show the solution immediately (skip the 90-second work period) and narrate through it.
}
\small
\begin{columns}[T]
@@ -300,9 +396,22 @@ dominates. A 4 KB MoE token is 25x below n*, so latency optimization dominates.}
% =============================================================================
\begin{frame}{The Six Core Primitives}
\note{[3 min] Walk through all six. Key insight: AllReduce = ReduceScatter + AllGather.
FSDP exploits this decomposition. AllToAll is the hardest to scale (O(N\^2) connections).
Ask: ``Why can't we use AllReduce for MoE?''}
\note{
% -- LINK: What prior concept connects to this slide
The alpha-beta model tells us how expensive a single message is. But distributed training does not send single messages --- it uses collective operations involving all GPUs simultaneously. This slide catalogs the six primitives.
% -- NARRATE: What to SAY while showing this slide
Walk through the diagram left-to-right: ``Broadcast: one-to-all. Reduce: all-to-one with aggregation. AllReduce: everyone gets the aggregated result --- this is the workhorse of data parallelism. AllGather: everyone gets everyone's shard. ReduceScatter: reduce then distribute shards. AllToAll: everyone sends a unique piece to everyone else --- the hardest to scale.'' Then state the key decomposition: ``AllReduce equals ReduceScatter plus AllGather. FSDP exploits this by splitting the two phases in time.''
% -- ENGAGE: Specific question, prediction, or task for THIS slide
Ask: ``Why can't we use AllReduce for Mixture-of-Experts routing?'' Expected answer: MoE needs to send different tokens to different experts --- that is a personalized exchange (AllToAll), not a global aggregation (AllReduce).
% -- WARN: What students will get wrong on THIS topic
Students assume AllReduce is always the right choice because it is the most discussed. MoE and RecSys require AllToAll, which has fundamentally different scaling properties (O(N^2) connections).
% -- FLEX: [CORE] This slide is essential --- do not skip.
IF AHEAD: Ask students to sketch how FSDP uses ReduceScatter during backward and AllGather during forward.
}
% --- Layout: FULL-WIDTH diagram ---
\centering
@@ -314,10 +423,22 @@ Ask: ``Why can't we use AllReduce for MoE?''}
\end{frame}
\begin{frame}{Matching Primitives to Parallelism}
\note{[3 min] This is the ``travel manifest'' --- the parallelism strategy
determines the communication pattern. Data parallelism = AllReduce (bandwidth-bound).
MoE = AllToAll (latency + contention). Wrong primitive = wrong scaling ceiling.
Common error: using AllReduce for everything.}
\note{
% -- LINK: What prior concept connects to this slide
The six primitives are tools. This slide is the ``travel manifest'' that maps each parallelism strategy to its required primitive.
% -- NARRATE: What to SAY while showing this slide
Walk through the table: ``Data parallel uses AllReduce --- bandwidth-bound because gradients are large. Tensor parallel also uses AllReduce but is latency-bound because it happens within each layer, requiring NVLink speeds. Pipeline parallel uses point-to-point --- latency-bound because stages are sequential. MoE uses AllToAll --- the hardest, because it creates O(N-squared) logical connections.'' Point to the red card: ``AllToAll hits a communication wall much earlier than AllReduce because contention grows quadratically.''
% -- ENGAGE: Specific question, prediction, or task for THIS slide
Ask: ``At 1024 GPUs, which parallelism strategy hits the communication wall first: data parallel or expert parallel?'' Expected answer: expert parallel, because AllToAll creates O(N^2) connections while AllReduce is O(1) per-node bandwidth.
% -- WARN: What students will get wrong on THIS topic
Students think using AllReduce for everything is safe. For MoE and RecSys workloads, AllReduce cannot express the required communication pattern --- using it would require redundant computation or incorrect results.
% -- FLEX: [CORE] This slide is essential --- do not skip.
IF SHORT: Focus on data parallel (AllReduce) and MoE (AllToAll) rows; skip the middle rows.
}
\footnotesize
\renewcommand{\arraystretch}{1.15}
@@ -346,10 +467,22 @@ Common error: using AllReduce for everything.}
% =============================================================================
\begin{frame}{Ring AllReduce: Bandwidth-Optimal}
\note{[3 min] Walk through the two phases: Scatter-Reduce + AllGather.
Key property: every link active every step. Bandwidth-optimal but O(N) latency.
For 10,000 nodes, 20,000 sequential hops is devastating.
Ask: ``What breaks when N = 10,000?''}
\note{
% -- LINK: What prior concept connects to this slide
The primitives table showed AllReduce is the workhorse for data-parallel and tensor-parallel training. This slide examines the simplest bandwidth-optimal implementation: the Ring algorithm.
% -- NARRATE: What to SAY while showing this slide
Point to the diagram: ``Phase 1, Scatter-Reduce: N minus 1 steps where each GPU sends one chunk clockwise and accumulates partial sums. Phase 2, AllGather: N minus 1 more steps where the completed sums circulate to all nodes.'' Write the formula: ``Total time is 2(N-1) alpha plus 2 times (N-1)/N times M over beta.'' Then highlight: ``The bandwidth term approaches 2M/beta as N grows --- optimal! But the latency term is O(N) --- 2 times (N-1) startup delays. For 10,000 nodes, that is 20,000 sequential hops.''
% -- ENGAGE: Specific question, prediction, or task for THIS slide
Ask: ``What breaks when N equals 10,000?'' Expected answer: the O(N) latency term. At alpha=2 microseconds and N=10,000, latency alone costs 40 milliseconds --- before any data moves.
% -- WARN: What students will get wrong on THIS topic
Students see ``bandwidth-optimal'' and assume Ring is always the best choice. It is optimal only in the bandwidth term; the O(N) latency term makes it catastrophic for small messages or large clusters.
% -- FLEX: [CORE] This slide is essential --- do not skip.
IF AHEAD: Ask students to calculate the Ring latency overhead for N=10,000 at alpha=2 microseconds.
}
\small
\begin{columns}[T]
@@ -379,10 +512,23 @@ Ask: ``What breaks when N = 10,000?''}
\end{frame}
\begin{frame}{Algorithm Comparison}
\note{[3 min] Walk through the four algorithms. Ring = BW-optimal but O(N) latency.
Tree = O(log N) latency but log N BW penalty. Butterfly = best of both but needs N=2\^k.
Double Binary Tree = NCCL default. Crossover formula determines the winner.
If short: focus on Ring vs Tree.}
\note{
% -- LINK: What prior concept connects to this slide
Ring AllReduce is bandwidth-optimal but has O(N) latency. This slide introduces three alternatives that trade bandwidth efficiency for lower latency, and identifies when each wins.
% -- NARRATE: What to SAY while showing this slide
Walk through the table: ``Ring: O(N) latency, bandwidth-optimal --- best for large gradients on small-to-medium clusters. Tree: O(log N) latency, but log N bandwidth penalty --- best for small messages. Butterfly: best of both but requires N to be a power of 2. Double Binary Tree: NCCL's default, near-optimal in both.'' Then point to the crossover formula: ``M-crossover equals N times alpha times beta. Below this message size, Tree wins. Above it, Ring wins. For 64 GPUs on IB NDR, the crossover is about 6.4 MB.''
% -- ENGAGE: Specific question, prediction, or task for THIS slide
Ask: ``For a 1 MB AllReduce across 64 GPUs, which algorithm wins?'' Expected answer: Tree, because 1 MB is below the 6.4 MB crossover.
% -- WARN: What students will get wrong on THIS topic
Students memorize ``Ring is optimal'' without qualifying it. Ring is bandwidth-optimal but latency-poor. The crossover formula quantifies exactly when Ring loses to Tree.
% -- FLEX: [CORE] This slide is essential --- do not skip.
IF SHORT: Focus on Ring vs Tree only. Skip Butterfly and Double Tree rows.
IF AHEAD: Ask students to calculate the crossover for 256 GPUs.
}
\footnotesize
\renewcommand{\arraystretch}{1.15}
@@ -411,10 +557,22 @@ For 64 GPUs on IB NDR: $M_{\text{crossover}} \approx 64 \times 2\ \mu\text{s} \t
% =============================================================================
\begin{frame}{The Bandwidth Hierarchy}
\note{[3 min] Real clusters are NOT flat. NVLink is 18x faster than InfiniBand.
A flat Ring wastes NVLink by routing data over IB when NVLink suffices.
The 3-phase hierarchical approach confines expensive IB traffic.
Ask: ``How much does inter-node traffic drop with 8 GPUs per node?''}
\note{
% -- LINK: What prior concept connects to this slide
The algorithm comparison assumed a flat network where every link has the same bandwidth. Real clusters have a hierarchy: NVLink at 900 GB/s within a node, InfiniBand at 50 GB/s between nodes. Ignoring this hierarchy wastes NVLink bandwidth.
% -- NARRATE: What to SAY while showing this slide
Point to the diagram phases: ``Phase 1: ReduceScatter within each node using NVLink at 900 GB/s. Phase 2: AllReduce across nodes using InfiniBand at 50 GB/s --- but now each node sends only 1/G of the data, where G is GPUs per node. Phase 3: AllGather within each node using NVLink again.'' Emphasize: ``The expensive IB traffic is confined to 1/G of the data. With 8 GPUs per node, inter-node traffic drops 8 times.''
% -- ENGAGE: Specific question, prediction, or task for THIS slide
Ask: ``How much does inter-node traffic drop with 8 GPUs per node?'' Expected answer: 8 times, because each node reduces locally before sending across the network.
% -- WARN: What students will get wrong on THIS topic
Students think hierarchical AllReduce is an optimization trick. It is actually the default in NCCL --- flat Ring across nodes is the anti-pattern. The hierarchy is not optional; it matches the physical bandwidth tiers.
% -- FLEX: [CORE] This slide is essential --- do not skip.
IF SHORT: State the three phases and the 1/G reduction; skip the detailed bandwidth numbers.
}
% --- Layout: FULL-WIDTH diagram ---
\centering
@@ -440,9 +598,22 @@ Ask: ``How much does inter-node traffic drop with 8 GPUs per node?''}
\end{frame}
\begin{frame}{Hierarchical AllReduce: Worked Example}
\note{[3 min] Walk through the numbers: flat = 40 ms, hierarchical = 7 ms.
5.7x speedup from respecting the bandwidth hierarchy. The key: inter-node
traffic drops by 8x (GPUs per node). This is why NCCL defaults to hierarchical.}
\note{
% -- LINK: What prior concept connects to this slide
The hierarchy concept was introduced abstractly. This worked example puts concrete numbers to each phase and shows a 5.7x speedup.
% -- NARRATE: What to SAY while showing this slide
Walk through the table: ``Flat Ring sends 2 GB over InfiniBand at 50 GB/s: 40 ms. Hierarchical: Phase 1, ReduceScatter sends 875 MB over NVLink at 900 GB/s: about 1 ms. Phase 2, inter-node AllReduce sends only 125 MB over IB: about 5 ms. Phase 3, AllGather sends 875 MB over NVLink: about 1 ms. Total: 7 ms. That is a 5.7x speedup just from respecting the bandwidth hierarchy.''
% -- ENGAGE: Specific question, prediction, or task for THIS slide
Ask: ``Where did the 125 MB come from?'' Expected answer: 1 GB divided by 8 GPUs per node. Each node reduces locally first, so only the reduced shard crosses the network.
% -- WARN: What students will get wrong on THIS topic
Students forget that the inter-node data volume shrinks by 1/G. They apply the original 1 GB to the IB bandwidth and get the wrong answer. Emphasize: local reduction is the key step.
% -- FLEX: [CORE] This slide is essential --- do not skip.
IF AHEAD: Ask ``What happens with 16 GPUs per node instead of 8?'' (Answer: inter-node traffic drops to 62.5 MB, further speedup.)
}
\footnotesize
\textbf{1 GB gradient, 8 nodes $\times$ 8 GPUs (64 total)}
@@ -472,7 +643,13 @@ traffic drops by 8x (GPUs per node). This is why NCCL defaults to hierarchical.}
% --- ACTIVE LEARNING: Micro-Retrieval Cue ---
\begin{frame}{Quick Check}
\note{[1 min] Answer: confines expensive IB traffic to 1/G of the data.}
\note{
% -- NARRATE: What to SAY while showing this slide
Read the question. Give 15 seconds of silence. Then cold-call. Expected answer: hierarchical AllReduce confines expensive IB traffic to 1/G of the data by reducing locally first within each NVLink domain.
% -- FLEX: [CORE] Quick retrieval cue --- takes only 1 minute.
IF SHORT: Ask the question aloud and answer it yourself in 20 seconds.
}
\centering
\vspace{1.0cm}
@@ -490,10 +667,22 @@ traffic drops by 8x (GPUs per node). This is why NCCL defaults to hierarchical.}
% --- ACTIVE LEARNING 2: Discussion ---
\begin{frame}{Discussion: AllReduce vs.\ AllToAll Scaling}
\note{[3 min] Turn-and-talk. AllReduce scales gracefully (O(1) per-node BW).
AllToAll creates O(N\^2) connections --- network contention is the wall.
This is why MoE hits limits earlier than dense LLMs.
Cold-call 2--3 pairs.}
\note{
% -- LINK: What prior concept connects to this slide
Students learned that AllReduce is bandwidth-bound and AllToAll creates O(N^2) connections. This discussion forces them to reason about which scaling limit is hit first.
% -- NARRATE: What to SAY while showing this slide
Read the prompt. Set the timer for 90 seconds. Walk around the room listening to pairs. After time, cold-call 2--3 pairs.
% -- ENGAGE: Specific question, prediction, or task for THIS slide
The MoE model hits the communication wall first because AllToAll creates O(N^2) logical connections, causing network contention to grow quadratically. AllReduce maintains O(1) per-node bandwidth regardless of cluster size. At 512 GPUs, AllToAll contention becomes the dominant bottleneck while AllReduce remains manageable.
% -- WARN: What students will get wrong on THIS topic
Some students will say ``both are equally hard'' because they have the same number of GPUs. Redirect: it is not the hardware that differs --- it is the communication pattern. AllReduce is a structured reduction; AllToAll is a full permutation.
% -- FLEX: [CORE] This slide is essential --- builds intuition for MoE scaling limits.
IF SHORT: Do a show-of-hands poll (AllReduce vs AllToAll) instead of pair discussion.
}
\centering
\vspace{0.8cm}
@@ -514,10 +703,22 @@ and a MoE model using AllToAll on the same 512-GPU cluster.\\[0.3cm]
% =============================================================================
\begin{frame}{Gradient Compression Techniques}
\note{[3 min] When even the fastest wires aren't enough, send fewer bits.
Walk through the compression spectrum: FP16 (2x), INT8 (4x), Top-K (100x),
1-bit (32x). Key: always use Error Feedback beyond FP16.
Ask: ``What happens if you discard 99\% of gradients without error feedback?''}
\note{
% -- LINK: What prior concept connects to this slide
Hierarchical AllReduce minimizes wasted bandwidth. But when even the fastest wires are not enough, the next strategy is to send fewer bits per gradient element.
% -- NARRATE: What to SAY while showing this slide
Walk through the diagram left-to-right: ``FP16 gives 2x compression with almost no quality loss --- this is the baseline. INT8 gives 4x. Top-K sparsification sends only the largest 1\% of gradients for 100x compression. 1-bit quantization sends only the sign of each gradient for 32x compression.'' Then point to the rule: ``Beyond FP16, always use Error Feedback. Without it, small gradients below the threshold are permanently lost, causing 1--3\% accuracy degradation.''
% -- ENGAGE: Specific question, prediction, or task for THIS slide
Ask: ``What happens if you discard 99\% of gradients without error feedback?'' Expected answer: small but persistent gradients are permanently lost, causing the model to converge to a worse optimum (1--3\% accuracy loss).
% -- WARN: What students will get wrong on THIS topic
Students assume ``99\% compression with only 1\% accuracy loss'' is free. The accuracy loss compounds over training --- 1\% on a benchmark can mean significantly worse real-world performance. Error Feedback is the mechanism that makes aggressive compression safe.
% -- FLEX: [CORE] This slide is essential --- do not skip.
IF AHEAD: Discuss how PowerSGD achieves better compression ratios than Top-K by projecting gradients into a low-rank subspace.
}
% --- Layout: FULL-WIDTH diagram ---
\centering
@@ -529,10 +730,23 @@ Ask: ``What happens if you discard 99\% of gradients without error feedback?''}
\end{frame}
\begin{frame}{Error Feedback: No Information Lost}
\note{[3 min] Walk through the error feedback mechanism step by step.
Without EF: small gradients below threshold are permanently lost.
With EF: residuals accumulate until they cross the threshold.
Sum(transmitted) + error = true gradient. This is the key mathematical guarantee.}
\note{
% -- LINK: What prior concept connects to this slide
The previous slide stated the Error Feedback rule. This slide proves why it works by walking through the mathematical guarantee step by step.
% -- NARRATE: What to SAY while showing this slide
Point to the equation: ``Error feedback stores the residual: what we wanted to send minus what we actually sent.'' Walk through the table row by row: ``Step 1: gradient 0.4, error 0, sum 0.4, below threshold, send 0, new error 0.4. Step 2: gradient 0.3 plus error 0.4 equals 0.7, above threshold, send 1, new error -0.3.'' Continue through all 5 steps. Then: ``After 5 steps, we sent 2 and have error -0.4. The true sum of all gradients is 1.6. And 2 plus -0.4 equals 1.6 --- nothing was lost, just delayed.''
ANALOGY: ``Error feedback is like a jar where you save your loose change. Each day you might not have enough for a coffee, but the jar accumulates until you do.''
% -- ENGAGE: Specific question, prediction, or task for THIS slide
Ask: ``Without error feedback, how much of the 1.6 total gradient would have been transmitted?'' Expected answer: only 0 --- every individual gradient is below the threshold of 1, so naive compression sends nothing.
% -- WARN: What students will get wrong on THIS topic
Students think error feedback is approximate. It is mathematically exact: sum of transmitted values plus the final error always equals the true gradient sum. The information is delayed, not lost.
% -- FLEX: [CORE] This slide is essential --- do not skip.
IF SHORT: Skip rows 3--4 in the table walkthrough; show rows 1, 2, and 5 to demonstrate the pattern.
}
\small
\begin{columns}[T]
@@ -583,11 +797,23 @@ Sum(transmitted) + error = true gradient. This is the key mathematical guarantee
% =============================================================================
\begin{frame}{Communication-Computation Overlap}
\note{[3 min] The final optimization: hide communication behind computation.
Layer-by-layer overlap launches AllReduce for completed layers while
earlier layers still compute. Bucket fusion amortizes alpha overhead.
Walk through the pipelined timeline vs sequential.
Ask: ``When does overlap fail?''}
\note{
% -- LINK: What prior concept connects to this slide
Hierarchical AllReduce and gradient compression reduce communication time. Overlap is the final strategy: hide whatever communication remains behind computation.
% -- NARRATE: What to SAY while showing this slide
Point to the diagram: ``The sequential timeline shows backward pass completing fully, then AllReduce starting. The pipelined timeline interleaves them: as layer N finishes its backward pass, its AllReduce launches immediately while layer N-1 continues computing.'' Point to the bucket fusion detail: ``Bucket fusion groups small per-layer AllReduces into larger chunks to amortize the alpha overhead --- typically 25--100 MB buckets.''
ANALOGY: ``Overlap is like washing dishes while the next pot boils. You are doing two things in parallel using different resources.''
% -- ENGAGE: Specific question, prediction, or task for THIS slide
Ask: ``When does overlap fail?'' Expected answer: when AllReduce per layer takes longer than the backward pass per layer --- the communication is ``exposed'' and cannot be hidden.
% -- WARN: What students will get wrong on THIS topic
Students assume overlap hides all communication. It only hides communication that fits within the computation window. If AllReduce per layer exceeds backward per layer, the excess is exposed and adds to total time.
% -- FLEX: [CORE] This slide is essential --- do not skip.
IF AHEAD: Ask students to derive the condition for full overlap: T_bwd/layer must exceed T_AR/layer.
}
% --- Layout: FULL-WIDTH diagram ---
\centering
@@ -599,9 +825,22 @@ Ask: ``When does overlap fail?''}
\end{frame}
\begin{frame}{Overlap Budget: Worked Example}
\note{[2 min] 32-layer 7B model: without overlap = 1325 ms, with overlap = 360 ms.
73\% savings. The remaining exposed comm comes from AllReduce being slower than
per-layer backward. To eliminate: increase batch size or reduce AllReduce time.}
\note{
% -- LINK: What prior concept connects to this slide
The overlap concept was introduced qualitatively. This slide quantifies it for a 32-layer 7B model to show 73\% savings.
% -- NARRATE: What to SAY while showing this slide
Walk through the table: ``Backward per layer is 15 ms. AllReduce per layer is 26 ms for 880 MB of gradients in 100 MB buckets. Without overlap: 480 ms backward plus 832 ms AllReduce equals 1,325 ms. With overlap: the backward pass starts, and each layer's AllReduce launches immediately. But 26 ms exceeds 15 ms, so 11 ms per layer is exposed. Total exposed: about 360 ms --- a 73\% savings.'' Then: ``The remaining exposed communication can be reduced by increasing batch size (longer backward) or using faster networking (shorter AllReduce).''
% -- ENGAGE: Specific question, prediction, or task for THIS slide
Ask: ``What two knobs reduce the 11 ms exposed gap per layer?'' Expected answer: increase batch size (makes backward longer) or reduce AllReduce time (faster network, compression, hierarchical).
% -- WARN: What students will get wrong on THIS topic
Students read ``73\% savings'' and think the problem is solved. The remaining 360 ms is still a significant cost at scale. The full optimization stack (hierarchical + compression + overlap) reduces effective overhead to 5--15\%, not zero.
% -- FLEX: [OPTIONAL] Can be compressed if running behind schedule.
IF SHORT: State the 73\% savings result and the condition T_bwd/layer > T_AR/layer; skip the detailed numbers.
}
\footnotesize
\textbf{32-layer transformer, 7B parameters, 64 GPUs}
@@ -630,7 +869,16 @@ per-layer backward. To eliminate: increase batch size or reduce AllReduce time.}
% =============================================================================
\begin{frame}{Fallacies}
\note{[2 min] Four common misconceptions with quantitative evidence.}
\note{
% -- LINK: What prior concept connects to this slide
Students have seen the full optimization stack. These fallacies test whether they internalized the key distinctions: latency vs bandwidth, Ring vs Tree, sync vs async, flat vs hierarchical.
% -- NARRATE: What to SAY while showing this slide
Read each fallacy and its rebuttal. For the first: ``Bandwidth is not the only metric --- for 4 KB MoE tokens, latency dominates and 400G networking gives zero benefit.'' For the second: ``Ring pays O(N) latency --- for small messages across 64 GPUs, Tree wins.'' For the third: ``The LogP overhead o is non-overlappable --- if GPU compute is less than o, the GPU still stalls.'' For the fourth: ``Hierarchical achieves 5--6x speedup by cutting inter-node traffic 8x.''
% -- FLEX: [CORE] This slide is essential --- do not skip.
IF SHORT: Cover fallacies 1 and 4 only; skip 2 and 3.
}
\footnotesize
\textbf{Fallacy:} \textit{Bandwidth is the only metric that matters.}\\
@@ -651,7 +899,16 @@ Hierarchical achieves 5--6$\times$ speedup on 8-node clusters by cutting inter-n
\end{frame}
\begin{frame}{Pitfalls}
\note{[2 min] Three operational pitfalls.}
\note{
% -- LINK: What prior concept connects to this slide
Fallacies addressed conceptual errors. Pitfalls address operational mistakes that teams make when deploying collective communication in production.
% -- NARRATE: What to SAY while showing this slide
Read each pitfall. For the first: ``MoE and DLRM need AllToAll, which creates O(N^2) connections and hits contention at smaller cluster sizes than AllReduce.'' For the second: ``Without error feedback, Top-K permanently discards small gradients, causing 1--3\% accuracy loss.'' For the third: ``nccl-tests reports theoretical peak bandwidth, but real training sees 50--60\% if ranks are topology-misaligned --- for example, a tensor-parallel group spanning two nodes instead of staying within NVLink.'' For the fourth: ``At 10K nodes, a 1-in-10^15 bit-flip rate means multiple corruptions per day. These appear as unexplained NaN gradients.''
% -- FLEX: [CORE] This slide is essential --- do not skip.
IF SHORT: Cover pitfalls 1 and 2 only.
}
\footnotesize
\textbf{Pitfall:} \textit{Assuming AllReduce works for everything.}\\
@@ -675,9 +932,13 @@ At 10K nodes, ``rare'' bit flips ($1$ in $10^{15}$) happen multiple times per da
% --- MUDDIEST POINT ---
\begin{frame}{Muddiest Point}
\note{[2 min] Quick anonymous poll. Students write on a slip of paper or submit
digitally. Collect and scan for patterns. Address the top 2--3 confusions in the
next lecture's opening. This closes the feedback loop.}
\note{
% -- NARRATE: What to SAY while showing this slide
Say: ``Before we close, write down the one concept from today that you found most confusing. This is anonymous --- one sentence, submit before you leave. I will address the top two or three confusions at the start of next lecture.''
% -- FLEX: [CORE] Always include --- closes the feedback loop.
IF SHORT: Reduce to 30 seconds. Students can submit digitally after class.
}
\centering
\vspace{1.0cm}
@@ -692,8 +953,13 @@ next lecture's opening. This closes the feedback loop.}
\end{frame}
\begin{frame}{What Were the Key Ideas?}
\note{[2 min] Retrieval practice. Students write 90 seconds, no notes.
Do NOT show next slide yet. Walk around the room.}
\note{
% -- NARRATE: What to SAY while showing this slide
Say: ``Close your notes. No screens. Write down the four most important concepts from today's lecture. You have 90 seconds.'' Walk around the room to observe. Do NOT show the next slide yet --- the retrieval effort is the learning event.
% -- FLEX: [CORE] Always include --- retrieval practice is the highest-impact learning activity.
IF SHORT: Reduce to 60 seconds but do not skip entirely.
}
\centering
\vspace{1.5cm}
@@ -708,8 +974,16 @@ Do NOT show next slide yet. Walk around the room.}
\end{frame}
\begin{frame}{Key Takeaways}
\note{[2 min] Reveal. Walk through each bullet. Emphasize quantitative anchors:
n* = 100 KB, 11s AllReduce, 5.7x hierarchical speedup, 73\% overlap savings.}
\note{
% -- LINK: What prior concept connects to this slide
Students just attempted retrieval. This slide reveals the answers so they can compare and fill gaps.
% -- NARRATE: What to SAY while showing this slide
Walk through each bullet, pausing on the quantitative anchors: ``n-star equals 100 KB --- that separates latency-bound from bandwidth-bound. 11 seconds of AllReduce for a 70B model. 5.7x speedup from hierarchical AllReduce. 73\% overlap savings from layer pipelining. The full stack reduces overhead from 50--80\% to 5--15\%.''
% -- FLEX: [CORE] This slide is essential --- do not skip.
IF SHORT: Read only bullets 1, 3, and 7.
}
\scriptsize
\begin{itemize}\setlength\itemsep{0pt}
@@ -725,7 +999,12 @@ n* = 100 KB, 11s AllReduce, 5.7x hierarchical speedup, 73\% overlap savings.}
\end{frame}
\begin{frame}{References}
\note{[1 min] Point students to canonical papers.}
\note{
% -- NARRATE: What to SAY while showing this slide
Point students to the Patarasuk paper for Ring AllReduce theory and the Gibiansky blog post for intuitive explanation. Sergeev for Horovod. Tang for 1-bit Adam. Stich for error feedback theory. Rajbhandari for ZeRO.
% -- FLEX: [OPTIONAL] Can be skipped in lecture; students read on their own.
}
\small
\mlsysref{Patarasuk+09}{Patarasuk \& Yuan. ``Bandwidth Optimal All-Reduce Algorithms.'' 2009.}
@@ -738,9 +1017,15 @@ n* = 100 KB, 11s AllReduce, 5.7x hierarchical speedup, 73\% overlap savings.}
\end{frame}
\begin{frame}{Next Lecture: Fault Tolerance}
\note{[1 min] Forward hook. The fleet has its traffic patterns, but the roads
are crumbling. GPUs overheat, networks drop packets, nodes fail mid-training.
How do we maintain the illusion of a perfect supercomputer on imperfect hardware?}
\note{
% -- LINK: What prior concept connects to this slide
This chapter built the communication patterns for distributed training. The next chapter asks: what happens when the infrastructure breaks?
% -- NARRATE: What to SAY while showing this slide
Say: ``The fleet has its traffic patterns, but the roads are crumbling. GPUs overheat, networks drop packets, nodes fail mid-training. At 10,000 GPUs, failure is not exceptional --- it is the steady state. Next lecture: how do we maintain the illusion of a perfect supercomputer on imperfect hardware?''
% -- FLEX: [CORE] Always include --- forward hooks maintain narrative continuity.
}
\footnotesize
\begin{columns}[c]
@@ -779,7 +1064,10 @@ How do we maintain the illusion of a perfect supercomputer on imperfect hardware
\appendix
\begin{frame}{Backup: Extended Reference}
\note{Backup slide with additional reference material for this chapter.}
\note{
% -- NARRATE: Backup slide with additional reference material.
% -- FLEX: [OPTIONAL] Use only if students request deeper material.
}
\footnotesize
This slide provides extended reference material for students who want to go deeper.
@@ -794,7 +1082,10 @@ textbook's summary tables. Use them as a quick reference during problem sets.
\end{frame}
\begin{frame}{Backup: Further Reading}
\note{Backup slide. Point students to additional resources beyond the references slide.}
\note{
% -- NARRATE: Backup slide pointing to additional resources.
% -- FLEX: [OPTIONAL] Use only if students request further reading.
}
\footnotesize
\textbf{For deeper exploration:}

View File

@@ -68,9 +68,29 @@
% LEARNING OBJECTIVES
% =============================================================================
\begin{frame}{Learning Objectives}
\note{[2 min] Walk through objectives. Emphasize that this chapter is about
the management layer of the fleet. Ask: ``How many of you have deployed
more than one model to production?''}
\note{
% -- LINK: Connect to prior chapters
Students built serving infrastructure in Part III. This chapter asks:
what happens when you manage not one model, but a hundred?
% -- NARRATE: What to SAY
Read each objective aloud, pausing on ``platform ROI'' and ``TCO framework.''
These two anchor the quantitative reasoning for the entire chapter.
% -- ENGAGE: Specific question
Ask: ``How many of you have deployed more than one model to production?
At what count did ad hoc practices start breaking?''
Give 10 seconds for a show of hands.
% -- WARN: Specific misconception
Students assume operations scale linearly with model count.
Correct: dependencies grow as O(N^2), alerts as O(N*M).
% -- FLEX: [CORE]
[CORE] Never skip --- objectives frame the entire lecture.
IF AHEAD: Ask students to rank which objective they most want to master.
IF SHORT: Read objectives without elaboration, move on.
}
\small
\begin{enumerate}
@@ -86,8 +106,16 @@ more than one model to production?''}
\end{frame}
\begin{frame}{Visual Language}
\note{[1 min] Explain the semantic color system used throughout the course.
These colors are consistent across all diagrams and slides.}
\note{
% -- NARRATE: What to SAY
Point to each card: ``Blue = compute, green = data, orange = routing,
red = error. These are consistent across every diagram in this course.
When you see red in a pipeline diagram, something is bottlenecked.''
% -- FLEX: [OPTIONAL]
[OPTIONAL] Skip if students have seen this in a previous chapter deck.
IF SHORT: Say ``same color system as last lecture'' and advance.
}
\small
Throughout this course, colors carry meaning:
@@ -126,9 +154,30 @@ Throughout this course, colors carry meaning:
% =============================================================================
\begin{frame}{The N-Models Problem}
\note{[3 min] Core insight: managing 100 models is not 100$\times$ the work.
Dependencies grow quadratically. Ask: ``At your organization, how many
models share the same data sources?'' Common error: assuming linear scaling.}
\note{
% -- LINK: Connect to prior concept
Students just saw the learning objectives listing platform ROI and dependency
management. This slide makes the problem visceral: why platforms exist.
% -- NARRATE: What to SAY
Point to the diagram: ``At 10 models, a few shared data sources. At 50,
the dependency graph is a hairball. At 100, a single upstream change
cascades unpredictably.'' Trace the quadratic growth curve.
% -- ENGAGE: Specific question
Ask: ``At your organization, how many models share the same data sources?''
Expected answer: most students underestimate --- typical is 5--10 shared sources.
% -- WARN: Specific misconception
Students assume managing 100 models is 100x the work of managing one.
Correct: dependencies grow as O(N^2), so 100 models is closer to 10,000x
the coordination complexity.
% -- FLEX: [CORE]
[CORE] This motivates the entire chapter.
IF AHEAD: Ask ``At what N does your dependency graph become unmanageable?''
IF SHORT: Show diagram, state O(N^2), move on.
}
% --- Full-width diagram ---
\centering
@@ -140,9 +189,30 @@ models share the same data sources?'' Common error: assuming linear scaling.}
\end{frame}
\begin{frame}{Operational Complexity Growth}
\note{[2 min] Walk through the table. Emphasize that monitoring becomes
unmanageable and debugging requires distributed tracing. If short on time,
focus on the deployment coordination column.}
\note{
% -- LINK: Connect to prior concept
The N-models diagram showed the complexity curve. This table puts
concrete operational labels on each step of that curve.
% -- NARRATE: What to SAY
Walk column by column: ``At 1 model, monitoring is a single dashboard.
At 100, you have 100 dashboards nobody reads. Debugging shifts from
local to distributed tracing --- a qualitatively different skill.''
% -- ENGAGE: Specific question
Ask: ``Which column transitions from manageable to unmanageable first?''
Expected answer: monitoring (it is the first to break because alert
volume grows as N times M metrics).
% -- WARN: Specific misconception
Students focus on deployment coordination but miss that monitoring
breaks first. Alert fatigue precedes deployment chaos.
% -- FLEX: [OPTIONAL]
[OPTIONAL] This table reinforces the N-models diagram.
IF SHORT: Point to the monitoring row, state the key insight, advance.
IF AHEAD: Ask students to fill in a ``1000 models'' column mentally.
}
\footnotesize
\renewcommand{\arraystretch}{1.15}
@@ -165,9 +235,30 @@ focus on the deployment coordination column.}
\end{frame}
\begin{frame}{Quantifying Platform ROI}
\note{[3 min] Walk through the ROI equation. The key insight: platforms
exhibit a scaling threshold. At 20 models, a \$2M platform breaks even.
Ask: ``At what model count does your organization need a platform team?''}
\note{
% -- LINK: Connect to prior concept
The complexity table showed operations becoming unmanageable. This slide
answers: what is the economic case for investing in a platform?
% -- NARRATE: What to SAY
Point to the equation: ``N is model count, T-saved is hours per model,
C-eng is engineer cost. The numerator grows linearly with N; the
denominator is fixed.'' Walk through the worked example: ``50 models,
40 hours each, \$150/hr = \$3.6M before, \$1.6M after. 56\% savings.''
% -- ENGAGE: Specific question
Ask: ``At what model count does your organization need a platform team?''
Give 10 seconds. Expected: most say 50--100; the surprise is 20--50.
% -- WARN: Specific misconception
Students think platform ROI is linear. Correct: it is superlinear because
shared infrastructure amortizes over all models simultaneously.
% -- FLEX: [CORE]
[CORE] The ROI equation is the quantitative anchor for the chapter.
IF AHEAD: Ask ``What if T-saved is only 10 hours instead of 30?''
IF SHORT: Show equation, state the 56\% number, skip the worked example detail.
}
\small
\begin{columns}[T]
@@ -196,7 +287,16 @@ Ask: ``At what model count does your organization need a platform team?''}
% --- ACTIVE LEARNING: Micro-Retrieval Cue ---
\begin{frame}{Quick Check}
\note{[1 min] Answer: 20-50 models.}
\note{
% -- NARRATE: What to SAY
Pause 15 seconds. Then cold-call. Answer: 20--50 models.
Say: ``Most students guess 100+. The surprise is the threshold is
much lower because shared infrastructure amortizes across all models.''
% -- FLEX: [OPTIONAL]
[OPTIONAL] Micro-retrieval cue reinforcing the ROI slide.
IF SHORT: Skip entirely --- the predict exercise covers this ground.
}
\centering
\vspace{1.0cm}
@@ -213,9 +313,30 @@ Ask: ``At what model count does your organization need a platform team?''}
\begin{frame}{MLOps Maturity Hierarchy}
\note{[2 min] Four levels from manual to enterprise. Most organizations
are at Level 1. The jump from Level 1 to Level 2 provides superlinear
returns. If short: just show the staircase and describe the transition.}
\note{
% -- LINK: Connect to prior concept
The ROI equation showed that platforms pay for themselves. This slide
asks: what does the maturity journey look like?
% -- NARRATE: What to SAY
Point to each level: ``L0 = scripts on laptops. L1 = per-model CI/CD,
the most common state. L2 = shared platform, where the superlinear
returns kick in. L3 = enterprise governance across the org.'' Emphasize
the L1-to-L2 transition: ``This is where most organizations stall.''
% -- ENGAGE: Specific question
Ask: ``Where is your organization on this staircase?''
Show of hands for each level. Most will cluster at L0--L1.
% -- WARN: Specific misconception
Students think the jump from L0 to L1 is the hard part. Correct:
L0-to-L1 is just adding CI/CD. The L1-to-L2 jump requires
organizational change --- shared infrastructure, common APIs, platform team.
% -- FLEX: [OPTIONAL]
[OPTIONAL] Reinforces the platform investment argument.
IF SHORT: Show staircase, name the four levels, emphasize L1-to-L2, advance.
}
% --- Full-width diagram ---
\centering