mirror of
https://github.com/harvard-edge/cs249r_book.git
synced 2026-05-10 15:49:25 -05:00
feat(slides): 5-component speaker notes + upgrade 7 chapters
New speaker note standard (LINK, NARRATE, ENGAGE, WARN, FLEX) based on 8 pedagogical frameworks (Shulman PCK, Ambrose, Rosenshine, Chi ICAP, Merrill, Bain, Wiggins UbD, Garner & Alley). Upgraded 7 chapters: Vol1 Ch00/Ch05/Ch09/Ch13, Vol2 Ch00/Ch06/Ch12. Updated stats row on portal landing page. 28 chapters remaining for next pass.
This commit is contained in:
@@ -205,9 +205,9 @@ toc: false
|
||||
<div class="stats-row">
|
||||
<div class="stat"><span class="stat-num">35</span><span class="stat-lbl">Decks</span></div>
|
||||
<div class="stat"><span class="stat-num">1,099</span><span class="stat-lbl">Slides</span></div>
|
||||
<div class="stat"><span class="stat-num">266</span><span class="stat-lbl">SVG Figures</span></div>
|
||||
<div class="stat"><span class="stat-num">2</span><span class="stat-lbl">Volumes</span></div>
|
||||
<div class="stat"><span class="stat-num">~38 hrs</span><span class="stat-lbl">Teaching Time</span></div>
|
||||
<div class="stat"><span class="stat-num">308</span><span class="stat-lbl">Active Learning</span></div>
|
||||
<div class="stat"><span class="stat-num">1,099</span><span class="stat-lbl">Speaker Notes</span></div>
|
||||
</div>
|
||||
|
||||
<!-- Actions -->
|
||||
|
||||
@@ -52,8 +52,16 @@
|
||||
|
||||
|
||||
\begin{frame}{Visual Language}
|
||||
\note{[1 min] Explain the semantic color system used throughout the course.
|
||||
These colors are consistent across all diagrams and slides.}
|
||||
\note{
|
||||
% -- NARRATE: Walk through each card. ``Blue means compute---anytime you see
|
||||
% blue in a diagram, think GPU ops, matrix multiplies, forward pass. Green is
|
||||
% data flow---memory, caches, healthy paths. Orange is routing---schedulers,
|
||||
% load balancers. Red is cost, error, or bottleneck.'' Point to each card as
|
||||
% you name it.
|
||||
%
|
||||
% -- FLEX: [CORE] Show on Day 1 and again briefly on Day 2.
|
||||
% IF SHORT: Display the slide but do not narrate---students can read the cards.
|
||||
}
|
||||
|
||||
\small
|
||||
Throughout this course, colors carry meaning:
|
||||
@@ -105,9 +113,17 @@ Throughout this course, colors carry meaning:
|
||||
% =============================================================================
|
||||
|
||||
\begin{frame}{Welcome}
|
||||
\note{[2 min] Welcome students. Set the tone: this is not an ML algorithms class.
|
||||
This is about the \emph{systems} that make ML work. Ask: ``How many of you have
|
||||
trained a model? How many have deployed one?'' The gap is the course.}
|
||||
\note{
|
||||
% -- NARRATE: ``Welcome. Raise your hand if you have trained a model.'' (most
|
||||
% hands go up) ``Now keep your hand up if you have deployed one to production.''
|
||||
% (most hands drop) ``That gap---training vs.\ shipping---is this entire course.
|
||||
% This is not an ML algorithms class. This is about the systems that make ML
|
||||
% work: memory, bandwidth, power, latency, and the physics behind every design
|
||||
% decision.''
|
||||
%
|
||||
% -- FLEX: [CORE] Sets the emotional contract for the semester.
|
||||
% IF SHORT: Cut the hand-raise; just state the training-deployment gap directly.
|
||||
}
|
||||
|
||||
\centering
|
||||
\vspace{0.8cm}
|
||||
@@ -129,10 +145,27 @@ trained a model? How many have deployed one?'' The gap is the course.}
|
||||
% =============================================================================
|
||||
|
||||
\begin{frame}{The Gap Between ML Research and Production}
|
||||
\note{[3 min] Most students have trained models in notebooks. Very few have
|
||||
shipped one. The failure rates are staggering: 60--85\% of ML projects never
|
||||
reach production. The bottleneck is not algorithms --- it is systems.
|
||||
Ask: ``Why do you think most ML projects fail?''}
|
||||
\note{
|
||||
% -- LINK: Students just heard ``this is a systems course, not an algorithms
|
||||
% course.'' This slide gives the quantitative evidence for WHY.
|
||||
%
|
||||
% -- NARRATE: ``60 to 85 percent of ML projects never reach production. Let
|
||||
% that sink in. Look at this table---research uses a static dataset, a single
|
||||
% GPU, and optimizes one metric. Production uses a shifting data stream, a
|
||||
% fleet of heterogeneous hardware, and must hit accuracy AND latency AND cost
|
||||
% targets simultaneously. That gap is not fixed by a better optimizer.''
|
||||
%
|
||||
% -- ENGAGE: ``Why do you think most ML projects fail? Write one reason.''
|
||||
% Give 30 seconds. Cold-call 2 students.
|
||||
% Expected: ``data quality,'' ``hardware limits.'' Surprise answer: systems.
|
||||
%
|
||||
% -- WARN: Students assume failures are algorithmic (``bad model''). Correct
|
||||
% framing: the bottleneck is infrastructure---data pipelines, serving, monitoring.
|
||||
% IF STUCK: Point to the ``90\% of time goes to data + infrastructure'' bullet.
|
||||
%
|
||||
% -- FLEX: [CORE] This slide motivates the entire semester.
|
||||
% IF AHEAD: ``What percentage of engineering time goes to the model itself?''
|
||||
}
|
||||
|
||||
\small
|
||||
\begin{columns}[T]
|
||||
@@ -173,9 +206,26 @@ Ask: ``Why do you think most ML projects fail?''}
|
||||
\end{frame}
|
||||
|
||||
\begin{frame}{The 5\% Problem}
|
||||
\note{[3 min] Sculley et al.\ 2015. The ML model code is the tiny box in the
|
||||
center. Everything around it --- data pipelines, serving, monitoring, config ---
|
||||
is what this course teaches. Ask: ``What is the biggest box?''}
|
||||
\note{
|
||||
% -- LINK: The previous slide said ``the bottleneck is systems.'' This diagram
|
||||
% shows exactly what those systems look like.
|
||||
%
|
||||
% -- NARRATE: Point to the tiny center box: ``That is the ML model code---about
|
||||
% 5 percent. Everything around it---data pipelines, feature stores, serving
|
||||
% infrastructure, monitoring, configuration---is what this course teaches.
|
||||
% Sculley et al.\ called this `hidden technical debt.' ''
|
||||
%
|
||||
% -- ENGAGE: ``Look at the diagram. What is the biggest box?'' Give 15 seconds.
|
||||
% Expected answer: data collection or configuration. Both are valid---the point
|
||||
% is that neither is the model.
|
||||
%
|
||||
% -- WARN: Students equate ``ML'' with ``the model.'' Correct framing: the
|
||||
% model is 5\%; the other 95\% is systems engineering that determines whether
|
||||
% the model ever reaches a user.
|
||||
%
|
||||
% -- FLEX: [CORE] Foundational mental model for the course.
|
||||
% IF SHORT: Skip the question; just narrate the diagram for 90 seconds.
|
||||
}
|
||||
|
||||
% --- Layout: FULL-WIDTH IMAGE ---
|
||||
\centering
|
||||
@@ -198,11 +248,30 @@ you choose it because of how it parallelizes on real silicon.%
|
||||
}
|
||||
|
||||
\begin{frame}{AI Is Infrastructure}
|
||||
\note{[2 min] This is the philosophical foundation of the course.
|
||||
Every design decision in ML systems traces back to a physical constraint:
|
||||
memory bandwidth, power budget, speed of light. If you understand the
|
||||
constraints, the architecture choices become obvious.
|
||||
Ask: ``Why can't we run GPT-4 on a phone?''}
|
||||
\note{
|
||||
% -- LINK: The Core Thesis focus slide just said ``constraints drive
|
||||
% architecture.'' This slide names the four specific physical constraints.
|
||||
%
|
||||
% -- NARRATE: Walk through bullets top to bottom. ``Memory bandwidth limits
|
||||
% how fast data reaches the processor---this is why GPT-4 inference is slow
|
||||
% even on powerful GPUs. Power budget limits where a model can run---a phone
|
||||
% cannot sustain 700 watts. Speed of light limits latency---a self-driving car
|
||||
% cannot wait 50 ms for a cloud round trip. Thermodynamics limits compute per
|
||||
% rack---you cannot cool infinite GPUs in a data center.''
|
||||
% ANALOGY: ``Physics is to ML systems what gravity is to bridges. You can
|
||||
% build creative bridges, but none of them ignore gravity.''
|
||||
%
|
||||
% -- ENGAGE: ``Why can't we run GPT-4 on a smartphone?'' Cold-call one student.
|
||||
% Expected: ``not enough memory.'' Deepen: ``How much memory does it need?
|
||||
% 3.6 TB at FP32. A phone has 8 GB. That is a 450x gap---physics, not software.''
|
||||
%
|
||||
% -- WARN: Students think hardware limitations are temporary (``next year's chip
|
||||
% will fix it''). Correct framing: these are physical laws, not engineering gaps.
|
||||
% Memory bandwidth grows ~20\%/yr; compute demand grows 7x faster than Moore's Law.
|
||||
%
|
||||
% -- FLEX: [CORE] Philosophical anchor for the course.
|
||||
% IF AHEAD: ``Which constraint is hardest to overcome with engineering?''
|
||||
}
|
||||
|
||||
\small
|
||||
\begin{columns}[T]
|
||||
@@ -240,9 +309,26 @@ Ask: ``Why can't we run GPT-4 on a phone?''}
|
||||
% =============================================================================
|
||||
|
||||
\begin{frame}{Three Analytical Frameworks}
|
||||
\note{[3 min] These are the three recurring analytical tools for the course.
|
||||
Show the overview diagram. Students will learn each in depth in Chapter 1.
|
||||
Today, just plant the seed. Ask: ``What question does each framework answer?''}
|
||||
\note{
|
||||
% -- LINK: Students now know that physical constraints drive architecture.
|
||||
% These three frameworks are the tools for reasoning about those constraints.
|
||||
%
|
||||
% -- NARRATE: Point to each framework in the diagram. ``D-A-M tells you WHERE
|
||||
% the bottleneck is. The Iron Law tells you HOW LONG an operation takes. The
|
||||
% Degradation Equation tells you WHEN a model will fail. These three tools
|
||||
% recur in every single chapter.''
|
||||
%
|
||||
% -- ENGAGE: ``What question does each framework answer? Write one word per
|
||||
% framework.'' 30 seconds. Cold-call one student.
|
||||
% Expected: Where/How long/When---or close variants.
|
||||
%
|
||||
% -- WARN: Students will try to memorize the equations without understanding
|
||||
% what question each answers. Correct framing: frameworks are diagnostic
|
||||
% tools, not formulas to plug numbers into.
|
||||
%
|
||||
% -- FLEX: [CORE] Preview only---do not go deep. Chapter 1 covers each.
|
||||
% IF SHORT: Just name the three frameworks and move on (60 seconds).
|
||||
}
|
||||
|
||||
% --- Layout: FULL-WIDTH IMAGE ---
|
||||
\centering
|
||||
@@ -254,10 +340,27 @@ Today, just plant the seed. Ask: ``What question does each framework answer?''}
|
||||
\end{frame}
|
||||
|
||||
\begin{frame}{The \DAM{} Taxonomy}
|
||||
\note{[2 min] Brief intro to D-A-M. Every ML system has three interdependent
|
||||
axes: Data, Algorithm, Machine. Optimizing one shifts pressure to another.
|
||||
The diagnostic question: which axis is the bottleneck?
|
||||
Do not go deep --- Chapter 1 covers this in full.}
|
||||
\note{
|
||||
% -- LINK: The previous diagram named D-A-M as one of three frameworks.
|
||||
% Now we unpack it briefly.
|
||||
%
|
||||
% -- NARRATE: ``Every ML system sits at the intersection of three axes.
|
||||
% Data: how much, how fast can we move it. Algorithm: how many operations,
|
||||
% what parallelism pattern. Machine: what silicon, what memory hierarchy.
|
||||
% The diagnostic question is always the same: which axis is the bottleneck?
|
||||
% Optimizing one axis shifts pressure to the others---they are coupled.''
|
||||
% Point to the table: ``Notice the units. These become the Iron Law variables.''
|
||||
%
|
||||
% -- ENGAGE: ``If you double the training dataset, which other axis feels
|
||||
% the pressure?'' Expected: Machine (need more bandwidth or compute time).
|
||||
%
|
||||
% -- WARN: Students treat the three axes as independent knobs. Correct framing:
|
||||
% they are interdependent---doubling data volume requires proportionally more
|
||||
% bandwidth or longer training time.
|
||||
%
|
||||
% -- FLEX: [CORE] But keep it brief---Chapter 1 goes deep.
|
||||
% IF SHORT: Show the slide for 60 seconds, skip the engage question.
|
||||
}
|
||||
|
||||
\small
|
||||
\begin{columns}[T]
|
||||
@@ -304,9 +407,26 @@ Do not go deep --- Chapter 1 covers this in full.}
|
||||
\end{frame}
|
||||
|
||||
\begin{frame}{The Iron Law of ML Systems}
|
||||
\note{[2 min] Quick preview. Every term resolves to seconds. The slowest
|
||||
term dominates end-to-end latency. Chapter 1 covers worked examples.
|
||||
Ask: ``For a phone camera app, which term dominates?''}
|
||||
\note{
|
||||
% -- LINK: D-A-M tells you where the bottleneck is. The Iron Law quantifies
|
||||
% how long each axis takes in seconds.
|
||||
%
|
||||
% -- NARRATE: Point to each term. ``Data term: bytes divided by bandwidth
|
||||
% gives seconds. Compute term: FLOPs divided by peak rate times efficiency
|
||||
% gives seconds. Overhead: orchestration tax, also seconds. You add three
|
||||
% times and the slowest one dominates. This is dimensional analysis---if
|
||||
% your units do not resolve to seconds, the equation is wrong.''
|
||||
%
|
||||
% -- ENGAGE: ``For a phone camera app classifying a photo, which term
|
||||
% dominates?'' Give 20 seconds. Expected: Data term (reading the image from
|
||||
% memory) or Overhead (framework launch cost). Accept either with reasoning.
|
||||
%
|
||||
% -- WARN: Students try to add FLOPs to bytes. Correct framing: every term
|
||||
% must resolve to seconds before you can compare or add them.
|
||||
%
|
||||
% -- FLEX: [CORE] Preview only---worked examples come in Chapter 1.
|
||||
% IF SHORT: State the equation, emphasize ``slowest term dominates,'' move on.
|
||||
}
|
||||
|
||||
\small
|
||||
$$T_{\text{total}} = \underbrace{\dfrac{D_{\text{vol}}}{BW}}_{\text{Data}} +
|
||||
@@ -339,9 +459,28 @@ $$T_{\text{total}} = \underbrace{\dfrac{D_{\text{vol}}}{BW}}_{\text{Data}} +
|
||||
\end{frame}
|
||||
|
||||
\begin{frame}{The Degradation Equation}
|
||||
\note{[2 min] ML systems fail silently. Accuracy degrades as the world changes
|
||||
around the model. The degradation equation quantifies this.
|
||||
Ask: ``How would you know your model is getting worse if no code changed?''}
|
||||
\note{
|
||||
% -- LINK: The Iron Law measures performance at a point in time. The
|
||||
% Degradation Equation measures how performance decays over time.
|
||||
%
|
||||
% -- NARRATE: ``Accuracy at time t equals initial accuracy minus alpha times
|
||||
% the distribution distance. Alpha is how sensitive the model is to drift.
|
||||
% Delta measures how far the live data has drifted from training data.
|
||||
% Look at the example: a recommendation system starts at 85\% and drops to
|
||||
% 79.2\% in 6 months. No code changed. No bugs. The world changed.
|
||||
% The engineering response: set a retraining trigger at a threshold.''
|
||||
%
|
||||
% -- ENGAGE: ``How would you know your model is getting worse if no code
|
||||
% changed and no one filed a bug?'' Give 20 seconds. Cold-call one student.
|
||||
% Expected: monitoring accuracy metrics over time. Deepen: ``What if you
|
||||
% do not have ground truth labels in real time?''
|
||||
%
|
||||
% -- WARN: Students assume ``no bugs = working correctly.'' Correct framing:
|
||||
% ML systems degrade through data drift even when code is untouched.
|
||||
%
|
||||
% -- FLEX: [CORE] Third framework preview.
|
||||
% IF SHORT: Show equation, state the rec-system example, move on.
|
||||
}
|
||||
|
||||
\small
|
||||
$$A(t) = A_0 - \alpha \cdot \Delta(P_{\text{train}},\; P_{\text{live}}(t))$$
|
||||
|
||||
@@ -68,10 +68,26 @@
|
||||
% LEARNING OBJECTIVES
|
||||
% =============================================================================
|
||||
\begin{frame}{Learning Objectives}
|
||||
\note{[2 min] Read objectives aloud. Emphasize: this chapter is about the
|
||||
\emph{computational workload} that neural networks create, not about how to
|
||||
use a framework. Ask: ``How many of you have trained a model but never
|
||||
thought about why training uses 4$\times$ more memory than inference?''}
|
||||
\note{
|
||||
% -- LINK: Prior chapters established the Iron Law and DAM taxonomy.
|
||||
% This chapter reveals what neural networks actually compute inside those terms.
|
||||
|
||||
% -- NARRATE: Read objectives aloud, pausing on each verb.
|
||||
``This chapter is about the computational workload that neural networks
|
||||
create, not about how to use a framework. Every objective maps to a
|
||||
measurable skill you can demonstrate.''
|
||||
|
||||
% -- ENGAGE: ``How many of you have trained a model but never thought
|
||||
% about why training uses 4x more memory than inference?''
|
||||
% Show of hands. Use the count to calibrate depth later.
|
||||
|
||||
% -- WARN: Students often confuse ``understanding neural networks'' with
|
||||
% ``using PyTorch.'' Correct framing: this chapter is about the math and
|
||||
% memory, not the API.
|
||||
|
||||
% -- FLEX: [CORE] Never skip objectives---they set the contract for the lecture.
|
||||
% IF SHORT: Read only the bolded terms, skip the full sentence for each.
|
||||
}
|
||||
|
||||
\footnotesize
|
||||
\begin{enumerate}\setlength\itemsep{0pt}
|
||||
@@ -88,8 +104,16 @@ thought about why training uses 4$\times$ more memory than inference?''}
|
||||
\end{frame}
|
||||
|
||||
\begin{frame}{Visual Language}
|
||||
\note{[1 min] Explain the semantic color system used throughout the course.
|
||||
These colors are consistent across all diagrams and slides.}
|
||||
\note{
|
||||
% -- NARRATE: Point to each card in turn.
|
||||
``Blue means compute---any time a GPU is doing arithmetic. Green means data
|
||||
or memory---bytes moving through the system. Orange means scheduling or
|
||||
routing decisions. Red means cost, error, or bottleneck. These colors are
|
||||
identical in every diagram across the entire course.''
|
||||
|
||||
% -- FLEX: [OPTIONAL] Students internalize colors through exposure, not memorization.
|
||||
% IF SHORT: Show for 15 seconds and move on; the colors reinforce themselves.
|
||||
}
|
||||
|
||||
\small
|
||||
Throughout this course, colors carry meaning:
|
||||
@@ -127,11 +151,30 @@ Throughout this course, colors carry meaning:
|
||||
% =============================================================================
|
||||
|
||||
\begin{frame}{The Silicon Contract}
|
||||
\note{[2 min] Bridge from prior chapters. The Iron Law (Ch1) established that
|
||||
every model makes a computational bargain with hardware. This chapter reveals
|
||||
what those computations actually are. The operators inside a neural network
|
||||
determine memory consumption, execution time, and energy expenditure.
|
||||
Ask: ``If the model code just says `multiply these matrices,' where is the bug?''}
|
||||
\note{
|
||||
% -- LINK: The Iron Law (Ch1) decomposed performance into D, O, and L terms.
|
||||
% Students know the formula but not what fills each term. This slide connects
|
||||
% the abstract equation to the concrete operators inside a neural network.
|
||||
|
||||
% -- NARRATE: Point to the equation on the slide.
|
||||
``Every term in the Iron Law has a physical origin inside the neural network.
|
||||
O comes from matrix multiplications. D comes from weight and activation
|
||||
traffic. L comes from pipeline overhead. The operators you choose---and how
|
||||
you arrange them---determine which term dominates.''
|
||||
ANALOGY: ``The Iron Law is the utility bill; this chapter opens the meter.''
|
||||
|
||||
% -- ENGAGE: ``If the model code just says `multiply these matrices,' where
|
||||
% is the bug?'' Cold-call one student. Expected answer: the bug is not a
|
||||
% syntax error---it is a numerical instability (gradient explosion, overflow).
|
||||
|
||||
% -- WARN: Students expect bugs to look like Python exceptions. In neural
|
||||
% networks, bugs are silent: NaN gradients, saturating activations, memory
|
||||
% exhaustion. The code runs---the math fails.
|
||||
|
||||
% -- FLEX: [CORE] This slide sets the chapter thesis---never skip.
|
||||
% IF AHEAD: ``Can you name a specific numerical instability you have seen?''
|
||||
% IF SHORT: Skip the analogy, keep the equation walkthrough.
|
||||
}
|
||||
|
||||
\small
|
||||
\begin{columns}[T]
|
||||
@@ -167,10 +210,28 @@ Ask: ``If the model code just says `multiply these matrices,' where is the bug?'
|
||||
\end{frame}
|
||||
|
||||
\begin{frame}{Three Paradigms, One Digit}
|
||||
\note{[3 min] This is the chapter's central comparison. Walk through the same
|
||||
$28\times28$ digit across three paradigms. The 1,092$\times$ compute explosion
|
||||
is the visceral number. Ask: ``Where does each paradigm sit on the Iron Law?''
|
||||
Common error: students think more compute is always bad.}
|
||||
\note{
|
||||
% -- LINK: The Silicon Contract slide introduced the Iron Law terms.
|
||||
% Now we see what happens when you move from rule-based to neural: the
|
||||
% same 28x28 digit triggers 1,092x more operations.
|
||||
|
||||
% -- NARRATE: Point left to right across the three panels.
|
||||
``Same digit, same 784 pixels. Rule-based: 100 comparisons. Classical ML:
|
||||
8,000 feature extractions. Neural net: 109,184 multiply-accumulate ops.
|
||||
That is a 1,092x compute explosion for the same input.''
|
||||
|
||||
% -- ENGAGE: ``Where does each paradigm sit on the Iron Law? Which term
|
||||
% dominates for each?'' Give 30 seconds. Expected: rule-based is L-dominated,
|
||||
% classical ML is balanced, neural net is O-dominated.
|
||||
|
||||
% -- WARN: Students assume more compute is always bad. Correct framing:
|
||||
% more compute buys representation power---the question is whether the
|
||||
% systems cost is justified by the accuracy gain.
|
||||
|
||||
% -- FLEX: [CORE] The 1,092x number is the chapter's anchor.
|
||||
% IF AHEAD: ``At what point does the accuracy gain stop justifying the cost?''
|
||||
% IF SHORT: Just emphasize the 1,092x ratio and move on.
|
||||
}
|
||||
|
||||
% --- Full-width image ---
|
||||
\centering
|
||||
@@ -182,9 +243,27 @@ Common error: students think more compute is always bad.}
|
||||
\end{frame}
|
||||
|
||||
\begin{frame}{The Compute Explosion in Numbers}
|
||||
\note{[2 min] Quantitative backing for the diagram. Walk through the table.
|
||||
Key insight: memory also jumps---from fitting in L1 cache to exceeding it.
|
||||
If short: just emphasize the 1,092$\times$ ratio and the cache threshold.}
|
||||
\note{
|
||||
% -- LINK: The three-paradigms diagram showed the qualitative shift.
|
||||
% This table adds the quantitative evidence students need to reason precisely.
|
||||
|
||||
% -- NARRATE: Walk down each row of the table.
|
||||
``Rule-based: 100 ops, 784 bytes---fits in a register file. Classical ML:
|
||||
8,000 ops, 2 KB---fits in L1 cache. Neural net: 109,184 MACs, 427 KB---
|
||||
blows past L1 (typically 64 KB). The moment you cross the cache boundary,
|
||||
every inference forces memory traffic.''
|
||||
|
||||
% -- ENGAGE: ``Which jump matters more for systems design: the 1,092x
|
||||
% compute increase or the 546x memory increase?'' Pair discussion, 30 sec.
|
||||
% Expected: the memory jump, because it changes the bottleneck regime.
|
||||
|
||||
% -- WARN: Students fixate on FLOP counts. The cache threshold crossing
|
||||
% (784 B to 427 KB) is the more consequential systems event---it changes
|
||||
% whether the workload is compute-bound or memory-bound.
|
||||
|
||||
% -- FLEX: [OPTIONAL] The table reinforces the diagram.
|
||||
% IF SHORT: Point to the 1,092x and 546x numbers, skip row-by-row walkthrough.
|
||||
}
|
||||
|
||||
\small
|
||||
\renewcommand{\arraystretch}{1.2}
|
||||
@@ -209,9 +288,25 @@ If short: just emphasize the 1,092$\times$ ratio and the cache threshold.}
|
||||
|
||||
% --- ACTIVE LEARNING 1: Predict ---
|
||||
\begin{frame}{Predict: What Is a Neuron Computing?}
|
||||
\note{[2 min] Prediction exercise before revealing the neuron. Give students
|
||||
60 seconds. Do NOT reveal yet. This primes the MAC concept.
|
||||
Ask 2--3 students: ``What mathematical operation does a neuron perform?''}
|
||||
\note{
|
||||
% -- LINK: Students just saw 109,184 MACs but do not yet know what a MAC is.
|
||||
% This prediction primes them to discover the neuron equation themselves.
|
||||
|
||||
% -- NARRATE: Read the prompt aloud, then go silent for 60 seconds.
|
||||
``784 inputs, 128 neurons, 100,352 multiply-accumulate operations.
|
||||
Write one equation that explains what each neuron computes.''
|
||||
|
||||
% -- ENGAGE: Think-Write-Share. 60 seconds writing, then turn to a neighbor.
|
||||
% Cold-call 2--3 students. Expected answer: weighted sum plus bias, then
|
||||
% activation. Accept partial answers---the full equation comes next slide.
|
||||
|
||||
% -- WARN: Some students will write softmax or loss---those are network-level
|
||||
% ops, not neuron-level. Redirect: ``What does a single neuron do to its inputs?''
|
||||
|
||||
% -- FLEX: [CORE] Prediction before reveal is the highest-leverage active
|
||||
% learning moment. Never skip.
|
||||
% IF SHORT: Reduce to 30 seconds writing, skip neighbor comparison.
|
||||
}
|
||||
|
||||
\centering
|
||||
\vspace{0.8cm}
|
||||
@@ -235,11 +330,29 @@ needs 100,352 multiply-accumulate operations.\\[0.2cm]
|
||||
% =============================================================================
|
||||
|
||||
\begin{frame}{Anatomy of a Neuron}
|
||||
\note{[3 min] Reveal after prediction. The neuron computes a weighted sum
|
||||
plus bias, then applies a nonlinear activation. The MAC is the atomic
|
||||
operation. N inputs $\to$ N MACs. Emphasize: this is NOT a biological
|
||||
neuron---it is a computational primitive.
|
||||
Ask: ``How many memory accesses does one neuron need?''}
|
||||
\note{
|
||||
% -- LINK: Students just predicted the neuron equation. Now reveal and
|
||||
% validate their answers against the actual formula.
|
||||
|
||||
% -- NARRATE: Point to the diagram left-to-right.
|
||||
``Each input x_i is multiplied by a weight w_i, all products are summed,
|
||||
a bias b is added, then a nonlinear activation f is applied. That is one
|
||||
neuron: N multiply-accumulate operations. A layer of M neurons does M*N
|
||||
MACs---one matrix multiplication.''
|
||||
ANALOGY: ``A neuron is a dot product with a switch on the end.''
|
||||
|
||||
% -- ENGAGE: ``How many memory accesses does one neuron with 784 inputs
|
||||
% need?'' Expected: 784 weights + 784 inputs + 1 bias = 1,569 reads minimum,
|
||||
% plus 1 write for the output. Memory traffic dominates for small neurons.
|
||||
|
||||
% -- WARN: Students confuse biological neurons with computational neurons.
|
||||
% Correct framing: this is a multiply-accumulate primitive, not a model of
|
||||
% biology. The name is historical; the operation is linear algebra.
|
||||
|
||||
% -- FLEX: [CORE] The neuron equation is foundational for every later slide.
|
||||
% IF AHEAD: ``What happens if we remove the activation function f?''
|
||||
% (Answer: the entire network collapses to a single linear transformation.)
|
||||
}
|
||||
|
||||
% --- Full-width image ---
|
||||
\centering
|
||||
|
||||
@@ -68,9 +68,33 @@
|
||||
% LEARNING OBJECTIVES
|
||||
% =============================================================================
|
||||
\begin{frame}{Learning Objectives}
|
||||
\note{[2 min] Read through objectives. Emphasize that data selection is the
|
||||
highest-leverage optimization in the D-A-M stack. Ask: ``How many of you have
|
||||
ever questioned whether all your training data is actually useful?''}
|
||||
\note{
|
||||
% -- LINK: Learning objectives frame opens the lecture
|
||||
This is the roadmap slide. Students arrive from Ch.\ 8 (Training) knowing how to
|
||||
train models; now they learn that \emph{what} you train on matters more than
|
||||
\emph{how long} you train.
|
||||
|
||||
% -- NARRATE: Walk through objectives with emphasis
|
||||
Read each objective aloud. Pause on objective 1: ``Data selection is the
|
||||
highest-leverage optimization in the entire D-A-M stack --- it reduces the
|
||||
numerator \emph{before} anything else touches it.'' Emphasize that every
|
||||
subsequent objective builds toward the Selection Inequality (objective 5).
|
||||
|
||||
% -- ENGAGE: Opening question to surface assumptions
|
||||
Ask: ``How many of you have ever questioned whether all your training data
|
||||
is actually useful?'' Follow up: ``What fraction would you guess is redundant?''
|
||||
[Expected: most guess 10--20\%; the real answer is 50--90\%.]
|
||||
|
||||
% -- WARN: Students underestimate data waste
|
||||
Students arrive believing ``more data = better model'' because scaling-law
|
||||
papers dominate the discourse. This lecture systematically dismantles that
|
||||
assumption with quantitative evidence.
|
||||
|
||||
% -- FLEX: [CORE] --- never skip
|
||||
[CORE] Objectives frame sets the contract for the entire lecture.
|
||||
IF AHEAD: Ask students to rank which objective they find most surprising.
|
||||
IF SHORT: Read objectives quickly, spend time on the opening question.
|
||||
}
|
||||
|
||||
\small
|
||||
\begin{enumerate}
|
||||
@@ -86,8 +110,21 @@ ever questioned whether all your training data is actually useful?''}
|
||||
\end{frame}
|
||||
|
||||
\begin{frame}{Visual Language}
|
||||
\note{[1 min] Explain the semantic color system used throughout the course.
|
||||
These colors are consistent across all diagrams and slides.}
|
||||
\note{
|
||||
% -- LINK: Follows learning objectives; sets visual conventions before content
|
||||
Students just saw what they will learn; this slide equips them to read every
|
||||
diagram that follows.
|
||||
|
||||
% -- NARRATE: Walk through each color with a concrete example
|
||||
Point to each card: ``Blue means compute --- anytime you see blue, think
|
||||
GPU cycles. Green means data flow or memory. Orange is routing or scheduling.
|
||||
Red flags cost, error, or a bottleneck. These colors are identical across
|
||||
every slide and every SVG in this course.''
|
||||
|
||||
% -- FLEX: [CORE] --- first time seeing the color system
|
||||
[CORE] Essential for first lecture where students encounter the color system.
|
||||
IF SHORT: Spend 30 seconds; students will internalize through repeated exposure.
|
||||
}
|
||||
|
||||
\small
|
||||
Throughout this course, colors carry meaning:
|
||||
@@ -125,10 +162,36 @@ Throughout this course, colors carry meaning:
|
||||
% =============================================================================
|
||||
|
||||
\begin{frame}{The Data Wall}
|
||||
\note{[3 min] Open with the key tension. Compute grows 10x/3yr while quality
|
||||
data grows 2x/5yr. The internet has already been scraped. This asymmetry
|
||||
inverts the optimization priority. Ask: ``If you had unlimited GPUs but
|
||||
limited data, what would you optimize?''}
|
||||
\note{
|
||||
% -- LINK: First content slide after objectives
|
||||
Students just heard that data selection is the highest-leverage optimization.
|
||||
This slide provides the \emph{why}: a physical asymmetry between compute
|
||||
growth and data growth.
|
||||
|
||||
% -- NARRATE: Build the tension with the table
|
||||
Point to the table row by row: ``Compute: 10x every 3 years --- Moore's Law
|
||||
on steroids. Training data: 2x every 5 years --- we have already scraped
|
||||
the internet. This asymmetry is the Data Wall.'' Tap the red callout card:
|
||||
``The field has flipped from data-poor/compute-poor to compute-rich/data-poor.''
|
||||
|
||||
% -- ENGAGE: Falsifiable question
|
||||
Ask: ``If you had unlimited GPUs but limited high-quality data, what would
|
||||
you optimize first?'' Cold-call one student.
|
||||
[Expected: most say ``get more data'' --- correct answer is ``get more
|
||||
\emph{value} from existing data.'']
|
||||
|
||||
% -- WARN: Students conflate data quantity with data quality
|
||||
Common error: students assume more data always helps because scaling-law
|
||||
papers show log-linear improvement. Correct framing: scaling laws assume
|
||||
\emph{unique, high-quality} tokens --- duplicates and noise yield diminishing
|
||||
returns far earlier.
|
||||
|
||||
% -- FLEX: [CORE] --- motivates the entire chapter
|
||||
[CORE] This is the chapter thesis slide.
|
||||
IF AHEAD: ``What happens when synthetic data grows unbounded but
|
||||
quality-limited?''
|
||||
IF SHORT: Skip the question, let the table speak for itself.
|
||||
}
|
||||
|
||||
\footnotesize
|
||||
\begin{columns}[T]
|
||||
@@ -170,9 +233,34 @@ limited data, what would you optimize?''}
|
||||
\end{frame}
|
||||
|
||||
\begin{frame}{What Is Data Selection?}
|
||||
\note{[2 min] Formal definition. Emphasize the distinction from data
|
||||
engineering: quality (is it correct?) vs.\ value (is it worth the compute?).
|
||||
Common error: students think data selection = data cleaning.}
|
||||
\note{
|
||||
% -- LINK: The Data Wall motivates a formal response
|
||||
Students just saw compute outpacing data supply. This slide names the
|
||||
discipline that responds: data selection, distinct from data engineering
|
||||
they learned in Ch.\ 4.
|
||||
|
||||
% -- NARRATE: Read the definition, then contrast with the table
|
||||
Read the crimson card aloud slowly. Then point to the comparison table:
|
||||
``Ch.\ 4 asked `is the data correct?' Ch.\ 9 asks `is correct data
|
||||
worth the compute?' A perfectly clean dataset can still be 90\% redundant.''
|
||||
Pause on the insight card: ``10x low-quality data < 1.1x carefully selected
|
||||
high-quality data.''
|
||||
|
||||
% -- ENGAGE: Falsifiable distinction
|
||||
Ask: ``Give me one example where data engineering fixes the problem and one
|
||||
where only data selection helps.'' [Expected: dedup of corrupted images =
|
||||
engineering; removing easy samples near cluster centers = selection.]
|
||||
|
||||
% -- WARN: Students conflate selection with cleaning
|
||||
Common error: students hear ``data selection'' and think ``data cleaning.''
|
||||
Correct framing: cleaning fixes errors; selection removes \emph{correct
|
||||
but uninformative} samples. Both are necessary; neither subsumes the other.
|
||||
|
||||
% -- FLEX: [CORE] --- foundational definition
|
||||
[CORE] The ICR definition here is referenced throughout the rest of the deck.
|
||||
IF AHEAD: ``Can a sample be high-quality but low-ICR? Give an example.''
|
||||
IF SHORT: Skip the table, keep the definition card and the insight.
|
||||
}
|
||||
|
||||
\small
|
||||
\begin{mlsyscard}{crimson}
|
||||
@@ -199,10 +287,38 @@ Common error: students think data selection = data cleaning.}
|
||||
\end{frame}
|
||||
|
||||
\begin{frame}{Data Selection and the Iron Law}
|
||||
\note{[3 min] Connect to the Iron Law from Ch.\ 1. Data selection is the only
|
||||
technique that reduces the number of passes through the entire equation.
|
||||
Model compression reduces O per pass; hardware increases R. Data selection
|
||||
reduces the pass count itself. 2x * 2x * 2x = 8x, not 6x.}
|
||||
\note{
|
||||
% -- LINK: From definition to mechanism via the Iron Law
|
||||
Students just defined data selection and ICR. This slide connects data
|
||||
selection to the Iron Law from Ch.\ 1, showing \emph{where} in the
|
||||
equation it acts.
|
||||
|
||||
% -- NARRATE: Walk through the D-A-M diagram
|
||||
Point to the diagram: ``Data selection reduces the total number of passes
|
||||
through the \emph{entire} equation. Model compression (Ch.\ 10) reduces
|
||||
O per pass. Hardware (Ch.\ 11) increases R. But data selection reduces
|
||||
the pass count itself --- it is the only technique that shrinks the
|
||||
workload before the other two even see it.''
|
||||
ANALOGY: ``Think of a factory: compression makes each widget faster to
|
||||
build, hardware buys faster machines, but data selection throws away
|
||||
widgets nobody ordered.''
|
||||
|
||||
% -- ENGAGE: Multiplicative vs.\ additive
|
||||
Before showing the concept card, ask: ``If each technique gives 2x, is
|
||||
the combined gain 6x or 8x?'' Give 10 seconds.
|
||||
[Expected: many say 6x (additive). Correct: 8x (multiplicative).]
|
||||
|
||||
% -- WARN: Additive thinking is the default
|
||||
Students instinctively add speedups (2+2+2=6) instead of multiplying
|
||||
(2*2*2=8). Correct framing: the three optimizations operate on
|
||||
\emph{different terms} of the same equation, so they compound.
|
||||
|
||||
% -- FLEX: [CORE] --- the D-A-M multiplicative argument
|
||||
[CORE] This multiplicative insight is revisited in Key Takeaways.
|
||||
IF AHEAD: ``What happens if data selection gives 10x but compression
|
||||
only 1.2x? Where should the team invest next?''
|
||||
IF SHORT: Show diagram, state the 8x result, move on.
|
||||
}
|
||||
|
||||
% --- Layout: FULL-WIDTH diagram ---
|
||||
\centering
|
||||
@@ -216,9 +332,34 @@ reduces the pass count itself. 2x * 2x * 2x = 8x, not 6x.}
|
||||
|
||||
% --- ACTIVE LEARNING 1: Predict ---
|
||||
\begin{frame}{Predict: Where Does the Waste Live?}
|
||||
\note{[2 min] Prediction exercise. Give students 60 seconds. The answer will
|
||||
be revealed with the ICR curve. Most will say ``noisy samples'' --- the
|
||||
real answer includes redundant easy samples far from the decision boundary.}
|
||||
\note{
|
||||
% -- LINK: From the Iron Law connection to hands-on reasoning
|
||||
Students just saw that data selection reduces total passes. Now they must
|
||||
decide \emph{which} samples to cut --- before seeing the ICR framework.
|
||||
|
||||
% -- NARRATE: Run the Think-Write-Share protocol
|
||||
Say: ``You have 1 million samples and can keep only 10\%. Write down your
|
||||
strategy --- which samples do you throw away and why?'' Give 60 seconds
|
||||
of silent writing, then 30 seconds of neighbor discussion. Do NOT reveal
|
||||
the ICR curve yet.
|
||||
|
||||
% -- ENGAGE: The prediction itself is the engagement
|
||||
This is the active learning moment. Walk the room during writing time.
|
||||
Listen for common strategies: ``remove noisy samples,'' ``random subset,''
|
||||
``remove outliers.'' The answer (revealed next slide): remove redundant
|
||||
easy samples deep within class clusters, not just noisy ones.
|
||||
|
||||
% -- WARN: Students fixate on noise, ignore redundancy
|
||||
Most students say ``throw away noisy samples.'' The deeper insight is
|
||||
that \emph{clean, easy} samples far from the decision boundary are the
|
||||
biggest source of wasted compute --- they contribute near-zero gradient.
|
||||
|
||||
% -- FLEX: [CORE] --- first active learning moment
|
||||
[CORE] This prediction primes the ICR curve reveal on the next slide.
|
||||
IF AHEAD: Ask a follow-up: ``Would your strategy change if you could
|
||||
keep 50\% instead of 10\%?''
|
||||
IF SHORT: Reduce writing time to 30 seconds, skip neighbor discussion.
|
||||
}
|
||||
|
||||
\centering
|
||||
\vspace{0.8cm}
|
||||
|
||||
@@ -67,9 +67,30 @@
|
||||
% LEARNING OBJECTIVES
|
||||
% =============================================================================
|
||||
\begin{frame}{Learning Objectives}
|
||||
\note{[2 min] Read objectives aloud. Emphasize the inversion theme: everything
|
||||
students learned about training optimization is about to be flipped.
|
||||
Ask: ``How many of you have deployed a model to production?''}
|
||||
\note{
|
||||
% -- LINK: What prior concept connects to this slide
|
||||
Students spent 12 chapters optimizing throughput. This slide frames the
|
||||
inversion: every training priority is about to flip.
|
||||
|
||||
% -- NARRATE: What to SAY while showing this slide
|
||||
Read each objective aloud, pausing on ``latency budget'' and ``queuing
|
||||
theory'' --- these are the new quantitative anchors replacing samples/hour.
|
||||
ANALOGY: ``Training is a factory running 24/7. Serving is an ER --- every
|
||||
patient has a deadline.''
|
||||
|
||||
% -- ENGAGE: Specific question, prediction, or task for THIS slide
|
||||
Ask: ``How many of you have deployed a model to production? What was your
|
||||
biggest surprise?'' Cold-call one responder.
|
||||
|
||||
% -- WARN: What students will get wrong on THIS topic
|
||||
Common error: students assume serving is just calling model.predict().
|
||||
Correct framing: serving is a six-stage pipeline where the model is one
|
||||
stage consuming less than 50\% of the budget.
|
||||
|
||||
% -- FLEX: [CORE] This slide is essential --- do not skip.
|
||||
IF AHEAD: Ask students to predict which objective will be hardest.
|
||||
IF SHORT: Read objectives without discussion; move to content.
|
||||
}
|
||||
|
||||
\footnotesize
|
||||
\begin{enumerate}\setlength\itemsep{0pt}
|
||||
@@ -85,8 +106,16 @@ Ask: ``How many of you have deployed a model to production?''}
|
||||
\end{frame}
|
||||
|
||||
\begin{frame}{Visual Language}
|
||||
\note{[1 min] Explain the semantic color system used throughout the course.
|
||||
These colors are consistent across all diagrams and slides.}
|
||||
\note{
|
||||
% -- NARRATE: What to SAY while showing this slide
|
||||
Point to each card: ``Blue is compute --- GPU forward passes. Green is data
|
||||
flow and memory. Orange is routing and scheduling. Red flags bottlenecks
|
||||
and cost.'' In serving diagrams, you will see orange load balancers feeding
|
||||
blue inference runners, with red marking decode bottlenecks.
|
||||
|
||||
% -- FLEX: [OPTIONAL] Skip if students already know the palette from earlier chapters.
|
||||
IF SHORT: Say ``same colors as always'' and advance.
|
||||
}
|
||||
|
||||
\small
|
||||
Throughout this course, colors carry meaning:
|
||||
@@ -124,11 +153,34 @@ Throughout this course, colors carry meaning:
|
||||
% =============================================================================
|
||||
|
||||
\begin{frame}{Why Serving Is Different}
|
||||
\note{[3 min] Core thesis of the chapter. Training maximizes throughput; serving
|
||||
minimizes latency. Same hardware, opposite priorities. Walk through the DAM
|
||||
inversion: Data goes from Volume to Freshness, Algorithm from Mutable to
|
||||
Frozen, Machine from Saturation to Headroom.
|
||||
Ask: ``If training saturates GPUs at 95\%, why would serving aim for 50\%?''}
|
||||
\note{
|
||||
% -- LINK: What prior concept connects to this slide
|
||||
In every prior chapter, success meant saturating the GPU. This slide
|
||||
reveals why that strategy fails in production serving.
|
||||
|
||||
% -- NARRATE: What to SAY while showing this slide
|
||||
Point to the diagram: ``Left side is training --- maximize throughput,
|
||||
saturate hardware. Right side is serving --- minimize latency, maintain
|
||||
headroom.'' Walk through the DAM inversion: Data shifts from Volume to
|
||||
Freshness, Algorithm from Mutable to Frozen, Machine from Saturation to
|
||||
Headroom.
|
||||
ANALOGY: ``Training is a freight train --- pack it full. Serving is an
|
||||
ambulance --- it must always be ready to go.''
|
||||
|
||||
% -- ENGAGE: Specific question, prediction, or task for THIS slide
|
||||
Ask: ``If training saturates GPUs at 95\%, why would serving aim for
|
||||
50\%?'' Give 15 seconds, then cold-call. [Expected: queuing theory ---
|
||||
high utilization causes latency spikes.]
|
||||
|
||||
% -- WARN: What students will get wrong on THIS topic
|
||||
Common error: students think serving is just training with batch size 1.
|
||||
Correct framing: serving inverts the optimization objective itself ---
|
||||
latency replaces throughput as the primary metric.
|
||||
|
||||
% -- FLEX: [CORE] This slide is essential --- do not skip.
|
||||
IF AHEAD: ``What happens to cost when you run GPUs at 50\% instead of 95\%?''
|
||||
IF SHORT: Show diagram, state the inversion, skip the DAM walkthrough.
|
||||
}
|
||||
|
||||
% --- Full-width diagram ---
|
||||
\centering
|
||||
@@ -140,9 +192,34 @@ Ask: ``If training saturates GPUs at 95\%, why would serving aim for 50\%?''}
|
||||
\end{frame}
|
||||
|
||||
\begin{frame}{The D\raisebox{0.04em}{\tiny$\bullet$}A\raisebox{0.04em}{\tiny$\bullet$}M Inversion}
|
||||
\note{[2 min] Formalize the inversion along DAM axes. Students already know DAM
|
||||
from Ch1; here we show how every axis flips. The Iron Law shifts from the
|
||||
compute term dominating (training) to the latency term dominating (serving).}
|
||||
\note{
|
||||
% -- LINK: What prior concept connects to this slide
|
||||
The previous slide showed the inversion visually. This slide formalizes
|
||||
it along the three DAM axes students learned in Ch1.
|
||||
|
||||
% -- NARRATE: What to SAY while showing this slide
|
||||
Walk down the table row by row: ``Data: training ingests billions of
|
||||
samples; serving handles one request at a time --- freshness replaces
|
||||
volume. Algorithm: training runs backprop; serving is forward-only ---
|
||||
no optimizer state needed. Machine: training saturates at 95\%; serving
|
||||
holds headroom at 40--60\% to absorb traffic spikes.'' End on the Iron
|
||||
Law row: ``The dominant term flips from compute to latency overhead.''
|
||||
|
||||
% -- ENGAGE: Specific question, prediction, or task for THIS slide
|
||||
Before showing the bottom card, ask: ``If both training and serving use
|
||||
the same GPU, why does the Iron Law's dominant term change?''
|
||||
[Expected: serving processes one request at low arithmetic intensity,
|
||||
making the latency/overhead term dominate.]
|
||||
|
||||
% -- WARN: What students will get wrong on THIS topic
|
||||
Common error: students think removing backprop makes serving trivially
|
||||
easy. Correct framing: removing backprop frees memory but exposes the
|
||||
latency term that training's large batches amortized away.
|
||||
|
||||
% -- FLEX: [CORE] This slide is essential --- do not skip.
|
||||
IF AHEAD: ``What happens to the Machine row during a traffic spike?''
|
||||
IF SHORT: Cover Data and Machine rows; skip Algorithm row.
|
||||
}
|
||||
|
||||
\scriptsize
|
||||
\renewcommand{\arraystretch}{1.1}
|
||||
@@ -167,10 +244,33 @@ compute term dominating (training) to the latency term dominating (serving).}
|
||||
\end{frame}
|
||||
|
||||
\begin{frame}{Static vs.\ Dynamic Inference}
|
||||
\note{[2 min] First architectural decision: when to compute predictions.
|
||||
Static = pre-compute overnight (photo classification). Dynamic = on-demand
|
||||
(content moderation). Most production systems use a hybrid.
|
||||
If short on time: cover the table quickly and move on.}
|
||||
\note{
|
||||
% -- LINK: What prior concept connects to this slide
|
||||
The DAM inversion showed that serving prioritizes freshness over volume.
|
||||
This slide presents the first design decision that follows: when to
|
||||
compute predictions.
|
||||
|
||||
% -- NARRATE: What to SAY while showing this slide
|
||||
Point to the green card: ``Static inference pre-computes overnight ---
|
||||
10,000 photos times 5 ms equals 50 seconds total. Zero runtime latency,
|
||||
but it cannot handle novel inputs.'' Then the red card: ``Dynamic
|
||||
inference computes on demand under a 100 ms budget. Flexible but
|
||||
expensive.'' Finish with the hybrid insight at the bottom.
|
||||
|
||||
% -- ENGAGE: Specific question, prediction, or task for THIS slide
|
||||
Ask: ``A search engine autocomplete --- static or dynamic?''
|
||||
[Expected: hybrid --- common queries are cached, novel queries computed
|
||||
on demand.]
|
||||
|
||||
% -- WARN: What students will get wrong on THIS topic
|
||||
Common error: students dismiss static inference as outdated. Correct
|
||||
framing: recommendation systems pre-compute candidate sets for millions
|
||||
of users nightly; only the final ranking is dynamic.
|
||||
|
||||
% -- FLEX: [OPTIONAL] This slide provides context but is not load-bearing.
|
||||
IF SHORT: Cover the two cards quickly and move on to Server Anatomy.
|
||||
IF AHEAD: ``What determines the boundary between cached and dynamic?''
|
||||
}
|
||||
|
||||
\footnotesize
|
||||
\begin{columns}[T]
|
||||
|
||||
@@ -71,11 +71,38 @@
|
||||
\section{Welcome}
|
||||
|
||||
\begin{frame}{Welcome to Volume II}
|
||||
\note{[3 min] Set the tone: this is the advanced course. Students already know
|
||||
how one machine works; now they learn how thousands coordinate. The metaphor:
|
||||
``The fleet is the computer'' --- like ``The network is the computer'' (Sun
|
||||
Microsystems), but for ML clusters. Ask: ``How many of you have SSH'd into a
|
||||
multi-GPU cluster?''}
|
||||
\note{
|
||||
% -- LINK: What prior concept connects to this slide
|
||||
Volume I taught the single-machine mental model: one node, 1--8 GPUs,
|
||||
shared memory, PCIe/NVLink. This slide establishes that everything
|
||||
students learned still applies --- but a new scale axis changes the rules.
|
||||
|
||||
% -- NARRATE: What to SAY while showing this slide
|
||||
Point to the comparison table on the right: ``Every row flips when you
|
||||
cross the node boundary. The bus becomes a network. Shared memory becomes
|
||||
message passing. Rare failures become daily certainty.'' Use the tagline:
|
||||
``The fleet is the computer'' --- echoing Sun Microsystems' ``The network
|
||||
is the computer,'' but applied to ML clusters.
|
||||
|
||||
% -- ENGAGE: Specific question, prediction, or task for THIS slide
|
||||
Ask: ``How many of you have SSH'd into a multi-GPU cluster?''
|
||||
Hands up. Then: ``How many have debugged a training job that stalled
|
||||
because one of 1,000 GPUs went silent?'' The gap between those two
|
||||
counts is what this course fills.
|
||||
|
||||
% -- WARN: What students will get wrong on THIS topic
|
||||
Common error: students assume ``distributed = just add more GPUs.''
|
||||
Correct framing: crossing a node boundary changes failure modes,
|
||||
communication patterns, and programming models qualitatively.
|
||||
IF STUCK: Ask them to compare restarting a crashed browser tab vs.\
|
||||
restarting one node in a 1,000-node synchronized training job.
|
||||
|
||||
% -- FLEX: [CORE] or [OPTIONAL] + contingency
|
||||
[CORE] This is the opening frame --- it sets the tone for the entire course.
|
||||
IF AHEAD: Ask what specific Vol I concept they found most surprising;
|
||||
connect it to the fleet version.
|
||||
IF SHORT: Skip the hand-raise, go straight to the table walkthrough.
|
||||
}
|
||||
|
||||
\small
|
||||
\begin{columns}[T]
|
||||
@@ -122,10 +149,38 @@ multi-GPU cluster?''}
|
||||
% =============================================================================
|
||||
|
||||
\begin{frame}{From Single Node to Fleet}
|
||||
\note{[3 min] The transition slide. Walk through left vs.\ right.
|
||||
Key insight: at fleet scale, the network replaces the memory bus as the
|
||||
critical interconnect. Latency goes from nanoseconds to milliseconds.
|
||||
Ask: ``What happens to your training job when one of 10,000 GPUs dies?''}
|
||||
\note{
|
||||
% -- LINK: What prior concept connects to this slide
|
||||
The Welcome slide introduced the fleet concept verbally. This diagram
|
||||
makes the transition visual --- left side is Vol I territory, right side
|
||||
is Vol II.
|
||||
|
||||
% -- NARRATE: What to SAY while showing this slide
|
||||
Walk left-to-right through the diagram: ``On the left, one node ---
|
||||
everything connected by NVLink at 900 GB/s, failures are rare, latency
|
||||
is nanoseconds. Cross the dotted line to the right: the bus becomes
|
||||
InfiniBand at 50 GB/s, latency jumps to microseconds, and with 10,000
|
||||
GPUs, one fails every few hours. The network replaces the memory bus as
|
||||
the critical interconnect.''
|
||||
|
||||
% -- ENGAGE: Specific question, prediction, or task for THIS slide
|
||||
Ask: ``What happens to your training job when one of 10,000 GPUs dies?''
|
||||
Give 15 seconds of think time. Expected answer: the entire job stalls
|
||||
or crashes --- which is exactly why fault tolerance is first-class.
|
||||
|
||||
% -- WARN: What students will get wrong on THIS topic
|
||||
Students will underestimate the latency jump: nanoseconds to microseconds
|
||||
sounds small, but it is a 1,000x increase. At 350 GB of gradients per
|
||||
step, that 1,000x turns into minutes of synchronization overhead per hour.
|
||||
IF STUCK: Compare it to a highway going from 300 mph to 0.3 mph at a
|
||||
toll booth --- the booth is the node boundary.
|
||||
|
||||
% -- FLEX: [CORE] or [OPTIONAL] + contingency
|
||||
[CORE] This diagram anchors the entire course narrative.
|
||||
IF AHEAD: ``At what fleet size does the probability of zero failures
|
||||
during a 1-hour training window drop below 50\%?''
|
||||
IF SHORT: Point to the diagram, read the insight callout, move on.
|
||||
}
|
||||
|
||||
% --- Layout: FULL-WIDTH diagram ---
|
||||
\centering
|
||||
@@ -137,10 +192,41 @@ Ask: ``What happens to your training job when one of 10,000 GPUs dies?''}
|
||||
\end{frame}
|
||||
|
||||
\begin{frame}{Why Scale Changes Everything}
|
||||
\note{[3 min] Three fundamental changes. These are NOT just ``more of the same.''
|
||||
Each one requires new engineering disciplines that do not exist in single-node ML.
|
||||
Common error: students think distributed = ``just add more GPUs.''
|
||||
Ask: ``Which of these three surprises you most?''}
|
||||
\note{
|
||||
% -- LINK: What prior concept connects to this slide
|
||||
The previous diagram showed the physical transition. This slide names
|
||||
the three qualitative changes that make fleet engineering a different
|
||||
discipline from single-node optimization.
|
||||
|
||||
% -- NARRATE: What to SAY while showing this slide
|
||||
Point to each card in order. ``First, communication dominates ---
|
||||
at 10,000 GPUs, AllReduce takes longer than the forward pass. Second,
|
||||
failure is routine --- 10,000 GPUs times 100,000h MTBF equals one failure
|
||||
every 10 hours. Third, emergent behavior --- stragglers, hot spots, and
|
||||
cascading failures that no single component predicts.''
|
||||
ANALOGY: ``A single car can break down. A fleet of 10,000 taxis has
|
||||
at least one broken down at any moment --- and every taxi must wait for
|
||||
every other taxi before the next fare.''
|
||||
|
||||
% -- ENGAGE: Specific question, prediction, or task for THIS slide
|
||||
Ask: ``Which of these three surprises you most?'' Cold-call one student.
|
||||
Most will say failure frequency --- use that to preview the reliability
|
||||
math coming later.
|
||||
|
||||
% -- WARN: What students will get wrong on THIS topic
|
||||
Students will treat these as independent problems. In reality they
|
||||
interact: a straggler (emergent behavior) that triggers a timeout
|
||||
(failure) during an AllReduce (communication) cascades across all three.
|
||||
IF STUCK: Walk through a concrete cascade: slow GPU triggers BSP
|
||||
barrier timeout, which triggers checkpoint, which stalls all 10,000 GPUs.
|
||||
|
||||
% -- FLEX: [CORE] or [OPTIONAL] + contingency
|
||||
[CORE] These three categories structure the entire course.
|
||||
IF AHEAD: ``Can you think of a fourth qualitative change at scale?''
|
||||
(Answer: cost --- 1\% inefficiency at 10,000 GPUs is millions of dollars.)
|
||||
IF SHORT: Read the three card titles and the bottom-line callout, skip
|
||||
the analogy.
|
||||
}
|
||||
|
||||
\small
|
||||
\begin{columns}[T]
|
||||
@@ -175,11 +261,41 @@ Ask: ``Which of these three surprises you most?''}
|
||||
% =============================================================================
|
||||
|
||||
\begin{frame}{The \Cthree{} Taxonomy: Compute, Communication, Coordination}
|
||||
\note{[3 min] Introduce the diagnostic framework for Vol 2. Every performance
|
||||
problem in a fleet can be traced to one of these three axes. This replaces
|
||||
\note{
|
||||
% -- LINK: What prior concept connects to this slide
|
||||
The previous slide named three qualitative changes at scale. The C3
|
||||
taxonomy formalizes these into a diagnostic framework --- it replaces
|
||||
the single-node D-A-M taxonomy with a fleet-scale lens.
|
||||
Ask: ``If training stalls for 30 seconds every hour, which C is the culprit?''
|
||||
(Answer: Coordination --- likely checkpointing or straggler mitigation.)}
|
||||
|
||||
% -- NARRATE: What to SAY while showing this slide
|
||||
Point to the diagram: ``Every performance problem in a fleet traces to
|
||||
one of these three axes. Compute: are the GPUs doing useful math?
|
||||
Communication: how fast can gradients and activations move? Coordination:
|
||||
who decides what runs where, when to checkpoint, how to handle stragglers?''
|
||||
Draw the parallel explicitly: ``Vol I asked `is this workload Data-bound,
|
||||
Algorithm-bound, or Memory-bound?' Vol II asks `is the fleet bottlenecked
|
||||
on Compute, Communication, or Coordination?'''
|
||||
|
||||
% -- ENGAGE: Specific question, prediction, or task for THIS slide
|
||||
Ask: ``If training stalls for 30 seconds every hour, which C is the
|
||||
culprit?'' Give 15 seconds. Cold-call. Expected answer: Coordination ---
|
||||
likely synchronous checkpointing or straggler mitigation, not raw
|
||||
compute or bandwidth.
|
||||
|
||||
% -- WARN: What students will get wrong on THIS topic
|
||||
Students will conflate Communication and Coordination. Communication is
|
||||
moving bytes (AllReduce, gradient sync). Coordination is making decisions
|
||||
(scheduling, checkpointing, barrier management). A straggler that slows
|
||||
AllReduce is a Coordination problem manifesting through Communication.
|
||||
IF STUCK: ``Communication is the pipe. Coordination is the traffic cop.''
|
||||
|
||||
% -- FLEX: [CORE] or [OPTIONAL] + contingency
|
||||
[CORE] C3 is the diagnostic backbone of the entire volume.
|
||||
IF AHEAD: ``Can a problem be bottlenecked on two C's simultaneously?
|
||||
Give an example.'' (Yes: gradient compression reduces Communication but
|
||||
adds Compute overhead for encode/decode.)
|
||||
IF SHORT: Show diagram, ask the 30-second stall question, move on.
|
||||
}
|
||||
|
||||
% --- Layout: FULL-WIDTH diagram ---
|
||||
\centering
|
||||
@@ -191,9 +307,37 @@ Ask: ``If training stalls for 30 seconds every hour, which C is the culprit?''
|
||||
\end{frame}
|
||||
|
||||
\begin{frame}{\Cthree{} in Practice}
|
||||
\note{[2 min] Concrete examples mapping real problems to the C3 axes.
|
||||
Walk through the table row by row. The point: every fleet problem has
|
||||
a C3 diagnosis that guides the engineering response.}
|
||||
\note{
|
||||
% -- LINK: What prior concept connects to this slide
|
||||
The previous slide defined C3 abstractly. This table maps real fleet
|
||||
symptoms to C3 axes, showing the framework in diagnostic action.
|
||||
|
||||
% -- NARRATE: What to SAY while showing this slide
|
||||
Walk through the table row by row. ``Low GPU utilization? That is
|
||||
Compute --- memory-bound kernels, fix with operator fusion. Throughput
|
||||
plateau? Communication --- AllReduce saturated, fix with gradient
|
||||
compression. Periodic 30-second stalls? Coordination --- synchronous
|
||||
checkpointing, fix with async checkpointing.'' Emphasize the pattern:
|
||||
diagnosis precedes optimization.
|
||||
|
||||
% -- ENGAGE: Specific question, prediction, or task for THIS slide
|
||||
Cover the ``Response'' column with your hand. Point to ``Throughput
|
||||
variance'' and ask: ``Which C axis and what would you do?'' Give 20
|
||||
seconds. Expected: Coordination (straggler nodes), response is straggler
|
||||
mitigation or redundant computation.
|
||||
|
||||
% -- WARN: What students will get wrong on THIS topic
|
||||
Students will jump to solutions before diagnosing. ``Just buy faster
|
||||
GPUs'' is the default instinct. The table shows that 4 of 6 symptoms
|
||||
are NOT solved by faster hardware --- they require algorithmic or
|
||||
systems-level interventions.
|
||||
IF STUCK: Ask which rows would NOT be helped by doubling GPU TFLOPS.
|
||||
|
||||
% -- FLEX: [OPTIONAL] + contingency
|
||||
[OPTIONAL] The C3 concept was introduced on the previous slide.
|
||||
IF SHORT: Show the table briefly, read the insight callout, move on.
|
||||
IF AHEAD: Ask students to propose a 7th row with a novel symptom.
|
||||
}
|
||||
|
||||
\scriptsize
|
||||
\renewcommand{\arraystretch}{1.15}
|
||||
@@ -217,9 +361,27 @@ a C3 diagnosis that guides the engineering response.}
|
||||
|
||||
% --- ACTIVE LEARNING 1: Predict ---
|
||||
\begin{frame}{Predict: Where Is the Bottleneck?}
|
||||
\note{[2 min] Prediction exercise. Give students 60 seconds. Do NOT reveal
|
||||
the answer yet. The point: build intuition for C3 diagnosis before the
|
||||
course teaches the formal tools. Ask 2-3 students to share.}
|
||||
\note{
|
||||
% -- NARRATE: What to SAY while showing this slide
|
||||
Read the scenario aloud. ``You are training a 70B model across 4,096
|
||||
GPUs. Throughput is only 40\% of theoretical peak. Which C3 axis is
|
||||
most likely the bottleneck?'' Emphasize: 40\% means 60\% of silicon
|
||||
is doing nothing useful.
|
||||
|
||||
% -- ENGAGE: Specific question, prediction, or task for THIS slide
|
||||
Give 60 seconds for individual writing. Then 30 seconds of pair
|
||||
discussion. Cold-call 2--3 students. Expected answer: Communication ---
|
||||
at 4,096 GPUs, AllReduce overhead for 70B parameters (140 GB gradients)
|
||||
is massive. Some students will say Compute (kernel inefficiency) ---
|
||||
acknowledge it but redirect: at 40\% peak with 4,096 GPUs, the
|
||||
Communication term is almost certainly dominant.
|
||||
|
||||
% -- FLEX: [CORE] or [OPTIONAL] + contingency
|
||||
[CORE] This is the first active learning moment --- establishes the
|
||||
predict-before-reveal pattern for the entire course.
|
||||
IF SHORT: Reduce to 30 seconds writing, skip pair discussion, cold-call
|
||||
one student.
|
||||
}
|
||||
|
||||
\centering
|
||||
\vspace{0.8cm}
|
||||
@@ -243,12 +405,42 @@ Write your answer and one reason why. \textcolor{midgray}{(60 seconds)}}
|
||||
% =============================================================================
|
||||
|
||||
\begin{frame}{The Fleet Stack: Four Layers}
|
||||
\note{[3 min] The organizing framework for the entire course. Walk through
|
||||
bottom to top: infrastructure provides the physical substrate, distributed ML
|
||||
adds the training algorithms, deployment puts models into production,
|
||||
governance ensures responsible operation.
|
||||
Key insight: you cannot skip layers. A governance failure is ultimately
|
||||
an infrastructure failure.}
|
||||
\note{
|
||||
% -- LINK: What prior concept connects to this slide
|
||||
C3 diagnoses WHERE the bottleneck is. The Fleet Stack organizes HOW the
|
||||
course addresses each layer of the system --- from physical silicon to
|
||||
societal governance.
|
||||
|
||||
% -- NARRATE: What to SAY while showing this slide
|
||||
Walk bottom-to-top through the diagram: ``Infrastructure provides the
|
||||
physical substrate --- compute, network, storage. Distributed ML adds
|
||||
the training algorithms --- data, tensor, pipeline parallelism.
|
||||
Deployment puts models into production --- inference optimization,
|
||||
scheduling, SLAs. Governance ensures responsible operation --- security,
|
||||
sustainability, fairness.'' Pause on the insight: ``You cannot skip
|
||||
layers. A governance failure is ultimately an infrastructure failure ---
|
||||
if the cluster cannot audit which data trained which model, no amount
|
||||
of policy fixes that.''
|
||||
|
||||
% -- ENGAGE: Specific question, prediction, or task for THIS slide
|
||||
Ask: ``Which layer do most ML courses stop at?'' Expected answer:
|
||||
Layer 2 (Distributed ML). ``This course covers all four --- production
|
||||
ML fails when any layer is neglected.''
|
||||
|
||||
% -- WARN: What students will get wrong on THIS topic
|
||||
Students will treat layers as independent. In reality, decisions at the
|
||||
bottom constrain possibilities at the top: a network topology that
|
||||
cannot isolate tenants makes multi-tenant governance impossible.
|
||||
IF STUCK: Give the concrete example: ``If your IB fabric has no SR-IOV,
|
||||
you cannot do secure multi-tenant serving.''
|
||||
|
||||
% -- FLEX: [CORE] or [OPTIONAL] + contingency
|
||||
[CORE] The Fleet Stack is the course roadmap --- every chapter maps
|
||||
to a layer.
|
||||
IF AHEAD: ``Which layer do you think has the highest dollar-cost of
|
||||
getting wrong?''
|
||||
IF SHORT: Name the four layers, read the insight callout, move on.
|
||||
}
|
||||
|
||||
% --- Layout: FULL-WIDTH diagram ---
|
||||
\centering
|
||||
@@ -260,9 +452,38 @@ an infrastructure failure.}
|
||||
\end{frame}
|
||||
|
||||
\begin{frame}{Fleet Stack: What Each Layer Teaches}
|
||||
\note{[2 min] Quick overview of what students will learn in each layer.
|
||||
This is the ``what's in it for me'' slide. Emphasize that the course
|
||||
covers the full stack, not just distributed training.}
|
||||
\note{
|
||||
% -- LINK: What prior concept connects to this slide
|
||||
The previous slide showed the Fleet Stack as a diagram. This table
|
||||
translates it into concrete skills students will develop in each layer.
|
||||
|
||||
% -- NARRATE: What to SAY while showing this slide
|
||||
Read each row as a promise: ``In the Infrastructure layer, you will
|
||||
learn to reason about cluster hardware --- GPU selection, InfiniBand
|
||||
topology, storage hierarchy. In Distributed ML, you will design parallel
|
||||
training --- DP, TP, PP, collective ops, fault-tolerant training.''
|
||||
Emphasize the crimson card: ``Most courses stop at Layer 2. This course
|
||||
covers all four.''
|
||||
|
||||
% -- ENGAGE: Specific question, prediction, or task for THIS slide
|
||||
Ask: ``Which layer interests you most and why?'' Quick show of hands
|
||||
for each layer. Use the distribution to preview which chapters will
|
||||
resonate most.
|
||||
|
||||
% -- WARN: What students will get wrong on THIS topic
|
||||
Students will discount Governance as ``not technical.'' Counter with:
|
||||
the EU AI Act requires auditable training provenance --- that is a
|
||||
distributed systems problem (data lineage across a 10,000-GPU fleet),
|
||||
not a policy problem.
|
||||
IF STUCK: ``If you cannot prove which data trained your model, you
|
||||
cannot deploy in Europe. That is infrastructure.''
|
||||
|
||||
% -- FLEX: [OPTIONAL] + contingency
|
||||
[OPTIONAL] The Fleet Stack was covered on the previous slide.
|
||||
IF SHORT: Skip this slide entirely --- the diagram carries the message.
|
||||
IF AHEAD: Ask which layer they think is most under-invested at real
|
||||
companies.
|
||||
}
|
||||
|
||||
\scriptsize
|
||||
\renewcommand{\arraystretch}{1.15}
|
||||
@@ -289,10 +510,39 @@ covers the full stack, not just distributed training.}
|
||||
% =============================================================================
|
||||
|
||||
\begin{frame}{17 Chapters in Four Parts}
|
||||
\note{[3 min] The map of the semester. Walk through each part briefly.
|
||||
Emphasize that C3 threads through all four parts. Point out the chapter
|
||||
numbers so students can look ahead. If short: just name the four parts
|
||||
and move on.}
|
||||
\note{
|
||||
% -- LINK: What prior concept connects to this slide
|
||||
The Fleet Stack named four layers. This roadmap maps those layers to
|
||||
17 specific chapters across the semester, showing the learning arc.
|
||||
|
||||
% -- NARRATE: What to SAY while showing this slide
|
||||
Point to each part in the diagram: ``Part I is Infrastructure ---
|
||||
compute, network, storage. Part II is Distributed ML --- parallelism
|
||||
strategies, collective communication, fault tolerance. Part III is
|
||||
Deployment --- inference, scheduling, serving. Part IV is Governance ---
|
||||
security, sustainability, responsible AI.'' Highlight the color coding:
|
||||
``Notice how C3 threads through all four parts --- Compute constraints
|
||||
dominate Parts I and II, Communication in Parts II and III, Coordination
|
||||
in Parts III and IV.''
|
||||
|
||||
% -- ENGAGE: Specific question, prediction, or task for THIS slide
|
||||
Ask: ``Looking at this map, which chapter title are you most curious
|
||||
about?'' Quick poll. This surfaces student interests early.
|
||||
|
||||
% -- WARN: What students will get wrong on THIS topic
|
||||
Students will assume the parts are independent. In reality, Part III
|
||||
(Deployment) depends heavily on Part I (Infrastructure) decisions ---
|
||||
you cannot optimize inference serving without understanding the memory
|
||||
hierarchy from Chapter 2.
|
||||
IF STUCK: ``Think of it as a building: you cannot furnish the penthouse
|
||||
before pouring the foundation.''
|
||||
|
||||
% -- FLEX: [OPTIONAL] + contingency
|
||||
[OPTIONAL] This is a reference slide that students will revisit.
|
||||
IF SHORT: Show the diagram, name the four parts, move on.
|
||||
IF AHEAD: Ask students to predict which part will be most relevant
|
||||
to their career goals.
|
||||
}
|
||||
|
||||
% --- Layout: FULL-WIDTH diagram ---
|
||||
\centering
|
||||
@@ -306,10 +556,40 @@ and move on.}
|
||||
\end{frame}
|
||||
|
||||
\begin{frame}{The Numbers That Define Fleet Scale}
|
||||
\note{[3 min] Let the numbers speak. Pause on each one. The failure rate
|
||||
calculation is the most surprising: 10,000 GPUs with 100,000h MTBF means
|
||||
one failure every 10 hours. Ask: ``How does this change how you think about
|
||||
writing training code?''}
|
||||
\note{
|
||||
% -- LINK: What prior concept connects to this slide
|
||||
The roadmap showed the course structure. This slide grounds it in
|
||||
physical reality --- the numbers that make fleet engineering different
|
||||
from single-node work.
|
||||
|
||||
% -- NARRATE: What to SAY while showing this slide
|
||||
Let the numbers speak. Pause on each one: ``10,000 GPUs. 100 MW of
|
||||
power --- that is a small city. 350 GB of gradients synchronized every
|
||||
few seconds. And the one that changes everything: 10,000 GPUs with
|
||||
100,000h MTBF means one failure every 10 hours. Not one failure per
|
||||
year. Every 10 hours.''
|
||||
ANALOGY: ``Imagine a 10,000-person orchestra where one musician
|
||||
collapses every 10 hours, and the entire orchestra must stop and restart
|
||||
from a checkpoint.''
|
||||
|
||||
% -- ENGAGE: Specific question, prediction, or task for THIS slide
|
||||
Ask: ``How does this failure rate change how you think about writing
|
||||
training code?'' Give 20 seconds. Expected insight: checkpointing
|
||||
becomes the most critical code path, not the model architecture.
|
||||
|
||||
% -- WARN: What students will get wrong on THIS topic
|
||||
Students will try to prevent failures rather than engineer for
|
||||
resilience. At 10,000 GPUs, prevention is impossible --- the math
|
||||
guarantees failures. The correct framing is: minimize recovery time,
|
||||
not failure probability.
|
||||
IF STUCK: ``You cannot prevent rain. You build a roof.''
|
||||
|
||||
% -- FLEX: [CORE] or [OPTIONAL] + contingency
|
||||
[CORE] These numbers recur throughout the entire course as anchors.
|
||||
IF AHEAD: ``Calculate the fleet MTBF for 25,000 GPUs with 100,000h
|
||||
per-GPU MTBF.'' (Answer: 4 hours.)
|
||||
IF SHORT: Highlight the failure rate number and the bottom-line callout.
|
||||
}
|
||||
|
||||
% --- Layout: FULL-WIDTH diagram ---
|
||||
\centering
|
||||
@@ -322,7 +602,17 @@ writing training code?''}
|
||||
|
||||
% --- ACTIVE LEARNING: Micro-Retrieval Cue ---
|
||||
\begin{frame}{Quick Check}
|
||||
\note{[1 min] Answer: ~once per hour (10000/8760).}
|
||||
\note{
|
||||
% -- NARRATE: What to SAY while showing this slide
|
||||
Read the question aloud: ``If one GPU fails once per year, how often
|
||||
does a 10,000-GPU cluster fail?'' Pause 15 seconds. Cold-call.
|
||||
Answer: approximately once per hour (10,000 failures/year divided by
|
||||
8,760 hours/year is about 1.14 per hour).
|
||||
|
||||
% -- FLEX: [CORE] or [OPTIONAL] + contingency
|
||||
[CORE] This cements the failure-rate intuition before moving on.
|
||||
IF SHORT: Ask, pause 10 seconds, give the answer, move on.
|
||||
}
|
||||
|
||||
\centering
|
||||
\vspace{1.0cm}
|
||||
@@ -343,9 +633,37 @@ writing training code?''}
|
||||
% =============================================================================
|
||||
|
||||
\begin{frame}{Learning Outcomes}
|
||||
\note{[2 min] Read through outcomes. These map to assessable skills.
|
||||
Emphasize that every outcome is \emph{quantitative} --- ``design'' means
|
||||
calculate, not just describe.}
|
||||
\note{
|
||||
% -- LINK: What prior concept connects to this slide
|
||||
The roadmap and scale numbers established what the course covers and
|
||||
why. This slide translates that into measurable skills students will
|
||||
demonstrate.
|
||||
|
||||
% -- NARRATE: What to SAY while showing this slide
|
||||
Read through outcomes emphasizing the verbs: ``Design --- not describe,
|
||||
design. Calculate --- not estimate, calculate. Every outcome is
|
||||
quantitative. When we say `design distributed training pipelines,' we
|
||||
mean you will specify TP degree, PP stages, and DP replicas for a given
|
||||
model and cluster, then calculate the expected MFU.''
|
||||
|
||||
% -- ENGAGE: Specific question, prediction, or task for THIS slide
|
||||
Ask: ``Which of these seven outcomes do you currently feel least
|
||||
prepared for?'' Quick show of hands per outcome. Use the distribution
|
||||
to calibrate pacing for early chapters.
|
||||
|
||||
% -- WARN: What students will get wrong on THIS topic
|
||||
Students will underestimate outcome 7 (governance). It requires the
|
||||
same quantitative rigor as the others --- calculating carbon footprint
|
||||
per training run, auditing data provenance across a distributed pipeline.
|
||||
IF STUCK: ``Governance is not an essay. It is a systems design problem
|
||||
with measurable constraints.''
|
||||
|
||||
% -- FLEX: [OPTIONAL] + contingency
|
||||
[OPTIONAL] Outcomes are a reference --- students will revisit on the
|
||||
syllabus.
|
||||
IF SHORT: Skim the list, emphasize the quantitative verbs, move on.
|
||||
IF AHEAD: Ask students to rank the outcomes by difficulty.
|
||||
}
|
||||
|
||||
\footnotesize
|
||||
By the end of this course, you will be able to:
|
||||
@@ -364,8 +682,19 @@ By the end of this course, you will be able to:
|
||||
\end{frame}
|
||||
|
||||
\begin{frame}{Visual Language}
|
||||
\note{[1 min] Explain the semantic color system used throughout the course.
|
||||
These colors are consistent across all diagrams and slides.}
|
||||
\note{
|
||||
% -- NARRATE: What to SAY while showing this slide
|
||||
Point to each card: ``Blue means compute or processing --- GPU ops,
|
||||
forward/backward pass. Green means data flow or healthy paths. Orange
|
||||
means routing or scheduling. Red means error, cost, or bottleneck.
|
||||
These colors are consistent across every diagram and slide in the
|
||||
course. When you see red in a figure, something is wrong or expensive.''
|
||||
|
||||
% -- FLEX: [OPTIONAL] + contingency
|
||||
[OPTIONAL] Reference slide for the color system.
|
||||
IF SHORT: Skip entirely --- students will absorb the colors through
|
||||
exposure.
|
||||
}
|
||||
|
||||
\small
|
||||
Throughout this course, colors carry meaning:
|
||||
@@ -400,10 +729,43 @@ Throughout this course, colors carry meaning:
|
||||
|
||||
|
||||
\begin{frame}{A Taste of What's Coming}
|
||||
\note{[3 min] The hook. Make it visceral. Walk through the scenario step by step.
|
||||
The point: every one of these failure modes is a chapter in the course.
|
||||
\note{
|
||||
% -- LINK: What prior concept connects to this slide
|
||||
The learning outcomes listed abstract skills. This scenario makes them
|
||||
visceral --- a concrete frontier training run where every failure mode
|
||||
maps to a course chapter.
|
||||
|
||||
% -- NARRATE: What to SAY while showing this slide
|
||||
Walk through the scenario step by step: ``25,000 GPUs. A GPU dies every
|
||||
4 hours. AllReduce across 25,000 GPUs takes 500 ms per step. A rack
|
||||
switch fails and 128 GPUs go dark. Gradient staleness causes loss
|
||||
spikes. Power budget limits utilization to 80\%.'' Then map each
|
||||
problem to a chapter on the right: ``Fault-tolerant checkpointing ---
|
||||
Chapter 7. Collective communication optimization --- Chapter 6. Network
|
||||
fabric redundancy --- Chapter 3. Async training --- Chapter 5.
|
||||
Sustainability and power budgets --- Chapter 15.''
|
||||
|
||||
% -- ENGAGE: Specific question, prediction, or task for THIS slide
|
||||
Ask: ``Which of these problems would you know how to solve today?''
|
||||
(Expected answer: none of them --- that's why they're taking this course.)}
|
||||
Expected answer: none of them --- and that gap is exactly why they are
|
||||
taking this course. If a student claims to know one, press for
|
||||
specifics.
|
||||
|
||||
% -- WARN: What students will get wrong on THIS topic
|
||||
Students will try to solve each problem in isolation. The real challenge
|
||||
is that these problems interact: a GPU failure triggers checkpointing,
|
||||
which saturates the network, which causes gradient staleness, which
|
||||
spikes loss. The system view matters more than any individual fix.
|
||||
IF STUCK: ``These are not five separate problems. They are one system
|
||||
with five failure modes.''
|
||||
|
||||
% -- FLEX: [CORE] or [OPTIONAL] + contingency
|
||||
[CORE] This is the motivational hook that justifies the entire course.
|
||||
IF AHEAD: ``What is the dollar cost of 4 hours of 25,000 idle H100s
|
||||
at \$3/GPU-hour?'' (Answer: \$300K wasted per failure event.)
|
||||
IF SHORT: Read the left column only, point to the right column as
|
||||
``what this course teaches,'' move on.
|
||||
}
|
||||
|
||||
\footnotesize
|
||||
\textbf{Scenario: Training a Frontier Model Across 25,000 GPUs}
|
||||
@@ -441,10 +803,40 @@ Ask: ``Which of these problems would you know how to solve today?''
|
||||
|
||||
% --- ACTIVE LEARNING 2: Discussion ---
|
||||
\begin{frame}{Discussion: What Breaks First?}
|
||||
\note{[3 min] Turn-and-talk. Students discuss in pairs for 90 seconds.
|
||||
Cold-call 2-3 pairs. No single right answer --- the point is that
|
||||
``more GPUs'' is never the full answer. Common student answer: ``the network''
|
||||
--- press them on whether that's communication or coordination.}
|
||||
\note{
|
||||
% -- LINK: What prior concept connects to this slide
|
||||
The scenario slide listed five failure modes. This discussion forces
|
||||
students to reason about which one strikes first --- building intuition
|
||||
for failure ordering at scale.
|
||||
|
||||
% -- NARRATE: What to SAY while showing this slide
|
||||
Read the scenario: ``Your startup raised \$100M. You buy 10,000 H100
|
||||
GPUs, connect them with InfiniBand. What breaks first?'' Point to the
|
||||
five options along the bottom.
|
||||
|
||||
% -- ENGAGE: Specific question, prediction, or task for THIS slide
|
||||
Turn-and-talk for 90 seconds. Then cold-call 2--3 pairs. There is no
|
||||
single right answer --- the point is that ``more GPUs'' is never the
|
||||
full answer. Common student answer: ``the network'' --- press them on
|
||||
whether that is Communication (bandwidth saturation) or Coordination
|
||||
(scheduling, straggler management). Some will say ``your budget'' ---
|
||||
that is a valid and insightful answer worth exploring.
|
||||
|
||||
% -- WARN: What students will get wrong on THIS topic
|
||||
Students will focus on GPU hardware failures. In practice, the software
|
||||
stack (NCCL hangs, CUDA OOM, driver crashes) breaks far more often than
|
||||
physical hardware. Meta's Grand Teton paper reports that software issues
|
||||
cause more downtime than hardware failures.
|
||||
IF STUCK: ``Think about what you have to set up BEFORE the GPUs start
|
||||
computing.''
|
||||
|
||||
% -- FLEX: [CORE] or [OPTIONAL] + contingency
|
||||
[CORE] This is the second active learning moment and the most
|
||||
interactive slide in the deck.
|
||||
IF SHORT: Reduce to 60-second pairs, cold-call one pair.
|
||||
IF AHEAD: After discussion, ask: ``What would you monitor to detect
|
||||
the failure before it happens?''
|
||||
}
|
||||
|
||||
\centering
|
||||
\vspace{0.8cm}
|
||||
@@ -471,10 +863,40 @@ You buy 10,000 H100 GPUs and connect them with InfiniBand.\\[0.3cm]
|
||||
% =============================================================================
|
||||
|
||||
\begin{frame}{Prerequisites}
|
||||
\note{[2 min] Set expectations clearly. Students need Vol 1 or equivalent.
|
||||
Distributed systems basics (consensus, message passing) are helpful but
|
||||
will be introduced as needed. Programming: PyTorch and basic Linux/cluster
|
||||
experience expected.}
|
||||
\note{
|
||||
% -- LINK: What prior concept connects to this slide
|
||||
The course content and motivation are established. Now students need to
|
||||
know: am I prepared for this?
|
||||
|
||||
% -- NARRATE: What to SAY while showing this slide
|
||||
Walk through the Required column: ``Vol I or equivalent --- you must
|
||||
understand single-GPU training, the memory hierarchy, and basic
|
||||
profiling. PyTorch proficiency --- you will write torch.distributed
|
||||
code from week 2. Linux and SSH --- you will be running jobs on
|
||||
multi-node clusters.'' Then the Helpful column: ``Distributed systems
|
||||
concepts like consensus and RPC will be introduced as needed. If you
|
||||
know Slurm or Kubernetes, you will have a head start on the scheduling
|
||||
chapters, but it is not required.''
|
||||
|
||||
% -- ENGAGE: Specific question, prediction, or task for THIS slide
|
||||
Ask: ``Raise your hand if you are comfortable explaining what AllReduce
|
||||
does.'' The fraction of hands up calibrates how much Ch.~5--6 review
|
||||
is needed.
|
||||
|
||||
% -- WARN: What students will get wrong on THIS topic
|
||||
Students without cluster experience will feel behind. Reassure them:
|
||||
the course introduces cluster concepts from scratch. The real
|
||||
prerequisite is single-GPU fluency, not distributed systems expertise.
|
||||
IF STUCK: ``If you can train a ResNet on one GPU and profile where time
|
||||
goes, you are ready.''
|
||||
|
||||
% -- FLEX: [OPTIONAL] + contingency
|
||||
[OPTIONAL] Logistical slide.
|
||||
IF SHORT: Name the top 3 required skills, mention the ``helpful but not
|
||||
required'' list exists, move on.
|
||||
IF AHEAD: Ask a student with cluster experience to share one surprising
|
||||
lesson from their first multi-node job.
|
||||
}
|
||||
|
||||
\small
|
||||
\begin{columns}[T]
|
||||
@@ -505,11 +927,40 @@ experience expected.}
|
||||
\end{frame}
|
||||
|
||||
\begin{frame}{Relationship to Volume I}
|
||||
\note{[2 min] Address the key question: ``Is this harder than Vol I?''
|
||||
Answer: not harder --- wider. The distinction is SCOPE, not DEPTH.
|
||||
Vol I went deep on one machine; Vol II goes wide across the fleet.
|
||||
Both are equally rigorous. Vol II re-derives key frameworks (like the
|
||||
Iron Law) at fleet scale.}
|
||||
\note{
|
||||
% -- LINK: What prior concept connects to this slide
|
||||
Prerequisites established what students need. This slide addresses the
|
||||
elephant in the room: ``Is this harder than Vol I?''
|
||||
|
||||
% -- NARRATE: What to SAY while showing this slide
|
||||
Point to the two cards side by side: ``Vol I: one machine, 1--8 GPUs,
|
||||
shared memory, DataParallel, the Iron Law. Vol II: 10,000+ GPUs,
|
||||
InfiniBand, message passing, torch.distributed, C3.'' Then the key
|
||||
message: ``The distinction is SCOPE, not DEPTH. Vol I went deep on one
|
||||
machine. Vol II goes wide across the fleet. Both are equally rigorous.
|
||||
We re-derive key frameworks like the Iron Law at fleet scale, adding
|
||||
communication and coordination terms.''
|
||||
|
||||
% -- ENGAGE: Specific question, prediction, or task for THIS slide
|
||||
Ask: ``What Vol I concept do you think will change most at fleet scale?''
|
||||
Cold-call. Any answer works --- use it to preview how that concept
|
||||
evolves. (e.g., ``the memory wall'' becomes the ``network wall.'')
|
||||
|
||||
% -- WARN: What students will get wrong on THIS topic
|
||||
Students will assume Vol II content is ``harder.'' It is not --- it is
|
||||
different. A student who struggled with cache hierarchies in Vol I may
|
||||
excel at distributed fault tolerance in Vol II. The thinking patterns
|
||||
are complementary, not hierarchical.
|
||||
IF STUCK: ``Think of it as learning to fly after learning to drive.
|
||||
Different skills, not harder skills.''
|
||||
|
||||
% -- FLEX: [OPTIONAL] + contingency
|
||||
[OPTIONAL] Context-setting slide.
|
||||
IF SHORT: Read the crimson card at the bottom, skip the detailed
|
||||
comparison.
|
||||
IF AHEAD: Ask: ``Which Vol I equation do you think we will generalize
|
||||
first?'' (Answer: the Iron Law, in Chapter 1.)
|
||||
}
|
||||
|
||||
\small
|
||||
\begin{columns}[T]
|
||||
@@ -559,9 +1010,17 @@ Iron Law) at fleet scale.}
|
||||
|
||||
% --- MUDDIEST POINT ---
|
||||
\begin{frame}{Muddiest Point}
|
||||
\note{[2 min] Quick anonymous poll. Students write on a slip of paper or submit
|
||||
digitally. Collect and scan for patterns. Address the top 2--3 confusions in the
|
||||
next lecture's opening. This closes the feedback loop.}
|
||||
\note{
|
||||
% -- NARRATE: What to SAY while showing this slide
|
||||
``Before we close, I want to know what confused you most. Write one
|
||||
sentence --- the concept you found muddiest. Anonymous. Submit before
|
||||
you leave --- slip of paper or digital form.'' Scan responses after
|
||||
class and address the top 2--3 confusions at the start of next lecture.
|
||||
|
||||
% -- FLEX: [CORE] or [OPTIONAL] + contingency
|
||||
[CORE] Closes the feedback loop --- essential for adaptive teaching.
|
||||
IF SHORT: Reduce to ``write one word on a slip of paper.''
|
||||
}
|
||||
|
||||
\centering
|
||||
\vspace{1.0cm}
|
||||
@@ -576,8 +1035,18 @@ next lecture's opening. This closes the feedback loop.}
|
||||
\end{frame}
|
||||
|
||||
\begin{frame}{What Were the Key Ideas?}
|
||||
\note{[2 min] Retrieval practice. Students write 90 seconds, no notes.
|
||||
Do NOT show next slide yet. Walk around the room.}
|
||||
\note{
|
||||
% -- NARRATE: What to SAY while showing this slide
|
||||
``Close your notes. 90 seconds. Write down the 3 most important ideas
|
||||
from today. No peeking.'' Walk around the room while students write.
|
||||
Do NOT show the next slide yet --- the struggle to recall is where
|
||||
learning happens.
|
||||
|
||||
% -- FLEX: [CORE] or [OPTIONAL] + contingency
|
||||
[CORE] Retrieval practice is the highest-impact learning technique
|
||||
(Rosenshine). Never skip.
|
||||
IF SHORT: Reduce to 60 seconds.
|
||||
}
|
||||
|
||||
\centering
|
||||
\vspace{1.5cm}
|
||||
@@ -592,9 +1061,24 @@ Do NOT show next slide yet. Walk around the room.}
|
||||
\end{frame}
|
||||
|
||||
\begin{frame}{Key Takeaways}
|
||||
\note{[2 min] Reveal. Walk through each bullet. Emphasize that every concept
|
||||
here will recur throughout the course. The C3 framework is the lens; the
|
||||
Fleet Stack is the map.}
|
||||
\note{
|
||||
% -- LINK: What prior concept connects to this slide
|
||||
Students just attempted recall. This slide reveals the answers and
|
||||
fills gaps.
|
||||
|
||||
% -- NARRATE: What to SAY while showing this slide
|
||||
Walk through each bullet, pausing on the quantitative anchors: ``The
|
||||
fleet is the computer --- thousands of accelerators as one unit. Scale
|
||||
is qualitative --- not just more, but fundamentally different. C3
|
||||
framework --- every bottleneck maps to Compute, Communication, or
|
||||
Coordination. Fleet Stack --- four layers from infrastructure to
|
||||
governance. Failure math --- one failure every 10 hours at 10,000 GPUs.
|
||||
Scope not depth --- Vol II is wider, not harder.''
|
||||
|
||||
% -- FLEX: [CORE] or [OPTIONAL] + contingency
|
||||
[CORE] Consolidation slide.
|
||||
IF SHORT: Read just the first three bullets.
|
||||
}
|
||||
|
||||
\scriptsize
|
||||
\begin{itemize}\setlength\itemsep{0pt}
|
||||
@@ -610,7 +1094,17 @@ Fleet Stack is the map.}
|
||||
\end{frame}
|
||||
|
||||
\begin{frame}{References}
|
||||
\note{[1 min] Point students to foundational readings for the course.}
|
||||
\note{
|
||||
% -- NARRATE: What to SAY while showing this slide
|
||||
``These five references are the foundational readings for the course.
|
||||
Verbraeken is the survey that maps the landscape. Narayanan and Jiang
|
||||
show how frontier labs actually train at scale. Dean is the historical
|
||||
origin. Patterson and Hennessy is the pedagogical model we follow.''
|
||||
|
||||
% -- FLEX: [OPTIONAL] + contingency
|
||||
[OPTIONAL] Reference slide.
|
||||
IF SHORT: Skip verbal walkthrough --- students can read it.
|
||||
}
|
||||
|
||||
\small
|
||||
\mlsysref{Verbraeken+20}{Verbraeken et al. ``A Survey on Distributed ML.'' ACM Computing Surveys, 2020.}
|
||||
@@ -622,10 +1116,24 @@ Fleet Stack is the map.}
|
||||
\end{frame}
|
||||
|
||||
\begin{frame}{Next Lecture: The Distributed Landscape}
|
||||
\note{[1 min] Forward hook. The next lecture introduces the fleet as a system:
|
||||
what hardware is in a modern GPU cluster, how nodes are connected, and why
|
||||
the topology matters. The question ``how do 10,000 GPUs talk to each other?''
|
||||
is the central puzzle.}
|
||||
\note{
|
||||
% -- LINK: What prior concept connects to this slide
|
||||
Today established why scale matters and introduced C3 and the Fleet
|
||||
Stack. The next lecture dives into the first concrete question: what
|
||||
is actually inside a modern GPU cluster?
|
||||
|
||||
% -- NARRATE: What to SAY while showing this slide
|
||||
``The fleet is the computer. But what hardware makes up this computer?
|
||||
Next lecture: Chapter 1 introduces the fleet as a system --- what
|
||||
accelerators, what interconnects, what topology, and why the answer to
|
||||
`how do 10,000 GPUs talk to each other?' is the central puzzle of
|
||||
fleet engineering.'' Point to the three columns: Compute, Communication,
|
||||
Coordination --- the C3 lens applied to hardware.
|
||||
|
||||
% -- FLEX: [CORE] or [OPTIONAL] + contingency
|
||||
[CORE] Forward hook that creates anticipation.
|
||||
IF SHORT: Read the central question and move on.
|
||||
}
|
||||
|
||||
\small
|
||||
\centering
|
||||
@@ -664,7 +1172,14 @@ is the central puzzle.}
|
||||
\appendix
|
||||
|
||||
\begin{frame}{Backup: Extended Reference}
|
||||
\note{Backup slide with additional reference material for this chapter.}
|
||||
\note{
|
||||
% -- NARRATE: What to SAY while showing this slide
|
||||
Backup reference slide. Only show if students ask for additional
|
||||
resources or problem set reference material.
|
||||
|
||||
% -- FLEX: [OPTIONAL] + contingency
|
||||
[OPTIONAL] Backup slide --- do not present unless needed.
|
||||
}
|
||||
|
||||
\footnotesize
|
||||
This slide provides extended reference material for students who want to go deeper.
|
||||
@@ -679,7 +1194,14 @@ textbook's summary tables. Use them as a quick reference during problem sets.
|
||||
\end{frame}
|
||||
|
||||
\begin{frame}{Backup: Further Reading}
|
||||
\note{Backup slide. Point students to additional resources beyond the references slide.}
|
||||
\note{
|
||||
% -- NARRATE: What to SAY while showing this slide
|
||||
Backup slide pointing to additional resources. Only present if a
|
||||
student asks ``where can I learn more before next class?''
|
||||
|
||||
% -- FLEX: [OPTIONAL] + contingency
|
||||
[OPTIONAL] Backup slide --- do not present unless needed.
|
||||
}
|
||||
|
||||
\footnotesize
|
||||
\textbf{For deeper exploration:}
|
||||
|
||||
@@ -68,10 +68,23 @@
|
||||
% LEARNING OBJECTIVES
|
||||
% =============================================================================
|
||||
\begin{frame}{Learning Objectives}
|
||||
\note{[2 min] Walk through objectives. Emphasize that this chapter bridges
|
||||
the physical network (Ch5) with the algorithms that run on it. Every concept
|
||||
reduces to the alpha-beta model. Ask: ``How long does it take to send 140 GB
|
||||
across a datacenter?''}
|
||||
\note{
|
||||
% -- LINK: What prior concept connects to this slide
|
||||
Ch5 established the physical network --- NVLink, InfiniBand, fat-tree topologies. This chapter asks: what traffic patterns actually flow over those wires during training?
|
||||
|
||||
% -- NARRATE: What to SAY while showing this slide
|
||||
Read each objective aloud, pausing on ``alpha-beta model'' and ``gradient compression.'' Emphasize that every concept in this chapter reduces to one question: how long does it take to move N bytes across P GPUs?
|
||||
ANALOGY: ``Ch5 built the highway system; today we study the traffic patterns.''
|
||||
|
||||
% -- ENGAGE: Specific question, prediction, or task for THIS slide
|
||||
Ask: ``How long does it take to send 140 GB of gradients across a datacenter?'' Accept guesses --- we will calculate the exact answer shortly.
|
||||
|
||||
% -- WARN: What students will get wrong on THIS topic
|
||||
Students often think communication is a minor overhead. Set up the surprise: at scale, GPUs spend most of their time waiting, not computing.
|
||||
|
||||
% -- FLEX: [CORE] This slide is essential --- do not skip.
|
||||
IF SHORT: Read only objectives 1, 2, and 6 aloud; let students read the rest.
|
||||
}
|
||||
|
||||
\small
|
||||
\begin{enumerate}
|
||||
@@ -86,8 +99,13 @@ across a datacenter?''}
|
||||
\end{frame}
|
||||
|
||||
\begin{frame}{Visual Language}
|
||||
\note{[1 min] Explain the semantic color system used throughout the course.
|
||||
These colors are consistent across all diagrams and slides.}
|
||||
\note{
|
||||
% -- NARRATE: What to SAY while showing this slide
|
||||
Point to each card: ``Blue is compute --- GPU ops, forward and backward passes. Green is data flow and healthy paths. Orange is routing and scheduling. Red is error, cost, or bottleneck.'' These colors are consistent across every diagram in this course.
|
||||
|
||||
% -- FLEX: [OPTIONAL] Skip if students have seen this in a prior lecture.
|
||||
IF SHORT: Say ``same color system as last lecture'' and move on in 15 seconds.
|
||||
}
|
||||
|
||||
\small
|
||||
Throughout this course, colors carry meaning:
|
||||
@@ -126,9 +144,23 @@ Throughout this course, colors carry meaning:
|
||||
% =============================================================================
|
||||
|
||||
\begin{frame}{The Communication Bottleneck}
|
||||
\note{[3 min] Open with the visceral fact: at scale, GPUs spend most of their
|
||||
time waiting for data, not computing. The 70B model example grounds this.
|
||||
Ask: ``If compute is cheap but communication is expensive, what should we optimize?''}
|
||||
\note{
|
||||
% -- LINK: What prior concept connects to this slide
|
||||
Ch5 showed that InfiniBand NDR delivers 50 GB/s per port. This slide reveals what happens when you actually need to move hundreds of gigabytes of gradients across that fabric.
|
||||
|
||||
% -- NARRATE: What to SAY while showing this slide
|
||||
Point to the left column: ``Adding GPUs is easy --- compute scales linearly. But coordination scales quadratically or worse.'' Walk through the 70B example: 70 billion params times 4 bytes = 280 GB of gradients per step. Point to the red card: ``Ring AllReduce across 64 GPUs at 50 GB/s costs 11.2 seconds of pure communication per training step.''
|
||||
ANALOGY: ``Imagine 64 people in a room each holding a 4 GB file. Everyone needs a copy of the merged result. That is AllReduce.''
|
||||
|
||||
% -- ENGAGE: Specific question, prediction, or task for THIS slide
|
||||
Ask: ``If compute is cheap but communication is expensive, which term in the Iron Law should we optimize?'' Expected answer: the data movement term.
|
||||
|
||||
% -- WARN: What students will get wrong on THIS topic
|
||||
Students assume adding more GPUs always speeds up training. At 64+ GPUs, communication can consume 40--70\% of the step, making additional GPUs counterproductive without communication optimization.
|
||||
|
||||
% -- FLEX: [CORE] This slide is essential --- do not skip.
|
||||
IF AHEAD: Ask ``At what GPU count does communication exceed 50\% of the step?''
|
||||
}
|
||||
|
||||
\footnotesize
|
||||
\begin{columns}[T]
|
||||
@@ -156,9 +188,22 @@ Ask: ``If compute is cheap but communication is expensive, what should we optimi
|
||||
\end{frame}
|
||||
|
||||
\begin{frame}{The Compute-Communication Timeline}
|
||||
\note{[3 min] Walk through the stacked bars. Key transition: from NVLink-dominated
|
||||
(8 GPUs, 25\% comm) to InfiniBand-limited (4096 GPUs, 65\% comm + 15\% sync).
|
||||
Ask: ``At what point does buying more GPUs stop helping?''}
|
||||
\note{
|
||||
% -- LINK: What prior concept connects to this slide
|
||||
The previous slide stated 40--70\% overhead abstractly. This diagram shows the concrete breakdown as GPU count grows from 8 to 4,096.
|
||||
|
||||
% -- NARRATE: What to SAY while showing this slide
|
||||
Point to the stacked bars left-to-right: ``At 8 GPUs, NVLink keeps communication to about 25\% of the step. At 256 GPUs, InfiniBand dominates and communication rises to 50\%. At 4,096 GPUs, communication plus sync consume 80\% of the training step.'' Trace the inflection point where the blue bar shrinks relative to the red and orange bars.
|
||||
|
||||
% -- ENGAGE: Specific question, prediction, or task for THIS slide
|
||||
Ask: ``At what point does buying more GPUs stop helping?'' Expected answer: when communication exceeds the compute time gained by adding GPUs --- roughly the 512--1024 GPU range without hierarchical AllReduce.
|
||||
|
||||
% -- WARN: What students will get wrong on THIS topic
|
||||
Students read ``80\% communication'' and think the hardware is slow. The hardware is near wire-speed --- the problem is algorithmic (flat Ring over too many hops).
|
||||
|
||||
% -- FLEX: [CORE] This slide is essential --- do not skip.
|
||||
IF SHORT: Point to the 4,096-GPU bar and state the 80\% figure; skip the intermediate data points.
|
||||
}
|
||||
|
||||
% --- Layout: FULL-WIDTH IMAGE + annotation ---
|
||||
\centering
|
||||
@@ -170,10 +215,23 @@ Ask: ``At what point does buying more GPUs stop helping?''}
|
||||
\end{frame}
|
||||
|
||||
\begin{frame}{The Physics of Data Movement}
|
||||
\note{[2 min] Three physical constraints: speed of light (latency floor),
|
||||
bandwidth-distance product, and energy per bit. Emphasize that these are
|
||||
not software problems --- they are physics constraints.
|
||||
Common error: students think faster NICs solve everything.}
|
||||
\note{
|
||||
% -- LINK: What prior concept connects to this slide
|
||||
The timeline showed communication growing with scale. This slide explains the three physical constraints that make communication fundamentally expensive, regardless of the algorithm.
|
||||
|
||||
% -- NARRATE: What to SAY while showing this slide
|
||||
Walk through the table row by row. ``Latency: light in fiber travels at 5 microseconds per kilometer. A 500-meter datacenter round-trip costs thousands of GPU cycles.'' ``Bandwidth: PAM4 signaling limits copper to about 2 meters at NDR speeds --- that is why optical cables cost more.'' ``Energy: moving a bit over InfiniBand costs 20--50 picojoules, 40--100 times more than an SRAM access.'' Then highlight RDMA: ``GPUDirect RDMA bypasses the kernel, cutting latency from 10--20 microseconds to 1--3 microseconds.''
|
||||
|
||||
% -- ENGAGE: Specific question, prediction, or task for THIS slide
|
||||
Ask: ``Which of these three constraints is a software problem that we can engineer away?'' Expected answer: none --- all three are physics. Software can only minimize exposure, not eliminate the constraints.
|
||||
|
||||
% -- WARN: What students will get wrong on THIS topic
|
||||
Students think faster NICs solve everything. Even with 800G networking, latency is bounded by speed of light and energy per bit scales with distance. The constraints are multiplicative, not additive.
|
||||
|
||||
% -- FLEX: [OPTIONAL] Can be compressed to 1 minute.
|
||||
IF SHORT: State ``three physics constraints --- latency, bandwidth, energy --- none are software problems'' and move on.
|
||||
IF AHEAD: Discuss the energy implications at 10K GPUs where communication power approaches compute power.
|
||||
}
|
||||
|
||||
\footnotesize
|
||||
\textbf{Three constraints interact multiplicatively:}
|
||||
@@ -197,10 +255,22 @@ Common error: students think faster NICs solve everything.}
|
||||
|
||||
% --- ACTIVE LEARNING 1: Predict ---
|
||||
\begin{frame}{Predict: What Determines Communication Cost?}
|
||||
\note{[2 min] Prediction exercise before revealing the alpha-beta model.
|
||||
Give students 60 seconds. Do NOT reveal the answer yet.
|
||||
Ask 2--3 students to share. Most will say ``bandwidth'' --- set up the reveal
|
||||
that latency matters too.}
|
||||
\note{
|
||||
% -- LINK: What prior concept connects to this slide
|
||||
Students just learned about the three physics constraints. This prediction exercise primes them for the alpha-beta model by asking them to reason about message size before seeing the formula.
|
||||
|
||||
% -- NARRATE: What to SAY while showing this slide
|
||||
Read the prompt aloud. ``A 4 KB message and a 140 GB message, same 50 GB/s network. Which takes longer relative to its theoretical minimum?'' Emphasize ``relative'' --- the 140 GB message takes longer in absolute time, but the question is about overhead ratio.
|
||||
|
||||
% -- ENGAGE: Specific question, prediction, or task for THIS slide
|
||||
Give 60 seconds to write. Do NOT reveal the answer. Ask 2--3 students to share. Most will say ``the small message'' --- this is correct. The 4 KB message is dominated by startup latency (alpha), making its actual time orders of magnitude above the bandwidth limit. The 140 GB message is almost entirely bandwidth-bound, achieving near-theoretical throughput.
|
||||
|
||||
% -- WARN: What students will get wrong on THIS topic
|
||||
Some students will say ``the large message'' because it takes longer in absolute time. Redirect: the question is about relative overhead, not absolute time. This distinction is exactly what the alpha-beta model formalizes.
|
||||
|
||||
% -- FLEX: [CORE] This slide is essential --- it primes the alpha-beta model.
|
||||
IF SHORT: Reduce to 30 seconds of think time and skip pair sharing.
|
||||
}
|
||||
|
||||
\centering
|
||||
\vspace{1.0cm}
|
||||
@@ -225,10 +295,23 @@ $T(n) = \underbrace{\alpha}_{\text{Startup latency}} + \underbrace{\dfrac{n}{\be
|
||||
}
|
||||
|
||||
\begin{frame}{Two Regimes, One Crossover}
|
||||
\note{[3 min] Walk through the alpha-beta model. The critical message size
|
||||
n* = alpha * beta separates two regimes. For IB NDR: n* = 100 KB.
|
||||
MoE tokens (4 KB) are latency-bound; LLM gradients (140 GB) are bandwidth-bound.
|
||||
Ask: ``Which optimization helps MoE but not LLMs?''}
|
||||
\note{
|
||||
% -- LINK: What prior concept connects to this slide
|
||||
The prediction exercise showed that small and large messages behave very differently. The alpha-beta model formalizes this into two regimes separated by the critical message size n*.
|
||||
|
||||
% -- NARRATE: What to SAY while showing this slide
|
||||
Point to the diagram: ``Left of n-star, latency dominates --- the message is so small that startup overhead dwarfs transfer time. Right of n-star, bandwidth dominates --- the message is large enough that transfer time dwarfs startup.'' State the crossover: ``For IB NDR, n-star is about 100 KB. MoE tokens at 4 KB are deeply latency-bound. LLM gradients at 140 GB are deeply bandwidth-bound.''
|
||||
ANALOGY: ``Alpha is the cost of picking up the phone. Beta is how fast you can talk. A one-word message is dominated by dialing time. A novel is dominated by reading speed.''
|
||||
|
||||
% -- ENGAGE: Specific question, prediction, or task for THIS slide
|
||||
Ask: ``Which optimization helps MoE routing tokens but not LLM gradients?'' Expected answer: latency reduction (RDMA, kernel bypass, topology optimization). Bandwidth compression helps LLMs but not MoE.
|
||||
|
||||
% -- WARN: What students will get wrong on THIS topic
|
||||
Students conflate ``latency'' with ``slowness.'' Clarify: a latency-bound message is not slow in absolute terms --- it just cannot be sped up by increasing bandwidth. The optimization target is different for each regime.
|
||||
|
||||
% -- FLEX: [CORE] This slide is essential --- do not skip.
|
||||
IF AHEAD: Ask students to calculate n* for PCIe Gen5 (alpha=5us, beta=64 GB/s) and compare with IB NDR.
|
||||
}
|
||||
|
||||
% --- Layout: FULL-WIDTH diagram ---
|
||||
\centering
|
||||
@@ -256,9 +339,22 @@ Ask: ``Which optimization helps MoE but not LLMs?''}
|
||||
\end{frame}
|
||||
|
||||
\begin{frame}{Your Turn: Critical Message Size}
|
||||
\note{[3 min] Give students 90 seconds. Answer: n* = 2e-6 * 50e9 = 100 KB.
|
||||
A 140 GB gradient is 1.4 million times above n*, so bandwidth optimization
|
||||
dominates. A 4 KB MoE token is 25x below n*, so latency optimization dominates.}
|
||||
\note{
|
||||
% -- LINK: What prior concept connects to this slide
|
||||
Students just learned the alpha-beta model conceptually. This exercise makes them apply it quantitatively for the first time.
|
||||
|
||||
% -- NARRATE: What to SAY while showing this slide
|
||||
Read the problem aloud. ``InfiniBand NDR 400G: alpha is 2 microseconds, beta is 50 GB/s.'' Give 90 seconds. After the pause, walk through: ``n-star equals alpha times beta equals 2 times 10 to the minus 6 times 50 times 10 to the 9 equals 100,000 bytes equals 100 KB.'' Then: ``A 140 GB gradient is 1.4 million times above n-star --- bandwidth optimization dominates. A 4 KB MoE token is 25 times below n-star --- latency optimization dominates.''
|
||||
|
||||
% -- ENGAGE: Specific question, prediction, or task for THIS slide
|
||||
Students calculate n* and classify two workloads. After solving, ask neighbors to compare. Cold-call one pair to present.
|
||||
|
||||
% -- WARN: What students will get wrong on THIS topic
|
||||
Unit errors: students will forget to convert microseconds to seconds or gigabytes to bytes. Emphasize writing out the full exponents: 2e-6 times 50e9.
|
||||
|
||||
% -- FLEX: [CORE] This slide is essential --- first quantitative application of alpha-beta.
|
||||
IF SHORT: Show the solution immediately (skip the 90-second work period) and narrate through it.
|
||||
}
|
||||
|
||||
\small
|
||||
\begin{columns}[T]
|
||||
@@ -300,9 +396,22 @@ dominates. A 4 KB MoE token is 25x below n*, so latency optimization dominates.}
|
||||
% =============================================================================
|
||||
|
||||
\begin{frame}{The Six Core Primitives}
|
||||
\note{[3 min] Walk through all six. Key insight: AllReduce = ReduceScatter + AllGather.
|
||||
FSDP exploits this decomposition. AllToAll is the hardest to scale (O(N\^2) connections).
|
||||
Ask: ``Why can't we use AllReduce for MoE?''}
|
||||
\note{
|
||||
% -- LINK: What prior concept connects to this slide
|
||||
The alpha-beta model tells us how expensive a single message is. But distributed training does not send single messages --- it uses collective operations involving all GPUs simultaneously. This slide catalogs the six primitives.
|
||||
|
||||
% -- NARRATE: What to SAY while showing this slide
|
||||
Walk through the diagram left-to-right: ``Broadcast: one-to-all. Reduce: all-to-one with aggregation. AllReduce: everyone gets the aggregated result --- this is the workhorse of data parallelism. AllGather: everyone gets everyone's shard. ReduceScatter: reduce then distribute shards. AllToAll: everyone sends a unique piece to everyone else --- the hardest to scale.'' Then state the key decomposition: ``AllReduce equals ReduceScatter plus AllGather. FSDP exploits this by splitting the two phases in time.''
|
||||
|
||||
% -- ENGAGE: Specific question, prediction, or task for THIS slide
|
||||
Ask: ``Why can't we use AllReduce for Mixture-of-Experts routing?'' Expected answer: MoE needs to send different tokens to different experts --- that is a personalized exchange (AllToAll), not a global aggregation (AllReduce).
|
||||
|
||||
% -- WARN: What students will get wrong on THIS topic
|
||||
Students assume AllReduce is always the right choice because it is the most discussed. MoE and RecSys require AllToAll, which has fundamentally different scaling properties (O(N^2) connections).
|
||||
|
||||
% -- FLEX: [CORE] This slide is essential --- do not skip.
|
||||
IF AHEAD: Ask students to sketch how FSDP uses ReduceScatter during backward and AllGather during forward.
|
||||
}
|
||||
|
||||
% --- Layout: FULL-WIDTH diagram ---
|
||||
\centering
|
||||
@@ -314,10 +423,22 @@ Ask: ``Why can't we use AllReduce for MoE?''}
|
||||
\end{frame}
|
||||
|
||||
\begin{frame}{Matching Primitives to Parallelism}
|
||||
\note{[3 min] This is the ``travel manifest'' --- the parallelism strategy
|
||||
determines the communication pattern. Data parallelism = AllReduce (bandwidth-bound).
|
||||
MoE = AllToAll (latency + contention). Wrong primitive = wrong scaling ceiling.
|
||||
Common error: using AllReduce for everything.}
|
||||
\note{
|
||||
% -- LINK: What prior concept connects to this slide
|
||||
The six primitives are tools. This slide is the ``travel manifest'' that maps each parallelism strategy to its required primitive.
|
||||
|
||||
% -- NARRATE: What to SAY while showing this slide
|
||||
Walk through the table: ``Data parallel uses AllReduce --- bandwidth-bound because gradients are large. Tensor parallel also uses AllReduce but is latency-bound because it happens within each layer, requiring NVLink speeds. Pipeline parallel uses point-to-point --- latency-bound because stages are sequential. MoE uses AllToAll --- the hardest, because it creates O(N-squared) logical connections.'' Point to the red card: ``AllToAll hits a communication wall much earlier than AllReduce because contention grows quadratically.''
|
||||
|
||||
% -- ENGAGE: Specific question, prediction, or task for THIS slide
|
||||
Ask: ``At 1024 GPUs, which parallelism strategy hits the communication wall first: data parallel or expert parallel?'' Expected answer: expert parallel, because AllToAll creates O(N^2) connections while AllReduce is O(1) per-node bandwidth.
|
||||
|
||||
% -- WARN: What students will get wrong on THIS topic
|
||||
Students think using AllReduce for everything is safe. For MoE and RecSys workloads, AllReduce cannot express the required communication pattern --- using it would require redundant computation or incorrect results.
|
||||
|
||||
% -- FLEX: [CORE] This slide is essential --- do not skip.
|
||||
IF SHORT: Focus on data parallel (AllReduce) and MoE (AllToAll) rows; skip the middle rows.
|
||||
}
|
||||
|
||||
\footnotesize
|
||||
\renewcommand{\arraystretch}{1.15}
|
||||
@@ -346,10 +467,22 @@ Common error: using AllReduce for everything.}
|
||||
% =============================================================================
|
||||
|
||||
\begin{frame}{Ring AllReduce: Bandwidth-Optimal}
|
||||
\note{[3 min] Walk through the two phases: Scatter-Reduce + AllGather.
|
||||
Key property: every link active every step. Bandwidth-optimal but O(N) latency.
|
||||
For 10,000 nodes, 20,000 sequential hops is devastating.
|
||||
Ask: ``What breaks when N = 10,000?''}
|
||||
\note{
|
||||
% -- LINK: What prior concept connects to this slide
|
||||
The primitives table showed AllReduce is the workhorse for data-parallel and tensor-parallel training. This slide examines the simplest bandwidth-optimal implementation: the Ring algorithm.
|
||||
|
||||
% -- NARRATE: What to SAY while showing this slide
|
||||
Point to the diagram: ``Phase 1, Scatter-Reduce: N minus 1 steps where each GPU sends one chunk clockwise and accumulates partial sums. Phase 2, AllGather: N minus 1 more steps where the completed sums circulate to all nodes.'' Write the formula: ``Total time is 2(N-1) alpha plus 2 times (N-1)/N times M over beta.'' Then highlight: ``The bandwidth term approaches 2M/beta as N grows --- optimal! But the latency term is O(N) --- 2 times (N-1) startup delays. For 10,000 nodes, that is 20,000 sequential hops.''
|
||||
|
||||
% -- ENGAGE: Specific question, prediction, or task for THIS slide
|
||||
Ask: ``What breaks when N equals 10,000?'' Expected answer: the O(N) latency term. At alpha=2 microseconds and N=10,000, latency alone costs 40 milliseconds --- before any data moves.
|
||||
|
||||
% -- WARN: What students will get wrong on THIS topic
|
||||
Students see ``bandwidth-optimal'' and assume Ring is always the best choice. It is optimal only in the bandwidth term; the O(N) latency term makes it catastrophic for small messages or large clusters.
|
||||
|
||||
% -- FLEX: [CORE] This slide is essential --- do not skip.
|
||||
IF AHEAD: Ask students to calculate the Ring latency overhead for N=10,000 at alpha=2 microseconds.
|
||||
}
|
||||
|
||||
\small
|
||||
\begin{columns}[T]
|
||||
@@ -379,10 +512,23 @@ Ask: ``What breaks when N = 10,000?''}
|
||||
\end{frame}
|
||||
|
||||
\begin{frame}{Algorithm Comparison}
|
||||
\note{[3 min] Walk through the four algorithms. Ring = BW-optimal but O(N) latency.
|
||||
Tree = O(log N) latency but log N BW penalty. Butterfly = best of both but needs N=2\^k.
|
||||
Double Binary Tree = NCCL default. Crossover formula determines the winner.
|
||||
If short: focus on Ring vs Tree.}
|
||||
\note{
|
||||
% -- LINK: What prior concept connects to this slide
|
||||
Ring AllReduce is bandwidth-optimal but has O(N) latency. This slide introduces three alternatives that trade bandwidth efficiency for lower latency, and identifies when each wins.
|
||||
|
||||
% -- NARRATE: What to SAY while showing this slide
|
||||
Walk through the table: ``Ring: O(N) latency, bandwidth-optimal --- best for large gradients on small-to-medium clusters. Tree: O(log N) latency, but log N bandwidth penalty --- best for small messages. Butterfly: best of both but requires N to be a power of 2. Double Binary Tree: NCCL's default, near-optimal in both.'' Then point to the crossover formula: ``M-crossover equals N times alpha times beta. Below this message size, Tree wins. Above it, Ring wins. For 64 GPUs on IB NDR, the crossover is about 6.4 MB.''
|
||||
|
||||
% -- ENGAGE: Specific question, prediction, or task for THIS slide
|
||||
Ask: ``For a 1 MB AllReduce across 64 GPUs, which algorithm wins?'' Expected answer: Tree, because 1 MB is below the 6.4 MB crossover.
|
||||
|
||||
% -- WARN: What students will get wrong on THIS topic
|
||||
Students memorize ``Ring is optimal'' without qualifying it. Ring is bandwidth-optimal but latency-poor. The crossover formula quantifies exactly when Ring loses to Tree.
|
||||
|
||||
% -- FLEX: [CORE] This slide is essential --- do not skip.
|
||||
IF SHORT: Focus on Ring vs Tree only. Skip Butterfly and Double Tree rows.
|
||||
IF AHEAD: Ask students to calculate the crossover for 256 GPUs.
|
||||
}
|
||||
|
||||
\footnotesize
|
||||
\renewcommand{\arraystretch}{1.15}
|
||||
@@ -411,10 +557,22 @@ For 64 GPUs on IB NDR: $M_{\text{crossover}} \approx 64 \times 2\ \mu\text{s} \t
|
||||
% =============================================================================
|
||||
|
||||
\begin{frame}{The Bandwidth Hierarchy}
|
||||
\note{[3 min] Real clusters are NOT flat. NVLink is 18x faster than InfiniBand.
|
||||
A flat Ring wastes NVLink by routing data over IB when NVLink suffices.
|
||||
The 3-phase hierarchical approach confines expensive IB traffic.
|
||||
Ask: ``How much does inter-node traffic drop with 8 GPUs per node?''}
|
||||
\note{
|
||||
% -- LINK: What prior concept connects to this slide
|
||||
The algorithm comparison assumed a flat network where every link has the same bandwidth. Real clusters have a hierarchy: NVLink at 900 GB/s within a node, InfiniBand at 50 GB/s between nodes. Ignoring this hierarchy wastes NVLink bandwidth.
|
||||
|
||||
% -- NARRATE: What to SAY while showing this slide
|
||||
Point to the diagram phases: ``Phase 1: ReduceScatter within each node using NVLink at 900 GB/s. Phase 2: AllReduce across nodes using InfiniBand at 50 GB/s --- but now each node sends only 1/G of the data, where G is GPUs per node. Phase 3: AllGather within each node using NVLink again.'' Emphasize: ``The expensive IB traffic is confined to 1/G of the data. With 8 GPUs per node, inter-node traffic drops 8 times.''
|
||||
|
||||
% -- ENGAGE: Specific question, prediction, or task for THIS slide
|
||||
Ask: ``How much does inter-node traffic drop with 8 GPUs per node?'' Expected answer: 8 times, because each node reduces locally before sending across the network.
|
||||
|
||||
% -- WARN: What students will get wrong on THIS topic
|
||||
Students think hierarchical AllReduce is an optimization trick. It is actually the default in NCCL --- flat Ring across nodes is the anti-pattern. The hierarchy is not optional; it matches the physical bandwidth tiers.
|
||||
|
||||
% -- FLEX: [CORE] This slide is essential --- do not skip.
|
||||
IF SHORT: State the three phases and the 1/G reduction; skip the detailed bandwidth numbers.
|
||||
}
|
||||
|
||||
% --- Layout: FULL-WIDTH diagram ---
|
||||
\centering
|
||||
@@ -440,9 +598,22 @@ Ask: ``How much does inter-node traffic drop with 8 GPUs per node?''}
|
||||
\end{frame}
|
||||
|
||||
\begin{frame}{Hierarchical AllReduce: Worked Example}
|
||||
\note{[3 min] Walk through the numbers: flat = 40 ms, hierarchical = 7 ms.
|
||||
5.7x speedup from respecting the bandwidth hierarchy. The key: inter-node
|
||||
traffic drops by 8x (GPUs per node). This is why NCCL defaults to hierarchical.}
|
||||
\note{
|
||||
% -- LINK: What prior concept connects to this slide
|
||||
The hierarchy concept was introduced abstractly. This worked example puts concrete numbers to each phase and shows a 5.7x speedup.
|
||||
|
||||
% -- NARRATE: What to SAY while showing this slide
|
||||
Walk through the table: ``Flat Ring sends 2 GB over InfiniBand at 50 GB/s: 40 ms. Hierarchical: Phase 1, ReduceScatter sends 875 MB over NVLink at 900 GB/s: about 1 ms. Phase 2, inter-node AllReduce sends only 125 MB over IB: about 5 ms. Phase 3, AllGather sends 875 MB over NVLink: about 1 ms. Total: 7 ms. That is a 5.7x speedup just from respecting the bandwidth hierarchy.''
|
||||
|
||||
% -- ENGAGE: Specific question, prediction, or task for THIS slide
|
||||
Ask: ``Where did the 125 MB come from?'' Expected answer: 1 GB divided by 8 GPUs per node. Each node reduces locally first, so only the reduced shard crosses the network.
|
||||
|
||||
% -- WARN: What students will get wrong on THIS topic
|
||||
Students forget that the inter-node data volume shrinks by 1/G. They apply the original 1 GB to the IB bandwidth and get the wrong answer. Emphasize: local reduction is the key step.
|
||||
|
||||
% -- FLEX: [CORE] This slide is essential --- do not skip.
|
||||
IF AHEAD: Ask ``What happens with 16 GPUs per node instead of 8?'' (Answer: inter-node traffic drops to 62.5 MB, further speedup.)
|
||||
}
|
||||
|
||||
\footnotesize
|
||||
\textbf{1 GB gradient, 8 nodes $\times$ 8 GPUs (64 total)}
|
||||
@@ -472,7 +643,13 @@ traffic drops by 8x (GPUs per node). This is why NCCL defaults to hierarchical.}
|
||||
|
||||
% --- ACTIVE LEARNING: Micro-Retrieval Cue ---
|
||||
\begin{frame}{Quick Check}
|
||||
\note{[1 min] Answer: confines expensive IB traffic to 1/G of the data.}
|
||||
\note{
|
||||
% -- NARRATE: What to SAY while showing this slide
|
||||
Read the question. Give 15 seconds of silence. Then cold-call. Expected answer: hierarchical AllReduce confines expensive IB traffic to 1/G of the data by reducing locally first within each NVLink domain.
|
||||
|
||||
% -- FLEX: [CORE] Quick retrieval cue --- takes only 1 minute.
|
||||
IF SHORT: Ask the question aloud and answer it yourself in 20 seconds.
|
||||
}
|
||||
|
||||
\centering
|
||||
\vspace{1.0cm}
|
||||
@@ -490,10 +667,22 @@ traffic drops by 8x (GPUs per node). This is why NCCL defaults to hierarchical.}
|
||||
|
||||
% --- ACTIVE LEARNING 2: Discussion ---
|
||||
\begin{frame}{Discussion: AllReduce vs.\ AllToAll Scaling}
|
||||
\note{[3 min] Turn-and-talk. AllReduce scales gracefully (O(1) per-node BW).
|
||||
AllToAll creates O(N\^2) connections --- network contention is the wall.
|
||||
This is why MoE hits limits earlier than dense LLMs.
|
||||
Cold-call 2--3 pairs.}
|
||||
\note{
|
||||
% -- LINK: What prior concept connects to this slide
|
||||
Students learned that AllReduce is bandwidth-bound and AllToAll creates O(N^2) connections. This discussion forces them to reason about which scaling limit is hit first.
|
||||
|
||||
% -- NARRATE: What to SAY while showing this slide
|
||||
Read the prompt. Set the timer for 90 seconds. Walk around the room listening to pairs. After time, cold-call 2--3 pairs.
|
||||
|
||||
% -- ENGAGE: Specific question, prediction, or task for THIS slide
|
||||
The MoE model hits the communication wall first because AllToAll creates O(N^2) logical connections, causing network contention to grow quadratically. AllReduce maintains O(1) per-node bandwidth regardless of cluster size. At 512 GPUs, AllToAll contention becomes the dominant bottleneck while AllReduce remains manageable.
|
||||
|
||||
% -- WARN: What students will get wrong on THIS topic
|
||||
Some students will say ``both are equally hard'' because they have the same number of GPUs. Redirect: it is not the hardware that differs --- it is the communication pattern. AllReduce is a structured reduction; AllToAll is a full permutation.
|
||||
|
||||
% -- FLEX: [CORE] This slide is essential --- builds intuition for MoE scaling limits.
|
||||
IF SHORT: Do a show-of-hands poll (AllReduce vs AllToAll) instead of pair discussion.
|
||||
}
|
||||
|
||||
\centering
|
||||
\vspace{0.8cm}
|
||||
@@ -514,10 +703,22 @@ and a MoE model using AllToAll on the same 512-GPU cluster.\\[0.3cm]
|
||||
% =============================================================================
|
||||
|
||||
\begin{frame}{Gradient Compression Techniques}
|
||||
\note{[3 min] When even the fastest wires aren't enough, send fewer bits.
|
||||
Walk through the compression spectrum: FP16 (2x), INT8 (4x), Top-K (100x),
|
||||
1-bit (32x). Key: always use Error Feedback beyond FP16.
|
||||
Ask: ``What happens if you discard 99\% of gradients without error feedback?''}
|
||||
\note{
|
||||
% -- LINK: What prior concept connects to this slide
|
||||
Hierarchical AllReduce minimizes wasted bandwidth. But when even the fastest wires are not enough, the next strategy is to send fewer bits per gradient element.
|
||||
|
||||
% -- NARRATE: What to SAY while showing this slide
|
||||
Walk through the diagram left-to-right: ``FP16 gives 2x compression with almost no quality loss --- this is the baseline. INT8 gives 4x. Top-K sparsification sends only the largest 1\% of gradients for 100x compression. 1-bit quantization sends only the sign of each gradient for 32x compression.'' Then point to the rule: ``Beyond FP16, always use Error Feedback. Without it, small gradients below the threshold are permanently lost, causing 1--3\% accuracy degradation.''
|
||||
|
||||
% -- ENGAGE: Specific question, prediction, or task for THIS slide
|
||||
Ask: ``What happens if you discard 99\% of gradients without error feedback?'' Expected answer: small but persistent gradients are permanently lost, causing the model to converge to a worse optimum (1--3\% accuracy loss).
|
||||
|
||||
% -- WARN: What students will get wrong on THIS topic
|
||||
Students assume ``99\% compression with only 1\% accuracy loss'' is free. The accuracy loss compounds over training --- 1\% on a benchmark can mean significantly worse real-world performance. Error Feedback is the mechanism that makes aggressive compression safe.
|
||||
|
||||
% -- FLEX: [CORE] This slide is essential --- do not skip.
|
||||
IF AHEAD: Discuss how PowerSGD achieves better compression ratios than Top-K by projecting gradients into a low-rank subspace.
|
||||
}
|
||||
|
||||
% --- Layout: FULL-WIDTH diagram ---
|
||||
\centering
|
||||
@@ -529,10 +730,23 @@ Ask: ``What happens if you discard 99\% of gradients without error feedback?''}
|
||||
\end{frame}
|
||||
|
||||
\begin{frame}{Error Feedback: No Information Lost}
|
||||
\note{[3 min] Walk through the error feedback mechanism step by step.
|
||||
Without EF: small gradients below threshold are permanently lost.
|
||||
With EF: residuals accumulate until they cross the threshold.
|
||||
Sum(transmitted) + error = true gradient. This is the key mathematical guarantee.}
|
||||
\note{
|
||||
% -- LINK: What prior concept connects to this slide
|
||||
The previous slide stated the Error Feedback rule. This slide proves why it works by walking through the mathematical guarantee step by step.
|
||||
|
||||
% -- NARRATE: What to SAY while showing this slide
|
||||
Point to the equation: ``Error feedback stores the residual: what we wanted to send minus what we actually sent.'' Walk through the table row by row: ``Step 1: gradient 0.4, error 0, sum 0.4, below threshold, send 0, new error 0.4. Step 2: gradient 0.3 plus error 0.4 equals 0.7, above threshold, send 1, new error -0.3.'' Continue through all 5 steps. Then: ``After 5 steps, we sent 2 and have error -0.4. The true sum of all gradients is 1.6. And 2 plus -0.4 equals 1.6 --- nothing was lost, just delayed.''
|
||||
ANALOGY: ``Error feedback is like a jar where you save your loose change. Each day you might not have enough for a coffee, but the jar accumulates until you do.''
|
||||
|
||||
% -- ENGAGE: Specific question, prediction, or task for THIS slide
|
||||
Ask: ``Without error feedback, how much of the 1.6 total gradient would have been transmitted?'' Expected answer: only 0 --- every individual gradient is below the threshold of 1, so naive compression sends nothing.
|
||||
|
||||
% -- WARN: What students will get wrong on THIS topic
|
||||
Students think error feedback is approximate. It is mathematically exact: sum of transmitted values plus the final error always equals the true gradient sum. The information is delayed, not lost.
|
||||
|
||||
% -- FLEX: [CORE] This slide is essential --- do not skip.
|
||||
IF SHORT: Skip rows 3--4 in the table walkthrough; show rows 1, 2, and 5 to demonstrate the pattern.
|
||||
}
|
||||
|
||||
\small
|
||||
\begin{columns}[T]
|
||||
@@ -583,11 +797,23 @@ Sum(transmitted) + error = true gradient. This is the key mathematical guarantee
|
||||
% =============================================================================
|
||||
|
||||
\begin{frame}{Communication-Computation Overlap}
|
||||
\note{[3 min] The final optimization: hide communication behind computation.
|
||||
Layer-by-layer overlap launches AllReduce for completed layers while
|
||||
earlier layers still compute. Bucket fusion amortizes alpha overhead.
|
||||
Walk through the pipelined timeline vs sequential.
|
||||
Ask: ``When does overlap fail?''}
|
||||
\note{
|
||||
% -- LINK: What prior concept connects to this slide
|
||||
Hierarchical AllReduce and gradient compression reduce communication time. Overlap is the final strategy: hide whatever communication remains behind computation.
|
||||
|
||||
% -- NARRATE: What to SAY while showing this slide
|
||||
Point to the diagram: ``The sequential timeline shows backward pass completing fully, then AllReduce starting. The pipelined timeline interleaves them: as layer N finishes its backward pass, its AllReduce launches immediately while layer N-1 continues computing.'' Point to the bucket fusion detail: ``Bucket fusion groups small per-layer AllReduces into larger chunks to amortize the alpha overhead --- typically 25--100 MB buckets.''
|
||||
ANALOGY: ``Overlap is like washing dishes while the next pot boils. You are doing two things in parallel using different resources.''
|
||||
|
||||
% -- ENGAGE: Specific question, prediction, or task for THIS slide
|
||||
Ask: ``When does overlap fail?'' Expected answer: when AllReduce per layer takes longer than the backward pass per layer --- the communication is ``exposed'' and cannot be hidden.
|
||||
|
||||
% -- WARN: What students will get wrong on THIS topic
|
||||
Students assume overlap hides all communication. It only hides communication that fits within the computation window. If AllReduce per layer exceeds backward per layer, the excess is exposed and adds to total time.
|
||||
|
||||
% -- FLEX: [CORE] This slide is essential --- do not skip.
|
||||
IF AHEAD: Ask students to derive the condition for full overlap: T_bwd/layer must exceed T_AR/layer.
|
||||
}
|
||||
|
||||
% --- Layout: FULL-WIDTH diagram ---
|
||||
\centering
|
||||
@@ -599,9 +825,22 @@ Ask: ``When does overlap fail?''}
|
||||
\end{frame}
|
||||
|
||||
\begin{frame}{Overlap Budget: Worked Example}
|
||||
\note{[2 min] 32-layer 7B model: without overlap = 1325 ms, with overlap = 360 ms.
|
||||
73\% savings. The remaining exposed comm comes from AllReduce being slower than
|
||||
per-layer backward. To eliminate: increase batch size or reduce AllReduce time.}
|
||||
\note{
|
||||
% -- LINK: What prior concept connects to this slide
|
||||
The overlap concept was introduced qualitatively. This slide quantifies it for a 32-layer 7B model to show 73\% savings.
|
||||
|
||||
% -- NARRATE: What to SAY while showing this slide
|
||||
Walk through the table: ``Backward per layer is 15 ms. AllReduce per layer is 26 ms for 880 MB of gradients in 100 MB buckets. Without overlap: 480 ms backward plus 832 ms AllReduce equals 1,325 ms. With overlap: the backward pass starts, and each layer's AllReduce launches immediately. But 26 ms exceeds 15 ms, so 11 ms per layer is exposed. Total exposed: about 360 ms --- a 73\% savings.'' Then: ``The remaining exposed communication can be reduced by increasing batch size (longer backward) or using faster networking (shorter AllReduce).''
|
||||
|
||||
% -- ENGAGE: Specific question, prediction, or task for THIS slide
|
||||
Ask: ``What two knobs reduce the 11 ms exposed gap per layer?'' Expected answer: increase batch size (makes backward longer) or reduce AllReduce time (faster network, compression, hierarchical).
|
||||
|
||||
% -- WARN: What students will get wrong on THIS topic
|
||||
Students read ``73\% savings'' and think the problem is solved. The remaining 360 ms is still a significant cost at scale. The full optimization stack (hierarchical + compression + overlap) reduces effective overhead to 5--15\%, not zero.
|
||||
|
||||
% -- FLEX: [OPTIONAL] Can be compressed if running behind schedule.
|
||||
IF SHORT: State the 73\% savings result and the condition T_bwd/layer > T_AR/layer; skip the detailed numbers.
|
||||
}
|
||||
|
||||
\footnotesize
|
||||
\textbf{32-layer transformer, 7B parameters, 64 GPUs}
|
||||
@@ -630,7 +869,16 @@ per-layer backward. To eliminate: increase batch size or reduce AllReduce time.}
|
||||
% =============================================================================
|
||||
|
||||
\begin{frame}{Fallacies}
|
||||
\note{[2 min] Four common misconceptions with quantitative evidence.}
|
||||
\note{
|
||||
% -- LINK: What prior concept connects to this slide
|
||||
Students have seen the full optimization stack. These fallacies test whether they internalized the key distinctions: latency vs bandwidth, Ring vs Tree, sync vs async, flat vs hierarchical.
|
||||
|
||||
% -- NARRATE: What to SAY while showing this slide
|
||||
Read each fallacy and its rebuttal. For the first: ``Bandwidth is not the only metric --- for 4 KB MoE tokens, latency dominates and 400G networking gives zero benefit.'' For the second: ``Ring pays O(N) latency --- for small messages across 64 GPUs, Tree wins.'' For the third: ``The LogP overhead o is non-overlappable --- if GPU compute is less than o, the GPU still stalls.'' For the fourth: ``Hierarchical achieves 5--6x speedup by cutting inter-node traffic 8x.''
|
||||
|
||||
% -- FLEX: [CORE] This slide is essential --- do not skip.
|
||||
IF SHORT: Cover fallacies 1 and 4 only; skip 2 and 3.
|
||||
}
|
||||
|
||||
\footnotesize
|
||||
\textbf{Fallacy:} \textit{Bandwidth is the only metric that matters.}\\
|
||||
@@ -651,7 +899,16 @@ Hierarchical achieves 5--6$\times$ speedup on 8-node clusters by cutting inter-n
|
||||
\end{frame}
|
||||
|
||||
\begin{frame}{Pitfalls}
|
||||
\note{[2 min] Three operational pitfalls.}
|
||||
\note{
|
||||
% -- LINK: What prior concept connects to this slide
|
||||
Fallacies addressed conceptual errors. Pitfalls address operational mistakes that teams make when deploying collective communication in production.
|
||||
|
||||
% -- NARRATE: What to SAY while showing this slide
|
||||
Read each pitfall. For the first: ``MoE and DLRM need AllToAll, which creates O(N^2) connections and hits contention at smaller cluster sizes than AllReduce.'' For the second: ``Without error feedback, Top-K permanently discards small gradients, causing 1--3\% accuracy loss.'' For the third: ``nccl-tests reports theoretical peak bandwidth, but real training sees 50--60\% if ranks are topology-misaligned --- for example, a tensor-parallel group spanning two nodes instead of staying within NVLink.'' For the fourth: ``At 10K nodes, a 1-in-10^15 bit-flip rate means multiple corruptions per day. These appear as unexplained NaN gradients.''
|
||||
|
||||
% -- FLEX: [CORE] This slide is essential --- do not skip.
|
||||
IF SHORT: Cover pitfalls 1 and 2 only.
|
||||
}
|
||||
|
||||
\footnotesize
|
||||
\textbf{Pitfall:} \textit{Assuming AllReduce works for everything.}\\
|
||||
@@ -675,9 +932,13 @@ At 10K nodes, ``rare'' bit flips ($1$ in $10^{15}$) happen multiple times per da
|
||||
|
||||
% --- MUDDIEST POINT ---
|
||||
\begin{frame}{Muddiest Point}
|
||||
\note{[2 min] Quick anonymous poll. Students write on a slip of paper or submit
|
||||
digitally. Collect and scan for patterns. Address the top 2--3 confusions in the
|
||||
next lecture's opening. This closes the feedback loop.}
|
||||
\note{
|
||||
% -- NARRATE: What to SAY while showing this slide
|
||||
Say: ``Before we close, write down the one concept from today that you found most confusing. This is anonymous --- one sentence, submit before you leave. I will address the top two or three confusions at the start of next lecture.''
|
||||
|
||||
% -- FLEX: [CORE] Always include --- closes the feedback loop.
|
||||
IF SHORT: Reduce to 30 seconds. Students can submit digitally after class.
|
||||
}
|
||||
|
||||
\centering
|
||||
\vspace{1.0cm}
|
||||
@@ -692,8 +953,13 @@ next lecture's opening. This closes the feedback loop.}
|
||||
\end{frame}
|
||||
|
||||
\begin{frame}{What Were the Key Ideas?}
|
||||
\note{[2 min] Retrieval practice. Students write 90 seconds, no notes.
|
||||
Do NOT show next slide yet. Walk around the room.}
|
||||
\note{
|
||||
% -- NARRATE: What to SAY while showing this slide
|
||||
Say: ``Close your notes. No screens. Write down the four most important concepts from today's lecture. You have 90 seconds.'' Walk around the room to observe. Do NOT show the next slide yet --- the retrieval effort is the learning event.
|
||||
|
||||
% -- FLEX: [CORE] Always include --- retrieval practice is the highest-impact learning activity.
|
||||
IF SHORT: Reduce to 60 seconds but do not skip entirely.
|
||||
}
|
||||
|
||||
\centering
|
||||
\vspace{1.5cm}
|
||||
@@ -708,8 +974,16 @@ Do NOT show next slide yet. Walk around the room.}
|
||||
\end{frame}
|
||||
|
||||
\begin{frame}{Key Takeaways}
|
||||
\note{[2 min] Reveal. Walk through each bullet. Emphasize quantitative anchors:
|
||||
n* = 100 KB, 11s AllReduce, 5.7x hierarchical speedup, 73\% overlap savings.}
|
||||
\note{
|
||||
% -- LINK: What prior concept connects to this slide
|
||||
Students just attempted retrieval. This slide reveals the answers so they can compare and fill gaps.
|
||||
|
||||
% -- NARRATE: What to SAY while showing this slide
|
||||
Walk through each bullet, pausing on the quantitative anchors: ``n-star equals 100 KB --- that separates latency-bound from bandwidth-bound. 11 seconds of AllReduce for a 70B model. 5.7x speedup from hierarchical AllReduce. 73\% overlap savings from layer pipelining. The full stack reduces overhead from 50--80\% to 5--15\%.''
|
||||
|
||||
% -- FLEX: [CORE] This slide is essential --- do not skip.
|
||||
IF SHORT: Read only bullets 1, 3, and 7.
|
||||
}
|
||||
|
||||
\scriptsize
|
||||
\begin{itemize}\setlength\itemsep{0pt}
|
||||
@@ -725,7 +999,12 @@ n* = 100 KB, 11s AllReduce, 5.7x hierarchical speedup, 73\% overlap savings.}
|
||||
\end{frame}
|
||||
|
||||
\begin{frame}{References}
|
||||
\note{[1 min] Point students to canonical papers.}
|
||||
\note{
|
||||
% -- NARRATE: What to SAY while showing this slide
|
||||
Point students to the Patarasuk paper for Ring AllReduce theory and the Gibiansky blog post for intuitive explanation. Sergeev for Horovod. Tang for 1-bit Adam. Stich for error feedback theory. Rajbhandari for ZeRO.
|
||||
|
||||
% -- FLEX: [OPTIONAL] Can be skipped in lecture; students read on their own.
|
||||
}
|
||||
|
||||
\small
|
||||
\mlsysref{Patarasuk+09}{Patarasuk \& Yuan. ``Bandwidth Optimal All-Reduce Algorithms.'' 2009.}
|
||||
@@ -738,9 +1017,15 @@ n* = 100 KB, 11s AllReduce, 5.7x hierarchical speedup, 73\% overlap savings.}
|
||||
\end{frame}
|
||||
|
||||
\begin{frame}{Next Lecture: Fault Tolerance}
|
||||
\note{[1 min] Forward hook. The fleet has its traffic patterns, but the roads
|
||||
are crumbling. GPUs overheat, networks drop packets, nodes fail mid-training.
|
||||
How do we maintain the illusion of a perfect supercomputer on imperfect hardware?}
|
||||
\note{
|
||||
% -- LINK: What prior concept connects to this slide
|
||||
This chapter built the communication patterns for distributed training. The next chapter asks: what happens when the infrastructure breaks?
|
||||
|
||||
% -- NARRATE: What to SAY while showing this slide
|
||||
Say: ``The fleet has its traffic patterns, but the roads are crumbling. GPUs overheat, networks drop packets, nodes fail mid-training. At 10,000 GPUs, failure is not exceptional --- it is the steady state. Next lecture: how do we maintain the illusion of a perfect supercomputer on imperfect hardware?''
|
||||
|
||||
% -- FLEX: [CORE] Always include --- forward hooks maintain narrative continuity.
|
||||
}
|
||||
|
||||
\footnotesize
|
||||
\begin{columns}[c]
|
||||
@@ -779,7 +1064,10 @@ How do we maintain the illusion of a perfect supercomputer on imperfect hardware
|
||||
\appendix
|
||||
|
||||
\begin{frame}{Backup: Extended Reference}
|
||||
\note{Backup slide with additional reference material for this chapter.}
|
||||
\note{
|
||||
% -- NARRATE: Backup slide with additional reference material.
|
||||
% -- FLEX: [OPTIONAL] Use only if students request deeper material.
|
||||
}
|
||||
|
||||
\footnotesize
|
||||
This slide provides extended reference material for students who want to go deeper.
|
||||
@@ -794,7 +1082,10 @@ textbook's summary tables. Use them as a quick reference during problem sets.
|
||||
\end{frame}
|
||||
|
||||
\begin{frame}{Backup: Further Reading}
|
||||
\note{Backup slide. Point students to additional resources beyond the references slide.}
|
||||
\note{
|
||||
% -- NARRATE: Backup slide pointing to additional resources.
|
||||
% -- FLEX: [OPTIONAL] Use only if students request further reading.
|
||||
}
|
||||
|
||||
\footnotesize
|
||||
\textbf{For deeper exploration:}
|
||||
|
||||
@@ -68,9 +68,29 @@
|
||||
% LEARNING OBJECTIVES
|
||||
% =============================================================================
|
||||
\begin{frame}{Learning Objectives}
|
||||
\note{[2 min] Walk through objectives. Emphasize that this chapter is about
|
||||
the management layer of the fleet. Ask: ``How many of you have deployed
|
||||
more than one model to production?''}
|
||||
\note{
|
||||
% -- LINK: Connect to prior chapters
|
||||
Students built serving infrastructure in Part III. This chapter asks:
|
||||
what happens when you manage not one model, but a hundred?
|
||||
|
||||
% -- NARRATE: What to SAY
|
||||
Read each objective aloud, pausing on ``platform ROI'' and ``TCO framework.''
|
||||
These two anchor the quantitative reasoning for the entire chapter.
|
||||
|
||||
% -- ENGAGE: Specific question
|
||||
Ask: ``How many of you have deployed more than one model to production?
|
||||
At what count did ad hoc practices start breaking?''
|
||||
Give 10 seconds for a show of hands.
|
||||
|
||||
% -- WARN: Specific misconception
|
||||
Students assume operations scale linearly with model count.
|
||||
Correct: dependencies grow as O(N^2), alerts as O(N*M).
|
||||
|
||||
% -- FLEX: [CORE]
|
||||
[CORE] Never skip --- objectives frame the entire lecture.
|
||||
IF AHEAD: Ask students to rank which objective they most want to master.
|
||||
IF SHORT: Read objectives without elaboration, move on.
|
||||
}
|
||||
|
||||
\small
|
||||
\begin{enumerate}
|
||||
@@ -86,8 +106,16 @@ more than one model to production?''}
|
||||
\end{frame}
|
||||
|
||||
\begin{frame}{Visual Language}
|
||||
\note{[1 min] Explain the semantic color system used throughout the course.
|
||||
These colors are consistent across all diagrams and slides.}
|
||||
\note{
|
||||
% -- NARRATE: What to SAY
|
||||
Point to each card: ``Blue = compute, green = data, orange = routing,
|
||||
red = error. These are consistent across every diagram in this course.
|
||||
When you see red in a pipeline diagram, something is bottlenecked.''
|
||||
|
||||
% -- FLEX: [OPTIONAL]
|
||||
[OPTIONAL] Skip if students have seen this in a previous chapter deck.
|
||||
IF SHORT: Say ``same color system as last lecture'' and advance.
|
||||
}
|
||||
|
||||
\small
|
||||
Throughout this course, colors carry meaning:
|
||||
@@ -126,9 +154,30 @@ Throughout this course, colors carry meaning:
|
||||
% =============================================================================
|
||||
|
||||
\begin{frame}{The N-Models Problem}
|
||||
\note{[3 min] Core insight: managing 100 models is not 100$\times$ the work.
|
||||
Dependencies grow quadratically. Ask: ``At your organization, how many
|
||||
models share the same data sources?'' Common error: assuming linear scaling.}
|
||||
\note{
|
||||
% -- LINK: Connect to prior concept
|
||||
Students just saw the learning objectives listing platform ROI and dependency
|
||||
management. This slide makes the problem visceral: why platforms exist.
|
||||
|
||||
% -- NARRATE: What to SAY
|
||||
Point to the diagram: ``At 10 models, a few shared data sources. At 50,
|
||||
the dependency graph is a hairball. At 100, a single upstream change
|
||||
cascades unpredictably.'' Trace the quadratic growth curve.
|
||||
|
||||
% -- ENGAGE: Specific question
|
||||
Ask: ``At your organization, how many models share the same data sources?''
|
||||
Expected answer: most students underestimate --- typical is 5--10 shared sources.
|
||||
|
||||
% -- WARN: Specific misconception
|
||||
Students assume managing 100 models is 100x the work of managing one.
|
||||
Correct: dependencies grow as O(N^2), so 100 models is closer to 10,000x
|
||||
the coordination complexity.
|
||||
|
||||
% -- FLEX: [CORE]
|
||||
[CORE] This motivates the entire chapter.
|
||||
IF AHEAD: Ask ``At what N does your dependency graph become unmanageable?''
|
||||
IF SHORT: Show diagram, state O(N^2), move on.
|
||||
}
|
||||
|
||||
% --- Full-width diagram ---
|
||||
\centering
|
||||
@@ -140,9 +189,30 @@ models share the same data sources?'' Common error: assuming linear scaling.}
|
||||
\end{frame}
|
||||
|
||||
\begin{frame}{Operational Complexity Growth}
|
||||
\note{[2 min] Walk through the table. Emphasize that monitoring becomes
|
||||
unmanageable and debugging requires distributed tracing. If short on time,
|
||||
focus on the deployment coordination column.}
|
||||
\note{
|
||||
% -- LINK: Connect to prior concept
|
||||
The N-models diagram showed the complexity curve. This table puts
|
||||
concrete operational labels on each step of that curve.
|
||||
|
||||
% -- NARRATE: What to SAY
|
||||
Walk column by column: ``At 1 model, monitoring is a single dashboard.
|
||||
At 100, you have 100 dashboards nobody reads. Debugging shifts from
|
||||
local to distributed tracing --- a qualitatively different skill.''
|
||||
|
||||
% -- ENGAGE: Specific question
|
||||
Ask: ``Which column transitions from manageable to unmanageable first?''
|
||||
Expected answer: monitoring (it is the first to break because alert
|
||||
volume grows as N times M metrics).
|
||||
|
||||
% -- WARN: Specific misconception
|
||||
Students focus on deployment coordination but miss that monitoring
|
||||
breaks first. Alert fatigue precedes deployment chaos.
|
||||
|
||||
% -- FLEX: [OPTIONAL]
|
||||
[OPTIONAL] This table reinforces the N-models diagram.
|
||||
IF SHORT: Point to the monitoring row, state the key insight, advance.
|
||||
IF AHEAD: Ask students to fill in a ``1000 models'' column mentally.
|
||||
}
|
||||
|
||||
\footnotesize
|
||||
\renewcommand{\arraystretch}{1.15}
|
||||
@@ -165,9 +235,30 @@ focus on the deployment coordination column.}
|
||||
\end{frame}
|
||||
|
||||
\begin{frame}{Quantifying Platform ROI}
|
||||
\note{[3 min] Walk through the ROI equation. The key insight: platforms
|
||||
exhibit a scaling threshold. At 20 models, a \$2M platform breaks even.
|
||||
Ask: ``At what model count does your organization need a platform team?''}
|
||||
\note{
|
||||
% -- LINK: Connect to prior concept
|
||||
The complexity table showed operations becoming unmanageable. This slide
|
||||
answers: what is the economic case for investing in a platform?
|
||||
|
||||
% -- NARRATE: What to SAY
|
||||
Point to the equation: ``N is model count, T-saved is hours per model,
|
||||
C-eng is engineer cost. The numerator grows linearly with N; the
|
||||
denominator is fixed.'' Walk through the worked example: ``50 models,
|
||||
40 hours each, \$150/hr = \$3.6M before, \$1.6M after. 56\% savings.''
|
||||
|
||||
% -- ENGAGE: Specific question
|
||||
Ask: ``At what model count does your organization need a platform team?''
|
||||
Give 10 seconds. Expected: most say 50--100; the surprise is 20--50.
|
||||
|
||||
% -- WARN: Specific misconception
|
||||
Students think platform ROI is linear. Correct: it is superlinear because
|
||||
shared infrastructure amortizes over all models simultaneously.
|
||||
|
||||
% -- FLEX: [CORE]
|
||||
[CORE] The ROI equation is the quantitative anchor for the chapter.
|
||||
IF AHEAD: Ask ``What if T-saved is only 10 hours instead of 30?''
|
||||
IF SHORT: Show equation, state the 56\% number, skip the worked example detail.
|
||||
}
|
||||
|
||||
\small
|
||||
\begin{columns}[T]
|
||||
@@ -196,7 +287,16 @@ Ask: ``At what model count does your organization need a platform team?''}
|
||||
|
||||
% --- ACTIVE LEARNING: Micro-Retrieval Cue ---
|
||||
\begin{frame}{Quick Check}
|
||||
\note{[1 min] Answer: 20-50 models.}
|
||||
\note{
|
||||
% -- NARRATE: What to SAY
|
||||
Pause 15 seconds. Then cold-call. Answer: 20--50 models.
|
||||
Say: ``Most students guess 100+. The surprise is the threshold is
|
||||
much lower because shared infrastructure amortizes across all models.''
|
||||
|
||||
% -- FLEX: [OPTIONAL]
|
||||
[OPTIONAL] Micro-retrieval cue reinforcing the ROI slide.
|
||||
IF SHORT: Skip entirely --- the predict exercise covers this ground.
|
||||
}
|
||||
|
||||
\centering
|
||||
\vspace{1.0cm}
|
||||
@@ -213,9 +313,30 @@ Ask: ``At what model count does your organization need a platform team?''}
|
||||
|
||||
|
||||
\begin{frame}{MLOps Maturity Hierarchy}
|
||||
\note{[2 min] Four levels from manual to enterprise. Most organizations
|
||||
are at Level 1. The jump from Level 1 to Level 2 provides superlinear
|
||||
returns. If short: just show the staircase and describe the transition.}
|
||||
\note{
|
||||
% -- LINK: Connect to prior concept
|
||||
The ROI equation showed that platforms pay for themselves. This slide
|
||||
asks: what does the maturity journey look like?
|
||||
|
||||
% -- NARRATE: What to SAY
|
||||
Point to each level: ``L0 = scripts on laptops. L1 = per-model CI/CD,
|
||||
the most common state. L2 = shared platform, where the superlinear
|
||||
returns kick in. L3 = enterprise governance across the org.'' Emphasize
|
||||
the L1-to-L2 transition: ``This is where most organizations stall.''
|
||||
|
||||
% -- ENGAGE: Specific question
|
||||
Ask: ``Where is your organization on this staircase?''
|
||||
Show of hands for each level. Most will cluster at L0--L1.
|
||||
|
||||
% -- WARN: Specific misconception
|
||||
Students think the jump from L0 to L1 is the hard part. Correct:
|
||||
L0-to-L1 is just adding CI/CD. The L1-to-L2 jump requires
|
||||
organizational change --- shared infrastructure, common APIs, platform team.
|
||||
|
||||
% -- FLEX: [OPTIONAL]
|
||||
[OPTIONAL] Reinforces the platform investment argument.
|
||||
IF SHORT: Show staircase, name the four levels, emphasize L1-to-L2, advance.
|
||||
}
|
||||
|
||||
% --- Full-width diagram ---
|
||||
\centering
|
||||
|
||||
Reference in New Issue
Block a user