mirror of
https://github.com/harvard-edge/cs249r_book.git
synced 2026-03-09 07:15:51 -05:00
style: Vol2 register pass follow-up — fix missed violations in distributed_training and sustainable_ai
Post-commit verification found 6 additional violations not caught by the initial audit agents: distributed_training (4 fixes): - line 108: second person "If you could purchase a single GPU" → impersonal - line 280: rhetorical Q "How exactly do 1,024 GPUs...agree" → declarative - line 784: second person "Your AllReduce...Where do you look?" in .callout-perspective → impersonal problem statement - line 1347: rhetorical Q "where did the missing 25%...go?" → declarative sustainable_ai (2 fixes): - line 2047: embedded rhetorical Q "where does the dominant share of energy go?" → declarative - line 2414: closing rhetorical Q "what happens to these clusters...?" → declarative noun phrase
This commit is contained in:
@@ -105,7 +105,7 @@ from mlsys.formatting import fmt, sci, check
|
||||
|
||||
Part I built the physical fleet: @sec-compute-infrastructure established the accelerator hierarchy, @sec-network-fabrics wired nodes into a high-bandwidth fabric, and @sec-data-storage completed the infrastructure with storage pipelines that keep the fleet fed. With the physical foundation in place, the algorithmic challenge that defines Part II is splitting a single training job across this hardware.
|
||||
|
||||
If you could purchase a single GPU with 100 terabytes of memory and an exaflop of compute, distributed training would not exist. Because the laws of physics prevent this, we are forced to shatter our models across thousands of independent chips. In the **Fleet Stack** framework (@sec-vol2-introduction), Distributed Training represents the **Distribution Layer** — the logic that partitions the mathematical workload across the physical fleet.
|
||||
A single GPU with 100 terabytes of memory and an exaflop of compute would make distributed training unnecessary. Because the laws of physics prevent this, training must be shattered across thousands of independent chips. In the **Fleet Stack** framework (@sec-vol2-introduction), Distributed Training represents the **Distribution Layer** — the logic that partitions the mathematical workload across the physical fleet.
|
||||
|
||||
### The Physics of the Cluster
|
||||
|
||||
@@ -277,7 +277,7 @@ The decision tree reveals that parallelism strategy selection is not a preferenc
|
||||
|
||||
## The Distributed Training Step {#sec-distributed-training-systems-systems-distributed-training-fundamentals-97da}
|
||||
|
||||
How exactly do 1,024 GPUs, operating completely independently, agree on a single, mathematically rigorous set of updated weights at the end of a training iteration? The single-machine optimization techniques discussed in the previous section only delay the inevitable; eventually, the computation must span multiple devices.
|
||||
The central challenge of distributed training is ensuring that 1,024 GPUs, operating completely independently, agree on a single, mathematically rigorous set of updated weights at the end of each training iteration. The single-machine optimization techniques discussed in the previous section only delay the inevitable; eventually, the computation must span multiple devices.
|
||||
|
||||
::: {.callout-definition title="Distributed Training"}
|
||||
|
||||
@@ -781,9 +781,7 @@ When synchronization performance deviates from theoretical expectations, the Fle
|
||||
|
||||
::: {.callout-perspective title="Debugging Slow Gradient Synchronization"}
|
||||
|
||||
**Problem Statement**: Your AllReduce operation takes 100 ms when you expected 50 ms based on bandwidth calculations. Where do you look?
|
||||
|
||||
The Fleet Stack framework provides a systematic debugging methodology by examining each layer:
|
||||
**Problem Statement**: An AllReduce operation takes 100 ms when the bandwidth calculation predicts 50 ms. The Fleet Stack framework provides a systematic debugging methodology by examining each layer:
|
||||
|
||||
**Infrastructure Layer**:
|
||||
|
||||
@@ -1344,7 +1342,7 @@ In 2017, Facebook AI Research shattered the "batch size ceiling" by training Res
|
||||
|
||||
## Scaling Efficiency and Convergence {#sec-distributed-training-systems-systems-distributed-training-efficiency-metrics-9488}
|
||||
|
||||
If doubling the number of GPUs in your cluster only makes your training run 1.5 times faster, where did the missing 25% of your multi-million dollar compute budget go? Data parallelism revealed the practical mechanics of gradient synchronization and memory sharding, but to understand *why* scaling efficiency degrades and *how* convergence changes with parallelism, we need a quantitative framework. The metrics and convergence theory in this section apply to all parallelism strategies — data, model, pipeline, and hybrid — governing the fundamental trade-offs between throughput, communication cost, and optimization quality.
|
||||
When doubling the number of GPUs yields only 1.5× speedup, the missing 25% of compute budget has been consumed by communication overhead and synchronization barriers. Data parallelism revealed the practical mechanics of gradient synchronization and memory sharding, but to understand *why* scaling efficiency degrades and *how* convergence changes with parallelism, a quantitative framework is needed. The metrics and convergence theory in this section apply to all parallelism strategies — data, model, pipeline, and hybrid — governing the fundamental trade-offs between throughput, communication cost, and optimization quality.
|
||||
|
||||
Communication overhead represents the primary bottleneck in distributed training systems. AllReduce operations consume 10--40% of total training time in data parallel systems, and this overhead grows with cluster size. BERT-Large on 128 GPUs experiences communication overhead reaching 35% of total runtime, while GPT-3 scale models experience 55% overhead on 1,024 GPUs.
|
||||
|
||||
|
||||
@@ -2044,7 +2044,7 @@ You are auditing the carbon footprint of a Machine Learning platform. Classify t
|
||||
4. Emissions from the end-user's smartphone battery while running your mobile inference app.
|
||||
:::
|
||||
|
||||
Accurately classifying these hidden emissions forces engineering teams to take responsibility for the entire value chain of their deployments. The comprehensive accounting framework also reveals a critical operational question: where does the dominant share of energy go once a model moves from the training phase to global inference?
|
||||
Accurately classifying these hidden emissions forces engineering teams to take responsibility for the entire value chain of their deployments. The comprehensive accounting framework also reveals that the dominant share of energy shifts once a model moves from the training phase to global inference.
|
||||
|
||||
## Training vs Inference Energy Analysis {#sec-sustainable-ai-training-vs-inference-energy-analysis-4cb5}
|
||||
|
||||
@@ -2411,7 +2411,7 @@ AI hardware depends on a suite of scarce and geopolitically sensitive **critical
|
||||
|
||||
The construction and operation of fabs and data centers also directly impacts natural ecosystems through habitat destruction, water stress from aquifer depletion, and pollution from chemical discharge. In Hsinchu, Taiwan, extensive water extraction by fabs has led to falling water tables and seawater intrusion, affecting both agriculture and aquatic biodiversity [@hsu2016accumulation]. Waste generation from fabrication---including gaseous emissions, VOC-laden air, and metal-contaminated wastewater---requires advanced treatment systems, and the end-of-life disposal of AI hardware contributes to a growing e-waste crisis, with only 17.4% of global e-waste properly recycled [@singh2022disentangling].
|
||||
|
||||
The environmental toll of our computational demands extends far beyond atmospheric carbon, manifesting as severe water stress and ecological disruption around manufacturing hubs. This sobering reality brings us to the ultimate physical consequence of the AI arms race: what happens to these massive, resource-intensive hardware clusters when they become obsolete just three years later?
|
||||
The environmental toll of our computational demands extends far beyond atmospheric carbon, manifesting as severe water stress and ecological disruption around manufacturing hubs. This sobering reality converges on the ultimate physical consequence of the AI arms race: the disposition of massive, resource-intensive hardware clusters that become obsolete within three years.
|
||||
|
||||
## Hardware Lifecycle and E-Waste {#sec-sustainable-ai-hardware-lifecycle-environmental-assessment-66ee}
|
||||
|
||||
|
||||
Reference in New Issue
Block a user