mirror of
https://github.com/harvard-edge/cs249r_book.git
synced 2026-03-11 17:49:25 -05:00
style(prose): eliminate 'the fact that' where possible (book-prose)
- Replace with 'that' or rephrase clause as subject; fix one remaining But→However in nn_architectures
This commit is contained in:
@@ -1477,7 +1477,7 @@ An `{python} cost_reduction_str`$\times$ cost reduction for ~`{python} acc_loss_
|
||||
|
||||
These gains are substantial, but semi-supervised learning is not universally applicable. The technique assumes that unlabeled data comes from the same distribution as labeled data, and it struggles when unlabeled data contains out-of-distribution samples (the model confidently mislabels them), when class imbalance is severe (pseudo-labels amplify majority class bias), or when the labeled set does not cover all classes (preventing label propagation for unseen classes). Always validate on a held-out set with true labels to catch distribution mismatch.
|
||||
|
||||
Despite these limitations, semi-supervised learning reduces label requirements by 5–10$\times$ while maintaining accuracy. We have now progressively reduced labeling demands through a clear trajectory: coreset selection and deduplication prune low-value samples before training; curriculum learning optimizes the order of presentation during training; active learning queries only the most informative samples for human annotation; and semi-supervised learning exploits unlabeled data to stretch those annotations further. Each technique has pushed the label requirement lower, but none has eliminated it. This raises a deeper question: do we need *any* task-specific labels at all? What if the structure of data itself---the fact that cat images resemble other cat images, that coherent sentences follow grammatical patterns---could provide the supervision signal?
|
||||
Despite these limitations, semi-supervised learning reduces label requirements by 5–10$\times$ while maintaining accuracy. We have now progressively reduced labeling demands through a clear trajectory: coreset selection and deduplication prune low-value samples before training; curriculum learning optimizes the order of presentation during training; active learning queries only the most informative samples for human annotation; and semi-supervised learning exploits unlabeled data to stretch those annotations further. Each technique has pushed the label requirement lower, but none has eliminated it. This raises a deeper question: do we need *any* task-specific labels at all? What if the structure of data itself---that cat images resemble other cat images and coherent sentences follow grammatical patterns---could provide the supervision signal?
|
||||
|
||||
## Self-Supervised Learning {#sec-data-selection-selfsupervised-learning-7518}
|
||||
|
||||
|
||||
@@ -1458,7 +1458,7 @@ CNNs succeed because they match the structure of image data. Verify you understa
|
||||
- [ ] Can you calculate why a conv layer is typically **Compute-Bound** (high arithmetic intensity) compared to other layers?
|
||||
:::
|
||||
|
||||
CNNs naturally implement hierarchical representation learning through their layered structure. Early layers detect low-level features\index{Feature Extraction!hierarchical} like edges and textures with small receptive fields, while deeper layers combine these into increasingly complex patterns with larger receptive fields. This hierarchical organization enables CNNs to build compositional representations\index{Compositional Representation}: complex objects are represented as compositions of simpler parts. The mathematical foundation for this emerges from the fact that stacking convolutional layers creates a tree-like dependency structure, where each deep neuron depends on an exponentially large set of input pixels, enabling efficient representation of hierarchical patterns.
|
||||
CNNs naturally implement hierarchical representation learning through their layered structure. Early layers detect low-level features\index{Feature Extraction!hierarchical} like edges and textures with small receptive fields, while deeper layers combine these into increasingly complex patterns with larger receptive fields. This hierarchical organization enables CNNs to build compositional representations\index{Compositional Representation}: complex objects are represented as compositions of simpler parts. The mathematical foundation for this emerges from stacking convolutional layers, which creates a tree-like dependency structure, where each deep neuron depends on an exponentially large set of input pixels, enabling efficient representation of hierarchical patterns.
|
||||
|
||||
The parameter sharing introduced earlier dramatically reduces complexity compared to MLPs. This sharing embodies the assumption that useful features can appear anywhere in an image, making the same feature detector valuable across all spatial positions.
|
||||
|
||||
@@ -1736,7 +1736,7 @@ From ResNet-50's compute-heavy standard convolutions through MobileNet's efficie
|
||||
|
||||
## RNNs: Sequential Pattern Processing {#sec-network-architectures-rnns-sequential-pattern-processing-f804}
|
||||
|
||||
Convolutional networks exploit spatial structure—the fact that nearby pixels are more related than distant ones. But many real-world signals have *temporal* structure instead: words in a sentence, samples in an audio stream, sensor readings over time. Processing sequences requires architectures that maintain state across time steps.
|
||||
Convolutional networks exploit spatial structure—nearby pixels are more related than distant ones. However, many real-world signals have *temporal* structure instead: words in a sentence, samples in an audio stream, sensor readings over time. Processing sequences requires architectures that maintain state across time steps.
|
||||
|
||||
This limitation manifests concretely in domains such as natural language processing, where word meaning depends on sentential context, and time-series analysis, where future values depend on historical patterns. Sequential data presents a challenge distinct from spatial processing: patterns can span arbitrary temporal distances, rendering fixed-size kernels ineffective. While spatial convolution exploits the principle that nearby pixels are typically related, temporal relationships operate differently—important connections may span hundreds or thousands of time steps with no correlation to proximity. Traditional feedforward architectures, including CNNs, process each input independently and cannot maintain the temporal context necessary for these long-range dependencies.
|
||||
|
||||
|
||||
@@ -1900,7 +1900,7 @@ Cooling system failures have more severe consequences in ML clusters than in tra
|
||||
|
||||
This rapid thermal runaway drives several design decisions. Coolant loops are designed with N+1 redundancy: each CDU has a backup pump, and the piping manifold includes bypass valves that can reroute coolant around a failed CDU. Temperature sensors at each cold plate trigger immediate alerts when the coolant outlet temperature exceeds a threshold (typically 65 degrees Celsius), and the GPU firmware will throttle power within milliseconds if the junction temperature approaches the 83-degree limit.
|
||||
|
||||
Some facilities also maintain an emergency air cooling capability as a last-resort backup. Even though air cooling cannot sustain full-power operation at ML rack densities, it can keep the hardware below damage thresholds (at reduced clock speeds) long enough for operators to repair the liquid cooling system. This defense-in-depth approach to cooling reliability reflects the fact that a cooling failure in a 10,000-GPU cluster can simultaneously affect hundreds of GPUs, making the potential financial impact of a cooling outage far greater than the cost of the redundancy.
|
||||
Some facilities also maintain an emergency air cooling capability as a last-resort backup. Even though air cooling cannot sustain full-power operation at ML rack densities, it can keep the hardware below damage thresholds (at reduced clock speeds) long enough for operators to repair the liquid cooling system. This defense-in-depth approach to cooling reliability reflects that a cooling failure in a 10,000-GPU cluster can simultaneously affect hundreds of GPUs, making the potential financial impact of a cooling outage far greater than the cost of the redundancy.
|
||||
|
||||
The failure modes of liquid cooling systems are qualitatively different from those of air cooling. Air cooling fails gracefully: a fan failure reduces airflow, causing temperatures to rise slowly over minutes, providing ample time for automated load shedding. Liquid cooling can fail catastrophically: a coolant leak can simultaneously damage hardware (if the coolant is conductive) and remove cooling capacity (if the leak drains the loop). Quick-disconnect fittings, which allow hot-swapping of server nodes without draining the entire coolant loop, are a critical design feature that reduces maintenance downtime from hours to minutes. However, these fittings are also the most common point of failure in the coolant loop, as the O-ring seals degrade over thousands of connect/disconnect cycles. Facilities that perform frequent hardware swaps (common in research environments where nodes are regularly reconfigured) must budget for quarterly O-ring replacement and maintain a stock of spare fittings.
|
||||
|
||||
|
||||
@@ -2447,7 +2447,7 @@ fig = plt.gcf()
|
||||
|
||||
\end{tikzpicture}
|
||||
```
|
||||
**Gradient Compression Techniques**. **(a) Standard** updates transmit full-precision values. **(b) Quantization** maps values to low-precision buckets (e.g., FP32 to INT8), reducing bandwidth. **(c) Sparsification** transmits only the most significant (Top-$k$) gradients, exploiting the fact that many updates are near-zero.
|
||||
**Gradient Compression Techniques**. **(a) Standard** updates transmit full-precision values. **(b) Quantization** maps values to low-precision buckets (e.g., FP32 to INT8), reducing bandwidth. **(c) Sparsification** transmits only the most significant (Top-$k$) gradients, exploiting that many updates are near-zero.
|
||||
:::
|
||||
|
||||
### Federated Personalization {#sec-edge-intelligence-federated-personalization-3c73}
|
||||
|
||||
@@ -1066,7 +1066,7 @@ To maximize spot instance utility while minimizing disruption, sophisticated sch
|
||||
|
||||
**Availability zone diversification** reduces the probability of simultaneous fleet-wide reclamation. Cloud providers typically manage spot pools independently per availability zone (AZ). If a training job requires 1,024 GPUs, allocating them as a single block in one zone exposes the job to a complete stop if that specific zone's spot pool is reclaimed. Spreading the job across multiple AZs ensures that a reclamation event in one zone only affects a fraction of the fleet. Training our 175B model on 1,024 spot GPUs across 3 availability zones, where each AZ has a 5 percent hourly interruption probability, the probability of losing more than 128 GPUs (one full data-parallel group) in a single hour is less than 0.1 percent, making elastic recovery sufficient for most interruption events.
|
||||
|
||||
**Instance type diversification** exploits the fact that different GPU SKUs have independent spot markets. A job that strictly requests one instance type competes in a single, crowded market. A job that accepts multiple equivalent instance types dramatically increases its scheduling probability. The scheduler should maintain a priority-ordered list of acceptable instance types, falling back to larger-memory instances when the primary type is unavailable. While this mix requires careful peer-to-peer bandwidth management, it effectively uses the "upgrade" to maintain progress at a blended cost still far below on-demand rates.
|
||||
**Instance type diversification** exploits that different GPU SKUs have independent spot markets. A job that strictly requests one instance type competes in a single, crowded market. A job that accepts multiple equivalent instance types dramatically increases its scheduling probability. The scheduler should maintain a priority-ordered list of acceptable instance types, falling back to larger-memory instances when the primary type is unavailable. While this mix requires careful peer-to-peer bandwidth management, it effectively uses the "upgrade" to maintain progress at a blended cost still far below on-demand rates.
|
||||
|
||||
**Spot interruption prediction** leverages the data that cloud providers expose about reclamation likelihood, such as the AWS Spot Placement Score or GCP preemptibility data. Advanced schedulers ingest this feed to estimate the expected interruptions per day for a given instance type and region. If the expected interruption rate rises above a threshold where the overhead of restarts exceeds the cost savings (typically more than 4 interruptions per day for large models), the scheduler should automatically migrate the job to on-demand capacity or a different region, preventing thrashing where a job spends more time recovering than training.
|
||||
|
||||
|
||||
Reference in New Issue
Block a user