mirror of
https://github.com/harvard-edge/cs249r_book.git
synced 2026-03-11 17:49:25 -05:00
Fix Vol1 standalone PDF build errors across 4 chapters
- hw_acceleration: escape % in callout title 'The Five-Percent Utilization Mystery' (LaTeX treats % as comment char in div attribute titles, truncating the box) - data_selection: escape % in callout title 'The Ninety-Nine Percent Sparsity Trap' (same \fbxSimple runaway argument error) - model_compression: remove 28-line orphaned stale class body (merge artifact); add missing mat_dim=4096 to LowRankFactorization class parameters - model_serving: move littles-law-calc code cell before the prose that references its exported variables (serving_qps_str etc. used before they were defined)
This commit is contained in:
@@ -2750,7 +2750,7 @@ The mechanism relates to how models encode information. A model trained on repet
|
||||
|
||||
Data selection and model compression are therefore *complementary*. The techniques in this chapter can reduce both training cost *and* post-training compression effort. When planning an efficiency pipeline, apply data selection first; the resulting model will be easier to compress.
|
||||
|
||||
::: {.callout-war-story title="The 99% Sparsity Trap"}
|
||||
::: {.callout-war-story title="The Ninety-Nine Percent Sparsity Trap"}
|
||||
**The Context**: Researchers at Google Brain investigated the impact of pruning on model performance. They pruned a ResNet model to 90%+ sparsity, removing the vast majority of weights.
|
||||
|
||||
**The Failure**: They found that while FLOPs decreased by 90%, the inference latency on standard hardware (GPUs/TPUs) often *increased*.
|
||||
|
||||
@@ -2514,7 +2514,7 @@ To reduce the need for constant data movement between registers and external mem
|
||||
|
||||
For larger working datasets, many AI accelerators include scratchpad memory, which offers more storage than caches but with a key difference: it allows explicit software control over what data is stored and when it is evicted. Unlike caches, which rely on hardware-based eviction policies, scratchpad memory enables machine learning workloads to retain key values such as activations and filter weights for multiple layers of computation. This capability is useful in models like convolutional neural networks, where the same input feature maps and filter weights are reused across multiple operations. By keeping this data in scratchpad memory rather than reloading it from external memory, accelerators can significantly reduce unnecessary memory transfers and improve overall efficiency [@Chen2016].
|
||||
|
||||
::: {.callout-war-story title="The 5% Utilization Mystery"}
|
||||
::: {.callout-war-story title="The Five-Percent Utilization Mystery"}
|
||||
**The Context**: Engineers at Tencent deployed a massive Transformer model on NVIDIA A100 GPUs, expecting a $10 \times$ speedup over their old V100s due to the new Tensor Cores.
|
||||
|
||||
**The Failure**: The model ran only $1.2 \times$ faster. Profiling revealed the Tensor Cores were active 0% of the time. The team had implemented their custom accumulation kernel in FP32 (32-bit float) to maintain precision.
|
||||
|
||||
@@ -1386,22 +1386,6 @@ Serving engineers routinely face a concrete question: given a latency SLO\index{
|
||||
|
||||
### Little's Law {#sec-model-serving-littles-law-9352}
|
||||
|
||||
Serving engineers need a tool that connects observable metrics to capacity requirements. The most celebrated result in queuing theory is Little's Law,\index{Little's Law!concurrency calculation}\index{Little's Law!L=λW formula}[^fn-littles-law] [^fn-littles-law-intuition] which @eq-littles-law expresses as a simple relationship between three quantities in any stable system: Concretely, a server targeting `{python} serving_qps_str` QPS with a `{python} serving_slo_str` SLO requires `{python} serving_concurrency_slots_str` concurrent request slots, which sets the hard memory floor for activation storage on that node.
|
||||
|
||||
[^fn-littles-law]: **Little's Law**: Proven by John D.C. Little in 1961 [@little1961proof], this theorem establishes that $L = \lambda W$ holds for any stable queuing system regardless of arrival patterns, service distributions, or scheduling policies. The remarkable generality makes it one of the most useful results in operations research. For serving systems, it enables capacity planning from observable metrics: measuring queue depth and arrival rate directly yields average latency without instrumenting individual requests.
|
||||
|
||||
[^fn-littles-law-intuition]: **Little's Law in the Coffee Shop**: Throughput ($\lambda$) is the rate of arriving customers; Latency ($W$) is the time to make one drink; Queue ($L$) is the number of people waiting. If the barista takes 1 minute per drink ($W=1$) and customers arrive every 30 seconds ($\lambda=2$), the queue ($L$) will grow indefinitely unless more baristas are added.
|
||||
|
||||
$$L = \lambda \cdot W$$ {#eq-littles-law}
|
||||
|
||||
where $L$ is the average number of requests in the system, $\lambda$ is the arrival rate (requests per second), and $W$ is the average time each request spends in the system.
|
||||
|
||||
::: {.callout-perspective title="Notation Alert: L vs. Latency"}
|
||||
In queuing theory, $L$ traditionally denotes the *length* of the queue (number of items in the system), and $W$ denotes *wait time* (time in system per request). Elsewhere in this book, we use $L_{\text{lat}}$ for latency with descriptive subscripts ($L_{\text{lat,wait}}$, $L_{\text{lat,compute}}$) to denote latency components. To preserve standard queuing notation, we retain $L$ for queue length and $W$ for time in system in this section. In the batching analysis that follows (@sec-model-serving-dynamic-batching-latencythroughput-tradeoffs-986d), $L_{\text{lat,wait}}$ corresponds to the queueing wait component $W_q$, and $L_{\text{lat,compute}}$ includes inference time.
|
||||
:::
|
||||
|
||||
This relationship holds regardless of arrival distribution, service time distribution, or scheduling policy. The following notebook quantifies this capacity relationship through a practical application of *Little's Law*.
|
||||
|
||||
```{python}
|
||||
#| label: littles-law-calc
|
||||
#| echo: false
|
||||
@@ -1469,6 +1453,23 @@ littles_w_str = CapacityPlanning.littles_w_str
|
||||
littles_l_str = CapacityPlanning.littles_l_str
|
||||
```
|
||||
|
||||
Serving engineers need a tool that connects observable metrics to capacity requirements. The most celebrated result in queuing theory is Little's Law,\index{Little's Law!concurrency calculation}\index{Little's Law!L=λW formula}[^fn-littles-law] [^fn-littles-law-intuition] which @eq-littles-law expresses as a simple relationship between three quantities in any stable system: Concretely, a server targeting `{python} serving_qps_str` QPS with a `{python} serving_slo_str` SLO requires `{python} serving_concurrency_slots_str` concurrent request slots, which sets the hard memory floor for activation storage on that node.
|
||||
|
||||
[^fn-littles-law]: **Little's Law**: Proven by John D.C. Little in 1961 [@little1961proof], this theorem establishes that $L = \lambda W$ holds for any stable queuing system regardless of arrival patterns, service distributions, or scheduling policies. The remarkable generality makes it one of the most useful results in operations research. For serving systems, it enables capacity planning from observable metrics: measuring queue depth and arrival rate directly yields average latency without instrumenting individual requests.
|
||||
|
||||
[^fn-littles-law-intuition]: **Little's Law in the Coffee Shop**: Throughput ($\lambda$) is the rate of arriving customers; Latency ($W$) is the time to make one drink; Queue ($L$) is the number of people waiting. If the barista takes 1 minute per drink ($W=1$) and customers arrive every 30 seconds ($\lambda=2$), the queue ($L$) will grow indefinitely unless more baristas are added.
|
||||
|
||||
$$L = \lambda \cdot W$$ {#eq-littles-law}
|
||||
|
||||
where $L$ is the average number of requests in the system, $\lambda$ is the arrival rate (requests per second), and $W$ is the average time each request spends in the system.
|
||||
|
||||
::: {.callout-perspective title="Notation Alert: L vs. Latency"}
|
||||
In queuing theory, $L$ traditionally denotes the *length* of the queue (number of items in the system), and $W$ denotes *wait time* (time in system per request). Elsewhere in this book, we use $L_{\text{lat}}$ for latency with descriptive subscripts ($L_{\text{lat,wait}}$, $L_{\text{lat,compute}}$) to denote latency components. To preserve standard queuing notation, we retain $L$ for queue length and $W$ for time in system in this section. In the batching analysis that follows (@sec-model-serving-dynamic-batching-latencythroughput-tradeoffs-986d), $L_{\text{lat,wait}}$ corresponds to the queueing wait component $W_q$, and $L_{\text{lat,compute}}$ includes inference time.
|
||||
:::
|
||||
|
||||
This relationship holds regardless of arrival distribution, service time distribution, or scheduling policy. The following notebook quantifies this capacity relationship through a practical application of *Little's Law*.
|
||||
|
||||
|
||||
::: {.callout-notebook #notebook-littles-law title="Little's Law"}
|
||||
|
||||
**The Capacity Physics**: How much memory do you need to serve 1,000 queries per second?
|
||||
|
||||
@@ -2262,6 +2262,7 @@ class LowRankFactorization:
|
||||
"""
|
||||
|
||||
# ┌── 1. PARAMETERS (Inputs) ───────────────────────────────────────────────
|
||||
mat_dim = 4096
|
||||
rank_k = 128
|
||||
bytes_per_param = 4 # FP32
|
||||
|
||||
@@ -2292,34 +2293,6 @@ rank_k_str = LowRankFactorization.rank_k_str
|
||||
full_mb_str = LowRankFactorization.full_mb_str
|
||||
factored_mb_str = LowRankFactorization.factored_mb_str
|
||||
data_reduction_str = LowRankFactorization.data_reduction_str
|
||||
rank_k = 128
|
||||
bytes_fp32 = 4
|
||||
|
||||
# ┌── 2. CALCULATION (The Physics) ─────────────────────────────────────────
|
||||
full_bytes = mat_dim * mat_dim * bytes_fp32
|
||||
full_mb = full_bytes / MIB_TO_BYTES
|
||||
factored_bytes = (mat_dim * rank_k + rank_k * mat_dim) * bytes_fp32
|
||||
factored_mb = factored_bytes / MIB_TO_BYTES
|
||||
|
||||
reduction_factor = full_mb / factored_mb
|
||||
|
||||
# ┌── 3. INVARIANTS (Guardrails) ───────────────────────────────────────────
|
||||
check(reduction_factor >= 10, f"Low-rank reduction ({reduction_factor:.1f}x) is too small.")
|
||||
|
||||
|
||||
# ┌── 4. OUTPUTS (Formatting) ──────────────────────────────────────────────
|
||||
full_mb_str = f"{int(full_mb)}"
|
||||
factored_mb_str = f"{int(factored_mb)}"
|
||||
data_reduction_str = f"{int(reduction_factor)}"
|
||||
mat_dim_str = f"{mat_dim}"
|
||||
rank_k_str = f"{rank_k}"
|
||||
|
||||
# ┌── EXPORTS (Bridge to Text) ─────────────────────────────────────────────────
|
||||
full_mb_str = LowRankFactorization.full_mb_str
|
||||
factored_mb_str = LowRankFactorization.factored_mb_str
|
||||
data_reduction_str = LowRankFactorization.data_reduction_str
|
||||
mat_dim_str = LowRankFactorization.mat_dim_str
|
||||
rank_k_str = LowRankFactorization.rank_k_str
|
||||
```
|
||||
|
||||
#### Low-Rank Factorization {#sec-model-compression-lowrank-factorization-955e}
|
||||
|
||||
Reference in New Issue
Block a user