mirror of
https://github.com/harvard-edge/cs249r_book.git
synced 2026-05-10 15:49:25 -05:00
Apply targeted fixes to the 802 still-failing major and blocker items identified by re-auditing the corpus after wave-5. Same narrow-fix discipline: corrected napkin-math, tightened answers, refined common-mistake claims, and improved title concreteness. Per-track files: cloud 273, edge 125, mobile 106, tinyml 63. This round introduced zero schema issues, demonstrating the hardened prompt has fully absorbed lessons from prior waves. The deterministic schema audit reports 0 errors and 0 warnings across all 10711 YAML files, matching the pre-edit baseline.
50 lines
3.4 KiB
YAML
50 lines
3.4 KiB
YAML
schema_version: '1.0'
|
|
id: tinyml-0785
|
|
track: tinyml
|
|
level: L4
|
|
zone: diagnosis
|
|
topic: pruning-sparsity
|
|
competency_area: optimization
|
|
bloom_level: analyze
|
|
phase: training
|
|
title: The Depthwise Separable Arithmetic Intensity Drop
|
|
scenario: You replace standard convolutions with depthwise separable convolutions on a Jetson Orin to speed up a vision model. Despite FLOPs dropping by 9x, the end-to-end latency only improves by 10%.
|
|
question: Why does replacing standard convolutions with depthwise separable convolutions barely improve Jetson Orin latency?
|
|
details:
|
|
realistic_solution: Depthwise convolutions have notoriously low arithmetic intensity because they don't reuse input channels across output channels. While you slashed the FLOPs, you pushed the operation into a memory-bound region of the roofline model. Using an INT8 convention of 1 TOPS = 10^12 operations/s, a 275 TOPS peak and 204 GB/s memory bandwidth imply a very high compute-to-bandwidth balance of about 1348 ops/byte. A 3x3 depthwise output does 9 MACs, or 18 ops, while reading at least 9 input bytes, 9 weight bytes, and writing 1 output byte before counting scale metadata or imperfect cache reuse. That gives at most about 0.95 ops/byte, so bandwidth alone caps sustained work near 194 GOPS before overheads. The GPU is now idling waiting for memory. The solution is to fuse the depthwise convolution with the pointwise convolution or activation layers to increase arithmetic intensity and reduce memory traffic.
|
|
common_mistake: |
|
|
**The Pitfall:** Assuming that reducing theoretical FLOPs will yield a linearly proportional decrease in latency on modern bandwidth-constrained hardware.
|
|
**The Rationale:** Engineers evaluate algorithms entirely on FLOPs instead of factoring in the cost of memory movement and arithmetic intensity.
|
|
**The Consequence:** Heavy refactoring effort yields practically zero real-world speedup because the bottleneck shifted entirely to memory bandwidth.
|
|
napkin_math: |
|
|
**Assumptions & Constraints:**
|
|
- Convention: 1 TOPS = 10^12 INT8 operations/s and 1 GB/s = 10^9 bytes/s.
|
|
- Orin compute-to-bandwidth balance: 275e12 ops/s / 204e9 bytes/s = about 1348 ops/byte.
|
|
- Depthwise 3x3 per output element: 9 MACs = 18 ops.
|
|
- Conservative lower-bound byte traffic: 9 input bytes + 9 weight bytes + 1 output byte = 19 bytes, before scale metadata, activation re-reads, or cache inefficiency.
|
|
|
|
**Calculations:**
|
|
- Arithmetic intensity upper bound: 18 ops / 19 bytes = 0.95 ops/byte.
|
|
- Bandwidth-limited throughput upper bound: 0.95 ops/byte * 204 GB/s = about 194 GOPS.
|
|
- 0.95 ops/byte is far below the about 1348 ops/byte balance point, so the layer is memory-bound under this model.
|
|
|
|
**Conclusion & Interpretation:**
|
|
- **Result: at most about 194 GOPS before additional overheads (Memory-Bound)**. The exact sustained number depends on cache reuse, layout, and kernel fusion, but the qualitative diagnosis is robust: the depthwise layer has too little arithmetic work per byte to approach the 275 TOPS compute ceiling.
|
|
status: published
|
|
provenance: imported
|
|
requires_explanation: false
|
|
expected_time_minutes: 10
|
|
validated: true
|
|
validation_status: OK
|
|
validation_date: '2026-04-01'
|
|
validation_model: gemini-2.5-flash
|
|
math_verified: true
|
|
math_status: CORRECTED
|
|
math_date: '2026-04-03'
|
|
math_model: gemini-3.1-pro-preview
|
|
human_reviewed:
|
|
status: not-reviewed
|
|
by: null
|
|
date: null
|
|
notes: null
|