cs249r_book/interviews/vault/questions/tinyml/optimization/tinyml-0785.yaml

schema_version: '1.0'
id: tinyml-0785
track: tinyml
level: L4
zone: diagnosis
topic: pruning-sparsity
competency_area: optimization
bloom_level: analyze
phase: training
title: The Depthwise Separable Arithmetic Intensity Drop
scenario: You replace standard convolutions with depthwise separable convolutions on a Jetson Orin to speed up a vision model. Despite FLOPs dropping by 9x, the end-to-end latency only improves by 10%.
question: Why does replacing standard convolutions with depthwise separable convolutions barely improve Jetson Orin latency?
details:
  realistic_solution: Depthwise convolutions have notoriously low arithmetic intensity because they don't reuse input channels across output channels. While you slashed the FLOPs, you pushed the operation into a memory-bound region of the roofline model. Using an INT8 convention of 1 TOPS = 10^12 operations/s, a 275 TOPS peak and 204 GB/s memory bandwidth imply a very high compute-to-bandwidth balance of about 1348 ops/byte. A 3x3 depthwise output does 9 MACs, or 18 ops, while reading at least 9 input bytes, 9 weight bytes, and writing 1 output byte before counting scale metadata or imperfect cache reuse. That gives at most about 0.95 ops/byte, so bandwidth alone caps sustained work near 194 GOPS before overheads. The GPU is now idling waiting for memory. The solution is to fuse the depthwise convolution with the pointwise convolution or activation layers to increase arithmetic intensity and reduce memory traffic.
  common_mistake: |
    **The Pitfall:** Assuming that reducing theoretical FLOPs will yield a linearly proportional decrease in latency on modern bandwidth-constrained hardware.
    **The Rationale:** Engineers evaluate algorithms entirely on FLOPs instead of factoring in the cost of memory movement and arithmetic intensity.
    **The Consequence:** Heavy refactoring effort yields practically zero real-world speedup because the bottleneck shifted entirely to memory bandwidth.
  napkin_math: |
    **Assumptions & Constraints:**
    - Convention: 1 TOPS = 10^12 INT8 operations/s and 1 GB/s = 10^9 bytes/s.
    - Orin compute-to-bandwidth balance: 275e12 ops/s / 204e9 bytes/s = about 1348 ops/byte.
    - Depthwise 3x3 per output element: 9 MACs = 18 ops.
    - Conservative lower-bound byte traffic: 9 input bytes + 9 weight bytes + 1 output byte = 19 bytes, before scale metadata, activation re-reads, or cache inefficiency.

    **Calculations:**
    - Arithmetic intensity upper bound: 18 ops / 19 bytes = 0.95 ops/byte.
    - Bandwidth-limited throughput upper bound: 0.95 ops/byte * 204 GB/s = about 194 GOPS.
    - 0.95 ops/byte is far below the about 1348 ops/byte balance point, so the layer is memory-bound under this model.

    **Conclusion & Interpretation:**
    - **Result: at most about 194 GOPS before additional overheads (Memory-Bound)**. The exact sustained number depends on cache reuse, layout, and kernel fusion, but the qualitative diagnosis is robust: the depthwise layer has too little arithmetic work per byte to approach the 275 TOPS compute ceiling.
status: published
provenance: imported
requires_explanation: false
expected_time_minutes: 10
validated: true
validation_status: OK
validation_date: '2026-04-01'
validation_model: gemini-2.5-flash
math_verified: true
math_status: CORRECTED
math_date: '2026-04-03'
math_model: gemini-3.1-pro-preview
human_reviewed:
  status: not-reviewed
  by: null
  date: null
  notes: null