Chapter 12 - Attention #493

Closed
opened 2026-03-22 15:43:48 -05:00 by GiteaMirror · 7 comments
Owner

Originally created by @ngbolin on GitHub (Feb 2, 2026).

Hi,

Wanted to clarify the following computation when deriving the complexity for Attention:

Shouldn't it be n * n * d instead of n * n, since we are taking the dot product between K and V of which d operations are computed for each query-key pair.

Image
Originally created by @ngbolin on GitHub (Feb 2, 2026). Hi, Wanted to clarify the following computation when deriving the complexity for Attention: Shouldn't it be n * n * d instead of n * n, since we are taking the dot product between K and V of which d operations are computed for each query-key pair. <img width="1422" height="778" alt="Image" src="https://github.com/user-attachments/assets/9f817b39-c925-40ed-b000-a5e805da9b73" />
GiteaMirror added the area: booktype: question labels 2026-03-22 15:43:48 -05:00
Author
Owner

@ngbolin commented on GitHub (Feb 2, 2026):

To add on, the value for Total Memory should be the same of all previous columns, not the same as the Optimizer memory

Image
@ngbolin commented on GitHub (Feb 2, 2026): To add on, the value for Total Memory should be the same of all previous columns, not the same as the Optimizer memory <img width="1381" height="543" alt="Image" src="https://github.com/user-attachments/assets/205e7dfc-8dee-4a09-9901-026a5afd60e6" />
Author
Owner

@profvjreddi commented on GitHub (Feb 4, 2026):

Hi @ngbolin — great catches on both points! 💯

  1. Attention Complexity: You're absolutely right. The time complexity of QK^T is O(n² × d), not O(n²), since each of the n² query-key pairs requires a d-dimensional dot product. O(n²) only describes the memory complexity (the attention weight matrix has n² entries). I've updated the explanation to clearly distinguish between time and memory complexity.

  2. Total Memory Column: Also correct — the Total Memory column was just reprinting the Optimizer value instead of summing Forward + Gradients + Optimizer. Thanks for catching that! I just fixed it too, and the multiplier now should show ~7× inference memory.

Both fixes are in PR #1155. Thanks for the careful reading! 🙏

@all-contributors please add @ngbolin for bug, doc in TinyTorch

@profvjreddi commented on GitHub (Feb 4, 2026): Hi @ngbolin — great catches on both points! 💯 1. Attention Complexity: You're absolutely right. The time complexity of QK^T is O(n² × d), not O(n²), since each of the n² query-key pairs requires a d-dimensional dot product. O(n²) only describes the memory complexity (the attention weight matrix has n² entries). I've updated the explanation to clearly distinguish between time and memory complexity. 2. Total Memory Column: Also correct — the Total Memory column was just reprinting the Optimizer value instead of summing Forward + Gradients + Optimizer. Thanks for catching that! I just fixed it too, and the multiplier now should show ~7× inference memory. Both fixes are in PR #1155. Thanks for the careful reading! 🙏 @all-contributors please add @ngbolin for bug, doc in TinyTorch
Author
Owner

@github-actions[bot] commented on GitHub (Feb 4, 2026):

I've added @ngbolin as a contributor to tinytorch! 🎉

Recognized for: doc
Based on: @all-contributors please add @ngbolin for bug, doc in TinyTorch

The contributor list has been updated in:

  • tinytorch/.all-contributorsrc
  • tinytorch/README.md
  • Main README.md

We love recognizing our contributors! ❤️

@github-actions[bot] commented on GitHub (Feb 4, 2026): I've added @ngbolin as a contributor to **tinytorch**! :tada: **Recognized for:** doc **Based on:** @all-contributors please add @ngbolin for bug, doc in TinyTorch The contributor list has been updated in: - `tinytorch/.all-contributorsrc` - `tinytorch/README.md` - Main `README.md` We love recognizing our contributors! :heart:
Author
Owner

@ngbolin commented on GitHub (Feb 6, 2026):

Hi @profvjreddi - for Module 15 on Quantization, can you explain why the Zero Point is at 88? Intuitively, because the input ranges from -1.5 to 2.8, the ZP should be (-)0.65/0.017 = -38 (where 0.65 is the midpoint between -1.5 and 2.8), and 0.2 should be -26.5 instead of 12...

Image
@ngbolin commented on GitHub (Feb 6, 2026): Hi @profvjreddi - for Module 15 on Quantization, can you explain why the Zero Point is at 88? Intuitively, because the input ranges from -1.5 to 2.8, the ZP should be (-)0.65/0.017 = -38 (where 0.65 is the midpoint between -1.5 and 2.8), and 0.2 should be -26.5 instead of 12... <img width="1085" height="289" alt="Image" src="https://github.com/user-attachments/assets/05303925-4d38-459c-b60d-7d08c58a2043" />
Author
Owner

@profvjreddi commented on GitHub (Feb 13, 2026):

Great catch @ngbolin, you're right! The zero point of 88 was wrong.

Let me explain what's going on so the fix makes sense.

What the zero point actually is

The goal is to squeeze a continuous FP32 range into 256 discrete INT8 values (-128 to 127). The zero point answers one question: which INT8 value represents FP32 zero?

This matters because zero is special in neural networks. ReLU outputs zero for all negative inputs, sparse weights are zero, dropout masks with zero. If FP32 zero doesn't map cleanly to an integer, you get quantization error at the most important value.

Walking through the example

Our tensor has values in [-1.5, 2.8]. We need to map that onto [-128, 127]:

FP32:   -1.5          0.0              2.8
         |             |                |
         v             v                v
INT8:  -128           ???              127
                       |
               Where does 0.0 land?

Scale tells us how wide each INT8 step is:

scale = (2.8 - (-1.5)) / 255 = 4.3 / 255 ≈ 0.017

Zero point tells us where FP32 zero lands. We know -1.5 must map to -128 (the bottom of INT8 range), and the quantization formula is q = round(value / scale + zp). Plugging in:

-128 = round(-1.5 / 0.017 + zp)
-128 = round(-89 + zp)
zp = -128 + 89 = -39

So FP32 zero sits at INT8 value -39:

FP32:   -1.5          0.0              2.8
         |             |                |
INT8:  -128           -39              127

Now we can quantize 0.2:

q(0.2) = round(0.2 / 0.017 + (-39)) = round(12 - 39) = round(-27) = -27

Result: [-128, -27, 127] with ZP = -39.

Where 88 came from

I mixed up the UINT8 (0 to 255) and signed INT8 (-128 to 127) conventions when writing the documentation. The UINT8 formula gives zp = round(-min_val / scale) = round(1.5 / 0.017) = round(89) ≈ 88. That's the right answer if your integer range starts at 0 instead of -128. Off by exactly 128. The code was always using the correct signed INT8 formula, but I wrote the diagram numbers with the wrong one.

This bug was pervasive across the module. The same mistake was in several other diagrams and docstrings too. I've fixed all of them in 99b0eb1 and verified nothing else in the codebase was affected. Thanks for the careful reading, really appreciate it!

@profvjreddi commented on GitHub (Feb 13, 2026): Great catch @ngbolin, you're right! The zero point of 88 was wrong. Let me explain what's going on so the fix makes sense. ### What the zero point actually is The goal is to squeeze a continuous FP32 range into 256 discrete INT8 values (-128 to 127). The zero point answers one question: **which INT8 value represents FP32 zero?** This matters because zero is special in neural networks. [ReLU](https://mlsysbook.ai/book/contents/core/dl_primer/dl_primer.html#fig-activation-functions) outputs zero for all negative inputs, sparse weights are zero, dropout masks with zero. If FP32 zero doesn't map cleanly to an integer, you get quantization error at the most important value. ### Walking through the example Our tensor has values in [-1.5, 2.8]. We need to map that onto [-128, 127]: ``` FP32: -1.5 0.0 2.8 | | | v v v INT8: -128 ??? 127 | Where does 0.0 land? ``` **Scale** tells us how wide each INT8 step is: ``` scale = (2.8 - (-1.5)) / 255 = 4.3 / 255 ≈ 0.017 ``` **Zero point** tells us where FP32 zero lands. We know -1.5 must map to -128 (the bottom of INT8 range), and the quantization formula is `q = round(value / scale + zp)`. Plugging in: ``` -128 = round(-1.5 / 0.017 + zp) -128 = round(-89 + zp) zp = -128 + 89 = -39 ``` So FP32 zero sits at INT8 value **-39**: ``` FP32: -1.5 0.0 2.8 | | | INT8: -128 -39 127 ``` Now we can quantize 0.2: ``` q(0.2) = round(0.2 / 0.017 + (-39)) = round(12 - 39) = round(-27) = -27 ``` **Result: [-128, -27, 127]** with ZP = -39. ### Where 88 came from I mixed up the UINT8 (0 to 255) and signed INT8 (-128 to 127) conventions when writing the documentation. The UINT8 formula gives `zp = round(-min_val / scale) = round(1.5 / 0.017) = round(89) ≈ 88`. That's the right answer if your integer range starts at 0 instead of -128. Off by exactly 128. The code was always using the correct signed INT8 formula, but I wrote the diagram numbers with the wrong one. This bug was pervasive across the module. The same mistake was in several other diagrams and docstrings too. I've fixed all of them in 99b0eb1 and verified nothing else in the codebase was affected. Thanks for the careful reading, really appreciate it!
Author
Owner

@profvjreddi commented on GitHub (Feb 13, 2026):

@all-contributors please add @ as a contributor for ✍️ Doc in TinyTorch

@profvjreddi commented on GitHub (Feb 13, 2026): @all-contributors please add @ as a contributor for ✍️ Doc in TinyTorch
Author
Owner

@github-actions[bot] commented on GitHub (Feb 13, 2026):

I've added @ngbolin as a contributor to tinytorch! 🎉

Recognized for: doc
Project: tinytorch (explicitly mentioned in comment)
Based on: @all-contributors please add @ as a contributor for ✍️ Doc in TinyTorch

The contributor list has been updated in:

  • tinytorch/.all-contributorsrc
  • tinytorch/README.md
  • Main README.md

We love recognizing our contributors! ❤️

@github-actions[bot] commented on GitHub (Feb 13, 2026): I've added @ngbolin as a contributor to **tinytorch**! :tada: **Recognized for:** doc **Project:** tinytorch (explicitly mentioned in comment) **Based on:** @all-contributors please add @ as a contributor for ✍️ Doc in TinyTorch The contributor list has been updated in: - `tinytorch/.all-contributorsrc` - `tinytorch/README.md` - Main `README.md` We love recognizing our contributors! :heart:
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/cs249r_book#493