fix(tinytorch): correct INT8 zero-point values in Module 15 quantization docs

Documentation examples were computed using UINT8 (0-255) zero-point
formula but the code implements signed INT8 (-128 to 127). Fixed all
hardcoded diagram values and docstring examples to match the actual
code output. The code logic was always correct; only the documentation
numbers were wrong.

Fixes: zero-point 88 -> -39, 64 -> -64, 42 -> -43
Fixes: quantized result [-128, 12, 127] -> [-128, -27, 127]
Fixes: dequantize docstring example with correct parameters
Ref: https://github.com/harvard-edge/cs249r_book/issues/1150
This commit is contained in:
Vijay Janapa Reddi
2026-02-13 17:06:29 -05:00
parent a9c2ba0180
commit 99b0eb1387
2 changed files with 17 additions and 17 deletions

View File

@@ -312,11 +312,11 @@ Small Scale (high precision): Large Scale (low precision):
Symmetric Range: Asymmetric Range:
FP32: [-2.0, 2.0] FP32: [-1.0, 3.0]
↓ ↓ ↓ ↓ ↓ ↓
INT8: -128 0 127 INT8: -128 64 127
INT8: -128 0 127 INT8: -128 -64 127
│ │ │ │ │ │
-2.0 0.0 2.0 -1.0 0.0 3.0
Zero Point = 0 Zero Point = 64
Zero Point = 0 Zero Point = -64
```
### Visual Example: Weight Quantization
@@ -324,8 +324,8 @@ Symmetric Range: Asymmetric Range:
```
Original FP32 Weights: Quantized INT8 Mapping:
┌─────────────────────────┐ ┌─────────────────────────┐
│ -0.8 -0.3 0.0 0.5 │ → │ -102 -38 0 64
│ 0.9 1.2 -0.1 0.7 │ │ 115 153 -13 89
│ -0.8 -0.3 0.0 0.5 │ → │ -128 -64 -26 38
│ 0.9 1.2 -0.1 0.7 │ │ 89 127 -39 63
└─────────────────────────┘ └─────────────────────────┘
4 bytes each 1 byte each
Total: 32 bytes Total: 8 bytes
@@ -425,10 +425,10 @@ Quantization Process Visualization:
Step 1: Analyze Range Step 2: Calculate Parameters Step 3: Apply Formula
┌─────────────────────────┐ ┌─────────────────────────┐ ┌─────────────────────────┐
│ Input: [-1.5, 0.2, 2.8] │ │ Min: -1.5 │ │ quantized = round( │
│ │ │ Max: 2.8 │ │ (value - zp*scale)
│ Find min/max values │ → │ Range: 4.3 │ →│ / scale)
│ │ │ Max: 2.8 │ │ value / scale + zp)
│ Find min/max values │ → │ Range: 4.3 │ →│
│ │ │ Scale: 4.3/255 = 0.017 │ │ │
│ │ │ Zero Point: 88 │ │ Result: [-128, 12, 127] │
│ │ │ Zero Point: -39 │ │ Result: [-128,-27, 127] │
└─────────────────────────┘ └─────────────────────────┘ └─────────────────────────┘
```
@@ -472,7 +472,7 @@ def quantize_int8(tensor: Tensor) -> Tuple[Tensor, float, int]:
>>> tensor = Tensor([[-1.0, 0.0, 2.0], [0.5, 1.5, -0.5]])
>>> q_tensor, scale, zero_point = quantize_int8(tensor)
>>> print(f"Scale: {scale:.4f}, Zero point: {zero_point}")
Scale: 0.0118, Zero point: 42
Scale: 0.0118, Zero point: -43
HINTS:
- Use np.round() for quantization
@@ -566,9 +566,9 @@ Dequantization Process:
INT8 Values + Parameters → FP32 Reconstruction
┌───────────────────────────────────┐
│ Quantized: [-128, 12, 127]
│ Quantized: [-128, -27, 127] │
│ Scale: 0.017 │
│ Zero Point: 88
│ Zero Point: -39
└───────────────────────────────────┘
▼ Apply Formula
@@ -579,9 +579,9 @@ INT8 Values + Parameters → FP32 Reconstruction
┌───────────────────────────────────┐
│ Result: [-1.496, 0.204, 2.799] │
│ Result: [-1.501, 0.202, 2.799] │
│ Original: [-1.5, 0.2, 2.8] │
│ Error: [0.004, 0.004, 0.001] │
│ Error: [0.001, 0.002, 0.001] │
└───────────────────────────────────┘
Excellent approximation!
@@ -620,11 +620,11 @@ def dequantize_int8(q_tensor: Tensor, scale: float, zero_point: int) -> Tensor:
Reconstructed FP32 tensor
EXAMPLE:
>>> q_tensor = Tensor([[-42, 0, 85]]) # INT8 values
>>> scale, zero_point = 0.0314, 64
>>> q_tensor = Tensor([[-100, 0, 50]]) # INT8 values
>>> scale, zero_point = 0.02, -25
>>> fp32_tensor = dequantize_int8(q_tensor, scale, zero_point)
>>> print(fp32_tensor.data)
[[-1.31, 2.01, 2.67]] # Approximate original values
[[-1.5, 0.5, 1.5]] # Reconstructed FP32 values
HINT:
- Formula: dequantized = (quantized - zero_point) * scale

View File

@@ -477,7 +477,7 @@ The algorithm finds the minimum and maximum values in the tensor, then calculate
The scale parameter determines how large each INT8 step is in FP32 space. A scale of 0.01 means each INT8 increment represents 0.01 in the original FP32 values. Smaller scales provide finer precision but can only represent a narrower range; larger scales cover wider ranges but sacrifice precision.
The zero-point is an integer offset that shifts the quantization range. For a symmetric distribution like [-2, 2], the zero-point is 0, mapping FP32 zero to INT8 zero. For an asymmetric range like [-1, 3], the zero-point might be 64, ensuring the quantization levels are distributed optimally across the actual data range.
The zero-point is an integer offset that shifts the quantization range. For a symmetric distribution like [-2, 2], the zero-point is 0, mapping FP32 zero to INT8 zero. For an asymmetric range like [-1, 3], the zero-point is -64, ensuring the quantization levels are distributed optimally across the actual data range.
Here's how dequantization reverses the process:
@@ -488,7 +488,7 @@ def dequantize_int8(q_tensor: Tensor, scale: float, zero_point: int) -> Tensor:
return Tensor(dequantized_data)
```
The formula `(quantized - zero_point) × scale` inverts the quantization mapping. If you quantized 2.5 to INT8 value 85 with scale 0.02 and zero-point 60, dequantization computes `(85 - 60) × 0.02 = 0.5`. The round-trip isn't perfect due to quantization being lossy compression, but the error is bounded by the scale value.
The formula `(quantized - zero_point) × scale` inverts the quantization mapping. If you quantized 1.5 to INT8 value 50 with scale 0.02 and zero-point -25, dequantization computes `(50 - (-25)) × 0.02 = 1.5`. The round-trip isn't perfect due to quantization being lossy compression, but the error is bounded by the scale value.
### Post-Training Quantization