mirror of
https://github.com/harvard-edge/cs249r_book.git
synced 2026-04-29 17:20:21 -05:00
fix(tinytorch): correct INT8 zero-point values in Module 15 quantization docs
Documentation examples were computed using UINT8 (0-255) zero-point formula but the code implements signed INT8 (-128 to 127). Fixed all hardcoded diagram values and docstring examples to match the actual code output. The code logic was always correct; only the documentation numbers were wrong. Fixes: zero-point 88 -> -39, 64 -> -64, 42 -> -43 Fixes: quantized result [-128, 12, 127] -> [-128, -27, 127] Fixes: dequantize docstring example with correct parameters Ref: https://github.com/harvard-edge/cs249r_book/issues/1150
This commit is contained in:
@@ -312,11 +312,11 @@ Small Scale (high precision): Large Scale (low precision):
|
||||
Symmetric Range: Asymmetric Range:
|
||||
FP32: [-2.0, 2.0] FP32: [-1.0, 3.0]
|
||||
↓ ↓ ↓ ↓ ↓ ↓
|
||||
INT8: -128 0 127 INT8: -128 64 127
|
||||
INT8: -128 0 127 INT8: -128 -64 127
|
||||
│ │ │ │ │ │
|
||||
-2.0 0.0 2.0 -1.0 0.0 3.0
|
||||
|
||||
Zero Point = 0 Zero Point = 64
|
||||
Zero Point = 0 Zero Point = -64
|
||||
```
|
||||
|
||||
### Visual Example: Weight Quantization
|
||||
@@ -324,8 +324,8 @@ Symmetric Range: Asymmetric Range:
|
||||
```
|
||||
Original FP32 Weights: Quantized INT8 Mapping:
|
||||
┌─────────────────────────┐ ┌─────────────────────────┐
|
||||
│ -0.8 -0.3 0.0 0.5 │ → │ -102 -38 0 64 │
|
||||
│ 0.9 1.2 -0.1 0.7 │ │ 115 153 -13 89 │
|
||||
│ -0.8 -0.3 0.0 0.5 │ → │ -128 -64 -26 38 │
|
||||
│ 0.9 1.2 -0.1 0.7 │ │ 89 127 -39 63 │
|
||||
└─────────────────────────┘ └─────────────────────────┘
|
||||
4 bytes each 1 byte each
|
||||
Total: 32 bytes Total: 8 bytes
|
||||
@@ -425,10 +425,10 @@ Quantization Process Visualization:
|
||||
Step 1: Analyze Range Step 2: Calculate Parameters Step 3: Apply Formula
|
||||
┌─────────────────────────┐ ┌─────────────────────────┐ ┌─────────────────────────┐
|
||||
│ Input: [-1.5, 0.2, 2.8] │ │ Min: -1.5 │ │ quantized = round( │
|
||||
│ │ │ Max: 2.8 │ │ (value - zp*scale) │
|
||||
│ Find min/max values │ → │ Range: 4.3 │ →│ / scale) │
|
||||
│ │ │ Max: 2.8 │ │ value / scale + zp) │
|
||||
│ Find min/max values │ → │ Range: 4.3 │ →│ │
|
||||
│ │ │ Scale: 4.3/255 = 0.017 │ │ │
|
||||
│ │ │ Zero Point: 88 │ │ Result: [-128, 12, 127] │
|
||||
│ │ │ Zero Point: -39 │ │ Result: [-128,-27, 127] │
|
||||
└─────────────────────────┘ └─────────────────────────┘ └─────────────────────────┘
|
||||
```
|
||||
|
||||
@@ -472,7 +472,7 @@ def quantize_int8(tensor: Tensor) -> Tuple[Tensor, float, int]:
|
||||
>>> tensor = Tensor([[-1.0, 0.0, 2.0], [0.5, 1.5, -0.5]])
|
||||
>>> q_tensor, scale, zero_point = quantize_int8(tensor)
|
||||
>>> print(f"Scale: {scale:.4f}, Zero point: {zero_point}")
|
||||
Scale: 0.0118, Zero point: 42
|
||||
Scale: 0.0118, Zero point: -43
|
||||
|
||||
HINTS:
|
||||
- Use np.round() for quantization
|
||||
@@ -566,9 +566,9 @@ Dequantization Process:
|
||||
INT8 Values + Parameters → FP32 Reconstruction
|
||||
|
||||
┌───────────────────────────────────┐
|
||||
│ Quantized: [-128, 12, 127] │
|
||||
│ Quantized: [-128, -27, 127] │
|
||||
│ Scale: 0.017 │
|
||||
│ Zero Point: 88 │
|
||||
│ Zero Point: -39 │
|
||||
└───────────────────────────────────┘
|
||||
│
|
||||
▼ Apply Formula
|
||||
@@ -579,9 +579,9 @@ INT8 Values + Parameters → FP32 Reconstruction
|
||||
│
|
||||
▼
|
||||
┌───────────────────────────────────┐
|
||||
│ Result: [-1.496, 0.204, 2.799] │
|
||||
│ Result: [-1.501, 0.202, 2.799] │
|
||||
│ Original: [-1.5, 0.2, 2.8] │
|
||||
│ Error: [0.004, 0.004, 0.001] │
|
||||
│ Error: [0.001, 0.002, 0.001] │
|
||||
└───────────────────────────────────┘
|
||||
↑
|
||||
Excellent approximation!
|
||||
@@ -620,11 +620,11 @@ def dequantize_int8(q_tensor: Tensor, scale: float, zero_point: int) -> Tensor:
|
||||
Reconstructed FP32 tensor
|
||||
|
||||
EXAMPLE:
|
||||
>>> q_tensor = Tensor([[-42, 0, 85]]) # INT8 values
|
||||
>>> scale, zero_point = 0.0314, 64
|
||||
>>> q_tensor = Tensor([[-100, 0, 50]]) # INT8 values
|
||||
>>> scale, zero_point = 0.02, -25
|
||||
>>> fp32_tensor = dequantize_int8(q_tensor, scale, zero_point)
|
||||
>>> print(fp32_tensor.data)
|
||||
[[-1.31, 2.01, 2.67]] # Approximate original values
|
||||
[[-1.5, 0.5, 1.5]] # Reconstructed FP32 values
|
||||
|
||||
HINT:
|
||||
- Formula: dequantized = (quantized - zero_point) * scale
|
||||
|
||||
@@ -477,7 +477,7 @@ The algorithm finds the minimum and maximum values in the tensor, then calculate
|
||||
|
||||
The scale parameter determines how large each INT8 step is in FP32 space. A scale of 0.01 means each INT8 increment represents 0.01 in the original FP32 values. Smaller scales provide finer precision but can only represent a narrower range; larger scales cover wider ranges but sacrifice precision.
|
||||
|
||||
The zero-point is an integer offset that shifts the quantization range. For a symmetric distribution like [-2, 2], the zero-point is 0, mapping FP32 zero to INT8 zero. For an asymmetric range like [-1, 3], the zero-point might be 64, ensuring the quantization levels are distributed optimally across the actual data range.
|
||||
The zero-point is an integer offset that shifts the quantization range. For a symmetric distribution like [-2, 2], the zero-point is 0, mapping FP32 zero to INT8 zero. For an asymmetric range like [-1, 3], the zero-point is -64, ensuring the quantization levels are distributed optimally across the actual data range.
|
||||
|
||||
Here's how dequantization reverses the process:
|
||||
|
||||
@@ -488,7 +488,7 @@ def dequantize_int8(q_tensor: Tensor, scale: float, zero_point: int) -> Tensor:
|
||||
return Tensor(dequantized_data)
|
||||
```
|
||||
|
||||
The formula `(quantized - zero_point) × scale` inverts the quantization mapping. If you quantized 2.5 to INT8 value 85 with scale 0.02 and zero-point 60, dequantization computes `(85 - 60) × 0.02 = 0.5`. The round-trip isn't perfect due to quantization being lossy compression, but the error is bounded by the scale value.
|
||||
The formula `(quantized - zero_point) × scale` inverts the quantization mapping. If you quantized 1.5 to INT8 value 50 with scale 0.02 and zero-point -25, dequantization computes `(50 - (-25)) × 0.02 = 1.5`. The round-trip isn't perfect due to quantization being lossy compression, but the error is bounded by the scale value.
|
||||
|
||||
### Post-Training Quantization
|
||||
|
||||
|
||||
Reference in New Issue
Block a user