Improve module-developer guidelines and fix all module issues

- Added progressive complexity guidelines (Foundation/Intermediate/Advanced)
- Added measurement function consolidation to prevent information overload
- Fixed all diagnostic issues in losses_dev.py
- Fixed markdown formatting across all modules
- Consolidated redundant analysis functions in foundation modules
- Fixed syntax errors and unused variables
- Ensured all educational content is in proper markdown cells for Jupyter
This commit is contained in:
Vijay Janapa Reddi
2025-09-28 09:42:25 -04:00
parent ce2a1b4fa6
commit ae109deae1
26 changed files with 3822 additions and 5682 deletions

View File

@@ -12,9 +12,9 @@
"""
# Module 17: Quantization - Trading Precision for Speed
Welcome to the Quantization module! After Module 16 showed you how to get free speedups through better algorithms, now we make our **first trade-off**: reduce precision for speed. You'll implement INT8 quantization to achieve 4× speedup with <1% accuracy loss.
Welcome to the Quantization module! After Module 16 showed you how to get free speedups through better algorithms, now we make our **first trade-off**: reduce precision for speed. You'll implement INT8 quantization to achieve 4* speedup with <1% accuracy loss.
## Connection from Module 16: Acceleration Quantization
## Connection from Module 16: Acceleration -> Quantization
Module 16 taught you to accelerate computations through better algorithms and hardware utilization - these were "free" optimizations. Now we enter the world of **trade-offs**: sacrificing precision to gain speed. This is especially powerful for CNN inference where INT8 operations are much faster than FP32.
@@ -24,13 +24,13 @@ Module 16 taught you to accelerate computations through better algorithms and ha
- **Core implementation skill**: Build INT8 quantization systems for CNN weights and activations
- **Pattern recognition**: Understand calibration-based quantization for post-training optimization
- **Framework connection**: See how production systems use quantization for edge deployment and mobile inference
- **Performance insight**: Achieve 4× speedup with <1% accuracy loss through precision optimization
- **Performance insight**: Achieve 4* speedup with <1% accuracy loss through precision optimization
## Build Profile Optimize
## Build -> Profile -> Optimize
1. **Build**: Start with FP32 CNN inference (baseline)
2. **Profile**: Measure memory usage and computational cost of FP32 operations
3. **Optimize**: Implement INT8 quantization to achieve 4× speedup with minimal accuracy loss
3. **Optimize**: Implement INT8 quantization to achieve 4* speedup with minimal accuracy loss
## What You'll Achieve
@@ -38,14 +38,14 @@ By the end of this module, you'll understand:
- **Deep technical understanding**: How INT8 quantization reduces precision while maintaining model quality
- **Practical capability**: Implement production-grade quantization for CNN inference acceleration
- **Systems insight**: Memory vs precision tradeoffs in ML systems optimization
- **Performance mastery**: Achieve 4× speedup (50ms 12ms inference) with <1% accuracy loss
- **Performance mastery**: Achieve 4* speedup (50ms -> 12ms inference) with <1% accuracy loss
- **Connection to edge deployment**: How mobile and edge devices use quantization for efficient AI
## Systems Reality Check
💡 **Production Context**: TensorFlow Lite and PyTorch Mobile use INT8 quantization for mobile deployment
**Performance Note**: CNN inference: FP32 = 50ms, INT8 = 12ms (4× faster) with 98% 97.5% accuracy
🧠 **Memory Tradeoff**: INT8 uses 4× less memory and enables much faster integer arithmetic
TIP **Production Context**: TensorFlow Lite and PyTorch Mobile use INT8 quantization for mobile deployment
SPEED **Performance Note**: CNN inference: FP32 = 50ms, INT8 = 12ms (4* faster) with 98% -> 97.5% accuracy
🧠 **Memory Tradeoff**: INT8 uses 4* less memory and enables much faster integer arithmetic
"""
# %% nbgrader={"grade": false, "grade_id": "quantization-imports", "locked": false, "schema_version": 3, "solution": false, "task": false}
@@ -92,7 +92,7 @@ Let's start by understanding what quantization means and why it provides such dr
### The Quantization Concept
Quantization converts high-precision floating-point numbers (FP32: 32 bits) to low-precision integers (INT8: 8 bits):
- **Memory**: 4× reduction (32 bits 8 bits)
- **Memory**: 4* reduction (32 bits -> 8 bits)
- **Compute**: Integer arithmetic is much faster than floating-point
- **Hardware**: Specialized INT8 units on modern CPUs and mobile processors
- **Trade-off**: Small precision loss for large speed gain
@@ -144,7 +144,7 @@ class BaselineCNN:
self.fc_input_size = 64 * 6 * 6 # 64 channels, 6x6 spatial
self.fc = np.random.randn(self.fc_input_size, num_classes) * 0.02
print(f" BaselineCNN initialized: {self._count_parameters()} parameters")
print(f"PASS BaselineCNN initialized: {self._count_parameters()} parameters")
### END SOLUTION
def _count_parameters(self) -> int:
@@ -253,7 +253,7 @@ Let's test our baseline CNN to establish performance and accuracy baselines:
# %% nbgrader={"grade": true, "grade_id": "test-baseline-cnn", "locked": false, "points": 2, "schema_version": 3, "solution": false, "task": false}
def test_baseline_cnn():
"""Test baseline CNN implementation and measure performance."""
print("🔍 Testing Baseline FP32 CNN...")
print("MAGNIFY Testing Baseline FP32 CNN...")
print("=" * 60)
# Create baseline model
@@ -272,13 +272,13 @@ def test_baseline_cnn():
# Validate output
assert logits.shape == (batch_size, 10), f"Expected (4, 10), got {logits.shape}"
print(f" Forward pass works: {logits.shape}")
print(f"PASS Forward pass works: {logits.shape}")
# Test predictions
predictions = model.predict(input_data)
assert predictions.shape == (batch_size,), f"Expected (4,), got {predictions.shape}"
assert all(0 <= p < 10 for p in predictions), "All predictions should be valid class indices"
print(f" Predictions work: {predictions}")
print(f"PASS Predictions work: {predictions}")
# Performance baseline
print(f"\n📊 Performance Baseline:")
@@ -287,8 +287,8 @@ def test_baseline_cnn():
print(f" Parameters: {model._count_parameters()} (all FP32)")
print(f" Memory usage: ~{model._count_parameters() * 4 / 1024:.1f}KB for weights")
print(" Baseline CNN tests passed!")
print("💡 Ready to implement INT8 quantization for 4× speedup...")
print("PASS Baseline CNN tests passed!")
print("TIP Ready to implement INT8 quantization for 4* speedup...")
# Test function defined (called in main block)
@@ -478,7 +478,7 @@ class INT8Quantizer:
print(f" Scale: {scale:.6f}, Zero point: {zero_point}")
print(f" Quantization error: {quantization_error:.6f} (max: {max_error:.6f})")
print(f" Compression: {compression_ratio:.1f}× ({original_size//1024}KB {quantized_size//1024}KB)")
print(f" Compression: {compression_ratio:.1f}* ({original_size//1024}KB -> {quantized_size//1024}KB)")
return {
'quantized_weights': quantized_weights,
@@ -500,7 +500,7 @@ Let's test our quantizer to verify it works correctly:
# %% nbgrader={"grade": true, "grade_id": "test-quantizer", "locked": false, "points": 3, "schema_version": 3, "solution": false, "task": false}
def test_int8_quantizer():
"""Test INT8 quantizer implementation."""
print("🔍 Testing INT8 Quantizer...")
print("MAGNIFY Testing INT8 Quantizer...")
print("=" * 60)
quantizer = INT8Quantizer()
@@ -519,14 +519,14 @@ def test_int8_quantizer():
# Verify quantized tensor is INT8
assert quantized.dtype == np.int8, f"Expected int8, got {quantized.dtype}"
assert np.all(quantized >= -128) and np.all(quantized <= 127), "Quantized values outside INT8 range"
print(" Quantization produces valid INT8 values")
print("PASS Quantization produces valid INT8 values")
# Verify round-trip error is reasonable
quantization_error = np.mean(np.abs(test_tensor - dequantized))
max_error = np.max(np.abs(test_tensor - dequantized))
assert quantization_error < 0.1, f"Quantization error too high: {quantization_error}"
print(f" Round-trip error acceptable: {quantization_error:.6f} (max: {max_error:.6f})")
print(f"PASS Round-trip error acceptable: {quantization_error:.6f} (max: {max_error:.6f})")
# Test weight quantization
weight_tensor = np.random.randn(64, 32, 3, 3) * 0.1 # Typical conv weight range
@@ -538,20 +538,20 @@ def test_int8_quantizer():
assert 'quantization_error' in weight_result, "Should return error metrics"
assert weight_result['compression_ratio'] > 3.5, "Should achieve good compression"
print(f" Weight quantization: {weight_result['compression_ratio']:.1f}× compression")
print(f" Weight quantization error: {weight_result['quantization_error']:.6f}")
print(f"PASS Weight quantization: {weight_result['compression_ratio']:.1f}* compression")
print(f"PASS Weight quantization error: {weight_result['quantization_error']:.6f}")
print(" INT8 quantizer tests passed!")
print("💡 Ready to build quantized CNN...")
print("PASS INT8 quantizer tests passed!")
print("TIP Ready to build quantized CNN...")
# Test function defined (called in main block)
# IMPLEMENTATION CHECKPOINT: Ensure quantized CNN is fully built before running
# PASS IMPLEMENTATION CHECKPOINT: Ensure quantized CNN is fully built before running
# 🤔 PREDICTION: How much memory will quantization save for convolutional layers?
# Write your guess here: _______× reduction
# THINK PREDICTION: How much memory will quantization save for convolutional layers?
# Write your guess here: _______* reduction
# 🔍 SYSTEMS INSIGHT #1: Quantization Memory Analysis
# MAGNIFY SYSTEMS INSIGHT #1: Quantization Memory Analysis
def analyze_quantization_memory():
"""Analyze memory savings from quantization."""
try:
@@ -579,15 +579,15 @@ def analyze_quantization_memory():
print(f"📊 Quantization Memory Analysis:")
print(f" Baseline conv weights: {baseline_conv_memory/1024:.1f}KB")
print(f" Quantized conv weights: {quantized_conv_memory/1024:.1f}KB")
print(f" Compression ratio: {compression_ratio:.1f}×")
print(f" Compression ratio: {compression_ratio:.1f}*")
print(f" Memory saved: {(baseline_conv_memory - quantized_conv_memory)/1024:.1f}KB")
# Explain the scaling
print(f"\n💡 WHY THIS MATTERS:")
print(f"\nTIP WHY THIS MATTERS:")
print(f" • FP32 uses 4 bytes per parameter")
print(f" • INT8 uses 1 byte per parameter")
print(f" • Theoretical maximum: 4× compression")
print(f" • Actual compression: {compression_ratio:.1f}× (close to theoretical!)")
print(f" • Theoretical maximum: 4* compression")
print(f" • Actual compression: {compression_ratio:.1f}* (close to theoretical!)")
print(f" • For large models: This enables mobile deployment")
# Scale to production size
@@ -601,7 +601,7 @@ def analyze_quantization_memory():
print(f" Mobile app size reduction: {fp32_size_mb - int8_size_mb:.1f}MB")
except Exception as e:
print(f" Error in memory analysis: {e}")
print(f"WARNING Error in memory analysis: {e}")
print("Make sure quantized CNN is implemented correctly")
# Analyze quantization memory impact
@@ -616,7 +616,7 @@ Now let's create a quantized version of our CNN that uses INT8 weights while mai
### Quantized Operations Strategy
For maximum performance, we need to:
1. **Store weights in INT8** format (4× memory savings)
1. **Store weights in INT8** format (4* memory savings)
2. **Compute convolutions with INT8** arithmetic (faster)
3. **Dequantize only when necessary** for activation functions
4. **Calibrate quantization** using representative data
@@ -683,7 +683,7 @@ class QuantizedConv2d:
self.weight_zero_point = result['zero_point']
self.is_quantized = True
print(f" Quantized: {result['compression_ratio']:.1f}× compression, "
print(f" Quantized: {result['compression_ratio']:.1f}* compression, "
f"{result['quantization_error']:.6f} error")
### END SOLUTION
@@ -742,7 +742,7 @@ class QuantizedCNN:
"""
CNN with INT8 quantized weights for fast inference.
This model demonstrates how quantization can achieve 4× speedup
This model demonstrates how quantization can achieve 4* speedup
with minimal accuracy loss through precision optimization.
"""
@@ -781,7 +781,7 @@ class QuantizedCNN:
self.quantizer = INT8Quantizer()
self.is_quantized = False
print(f" QuantizedCNN initialized: {self._count_parameters()} parameters")
print(f"PASS QuantizedCNN initialized: {self._count_parameters()} parameters")
### END SOLUTION
def _count_parameters(self) -> int:
@@ -829,9 +829,9 @@ class QuantizedCNN:
compression_ratio = original_conv_memory / quantized_conv_memory
print(f" Quantization complete:")
print(f" Conv layers: {original_conv_memory//1024}KB {quantized_conv_memory//1024}KB")
print(f" Compression: {compression_ratio:.1f}× memory savings")
print(f"PASS Quantization complete:")
print(f" Conv layers: {original_conv_memory//1024}KB -> {quantized_conv_memory//1024}KB")
print(f" Compression: {compression_ratio:.1f}* memory savings")
print(f" Model ready for fast inference!")
### END SOLUTION
@@ -899,7 +899,7 @@ Let's test our quantized CNN and verify it maintains accuracy:
# %% nbgrader={"grade": true, "grade_id": "test-quantized-cnn", "locked": false, "points": 4, "schema_version": 3, "solution": false, "task": false}
def test_quantized_cnn():
"""Test quantized CNN implementation."""
print("🔍 Testing Quantized CNN...")
print("MAGNIFY Testing Quantized CNN...")
print("=" * 60)
# Create quantized model
@@ -911,45 +911,45 @@ def test_quantized_cnn():
# Test before quantization
test_input = np.random.randn(2, 3, 32, 32)
logits_before = model.forward(test_input)
print(f" Forward pass before quantization: {logits_before.shape}")
print(f"PASS Forward pass before quantization: {logits_before.shape}")
# Calibrate and quantize
model.calibrate_and_quantize(calibration_data)
assert model.is_quantized, "Model should be marked as quantized"
assert model.conv1.is_quantized, "Conv1 should be quantized"
assert model.conv2.is_quantized, "Conv2 should be quantized"
print(" Model quantization successful")
print("PASS Model quantization successful")
# Test after quantization
logits_after = model.forward(test_input)
assert logits_after.shape == logits_before.shape, "Output shape should be unchanged"
print(f" Forward pass after quantization: {logits_after.shape}")
print(f"PASS Forward pass after quantization: {logits_after.shape}")
# Check predictions still work
predictions = model.predict(test_input)
assert predictions.shape == (2,), f"Expected (2,), got {predictions.shape}"
assert all(0 <= p < 10 for p in predictions), "All predictions should be valid"
print(f" Predictions work: {predictions}")
print(f"PASS Predictions work: {predictions}")
# Verify quantization maintains reasonable accuracy
output_diff = np.mean(np.abs(logits_before - logits_after))
max_diff = np.max(np.abs(logits_before - logits_after))
print(f" Quantization impact: {output_diff:.4f} mean diff, {max_diff:.4f} max diff")
print(f"PASS Quantization impact: {output_diff:.4f} mean diff, {max_diff:.4f} max diff")
# Should have reasonable impact but not destroy the model
assert output_diff < 2.0, f"Quantization impact too large: {output_diff:.4f}"
print(" Quantized CNN tests passed!")
print("💡 Ready for performance comparison...")
print("PASS Quantized CNN tests passed!")
print("TIP Ready for performance comparison...")
# Test function defined (called in main block)
# IMPLEMENTATION CHECKPOINT: Quantized CNN complete
# PASS IMPLEMENTATION CHECKPOINT: Quantized CNN complete
# 🤔 PREDICTION: What will be the biggest source of speedup from quantization?
# THINK PREDICTION: What will be the biggest source of speedup from quantization?
# Your answer: Memory bandwidth / Computation / Cache efficiency / _______
# 🔍 SYSTEMS INSIGHT #2: Quantization Speed Analysis
# MAGNIFY SYSTEMS INSIGHT #2: Quantization Speed Analysis
def analyze_quantization_speed():
"""Analyze speed improvements from quantization."""
try:
@@ -984,42 +984,42 @@ def analyze_quantization_speed():
speedup = baseline_avg / quantized_avg if quantized_avg > 0 else 1.0
print(f" Quantization Speed Analysis:")
print(f"SPEED Quantization Speed Analysis:")
print(f" Baseline FP32: {baseline_avg:.2f}ms")
print(f" Quantized INT8: {quantized_avg:.2f}ms")
print(f" Speedup: {speedup:.1f}×")
print(f" Speedup: {speedup:.1f}*")
# Analyze speedup sources
print(f"\n🔍 Speedup Sources:")
print(f" 1. Memory bandwidth: 4× less data to load (328 bits)")
print(f"\nMAGNIFY Speedup Sources:")
print(f" 1. Memory bandwidth: 4* less data to load (32->8 bits)")
print(f" 2. Cache efficiency: More weights fit in CPU cache")
print(f" 3. SIMD operations: More INT8 ops per instruction")
print(f" 4. Hardware acceleration: Dedicated INT8 units")
# Note about production vs educational implementation
print(f"\n📚 Educational vs Production:")
print(f" • This implementation: {speedup:.1f}× (educational focus)")
print(f" • Production systems: 3-5× typical speedup")
print(f" • Hardware optimized: Up to 10× on specialized chips")
print(f" • This implementation: {speedup:.1f}* (educational focus)")
print(f" • Production systems: 3-5* typical speedup")
print(f" • Hardware optimized: Up to 10* on specialized chips")
print(f" • Why difference: We dequantize for computation (educational clarity)")
print(f" • Production: Native INT8 kernels throughout pipeline")
except Exception as e:
print(f" Error in speed analysis: {e}")
print(f"WARNING Error in speed analysis: {e}")
# Analyze quantization speed benefits
analyze_quantization_speed()
# %% [markdown]
"""
## Part 4: Performance Analysis - 4× Speedup Demonstration
## Part 4: Performance Analysis - 4* Speedup Demonstration
Now let's demonstrate the dramatic performance improvement achieved by INT8 quantization. We'll compare FP32 vs INT8 inference speed and memory usage.
### Expected Results
- **Memory usage**: 4× reduction for quantized weights
- **Inference speed**: 4× improvement through INT8 arithmetic
- **Accuracy**: <1% degradation (98% 97.5% typical)
- **Memory usage**: 4* reduction for quantized weights
- **Inference speed**: 4* improvement through INT8 arithmetic
- **Accuracy**: <1% degradation (98% -> 97.5% typical)
"""
# %% nbgrader={"grade": false, "grade_id": "performance-analyzer", "locked": false, "schema_version": 3, "solution": true, "task": false}
@@ -1073,7 +1073,7 @@ class QuantizationPerformanceAnalyzer:
print(f"📊 Memory Analysis:")
print(f" Baseline: {baseline_memory:.1f}KB")
print(f" Quantized: {quantized_memory:.1f}KB")
print(f" Reduction: {memory_reduction:.1f}×")
print(f" Reduction: {memory_reduction:.1f}*")
# Inference Speed Benchmark
print(f"\n⏱️ Speed Benchmark ({num_runs} runs):")
@@ -1105,7 +1105,7 @@ class QuantizationPerformanceAnalyzer:
print(f" Baseline: {baseline_avg_time*1000:.2f}ms ± {baseline_std_time*1000:.2f}ms")
print(f" Quantized: {quantized_avg_time*1000:.2f}ms ± {quantized_std_time*1000:.2f}ms")
print(f" Speedup: {speedup:.1f}×")
print(f" Speedup: {speedup:.1f}*")
# Accuracy Analysis
output_diff = np.mean(np.abs(baseline_output - quantized_output))
@@ -1116,7 +1116,7 @@ class QuantizationPerformanceAnalyzer:
quantized_preds = np.argmax(quantized_output, axis=1)
agreement = np.mean(baseline_preds == quantized_preds)
print(f"\n🎯 Accuracy Analysis:")
print(f"\nTARGET Accuracy Analysis:")
print(f" Output difference: {output_diff:.4f} (max: {max_diff:.4f})")
print(f" Prediction agreement: {agreement:.1%}")
@@ -1176,29 +1176,29 @@ class QuantizationPerformanceAnalyzer:
This function is PROVIDED to display results clearly.
"""
print("\n🚀 QUANTIZATION PERFORMANCE SUMMARY")
print("\nROCKET QUANTIZATION PERFORMANCE SUMMARY")
print("=" * 60)
print(f"📊 Memory Optimization:")
print(f" • FP32 Model: {results['memory_baseline_kb']:.1f}KB")
print(f" • INT8 Model: {results['memory_quantized_kb']:.1f}KB")
print(f" • Memory savings: {results['memory_reduction']:.1f}× reduction")
print(f" • Memory savings: {results['memory_reduction']:.1f}* reduction")
print(f" • Storage efficiency: {(1 - 1/results['memory_reduction'])*100:.1f}% less memory")
print(f"\n Speed Optimization:")
print(f"\nSPEED Speed Optimization:")
print(f" • FP32 Inference: {results['speed_baseline_ms']:.1f}ms")
print(f" • INT8 Inference: {results['speed_quantized_ms']:.1f}ms")
print(f" • Speed improvement: {results['speedup']:.1f}× faster")
print(f" • Speed improvement: {results['speedup']:.1f}* faster")
print(f" • Latency reduction: {(1 - 1/results['speedup'])*100:.1f}% faster")
print(f"\n🎯 Accuracy Trade-off:")
print(f"\nTARGET Accuracy Trade-off:")
print(f" • Output preservation: {(1-results['output_difference'])*100:.1f}% similarity")
print(f" • Prediction agreement: {results['prediction_agreement']:.1%}")
print(f" • Quality maintained with {results['speedup']:.1f}× speedup!")
print(f" • Quality maintained with {results['speedup']:.1f}* speedup!")
# Overall assessment
efficiency_score = results['speedup'] * results['memory_reduction']
print(f"\n🏆 Overall Efficiency:")
print(f" • Combined benefit: {efficiency_score:.1f}× (speed × memory)")
print(f" • Combined benefit: {efficiency_score:.1f}* (speed * memory)")
print(f" • Trade-off assessment: {'🟢 Excellent' if results['prediction_agreement'] > 0.95 else '🟡 Good'}")
# %% [markdown]
@@ -1211,7 +1211,7 @@ Let's run comprehensive benchmarks to see the quantization benefits:
# %% nbgrader={"grade": true, "grade_id": "test-performance-analysis", "locked": false, "points": 4, "schema_version": 3, "solution": false, "task": false}
def test_performance_analysis():
"""Test performance analysis of quantization benefits."""
print("🔍 Testing Performance Analysis...")
print("MAGNIFY Testing Performance Analysis...")
print("=" * 60)
# Create models
@@ -1235,28 +1235,28 @@ def test_performance_analysis():
assert 'prediction_agreement' in results, "Should report accuracy preservation"
# Verify quantization benefits (realistic expectation: conv layers quantized, FC kept FP32)
assert results['memory_reduction'] > 1.2, f"Should show memory reduction, got {results['memory_reduction']:.1f}×"
assert results['speedup'] > 0.5, f"Educational implementation without actual INT8 kernels, got {results['speedup']:.1f}×"
assert results['memory_reduction'] > 1.2, f"Should show memory reduction, got {results['memory_reduction']:.1f}*"
assert results['speedup'] > 0.5, f"Educational implementation without actual INT8 kernels, got {results['speedup']:.1f}*"
assert results['prediction_agreement'] >= 0.0, f"Prediction agreement measurement, got {results['prediction_agreement']:.1%}"
print(f" Memory reduction: {results['memory_reduction']:.1f}×")
print(f" Speed improvement: {results['speedup']:.1f}×")
print(f" Prediction agreement: {results['prediction_agreement']:.1%}")
print(f"PASS Memory reduction: {results['memory_reduction']:.1f}*")
print(f"PASS Speed improvement: {results['speedup']:.1f}*")
print(f"PASS Prediction agreement: {results['prediction_agreement']:.1%}")
# Print comprehensive summary
analyzer.print_performance_summary(results)
print(" Performance analysis tests passed!")
print("🎉 Quantization delivers significant benefits!")
print("PASS Performance analysis tests passed!")
print("CELEBRATE Quantization delivers significant benefits!")
# Test function defined (called in main block)
# IMPLEMENTATION CHECKPOINT: Performance analysis complete
# PASS IMPLEMENTATION CHECKPOINT: Performance analysis complete
# 🤔 PREDICTION: Which quantization bit-width provides the best trade-off?
# THINK PREDICTION: Which quantization bit-width provides the best trade-off?
# Your answer: 4-bit / 8-bit / 16-bit / 32-bit
# 🔍 SYSTEMS INSIGHT #3: Quantization Bit-Width Analysis
# MAGNIFY SYSTEMS INSIGHT #3: Quantization Bit-Width Analysis
def analyze_quantization_bitwidths():
"""Compare different quantization bit-widths."""
try:
@@ -1298,11 +1298,11 @@ def analyze_quantization_bitwidths():
hardware = "Research"
use_case = "Experimental"
print(f"{bits:<6} {memory:<8.1f} {speed:<8.1f}× {accuracy:<10.1f}% {hardware:<15} {use_case:<20}")
print(f"{bits:<6} {memory:<8.1f} {speed:<8.1f}* {accuracy:<10.1f}% {hardware:<15} {use_case:<20}")
print(f"\n🎯 Key Insights:")
print(f"\nTARGET Key Insights:")
print(f" • INT8 Sweet Spot: Best balance of speed, accuracy, and hardware support")
print(f" • Memory scales linearly: Each bit halving saves 2× memory")
print(f" • Memory scales linearly: Each bit halving saves 2* memory")
print(f" • Speed scaling non-linear: Hardware specialization matters")
print(f" • Accuracy degrades exponentially: Below 8-bit becomes problematic")
@@ -1310,7 +1310,7 @@ def analyze_quantization_bitwidths():
print(f" • TensorFlow Lite: Standardized on INT8")
print(f" • PyTorch Mobile: INT8 with FP16 fallback")
print(f" • Apple Neural Engine: Optimized for INT8")
print(f" • Google TPU: INT8 operations 10× faster than FP32")
print(f" • Google TPU: INT8 operations 10* faster than FP32")
# Calculate efficiency score (speed / accuracy_loss)
print(f"\n📊 Efficiency Score (Speed / Accuracy Loss):")
@@ -1330,10 +1330,10 @@ def analyze_quantization_bitwidths():
print(f" {bits}-bit: {score:.1f} (higher is better)")
print(f"\n💡 WHY INT8 WINS: Highest efficiency score + universal hardware support!")
print(f"\nTIP WHY INT8 WINS: Highest efficiency score + universal hardware support!")
except Exception as e:
print(f" Error in bit-width analysis: {e}")
print(f"WARNING Error in bit-width analysis: {e}")
# Analyze different quantization bit-widths
analyze_quantization_bitwidths()
@@ -1373,7 +1373,7 @@ class ProductionQuantizationInsights:
{
'system': 'PyTorch Mobile (Meta)',
'technique': 'Dynamic quantization with runtime calibration',
'benefit': 'Reduces model size by 4× for mobile deployment',
'benefit': 'Reduces model size by 4* for mobile deployment',
'challenge': 'Balancing quantization overhead vs inference speedup'
},
{
@@ -1400,16 +1400,16 @@ class ProductionQuantizationInsights:
@staticmethod
def explain_advanced_techniques():
"""Explain advanced quantization techniques."""
print(" ADVANCED QUANTIZATION TECHNIQUES")
print("SPEED ADVANCED QUANTIZATION TECHNIQUES")
print("=" * 45)
print()
techniques = [
"🧠 **Mixed Precision**: Quantize some layers to INT8, keep critical layers in FP32",
"🔄 **Dynamic Quantization**: Quantize weights statically, activations dynamically",
"📦 **Block-wise Quantization**: Different quantization parameters for weight blocks",
"PACKAGE **Block-wise Quantization**: Different quantization parameters for weight blocks",
"⏰ **Quantization-Aware Training**: Train model to be robust to quantization",
"🎯 **Channel-wise Quantization**: Separate scales for each output channel",
"TARGET **Channel-wise Quantization**: Separate scales for each output channel",
"🔀 **Adaptive Quantization**: Adjust precision based on layer importance",
"⚖️ **Hardware-Aware Quantization**: Optimize for specific hardware capabilities",
"🛡️ **Calibration-Free Quantization**: Use statistical methods without data"
@@ -1419,7 +1419,7 @@ class ProductionQuantizationInsights:
print(f" {technique}")
print()
print("💡 **Your Implementation Foundation**: The INT8 quantization you built")
print("TIP **Your Implementation Foundation**: The INT8 quantization you built")
print(" demonstrates the core principles behind all these optimizations!")
@staticmethod
@@ -1429,20 +1429,20 @@ class ProductionQuantizationInsights:
print("=" * 40)
print()
print("🚀 **Speed Improvements**:")
print(" • Mobile CNNs: 2-4× faster inference with INT8")
print(" • BERT models: 3-5× speedup with mixed precision")
print(" • Edge deployment: 10× improvement with dedicated INT8 hardware")
print("ROCKET **Speed Improvements**:")
print(" • Mobile CNNs: 2-4* faster inference with INT8")
print(" • BERT models: 3-5* speedup with mixed precision")
print(" • Edge deployment: 10* improvement with dedicated INT8 hardware")
print(" • Real-time vision: Enables 30fps on mobile devices")
print()
print("💾 **Memory Reduction**:")
print(" • Model size: 4× smaller (critical for mobile apps)")
print(" • Runtime memory: 2-3× less activation memory")
print(" • Model size: 4* smaller (critical for mobile apps)")
print(" • Runtime memory: 2-3* less activation memory")
print(" • Cache efficiency: Better fit in processor caches")
print()
print("🎯 **Accuracy Preservation**:")
print("TARGET **Accuracy Preservation**:")
print(" • Computer vision: <1% accuracy loss typical")
print(" • Language models: 2-5% accuracy loss acceptable")
print(" • Recommendation systems: Minimal impact on ranking quality")
@@ -1529,7 +1529,7 @@ class QuantizationSystemsAnalyzer:
efficiency = 32.0 / bits # Rough approximation
results['compute_efficiency'].append(efficiency)
print(f" Compute efficiency: {efficiency:.1f}× faster than FP32")
print(f" Compute efficiency: {efficiency:.1f}* faster than FP32")
# Typical accuracy loss (percentage points)
if bits == 32:
@@ -1585,7 +1585,7 @@ class QuantizationSystemsAnalyzer:
This function is PROVIDED to show the analysis clearly.
"""
print("\n🎯 PRECISION VS PERFORMANCE TRADE-OFF SUMMARY")
print("\nTARGET PRECISION VS PERFORMANCE TRADE-OFF SUMMARY")
print("=" * 60)
print(f"{'Bits':<6} {'Memory':<8} {'Speed':<8} {'Acc Loss':<10} {'Hardware':<20}")
print("-" * 60)
@@ -1597,10 +1597,10 @@ class QuantizationSystemsAnalyzer:
hardware = analysis['hardware_support']
for i, bits in enumerate(bit_widths):
print(f"{bits:<6} {memory[i]:<8.1f} {speed[i]:<8.1f}× {acc_loss[i]:<10.1f}% {hardware[i]:<20}")
print(f"{bits:<6} {memory[i]:<8.1f} {speed[i]:<8.1f}* {acc_loss[i]:<10.1f}% {hardware[i]:<20}")
print()
print("🔍 **Key Insights**:")
print("MAGNIFY **Key Insights**:")
# Find sweet spot (best speed/accuracy trade-off)
efficiency_ratios = [s / (1 + a) for s, a in zip(speed, acc_loss)]
@@ -1608,14 +1608,14 @@ class QuantizationSystemsAnalyzer:
best_bits = bit_widths[best_idx]
print(f" • Sweet spot: {best_bits}-bit provides best efficiency/accuracy trade-off")
print(f" • Memory scaling: Linear with bit width (4× reduction FP32INT8)")
print(f" • Memory scaling: Linear with bit width (4* reduction FP32->INT8)")
print(f" • Speed scaling: Non-linear due to hardware specialization")
print(f" • Accuracy: Manageable loss up to 8-bit, significant below")
print(f"\n💡 **Why INT8 Dominates Production**:")
print(f"\nTIP **Why INT8 Dominates Production**:")
print(f" • Hardware support: Excellent across all platforms")
print(f" • Speed improvement: {speed[bit_widths.index(8)]:.1f}× faster than FP32")
print(f" • Memory reduction: {32/8:.1f}× smaller models")
print(f" • Speed improvement: {speed[bit_widths.index(8)]:.1f}* faster than FP32")
print(f" • Memory reduction: {32/8:.1f}* smaller models")
print(f" • Accuracy preservation: <{acc_loss[bit_widths.index(8)]:.1f}% typical loss")
print(f" • Deployment friendly: Fits mobile and edge constraints")
@@ -1629,7 +1629,7 @@ Let's analyze the fundamental precision vs performance trade-offs:
# %% nbgrader={"grade": true, "grade_id": "test-systems-analysis", "locked": false, "points": 3, "schema_version": 3, "solution": false, "task": false}
def test_systems_analysis():
"""Test systems analysis of precision vs performance trade-offs."""
print("🔍 Testing Systems Analysis...")
print("MAGNIFY Testing Systems Analysis...")
print("=" * 60)
analyzer = QuantizationSystemsAnalyzer()
@@ -1653,8 +1653,8 @@ def test_systems_analysis():
assert efficiency[int8_idx] > efficiency[fp32_idx], "INT8 should be more efficient than FP32"
assert memory[int8_idx] < memory[fp32_idx], "INT8 should use less memory than FP32"
print(f" INT8 efficiency: {efficiency[int8_idx]:.1f}× vs FP32")
print(f" INT8 memory: {memory[int8_idx]:.1f} vs {memory[fp32_idx]:.1f} bytes/param")
print(f"PASS INT8 efficiency: {efficiency[int8_idx]:.1f}* vs FP32")
print(f"PASS INT8 memory: {memory[int8_idx]:.1f} vs {memory[fp32_idx]:.1f} bytes/param")
# Show comprehensive analysis
analyzer.print_tradeoff_summary(analysis)
@@ -1664,10 +1664,10 @@ def test_systems_analysis():
best_bits = analysis['bit_widths'][np.argmax(efficiency_ratios)]
assert best_bits == 8, f"INT8 should be identified as optimal, got {best_bits}-bit"
print(f" Systems analysis correctly identifies {best_bits}-bit as optimal")
print(f"PASS Systems analysis correctly identifies {best_bits}-bit as optimal")
print(" Systems analysis tests passed!")
print("💡 INT8 quantization is the proven sweet spot for production!")
print("PASS Systems analysis tests passed!")
print("TIP INT8 quantization is the proven sweet spot for production!")
# Test function defined (called in main block)
@@ -1681,7 +1681,7 @@ Let's run comprehensive tests to validate our complete quantization implementati
# %% nbgrader={"grade": true, "grade_id": "comprehensive-tests", "locked": false, "points": 5, "schema_version": 3, "solution": false, "task": false}
def run_comprehensive_tests():
"""Run comprehensive tests of the entire quantization system."""
print("🧪 COMPREHENSIVE QUANTIZATION SYSTEM TESTS")
print("TEST COMPREHENSIVE QUANTIZATION SYSTEM TESTS")
print("=" * 60)
# Test 1: Baseline CNN
@@ -1727,16 +1727,16 @@ def run_comprehensive_tests():
# Verify pipeline works
assert len(baseline_pred) == len(quantized_pred), "Predictions should have same length"
print(f" End-to-end pipeline works")
print(f" Baseline predictions: {baseline_pred}")
print(f" Quantized predictions: {quantized_pred}")
print(f" PASS End-to-end pipeline works")
print(f" PASS Baseline predictions: {baseline_pred}")
print(f" PASS Quantized predictions: {quantized_pred}")
except Exception as e:
print(f" End-to-end test issue: {e}")
print(f" WARNING End-to-end test issue: {e}")
print("🎉 ALL COMPREHENSIVE TESTS PASSED!")
print(" Quantization system is working correctly!")
print("🚀 Ready for production deployment with 4× speedup!")
print("CELEBRATE ALL COMPREHENSIVE TESTS PASSED!")
print("PASS Quantization system is working correctly!")
print("ROCKET Ready for production deployment with 4* speedup!")
# Test function defined (called in main block)
@@ -1781,9 +1781,9 @@ class QuantizationMemoryProfiler:
baseline_fc_mem = baseline_model.fc.nbytes
baseline_total = baseline_conv1_mem + baseline_conv2_mem + baseline_fc_mem
print(f" Conv1 weights: {baseline_conv1_mem // 1024:.1f}KB (32×3×3×3 + 32 bias)")
print(f" Conv2 weights: {baseline_conv2_mem // 1024:.1f}KB (64×32×3×3 + 64 bias)")
print(f" FC weights: {baseline_fc_mem // 1024:.1f}KB (2304×10)")
print(f" Conv1 weights: {baseline_conv1_mem // 1024:.1f}KB (32*3*3*3 + 32 bias)")
print(f" Conv2 weights: {baseline_conv2_mem // 1024:.1f}KB (64*32*3*3 + 64 bias)")
print(f" FC weights: {baseline_fc_mem // 1024:.1f}KB (2304*10)")
print(f" Total: {baseline_total // 1024:.1f}KB")
# Quantized model memory breakdown
@@ -1803,8 +1803,8 @@ class QuantizationMemoryProfiler:
total_savings = baseline_total / quant_total
print(f"\n💾 Memory Savings Analysis:")
print(f" Conv layers: {conv_savings:.1f}× reduction")
print(f" Overall model: {total_savings:.1f}× reduction")
print(f" Conv layers: {conv_savings:.1f}* reduction")
print(f" Overall model: {total_savings:.1f}* reduction")
print(f" Memory saved: {(baseline_total - quant_total) // 1024:.1f}KB")
return {
@@ -1831,9 +1831,9 @@ class QuantizationMemoryProfiler:
kernel_size = 3
print(f"📐 Model Configuration:")
print(f" Input: {batch_size} × 3 × {input_h} × {input_w}")
print(f" Conv1: 3 {conv1_out_ch}, {kernel_size}×{kernel_size} kernel")
print(f" Conv2: {conv1_out_ch} {conv2_out_ch}, {kernel_size}×{kernel_size} kernel")
print(f" Input: {batch_size} * 3 * {input_h} * {input_w}")
print(f" Conv1: 3 -> {conv1_out_ch}, {kernel_size}*{kernel_size} kernel")
print(f" Conv2: {conv1_out_ch} -> {conv2_out_ch}, {kernel_size}*{kernel_size} kernel")
# FP32 operations
conv1_h_out = input_h - kernel_size + 1 # 30
@@ -1867,15 +1867,15 @@ class QuantizationMemoryProfiler:
print(f" Conv2 weight access: {conv2_weight_access:,} parameters")
print(f" FP32 memory bandwidth: {(conv1_weight_access + conv2_weight_access) * 4:,} bytes")
print(f" INT8 memory bandwidth: {(conv1_weight_access + conv2_weight_access) * 1:,} bytes")
print(f" Bandwidth reduction: 4× (FP32 INT8)")
print(f" Bandwidth reduction: 4* (FP32 -> INT8)")
# Theoretical speedup analysis
print(f"\n Theoretical Speedup Sources:")
print(f" Memory bandwidth: 4× improvement (32-bit 8-bit)")
print(f"\nSPEED Theoretical Speedup Sources:")
print(f" Memory bandwidth: 4* improvement (32-bit -> 8-bit)")
print(f" Cache efficiency: Better fit in L1/L2 cache")
print(f" SIMD vectorization: More operations per instruction")
print(f" Hardware acceleration: Dedicated INT8 units on modern CPUs")
print(f" Expected speedup: 2-4× in production systems")
print(f" Expected speedup: 2-4* in production systems")
return {
'total_flops': total_flops,
@@ -1889,7 +1889,7 @@ class QuantizationMemoryProfiler:
This function is PROVIDED to demonstrate scaling analysis.
"""
print("\n📈 SCALING BEHAVIOR ANALYSIS")
print("\nPROGRESS SCALING BEHAVIOR ANALYSIS")
print("=" * 35)
model_sizes = [
@@ -1916,10 +1916,10 @@ class QuantizationMemoryProfiler:
else:
speedup = 4.0 # Large models: memory bound, maximum benefit
print(f"{name:<15} {fp32_size_mb:<11.1f}MB {int8_size_mb:<11.1f}MB {savings:<9.1f}× {speedup:<7.1f}×")
print(f"{name:<15} {fp32_size_mb:<11.1f}MB {int8_size_mb:<11.1f}MB {savings:<9.1f}* {speedup:<7.1f}*")
print(f"\n💡 Key Scaling Insights:")
print(f" • Memory savings: Linear 4× reduction for all model sizes")
print(f"\nTIP Key Scaling Insights:")
print(f" • Memory savings: Linear 4* reduction for all model sizes")
print(f" • Speed benefits: Increase with model size (memory bottleneck)")
print(f" • Large models: Maximum benefit from reduced memory pressure")
print(f" • Mobile deployment: Enables models that wouldn't fit in RAM")
@@ -1940,7 +1940,7 @@ Let's run comprehensive systems analysis to understand quantization behavior:
# %% nbgrader={"grade": true, "grade_id": "test-memory-profiling", "locked": false, "points": 3, "schema_version": 3, "solution": false, "task": false}
def test_memory_profiling():
"""Test memory profiling and systems analysis."""
print("🔍 Testing Memory Profiling and Systems Analysis...")
print("MAGNIFY Testing Memory Profiling and Systems Analysis...")
print("=" * 60)
# Create models for profiling
@@ -1957,21 +1957,21 @@ def test_memory_profiling():
# Test memory usage analysis
memory_results = profiler.profile_memory_usage(baseline, quantized)
assert memory_results['conv_compression'] > 3.0, "Should show significant conv layer compression"
print(f" Conv layer compression: {memory_results['conv_compression']:.1f}×")
print(f"PASS Conv layer compression: {memory_results['conv_compression']:.1f}*")
# Test computational complexity analysis
complexity_results = profiler.analyze_computational_complexity()
assert complexity_results['total_flops'] > 0, "Should calculate FLOPs"
assert complexity_results['memory_bandwidth_reduction'] == 4.0, "Should show 4× bandwidth reduction"
print(f" Memory bandwidth reduction: {complexity_results['memory_bandwidth_reduction']:.1f}×")
assert complexity_results['memory_bandwidth_reduction'] == 4.0, "Should show 4* bandwidth reduction"
print(f"PASS Memory bandwidth reduction: {complexity_results['memory_bandwidth_reduction']:.1f}*")
# Test scaling behavior analysis
scaling_results = profiler.analyze_scaling_behavior()
assert scaling_results['memory_savings'] == 4.0, "Should show consistent 4× memory savings"
print(f" Memory savings scaling: {scaling_results['memory_savings']:.1f}× across all model sizes")
assert scaling_results['memory_savings'] == 4.0, "Should show consistent 4* memory savings"
print(f"PASS Memory savings scaling: {scaling_results['memory_savings']:.1f}* across all model sizes")
print(" Memory profiling and systems analysis tests passed!")
print("🎯 Quantization systems engineering principles validated!")
print("PASS Memory profiling and systems analysis tests passed!")
print("TARGET Quantization systems engineering principles validated!")
# Test function defined (called in main block)
@@ -1983,9 +1983,9 @@ Let's run all our tests to validate the complete implementation:
"""
if __name__ == "__main__":
print("🚀 MODULE 17: QUANTIZATION - TRADING PRECISION FOR SPEED")
print("ROCKET MODULE 17: QUANTIZATION - TRADING PRECISION FOR SPEED")
print("=" * 70)
print("Testing complete INT8 quantization implementation for 4× speedup...")
print("Testing complete INT8 quantization implementation for 4* speedup...")
print()
try:
@@ -2019,26 +2019,26 @@ if __name__ == "__main__":
ProductionQuantizationInsights.show_performance_numbers()
print()
print("🎉 SUCCESS: All quantization tests passed!")
print("🏆 ACHIEVEMENT: 4× speedup through precision optimization!")
print("CELEBRATE SUCCESS: All quantization tests passed!")
print("🏆 ACHIEVEMENT: 4* speedup through precision optimization!")
except Exception as e:
print(f" Error in testing: {e}")
print(f"FAIL Error in testing: {e}")
import traceback
traceback.print_exc()
# %% [markdown]
"""
## 🤔 ML Systems Thinking: Interactive Questions
## THINK ML Systems Thinking: Interactive Questions
Now that you've implemented INT8 quantization and achieved 4× speedup, let's reflect on the systems engineering principles and precision trade-offs you've learned.
Now that you've implemented INT8 quantization and achieved 4* speedup, let's reflect on the systems engineering principles and precision trade-offs you've learned.
"""
# %% [markdown] nbgrader={"grade": true, "grade_id": "systems-thinking-1", "locked": false, "points": 3, "schema_version": 3, "solution": true, "task": false}
"""
**Question 1: Precision vs Performance Trade-offs**
You implemented INT8 quantization that uses 4× less memory but provides 4× speedup with <1% accuracy loss.
You implemented INT8 quantization that uses 4* less memory but provides 4* speedup with <1% accuracy loss.
a) Why is INT8 the "sweet spot" for production quantization rather than INT4 or INT16?
b) In what scenarios would you choose NOT to use quantization despite the performance benefits?
@@ -2053,8 +2053,8 @@ c) How do hardware capabilities (mobile vs server) influence quantization decisi
a) Why INT8 is the sweet spot:
- Hardware support: Excellent native INT8 support in CPUs, GPUs, and mobile processors
- Accuracy preservation: Can represent 256 different values, sufficient for most weight distributions
- Speed gains: Specialized INT8 arithmetic units provide real 4× speedup (not just theoretical)
- Memory sweet spot: 4× reduction is significant but not so extreme as to destroy model quality
- Speed gains: Specialized INT8 arithmetic units provide real 4* speedup (not just theoretical)
- Memory sweet spot: 4* reduction is significant but not so extreme as to destroy model quality
- Production proven: Extensive validation across many model types shows <1% accuracy loss
- Tool ecosystem: TensorFlow Lite, PyTorch Mobile, ONNX Runtime all optimize for INT8
@@ -2072,7 +2072,7 @@ c) Hardware influence on quantization decisions:
- Server GPUs: Mixed precision (FP16) might be better than INT8 for throughput
- CPUs: INT8 vectorization provides significant benefits over FP32
- Memory-constrained systems: Quantization may be required just to fit the model
- Bandwidth-limited: 4× smaller models transfer faster over network
- Bandwidth-limited: 4* smaller models transfer faster over network
"""
### END SOLUTION
@@ -2188,7 +2188,7 @@ a) Quantization interactions with other optimizations:
- Model pruning synergy: Pruned models often quantize better (remaining weights more important)
- Knowledge distillation compatibility: Student models designed for quantization from start
- Neural architecture search: NAS can search for quantization-friendly architectures
- Combined benefits: Pruning + quantization can achieve 16× compression (4× each)
- Combined benefits: Pruning + quantization can achieve 16* compression (4* each)
- Order matters: Generally prune first, then quantize (quantizing first can interfere with pruning)
- Optimization conflicts: Some optimizations may work against each other
- Unified approaches: Modern techniques like differentiable quantization during NAS
@@ -2228,26 +2228,26 @@ Monitoring phase:
# %% [markdown]
"""
## 🎯 MODULE SUMMARY: Quantization - Trading Precision for Speed
## TARGET MODULE SUMMARY: Quantization - Trading Precision for Speed
Congratulations! You've completed Module 17 and mastered quantization techniques that achieve dramatic performance improvements while maintaining model accuracy.
### What You Built
- **Baseline FP32 CNN**: Reference implementation showing computational and memory costs
- **INT8 Quantizer**: Complete quantization system with scale/zero-point parameter computation
- **Quantized CNN**: Production-ready CNN using INT8 weights for 4× speedup
- **Quantized CNN**: Production-ready CNN using INT8 weights for 4* speedup
- **Performance Analyzer**: Comprehensive benchmarking system measuring speed, memory, and accuracy trade-offs
- **Systems Analyzer**: Deep analysis of precision vs performance trade-offs across different bit widths
### Key Systems Insights Mastered
1. **Precision vs Performance Trade-offs**: Understanding when to sacrifice precision for speed (4× memory/speed improvement for <1% accuracy loss)
1. **Precision vs Performance Trade-offs**: Understanding when to sacrifice precision for speed (4* memory/speed improvement for <1% accuracy loss)
2. **Quantization Mathematics**: Implementing scale/zero-point based affine quantization for optimal precision
3. **Hardware-Aware Optimization**: Leveraging INT8 specialized hardware for maximum performance benefits
4. **Production Deployment Strategies**: Calibration-based quantization for mobile and edge deployment
### Performance Achievements
- 🚀 **4× Speed Improvement**: Reduced inference time from 50ms to 12ms through INT8 arithmetic
- 🧠 **4× Memory Reduction**: Quantized weights use 25% of original FP32 memory
- ROCKET **4* Speed Improvement**: Reduced inference time from 50ms to 12ms through INT8 arithmetic
- 🧠 **4* Memory Reduction**: Quantized weights use 25% of original FP32 memory
- 📊 **<1% Accuracy Loss**: Maintained model quality while achieving dramatic speedups
- 🏭 **Production Ready**: Implemented patterns used by TensorFlow Lite, PyTorch Mobile, and Core ML