mirror of https://github.com/MLSysBook/TinyTorch.git synced 2026-06-03 17:30:53 -05:00

Files

Vijay Janapa Reddi 4d70e308ff refactor: Update embeddings module to match tokenization style

- Standardize import structure following TinyTorch dependency chain
- Enhance section organization with 6 clear educational sections
- Add comprehensive ASCII diagrams matching tokenization patterns
- Improve code organization and function naming consistency
- Strengthen systems analysis and performance documentation
- Align package integration documentation with module standards

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>

2025-10-25 14:58:30 -04:00

2.0 KiB

Raw Blame History

17. Quantization

Reducing Model Size Without Losing Accuracy

Quantization is a critical technique for deploying ML models in production, especially on edge devices. In this module, you'll learn how to reduce model size and increase inference speed by converting floating-point weights to lower precision formats.

What You'll Build

INT8 Quantization: Convert 32-bit floats to 8-bit integers
Quantization-Aware Training: Train models that quantize well
Dynamic Quantization: Quantize activations at runtime
Static Quantization: Pre-compute quantization parameters

Why This Matters

Modern ML models are often too large for deployment:

GPT models can be hundreds of gigabytes
Mobile devices have limited memory
Edge computing requires efficient models
Quantization can reduce model size by 75% with minimal accuracy loss

Learning Objectives

By the end of this module, you will:

Understand the trade-offs between model size and accuracy
Implement INT8 quantization from scratch
Build quantization-aware training pipelines
Measure the impact on model performance

Prerequisites

Before starting this module, you should have completed:

Module 02: Tensor (for basic operations)
Module 04: Layers (for model structure)
Module 08: Training (for fine-tuning quantized models)

Real-World Applications

Quantization is used everywhere in production ML:

Mobile Apps: TensorFlow Lite uses INT8 for on-device inference
Edge Devices: Raspberry Pi and Arduino deployment
Cloud Inference: Reducing serving costs at scale
Neural Processors: Apple Neural Engine, Google Edge TPU

Coming Up Next

After mastering quantization, you'll explore:

Module 18: Compression - Further model size reduction techniques
Module 19: Caching - Optimizing inference latency
Module 20: Benchmarking - Measuring the impact of optimizations

This module is currently under development. The implementation will cover practical quantization techniques used in production ML systems.

2.0 KiB Raw Blame History