mirror of
https://github.com/MLSysBook/TinyTorch.git
synced 2026-05-06 17:21:59 -05:00
Major Accomplishments: • Rebuilt all 20 modules with comprehensive explanations before each function • Fixed explanatory placement: detailed explanations before implementations, brief descriptions before tests • Enhanced all modules with ASCII diagrams for visual learning • Comprehensive individual module testing and validation • Created milestone directory structure with working examples • Fixed critical Module 01 indentation error (methods were outside Tensor class) Module Status: ✅ Modules 01-07: Fully working (Tensor → Training pipeline) ✅ Milestone 1: Perceptron - ACHIEVED (95% accuracy on 2D data) ✅ Milestone 2: MLP - ACHIEVED (complete training with autograd) ⚠️ Modules 08-20: Mixed results (import dependencies need fixes) Educational Impact: • Students can now learn complete ML pipeline from tensors to training • Clear progression: basic operations → neural networks → optimization • Explanatory sections provide proper context before implementation • Working milestones demonstrate practical ML capabilities Next Steps: • Fix import dependencies in advanced modules (9, 11, 12, 17-20) • Debug timeout issues in modules 14, 15 • First 7 modules provide solid foundation for immediate educational use(https://claude.ai/code)
Module 11: Tokenization - Text Processing for Language Models
Overview
This module implements the fundamental text processing systems that convert raw text into numerical sequences that neural networks can understand. You'll build character-level and subword tokenizers from scratch, understanding the critical trade-offs between vocabulary size and sequence length that affect model performance.
What You'll Learn
Core Implementations
- Character Tokenizer: Simple character-level tokenization with special tokens
- BPE Tokenizer: Byte Pair Encoding for efficient subword units
- Vocabulary Management: Bidirectional mappings between text and indices
- Padding & Truncation: Batch processing utilities for uniform sequences
ML Systems Concepts
- Memory Efficiency: How vocabulary size affects model parameters
- Performance Optimization: Tokenization throughput and caching strategies
- Scaling Trade-offs: Vocabulary size vs sequence length vs compute
- Production Patterns: Efficient text processing for large-scale systems
Performance Engineering
- Tokenization Profiling: Measuring speed and memory usage
- Cache Optimization: Reducing repeated tokenization overhead
- Batch Processing: Efficient handling of multiple texts
- Scaling Analysis: Understanding performance with large texts
Key Learning Outcomes
By completing this module, you'll understand:
- Text-to-Numbers Pipeline: How raw text becomes neural network input
- Tokenization Strategies: Character vs subword vs word-level approaches
- Systems Trade-offs: Vocabulary size impacts on memory and compute
- Performance Engineering: Optimizing text processing for production
- Language Model Foundation: How tokenization affects model capabilities
Files in This Module
tokenization_dev.py- Main implementation file with all tokenizerstokenization_dev.ipynb- Jupyter notebook (auto-generated)module.yaml- Module configuration and metadataREADME.md- This documentation file
Usage Example
from tinytorch.core.tokenization import CharTokenizer, BPETokenizer
# Character-level tokenization
char_tokenizer = CharTokenizer()
tokens = char_tokenizer.encode("Hello world!")
text = char_tokenizer.decode(tokens)
# BPE tokenization
bpe_tokenizer = BPETokenizer(vocab_size=1000)
bpe_tokenizer.train(["Hello world", "World hello", "Hello hello world"])
tokens = bpe_tokenizer.encode("Hello world!")
Integration with TinyTorch
This module exports to tinytorch.core.tokenization and provides the text processing foundation for:
- Embedding layers (Module 12) - Converting tokens to vectors
- Language models (Module 14+) - Processing text sequences
- Training pipelines - Efficient batch text processing
Systems Engineering Focus
This module emphasizes the systems engineering aspects of tokenization:
Performance Characteristics
- Character tokenization: Small vocab (~256), long sequences
- BPE tokenization: Medium vocab (~50k), shorter sequences
- Memory scaling: O(vocab_size × embedding_dim) for embedding tables
- Attention scaling: O(sequence_length²) for transformer models
Production Considerations
- Tokenization can become a bottleneck in training pipelines
- Efficient string processing is critical for high-throughput systems
- Caching strategies provide significant speedups for repeated texts
- Vocabulary size affects model download size and memory usage
Prerequisites
- Module 02: Tensor (for basic data structures)
- Understanding of string processing and algorithms
Estimated Time
4-5 hours including implementation, testing, and analysis
Next Steps
After completing this module, you'll be ready for:
- Module 12: Embeddings - Converting tokens to dense vector representations
- Module 13: Attention - Processing sequences with attention mechanisms
- Module 14: Transformers - Complete language model architectures