Files
TinyTorch/modules/11_tokenization
Vijay Janapa Reddi a9fed98b66 Clean up repository: remove temp files, organize modules, prepare for PyPI publication
- Removed temporary test files and audit reports
- Deleted backup and temp_holding directories
- Reorganized module structure (07->09 spatial, 09->07 dataloader)
- Added new modules: 11-14 (tokenization, embeddings, attention, transformers)
- Updated examples with historical ML milestones
- Cleaned up documentation structure
2025-09-24 10:13:37 -04:00
..

Module 11: Tokenization - Text Processing for Language Models

Overview

This module implements the fundamental text processing systems that convert raw text into numerical sequences that neural networks can understand. You'll build character-level and subword tokenizers from scratch, understanding the critical trade-offs between vocabulary size and sequence length that affect model performance.

What You'll Learn

Core Implementations

  • Character Tokenizer: Simple character-level tokenization with special tokens
  • BPE Tokenizer: Byte Pair Encoding for efficient subword units
  • Vocabulary Management: Bidirectional mappings between text and indices
  • Padding & Truncation: Batch processing utilities for uniform sequences

ML Systems Concepts

  • Memory Efficiency: How vocabulary size affects model parameters
  • Performance Optimization: Tokenization throughput and caching strategies
  • Scaling Trade-offs: Vocabulary size vs sequence length vs compute
  • Production Patterns: Efficient text processing for large-scale systems

Performance Engineering

  • Tokenization Profiling: Measuring speed and memory usage
  • Cache Optimization: Reducing repeated tokenization overhead
  • Batch Processing: Efficient handling of multiple texts
  • Scaling Analysis: Understanding performance with large texts

Key Learning Outcomes

By completing this module, you'll understand:

  1. Text-to-Numbers Pipeline: How raw text becomes neural network input
  2. Tokenization Strategies: Character vs subword vs word-level approaches
  3. Systems Trade-offs: Vocabulary size impacts on memory and compute
  4. Performance Engineering: Optimizing text processing for production
  5. Language Model Foundation: How tokenization affects model capabilities

Files in This Module

  • tokenization_dev.py - Main implementation file with all tokenizers
  • tokenization_dev.ipynb - Jupyter notebook (auto-generated)
  • module.yaml - Module configuration and metadata
  • README.md - This documentation file

Usage Example

from tinytorch.core.tokenization import CharTokenizer, BPETokenizer

# Character-level tokenization
char_tokenizer = CharTokenizer()
tokens = char_tokenizer.encode("Hello world!")
text = char_tokenizer.decode(tokens)

# BPE tokenization
bpe_tokenizer = BPETokenizer(vocab_size=1000)
bpe_tokenizer.train(["Hello world", "World hello", "Hello hello world"])
tokens = bpe_tokenizer.encode("Hello world!")

Integration with TinyTorch

This module exports to tinytorch.core.tokenization and provides the text processing foundation for:

  • Embedding layers (Module 12) - Converting tokens to vectors
  • Language models (Module 14+) - Processing text sequences
  • Training pipelines - Efficient batch text processing

Systems Engineering Focus

This module emphasizes the systems engineering aspects of tokenization:

Performance Characteristics

  • Character tokenization: Small vocab (~256), long sequences
  • BPE tokenization: Medium vocab (~50k), shorter sequences
  • Memory scaling: O(vocab_size × embedding_dim) for embedding tables
  • Attention scaling: O(sequence_length²) for transformer models

Production Considerations

  • Tokenization can become a bottleneck in training pipelines
  • Efficient string processing is critical for high-throughput systems
  • Caching strategies provide significant speedups for repeated texts
  • Vocabulary size affects model download size and memory usage

Prerequisites

  • Module 02: Tensor (for basic data structures)
  • Understanding of string processing and algorithms

Estimated Time

4-5 hours including implementation, testing, and analysis

Next Steps

After completing this module, you'll be ready for:

  • Module 12: Embeddings - Converting tokens to dense vector representations
  • Module 13: Attention - Processing sequences with attention mechanisms
  • Module 14: Transformers - Complete language model architectures