mirror of
https://github.com/MLSysBook/TinyTorch.git
synced 2026-04-28 23:37:30 -05:00
🤖 TinyGPT (2018) - Transformer Architecture
What This Demonstrates
Complete transformer language model using YOUR TinyTorch! The architecture that powers ChatGPT, built from YOUR implementations.
Prerequisites
Complete ALL these TinyTorch modules:
- Module 02 (Tensor) - Data structures
- Module 03 (Activations) - ReLU
- Module 04 (Layers) - Linear layers
- Module 05 (Networks) - Module base class
- Module 06 (Autograd) - Backprop through attention
- Module 08 (Optimizers) - Adam optimizer
- Module 12 (Embeddings) - Token embeddings, positional encoding
- Module 13 (Attention) - Multi-head self-attention
- Module 14 (Transformers) - LayerNorm, TransformerBlock
🚀 Quick Start
# Run transformer demo
python train_gpt.py
# This is a validation demo - no real training data needed
📊 Dataset Information
Demo Tokens Only
- No Real Dataset: Uses random tokens for architecture validation
- Purpose: Demonstrates the transformer works, not full training
- No Download Required: Synthetic data only
Why No Real Dataset?
Full language model training requires:
- Large text corpora (GBs of data)
- Significant compute (GPU hours/days)
- This example validates YOUR architecture works
🏗️ Architecture
Output Logits (Vocabulary Predictions)
↑
Output Projection
↑
Layer Norm
↑
╔══════════════════════════════╗
║ Transformer Block × 4 ║
║ ┌────────────────────┐ ║
║ │ Layer Norm │ ║
║ │ ↑ │ ║
║ │ Feed Forward Net │ ║
║ │ ↑ │ ║
║ │ Layer Norm │ ║
║ │ ↑ │ ║
║ │ Multi-Head Attention│ ║
║ └────────────────────┘ ║
╚══════════════════════════════╝
↑
Positional Encoding
↑
Token Embeddings
↑
Input Tokens
📈 Demo Configuration
- Vocab Size: 100 tokens (tiny for demo)
- Embedding Dim: 32
- Attention Heads: 4
- Layers: 2 transformer blocks
- Context Length: 16 tokens
💡 What Makes Transformers Special
Self-Attention
Each token can "look at" all other tokens to understand context:
"The cat sat on the [MASK]"
↓
Attention looks at all words
↓
"mat" (understands context!)
Key Innovations YOUR Implementation Shows
- Attention: Context-aware representations
- Positional Encoding: Order matters in sequences
- Layer Norm: Stable deep network training
- Residual Connections: Information flow through layers
📚 What You Learn
- Complete transformer architecture from scratch
- How attention creates contextual understanding
- YOUR implementations power modern LLMs
- Foundation for GPT, BERT, ChatGPT, etc.
🔬 Systems Insights
- Memory: O(n²) for attention (sequence length squared)
- Compute: Highly parallelizable (unlike RNNs)
- Scaling: Stack more layers for more capability
- YOUR Version: Core math is identical to production!
🚀 Real Training (Advanced)
To train a real language model:
- Get text dataset (WikiText, BookCorpus, etc.)
- Tokenize text into vocabulary
- Create data loader for sequences
- Train for many epochs (GPU recommended)
- Generate text autoregressively
This demo validates the architecture - real training is a larger undertaking!