In 2017, Vaswani et al. published "Attention Is All You Need," showing that attention mechanisms alone (no RNNs, no convolutions!) could achieve state-of-the-art results on sequence tasks. This breakthrough:

Replaced RNNs/LSTMs for sequence modeling
Enabled parallel training (unlike sequential RNNs)
Scaled to massive datasets and model sizes
Launched the era of GPT, BERT, and modern LLMs

Transformers didn't just improve NLP - they unified vision, language, and multimodal AI. Now it's your turn to build one from scratch using YOUR Tiny🔥Torch!

What You're Building

Character-level transformer models for text generation:

Question Answering - Train on TinyTalks Q&A dataset
Dialogue Generation - Generate coherent conversational responses

Required Modules

Run after Module 13 (Complete transformer stack)

Module	Component	What It Provides
Module 01	Tensor	YOUR data structure with autograd
Module 02	Activations	YOUR ReLU/GELU activations
Module 03	Layers	YOUR Linear layers
Module 04	Losses	YOUR CrossEntropyLoss
Module 05	DataLoader	YOUR data batching
Module 06	Autograd	YOUR automatic differentiation
Module 07	Optimizers	YOUR Adam optimizer
Module 10	Tokenization	YOUR CharTokenizer
Module 11	Embeddings	YOUR token + positional embeddings
Module 12	Attention	YOUR multi-head self-attention
Module 13	Transformers	YOUR LayerNorm + TransformerBlock + GPT

Milestone Structure

This milestone uses progressive difficulty with 3 scripts:

⭐ 00_vaswani_attention_proof.py (START HERE!)

Purpose: PROVE your attention mechanism works

Dataset: Auto-generated sequences (no files needed!)
Task: Reverse sequences [1,2,3,4] → [4,3,2,1]
From Paper: "Attention is All You Need" validation task
Training Time: ~30 seconds
Expected: 95%+ accuracy
Key Learning: "My attention is computing relationships!"

Why This Is THE Test:

IMPOSSIBLE without attention working
Trains in 30 seconds (instant gratification!)
Binary pass/fail (95%+ or broken)
Proves Q·K·V computation works

🎯 Run this FIRST to verify your attention before complex tasks!

01_vaswani_generation.py

Purpose: Apply attention to real language (Q&A)

Dataset: TinyTalks (17.5 KB, 5 difficulty levels)
Task: Learn to answer questions (Q: ... A: ...)
Architecture: Character-level GPT with attention
Expected: Coherent responses in 3-5 minutes
Key Learning: "Attention learns long-range dependencies!"

Why TinyTalks?

Fast training = instant feedback
Clear Q&A format = easy to verify learning
Progressive difficulty = see capability growth
Ships with TinyTorch = no downloads

02_vaswani_dialogue.py

Purpose: Generate natural conversational text

Dataset: Same TinyTalks, different framing
Task: Multi-turn dialogue generation
Expected: Context-aware responses
Key Learning: "Transformers capture conversation flow!"

What Makes This Special:

Same model architecture as GPT/ChatGPT (scaled down)
YOUR implementation from scratch (no magic!)
Proves attention mechanism works

Expected Results

Script	Task	Context Length	Success Criteria	Training Time
01 (Q&A)	Answer questions	128 chars	Loss < 1.5, sensible word choices	3-5 min
02 (Dialogue)	Multi-turn chat	128 chars	Maintains topic coherence, loss < 1.5	3-5 min

Key Learning: Why Attention Revolutionized AI

Transformers solve the fundamental problems of RNNs:

Problem with RNNs:

Sequential processing → Can't parallelize (slow training)
Vanishing gradients → Struggles with long sequences
Fixed hidden state → Information bottleneck

Transformer Solution:

Attention mechanism → "Look at ANY position, weighted by relevance"
Parallel processing → Process entire sequence at once
Direct connections → Every position can attend to every other position

The Key Insight:

RNN:  Hidden state carries ALL information (bottleneck!)
      h[t] = f(h[t-1], x[t])  ← Sequential, lossy

Attention: Directly access ANY past position (no bottleneck!)
          output[i] = Σ attention[i,j] × value[j]  ← Parallel, lossless

This is why GPT, BERT, T5, and modern LLMs all use transformers!

Running the Milestone

cd milestones/05_2017_transformer

# Step 1: Q&A generation (run after Module 13)
python 01_vaswani_generation.py --epochs 5 --batch-size 4

# Step 2: Dialogue generation (run after Module 13)
python 02_vaswani_dialogue.py --epochs 5 --batch-size 4

Optional flags:

--levels N - Use first N difficulty levels (1-5)
--embed-dim D - Embedding dimension (default: 64)
--num-layers L - Number of transformer blocks (default: 3)
--num-heads H - Attention heads (default: 4)

Achievement Unlocked

After completing this milestone, you'll understand:

How self-attention computes context-aware representations
Why transformers parallelize better than RNNs
What positional embeddings do (give position information)
How GPT-style autoregressive generation works

You've built the architecture powering modern AI!

Note for Next Milestone: You can now BUILD transformers, but can you OPTIMIZE them for production? Milestone 06 (MLPerf) teaches systematic optimization: profiling → compression → acceleration!

README.md Unescape Escape

Milestone 05: The Transformer Era (2017)

Historical Context