Files
TinyTorch/modules/13_transformers_ABOUT.html
2025-12-05 00:52:38 +00:00

1212 lines
99 KiB
HTML
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
<!DOCTYPE html>
<html lang="en" data-content_root="../" >
<head>
<meta charset="utf-8" />
<meta name="viewport" content="width=device-width, initial-scale=1.0" /><meta name="viewport" content="width=device-width, initial-scale=1" />
<title>13. Transformers - Complete GPT Architecture &#8212; Tiny🔥Torch</title>
<script data-cfasync="false">
document.documentElement.dataset.mode = localStorage.getItem("mode") || "";
document.documentElement.dataset.theme = localStorage.getItem("theme") || "";
</script>
<!-- Loaded before other Sphinx assets -->
<link href="../_static/styles/theme.css?digest=dfe6caa3a7d634c4db9b" rel="stylesheet" />
<link href="../_static/styles/bootstrap.css?digest=dfe6caa3a7d634c4db9b" rel="stylesheet" />
<link href="../_static/styles/pydata-sphinx-theme.css?digest=dfe6caa3a7d634c4db9b" rel="stylesheet" />
<link href="../_static/vendor/fontawesome/6.5.2/css/all.min.css?digest=dfe6caa3a7d634c4db9b" rel="stylesheet" />
<link rel="preload" as="font" type="font/woff2" crossorigin href="../_static/vendor/fontawesome/6.5.2/webfonts/fa-solid-900.woff2" />
<link rel="preload" as="font" type="font/woff2" crossorigin href="../_static/vendor/fontawesome/6.5.2/webfonts/fa-brands-400.woff2" />
<link rel="preload" as="font" type="font/woff2" crossorigin href="../_static/vendor/fontawesome/6.5.2/webfonts/fa-regular-400.woff2" />
<link rel="stylesheet" type="text/css" href="../_static/pygments.css?v=03e43079" />
<link rel="stylesheet" type="text/css" href="../_static/styles/sphinx-book-theme.css?v=eba8b062" />
<link rel="stylesheet" type="text/css" href="../_static/togglebutton.css?v=13237357" />
<link rel="stylesheet" type="text/css" href="../_static/copybutton.css?v=76b2166b" />
<link rel="stylesheet" type="text/css" href="../_static/mystnb.8ecb98da25f57f5357bf6f572d296f466b2cfe2517ffebfabe82451661e28f02.css" />
<link rel="stylesheet" type="text/css" href="../_static/sphinx-thebe.css?v=4fa983c6" />
<link rel="stylesheet" type="text/css" href="../_static/sphinx-design.min.css?v=95c83b7e" />
<link rel="stylesheet" type="text/css" href="../_static/custom.css?v=009d37f4" />
<!-- Pre-loaded scripts that we'll load fully later -->
<link rel="preload" as="script" href="../_static/scripts/bootstrap.js?digest=dfe6caa3a7d634c4db9b" />
<link rel="preload" as="script" href="../_static/scripts/pydata-sphinx-theme.js?digest=dfe6caa3a7d634c4db9b" />
<script src="../_static/vendor/fontawesome/6.5.2/js/all.min.js?digest=dfe6caa3a7d634c4db9b"></script>
<script src="../_static/documentation_options.js?v=9eb32ce0"></script>
<script src="../_static/doctools.js?v=9a2dae69"></script>
<script src="../_static/sphinx_highlight.js?v=dc90522c"></script>
<script src="../_static/clipboard.min.js?v=a7894cd8"></script>
<script src="../_static/copybutton.js?v=f281be69"></script>
<script src="../_static/scripts/sphinx-book-theme.js?v=887ef09a"></script>
<script>let toggleHintShow = 'Click to show';</script>
<script>let toggleHintHide = 'Click to hide';</script>
<script>let toggleOpenOnPrint = 'true';</script>
<script src="../_static/togglebutton.js?v=4a39c7ea"></script>
<script>var togglebuttonSelector = '.toggle, .admonition.dropdown';</script>
<script src="../_static/design-tabs.js?v=f930bc37"></script>
<script>const THEBE_JS_URL = "https://unpkg.com/thebe@0.8.2/lib/index.js"; const thebe_selector = ".thebe,.cell"; const thebe_selector_input = "pre"; const thebe_selector_output = ".output, .cell_output"</script>
<script async="async" src="../_static/sphinx-thebe.js?v=c100c467"></script>
<script>var togglebuttonSelector = '.toggle, .admonition.dropdown';</script>
<script>const THEBE_JS_URL = "https://unpkg.com/thebe@0.8.2/lib/index.js"; const thebe_selector = ".thebe,.cell"; const thebe_selector_input = "pre"; const thebe_selector_output = ".output, .cell_output"</script>
<script>DOCUMENTATION_OPTIONS.pagename = 'modules/13_transformers_ABOUT';</script>
<script src="../_static/ml-timeline.js?v=76e9b3e3"></script>
<script src="../_static/wip-banner.js?v=04a7e74d"></script>
<script src="../_static/marimo-badges.js?v=e6289128"></script>
<script src="../_static/sidebar-link.js?v=404b701b"></script>
<script src="../_static/hero-carousel.js?v=10341d2a"></script>
<script src="../_static/subscribe-modal.js?v=42919b64"></script>
<link rel="icon" href="../_static/favicon.svg"/>
<link rel="index" title="Index" href="../genindex.html" />
<link rel="search" title="Search" href="../search.html" />
<link rel="next" title="⏱️ Optimization Tier (Modules 14-19)" href="../tiers/optimization.html" />
<link rel="prev" title="12. Attention - The Mechanism That Powers Modern AI" href="12_attention_ABOUT.html" />
<meta name="viewport" content="width=device-width, initial-scale=1"/>
<meta name="docsearch:language" content="en"/>
</head>
<body data-bs-spy="scroll" data-bs-target=".bd-toc-nav" data-offset="180" data-bs-root-margin="0px 0px -60%" data-default-mode="">
<div id="pst-skip-link" class="skip-link d-print-none"><a href="#main-content">Skip to main content</a></div>
<div id="pst-scroll-pixel-helper"></div>
<button type="button" class="btn rounded-pill" id="pst-back-to-top">
<i class="fa-solid fa-arrow-up"></i>Back to top</button>
<input type="checkbox"
class="sidebar-toggle"
id="pst-primary-sidebar-checkbox"/>
<label class="overlay overlay-primary" for="pst-primary-sidebar-checkbox"></label>
<input type="checkbox"
class="sidebar-toggle"
id="pst-secondary-sidebar-checkbox"/>
<label class="overlay overlay-secondary" for="pst-secondary-sidebar-checkbox"></label>
<div class="search-button__wrapper">
<div class="search-button__overlay"></div>
<div class="search-button__search-container">
<form class="bd-search d-flex align-items-center"
action="../search.html"
method="get">
<i class="fa-solid fa-magnifying-glass"></i>
<input type="search"
class="form-control"
name="q"
id="search-input"
placeholder="Search..."
aria-label="Search..."
autocomplete="off"
autocorrect="off"
autocapitalize="off"
spellcheck="false"/>
<span class="search-button__kbd-shortcut"><kbd class="kbd-shortcut__modifier">Ctrl</kbd>+<kbd>K</kbd></span>
</form></div>
</div>
<div class="pst-async-banner-revealer d-none">
<aside id="bd-header-version-warning" class="d-none d-print-none" aria-label="Version warning"></aside>
</div>
<header class="bd-header navbar navbar-expand-lg bd-navbar d-print-none">
</header>
<div class="bd-container">
<div class="bd-container__inner bd-page-width">
<div class="bd-sidebar-primary bd-sidebar">
<div class="sidebar-header-items sidebar-primary__section">
</div>
<div class="sidebar-primary-items__start sidebar-primary__section">
<div class="sidebar-primary-item">
<a class="navbar-brand logo" href="../intro.html">
<img src="../_static/logo-tinytorch.png" class="logo__image only-light" alt="Tiny🔥Torch - Home"/>
<script>document.write(`<img src="../_static/logo-tinytorch.png" class="logo__image only-dark" alt="Tiny🔥Torch - Home"/>`);</script>
</a></div>
<div class="sidebar-primary-item">
<script>
document.write(`
<button class="btn search-button-field search-button__button" title="Search" aria-label="Search" data-bs-placement="bottom" data-bs-toggle="tooltip">
<i class="fa-solid fa-magnifying-glass"></i>
<span class="search-button__default-text">Search</span>
<span class="search-button__kbd-shortcut"><kbd class="kbd-shortcut__modifier">Ctrl</kbd>+<kbd class="kbd-shortcut__modifier">K</kbd></span>
</button>
`);
</script></div>
<div class="sidebar-primary-item"><nav class="bd-links bd-docs-nav" aria-label="Main">
<div class="bd-toc-item navbar-nav active">
<p aria-level="2" class="caption" role="heading"><span class="caption-text">🚀 Getting Started</span></p>
<ul class="nav bd-sidenav">
<li class="toctree-l1"><a class="reference internal" href="../getting-started.html">Complete Guide</a></li>
</ul>
<p aria-level="2" class="caption" role="heading"><span class="caption-text">🏗 Foundation Tier (01-07)</span></p>
<ul class="nav bd-sidenav">
<li class="toctree-l1"><a class="reference internal" href="../tiers/foundation.html">📖 Tier Overview</a></li>
<li class="toctree-l1"><a class="reference internal" href="01_tensor_ABOUT.html">01. Tensor</a></li>
<li class="toctree-l1"><a class="reference internal" href="02_activations_ABOUT.html">02. Activations</a></li>
<li class="toctree-l1"><a class="reference internal" href="03_layers_ABOUT.html">03. Layers</a></li>
<li class="toctree-l1"><a class="reference internal" href="04_losses_ABOUT.html">04. Losses</a></li>
<li class="toctree-l1"><a class="reference internal" href="05_autograd_ABOUT.html">05. Autograd</a></li>
<li class="toctree-l1"><a class="reference internal" href="06_optimizers_ABOUT.html">06. Optimizers</a></li>
<li class="toctree-l1"><a class="reference internal" href="07_training_ABOUT.html">07. Training</a></li>
</ul>
<p aria-level="2" class="caption" role="heading"><span class="caption-text">🏛️ Architecture Tier (08-13)</span></p>
<ul class="current nav bd-sidenav">
<li class="toctree-l1"><a class="reference internal" href="../tiers/architecture.html">📖 Tier Overview</a></li>
<li class="toctree-l1"><a class="reference internal" href="08_dataloader_ABOUT.html">08. DataLoader</a></li>
<li class="toctree-l1"><a class="reference internal" href="09_spatial_ABOUT.html">09. Convolutions</a></li>
<li class="toctree-l1"><a class="reference internal" href="10_tokenization_ABOUT.html">10. Tokenization</a></li>
<li class="toctree-l1"><a class="reference internal" href="11_embeddings_ABOUT.html">11. Embeddings</a></li>
<li class="toctree-l1"><a class="reference internal" href="12_attention_ABOUT.html">12. Attention</a></li>
<li class="toctree-l1 current active"><a class="current reference internal" href="#">13. Transformers</a></li>
</ul>
<p aria-level="2" class="caption" role="heading"><span class="caption-text">⏱️ Optimization Tier (14-19)</span></p>
<ul class="nav bd-sidenav">
<li class="toctree-l1"><a class="reference internal" href="../tiers/optimization.html">📖 Tier Overview</a></li>
<li class="toctree-l1"><a class="reference internal" href="14_profiling_ABOUT.html">14. Profiling</a></li>
<li class="toctree-l1"><a class="reference internal" href="15_quantization_ABOUT.html">15. Quantization</a></li>
<li class="toctree-l1"><a class="reference internal" href="16_compression_ABOUT.html">16. Compression</a></li>
<li class="toctree-l1"><a class="reference internal" href="17_memoization_ABOUT.html">17. Memoization</a></li>
<li class="toctree-l1"><a class="reference internal" href="18_acceleration_ABOUT.html">18. Acceleration</a></li>
<li class="toctree-l1"><a class="reference internal" href="19_benchmarking_ABOUT.html">19. Benchmarking</a></li>
</ul>
<p aria-level="2" class="caption" role="heading"><span class="caption-text">🏅 Capstone Competition</span></p>
<ul class="nav bd-sidenav">
<li class="toctree-l1"><a class="reference internal" href="../tiers/olympics.html">📖 Competition Overview</a></li>
<li class="toctree-l1"><a class="reference internal" href="20_capstone_ABOUT.html">20. Torch Olympics</a></li>
</ul>
<p aria-level="2" class="caption" role="heading"><span class="caption-text">🧭 Course Orientation</span></p>
<ul class="nav bd-sidenav">
<li class="toctree-l1"><a class="reference internal" href="../chapters/00-introduction.html">Course Structure</a></li>
<li class="toctree-l1"><a class="reference internal" href="../prerequisites.html">Prerequisites &amp; Resources</a></li>
<li class="toctree-l1"><a class="reference internal" href="../chapters/learning-journey.html">Learning Journey</a></li>
<li class="toctree-l1"><a class="reference internal" href="../chapters/milestones.html">Historical Milestones</a></li>
<li class="toctree-l1"><a class="reference internal" href="../faq.html">FAQ</a></li>
</ul>
<p aria-level="2" class="caption" role="heading"><span class="caption-text">🛠️ TITO CLI Reference</span></p>
<ul class="nav bd-sidenav">
<li class="toctree-l1"><a class="reference internal" href="../tito/overview.html">Command Overview</a></li>
<li class="toctree-l1"><a class="reference internal" href="../tito/modules.html">Module Workflow</a></li>
<li class="toctree-l1"><a class="reference internal" href="../tito/milestones.html">Milestone System</a></li>
<li class="toctree-l1"><a class="reference internal" href="../tito/data.html">Progress &amp; Data</a></li>
<li class="toctree-l1"><a class="reference internal" href="../tito/troubleshooting.html">Troubleshooting</a></li>
<li class="toctree-l1"><a class="reference internal" href="../datasets.html">Datasets Guide</a></li>
</ul>
<p aria-level="2" class="caption" role="heading"><span class="caption-text">🤝 Community</span></p>
<ul class="nav bd-sidenav">
<li class="toctree-l1"><a class="reference internal" href="../community.html">Ecosystem</a></li>
<li class="toctree-l1"><a class="reference internal" href="../resources.html">Learning Resources</a></li>
<li class="toctree-l1"><a class="reference internal" href="../credits.html">Credits &amp; Acknowledgments</a></li>
</ul>
</div>
</nav></div>
</div>
<div class="sidebar-primary-items__end sidebar-primary__section">
</div>
<div id="rtd-footer-container"></div>
</div>
<main id="main-content" class="bd-main" role="main">
<div class="sbt-scroll-pixel-helper"></div>
<div class="bd-content">
<div class="bd-article-container">
<div class="bd-header-article d-print-none">
<div class="header-article-items header-article__inner">
<div class="header-article-items__start">
<div class="header-article-item"><button class="sidebar-toggle primary-toggle btn btn-sm" title="Toggle primary sidebar" data-bs-placement="bottom" data-bs-toggle="tooltip">
<span class="fa-solid fa-bars"></span>
</button></div>
</div>
<div class="header-article-items__end">
<div class="header-article-item">
<div class="article-header-buttons">
<div class="dropdown dropdown-download-buttons">
<button class="btn dropdown-toggle" type="button" data-bs-toggle="dropdown" aria-expanded="false" aria-label="Download this page">
<i class="fas fa-download"></i>
</button>
<ul class="dropdown-menu">
<li><a href="../_sources/modules/13_transformers_ABOUT.md" target="_blank"
class="btn btn-sm btn-download-source-button dropdown-item"
title="Download source file"
data-bs-placement="left" data-bs-toggle="tooltip"
>
<span class="btn__icon-container">
<i class="fas fa-file"></i>
</span>
<span class="btn__text-container">.md</span>
</a>
</li>
<li>
<button onclick="window.print()"
class="btn btn-sm btn-download-pdf-button dropdown-item"
title="Print to PDF"
data-bs-placement="left" data-bs-toggle="tooltip"
>
<span class="btn__icon-container">
<i class="fas fa-file-pdf"></i>
</span>
<span class="btn__text-container">.pdf</span>
</button>
</li>
</ul>
</div>
<button onclick="toggleFullScreen()"
class="btn btn-sm btn-fullscreen-button"
title="Fullscreen mode"
data-bs-placement="bottom" data-bs-toggle="tooltip"
>
<span class="btn__icon-container">
<i class="fas fa-expand"></i>
</span>
</button>
<script>
document.write(`
<button class="btn btn-sm nav-link pst-navbar-icon theme-switch-button" title="light/dark" aria-label="light/dark" data-bs-placement="bottom" data-bs-toggle="tooltip">
<i class="theme-switch fa-solid fa-sun fa-lg" data-mode="light"></i>
<i class="theme-switch fa-solid fa-moon fa-lg" data-mode="dark"></i>
<i class="theme-switch fa-solid fa-circle-half-stroke fa-lg" data-mode="auto"></i>
</button>
`);
</script>
<script>
document.write(`
<button class="btn btn-sm pst-navbar-icon search-button search-button__button" title="Search" aria-label="Search" data-bs-placement="bottom" data-bs-toggle="tooltip">
<i class="fa-solid fa-magnifying-glass fa-lg"></i>
</button>
`);
</script>
<button class="sidebar-toggle secondary-toggle btn btn-sm" title="Toggle secondary sidebar" data-bs-placement="bottom" data-bs-toggle="tooltip">
<span class="fa-solid fa-list"></span>
</button>
</div></div>
</div>
</div>
</div>
<div id="jb-print-docs-body" class="onlyprint">
<h1>13. Transformers - Complete GPT Architecture</h1>
<!-- Table of contents -->
<div id="print-main-content">
<div id="jb-print-toc">
<div>
<h2> Contents </h2>
</div>
<nav aria-label="Page">
<ul class="visible nav section-nav flex-column">
<li class="toc-h2 nav-item toc-entry"><a class="reference internal nav-link" href="#overview">Overview</a></li>
<li class="toc-h2 nav-item toc-entry"><a class="reference internal nav-link" href="#learning-objectives">Learning Objectives</a></li>
<li class="toc-h2 nav-item toc-entry"><a class="reference internal nav-link" href="#build-use-reflect">Build → Use → Reflect</a></li>
<li class="toc-h2 nav-item toc-entry"><a class="reference internal nav-link" href="#implementation-guide">Implementation Guide</a><ul class="nav section-nav flex-column">
<li class="toc-h3 nav-item toc-entry"><a class="reference internal nav-link" href="#layernorm-training-stability-for-deep-networks">LayerNorm - Training Stability for Deep Networks</a></li>
<li class="toc-h3 nav-item toc-entry"><a class="reference internal nav-link" href="#mlp-position-wise-feed-forward-network">MLP - Position-Wise Feed-Forward Network</a></li>
<li class="toc-h3 nav-item toc-entry"><a class="reference internal nav-link" href="#transformerblock-complete-layer-with-attention-and-mlp">TransformerBlock - Complete Layer with Attention and MLP</a></li>
<li class="toc-h3 nav-item toc-entry"><a class="reference internal nav-link" href="#gpt-complete-decoder-only-architecture">GPT - Complete Decoder-Only Architecture</a></li>
<li class="toc-h3 nav-item toc-entry"><a class="reference internal nav-link" href="#decoder-only-architecture-choice">Decoder-Only Architecture Choice</a></li>
</ul>
</li>
<li class="toc-h2 nav-item toc-entry"><a class="reference internal nav-link" href="#getting-started">Getting Started</a><ul class="nav section-nav flex-column">
<li class="toc-h3 nav-item toc-entry"><a class="reference internal nav-link" href="#prerequisites">Prerequisites</a></li>
<li class="toc-h3 nav-item toc-entry"><a class="reference internal nav-link" href="#development-workflow">Development Workflow</a></li>
</ul>
</li>
<li class="toc-h2 nav-item toc-entry"><a class="reference internal nav-link" href="#testing">Testing</a><ul class="nav section-nav flex-column">
<li class="toc-h3 nav-item toc-entry"><a class="reference internal nav-link" href="#comprehensive-test-suite">Comprehensive Test Suite</a></li>
<li class="toc-h3 nav-item toc-entry"><a class="reference internal nav-link" href="#test-coverage-areas">Test Coverage Areas</a></li>
<li class="toc-h3 nav-item toc-entry"><a class="reference internal nav-link" href="#inline-testing-architecture-validation">Inline Testing &amp; Architecture Validation</a></li>
<li class="toc-h3 nav-item toc-entry"><a class="reference internal nav-link" href="#manual-testing-examples">Manual Testing Examples</a></li>
</ul>
</li>
<li class="toc-h2 nav-item toc-entry"><a class="reference internal nav-link" href="#where-this-code-lives-in-the-final-package">Where This Code Lives in the Final Package</a></li>
<li class="toc-h2 nav-item toc-entry"><a class="reference internal nav-link" href="#systems-thinking-questions">Systems Thinking Questions</a><ul class="nav section-nav flex-column">
<li class="toc-h3 nav-item toc-entry"><a class="reference internal nav-link" href="#real-world-applications">Real-World Applications</a></li>
<li class="toc-h3 nav-item toc-entry"><a class="reference internal nav-link" href="#architectural-foundations">Architectural Foundations</a></li>
<li class="toc-h3 nav-item toc-entry"><a class="reference internal nav-link" href="#performance-characteristics">Performance Characteristics</a></li>
</ul>
</li>
<li class="toc-h2 nav-item toc-entry"><a class="reference internal nav-link" href="#reflection-questions">Reflection Questions</a></li>
<li class="toc-h2 nav-item toc-entry"><a class="reference internal nav-link" href="#ready-to-build">Ready to Build?</a></li>
</ul>
</nav>
</div>
</div>
</div>
<div id="searchbox"></div>
<article class="bd-article">
<section id="transformers-complete-gpt-architecture">
<h1>13. Transformers - Complete GPT Architecture<a class="headerlink" href="#transformers-complete-gpt-architecture" title="Link to this heading">#</a></h1>
<p><strong>ARCHITECTURE TIER</strong> | Difficulty: ⭐⭐⭐⭐ (4/4) | Time: 6-8 hours</p>
<section id="overview">
<h2>Overview<a class="headerlink" href="#overview" title="Link to this heading">#</a></h2>
<p>Youll build the complete GPT transformer architecture—the decoder-only foundation powering ChatGPT, GPT-4, Claude, and virtually all modern large language models. This module combines everything youve learned about attention, embeddings, and neural networks into a production-ready autoregressive language model capable of text generation. Youll implement layer normalization, feed-forward networks, transformer blocks with residual connections, and the complete GPT model that matches PyTorchs <code class="docutils literal notranslate"><span class="pre">nn.TransformerDecoder</span></code> design.</p>
</section>
<section id="learning-objectives">
<h2>Learning Objectives<a class="headerlink" href="#learning-objectives" title="Link to this heading">#</a></h2>
<p>By the end of this module, you will be able to:</p>
<ul class="simple">
<li><p><strong>Implement complete transformer blocks</strong> with multi-head self-attention, position-wise feed-forward networks (4x expansion), layer normalization, and residual connections for gradient highways enabling deep networks (12+ layers)</p></li>
<li><p><strong>Build decoder-only GPT architecture</strong> with causal masking preventing future token leakage, autoregressive generation with temperature sampling, and embeddings combining token and positional information</p></li>
<li><p><strong>Understand pre-norm architecture and residual connections</strong> critical for training stability—pre-norm placement before sub-layers (not after) enables 100+ layer networks by providing clean normalized inputs and direct gradient paths</p></li>
<li><p><strong>Analyze parameter scaling and memory complexity</strong> including quadratic attention memory growth O(n²) with sequence length, linear parameter scaling with layers, and techniques like gradient checkpointing for memory reduction</p></li>
<li><p><strong>Apply transformer architecture to language modeling</strong> using real-world patterns from PyTorch <code class="docutils literal notranslate"><span class="pre">nn.Transformer</span></code>, understanding decoder-only vs encoder-only vs encoder-decoder choices, and production optimizations like KV caching</p></li>
</ul>
</section>
<section id="build-use-reflect">
<h2>Build → Use → Reflect<a class="headerlink" href="#build-use-reflect" title="Link to this heading">#</a></h2>
<p>This module follows TinyTorchs <strong>Build → Use → Reflect</strong> framework:</p>
<ol class="arabic simple">
<li><p><strong>Build</strong>: Implement LayerNorm with learnable scale/shift, MLP feed-forward networks with 4x expansion and GELU activation, TransformerBlock combining attention+MLP with pre-norm residual connections, complete GPT decoder with causal masking and generation</p></li>
<li><p><strong>Use</strong>: Train GPT-style decoder on character-level text generation, implement autoregressive generation with temperature sampling (conservative vs creative), analyze parameter scaling across model sizes (Tiny → GPT-3 scale), measure attention memory quadratic growth</p></li>
<li><p><strong>Reflect</strong>: Why are residual connections critical for deep transformers (gradient vanishing without them)? How does pre-norm differ from post-norm (training stability for &gt;12 layers)? Whats the compute/memory trade-off in stacking layers vs widening dimensions? Why does attention memory scale quadratically with sequence length (O(n²d) cost)?</p></li>
</ol>
<div class="tip admonition">
<p class="admonition-title">Systems Reality Check</p>
<p><strong>Production Context</strong>: The decoder-only GPT architecture youre implementing powers virtually all modern LLMs. GPT-4 uses a 120-layer decoder stack, ChatGPT is based on GPT-3.5 with 96 layers, Claude uses decoder-only architecture, Llama 2 has 80 layers—all are transformer decoders with causal attention. This architecture dominated because it scales predictably with parameters and data.</p>
<p><strong>Performance Note</strong>: Transformer depth has O(n²d) attention cost per layer (n=sequence length, d=model dimension). For GPT-3 with 2048 tokens, each attention layer processes 4M token pairs. Memory scales linearly with layers but quadratically with sequence length. Production systems use KV caching (reuse key-value pairs during generation), FlashAttention (memory-efficient attention), and gradient checkpointing (trade compute for memory) to manage this. Understanding these trade-offs is critical for ML systems engineering.</p>
</div>
</section>
<section id="implementation-guide">
<h2>Implementation Guide<a class="headerlink" href="#implementation-guide" title="Link to this heading">#</a></h2>
<section id="layernorm-training-stability-for-deep-networks">
<h3>LayerNorm - Training Stability for Deep Networks<a class="headerlink" href="#layernorm-training-stability-for-deep-networks" title="Link to this heading">#</a></h3>
<p>Layer normalization stabilizes training by normalizing activations across the feature dimension for each sample independently. Unlike batch normalization (normalizes across batch), LayerNorm works with any batch size including batch=1 during inference—essential for variable-length sequences.</p>
<div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="k">class</span><span class="w"> </span><span class="nc">LayerNorm</span><span class="p">:</span>
<span class="w"> </span><span class="sd">&quot;&quot;&quot;Layer normalization for transformer training stability.</span>
<span class="sd"> Normalizes across feature dimension (last axis) for each sample independently.</span>
<span class="sd"> Includes learnable scale (gamma) and shift (beta) parameters.</span>
<span class="sd"> Formula: output = gamma * (x - mean) / sqrt(variance + eps) + beta</span>
<span class="sd"> Why LayerNorm for Transformers:</span>
<span class="sd"> - Batch-independent: Works with any batch size (good for inference)</span>
<span class="sd"> - Variable-length sequences: Each sample normalized independently</span>
<span class="sd"> - Better gradients: Empirically superior to BatchNorm for NLP tasks</span>
<span class="sd"> &quot;&quot;&quot;</span>
<span class="k">def</span><span class="w"> </span><span class="fm">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">normalized_shape</span><span class="p">,</span> <span class="n">eps</span><span class="o">=</span><span class="mf">1e-5</span><span class="p">):</span>
<span class="bp">self</span><span class="o">.</span><span class="n">gamma</span> <span class="o">=</span> <span class="n">Tensor</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">ones</span><span class="p">(</span><span class="n">normalized_shape</span><span class="p">))</span> <span class="c1"># Learnable scale (starts at 1.0)</span>
<span class="bp">self</span><span class="o">.</span><span class="n">beta</span> <span class="o">=</span> <span class="n">Tensor</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">zeros</span><span class="p">(</span><span class="n">normalized_shape</span><span class="p">))</span> <span class="c1"># Learnable shift (starts at 0.0)</span>
<span class="bp">self</span><span class="o">.</span><span class="n">eps</span> <span class="o">=</span> <span class="n">eps</span> <span class="c1"># Numerical stability in variance calculation</span>
<span class="k">def</span><span class="w"> </span><span class="nf">forward</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">x</span><span class="p">):</span>
<span class="c1"># Compute statistics across last dimension (features)</span>
<span class="n">mean</span> <span class="o">=</span> <span class="n">x</span><span class="o">.</span><span class="n">mean</span><span class="p">(</span><span class="n">axis</span><span class="o">=-</span><span class="mi">1</span><span class="p">,</span> <span class="n">keepdims</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span>
<span class="n">variance</span> <span class="o">=</span> <span class="p">((</span><span class="n">x</span> <span class="o">-</span> <span class="n">mean</span><span class="p">)</span> <span class="o">**</span> <span class="mi">2</span><span class="p">)</span><span class="o">.</span><span class="n">mean</span><span class="p">(</span><span class="n">axis</span><span class="o">=-</span><span class="mi">1</span><span class="p">,</span> <span class="n">keepdims</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span>
<span class="c1"># Normalize: (x - μ) / σ</span>
<span class="n">normalized</span> <span class="o">=</span> <span class="p">(</span><span class="n">x</span> <span class="o">-</span> <span class="n">mean</span><span class="p">)</span> <span class="o">/</span> <span class="n">sqrt</span><span class="p">(</span><span class="n">variance</span> <span class="o">+</span> <span class="bp">self</span><span class="o">.</span><span class="n">eps</span><span class="p">)</span>
<span class="c1"># Apply learnable transformation: γ * norm + β</span>
<span class="k">return</span> <span class="bp">self</span><span class="o">.</span><span class="n">gamma</span> <span class="o">*</span> <span class="n">normalized</span> <span class="o">+</span> <span class="bp">self</span><span class="o">.</span><span class="n">beta</span>
</pre></div>
</div>
<p><strong>Key Design Decisions:</strong></p>
<ul class="simple">
<li><p><strong>Per-sample normalization</strong>: Each sequence position normalized independently across features (batch-independent)</p></li>
<li><p><strong>Learnable parameters</strong>: Gamma/beta allow model to recover any desired distribution after normalization</p></li>
<li><p><strong>Epsilon for stability</strong>: Small constant (1e-5) prevents division by zero in variance calculation</p></li>
</ul>
<p><strong>LayerNorm vs BatchNorm:</strong></p>
<div class="pst-scrollable-table-container"><table class="table">
<thead>
<tr class="row-odd"><th class="head"><p>Aspect</p></th>
<th class="head"><p>LayerNorm</p></th>
<th class="head"><p>BatchNorm</p></th>
</tr>
</thead>
<tbody>
<tr class="row-even"><td><p>Normalizes across</p></td>
<td><p>Features (per sample)</p></td>
<td><p>Batch (per feature)</p></td>
</tr>
<tr class="row-odd"><td><p>Batch size dependency</p></td>
<td><p>Independent</p></td>
<td><p>Dependent</p></td>
</tr>
<tr class="row-even"><td><p>Inference behavior</p></td>
<td><p>Same as training</p></td>
<td><p>Requires running statistics</p></td>
</tr>
<tr class="row-odd"><td><p>Best for</p></td>
<td><p>Transformers, NLP</p></td>
<td><p>CNNs, Computer Vision</p></td>
</tr>
</tbody>
</table>
</div>
</section>
<section id="mlp-position-wise-feed-forward-network">
<h3>MLP - Position-Wise Feed-Forward Network<a class="headerlink" href="#mlp-position-wise-feed-forward-network" title="Link to this heading">#</a></h3>
<p>The MLP provides non-linear transformation capacity in each transformer block. Its a simple two-layer network with a 4x expansion pattern applied identically to each sequence position.</p>
<div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="k">class</span><span class="w"> </span><span class="nc">MLP</span><span class="p">:</span>
<span class="w"> </span><span class="sd">&quot;&quot;&quot;Multi-Layer Perceptron (Feed-Forward Network) for transformer blocks.</span>
<span class="sd"> Standard pattern: Linear(expand) → GELU → Linear(contract)</span>
<span class="sd"> Expansion ratio: 4:1 (embed_dim → 4*embed_dim → embed_dim)</span>
<span class="sd"> This provides the &quot;thinking&quot; capacity after attention computes relationships.</span>
<span class="sd"> &quot;&quot;&quot;</span>
<span class="k">def</span><span class="w"> </span><span class="fm">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">embed_dim</span><span class="p">,</span> <span class="n">hidden_dim</span><span class="o">=</span><span class="kc">None</span><span class="p">):</span>
<span class="k">if</span> <span class="n">hidden_dim</span> <span class="ow">is</span> <span class="kc">None</span><span class="p">:</span>
<span class="n">hidden_dim</span> <span class="o">=</span> <span class="mi">4</span> <span class="o">*</span> <span class="n">embed_dim</span> <span class="c1"># Standard 4x expansion</span>
<span class="bp">self</span><span class="o">.</span><span class="n">linear1</span> <span class="o">=</span> <span class="n">Linear</span><span class="p">(</span><span class="n">embed_dim</span><span class="p">,</span> <span class="n">hidden_dim</span><span class="p">)</span> <span class="c1"># Expansion: 512 → 2048</span>
<span class="bp">self</span><span class="o">.</span><span class="n">gelu</span> <span class="o">=</span> <span class="n">GELU</span><span class="p">()</span> <span class="c1"># Smooth activation</span>
<span class="bp">self</span><span class="o">.</span><span class="n">linear2</span> <span class="o">=</span> <span class="n">Linear</span><span class="p">(</span><span class="n">hidden_dim</span><span class="p">,</span> <span class="n">embed_dim</span><span class="p">)</span> <span class="c1"># Contraction: 2048 → 512</span>
<span class="k">def</span><span class="w"> </span><span class="nf">forward</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">x</span><span class="p">):</span>
<span class="c1"># x: (batch, seq_len, embed_dim)</span>
<span class="n">x</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">linear1</span><span class="p">(</span><span class="n">x</span><span class="p">)</span> <span class="c1"># Expand to hidden_dim</span>
<span class="n">x</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">gelu</span><span class="p">(</span><span class="n">x</span><span class="p">)</span> <span class="c1"># Nonlinearity (smoother than ReLU)</span>
<span class="n">x</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">linear2</span><span class="p">(</span><span class="n">x</span><span class="p">)</span> <span class="c1"># Contract back to embed_dim</span>
<span class="k">return</span> <span class="n">x</span>
</pre></div>
</div>
<p><strong>Why 4x Expansion?</strong></p>
<ul class="simple">
<li><p><strong>Parameter capacity</strong>: More parameters = more representation power (MLP typically has more params than attention)</p></li>
<li><p><strong>Information bottleneck</strong>: Expansion → contraction forces model to compress useful information</p></li>
<li><p><strong>Empirical success</strong>: 4x ratio found to work well across model sizes (some models experiment with 2x-8x)</p></li>
</ul>
<p><strong>GELU vs ReLU:</strong></p>
<ul class="simple">
<li><p><strong>ReLU</strong>: Hard cutoff at zero <code class="docutils literal notranslate"><span class="pre">max(0,</span> <span class="pre">x)</span></code> - simple but non-smooth</p></li>
<li><p><strong>GELU</strong>: Smooth probabilistic activation <code class="docutils literal notranslate"><span class="pre">x</span> <span class="pre">*</span> <span class="pre">Φ(x)</span></code> where Φ is Gaussian CDF</p></li>
<li><p><strong>Why GELU</strong>: Smoother gradients, better performance for language modeling tasks</p></li>
</ul>
</section>
<section id="transformerblock-complete-layer-with-attention-and-mlp">
<h3>TransformerBlock - Complete Layer with Attention and MLP<a class="headerlink" href="#transformerblock-complete-layer-with-attention-and-mlp" title="Link to this heading">#</a></h3>
<p>A single transformer layer combining multi-head self-attention with feed-forward processing using pre-norm residual architecture. This is the core building block stacked 12-120 times in production models.</p>
<div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="k">class</span><span class="w"> </span><span class="nc">TransformerBlock</span><span class="p">:</span>
<span class="w"> </span><span class="sd">&quot;&quot;&quot;Complete transformer layer with self-attention, MLP, and residual connections.</span>
<span class="sd"> Pre-Norm Architecture (Modern Standard):</span>
<span class="sd"> x → LayerNorm → MultiHeadAttention → Add(x) →</span>
<span class="sd"> LayerNorm → MLP → Add → Output</span>
<span class="sd"> Each sub-layer (attention, MLP) gets normalized input but adds to residual stream.</span>
<span class="sd"> &quot;&quot;&quot;</span>
<span class="k">def</span><span class="w"> </span><span class="fm">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">embed_dim</span><span class="p">,</span> <span class="n">num_heads</span><span class="p">,</span> <span class="n">mlp_ratio</span><span class="o">=</span><span class="mi">4</span><span class="p">):</span>
<span class="c1"># Attention sub-layer components</span>
<span class="bp">self</span><span class="o">.</span><span class="n">attention</span> <span class="o">=</span> <span class="n">MultiHeadAttention</span><span class="p">(</span><span class="n">embed_dim</span><span class="p">,</span> <span class="n">num_heads</span><span class="p">)</span>
<span class="bp">self</span><span class="o">.</span><span class="n">ln1</span> <span class="o">=</span> <span class="n">LayerNorm</span><span class="p">(</span><span class="n">embed_dim</span><span class="p">)</span> <span class="c1"># Pre-norm: before attention</span>
<span class="c1"># MLP sub-layer components</span>
<span class="bp">self</span><span class="o">.</span><span class="n">mlp</span> <span class="o">=</span> <span class="n">MLP</span><span class="p">(</span><span class="n">embed_dim</span><span class="p">,</span> <span class="n">hidden_dim</span><span class="o">=</span><span class="nb">int</span><span class="p">(</span><span class="n">embed_dim</span> <span class="o">*</span> <span class="n">mlp_ratio</span><span class="p">))</span>
<span class="bp">self</span><span class="o">.</span><span class="n">ln2</span> <span class="o">=</span> <span class="n">LayerNorm</span><span class="p">(</span><span class="n">embed_dim</span><span class="p">)</span> <span class="c1"># Pre-norm: before MLP</span>
<span class="k">def</span><span class="w"> </span><span class="nf">forward</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">x</span><span class="p">,</span> <span class="n">mask</span><span class="o">=</span><span class="kc">None</span><span class="p">):</span>
<span class="w"> </span><span class="sd">&quot;&quot;&quot;Forward pass with residual connections.</span>
<span class="sd"> Args:</span>
<span class="sd"> x: (batch, seq_len, embed_dim) input</span>
<span class="sd"> mask: Optional attention mask (causal mask for GPT)</span>
<span class="sd"> Returns:</span>
<span class="sd"> output: (batch, seq_len, embed_dim) transformed sequence</span>
<span class="sd"> &quot;&quot;&quot;</span>
<span class="c1"># Attention sub-layer with residual</span>
<span class="n">normed</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">ln1</span><span class="p">(</span><span class="n">x</span><span class="p">)</span> <span class="c1"># Normalize input</span>
<span class="n">attended</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">attention</span><span class="p">(</span><span class="n">normed</span><span class="p">,</span> <span class="n">mask</span><span class="p">)</span> <span class="c1"># Self-attention</span>
<span class="n">x</span> <span class="o">=</span> <span class="n">x</span> <span class="o">+</span> <span class="n">attended</span> <span class="c1"># Residual connection</span>
<span class="c1"># MLP sub-layer with residual</span>
<span class="n">normed</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">ln2</span><span class="p">(</span><span class="n">x</span><span class="p">)</span> <span class="c1"># Normalize again</span>
<span class="n">mlp_out</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">mlp</span><span class="p">(</span><span class="n">normed</span><span class="p">)</span> <span class="c1"># Feed-forward</span>
<span class="n">x</span> <span class="o">=</span> <span class="n">x</span> <span class="o">+</span> <span class="n">mlp_out</span> <span class="c1"># Residual connection</span>
<span class="k">return</span> <span class="n">x</span>
</pre></div>
</div>
<p><strong>Pre-Norm vs Post-Norm:</strong></p>
<p><strong>Pre-Norm (What We Implement):</strong></p>
<div class="highlight-default notranslate"><div class="highlight"><pre><span></span>x → LayerNorm → Attention → Add(x) → output
</pre></div>
</div>
<ul class="simple">
<li><p>LayerNorm <strong>before</strong> sub-layers (attention, MLP)</p></li>
<li><p>Better gradient flow for deep models (&gt;12 layers)</p></li>
<li><p>Modern standard in GPT-3, GPT-4, LLaMA, Claude</p></li>
</ul>
<p><strong>Post-Norm (Original Transformer Paper):</strong></p>
<div class="highlight-default notranslate"><div class="highlight"><pre><span></span>x → Attention → Add(x) → LayerNorm → output
</pre></div>
</div>
<ul class="simple">
<li><p>LayerNorm <strong>after</strong> sub-layers</p></li>
<li><p>Used in original “Attention is All You Need” paper</p></li>
<li><p>Struggles with very deep networks (gradient issues)</p></li>
</ul>
<p><strong>Why Pre-Norm Wins:</strong></p>
<ol class="arabic simple">
<li><p><strong>Clean inputs</strong>: Each sub-layer receives normalized input (stable mean/variance)</p></li>
<li><p><strong>Direct gradient path</strong>: Residual connections bypass normalization during backprop</p></li>
<li><p><strong>Deeper networks</strong>: Enables training 100+ layer transformers (GPT-4 has ~120 layers)</p></li>
</ol>
</section>
<section id="gpt-complete-decoder-only-architecture">
<h3>GPT - Complete Decoder-Only Architecture<a class="headerlink" href="#gpt-complete-decoder-only-architecture" title="Link to this heading">#</a></h3>
<p>GPT (Generative Pre-trained Transformer) is the complete autoregressive language model combining embeddings, transformer blocks, and generation capability. Its <strong>decoder-only</strong> with causal masking preventing future token leakage.</p>
<div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="k">class</span><span class="w"> </span><span class="nc">GPT</span><span class="p">:</span>
<span class="w"> </span><span class="sd">&quot;&quot;&quot;Complete GPT decoder for autoregressive language modeling.</span>
<span class="sd"> Architecture:</span>
<span class="sd"> Input tokens → Token Embedding + Positional Embedding →</span>
<span class="sd"> TransformerBlocks (with causal masking) →</span>
<span class="sd"> LayerNorm → Linear(embed_dim → vocab_size) → Logits</span>
<span class="sd"> Key Feature: Causal masking ensures position i only attends to positions ≤ i</span>
<span class="sd"> &quot;&quot;&quot;</span>
<span class="k">def</span><span class="w"> </span><span class="fm">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">vocab_size</span><span class="p">,</span> <span class="n">embed_dim</span><span class="p">,</span> <span class="n">num_layers</span><span class="p">,</span> <span class="n">num_heads</span><span class="p">,</span> <span class="n">max_seq_len</span><span class="o">=</span><span class="mi">1024</span><span class="p">):</span>
<span class="c1"># Embedding layers</span>
<span class="bp">self</span><span class="o">.</span><span class="n">token_embedding</span> <span class="o">=</span> <span class="n">Embedding</span><span class="p">(</span><span class="n">vocab_size</span><span class="p">,</span> <span class="n">embed_dim</span><span class="p">)</span>
<span class="bp">self</span><span class="o">.</span><span class="n">position_embedding</span> <span class="o">=</span> <span class="n">Embedding</span><span class="p">(</span><span class="n">max_seq_len</span><span class="p">,</span> <span class="n">embed_dim</span><span class="p">)</span>
<span class="c1"># Stack of transformer blocks</span>
<span class="bp">self</span><span class="o">.</span><span class="n">blocks</span> <span class="o">=</span> <span class="p">[</span><span class="n">TransformerBlock</span><span class="p">(</span><span class="n">embed_dim</span><span class="p">,</span> <span class="n">num_heads</span><span class="p">)</span>
<span class="k">for</span> <span class="n">_</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">num_layers</span><span class="p">)]</span>
<span class="c1"># Output layers</span>
<span class="bp">self</span><span class="o">.</span><span class="n">ln_f</span> <span class="o">=</span> <span class="n">LayerNorm</span><span class="p">(</span><span class="n">embed_dim</span><span class="p">)</span> <span class="c1"># Final layer norm</span>
<span class="bp">self</span><span class="o">.</span><span class="n">lm_head</span> <span class="o">=</span> <span class="n">Linear</span><span class="p">(</span><span class="n">embed_dim</span><span class="p">,</span> <span class="n">vocab_size</span><span class="p">)</span> <span class="c1"># Vocab projection</span>
<span class="k">def</span><span class="w"> </span><span class="nf">forward</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">tokens</span><span class="p">):</span>
<span class="w"> </span><span class="sd">&quot;&quot;&quot;Forward pass through GPT decoder.</span>
<span class="sd"> Args:</span>
<span class="sd"> tokens: (batch, seq_len) token indices</span>
<span class="sd"> Returns:</span>
<span class="sd"> logits: (batch, seq_len, vocab_size) unnormalized predictions</span>
<span class="sd"> &quot;&quot;&quot;</span>
<span class="n">batch_size</span><span class="p">,</span> <span class="n">seq_len</span> <span class="o">=</span> <span class="n">tokens</span><span class="o">.</span><span class="n">shape</span>
<span class="c1"># Embeddings: tokens + positions</span>
<span class="n">token_emb</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">token_embedding</span><span class="p">(</span><span class="n">tokens</span><span class="p">)</span>
<span class="n">positions</span> <span class="o">=</span> <span class="n">Tensor</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">arange</span><span class="p">(</span><span class="n">seq_len</span><span class="p">)</span><span class="o">.</span><span class="n">reshape</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="n">seq_len</span><span class="p">))</span>
<span class="n">pos_emb</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">position_embedding</span><span class="p">(</span><span class="n">positions</span><span class="p">)</span>
<span class="n">x</span> <span class="o">=</span> <span class="n">token_emb</span> <span class="o">+</span> <span class="n">pos_emb</span> <span class="c1"># (batch, seq_len, embed_dim)</span>
<span class="c1"># Causal mask: prevent attending to future positions</span>
<span class="n">mask</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">_create_causal_mask</span><span class="p">(</span><span class="n">seq_len</span><span class="p">)</span>
<span class="c1"># Transformer blocks</span>
<span class="k">for</span> <span class="n">block</span> <span class="ow">in</span> <span class="bp">self</span><span class="o">.</span><span class="n">blocks</span><span class="p">:</span>
<span class="n">x</span> <span class="o">=</span> <span class="n">block</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">mask</span><span class="o">=</span><span class="n">mask</span><span class="p">)</span>
<span class="c1"># Output projection</span>
<span class="n">x</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">ln_f</span><span class="p">(</span><span class="n">x</span><span class="p">)</span>
<span class="n">logits</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">lm_head</span><span class="p">(</span><span class="n">x</span><span class="p">)</span> <span class="c1"># (batch, seq_len, vocab_size)</span>
<span class="k">return</span> <span class="n">logits</span>
<span class="k">def</span><span class="w"> </span><span class="nf">_create_causal_mask</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">seq_len</span><span class="p">):</span>
<span class="w"> </span><span class="sd">&quot;&quot;&quot;Create causal mask: upper triangular matrix with -inf.</span>
<span class="sd"> Mask ensures position i can only attend to positions j where j ≤ i.</span>
<span class="sd"> After softmax, -inf becomes probability 0.</span>
<span class="sd"> &quot;&quot;&quot;</span>
<span class="n">mask</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">triu</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">ones</span><span class="p">((</span><span class="n">seq_len</span><span class="p">,</span> <span class="n">seq_len</span><span class="p">))</span> <span class="o">*</span> <span class="o">-</span><span class="n">np</span><span class="o">.</span><span class="n">inf</span><span class="p">,</span> <span class="n">k</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span>
<span class="k">return</span> <span class="n">Tensor</span><span class="p">(</span><span class="n">mask</span><span class="p">)</span>
<span class="k">def</span><span class="w"> </span><span class="nf">generate</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">prompt_tokens</span><span class="p">,</span> <span class="n">max_new_tokens</span><span class="o">=</span><span class="mi">50</span><span class="p">,</span> <span class="n">temperature</span><span class="o">=</span><span class="mf">1.0</span><span class="p">):</span>
<span class="w"> </span><span class="sd">&quot;&quot;&quot;Autoregressive text generation.</span>
<span class="sd"> Args:</span>
<span class="sd"> prompt_tokens: (batch, prompt_len) initial sequence</span>
<span class="sd"> max_new_tokens: Number of tokens to generate</span>
<span class="sd"> temperature: Sampling temperature (higher = more random)</span>
<span class="sd"> Returns:</span>
<span class="sd"> generated: (batch, prompt_len + max_new_tokens) full sequence</span>
<span class="sd"> &quot;&quot;&quot;</span>
<span class="n">current</span> <span class="o">=</span> <span class="n">Tensor</span><span class="p">(</span><span class="n">prompt_tokens</span><span class="o">.</span><span class="n">data</span><span class="o">.</span><span class="n">copy</span><span class="p">())</span>
<span class="k">for</span> <span class="n">_</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">max_new_tokens</span><span class="p">):</span>
<span class="c1"># Forward pass</span>
<span class="n">logits</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">forward</span><span class="p">(</span><span class="n">current</span><span class="p">)</span>
<span class="c1"># Get last position logits</span>
<span class="n">next_logits</span> <span class="o">=</span> <span class="n">logits</span><span class="o">.</span><span class="n">data</span><span class="p">[:,</span> <span class="o">-</span><span class="mi">1</span><span class="p">,</span> <span class="p">:]</span> <span class="o">/</span> <span class="n">temperature</span>
<span class="c1"># Sample from distribution</span>
<span class="n">probs</span> <span class="o">=</span> <span class="n">softmax</span><span class="p">(</span><span class="n">next_logits</span><span class="p">)</span>
<span class="n">next_token</span> <span class="o">=</span> <span class="n">sample</span><span class="p">(</span><span class="n">probs</span><span class="p">)</span>
<span class="c1"># Append to sequence</span>
<span class="n">current</span> <span class="o">=</span> <span class="n">concat</span><span class="p">([</span><span class="n">current</span><span class="p">,</span> <span class="n">next_token</span><span class="p">],</span> <span class="n">axis</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span>
<span class="k">return</span> <span class="n">current</span>
</pre></div>
</div>
<p><strong>Causal Masking Visualization:</strong></p>
<div class="highlight-default notranslate"><div class="highlight"><pre><span></span>Sequence: [&quot;The&quot;, &quot;cat&quot;, &quot;sat&quot;, &quot;on&quot;]
Positions: 0 1 2 3
Attention Matrix (✓ = can attend, ✗ = masked):
To: 0 1 2 3
From 0: [ ✓ ✗ ✗ ✗ ] ← &quot;The&quot; only sees itself
From 1: [ ✓ ✓ ✗ ✗ ] ← &quot;cat&quot; sees &quot;The&quot; + itself
From 2: [ ✓ ✓ ✓ ✗ ] ← &quot;sat&quot; sees all previous
From 3: [ ✓ ✓ ✓ ✓ ] ← &quot;on&quot; sees everything
Implementation: Upper triangular with -∞
[[ 0, -∞, -∞, -∞],
[ 0, 0, -∞, -∞],
[ 0, 0, 0, -∞],
[ 0, 0, 0, 0]]
After softmax: -∞ → probability 0
</pre></div>
</div>
<p><strong>Temperature Sampling:</strong></p>
<ul class="simple">
<li><p><strong>Low temperature (0.1-0.5)</strong>: Conservative, deterministic (picks highest probability)</p></li>
<li><p><strong>Medium temperature (1.0)</strong>: Balanced sampling from probability distribution</p></li>
<li><p><strong>High temperature (1.5-2.0)</strong>: Creative, random (flattens distribution)</p></li>
</ul>
</section>
<section id="decoder-only-architecture-choice">
<h3>Decoder-Only Architecture Choice<a class="headerlink" href="#decoder-only-architecture-choice" title="Link to this heading">#</a></h3>
<p>This module implements <strong>decoder-only GPT architecture</strong>. Heres why this choice dominates modern LLMs:</p>
<p><strong>Decoder-Only (GPT) - What We Build:</strong></p>
<ul class="simple">
<li><p><strong>Attention</strong>: Causal masking (position i only sees positions ≤ i)</p></li>
<li><p><strong>Training</strong>: Next-token prediction (autoregressive objective)</p></li>
<li><p><strong>Use cases</strong>: Text generation, code completion, dialogue, instruction following</p></li>
<li><p><strong>Examples</strong>: GPT-3/4, ChatGPT, Claude, LLaMA, PaLM, Gemini LLMs</p></li>
</ul>
<p><strong>Encoder-Only (BERT) - Not Implemented:</strong></p>
<ul class="simple">
<li><p><strong>Attention</strong>: Bidirectional (all positions see all positions)</p></li>
<li><p><strong>Training</strong>: Masked language modeling (predict masked tokens)</p></li>
<li><p><strong>Use cases</strong>: Classification, NER, question answering, search ranking</p></li>
<li><p><strong>Examples</strong>: BERT, RoBERTa (Google Search uses BERT for ranking)</p></li>
</ul>
<p><strong>Encoder-Decoder (T5) - Not Implemented:</strong></p>
<ul class="simple">
<li><p><strong>Attention</strong>: Encoder is bidirectional, decoder is causal</p></li>
<li><p><strong>Training</strong>: Sequence-to-sequence tasks</p></li>
<li><p><strong>Use cases</strong>: Translation, summarization</p></li>
<li><p><strong>Examples</strong>: T5, BART (Google Translate uses encoder-decoder)</p></li>
</ul>
<p><strong>Why Decoder-Only Won:</strong></p>
<ol class="arabic simple">
<li><p><strong>Simplicity</strong>: Single architecture type (no encoder-decoder coordination)</p></li>
<li><p><strong>Scalability</strong>: Predictable scaling laws with parameters and data</p></li>
<li><p><strong>Versatility</strong>: Handles both understanding and generation tasks</p></li>
<li><p><strong>Efficiency</strong>: Simpler to implement and optimize than encoder-decoder</p></li>
</ol>
</section>
</section>
<section id="getting-started">
<h2>Getting Started<a class="headerlink" href="#getting-started" title="Link to this heading">#</a></h2>
<section id="prerequisites">
<h3>Prerequisites<a class="headerlink" href="#prerequisites" title="Link to this heading">#</a></h3>
<p>Ensure you understand the foundations from previous modules:</p>
<div class="highlight-bash notranslate"><div class="highlight"><pre><span></span><span class="c1"># Activate TinyTorch environment</span>
<span class="nb">source</span><span class="w"> </span>scripts/activate-tinytorch
<span class="c1"># Verify prerequisite modules</span>
tito<span class="w"> </span><span class="nb">test</span><span class="w"> </span>embeddings
tito<span class="w"> </span><span class="nb">test</span><span class="w"> </span>attention
</pre></div>
</div>
<p><strong>Required Background:</strong></p>
<ul class="simple">
<li><p><strong>Module 11 (Embeddings)</strong>: Token and positional embeddings for input representation</p></li>
<li><p><strong>Module 12 (Attention)</strong>: Multi-head attention mechanism for sequence modeling</p></li>
<li><p><strong>Module 05 (Autograd)</strong>: Automatic differentiation for training deep networks</p></li>
<li><p><strong>Module 02 (Activations)</strong>: GELU activation used in MLP layers</p></li>
</ul>
</section>
<section id="development-workflow">
<h3>Development Workflow<a class="headerlink" href="#development-workflow" title="Link to this heading">#</a></h3>
<ol class="arabic simple">
<li><p><strong>Open the development file</strong>: <code class="docutils literal notranslate"><span class="pre">modules/13_transformers/transformers.py</span></code></p></li>
<li><p><strong>Implement LayerNorm</strong>: Normalize across feature dimension with learnable scale/shift parameters (gamma, beta)</p></li>
<li><p><strong>Build MLP</strong>: Two linear layers with 4x expansion ratio and GELU activation (position-wise transformation)</p></li>
<li><p><strong>Create TransformerBlock</strong>: Combine attention and MLP with pre-norm residual connections (LayerNorm before sub-layers)</p></li>
<li><p><strong>Add GPT model</strong>: Stack transformer blocks with token+positional embeddings, causal masking, and generation</p></li>
<li><p><strong>Export and verify</strong>: <code class="docutils literal notranslate"><span class="pre">tito</span> <span class="pre">module</span> <span class="pre">complete</span> <span class="pre">13</span> <span class="pre">&amp;&amp;</span> <span class="pre">tito</span> <span class="pre">test</span> <span class="pre">transformers</span></code></p></li>
</ol>
</section>
</section>
<section id="testing">
<h2>Testing<a class="headerlink" href="#testing" title="Link to this heading">#</a></h2>
<section id="comprehensive-test-suite">
<h3>Comprehensive Test Suite<a class="headerlink" href="#comprehensive-test-suite" title="Link to this heading">#</a></h3>
<p>Run the full test suite to verify transformer functionality:</p>
<div class="highlight-bash notranslate"><div class="highlight"><pre><span></span><span class="c1"># TinyTorch CLI (recommended)</span>
tito<span class="w"> </span><span class="nb">test</span><span class="w"> </span>transformers
<span class="c1"># Direct pytest execution</span>
python<span class="w"> </span>-m<span class="w"> </span>pytest<span class="w"> </span>tests/<span class="w"> </span>-k<span class="w"> </span>transformers<span class="w"> </span>-v
</pre></div>
</div>
</section>
<section id="test-coverage-areas">
<h3>Test Coverage Areas<a class="headerlink" href="#test-coverage-areas" title="Link to this heading">#</a></h3>
<ul class="simple">
<li><p><strong>LayerNorm</strong>: Feature-wise normalization (mean≈0, std≈1), learnable gamma/beta parameters, numerical stability with epsilon</p></li>
<li><p><strong>MLP</strong>: 4x expansion ratio (embed_dim → 4*embed_dim → embed_dim), GELU activation, shape preservation</p></li>
<li><p><strong>TransformerBlock</strong>: Pre-norm architecture (LayerNorm before sub-layers), residual connections (x + sublayer), attention+MLP composition</p></li>
<li><p><strong>GPT Model</strong>: Forward pass shape correctness (batch, seq, vocab_size), causal masking preventing future leakage, autoregressive generation</p></li>
<li><p><strong>Generation</strong>: Temperature sampling (conservative vs creative), sequence extension, parameter counting validation</p></li>
</ul>
</section>
<section id="inline-testing-architecture-validation">
<h3>Inline Testing &amp; Architecture Validation<a class="headerlink" href="#inline-testing-architecture-validation" title="Link to this heading">#</a></h3>
<p>The module includes comprehensive architecture validation:</p>
<div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="c1"># Example inline test output</span>
<span class="err">🔬</span> <span class="n">Unit</span> <span class="n">Test</span><span class="p">:</span> <span class="n">LayerNorm</span><span class="o">...</span>
<span class="err"></span> <span class="n">Mean</span> <span class="err"></span> <span class="mi">0</span><span class="p">,</span> <span class="n">std</span> <span class="err"></span> <span class="mi">1</span> <span class="n">after</span> <span class="n">normalization</span>
<span class="err"></span> <span class="n">Learnable</span> <span class="n">gamma</span><span class="o">/</span><span class="n">beta</span> <span class="n">parameters</span> <span class="n">work</span>
<span class="err">📈</span> <span class="n">Progress</span><span class="p">:</span> <span class="n">LayerNorm</span> <span class="err"></span>
<span class="err">🔬</span> <span class="n">Unit</span> <span class="n">Test</span><span class="p">:</span> <span class="n">MLP</span><span class="o">...</span>
<span class="err"></span> <span class="mi">4</span><span class="n">x</span> <span class="n">expansion</span> <span class="n">ratio</span> <span class="n">correct</span> <span class="p">(</span><span class="n">embed_dim</span> <span class="err"></span> <span class="mi">4</span><span class="o">*</span><span class="n">embed_dim</span><span class="p">)</span>
<span class="err"></span> <span class="n">Shape</span> <span class="n">preserved</span> <span class="p">(</span><span class="nb">input</span><span class="p">:</span> <span class="p">[</span><span class="mi">2</span><span class="p">,</span><span class="mi">10</span><span class="p">,</span><span class="mi">64</span><span class="p">]</span> <span class="err"></span> <span class="n">output</span><span class="p">:</span> <span class="p">[</span><span class="mi">2</span><span class="p">,</span><span class="mi">10</span><span class="p">,</span><span class="mi">64</span><span class="p">])</span>
<span class="err"></span> <span class="n">GELU</span> <span class="n">activation</span> <span class="n">applied</span>
<span class="err">📈</span> <span class="n">Progress</span><span class="p">:</span> <span class="n">MLP</span> <span class="err"></span>
<span class="err">🔬</span> <span class="n">Unit</span> <span class="n">Test</span><span class="p">:</span> <span class="n">TransformerBlock</span><span class="o">...</span>
<span class="err"></span> <span class="n">Pre</span><span class="o">-</span><span class="n">norm</span> <span class="n">residual</span> <span class="n">connections</span> <span class="n">work</span>
<span class="err"></span> <span class="n">Attention</span> <span class="o">+</span> <span class="n">MLP</span> <span class="n">sub</span><span class="o">-</span><span class="n">layers</span> <span class="n">compose</span> <span class="n">correctly</span>
<span class="err"></span> <span class="n">Causal</span> <span class="n">mask</span> <span class="n">prevents</span> <span class="n">future</span> <span class="n">information</span> <span class="n">leak</span>
<span class="err">📈</span> <span class="n">Progress</span><span class="p">:</span> <span class="n">TransformerBlock</span> <span class="err"></span>
<span class="err">🔬</span> <span class="n">Unit</span> <span class="n">Test</span><span class="p">:</span> <span class="n">GPT</span> <span class="n">Model</span><span class="o">...</span>
<span class="err"></span> <span class="n">Forward</span> <span class="k">pass</span><span class="p">:</span> <span class="p">[</span><span class="mi">2</span><span class="p">,</span><span class="mi">8</span><span class="p">]</span> <span class="n">tokens</span> <span class="err"></span> <span class="p">[</span><span class="mi">2</span><span class="p">,</span><span class="mi">8</span><span class="p">,</span><span class="mi">100</span><span class="p">]</span> <span class="n">logits</span>
<span class="err"></span> <span class="n">Generation</span><span class="p">:</span> <span class="p">[</span><span class="mi">1</span><span class="p">,</span><span class="mi">5</span><span class="p">]</span> <span class="n">prompt</span> <span class="o">+</span> <span class="mi">3</span> <span class="n">new</span> <span class="err"></span> <span class="p">[</span><span class="mi">1</span><span class="p">,</span><span class="mi">8</span><span class="p">]</span> <span class="n">sequence</span>
<span class="err"></span> <span class="n">Parameter</span> <span class="n">counting</span> <span class="n">validates</span> <span class="nb">all</span> <span class="n">components</span>
<span class="err">📈</span> <span class="n">Progress</span><span class="p">:</span> <span class="n">GPT</span> <span class="n">Model</span> <span class="err"></span>
</pre></div>
</div>
</section>
<section id="manual-testing-examples">
<h3>Manual Testing Examples<a class="headerlink" href="#manual-testing-examples" title="Link to this heading">#</a></h3>
<div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="kn">from</span><span class="w"> </span><span class="nn">transformers</span><span class="w"> </span><span class="kn">import</span> <span class="n">GPT</span><span class="p">,</span> <span class="n">TransformerBlock</span><span class="p">,</span> <span class="n">LayerNorm</span><span class="p">,</span> <span class="n">MLP</span>
<span class="c1"># Test LayerNorm</span>
<span class="n">ln</span> <span class="o">=</span> <span class="n">LayerNorm</span><span class="p">(</span><span class="mi">512</span><span class="p">)</span>
<span class="n">x</span> <span class="o">=</span> <span class="n">Tensor</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">randn</span><span class="p">(</span><span class="mi">2</span><span class="p">,</span> <span class="mi">10</span><span class="p">,</span> <span class="mi">512</span><span class="p">))</span> <span class="c1"># (batch, seq, features)</span>
<span class="n">normalized</span> <span class="o">=</span> <span class="n">ln</span><span class="o">.</span><span class="n">forward</span><span class="p">(</span><span class="n">x</span><span class="p">)</span>
<span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s2">&quot;Mean: </span><span class="si">{</span><span class="n">normalized</span><span class="o">.</span><span class="n">mean</span><span class="p">()</span><span class="si">:</span><span class="s2">.4f</span><span class="si">}</span><span class="s2">, Std: </span><span class="si">{</span><span class="n">normalized</span><span class="o">.</span><span class="n">std</span><span class="p">()</span><span class="si">:</span><span class="s2">.4f</span><span class="si">}</span><span class="s2">&quot;</span><span class="p">)</span> <span class="c1"># ≈ 0, ≈ 1</span>
<span class="c1"># Test MLP</span>
<span class="n">mlp</span> <span class="o">=</span> <span class="n">MLP</span><span class="p">(</span><span class="n">embed_dim</span><span class="o">=</span><span class="mi">512</span><span class="p">)</span>
<span class="n">output</span> <span class="o">=</span> <span class="n">mlp</span><span class="o">.</span><span class="n">forward</span><span class="p">(</span><span class="n">x</span><span class="p">)</span>
<span class="k">assert</span> <span class="n">output</span><span class="o">.</span><span class="n">shape</span> <span class="o">==</span> <span class="p">(</span><span class="mi">2</span><span class="p">,</span> <span class="mi">10</span><span class="p">,</span> <span class="mi">512</span><span class="p">)</span> <span class="c1"># Shape preserved</span>
<span class="c1"># Test TransformerBlock</span>
<span class="n">block</span> <span class="o">=</span> <span class="n">TransformerBlock</span><span class="p">(</span><span class="n">embed_dim</span><span class="o">=</span><span class="mi">512</span><span class="p">,</span> <span class="n">num_heads</span><span class="o">=</span><span class="mi">8</span><span class="p">)</span>
<span class="n">mask</span> <span class="o">=</span> <span class="n">Tensor</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">triu</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">ones</span><span class="p">((</span><span class="mi">10</span><span class="p">,</span> <span class="mi">10</span><span class="p">))</span> <span class="o">*</span> <span class="o">-</span><span class="n">np</span><span class="o">.</span><span class="n">inf</span><span class="p">,</span> <span class="n">k</span><span class="o">=</span><span class="mi">1</span><span class="p">))</span> <span class="c1"># Causal mask</span>
<span class="n">transformed</span> <span class="o">=</span> <span class="n">block</span><span class="o">.</span><span class="n">forward</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">mask</span><span class="o">=</span><span class="n">mask</span><span class="p">)</span>
<span class="c1"># Test GPT</span>
<span class="n">gpt</span> <span class="o">=</span> <span class="n">GPT</span><span class="p">(</span><span class="n">vocab_size</span><span class="o">=</span><span class="mi">50000</span><span class="p">,</span> <span class="n">embed_dim</span><span class="o">=</span><span class="mi">768</span><span class="p">,</span> <span class="n">num_layers</span><span class="o">=</span><span class="mi">12</span><span class="p">,</span> <span class="n">num_heads</span><span class="o">=</span><span class="mi">12</span><span class="p">)</span>
<span class="n">tokens</span> <span class="o">=</span> <span class="n">Tensor</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">randint</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="mi">50000</span><span class="p">,</span> <span class="p">(</span><span class="mi">4</span><span class="p">,</span> <span class="mi">512</span><span class="p">)))</span> <span class="c1"># Batch of sequences</span>
<span class="n">logits</span> <span class="o">=</span> <span class="n">gpt</span><span class="o">.</span><span class="n">forward</span><span class="p">(</span><span class="n">tokens</span><span class="p">)</span> <span class="c1"># (4, 512, 50000)</span>
<span class="c1"># Test generation</span>
<span class="n">prompt</span> <span class="o">=</span> <span class="n">Tensor</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">array</span><span class="p">([[</span><span class="mi">15496</span><span class="p">,</span> <span class="mi">1917</span><span class="p">]]))</span> <span class="c1"># &quot;Hello world&quot;</span>
<span class="n">generated</span> <span class="o">=</span> <span class="n">gpt</span><span class="o">.</span><span class="n">generate</span><span class="p">(</span><span class="n">prompt</span><span class="p">,</span> <span class="n">max_new_tokens</span><span class="o">=</span><span class="mi">50</span><span class="p">,</span> <span class="n">temperature</span><span class="o">=</span><span class="mf">0.8</span><span class="p">)</span>
<span class="nb">print</span><span class="p">(</span><span class="sa">f</span><span class="s2">&quot;Generated </span><span class="si">{</span><span class="n">generated</span><span class="o">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">prompt</span><span class="o">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span><span class="si">}</span><span class="s2"> new tokens&quot;</span><span class="p">)</span>
</pre></div>
</div>
</section>
</section>
<section id="where-this-code-lives-in-the-final-package">
<h2>Where This Code Lives in the Final Package<a class="headerlink" href="#where-this-code-lives-in-the-final-package" title="Link to this heading">#</a></h2>
<p><strong>Package Export:</strong> Code exports to <code class="docutils literal notranslate"><span class="pre">tinytorch.models.transformer</span></code></p>
<div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="c1"># When students install tinytorch, they import your work like this:</span>
<span class="kn">from</span><span class="w"> </span><span class="nn">tinytorch.core.transformer</span><span class="w"> </span><span class="kn">import</span> <span class="n">GPT</span><span class="p">,</span> <span class="n">TransformerBlock</span>
<span class="kn">from</span><span class="w"> </span><span class="nn">tinytorch.nn</span><span class="w"> </span><span class="kn">import</span> <span class="n">LayerNorm</span><span class="p">,</span> <span class="n">MLP</span> <span class="c1"># Your normalization and feed-forward implementations</span>
<span class="kn">from</span><span class="w"> </span><span class="nn">tinytorch.core.tensor</span><span class="w"> </span><span class="kn">import</span> <span class="n">Tensor</span> <span class="c1"># Foundation from Module 01</span>
<span class="kn">from</span><span class="w"> </span><span class="nn">tinytorch.core.attention</span><span class="w"> </span><span class="kn">import</span> <span class="n">MultiHeadAttention</span> <span class="c1"># From Module 12</span>
<span class="kn">from</span><span class="w"> </span><span class="nn">tinytorch.core.embeddings</span><span class="w"> </span><span class="kn">import</span> <span class="n">Embedding</span> <span class="c1"># From Module 11</span>
<span class="c1"># Example: Build a GPT-2 scale model</span>
<span class="n">gpt2</span> <span class="o">=</span> <span class="n">GPT</span><span class="p">(</span>
<span class="n">vocab_size</span><span class="o">=</span><span class="mi">50257</span><span class="p">,</span> <span class="c1"># GPT-2 BPE vocabulary</span>
<span class="n">embed_dim</span><span class="o">=</span><span class="mi">768</span><span class="p">,</span> <span class="c1"># GPT-2 Small dimension</span>
<span class="n">num_layers</span><span class="o">=</span><span class="mi">12</span><span class="p">,</span> <span class="c1"># 12 transformer blocks</span>
<span class="n">num_heads</span><span class="o">=</span><span class="mi">12</span><span class="p">,</span> <span class="c1"># 12 attention heads</span>
<span class="n">max_seq_len</span><span class="o">=</span><span class="mi">1024</span> <span class="c1"># 1K token context</span>
<span class="p">)</span>
<span class="c1"># Forward pass</span>
<span class="n">tokens</span> <span class="o">=</span> <span class="n">Tensor</span><span class="p">([[</span><span class="mi">15496</span><span class="p">,</span> <span class="mi">1917</span><span class="p">,</span> <span class="mi">318</span><span class="p">,</span> <span class="mi">281</span><span class="p">]])</span> <span class="c1"># &quot;This is a&quot;</span>
<span class="n">logits</span> <span class="o">=</span> <span class="n">gpt2</span><span class="o">.</span><span class="n">forward</span><span class="p">(</span><span class="n">tokens</span><span class="p">)</span> <span class="c1"># (1, 4, 50257)</span>
<span class="c1"># Autoregressive generation</span>
<span class="n">generated</span> <span class="o">=</span> <span class="n">gpt2</span><span class="o">.</span><span class="n">generate</span><span class="p">(</span>
<span class="n">prompt_tokens</span><span class="o">=</span><span class="n">tokens</span><span class="p">,</span>
<span class="n">max_new_tokens</span><span class="o">=</span><span class="mi">100</span><span class="p">,</span>
<span class="n">temperature</span><span class="o">=</span><span class="mf">0.7</span> <span class="c1"># Balanced creativity</span>
<span class="p">)</span>
<span class="c1"># Example: Build transformer components directly</span>
<span class="n">block</span> <span class="o">=</span> <span class="n">TransformerBlock</span><span class="p">(</span><span class="n">embed_dim</span><span class="o">=</span><span class="mi">512</span><span class="p">,</span> <span class="n">num_heads</span><span class="o">=</span><span class="mi">8</span><span class="p">,</span> <span class="n">mlp_ratio</span><span class="o">=</span><span class="mi">4</span><span class="p">)</span>
<span class="n">ln</span> <span class="o">=</span> <span class="n">LayerNorm</span><span class="p">(</span><span class="mi">512</span><span class="p">)</span>
<span class="n">mlp</span> <span class="o">=</span> <span class="n">MLP</span><span class="p">(</span><span class="n">embed_dim</span><span class="o">=</span><span class="mi">512</span><span class="p">,</span> <span class="n">hidden_dim</span><span class="o">=</span><span class="mi">2048</span><span class="p">)</span>
</pre></div>
</div>
<p><strong>Package Structure:</strong></p>
<div class="highlight-default notranslate"><div class="highlight"><pre><span></span>tinytorch/
├── models/
│ └── transformer.py # GPT, TransformerBlock
├── nn/
│ ├── feedforward.py # MLP implementation
│ └── normalization.py # LayerNorm implementation
├── core/
│ ├── attention.py # MultiHeadAttention (Module 12)
│ └── layers.py # Linear layers
└── text/
└── embeddings.py # Embedding, PositionalEncoding
</pre></div>
</div>
</section>
<section id="systems-thinking-questions">
<h2>Systems Thinking Questions<a class="headerlink" href="#systems-thinking-questions" title="Link to this heading">#</a></h2>
<section id="real-world-applications">
<h3>Real-World Applications<a class="headerlink" href="#real-world-applications" title="Link to this heading">#</a></h3>
<ul class="simple">
<li><p><strong>Large Language Models (OpenAI, Anthropic, Google)</strong>: GPT-4 uses ~120-layer decoder stack trained on trillions of tokens. ChatGPT is GPT-3.5 with 96 layers and RLHF fine-tuning. Claude uses decoder-only architecture with constitutional AI training. All modern LLMs are transformer decoders because decoder-only architecture scales predictably with parameters and data—every 10× parameter increase yields ~5× better performance.</p></li>
<li><p><strong>Code Generation Systems (GitHub, Google, Meta)</strong>: Copilot uses GPT-based decoder trained on billions of lines of GitHub code. AlphaCode uses transformer decoder for competitive programming. CodeLlama specialized 70B decoder for code completion. All leverage causal attention for autoregressive generation because programming requires left-to-right token prediction matching code syntax.</p></li>
<li><p><strong>Conversational AI (ChatGPT, Claude, Gemini)</strong>: All modern chatbots use decoder-only transformers fine-tuned with RLHF (reinforcement learning from human feedback). Architecture is identical to base GPT—conversation formatted as single sequence with special tokens. Production systems serve billions of queries daily requiring efficient KV caching to avoid recomputing past tokens.</p></li>
<li><p><strong>Production Scaling Challenges</strong>: Training GPT-3 (175B parameters) required 3.14×10²³ FLOPs (floating point operations), consuming ~1,300 MWh of electricity. Inference costs dominate at scale—ChatGPT serves millions of users requiring thousands of GPUs. Memory is primary bottleneck: 175B parameters × 2 bytes (FP16) = 350GB just for model weights, plus activation memory during inference.</p></li>
</ul>
</section>
<section id="architectural-foundations">
<h3>Architectural Foundations<a class="headerlink" href="#architectural-foundations" title="Link to this heading">#</a></h3>
<ul class="simple">
<li><p><strong>Residual Connections Enable Deep Networks</strong>: Without residuals, gradients vanish exponentially with depth—in a 12-layer network without residuals, gradients at layer 1 are ~0.1¹² ≈ 10⁻¹² smaller than output gradients. Residuals create gradient highways: ∂Loss/∂x = ∂Loss/∂output × (1 + ∂F(x)/∂x), ensuring gradient magnitude ≥ output gradient. This enables 100+ layer transformers (GPT-4 has ~120 layers).</p></li>
<li><p><strong>Pre-Norm vs Post-Norm Architecture</strong>: Pre-norm (LayerNorm before sub-layers) provides better gradient flow for deep models. In post-norm, gradients must flow through LayerNorms division operation which can amplify small gradient differences. Pre-norm gives each sub-layer clean normalized inputs (mean=0, var=1) while residuals bypass the normalization during backprop. GPT-3, GPT-4, LLaMA all use pre-norm.</p></li>
<li><p><strong>Layer Normalization vs Batch Normalization</strong>: LayerNorm normalizes across features per sample (batch-independent), BatchNorm normalizes across batch per feature (batch-dependent). Transformers use LayerNorm because: (1) Variable sequence lengths make batch statistics unstable, (2) Inference requires batch=1 support, (3) Empirically better for NLP. BatchNorm works for CNNs because spatial dimensions provide consistent normalization axis.</p></li>
<li><p><strong>MLP Expansion Ratio Trade-offs</strong>: Standard 4× expansion (embed_dim=512 → hidden=2048) balances capacity with compute. MLP parameters dominate transformers: per layer, MLP has 8×embed_dim² parameters vs attentions 4×embed_dim². Larger expansion (8×) increases capacity but quadratically increases memory and FLOPs. Some models experiment with 2× (faster) or gated MLPs (SwiGLU in LLaMA uses 5.33× effective expansion).</p></li>
</ul>
</section>
<section id="performance-characteristics">
<h3>Performance Characteristics<a class="headerlink" href="#performance-characteristics" title="Link to this heading">#</a></h3>
<ul class="simple">
<li><p><strong>Quadratic Attention Memory Growth</strong>: Attention computes (batch, heads, seq_len, seq_len) matrix requiring batch×heads×seq_len² elements. For GPT-3 with seq_len=2048, batch=4, heads=96: 4×96×2048² ≈ 1.6B elements × 4 bytes = 6.4GB per layer just for attention matrices. Doubling sequence length quadruples attention memory. This is why 8K context requires 4× memory vs 4K context.</p></li>
<li><p><strong>Parameter Scaling</strong>: Total parameters ≈ vocab_size×embed_dim (embeddings) + num_layers×[4×embed_dim² (attention) + 8×embed_dim² (MLP)] ≈ num_layers×12×embed_dim². GPT-3 has embed_dim=12,288, num_layers=96 → 96×12×12,288² ≈ 175B parameters. Storage: 175B × 2 bytes (FP16) = 350GB. Training requires 4× memory for gradients and optimizer states = 1.4TB per GPU.</p></li>
<li><p><strong>Computational Complexity</strong>: Per layer: O(batch×seq_len²×embed_dim) for attention + O(batch×seq_len×embed_dim²) for MLP. For short sequences (seq_len &lt; embed_dim), MLP dominates. For long sequences (seq_len &gt; embed_dim), attention dominates. GPT-3 with seq_len=2048, embed_dim=12,288: attention is 2048²×12,288 ≈ 51B FLOPs vs MLP 2048×12,288² ≈ 309B FLOPs—MLP dominates even at 2K tokens.</p></li>
<li><p><strong>Generation Efficiency</strong>: Autoregressive generation requires one forward pass per token. For 100 tokens through 96-layer network: 100×96 = 9,600 layer evaluations. KV caching optimizes this: cache key-value pairs from previous positions, reducing attention from O(n²) to O(n) during generation. Without KV cache, 100-token generation takes ~10× longer. Production systems always use KV caching.</p></li>
<li><p><strong>Memory-Compute Trade-offs</strong>: Gradient checkpointing trades compute for memory by recomputing activations during backward pass instead of storing them. Saves ~50% activation memory but increases training time ~20%. Mixed precision training (FP16/BF16 forward, FP32 gradients) reduces memory by 50% and increases throughput by 2-3× on modern GPUs with tensor cores.</p></li>
</ul>
</section>
</section>
<section id="reflection-questions">
<h2>Reflection Questions<a class="headerlink" href="#reflection-questions" title="Link to this heading">#</a></h2>
<ol class="arabic simple">
<li><p><strong>Residual Connection Necessity</strong>: Remove residual connections from a 12-layer transformer. What happens during training? Calculate gradient flow: if each layer multiplies gradients by 0.5, whats the gradient at layer 1 after 12 layers? (0.5¹² ≈ 0.0002). How do residuals solve this by providing gradient highways that bypass layer computations?</p></li>
<li><p><strong>Pre-Norm vs Post-Norm Trade-offs</strong>: Original Transformer paper used post-norm (LayerNorm after sub-layers). Modern transformers use pre-norm (LayerNorm before). Why? Consider gradient flow: in post-norm, gradients pass through LayerNorms division which can amplify noise. In pre-norm, residuals bypass normalization. When does pre-norm become critical (how many layers)?</p></li>
<li><p><strong>Attention Memory Quadratic Growth</strong>: For seq_len=1024, batch=4, heads=8, attention matrix is 4×8×1024×1024 = 33.5M elements × 4 bytes = 134MB per layer. What happens at seq_len=4096? (×16 memory = 2.1GB per layer). Why is this quadratic growth the primary bottleneck for long-context models? How does FlashAttention address this?</p></li>
<li><p><strong>Parameter Scaling Analysis</strong>: GPT-3 has embed_dim=12,288, num_layers=96. Calculate approximate parameters: embeddings ≈ 50K vocab × 12,288 = 614M. Per layer: attention ≈ 4×12,288² = 604M, MLP ≈ 8×12,288² = 1.2B. Total per layer ≈ 1.8B. 96 layers × 1.8B = 173B. Compare to measured 175B. Whats the parameter distribution?</p></li>
<li><p><strong>Decoder-Only vs Encoder-Decoder</strong>: Why did decoder-only (GPT) dominate over encoder-decoder (T5) for LLMs? Consider: (1) Simplicity of single architecture, (2) Scaling laws holding predictably, (3) Versatility handling both understanding and generation. When would you still choose encoder-decoder (translation, summarization)?</p></li>
<li><p><strong>Generation Efficiency</strong>: Generating 100 tokens through 96-layer GPT-3 without KV caching requires 100 forward passes through all 96 layers = 9,600 layer evaluations. With KV caching, only new token processed through layers = 96 evaluations per token = 9,600 total. Same compute! But KV cache requires storing keys and values for all positions. Calculate memory for seq_len=2048: 2×(num_layers×batch×heads×seq_len×head_dim) elements. Whats the memory-compute trade-off?</p></li>
</ol>
</section>
<section id="ready-to-build">
<h2>Ready to Build?<a class="headerlink" href="#ready-to-build" title="Link to this heading">#</a></h2>
<p>Youre about to implement the transformer architecture that powers virtually all modern AI systems! The decoder-only GPT architecture youll build is the exact design used in ChatGPT, GPT-4, Claude, and every major language model. This isnt a simplified educational version—its the real production architecture that revolutionized AI.</p>
<p>Understanding transformers from first principles—implementing layer normalization, feed-forward networks, residual connections, and causal attention yourself—will give you deep insight into how production ML systems work. Youll understand why GPT-4 has 120 layers, why residual connections prevent gradient vanishing in deep networks, why pre-norm architecture enables training very deep models, and how attention memory scales quadratically with sequence length.</p>
<p>This module is the culmination of your Architecture Tier journey. Youve built tensors (Module 01), activations (Module 02), layers (Module 03), embeddings (Module 11), and attention (Module 12). Now youll compose them into the complete transformer model that matches PyTorchs <code class="docutils literal notranslate"><span class="pre">nn.TransformerDecoder</span></code> and powers billion-dollar AI systems. Take your time, test thoroughly, and enjoy building the architecture behind ChatGPT, Claude, and the AI revolution!</p>
<p>Choose your preferred way to engage with this module:</p>
<div class="sd-container-fluid sd-sphinx-override sd-mb-4 docutils">
<div class="sd-row sd-row-cols-1 sd-row-cols-xs-1 sd-row-cols-sm-2 sd-row-cols-md-3 sd-row-cols-lg-3 docutils">
<div class="sd-col sd-d-flex-row docutils">
<div class="sd-card sd-sphinx-override sd-w-100 sd-shadow-sm sd-card-hover docutils">
<div class="sd-card-body docutils">
<div class="sd-card-title sd-font-weight-bold docutils">
🚀 Launch Binder</div>
<p class="sd-card-text">Run this module interactively in your browser. No installation required!</p>
</div>
<a class="sd-stretched-link sd-hide-link-text reference external" href="https://mybinder.org/v2/gh/mlsysbook/TinyTorch/main?filepath=modules/13_transformers/transformers.ipynb"><span>https://mybinder.org/v2/gh/mlsysbook/TinyTorch/main?filepath=modules/13_transformers/transformers.ipynb</span></a></div>
</div>
<div class="sd-col sd-d-flex-row docutils">
<div class="sd-card sd-sphinx-override sd-w-100 sd-shadow-sm sd-card-hover docutils">
<div class="sd-card-body docutils">
<div class="sd-card-title sd-font-weight-bold docutils">
⚡ Open in Colab</div>
<p class="sd-card-text">Use Google Colab for GPU access and cloud compute power.</p>
</div>
<a class="sd-stretched-link sd-hide-link-text reference external" href="https://colab.research.google.com/github/mlsysbook/TinyTorch/blob/main/modules/13_transformers/transformers.ipynb"><span>https://colab.research.google.com/github/mlsysbook/TinyTorch/blob/main/modules/13_transformers/transformers.ipynb</span></a></div>
</div>
<div class="sd-col sd-d-flex-row docutils">
<div class="sd-card sd-sphinx-override sd-w-100 sd-shadow-sm sd-card-hover docutils">
<div class="sd-card-body docutils">
<div class="sd-card-title sd-font-weight-bold docutils">
📖 View Source</div>
<p class="sd-card-text">Browse the Python source code and understand the implementation.</p>
</div>
<a class="sd-stretched-link sd-hide-link-text reference external" href="https://github.com/mlsysbook/TinyTorch/blob/main/modules/13_transformers/transformers.py"><span>https://github.com/mlsysbook/TinyTorch/blob/main/modules/13_transformers/transformers.py</span></a></div>
</div>
</div>
</div>
<div class="tip admonition">
<p class="admonition-title">💾 Save Your Progress</p>
<p><strong>Binder sessions are temporary!</strong> Download your completed notebook when done, or switch to local development for persistent work.</p>
</div>
<hr class="docutils" />
<div class="prev-next-area">
<a class="left-prev" href="../modules/12_attention/ABOUT.html" title="previous page">← Previous Module</a>
<a class="right-next" href="../modules/14_profiling/ABOUT.html" title="next page">Next Module →</a>
</div>
</section>
</section>
<script type="text/x-thebe-config">
{
requestKernel: true,
binderOptions: {
repo: "binder-examples/jupyter-stacks-datascience",
ref: "master",
},
codeMirrorConfig: {
theme: "abcdef",
mode: "python"
},
kernelOptions: {
name: "python3",
path: "./modules"
},
predefinedOutput: true
}
</script>
<script>kernelName = 'python3'</script>
</article>
<footer class="prev-next-footer d-print-none">
<div class="prev-next-area">
<a class="left-prev"
href="12_attention_ABOUT.html"
title="previous page">
<i class="fa-solid fa-angle-left"></i>
<div class="prev-next-info">
<p class="prev-next-subtitle">previous</p>
<p class="prev-next-title">12. Attention - The Mechanism That Powers Modern AI</p>
</div>
</a>
<a class="right-next"
href="../tiers/optimization.html"
title="next page">
<div class="prev-next-info">
<p class="prev-next-subtitle">next</p>
<p class="prev-next-title">⏱️ Optimization Tier (Modules 14-19)</p>
</div>
<i class="fa-solid fa-angle-right"></i>
</a>
</div>
</footer>
</div>
<div class="bd-sidebar-secondary bd-toc"><div class="sidebar-secondary-items sidebar-secondary__inner">
<div class="sidebar-secondary-item">
<div class="page-toc tocsection onthispage">
<i class="fa-solid fa-list"></i> Contents
</div>
<nav class="bd-toc-nav page-toc">
<ul class="visible nav section-nav flex-column">
<li class="toc-h2 nav-item toc-entry"><a class="reference internal nav-link" href="#overview">Overview</a></li>
<li class="toc-h2 nav-item toc-entry"><a class="reference internal nav-link" href="#learning-objectives">Learning Objectives</a></li>
<li class="toc-h2 nav-item toc-entry"><a class="reference internal nav-link" href="#build-use-reflect">Build → Use → Reflect</a></li>
<li class="toc-h2 nav-item toc-entry"><a class="reference internal nav-link" href="#implementation-guide">Implementation Guide</a><ul class="nav section-nav flex-column">
<li class="toc-h3 nav-item toc-entry"><a class="reference internal nav-link" href="#layernorm-training-stability-for-deep-networks">LayerNorm - Training Stability for Deep Networks</a></li>
<li class="toc-h3 nav-item toc-entry"><a class="reference internal nav-link" href="#mlp-position-wise-feed-forward-network">MLP - Position-Wise Feed-Forward Network</a></li>
<li class="toc-h3 nav-item toc-entry"><a class="reference internal nav-link" href="#transformerblock-complete-layer-with-attention-and-mlp">TransformerBlock - Complete Layer with Attention and MLP</a></li>
<li class="toc-h3 nav-item toc-entry"><a class="reference internal nav-link" href="#gpt-complete-decoder-only-architecture">GPT - Complete Decoder-Only Architecture</a></li>
<li class="toc-h3 nav-item toc-entry"><a class="reference internal nav-link" href="#decoder-only-architecture-choice">Decoder-Only Architecture Choice</a></li>
</ul>
</li>
<li class="toc-h2 nav-item toc-entry"><a class="reference internal nav-link" href="#getting-started">Getting Started</a><ul class="nav section-nav flex-column">
<li class="toc-h3 nav-item toc-entry"><a class="reference internal nav-link" href="#prerequisites">Prerequisites</a></li>
<li class="toc-h3 nav-item toc-entry"><a class="reference internal nav-link" href="#development-workflow">Development Workflow</a></li>
</ul>
</li>
<li class="toc-h2 nav-item toc-entry"><a class="reference internal nav-link" href="#testing">Testing</a><ul class="nav section-nav flex-column">
<li class="toc-h3 nav-item toc-entry"><a class="reference internal nav-link" href="#comprehensive-test-suite">Comprehensive Test Suite</a></li>
<li class="toc-h3 nav-item toc-entry"><a class="reference internal nav-link" href="#test-coverage-areas">Test Coverage Areas</a></li>
<li class="toc-h3 nav-item toc-entry"><a class="reference internal nav-link" href="#inline-testing-architecture-validation">Inline Testing &amp; Architecture Validation</a></li>
<li class="toc-h3 nav-item toc-entry"><a class="reference internal nav-link" href="#manual-testing-examples">Manual Testing Examples</a></li>
</ul>
</li>
<li class="toc-h2 nav-item toc-entry"><a class="reference internal nav-link" href="#where-this-code-lives-in-the-final-package">Where This Code Lives in the Final Package</a></li>
<li class="toc-h2 nav-item toc-entry"><a class="reference internal nav-link" href="#systems-thinking-questions">Systems Thinking Questions</a><ul class="nav section-nav flex-column">
<li class="toc-h3 nav-item toc-entry"><a class="reference internal nav-link" href="#real-world-applications">Real-World Applications</a></li>
<li class="toc-h3 nav-item toc-entry"><a class="reference internal nav-link" href="#architectural-foundations">Architectural Foundations</a></li>
<li class="toc-h3 nav-item toc-entry"><a class="reference internal nav-link" href="#performance-characteristics">Performance Characteristics</a></li>
</ul>
</li>
<li class="toc-h2 nav-item toc-entry"><a class="reference internal nav-link" href="#reflection-questions">Reflection Questions</a></li>
<li class="toc-h2 nav-item toc-entry"><a class="reference internal nav-link" href="#ready-to-build">Ready to Build?</a></li>
</ul>
</nav></div>
</div></div>
</div>
<footer class="bd-footer-content">
<div class="bd-footer-content__inner container">
<div class="footer-item">
<p class="component-author">
By Prof. Vijay Janapa Reddi (Harvard University)
</p>
</div>
<div class="footer-item">
<p class="copyright">
© Copyright 2025.
<br/>
</p>
</div>
<div class="footer-item">
</div>
<div class="footer-item">
</div>
</div>
</footer>
</main>
</div>
</div>
<!-- Scripts loaded after <body> so the DOM is not blocked -->
<script src="../_static/scripts/bootstrap.js?digest=dfe6caa3a7d634c4db9b"></script>
<script src="../_static/scripts/pydata-sphinx-theme.js?digest=dfe6caa3a7d634c4db9b"></script>
<footer class="bd-footer">
</footer>
</body>
</html>