TinyTorch/paper/paper.tex

\documentclass[10pt,twocolumn]{article}

% Adjust line spacing for better readability
\renewcommand{\baselinestretch}{1.05}

% Essential packages
\usepackage{fontspec}
\setmainfont{Palatino}[
  Ligatures=TeX,
  Numbers=OldStyle,
  UprightFont=*,
  ItalicFont=* Italic,
  BoldFont=* Bold,
  BoldItalicFont=* Bold Italic
]
\setsansfont{Helvetica Neue}[
  Scale=MatchLowercase,
  UprightFont=*,
  BoldFont=* Bold
]
\setmonofont{Courier New}[
  Scale=MatchLowercase
]
\usepackage{microtype}
\usepackage{graphicx}
\usepackage{amsmath}
\usepackage{amssymb}
\usepackage{booktabs}
\usepackage{tabularx}
\usepackage{xcolor}
\usepackage{listings}
\usepackage[round]{natbib}
% Note: natbib with plainnat style shows full author lists
% To truncate, would need custom .bst file or biblatex instead
\usepackage{hyperref}
\usepackage{cleveref}
\usepackage{emoji}
\usepackage{tikz}
\usetikzlibrary{shapes,arrows,positioning}
\usepackage{subcaption}
\usepackage{enumitem}
\usepackage{titlesec}
\usepackage{fancyhdr}

% Caption styling - bold labels, small font, proper spacing
\captionsetup{
  font=small,
  labelfont=bf,
  labelsep=period,
  justification=justified,
  singlelinecheck=false,
  skip=8pt
}
\captionsetup[table]{position=top}
\captionsetup[figure]{position=bottom}
\captionsetup[lstlisting]{
  font=small,
  labelfont=bf,
  labelsep=period,
  skip=8pt
}

% Section spacing - tighter for two-column format
\titlespacing*{\section}{0pt}{0.8\baselineskip}{0.5\baselineskip}
\titlespacing*{\subsection}{0pt}{0.6\baselineskip}{0.4\baselineskip}
\titlespacing*{\subsubsection}{0pt}{0.5\baselineskip}{0.3\baselineskip}

% Page geometry
\usepackage[
  letterpaper,
  top=0.75in,
  bottom=1in,
  left=0.75in,
  right=0.75in,
  columnsep=0.25in
]{geometry}

% Header configuration - simple page numbers only
\pagestyle{fancy}
\fancyhf{} % Clear all headers and footers

% Define accent color (orange-red to match fire theme)
\definecolor{accentcolor}{RGB}{255,87,34} % Vibrant orange-red

% No header rule
\renewcommand{\headrulewidth}{0pt}

% Footer with centered page number - clean style
\fancyfoot[C]{%
  \fontsize{9}{11}\selectfont%
  \thepage%
}

% Adjust header height and spacing
\setlength{\headheight}{12pt}
\setlength{\headsep}{18pt}

% First page style - completely clean, no header or footer
\fancypagestyle{plain}{%
  \fancyhf{}%
  \renewcommand{\headrulewidth}{0pt}%
  \fancyhead{}%
  \fancyfoot{}%
}

% Python code highlighting
\definecolor{codegreen}{rgb}{0,0.6,0}
\definecolor{codegray}{rgb}{0.5,0.5,0.5}
\definecolor{codepurple}{rgb}{0.58,0,0.82}
\definecolor{backcolour}{rgb}{0.97,0.97,0.97}

\lstdefinestyle{pythonstyle}{
    backgroundcolor=\color{backcolour},
    commentstyle=\color{codegreen},
    keywordstyle=\color{blue},
    numberstyle=\tiny\color{codegray},
    stringstyle=\color{codepurple},
    basicstyle=\ttfamily\scriptsize,
    breakatwhitespace=false,
    breaklines=true,
    captionpos=b,
    keepspaces=true,
    numbers=left,
    numbersep=5pt,
    showspaces=false,
    showstringspaces=false,
    showtabs=false,
    tabsize=2,
    language=Python,
    frame=single,
    rulecolor=\color{black!30},
    xleftmargin=10pt,
    xrightmargin=5pt,
    aboveskip=8pt,
    belowskip=8pt
}

\lstset{style=pythonstyle}

% Hyperref setup
\hypersetup{
    colorlinks=true,
    linkcolor=blue,
    citecolor=blue,
    urlcolor=blue,
    pdftitle={TinyTorch: A Framework for Building ML Systems from Scratch},
    pdfauthor={Vijay Janapa Reddi}
}

% Title formatting - modern, clean design
\usepackage{titling}

% Remove the horizontal rules for a cleaner look
\pretitle{\begin{center}\vspace*{-1em}}
\posttitle{\vspace{0.8em}\end{center}}

% Title and authors - improved typography
\title{
  \fontsize{26}{32}\selectfont\bfseries
  Tiny\emoji{fire}Torch\\[0.3em]
  \fontsize{14}{17}\selectfont\normalfont
  Build Your Own Machine Learning Framework\\[0.2em]
  From Tensors to Systems
}
\author{
  \fontsize{12}{12}\selectfont
  Vijay Janapa Reddi\\
  \fontsize{12}{12}\selectfont
  Harvard University\\[1.5em]
  \fontsize{10}{12}\selectfont
  \fontsize{12}{12}\selectfont
  \textcolor{gray!70}{\href{https://www.tinytorch.ai}{tinytorch.ai}}
}

\date{}

\begin{document}

% Title page - no header
\thispagestyle{plain}
\maketitle

% Abstract - REVISED: Curriculum design focus
\begin{abstract}
Machine learning education typically teaches framework usage without exposing internals, leaving students unable to debug gradient flows, profile memory bottlenecks, or understand optimization tradeoffs. TinyTorch addresses this gap through a build-from-scratch curriculum where students implement PyTorch's core components—tensors, autograd, optimizers, and neural networks—to gain framework transparency.

We present the design and implementation of three pedagogical design patterns for teaching ML as systems engineering. \textbf{Progressive disclosure} activates dormant tensor features across modules through monkey-patching, modeling how frameworks evolve from separate abstractions to unified interfaces. \textbf{Systems-first curriculum} embeds memory profiling and complexity analysis from the start rather than treating them as advanced topics. \textbf{Historical milestone validation} recreates nearly 70 years of ML breakthroughs (1958 Perceptron through modern transformers) using exclusively student-implemented code to validate correctness. These patterns are grounded in established learning theory (situated cognition, cognitive load theory, cognitive apprenticeship) but represent testable design hypotheses whose learning outcomes require empirical validation.

The 20-module curriculum (estimated 60--80 hours) provides complete open-source infrastructure for institutional adoption or self-paced learning at \texttt{tinytorch.ai}.
\end{abstract}


% Main content
\section{Introduction}
\label{sec:intro}

Machine learning deployment faces a critical workforce bottleneck: industry surveys indicate significant demand-supply imbalances for ML systems engineers~\citep{roberthalf2024talent,keller2025ai}, with surveys suggesting that a substantial portion of executives cite talent shortage as their primary barrier to AI adoption~\citep{keller2025ai}.

Unlike algorithmic ML—where automated tools increasingly handle model architecture search and hyperparameter tuning—systems engineering remains bottlenecked by tacit knowledge that resists automation: understanding \emph{why} Adam requires 2$\times$ optimizer state memory, \emph{when} attention's $O(N^2)$ scaling becomes prohibitive, \emph{how} to navigate accuracy-latency-memory tradeoffs in production systems. These engineering judgment calls depend on mental models of framework internals~\citep{meadows2008thinking}, traditionally acquired through years of debugging PyTorch or TensorFlow rather than formal instruction.

Current ML education creates this gap by separating algorithms from systems. Students learn to implement gradient descent without measuring memory consumption, build attention mechanisms without profiling $O(N^2)$ costs, and train models without understanding optimizer state overhead. Introductory courses use high-level APIs (PyTorch, Keras) that abstract away implementation details, while advanced electives teach systems concepts (memory management, performance optimization) in isolation from ML frameworks. This pedagogical divide produces graduates who can \emph{use} \texttt{loss.backward()} but cannot explain how computational graphs enable reverse-mode differentiation, or who understand transformers mathematically but miss that KV caching trades $O(N^2)$ memory for $O(N)$ recomputation.

We present TinyTorch, a 20-module curriculum where students build PyTorch's core components from scratch using only NumPy: tensors, automatic differentiation, optimizers, CNNs, transformers, and production optimization techniques. Students transition from framework \emph{users} to framework \emph{engineers} by implementing the internals that high-level APIs deliberately hide. As a hands-on companion to the \emph{Machine Learning Systems} textbook~\citep{reddi2024mlsysbook}, TinyTorch transforms tacit systems knowledge into explicit pedagogy—students don't just learn \emph{that} Adam requires 4$\times$ training memory, they \emph{implement} momentum and variance buffers and \emph{measure} the footprint directly through profiling code they wrote. \Cref{fig:code-comparison} contrasts this bottom-up approach with traditional top-down API usage.

\begin{figure*}[t]
\centering
\begin{minipage}[b]{0.48\textwidth}
\begin{subfigure}[b]{\textwidth}
\centering
\begin{lstlisting}[basicstyle=\ttfamily\scriptsize,frame=single]
import torch.nn as nn
import torch.optim as optim

# How much memory?
model = nn.Linear(784, 10)
# Why does Adam need more memory
# than SGD?
optimizer = optim.Adam(
    model.parameters(), lr=0.001)
loss_fn = nn.CrossEntropyLoss()

for epoch in range(10):
    for x, y in dataloader:
        pred = model(x)
        loss = loss_fn(pred, y)
        loss.backward()   # Magic?
        optimizer.step()  # How?
        # What cost? How fast?
\end{lstlisting}
\subcaption{PyTorch: Using frameworks as black boxes}
\label{lst:pytorch-usage}
\end{subfigure}
\vspace{0.5em}
\begin{subfigure}[b]{\textwidth}
\centering
\begin{lstlisting}[basicstyle=\ttfamily\scriptsize,frame=single]
import tensorflow as tf

# What's happening inside?
model = tf.keras.Sequential([
    tf.keras.layers.Dense(10,
        input_shape=(784,))
])
# Why Adam over SGD?
# Memory cost?
model.compile(
    optimizer='adam',
    loss='sparse_categorical_crossentropy')

model.fit(dataloader, epochs=10)
# How does it work?
# What's the complexity?
\end{lstlisting}
\subcaption{TensorFlow: High-level abstractions}
\label{lst:tensorflow-usage}
\end{subfigure}
\end{minipage}
\hfill
\begin{subfigure}[b]{0.48\textwidth}
\centering
\begin{lstlisting}[basicstyle=\ttfamily\scriptsize,frame=single]
class Linear:
    def __init__(self, in_features,
                 out):
        # Memory: out × in_features × 4B
        self.weight = Tensor.randn(
            out, in_features)
        self.bias = Tensor.zeros(out)

    def forward(self, x):
        # O(batch × in × out) FLOPs
        return (x @ self.weight.T +
                self.bias)

class Adam:
    def __init__(self, params,
                 lr=0.001):
        self.params = params
        self.lr = lr
        # 2× optimizer state:
        # momentum + variance
        # Why 2× memory vs SGD?
        self.m = [Tensor.zeros_like(p)
                  for p in params]
        self.v = [Tensor.zeros_like(p)
                  for p in params]

    def step(self):
        for i, p in enumerate(
                self.params):
            # Exponential moving avg
            self.m[i] = (0.9*self.m[i] +
                        0.1*p.grad)
            self.v[i] = (0.999*self.v[i] +
                        0.001*p.grad**2)
            # Per-parameter adaptive lr
            p.data -= (self.lr *
                self.m[i] /
                (self.v[i].sqrt() + 1e-8))
\end{lstlisting}
\subcaption{TinyTorch: Understanding internals}
\label{lst:tinytorch-build}
\end{subfigure}
\caption{Learning progression from framework users to engineers. (a-b) PyTorch/TensorFlow: high-level API usage. (c) TinyTorch: building internals reveals optimizer memory costs, computational complexity, and systems constraints.}
\label{fig:code-comparison}
\end{figure*}

The curriculum addresses three fundamental pedagogical challenges: teaching systems thinking \emph{alongside} ML fundamentals rather than in separate electives (\Cref{sec:systems}), managing cognitive load when teaching both algorithms and implementation (\Cref{sec:progressive}), and validating that bottom-up implementation produces working systems (\Cref{subsec:milestones}). The following sections detail how TinyTorch's design addresses each challenge.

The curriculum follows the compiler course model~\citep{aho2006compilers}: students build a complete system module-by-module, experiencing how components integrate through direct implementation. \Cref{fig:module-flow} illustrates the dependency structure—tensors (Module 01) enable activations (02) and layers (03), which feed into autograd (05), which powers optimizers (06) and training (07). This incremental construction mirrors how compiler courses connect lexical analysis to parsing to code generation, creating systems thinking through component integration. Each completed module becomes immediately usable: after Module 03, students can build neural networks; after Module 05, automatic differentiation enables training; after Module 13, transformers support language modeling.

\begin{figure}[t]
\centering
\resizebox{\columnwidth}{!}{%
\begin{tikzpicture}[node distance=1.0cm and 1.8cm, every node/.style={font=\scriptsize}]
% Foundation tier
\node[draw,rectangle,fill=blue!20,minimum width=1.8cm] (M01) {01 Tensor};
\node[draw,rectangle,fill=blue!20,below=of M01,minimum width=1.8cm] (M02) {02 Activations};
\node[draw,rectangle,fill=blue!20,below=of M02,minimum width=1.8cm] (M03) {03 Layers};
\node[draw,rectangle,fill=blue!20,below=of M03,minimum width=1.8cm] (M04) {04 Losses};
\node[draw,rectangle,fill=orange!30,below=of M04,minimum width=1.8cm] (M05) {05 Autograd};
\node[draw,rectangle,fill=blue!20,below=of M05,minimum width=1.8cm] (M06) {06 Optimizers};
\node[draw,rectangle,fill=blue!20,below=of M06,minimum width=1.8cm] (M07) {07 Training};

% Architecture tier
\node[draw,rectangle,fill=purple!20,right=of M01,minimum width=1.8cm] (M08) {08 DataLoader};
\node[draw,rectangle,fill=purple!20,below=of M08,minimum width=1.8cm] (M09) {09 CNNs};
\node[draw,rectangle,fill=purple!20,below=of M09,minimum width=1.8cm] (M10) {10 Tokenization};
\node[draw,rectangle,fill=purple!20,below=of M10,minimum width=1.8cm] (M11) {11 Embeddings};
\node[draw,rectangle,fill=purple!20,below=of M11,minimum width=1.8cm] (M12) {12 Attention};
\node[draw,rectangle,fill=purple!20,below=of M12,minimum width=1.8cm] (M13) {13 Transformers};

% Optimization tier
\node[draw,rectangle,fill=green!20,right=of M08,minimum width=1.8cm] (M14) {14 Profiling};
\node[draw,rectangle,fill=green!20,below=of M14,minimum width=1.8cm] (M15) {15 Quantization};
\node[draw,rectangle,fill=green!20,below=of M15,minimum width=1.8cm] (M16) {16 Compression};
\node[draw,rectangle,fill=green!20,below=of M16,minimum width=1.8cm] (M17) {17 Memoization};
\node[draw,rectangle,fill=green!20,below=of M17,minimum width=1.8cm] (M18) {18 Acceleration};
\node[draw,rectangle,fill=green!20,below=of M18,minimum width=1.8cm] (M19) {19 Benchmarking};
\node[draw,rectangle,fill=red!30,below=of M19,minimum width=1.8cm] (M20) {20 Olympics};

% Arrows - Foundation connections (straight lines within column)
\draw[->] (M01) -- (M02);
\draw[->] (M02) -- (M03);
\draw[->] (M03) -- (M04);
\draw[->] (M04) -- (M05);
\draw[->] (M05) -- (M06);
\draw[->] (M06) -- (M07);

% Architecture connections (straight lines within column)
\draw[->] (M08) -- (M09);
\draw[->] (M09) -- (M10);
\draw[->] (M10) -- (M11);
\draw[->] (M11) -- (M12);
\draw[->] (M12) -- (M13);

% Optimization connections (straight lines within column)
\draw[->] (M14) -- (M15);
\draw[->] (M15) -- (M16);
\draw[->] (M16) -- (M17);
\draw[->] (M17) -- (M18);
\draw[->] (M18) -- (M19);
\draw[->] (M19) -- (M20);

% Cross-tier connections (use bent arrows to avoid overlaps)
\draw[->,bend left=10] (M01) to (M08);
\draw[->,bend left=15] (M01) to (M09);
\draw[->,bend left=10] (M03) to (M09);
\draw[->,bend left=15] (M05) to (M09);
\draw[->,bend left=15] (M01) to (M11);
\draw[->,bend left=10] (M03) to (M12);
\draw[->,bend left=15] (M05) to (M12);
\draw[->,bend left=10] (M02) to (M13);
\draw[->,bend left=15] (M11) to (M13);

% Training to architectures
\draw[->,dashed,bend left=10] (M07) to (M09);
\draw[->,dashed,bend left=15] (M07) to (M13);

% Architectures to optimization
\draw[->,dashed,bend left=10] (M09) to (M14);
\draw[->,dashed,bend left=15] (M13) to (M14);

\end{tikzpicture}%
}
\caption{Module dependency flow shows how students build a complete ML framework incrementally. Foundation modules (blue, M01-07) provide core tensor operations and training infrastructure. These enable architecture modules (purple, M08-13) where students implement CNNs and transformers using only their own code. Optimization modules (green, M14-19) teach production concerns: profiling, quantization, and deployment. Solid arrows show direct dependencies (e.g., autograd M05 requires tensors M01); dotted lines show cross-tier integration (e.g., benchmarking M19 requires all architectures). This structure mirrors compiler courses: each module builds on previous work, creating systems thinking through component integration.}
\label{fig:module-flow}
\end{figure}

TinyTorch serves students transitioning from framework \emph{users} to framework \emph{engineers}: those who have completed introductory ML courses (e.g., CS229, fast.ai) and want to understand PyTorch internals, those planning ML systems research or infrastructure careers, or practitioners debugging production deployment issues. The curriculum assumes NumPy proficiency and basic neural network familiarity but teaches framework architecture from first principles. Students needing immediate GPU/distributed training skills are better served by PyTorch tutorials; those preferring project-based application building will find high-level frameworks more appropriate. The 20-module structure supports flexible pacing: intensive completion (estimated 2-3 weeks at full-time pace), semester integration (parallel with lectures), or self-paced professional development.

The curriculum introduces three pedagogical design innovations. First, \textbf{progressive disclosure} is designed to manage cognitive load through runtime feature activation: \texttt{Tensor} gradient attributes exist from Module 01 but remain dormant until Module 05 activates automatic differentiation (\Cref{sec:progressive}). This monkey-patching technique maintains a unified mental model while revealing complexity gradually, teaching both current framework usage and historical evolution (PyTorch's Variable/Tensor merger). Second, \textbf{systems-first integration} embeds memory profiling, FLOPs analysis, and performance reasoning from Module 01 onwards rather than deferring to advanced electives (\Cref{sec:systems}). Students measure what they build: Conv2d's 109$\times$ parameter efficiency over Dense layers, attention's $O(N^2)$ memory scaling, quantization's 4$\times$ compression. Third, \textbf{historical milestone validation} provides correctness proof through replication: students recreate nearly 70 years of ML breakthroughs (1958 Perceptron through 2024 Llama-style transformers) using exclusively their own implementations, demonstrating that their code works on real tasks.

This paper makes three primary contributions:

\begin{enumerate}
\item \textbf{Systems-First Curriculum Architecture}: A 20-module learning path integrating memory profiling, computational complexity, and performance analysis from Module 01 onwards, replacing traditional algorithm-systems separation. Students discover systems constraints through direct measurement (Adam's 2$\times$ optimizer state overhead, Conv2d's 109$\times$ parameter efficiency, KV caching's $O(n^2) \rightarrow O(n)$ transformation) rather than abstract instruction (\Cref{sec:curriculum,sec:systems}). This architecture directly addresses the workforce gap by making tacit systems knowledge explicit through hands-on implementation. Grounded in situated cognition~\citep{lave1991situated} and constructionism~\citep{papert1980mindstorms}, with systems thinking pedagogy informed by established frameworks~\citep{meadows2008thinking}.

\item \textbf{Progressive Disclosure Pattern}: To make systems-first learning tractable, we introduce a pedagogical technique using monkey-patching (runtime method replacement) to reveal \texttt{Tensor} complexity gradually while maintaining a unified mental model (\Cref{sec:progressive}). This enables forward-compatible code (Module 01 implementations don't break when autograd activates) and teaches framework evolution (PyTorch's Variable/Tensor merger). Grounded in cognitive load theory~\citep{sweller1988cognitive} and cognitive apprenticeship~\citep{collins1989cognitive}.

\item \textbf{Open Educational Infrastructure}: Both innovations are validated through a complete open-source curriculum with NBGrader assessment infrastructure~\citep{blank2019nbgrader}, three integration models (self-paced learning, institutional courses, team onboarding), historical milestone validation (1958 Perceptron through 2024 optimized transformers), and PyTorch-inspired package architecture. This infrastructure enables community adoption, curricular adaptation, and empirical research into ML systems pedagogy effectiveness (\Cref{sec:curriculum,sec:deployment,sec:discussion}).
\end{enumerate}

\noindent\textbf{Scope:} These contributions represent demonstrated design patterns and complete educational infrastructure grounded in established learning theory. The curriculum's technical correctness is validated through historical milestone recreation (students train CNNs targeting 75\%+ CIFAR-10 accuracy using exclusively their own implementations). Learning outcome claims—that systems-first integration improves debugging skills, that progressive disclosure reduces cognitive load, that graduates achieve production readiness faster—remain testable hypotheses requiring empirical validation through controlled classroom studies. We detail specific research questions and measurement methodologies in \Cref{sec:discussion}.

\noindent\textbf{Paper Organization.} Before presenting TinyTorch's design, we position our contributions relative to existing educational frameworks and grounding learning theories (\Cref{sec:related}). We then present the systems-first curriculum architecture (\Cref{sec:curriculum}), its integration throughout modules (\Cref{sec:systems}), and the progressive disclosure pattern enabling cognitive load management (\Cref{sec:progressive}). Finally, we discuss limitations, empirical validation plans, and implications for ML education (\Cref{sec:discussion,sec:conclusion}).

\section{Related Work}
\label{sec:related}

TinyTorch builds upon decades of work in CS education research and recent innovations in ML framework pedagogy. We position our contributions relative to existing educational frameworks and grounding learning theories.

\subsection{Educational ML Frameworks}

Educational frameworks teaching ML internals occupy different points in the scope-simplicity tradeoff space. \textbf{micrograd}~\citep{karpathy2022micrograd} demonstrates autograd mechanics elegantly in approximately 200 lines of scalar-valued Python, making backpropagation transparent through decomposition into elementary operations. Its pedagogical clarity comes from intentional minimalism: scalar operations only, no tensor abstraction, focused solely on automatic differentiation fundamentals. This design illuminates gradient mechanics but necessarily omits systems concerns (memory profiling, computational complexity, production patterns) and modern architectures.

\textbf{MiniTorch}~\citep{schneider2020minitorch} extends beyond autograd to tensor operations, neural network modules, and optional GPU programming, originating from Cornell Tech's Machine Learning Engineering course. The curriculum progresses from foundational autodifferentiation through deep learning with assessment infrastructure (unit tests, visualization tools). While MiniTorch includes an optional GPU module exploring parallel programming concepts and covers efficiency considerations throughout, the core curriculum emphasizes mathematical rigor: students work through detailed exercises building tensor abstractions from first principles. TinyTorch differs through systems-first emphasis (memory profiling and complexity analysis embedded from Module 01), production-inspired package organization, and three integration models supporting diverse deployment contexts.

\textbf{tinygrad}~\citep{hotz2023tinygrad} positions itself between micrograd's simplicity and PyTorch's production capabilities, providing a complete framework (tensor library, IR, compiler, JIT) that emphasizes hackability and transparency. Unlike opaque production frameworks, tinygrad makes ``the entire compiler and IR visible,'' enabling students to understand deep learning compilation internals. While pedagogically valuable through its inspectable design, tinygrad assumes significant background: students must navigate compiler concepts, multiple hardware backends, and production-level architecture without scaffolded progression or automated assessment infrastructure.

\textbf{Stanford CS231n}~\citep{johnson2016cs231n}, \textbf{CMU Deep Learning Systems (CS 10-414)}~\citep{chen2022dlsyscourse}, and \textbf{Harvard TinyML}~\citep{banbury2021widening} represent university courses that include implementation components with different systems emphases. CS231n's assignments involve NumPy implementations of CNNs, backpropagation, and optimization algorithms, providing hands-on experience with neural network internals. However, assignments are isolated exercises rather than cumulative framework construction, and systems concerns (memory profiling, complexity analysis) are not embedded from the start. CMU's DL Systems course explicitly targets ML systems engineering, covering automatic differentiation, GPU programming, distributed training, and deployment---representing the production systems knowledge TinyTorch provides conceptual foundations for. Harvard's TinyML Professional Certificate focuses on deploying ML to resource-constrained embedded devices (microcontrollers with KB-scale memory), teaching TensorFlow Lite for Microcontrollers through Arduino-based projects. While TinyML emphasizes hardware constraints and embedded deployment (achieving systems thinking through resource limitations), TinyTorch focuses on framework internals and algorithmic understanding (achieving systems thinking through implementation transparency). TinyML students learn \emph{how to optimize for} hardware constraints; TinyTorch students learn \emph{why frameworks work} internally. These approaches complement rather than compete: TinyML prepares students for edge deployment, TinyTorch for framework engineering and infrastructure development.

\textbf{Dive into Deep Learning (d2l.ai)}~\citep{zhang2021dive} and \textbf{fast.ai}~\citep{howard2020fastai} represent comprehensive ML education but with different pedagogical emphases than framework construction. d2l.ai provides interactive implementations across multiple frameworks (PyTorch, JAX, TensorFlow, MXNet/NumPy) through executable notebooks, teaching algorithmic foundations alongside practical coding. The NumPy implementation track includes from-scratch implementations of key algorithms, though these are presented as educational demonstrations rather than components of a cumulative framework students build. With widespread adoption across hundreds of universities globally, it excels at algorithmic understanding through framework usage. fast.ai's distinctive top-down pedagogy starts with practical applications before foundations, using layered APIs that provide high-level abstractions while enabling deeper exploration through PyTorch. Both resources assume cloud computing access (AWS, Google Colab, SageMaker) for GPU-based training, though provide various deployment options.

\subsection{Learning Theory Foundations}

TinyTorch's pedagogical design draws on established learning theories validated across CS education.

\textbf{Constructionism}~\citep{papert1980mindstorms} argues learning occurs most effectively when students construct artifacts others can examine. TinyTorch realizes this through framework building---students create tangible ML systems enabling peer code review, portfolio demonstration, and conceptual debugging through concrete implementation.

\textbf{Cognitive Apprenticeship}~\citep{collins1989cognitive} emphasizes making expert thinking visible through modeling, coaching, and scaffolding. Progressive disclosure (\Cref{sec:progressive}) models expert framework evolution: students experience how PyTorch's Tensor class grew capabilities (matching PyTorch 0.4's Variable-Tensor merger~\citep{pytorch04release}). Module structure provides coaching through connection maps (showing prerequisites and unlocked capabilities) and scaffolding through integration tests validating cross-module composition.

\textbf{Productive Failure}~\citep{kapur2008productive} demonstrates that struggling with problems before instruction deepens understanding compared to direct teaching. TinyTorch applies this through minimal upfront explanation: students implement components first, encounter integration failures (``My Conv2d passes unit tests but crashes during backpropagation''), then discover why interface design matters. Historical milestones validate eventual success, providing delayed gratification after productive struggle.

\textbf{Threshold Concepts}~\citep{meyer2003threshold} identify transformative ideas (computational graphs, gradient flow, memory-compute tradeoffs) whose mastery fundamentally changes student thinking. Unlike traditional ML courses presenting these abstractly, TinyTorch makes thresholds concrete through implementation: students cannot complete Module 05 without understanding computational graphs because they must implement the graph data structure. Systems-first integration (\Cref{sec:systems}) addresses the ``memory reasoning'' threshold: calculating VRAM requirements before operations becomes reflexive through repeated practice starting Module 01.

\subsection{Positioning and Unique Contributions}

TinyTorch occupies a distinct pedagogical niche through its \textbf{bottom-up, systems-first approach}. Unlike top-down pedagogies (fast.ai: start with applications, descend to details) or algorithm-focused curricula (d2l.ai: master theory through framework usage), TinyTorch employs bottom-up framework construction: students build core abstractions first (tensors, autograd, layers), then compose them into architectures (CNNs, transformers), finally optimizing for production constraints (quantization, compression). This grounds systems thinking in direct implementation rather than abstract instruction. The curriculum serves students post-introductory-ML (ready to transition from framework users to engineers), pre-systems-research (providing foundations before production ML courses like CMU DL Systems), and complementary to algorithm courses (adding systems awareness to mathematical foundations).

\Cref{tab:framework-comparison} positions TinyTorch relative to both educational frameworks (micrograd, MiniTorch, tinygrad) and production frameworks (PyTorch, TensorFlow), clarifying that TinyTorch serves as pedagogical bridge between understanding frameworks and using them professionally.

\begin{table*}[tp]
\centering
\caption{Framework comparison positions TinyTorch's pedagogical role. Educational frameworks (micrograd, MiniTorch, tinygrad) prioritize learning over production use. Production frameworks (PyTorch, TensorFlow) prioritize performance and scalability. TinyTorch bridges both: students learn framework internals through implementation, then transfer that knowledge to production frameworks with deeper systems understanding.}
\label{tab:framework-comparison}
\small
\renewcommand{\arraystretch}{1.4}
\setlength{\tabcolsep}{6pt}
\begin{tabularx}{0.98\textwidth}{@{}>{\raggedright\arraybackslash}p{2.0cm}>{\raggedright\arraybackslash}p{1.8cm}>{\raggedright\arraybackslash}X>{\raggedright\arraybackslash}X>{\raggedright\arraybackslash}X@{}}
\toprule
\textbf{Framework} & \textbf{Purpose} & \textbf{Scope} & \textbf{Systems Focus} & \textbf{Target Outcome} \\
\midrule
\multicolumn{5}{@{}l}{\textbf{Educational Frameworks}} \\
\addlinespace[2pt]
micrograd & Teach autograd & Autograd only (scalar) & Minimal & Understand backprop \\
MiniTorch & Teach ML math & Tensors + autograd + optional GPU & Math foundations & Build from first principles \\
tinygrad & Inspectable production & Complete (compiler, IR, JIT) & Advanced (compiler) & Understand compilation \\
\addlinespace[1pt]
\textbf{TinyTorch} & \textbf{Teach systems} & \textbf{Complete (tensors $\rightarrow$ transformers $\rightarrow$ optimization)} & \textbf{Embedded from Module 01} & \textbf{Framework engineers} \\
\addlinespace[2pt]
\midrule
\multicolumn{5}{@{}l}{\textbf{Production Frameworks}} \\
\addlinespace[2pt]
PyTorch & Production ML & Complete (GPU, distributed, deployment) & Advanced (implicit) & Train models efficiently \\
TensorFlow & Production ML & Complete (GPU, distributed, deployment, mobile) & Advanced (implicit) & Deploy at scale \\
\bottomrule
\end{tabularx}
\end{table*}

TinyTorch differs from educational frameworks through systems-first integration and from production frameworks through pedagogical transparency:

\begin{itemize}
\item \textbf{vs. micrograd}: Complete framework scope beyond autograd (tensors, architectures, optimizers, transformers), systems integration (memory/performance from Module 01), automated assessment infrastructure (NBGrader)
\item \textbf{vs. MiniTorch}: TinyTorch inverts pedagogical priorities—where MiniTorch prioritizes mathematical rigor (students work through tensor abstractions and broadcasting semantics before encountering systems concerns, with GPU optimization as optional advanced module), TinyTorch embeds systems thinking from Module 01 (students calculate memory footprints before matrix multiplication, profile FLOPs during convolution). Progressive disclosure maintains unified API across modules (Tensor class evolves via monkey-patching) while MiniTorch introduces separate abstractions (ScalarTensor $\rightarrow$ Tensor $\rightarrow$ CudaTensor), modeling production framework evolution versus pedagogical scaffolding
\item \textbf{vs. tinygrad}: Scaffolded pedagogical progression with assessment versus inspectable production system, accessibility (CPU-only, no compiler background required) versus hackability (multiple backends, IR exploration)
\item \textbf{vs. d2l.ai}: Framework construction (build internals) versus algorithmic mastery (apply frameworks), systems-first integration versus algorithm-focused curriculum
\item \textbf{vs. fast.ai}: Bottom-up framework building versus top-down application focus, constructionist artifacts (students create importable framework) versus practitioner training (students use layered APIs)
\end{itemize}

Empirical validation of learning outcomes remains future work (\Cref{sec:discussion}), but design grounding in established theory (constructionism, cognitive apprenticeship, productive failure, threshold concepts) provides theoretical justification for pedagogical choices.

\section{TinyTorch Architecture}
\label{sec:curriculum}

This section presents the 20-module curriculum structure, organized into four tiers that progressively build a complete ML framework.

Traditional ML education presents algorithms sequentially without revealing how components integrate into working systems. TinyTorch addresses this through a 4-phase curriculum architecture where students build a complete ML framework progressively, with each module enforcing prerequisite mastery.

\subsection{Prerequisites}

As established in \Cref{sec:intro}, TinyTorch targets students transitioning from framework users to framework engineers. The curriculum assumes intermediate Python proficiency---comfort with classes, functions, and NumPy array operations---alongside mathematical foundations in linear algebra (matrix multiplication, vectors) and basic calculus (derivatives, chain rule). Students should understand complexity analysis (Big-O notation) and basic algorithms. While prior ML coursework (traditional machine learning or deep learning courses) and data structures courses are helpful, they are not strictly required; motivated students can acquire these foundations concurrently.

\subsection{The 3-Tier Learning Journey + Olympics}

TinyTorch organizes modules into three progressive tiers plus a capstone competition (\Cref{tab:objectives}). Students cannot skip tiers: architectures require foundation mastery, optimization demands training system understanding. The tiers mirror ML systems engineering practice: foundation (core ML mechanics), architectures (domain-specific models), optimization (production deployment), culminating in the AI Olympics (competitive systems engineering).

\begin{table*}[p]
\centering
\caption{Module progression integrates ML algorithms with systems concepts from the start. Each module teaches both "what" (ML technique) and "how much" (memory/compute costs). Foundation tier (M01-07) establishes core operations with explicit resource tracking. Architecture tier (M08-13) applies these foundations to CNNs and transformers. Optimization tier (M14-19) adds production concerns: profiling, quantization, deployment. This dual-concept approach ensures students never learn algorithms without understanding their systems implications.}
\label{tab:objectives}
\resizebox{\textwidth}{!}{%
\small
\renewcommand{\arraystretch}{1.4}
\setlength{\tabcolsep}{7pt}
\begin{tabularx}{\textwidth}{@{}cl>{\raggedright\arraybackslash}p{2.2cm}>{\raggedright\arraybackslash}X>{\raggedright\arraybackslash}X@{}}
\toprule
\textbf{Mod} & \textbf{Tier} & \textbf{Module Name} & \textbf{ML Concept} & \textbf{Systems Concept} \\
\midrule
\multicolumn{5}{@{}l}{\textbf{Foundation Tier (01--07)}} \\
\addlinespace[2pt]
01 & Fnd & Tensor & Multidimensional arrays, broadcasting & Memory footprint (nbytes), FP32 storage \\
02 & Fnd & Activations & ReLU, Sigmoid, Softmax & Numerical stability (exp overflow), vectorization \\
03 & Fnd & Layers & Linear, parameter initialization & Parameter memory vs activation memory \\
04 & Fnd & Losses & Cross-entropy, MSE & Stability (log(0) handling), gradient flow \\
05 & Fnd & Autograd & Computational graphs, backprop & Gradient memory, optimizer state (2$\times$ for Adam) \\
06 & Fnd & Optimizers & SGD, Momentum, Adam & Memory-speed tradeoffs, update rules \\
07 & Fnd & Training Loop & Epoch/batch iteration & Forward/backward memory lifecycle \\
\addlinespace[2pt]
\midrule
\multicolumn{5}{@{}l}{\textbf{Architecture Tier (08--13)}} \\
\addlinespace[2pt]
08 & Arch & DataLoader & Batching, shuffling, Dataset abstraction & Iterator protocol, batch collation, memory layout \\
09 & Arch & Spatial (CNNs) & Conv2d, kernels, strides, pooling & $O(B \!\times\! C_{\text{out}} \!\times\! H_{\text{out}} \!\times\! W_{\text{out}} \!\times\! C_{\text{in}} \!\times\! K_h \!\times\! K_w)$ complexity \\
10 & Arch & Tokenization & BPE (Byte Pair Encoding), vocabulary, encoding & Vocabulary management, OOV handling \\
11 & Arch & Embeddings & Token/position embeddings & Lookup tables, gradient through indices \\
12 & Arch & Attention & Scaled dot-product attention & $O(N^2)$ memory scaling, sequence length impact \\
13 & Arch & Transformers & Multi-head, encoder/decoder & Quadratic memory, KV caching strategies \\
\addlinespace[2pt]
\midrule
\multicolumn{5}{@{}l}{\textbf{Optimization Tier (14--19)}} \\
\addlinespace[2pt]
14 & Opt & Profiling & Time, memory, FLOPs analysis & Bottleneck identification, measurement overhead \\
15 & Opt & Quantization & INT8, dynamic/static quant & 4$\times$ model size reduction, accuracy-speed tradeoff \\
16 & Opt & Compression & Pruning, distillation & 10$\times$ model shrinkage, minimal accuracy loss \\
17 & Opt & Memoization & KV-cache for transformers & 10--100$\times$ inference speedup via caching \\
18 & Opt & Acceleration & Vectorization, parallelization & 10--100$\times$ speedup via NumPy optimization \\
19 & Opt & Benchmarking & Statistical testing, comparisons & Rigorous performance measurement \\
\addlinespace[2pt]
\midrule
\multicolumn{5}{@{}l}{\textbf{AI Olympics (20)}} \\
\addlinespace[2pt]
20 & Capstone & AI Olympics & Complete production system & MLPerf-style competition, leaderboard \\
\bottomrule
\end{tabularx}
}
\end{table*}

\begin{lstlisting}[caption={Tensor with memory profiling from Module 01.},label=lst:tensor-memory,float=t]
class Tensor:
    def __init__(self, data):
        self.data = np.array(data, dtype=np.float32)
        self.shape = self.data.shape

    def memory_footprint(self):
        """Calculate exact memory in bytes"""
        return self.data.nbytes

    def __matmul__(self, other):
        if self.shape[-1] != other.shape[0]:
            raise ValueError(
                f"Shape mismatch: {self.shape} @ {other.shape}"
            )
        return Tensor(self.data @ other.data)
\end{lstlisting}

\textbf{Tier 1: Foundation (Modules 01--07).}
Students build the mathematical core enabling neural networks to learn. Systems thinking begins immediately---Module 01 introduces \texttt{memory\_footprint()} before matrix multiplication (\Cref{lst:tensor-memory}), making memory a first-class concept. The tier progresses through tensors, activations, layers, and losses to automatic differentiation (Module 05)---where dormant gradient features activate through progressive disclosure (\Cref{sec:progressive}). Students implement optimizers (Module 06), discovering Adam's memory trade-offs through direct measurement (\Cref{sec:systems}). The training loop (Module 07) integrates all components. By tier completion, students recreate three historical milestones: \citet{rosenblatt1958perceptron}'s Perceptron, Minsky and Papert's XOR solution, and \citet{rumelhart1986learning}'s backpropagation targeting 95\%+ on MNIST. Students calculate memory before operations: ``Matrix multiplication A @ B where both are (1000, 1000) FP32 requires 12MB peak memory: 4MB for A, 4MB for B, and 4MB for the output.'' This reasoning becomes automatic.

\textbf{Tier 2: Architectures (Modules 08--13).}
Students apply foundation knowledge to modern architectures for vision and language. Module 08 introduces the Dataset abstraction pattern (implementing \texttt{\_\_len\_\_} and \texttt{\_\_getitem\_\_} protocols) and DataLoader with batch collation, teaching how PyTorch's data pipeline transforms individual samples into batched tensors through the iterator protocol. While Module 07 implements basic training loops with manual batching (simple iteration over pre-batched arrays), Module 08 refactors this into production-quality data loading---a pedagogical pattern of ``make it work, then make it right.'' Students first understand training mechanics (forward pass, loss, backward, update), then learn proper data pipeline engineering. TinyTorch ships with two custom educational datasets that install with the repository: \textbf{TinyDigits} (5,000 grayscale handwritten digits, curated from public digit datasets) and \textbf{TinyTalks} (3,000 synthetically-generated conversational Q\&A pairs). These datasets are deliberately small and offline-first: they require no network connectivity during training, consume minimal storage ($<$50MB combined), and train in minutes on CPU-only hardware. This design ensures accessibility for students in regions with limited internet infrastructure, institutional computer labs with restricted network access, and developing countries where cloud-based datasets create barriers to ML education.

The tier then branches into two paths. \textbf{Vision} implements Conv2d with seven explicit nested loops making $O(C_{out} \times H \times W \times C_{in} \times K^2)$ complexity visible before optimization. Students discover weight sharing's dramatic efficiency through direct comparison: Conv2d(3$\rightarrow$32, kernel=3) requires 896 parameters while an equivalent dense layer needs 98,336 parameters (3072 input features $\times$ 32 outputs + 32 bias terms)---a 109$\times$ reduction demonstrating how inductive biases enable CNNs to learn spatial patterns without brute-force parameterization. This enables Milestone 4 (1998 CNN Revolution) targeting 75\%+ CIFAR-10 accuracy~\citep{krizhevsky2009cifar,lecun1998gradient}.

\textbf{Language} progresses through tokenization (character-level and BPE), embeddings (both learned and sinusoidal positional encodings), attention ($O(N^2)$ memory), and complete transformers~\citep{vaswani2017attention}. Module 10 (Tokenization) teaches a fundamental NLP systems trade-off: vocabulary size controls model parameters (embedding matrix rows $\times$ dimensions), while sequence length determines transformer computation ($O(n^2)$ attention complexity). Students discover why later GPT models increased vocabulary from 50K tokens (GPT-2/GPT-3) to 100K tokens (GPT-3.5/GPT-4)---not for better language understanding, but to reduce sequence lengths for long documents, trading parameter memory for computational efficiency. Students experience quadratic scaling through direct measurement. Milestone 5 (2017 Transformer Era) validates through text generation on TinyTalks.

\textbf{Tier 3: Optimization (Modules 14--19).}
Students transition from ``models that train'' to ``systems that deploy.'' Profiling (14) teaches measuring time, memory, and FLOPs (floating-point operations), introducing Amdahl's Law: optimizing 70\% of runtime by 2$\times$ yields only 1.53$\times$ overall speedup because the remaining 30\% becomes the new bottleneck---teaching that optimization is iterative and measurement-driven. Quantization (15) achieves 4$\times$ compression (FP32$\rightarrow$INT8) with 1--2\% accuracy cost. Compression (16) applies pruning and distillation for 10$\times$ shrinkage. Memoization (17) implements KV caching (storing attention keys and values to avoid recomputation), a technique used in production LLM serving: students discover that naive autoregressive generation recomputes attention keys and values at every step---generating 100 tokens requires 5,050 redundant computations (1+2+...+100). By caching these values and reusing them, students transform $O(n^2)$ generation into $O(n)$, achieving 10--100$\times$ speedup and understanding why this optimization is essential in systems like ChatGPT and Claude for economically viable inference. Acceleration (18) vectorizes convolution for 10--100$\times$ gains. Benchmarking (19) teaches rigorous performance measurement.

\textbf{AI Olympics (Module 20).}
The capstone integrates all 19 modules into production-optimized systems. Inspired by MLPerf~\citep{reddi2020mlperf}, students optimize prior milestones (CIFAR-10 CNN, transformer generation, or custom architecture) for 10$\times$ faster inference, 4$\times$ smaller size, and sub-100ms latency while maintaining accuracy. Students compete on the TinyTorch Leaderboard across four tracks: Vision Excellence, Language Quality, Speed, and Compression. This teaches data-driven optimization mirroring real ML systems engineering.

\subsection{Module Structure}
\label{subsec:module-pedagogy}

Each module follows a consistent \textbf{Build $\rightarrow$ Use $\rightarrow$ Reflect} pedagogical cycle that integrates implementation, application, and systems reasoning. This structure addresses multiple learning objectives: students construct working components (Build), validate integration with prior modules (Use), and develop systems thinking~\citep{meadows2008thinking} through analysis (Reflect).

\noindent\textbf{Build: Implementation with Explicit Dependencies.} Students implement components in Jupyter notebooks (\texttt{*\_dev.py}) with scaffolded guidance. Each module begins with \emph{connection maps} visualizing prerequisites, current focus, and unlocked capabilities. For example, Module 05 (Autograd) shows prerequisites (Modules 01--04: Tensor, Activations, Layers, Losses), current implementation goal (computational graph + backward pass), and unlocked future modules (Modules 06--07: Optimizers, Training). These visual dependency chains address cognitive apprenticeship~\citep{collins1989cognitive} by making expert knowledge structures explicit. Students see ``why this module matters'' before implementation begins, reducing disengagement from seemingly isolated exercises.

\noindent\textbf{Use: Integration Testing Beyond Unit Tests.} Assessment validates both isolated correctness and cross-module integration. Unit tests verify individual component behavior (``Does \texttt{Tensor.reshape()} produce correct output?''), while integration tests validate that components compose into working systems (``Can Module 05 Autograd compute gradients through Module 03 Linear layers?''). Integration tests are critical for TinyTorch's pedagogical model because students may pass Module 03 unit tests but fail when autograd activates in Module 05---their layer implementation doesn't properly propagate \texttt{requires\_grad} through operations or construct computational graphs correctly.

A common failure pattern illustrates this: students implement \texttt{Linear.forward()} that passes unit tests (correct output values), but gradients don't flow during backpropagation because they used NumPy operations directly instead of \texttt{Tensor} operations. When \texttt{x.requires\_grad=True} flows into their layer, the computational graph breaks. Students encounter errors like ``\texttt{AttributeError: 'numpy.ndarray' object has no attribute 'backward'}'' and must debug interface contracts: operations must preserve \texttt{Tensor} types to maintain gradient connectivity. This teaches \emph{interface design}---components must satisfy contracts enabling composition, not just produce correct outputs in isolation.

Module 09 (Convolutions) integration exemplifies this: convolution must work with Module 05's autograd (gradient flow through kernels), Module 06's optimizers (parameter updates), and Module 07's training loop (forward-backward cycles) simultaneously. Students discover that ``passing unit tests'' $\neq$ ``works in the system'' when their Conv2d produces correct outputs but crashes during \texttt{loss.backward()} because they forgot to track intermediate activations for gradient computation. This debugging mirrors professional ML engineering: isolated correctness is insufficient; system integration reveals interface failures.

\noindent\textbf{Reflect: Systems Analysis Questions.}
Each module concludes with systems reasoning prompts measuring conceptual understanding beyond syntactic correctness. Memory analysis questions ask students to calculate footprints (``A (256, 256) Conv2d layer with 64 input and 128 output channels requires how much memory?''). Complexity analysis prompts probe asymptotic understanding (``Why is attention $O(N^2)$? Demonstrate by doubling sequence length and measuring memory growth.''). Design trade-off questions assess engineering judgment (``Adam requires 2$\times$ optimizer state memory (momentum and variance) but converges faster than SGD. When is the 4$\times$ total training memory trade-off worth it?''). These open-ended questions assess transfer~\citep{perkins1992transfer}---can students apply learned concepts to novel scenarios not seen in exercises?

\subsection{Milestone Arcs}
\label{subsec:milestones}

\noindent\textbf{Why Milestones Matter.} Milestones serve dual pedagogical and validation purposes that differentiate TinyTorch from traditional programming assignments. First, \textbf{pedagogical motivation through historical framing}: Rather than ``implement this function,'' students ``recreate the breakthrough that proved Minsky wrong about neural networks,'' connecting implementation work to historically significant results. This instantiates Bruner's spiral curriculum~\citep{bruner1960process}---students train neural networks 6 times with increasing sophistication, each iteration deepening understanding through historical progression from 1958 (Perceptron) through 2024 (production-optimized systems).

Second, \textbf{implementation validation beyond unit tests}: Milestones differ from modules pedagogically---modules teach components, milestones validate that components \emph{compose} into functional systems. Students who pass all Module 01--07 unit tests might still fail Milestone 3 (MLP Revival) if their training loop doesn't properly orchestrate forward passes, loss computation, and backpropagation. This mirrors professional ML engineering: individual functions may work, but the system fails due to integration bugs. If student-implemented CNNs successfully classify natural images, convolution, pooling, and backpropagation all work correctly together; if transformers generate coherent text, attention mechanisms integrate properly. Milestone success is measured by achieving performance in the ballpark of historical benchmarks (CNNs with reasonable CIFAR-10 accuracy, transformers generating coherent text), not matching exact published accuracies---the goal is demonstrating implementations work correctly on real tasks, validating framework correctness.

\noindent\textbf{The Six Historical Milestones.} The curriculum includes six milestones spanning 1958--2024, each requiring progressively more components from the growing framework:

\begin{enumerate}
\item \textbf{1958 Perceptron} (after Module 04): Train Rosenblatt's original single-layer perceptron on linearly separable classification. Students import \texttt{from tinytorch.core import Tensor; from tinytorch.nn import Linear, Sigmoid}---their framework now supports single-layer networks.

\item \textbf{1969 XOR Solution} (after Module 07): Solve Minsky's ``impossible'' XOR problem with multi-layer perceptrons, proving critics wrong. Validates that autograd enables non-linear learning.

\item \textbf{1986 MLP Revival} (after Module 07): Handwritten digit recognition demonstrating backpropagation's power. Requires Modules 01--07 working together (tensor operations, activations, layers, losses, autograd, optimizers, training). Students import \texttt{from tinytorch.optim import SGD; from tinytorch.nn import CrossEntropyLoss}---their framework trains multi-layer networks end-to-end on MNIST digits.

\item \textbf{1998 CNN Revolution} (after Module 09): Image classification demonstrating convolutional architectures' advantage~\citep{krizhevsky2009cifar,lecun1998gradient}. Students import \texttt{from tinytorch.nn import Conv2d, MaxPool2d}, training both MLP and CNN on CIFAR-10 to measure architectural improvements themselves through direct comparison.

\item \textbf{2017 Transformer Era} (after Module 13): Language generation with attention-based architecture. Validates that attention mechanisms, positional embeddings, and autoregressive sampling function correctly through coherent text generation.

\item \textbf{2018 MLPerf Benchmark Era} (after Module 20): Production-optimized system integrating all 20 modules, inspired by MLPerf~\citep{reddi2020mlperf}. Students import from every module: \texttt{from tinytorch.nn import Transformer; from tinytorch.optim import Adam; from tinytorch.profiling import profile\_memory}---demonstrating quantization, compression, and acceleration for 10$\times$ faster inference and 4$\times$ smaller models.
\end{enumerate}

Each milestone: (1) recreates actual breakthroughs using exclusively student code, (2) uses \emph{only} TinyTorch implementations (no PyTorch/TensorFlow), (3) validates success through task-appropriate performance, and (4) demonstrates architectural comparisons showing why new approaches improved over predecessors.

\noindent\textbf{Validation Approach:}
While milestones provide pedagogical motivation through historical framing, they simultaneously serve a technical validation purpose: demonstrating implementation correctness through real-world task performance. Success criteria for each milestone:

\begin{itemize}[leftmargin=*, itemsep=1pt, parsep=0pt]
    \item \textbf{M03 (1958 Perceptron)}: Solves linearly separable problems (e.g., 4-point OR/AND tasks), demonstrating basic gradient descent convergence.
    \item \textbf{M06 (1969 XOR Solution)}: Solves XOR classification, proving multi-layer networks handle non-linear problems that single layers cannot.
    \item \textbf{M07 (1986 MLP Revival)}: Achieves strong MNIST digit classification accuracy, validating backpropagation through all layers of deep networks.
    \item \textbf{M10 (1998 LeNet CNN)}: Demonstrates meaningful CIFAR-10 learning (substantially better than random 10\% baseline), showing convolutional feature extraction works correctly.
    \item \textbf{M13 (2017 Transformer)}: Generates coherent multi-token text continuations on TinyTalks dataset, demonstrating functional attention mechanisms and autoregressive generation.
    \item \textbf{M20 (2024 AI Olympics)}: Student-selected challenge across Vision/Language/Speed/Compression tracks with self-defined success metrics, demonstrating production systems integration.
\end{itemize}

Performance targets differ from published state-of-the-art due to pure-Python constraints (no GPU acceleration, simplified architectures). Correctness matters more than speed: if a student's CNN learns meaningful CIFAR-10 features, their convolution, pooling, and backpropagation implementations compose correctly into a functional vision system. This approach mirrors professional debugging where implementations prove correct by solving real tasks, not by passing synthetic unit tests alone.

\section{Progressive Disclosure}
\label{sec:progressive}

The curriculum architecture described above raises a pedagogical challenge: how do students learn complex framework features without cognitive overload? This section presents progressive disclosure, a pattern that manages complexity by revealing \texttt{Tensor} capabilities gradually while maintaining a unified mental model.

Traditional ML education faces a pedagogical dilemma: students need to understand complete systems, but introducing all concepts simultaneously overwhelms cognitive capacity. Educational frameworks employ various strategies: some introduce separate classes (fragmenting the conceptual model), others defer advanced features until later courses (leaving gaps). TinyTorch introduces a third approach: \textbf{progressive disclosure via monkey-patching}, where a single \texttt{Tensor} class reveals capabilities gradually while maintaining conceptual unity.

\subsection{Pattern Implementation}

TinyTorch's \texttt{Tensor} class includes gradient-related attributes from Module 01, but they remain dormant until Module 05 activates them through monkey-patching (\Cref{lst:dormant-tensor,lst:activation}). \Cref{fig:progressive-timeline} visualizes this activation timeline across the curriculum.

\begin{lstlisting}[caption={Module 01: Dormant gradient features.},label=lst:dormant-tensor,float=t]
# Module 01: Foundation Tensor
class Tensor:
    def __init__(self, data, requires_grad=False):
        self.data = np.array(data, dtype=np.float32)
        self.shape = self.data.shape
        # Gradient features - dormant
        self.requires_grad = requires_grad
        self.grad = None
        self._backward = None

    def backward(self, gradient=None):
        """No-op until Module 05"""
        pass

    def __mul__(self, other):
        return Tensor(self.data * other.data)
\end{lstlisting}

\begin{lstlisting}[caption={Module 05: Autograd activation.},label=lst:activation,float=t]
def enable_autograd():
    """Monkey-patch Tensor with gradients"""
    def backward(self, gradient=None):
        if gradient is None:
            gradient = np.ones_like(self.data)
        if self.grad is None:
            self.grad = gradient
        else:
            self.grad += gradient
        if self._backward is not None:
            self._backward(gradient)

    # Monkey-patch: replace methods
    Tensor.backward = backward
    print("Autograd activated!")

# Module 05 usage
enable_autograd()
x = Tensor([3.0], requires_grad=True)
y = x * x  # y = 9.0
y.backward()
print(x.grad)  # [6.0] - dy/dx = 2x
\end{lstlisting}

\begin{figure*}[t]
\centering
\begin{tikzpicture}[
    scale=0.9,
    every node/.style={font=\scriptsize},
    dormant/.style={rectangle, draw=gray!70, fill=gray!20, text=gray!70, minimum width=2.0cm, minimum height=0.5cm, anchor=east},
    active/.style={rectangle, draw=orange!80, fill=orange!30, text=black, font=\scriptsize\bfseries, minimum width=2.0cm, minimum height=0.5cm, anchor=west}
]

% Timeline axis
\draw[thick, ->] (0,0) -- (14,0) node[right, font=\scriptsize] {Modules};

% Module boundaries as vertical lines (darker)
\foreach \x/\label in {1/01, 3.5/03, 6/05, 8.5/09, 11/13, 13.5/20} {
    \draw[gray!60, dotted] (\x, 0) -- (\x, 5.5);
    \node[below, font=\tiny] at (\x, -0.3) {\texttt{M\label}};
}

% Module 05 activation boundary - thicker and highlighted
\draw[red!60, very thick] (6, 0) -- (6, 5.5);
\node[above, font=\scriptsize\bfseries, red!70] at (6, 5.7) {ACTIVATE};

% Feature layers - dormant boxes end AT M05, active boxes start AT M05
% Layer 1: Core features (always active - span both sides)
\node[active, minimum width=4.0cm, anchor=center] at (3.5, 1.0) {\texttt{.data}, \texttt{.shape}};
\node[left, font=\tiny] at (0.2, 1.0) {Core};

% Layer 2: Gradient features - boxes meet exactly at x=6 (Module 05 line)
% .requires_grad
\node[dormant] at (6, 2.2) {\texttt{.requires\_grad}};
\node[active] at (6, 2.2) {\texttt{.requires\_grad}};
\node[left, font=\tiny] at (0.2, 2.2) {Gradient};

% .grad
\node[dormant] at (6, 3.1) {\texttt{.grad}};
\node[active] at (6, 3.1) {\texttt{.grad}};

% .backward()
\node[dormant] at (6, 4.0) {\texttt{.backward()}};
\node[active] at (6, 4.0) {\texttt{.backward()}};

% Annotations - positioned at top
\node[align=center, font=\tiny, text width=4.5cm] at (3, 6.5) {
    \textbf{Modules 01--04:}\\
    Features visible but dormant\\
    \texttt{.backward()} is no-op
};

\node[align=center, font=\tiny, text width=4.5cm] at (10, 6.5) {
    \textbf{Modules 05--20:}\\
    Autograd fully active\\
    Gradients flow automatically
};

% Legend
\node[dormant, minimum width=1.0cm, minimum height=0.4cm, anchor=center] at (2.5, -1.2) {Dormant};
\node[active, minimum width=1.0cm, minimum height=0.4cm, anchor=center] at (5.5, -1.2) {Active};

\end{tikzpicture}
\caption{Progressive disclosure manages cognitive load through runtime feature activation. From Module 01, students see the complete Tensor API including gradient methods (\texttt{.backward()}, \texttt{.grad}, \texttt{.requires\_grad}), but these features remain dormant (gray, dashed)—they exist as placeholders that return gracefully. In Module 05, runtime method enhancement activates full autograd functionality (orange, solid) without breaking earlier code. This creates three learning benefits: (1) students learn the complete API early, avoiding interface surprise later; (2) Module 01 code continues working unchanged when autograd activates, demonstrating forward compatibility; (3) visible but inactive features create curiosity-driven questions ("Why does \texttt{.backward()} exist if we can't use it yet?") that motivate curriculum progression.}
\label{fig:progressive-timeline}
\end{figure*}

This design serves three pedagogical purposes: (1) \textbf{Early interface familiarity}---students learn complete \texttt{Tensor} API from start; (2) \textbf{Forward compatibility}---Module 01 code doesn't break when autograd activates; (3) \textbf{Curiosity-driven learning}---dormant features create questions motivating curriculum progression.

\subsection{Pedagogical Justification}

Progressive disclosure is grounded in cognitive load theory's principle of managing element interactivity through progressive complexity revelation \citep{sweller1988cognitive}. The pattern provides two established benefits: (1) \textbf{forward compatibility}---Module 01 code continues working when autograd activates, and (2) \textbf{unified mental model}---students work with one Tensor class throughout. The cognitive load hypothesis (early API familiarity reduces future load when features activate) competes with potential split-attention effects from visible but dormant features. Empirical measurement planned for Fall 2025 (\Cref{sec:future-work}) will quantify the net cognitive load impact across both mechanisms through dual-task methodology and self-report scales.

The pattern also instantiates threshold concept pedagogy \citep{meyer2003threshold}: autograd is transformative and troublesome. By making it visible early (dormant) but activatable later, students may cross this threshold when cognitively ready, though empirical evidence of this progression is needed.

\noindent\textbf{Implementation Choice: Monkey-Patching vs. Inheritance.} Alternative designs include inheritance (\texttt{TensorV1}/\texttt{TensorV2}) or composition. We chose monkey-patching for three pedagogical reasons: (1) \textbf{Unified mental model}---students work with one \texttt{Tensor} class throughout, not switching between versions; (2) \textbf{Historical accuracy}---models PyTorch 0.4's Variable-Tensor merger via runtime consolidation rather than design-time hierarchy; (3) \textbf{Forward compatibility demonstration}---Module 01 code continues working unchanged when features activate, teaching interface stability principles. The software engineering trade-off (global state modification) is explicitly discussed in Module 05's reflection questions, making students aware of both the pedagogical benefit and production code concerns.

\subsection{Production Framework Alignment}

Progressive disclosure demonstrates how real ML frameworks evolve. Early PyTorch (pre-0.4) separated data (\texttt{torch.Tensor}) from gradients (\texttt{torch.autograd.Variable}). PyTorch 0.4 (April 2018) \citep{pytorch04release} consolidated functionality into \texttt{Tensor}, matching TinyTorch's pattern. Students are exposed to the modern unified interface from Module 01, positioned to understand why PyTorch made this design evolution.

Similarly, TensorFlow 2.0 integrated eager execution by default \citep{tensorflow20}, making gradients work immediately---similar to TinyTorch's activation pattern. Students who understand progressive disclosure can grasp why TensorFlow eliminated \texttt{tf.Session()}: immediate execution with automatic graph construction aligns with unified API design principles.

\section{Systems-First Integration}
\label{sec:systems}

Having established TinyTorch's systems-first architecture (\Cref{sec:curriculum}), this section details how systems awareness manifests through a three-phase progression: (1) \textbf{understanding memory} through explicit profiling, (2) \textbf{analyzing complexity} through transparent implementations, and (3) \textbf{optimizing systems} through measurement-driven iteration. This progression applies situated cognition \citep{lave1991situated} by mirroring professional ML engineering workflow: measure resource requirements, understand computational costs, then optimize bottlenecks.

\subsection{Phase 1: Understanding Memory Through Profiling}

Where traditional frameworks abstract away memory concerns, TinyTorch makes memory footprint calculation explicit (\Cref{lst:tensor-memory}). Students' first assignment calculates memory for MNIST (60,000 $\times$ 784 $\times$ 4 bytes $\approx$ 180 MB) and ImageNet (1.2M $\times$ 224$\times$224$\times$3 $\times$ 4 bytes $\approx$ 670 GB).

This memory-first pedagogy transforms student questions:
\begin{itemize}
\item Module 01: ``Why does batch size affect memory?'' (activations scale with batch size)
\item Module 06: ``Why does Adam use 2$\times$ optimizer state memory?'' (momentum and variance buffers)
\item Module 13: ``How much VRAM for GPT-3 training?'' (175B parameters $\times$ 4 bytes $\times$ 4 $\approx$ 2.6 TB: weights + gradients + momentum + variance)
\end{itemize}

Students learn to distinguish parameter memory (model weights) from optimizer state memory (Adam's 2$\times$ for momentum and variance) from activation memory (often 10--100$\times$ larger than parameters). This decomposition enables accurate capacity planning for training runs.

\subsection{Phase 2: Analyzing Complexity Through Transparent Implementations}

Module 09 introduces convolution with seven explicit nested loops (\Cref{lst:conv-explicit}), making $O(B \times C_{\text{out}} \times H_{\text{out}} \times W_{\text{out}} \times C_{\text{in}} \times K_h \times K_w)$ complexity visible and countable.

\begin{lstlisting}[caption={Explicit convolution showing 7-nested complexity.},label=lst:conv-explicit,float=t]
def conv2d_explicit(input, weight):
    """7 nested loops - see the complexity!
    input: (B, C_in, H, W)
    weight: (C_out, C_in, K_h, K_w)"""
    B, C_in, H, W = input.shape
    C_out, _, K_h, K_w = weight.shape
    H_out, W_out = H - K_h + 1, W - K_w + 1
    output = np.zeros((B, C_out, H_out, W_out))

    # Count: 1,2,3,4,5,6,7 loops
    for b in range(B):
        for c_out in range(C_out):
            for h in range(H_out):
                for w in range(W_out):
                    for c_in in range(C_in):
                        for kh in range(K_h):
                            for kw in range(K_w):
                                output[b,c_out,h,w] += \
                                    input[b,c_in,h+kh,w+kw] * \
                                    weight[c_out,c_in,kh,kw]
    return output
\end{lstlisting}

This explicit implementation illustrates TinyTorch's pedagogical philosophy: \textbf{minimal NumPy reliance until concepts are established}. While the curriculum builds on NumPy as foundational infrastructure (array storage, broadcasting, element-wise operations), optimized operations like matrix multiplication appear only after students understand computational complexity through explicit loops. Module 03 introduces linear layers with manual weight-input multiplication loops before Module 08 introduces NumPy's \texttt{@} operator; Module 09 teaches convolution through seven nested loops before Module 18 vectorizes with NumPy operations. This progression ensures students understand \emph{what} operations do (and their complexity) before learning \emph{how} to optimize them. Pure Python transparency enables this pedagogical sequencing---students can inspect every operation without navigating compiled C extensions or CUDA kernels.

Students calculate: CIFAR-10 batch (128, 3, 32, 32) through 32-filter 5$\times$5 convolution: $128 \times 32 \times 28 \times 28 \times 3 \times 5 \times 5 = 241$M multiply-accumulate operations. This concrete measurement motivates Module 18's vectorization (10--100$\times$ speedup) and explains why CNNs require hardware acceleration.

\noindent\textbf{Experiencing Performance Reality.} \Cref{tab:performance} shows TinyTorch's deliberate slowness compared to PyTorch---100--10,000$\times$ slower through pure Python implementations. This slowness is pedagogically valuable (productive failure \citep{kapur2008productive}): students experience performance problems before learning optimizations, making vectorization meaningful rather than abstract. When students measure their Conv2d taking 97 seconds per CIFAR batch versus PyTorch's 10 milliseconds, they understand \emph{why} production frameworks obsess over implementation details.

\begin{table}[t]
\centering
\caption{Runtime comparison: TinyTorch vs PyTorch (CPU). Pure Python implementations with explicit loops are 100--10,000$\times$ slower, making performance costs visible and optimization meaningful. Benchmarks measured on Intel i7 CPU; PyTorch uses optimized BLAS libraries while TinyTorch uses pure Python for pedagogical transparency.}
\label{tab:performance}
\resizebox{\columnwidth}{!}{%
\small
\begin{tabular}{@{}lrrr@{}}
\toprule
Operation & TinyTorch & PyTorch & Ratio \\
\midrule
\texttt{matmul} (1K$\times$1K) & 1.0 s & 0.9 ms & 1,090$\times$ \\
\texttt{conv2d} (CIFAR batch) & 97 s & 10 ms & 10,017$\times$ \\
\texttt{softmax} (10K elem) & 6 ms & 0.05 ms & 134$\times$ \\
\bottomrule
\end{tabular}
}
\end{table}

\subsection{Phase 3: Optimizing Systems Through Measurement-Driven Iteration}

The Optimization Tier (Modules 14--19) transforms systems-first pedagogy from \emph{analysis} (``How much memory does this use?'') into \emph{optimization} (``How do I reduce memory by 4$\times$?''). Where foundation modules taught calculating footprints and counting FLOPs, optimization modules teach systematic improvement through profiling-driven iteration.

This tier introduces three fundamental optimization concepts that complete the systems-first integration:

\textbf{1. Measurement-Driven Optimization.} Students learn the ``measure first, optimize second'' methodology through systematic profiling. Rather than guessing bottlenecks, they measure time, memory, and FLOPs to identify where optimization efforts yield maximum impact. This mirrors production ML engineering: profiling reveals that convolution consumes 80\%+ training time, directing optimization focus appropriately.

\textbf{2. Trade-off Reasoning.} Optimization involves balancing competing objectives---accuracy, speed, memory, model size. Students measure these trade-offs empirically: quantization achieves 4$\times$ compression with 1--2\% accuracy cost; pruning removes 90\% of parameters with minimal accuracy impact; KV-caching achieves 10--100$\times$ speedup but increases memory. This reinforces that systems engineering requires navigating trade-offs, not absolutes.

\textbf{3. Implementation Matters.} Identical algorithms exhibit 100$\times$ performance differences based on implementation choices. Students experience this through vectorization: seven explicit loops (pedagogically transparent) versus NumPy matrix operations (production efficient). This teaches why production frameworks obsess over seemingly minor implementation details---performance differences compound across operations.

The Optimization Tier completes the systems-first integration arc: students who calculate memory in Module 01, count FLOPs in Module 09, and optimize deployment in Modules 14--19 are designed to develop reflexive systems thinking. When encountering new ML techniques, the curriculum aims to train them to automatically ask: ``How much memory? What's the computational complexity? What are the trade-offs?'' Whether this design successfully makes these questions automatic rather than afterthoughts requires empirical validation.

\section{Course Deployment}
\label{sec:deployment}

Translating curriculum design into effective classroom practice requires addressing integration models, infrastructure accessibility, and student support structures. This section presents deployment patterns validated through pilot implementations and institutional feedback.

\subsection{Integration Models}
\label{subsec:integration}

TinyTorch supports three deployment models for different institutional contexts, ranging from standalone courses to supplementary tracks in existing curricula.

\textbf{Model 1: Self-Paced Learning (Primary Use Case)} serves individual learners, professionals, and researchers wanting framework internals understanding. Students work through modules at their own pace, selecting depth based on goals: complete all 20 modules for comprehensive systems knowledge, focus on Foundation (01--07) for autograd understanding, or target specific topics (Module 12 for attention mechanisms, Module 15 for quantization). The curriculum provides immediate feedback through local NBGrader validation, historical milestones for correctness proof, and progressive complexity enabling both intensive study (weeks) and distributed learning (months). This model requires zero infrastructure beyond Python and 4GB RAM, making it accessible worldwide.

\textbf{Model 2: Institutional Integration} enables universities to incorporate TinyTorch into existing ML courses. Options include: standalone 4-credit course (all 20 modules, complete systems coverage), half-semester module (Modules 01--09, foundation + CNN architectures), or optional honors track (selected modules for extra credit). Institutional deployment provides NBGrader autograding infrastructure, connection maps showing prerequisite dependencies, and milestone validation scripts. Lecture materials remain future work; current release supports lab-based or flipped-classroom formats where students implement concepts from textbook readings.

\textbf{Model 3: Team Onboarding} addresses industry use cases where ML teams want members to understand PyTorch internals. Companies can use TinyTorch for: new hire bootcamps (2--3 week intensive), internal training programs (distributed over quarters), or debugging workshops (focused modules like 05 Autograd, 12 Attention). The framework's PyTorch-inspired package structure and systems-first approach prepare engineers for understanding production frameworks and optimization workflows.

\textbf{Available Resources}: Current release provides module notebooks, NBGrader test suites, milestone validation scripts, and connection maps. Lecture slides for institutional courses remain future work (\Cref{sec:future-work}), though self-paced learning requires no additional materials beyond the modules themselves.

\subsection{Tier-Based Curriculum Configurations}
\label{subsec:tier-configs}

TinyTorch's three-tier architecture (Foundation, Architecture, Optimization) enables flexible deployment matching diverse course objectives and time constraints. Instructors can deploy complete tiers or selectively focus on specific learning goals:

\textbf{Configuration 1: Foundation Only (Modules 01--07).} Students build core framework internals from scratch: tensors, activations, layers, losses, autograd, optimizers, and training loops. This 30--40 hour configuration suits introductory ML systems courses, undergraduate capstone projects, or bootcamp modules focusing on framework fundamentals. Students complete Milestones 1--3 (Perceptron, XOR, MLP Revival) demonstrating functional autograd and training infrastructure. Upon completion, students understand \texttt{loss.backward()} mechanics, can debug gradient flow, and profile memory usage. Ideal for courses prioritizing systems fundamentals over architectural breadth.

\textbf{Configuration 2: Foundation + Architecture (Modules 01--13).} Extends Configuration 1 with modern deep learning architectures: datasets/dataloaders, convolution, pooling, embeddings, attention, and transformers. This 50--65 hour configuration enables comprehensive ML systems courses or graduate-level deep learning seminars. Students complete Milestones 4--5 (CNN Revolution, Transformer Era) demonstrating working vision and language models. Upon completion, students implement production architectures from scratch, understand memory scaling ($O(N^2)$ attention), and recognize architectural tradeoffs (109$\times$ parameter efficiency from Conv2d weight sharing). Suitable for semester-long courses covering both internals and modern ML.

\textbf{Configuration 3: Optimization Focus (Modules 14--19 only).} Students import pre-built \texttt{tinytorch.nn} and \texttt{tinytorch.optim} packages from Configurations 1--2, implementing only production optimization techniques: profiling, quantization, compression, memoization, acceleration, and benchmarking. This 15--25 hour configuration targets production ML courses, TinyML workshops, or edge deployment seminars where students already understand framework basics but need systems optimization depth. Students complete Milestone 6 (MLPerf-inspired benchmark) demonstrating 10$\times$ speedup and 4$\times$ compression. Upon completion, students optimize existing models for deployment constraints. Addresses key pedagogical limitation: students interested in quantization shouldn't need to re-implement autograd first.

These configurations support "build what you're learning, import what you need" pedagogy. Configuration 3 students focus on optimization while treating Foundation/Architecture as trusted dependencies, mirroring professional practice where engineers specialize rather than rebuilding entire stacks. Discussion (\Cref{subsec:flexibility}) examines pedagogical rationale for these configurations.

\subsection{Infrastructure and Accessibility}
\label{subsec:infrastructure}

ML systems education faces an accessibility challenge: production ML courses typically require expensive GPU hardware (\$500+ gaming laptops or cloud credits), 16GB+ RAM, CUDA-compatible environments, and Linux/WSL systems. These requirements create barriers for community college students, international learners in regions with limited cloud access, K-12 educators exploring ML internals, and institutions with modest computing budgets. Widening access to ML systems education requires reducing infrastructure barriers while maintaining pedagogical effectiveness~\citep{banbury2021widening}.

TinyTorch addresses this through CPU-only, pure Python implementation. The curriculum requires only dual-core 2GHz+ CPUs (no GPU needed), 4GB RAM (sufficient for CIFAR-10 training with batch size 32), 2GB storage (modules plus datasets), and any operating system supporting Python 3.8+ (Windows, macOS, or Linux). This enables deployment on Chromebooks via Google Colab, five-year-old budget laptops, and institutional computer labs. Text-based ASCII connection maps enhance accessibility for visually impaired students using screen readers, while offline-first datasets (\Cref{sec:curriculum}) eliminate network dependencies during training.

\subsubsection{Jupyter Environment Options}

TinyTorch supports three deployment environments: \textbf{JupyterHub} (institutional server, 8-core/32GB supports 50 students), \textbf{Google Colab} (zero installation, best for MOOCs), and \textbf{local installation} (\texttt{pip install tinytorch}, best for self-paced learning).

\subsubsection{NBGrader Autograding Workflow}

\textbf{Student Submission Process}: (1) Student works in Jupyter notebook (local or cloud), (2) runs \texttt{nbgrader validate module\_01.ipynb} for local correctness checking, (3) submits via LMS (Canvas/Blackboard) or Git (GitHub Classroom), (4) instructor runs \texttt{nbgrader autograde} on submitted notebooks, (5) grades and feedback posted to LMS.

\textbf{NBGrader Module Structure Example}: Each module uses NBGrader markdown cells to define assessment points and structure. For example, Module 01's memory profiling exercise:

\begin{lstlisting}[caption={NBGrader cell metadata and solution structure.},label=lst:nbgrader-example,float=t]
# Cell metadata defines grading parameters:
# nbgrader = {
#   "grade": true,
#   "grade_id": "tensor_memory",
#   "points": 2,
#   "locked": false,
#   "solution": true
# }

def memory_footprint(self):
    """Calculate tensor memory in bytes"""
    ### BEGIN SOLUTION
    return self.data.nbytes
    ### END SOLUTION
\end{lstlisting}

This scaffolding~\citep{vygotsky1978mind} makes educational objectives explicit while enabling automated grading. The \texttt{name} field identifies the exercise, \texttt{points} assigns weight, and the description provides context before students see code cells.

\textbf{Handling Autograder Edge Cases}: Pure Python convolution (Module 09) may exceed default 30-second timeout on slower hardware; we set 5-minute timeouts and provide vectorized reference solutions for comparison. Critical modules (05 Autograd, 09 CNNs) include manual review of 20\% of submissions to catch conceptual errors missed by unit tests. All modules include \texttt{assert numpy.\_\_version\_\_ >= '1.20'} dependency validation.

\textbf{Projected Scalability}: Small courses (30 students) can grade in approximately 10 minutes per module on instructor laptop, medium courses (100 students) require approximately 30 minutes on dedicated grading server, while MOOCs (1000+ students) can achieve 2-hour turnaround via parallelized cloud autograding. These projections assume average grading time of 45 seconds per module submission on 4-core systems. Full-scale deployment validation planned for Fall 2025 (\Cref{sec:discussion}).

\subsection{Automated Assessment Infrastructure}

TinyTorch integrates NBGrader~\citep{blank2019nbgrader} for scalable automated assessment~\citep{thompson2008bloom}. Each module contains:

\begin{itemize}
\item \textbf{Solution cells}: Scaffolded implementations with grade metadata
\item \textbf{Test cells}: Locked autograded tests preventing modification
\item \textbf{Immediate feedback}: Students validate correctness locally before submission
\item \textbf{Point allocation}: Reflects pedagogical priorities (Module 05 Autograd: 100 points; Module 01 Tensor: 60 points)
\end{itemize}

This infrastructure enables deployment in MOOCs and large classrooms where manual grading proves infeasible. Instructors configure NBGrader to collect submissions, execute tests in sandboxed environments, and generate grade reports automatically.

\textbf{Important caveat}: NBGrader scaffolding exists but remains unvalidated at scale (\Cref{sec:discussion}). Automated assessment validity requires empirical investigation: Do tests measure conceptual understanding or syntax correctness? We scope this as ``curriculum with autograding infrastructure'' rather than ``validated assessment system.''

\subsection{Package Organization}
\label{subsec:package}

Unlike tutorial-style notebooks creating isolated code, TinyTorch modules export to a package structure inspired by PyTorch's API organization. Critically, \emph{each completed module becomes immediately usable}---students build a working framework progressively, not isolated exercises. Module 01 exports to \texttt{tinytorch.core.tensor}, Module 09 to \texttt{tinytorch.nn.conv}, enabling import patterns familiar to PyTorch users that grow with each module completed.

As students complete modules, their framework accumulates capabilities. After Module 03, students can import and use layers; after Module 05, autograd enables training; after Module 09, CNNs become available. This progressive accumulation creates tangible evidence of progress---students see their framework grow from basic tensors to a complete ML system. \Cref{lst:progressive-imports} illustrates how imports expand as modules are completed:

\begin{lstlisting}[caption={Progressive imports: Framework capabilities grow module-by-module.},label=lst:progressive-imports,float=t]
# After Module 01: Basic tensors
from tinytorch.core import Tensor

# After Module 09: CNNs available
from tinytorch.nn import Conv2d, MaxPool2d, Linear
# Autograd active - gradients flow!

# After Module 13: Complete framework
from tinytorch.nn import Transformer, Embedding, Attention
\end{lstlisting}

This design bridges educational and professional contexts. Students aren't ``solving exercises''---they're building a framework they could ship. The package structure reinforces systems thinking: understanding how \texttt{torch.nn.Conv2d} relates to \texttt{torch.Tensor} requires grasping module organization, not just individual algorithms. More importantly, students experience the satisfaction of watching their framework grow from a single \texttt{Tensor} class to a complete system capable of training transformers---each module completion adds new capabilities they can immediately use.

Export happens via nbdev \citep{howard2020fastai} directives (\texttt{\#| default\_exp core.tensor}) embedded in module notebooks. Students work in Jupyter's interactive environment while TinyTorch maintains source-of-truth in version-controlled Python files, enabling professional development workflows (Git, code review, CI/CD) within pedagogical context.

\subsection{Connection Maps and Knowledge Integration}

Every module begins with a \textbf{Connection Map} showing prerequisite modules, current module focus, and enabled future capabilities. This addresses Collins et al.'s cognitive apprenticeship \citep{collins1989cognitive} by making expert knowledge structures visible:

\begin{lstlisting}[caption={Module 05 connection map.},label=lst:connection-map,float=t]
## Prerequisites & Progress
You've Built: Tensor, activations, layers, losses
You'll Build: Autograd system
You'll Enable: Training loops, optimizers

Connection Map:
Modules 01-04 → Autograd → Training (06-07)
(forward pass)   (backward)  (learning loops)
\end{lstlisting}

Connection maps transform isolated modules into coherent curriculum. Students see \emph{why} each module matters before implementation begins, reducing ``I don't see the point'' disengagement. Early feedback suggests these maps help students maintain big-picture understanding while working through implementation details.

\subsection{Open Source Infrastructure}
\label{subsec:opensource}

TinyTorch is released as open source to enable community adoption and evolution.\footnote{Code released under MIT License, curriculum materials under Creative Commons Attribution-ShareAlike 4.0 (CC-BY-SA). Repository: \url{https://github.com/harvard-edge/TinyTorch}} The repository includes instructor resources: \texttt{CONTRIBUTING.md} (guidelines for bug reports and curriculum improvements), \texttt{INSTRUCTOR.md} (30-minute setup guide, grading rubrics, common student errors), and \texttt{MAINTENANCE.md} (support commitment through 2027, succession planning for community governance).

\textbf{Maintenance Commitment}: The author commits to bug fixes and dependency updates through 2027, community pull request review within 2 weeks, and annual releases incorporating educator feedback. Community governance transition (2026--2027) will establish an educator advisory board and document succession planning to ensure long-term sustainability beyond single-author maintenance.

\textbf{Customization Support}: TinyTorch's modular design enables institutional adaptation: replacing datasets with domain-specific data (medical images, time series), adding modules (diffusion models, graph neural networks), adjusting difficulty through scaffolding modifications, or changing assessment approaches. Forks should maintain attribution (CC-BY-SA requirement) and ideally contribute improvements upstream.

\subsection{Teaching Assistant Support}
\label{subsec:ta-support}

Effective deployment requires structured TA support beyond instructor guidance.

\textbf{TA Preparation}: TAs should develop deep familiarity with critical modules where students commonly struggle—Modules 05 (Autograd), 09 (CNNs), and 13 (Transformers)—by completing these modules themselves and intentionally introducing bugs to understand common error patterns. The repository provides \texttt{TA\_GUIDE.md} documenting frequent student errors (gradient shape mismatches, disconnected computational graphs, broadcasting failures) and debugging strategies.

\textbf{Office Hour Demand Patterns}: Student help requests are expected to cluster around conceptually challenging modules, with autograd (Module 05) likely generating higher office hour demand than foundation modules. Instructors should anticipate demand spikes by scheduling additional TA capacity during critical modules, providing pre-recorded debugging walkthroughs, and establishing async support channels (discussion forums with guaranteed response times).

\textbf{Grading Infrastructure}: While NBGrader automates 70-80\% of assessment, critical modules benefit from manual review of implementation quality and conceptual understanding. TAs should focus manual grading on: (1) code clarity and design choices, (2) edge case handling, (3) computational complexity analysis, and (4) memory profiling insights. Sample solutions and grading rubrics in \texttt{INSTRUCTOR.md} calibrate evaluation standards.

\textbf{Boundaries and Scaffolding}: TAs should guide students toward solutions through structured debugging questions rather than providing direct answers. When students reach unproductive frustration, TAs can suggest optional scaffolding modules (numerical gradient checking before autograd implementation, scalar autograd before tensor autograd) to build confidence through intermediate steps.

\subsection{Student Learning Support}
\label{subsec:student-support}

TinyTorch embraces productive failure \citep{kapur2008productive}---learning through struggle before instruction---while providing guardrails against unproductive frustration.

\textbf{Recognizing Productive vs Unproductive Struggle}: Productive struggle involves trying different approaches, making incremental progress (passing additional tests), and developing deeper understanding of error messages. Unproductive frustration manifests as repeated identical errors without new insights, random code changes hoping for success, or inability to articulate the problem. Students experiencing unproductive frustration should seek help rather than persisting solo.

\textbf{Structured Help-Seeking}: The repository provides debugging workflows: (1) self-debug using print statements and simple test cases, (2) consult common errors documentation for the module, (3) search discussion forums for similar issues, (4) post structured help requests with error messages and attempted solutions, (5) attend office hours with specific questions. This progression encourages independence while ensuring timely intervention.

\textbf{Flexible Pacing and Optional Scaffolding}: Students learn at different rates depending on background, learning style, and external commitments. TinyTorch supports multiple pacing modes---intensive (weeks), semester (distributed coursework), self-paced (professional development)---without prescriptive timelines. Students struggling with conceptual jumps can access optional intermediate modules providing additional scaffolding. No penalty attaches to slower pacing or scaffolding use; depth of understanding matters more than completion speed.

\textbf{Diverse Student Contexts}: The curriculum acknowledges students balance learning with work, caregiving, or health challenges. Flexible pacing enables participation from community college students, working professionals, international learners, and non-traditional students who might be excluded by rigid timelines or high-end hardware requirements. Pure Python deployment on modest hardware (4GB RAM, dual-core CPU) and screen-reader-compatible ASCII diagrams further broaden accessibility.

\section{Discussion and Limitations}
\label{sec:discussion}

This section reflects on TinyTorch's design through three lenses: pedagogical scope as deliberate design decision, flexible curriculum configurations enabling diverse institutional deployment, and honest assessment of limitations requiring future work.

\subsection{Pedagogical Scope as Design Decision}
\label{subsec:scope}

TinyTorch's CPU-only, framework-internals-focused scope represents deliberate pedagogical constraint, not technical limitation. This scoping embodies three design principles:

\textbf{Accessibility over performance}: Pure Python eliminates GPU dependency, prioritizing equitable access over execution speed (pedagogical transparency detailed in \Cref{sec:systems}). GPU access remains inequitably distributed---cloud credits favor well-funded institutions, personal GPUs favor affluent students. The 100--1000$\times$ slowdown versus PyTorch (\Cref{tab:performance}) is acceptable when pedagogical goal is understanding internals, not training production models.

\textbf{Incremental complexity management}: GPU programming introduces memory hierarchy (registers, shared memory, global memory), kernel launch semantics, race conditions, and hardware-specific tuning. Teaching GPU programming simultaneously with autograd would violate cognitive load constraints. TinyTorch enables "framework internals now, hardware optimization later" learning pathway. Students completing TinyTorch should pursue GPU acceleration through PyTorch tutorials, NVIDIA Deep Learning Institute courses, or advanced ML systems courses---building on internals understanding to comprehend optimization techniques.

Similarly, distributed training (data parallelism, model parallelism, gradient synchronization) and production deployment (model serving, compilation, MLOps) introduce substantial additional complexity orthogonal to framework understanding. These topics remain important but beyond current pedagogical scope. Future extensions could address distributed systems through simulation-based pedagogy (\Cref{sec:future-work}), maintaining accessibility while teaching concepts.

\subsection{Pedagogical Flexibility: Rationale and Design Principles}
\label{subsec:flexibility}

Beyond the deployment models described in \Cref{subsec:integration}, TinyTorch's modular structure enables pedagogical configurations addressing diverse learning objectives and institutional constraints. This subsection examines the pedagogical reasoning underlying flexible curriculum design.

\textbf{Tier-based partitioning enables distributed cognitive load}: The three-tier structure (Foundation, Architecture, Optimization) creates natural stopping points where students achieve mastery before advancing. Completing Foundation (Modules 01--07) develops core systems understanding (tensors, autograd, training loops) sufficient for simple network training. Architecture tier (Modules 08--13) builds on this foundation to teach modern deep learning without re-teaching basics. This vertical partitioning follows cognitive load theory~\citep{sweller1988cognitive}: students consolidate foundational knowledge before encountering architectural complexity, preventing the overwhelm that occurs when teaching CNNs and transformers simultaneously with basic autograd mechanics. Multi-semester deployments exploit this structure by aligning tiers with academic terms, enabling depth over breadth.

\textbf{Selective implementation leverages package accumulation}: TinyTorch's progressive package structure (\Cref{subsec:package}) enables "build what you're learning, import what you need" pedagogy, supporting three distinct curriculum configurations: (1) \textbf{Foundation only} (Modules 01--07): Build core systems (tensors, autograd, training loops) from scratch---ideal for introductory ML systems courses or undergraduate capstone projects focusing on framework internals. (2) \textbf{Foundation + Architecture} (Modules 01--13): Extend to modern deep learning by implementing CNNs and transformers---suitable for comprehensive ML systems courses or graduate-level deep learning seminars. (3) \textbf{Optimization focus} (Modules 14--19 only): Import pre-built \texttt{tinytorch.nn} and \texttt{tinytorch.optim} packages, implement only optimization techniques (quantization, pruning, distillation)---targets production ML courses, edge deployment seminars, or TinyML workshops where students already understand framework basics but need systems optimization depth. This addresses a key limitation of monolithic projects: students interested in quantization shouldn't need to re-implement autograd first. These configurations enable instructors to match curriculum scope to course objectives---systems-heavy courses build from Foundation through Architecture, while optimization-focused courses skip to production concerns using pre-built dependencies. The pedagogical tradeoff is intentional: complete implementation builds comprehensive understanding, while selective implementation enables targeted depth within semester constraints.

\textbf{Hybrid integration bridges application and internals}: TinyTorch modules can augment PyTorch-first courses by revealing framework internals alongside application. Students training CNNs with PyTorch might implement TinyTorch's convolution module to understand what \texttt{torch.nn.Conv2d} does internally, then return to PyTorch for projects. This parallel construction addresses a fundamental pedagogical tension: application-first courses teach ML usage quickly but risk treating frameworks as black boxes; implementation-first courses teach internals deeply but delay practical application. Hybrid approaches enable both: students learn PyTorch for projects while building TinyTorch for understanding. Critical integration points include Module 05 (autograd, demystifying \texttt{loss.backward()}), Module 09 (convolution, explaining kernel complexity), and Module 12 (attention, revealing transformer internals).

\textbf{Consolidation cycles validate understanding through application}: Rather than building continuously, students could alternate implementation and application phases. After completing Foundation tier, students might spend 2--3 weeks using their self-built framework for traditional ML tasks (MNIST classification, sentiment analysis) before advancing to Architecture tier. This consolidation reveals understanding gaps through productive struggle: debugging self-built autograd, profiling self-implemented training loops, fixing gradient flow in custom architectures. Application phases transform passive "I finished the exercises" into active "I can use what I built"---a crucial distinction for systems understanding. This rhythm mirrors professional practice: implement, validate through use, identify limitations, extend capabilities.

\textbf{Variable pacing addresses heterogeneous preparation}: Students enter ML systems courses with diverse backgrounds: CS majors with strong systems foundations, statistics students with ML theory but limited programming experience, working professionals with application knowledge but no implementation depth. Flexible pacing enables differentiation: advanced students accelerate through familiar content (Foundation tier as review) to reach novel material (Architecture/Optimization), while students building foundational skills invest time where needed. Critically, pedagogical design should validate understanding, not completion speed. Optional scaffolding (numerical gradient checking, scalar autograd prototypes) provides intermediate steps for students needing additional support without penalizing slower pacing.

\subsection{Limitations}

TinyTorch's current implementation contains gaps requiring future work. \textbf{Assessment infrastructure}: NBGrader scaffolding works in development but remains unvalidated for large-scale deployment. Grading validity requires investigation: Do tests measure conceptual understanding or syntax? Future work should validate through item analysis and transfer task correlation.

\textbf{Performance transparency tradeoff}: Pure Python executes 100--1000$\times$ slower than PyTorch (\Cref{tab:performance})---deliberate choice for pedagogical clarity. Seven explicit convolution loops reveal algorithmic complexity better than optimized C++ kernels, but slow execution limits practical experimentation. Students complete milestones (75\%+ CIFAR-10 accuracy, transformer text generation) but cannot iterate rapidly on architecture search.

\textbf{Energy consumption measurement}: While TinyTorch covers optimization techniques with significant energy implications (quantization achieving 4$\times$ compression, pruning enabling 10$\times$ model shrinkage), the curriculum does not explicitly measure or quantify energy consumption. Students understand that quantization reduces model size and pruning decreases computation, but may not connect these optimizations to concrete energy savings (joules/inference, watt-hours/training epoch). Future iterations could integrate energy profiling libraries to make sustainability an explicit learning objective alongside memory and latency optimization, particularly relevant for edge deployment.

\textbf{Language and accessibility}: Materials exist exclusively in English. Modular structure facilitates translation; community contributions welcome. Code examples omit type annotations (e.g., \texttt{def forward(self, x: Tensor) -> Tensor:}) to reduce visual complexity for students learning ML concepts simultaneously. While this prioritizes pedagogical clarity, it means students don't practice type-driven development increasingly standard in production ML codebases. Future iterations could introduce type hints progressively: omitting them in early modules (01--05), then adding them in optimization modules (14--18) where interface contracts become critical.

\section{Future Directions}
\label{sec:future-work}

TinyTorch's current implementation establishes a foundation for three extension directions: empirical validation to test pedagogical hypotheses, curriculum evolution to expand systems coverage beyond CPU-only scope, and community adoption to measure educational impact through deployment at scale.

\subsection{Empirical Validation Roadmap}

While TinyTorch's design is grounded in established learning theory (cognitive load~\citep{sweller1988cognitive}, progressive disclosure, cognitive apprenticeship~\citep{collins1989cognitive}), its pedagogical effectiveness requires empirical validation through controlled classroom studies. We commit to the following validation roadmap:

\textbf{Phase 1: Pilot Deployment (Fall 2025, $n=30$--50 students).} Deploy at 2--3 universities in introductory ML systems courses as primary hands-on framework alongside theory lectures. Cognitive load measurement uses Paas Mental Effort Rating Scale~\citep{paas1992training} administered after Modules 05 (autograd) and 09 (CNNs) to test progressive disclosure hypothesis: does dormant feature activation reduce cognitive load compared to introducing autograd as separate framework? Time-to-completion tracking instruments each module to measure actual versus estimated completion time (currently 60--80 hours total is unvalidated projection based on content density). Formative assessment identifies common struggle points, prerequisite gaps, and module pacing issues through instructor interviews, student feedback surveys, and learning analytics from NBGrader submissions.

\textbf{Phase 2: Comparative Study (Spring 2026, $n=100$--150 students).} Randomized controlled trial compares TinyTorch (systems-first, build-from-scratch) versus PyTorch-only (application-first, use-existing-frameworks) versus lecture-only (control) across 3 sections of same ML course with identical theory content. Conceptual understanding measured through ML systems concept inventory (adapted from program visualization assessment~\citep{sorva2012visual} for systems thinking) administered pre-course and post-course, assessing autograd mechanics, memory profiling, computational complexity, and optimization tradeoffs. Transfer performance evaluated through post-course debugging task requiring PyTorch profiling and optimization on novel CNN architecture (e.g., ``This training loop runs out of memory: identify bottlenecks and fix''). Does building TinyTorch improve debugging transfer to production frameworks? Code quality analysis evaluates student-written training loops for memory efficiency (batch size tuning, gradient accumulation awareness), vectorization (avoiding Python loops), and systems awareness (profiling-informed decisions versus trial-and-error).

\textbf{Phase 3: Longitudinal Tracking (2026--2027, $n=200$+ students).} Re-administer concept inventory at 6 months and 12 months post-course to measure long-term retention of systems thinking: does implementation-based learning persist better than lecture-based learning? Track TinyTorch cohort versus control group performance in subsequent ML systems courses (e.g., CMU Deep Learning Systems~\citep{chen2022dlsyscourse}, distributed training courses) to measure preparation effectiveness. Survey employment placement in ML engineering roles (requiring systems knowledge) versus ML application roles (using frameworks as black boxes) at 1--2 years post-graduation. Does systems-first education influence career trajectory?

\noindent\textbf{Open Science Commitment:}
All validation studies will be pre-registered on Open Science Framework (OSF) with hypotheses, instruments, and analysis plans published before data collection. Datasets (anonymized student performance, survey responses, code submissions) and analysis code will be released openly under CC-BY-4.0 license. Results---positive or negative---will be published regardless of outcome, avoiding publication bias. Validation data will inform iterative curriculum refinement through evidence-based design updates, ensuring continuous improvement grounded in empirical pedagogy research rather than assumption.

\subsection{Curriculum Evolution}

TinyTorch's CPU-only design prioritizes pedagogical transparency, but students benefit from understanding GPU acceleration and distributed training concepts without requiring expensive hardware. Future curriculum extensions would maintain TinyTorch's core principle---understanding through transparent implementation---while expanding systems coverage through complementary pedagogical approaches.

\noindent\textbf{Performance Analysis Through Analytical Models.} Future extensions could enable students to compare TinyTorch CPU implementations against PyTorch GPU equivalents through roofline modeling~\citep{williams2009roofline}. Rather than writing CUDA code, students would profile existing implementations to understand memory hierarchy differences, parallelism benefits, and compute versus memory bottlenecks. The roofline approach maintains TinyTorch's accessibility (no GPU hardware required) while preparing students for GPU programming by teaching first-principles performance analysis.

\noindent\textbf{Distributed Training Through Simulation.} Understanding distributed training communication patterns requires simulation-based pedagogy rather than multi-GPU clusters. Future extensions could integrate distributed training simulation enabling single-machine exploration of multi-device concepts: gradient synchronization overhead, scalability analysis across worker counts, network topology impact on communication patterns, and pipeline parallelism trade-offs. This simulation-based approach maintains pedagogical transparency---students understand distributed systems through measurement and analysis, not black-box hardware access.

\noindent\textbf{Energy and Power Profiling.} Edge deployment and sustainable ML~\citep{strubell2019energy,patterson2021carbon} require understanding energy consumption. Future extensions could integrate power profiling tools enabling students to measure energy costs (joules per inference, watt-hours per training epoch) alongside latency and memory. This connects existing optimization techniques (quantization, pruning) taught in Modules 15--18 to concrete sustainability metrics, particularly relevant for edge AI~\citep{banbury2021benchmarking} where battery life constrains deployment.

\noindent\textbf{Architecture Extensions.} Potential additions (graph neural networks, diffusion models, reinforcement learning) must justify inclusion through systems pedagogy rather than completeness. The question is not ``Can TinyTorch implement this?'' but ``Does implementing this teach fundamental systems concepts unavailable through existing modules?'' Graph convolutions might teach sparse tensor operations; diffusion models might illuminate iterative refinement trade-offs. However, extensions succeed only when maintaining TinyTorch's principle: \textbf{every line of code teaches a systems concept}. Community forks demonstrate this philosophy: quantum ML variants replace tensors with quantum state vectors (teaching circuit depth versus memory); robotics forks emphasize RL simulation overhead and real-time constraints. The curriculum remains intentionally incomplete as a production framework---completeness lies in foundational systems thinking applicable across all ML architectures.

\subsection{Community Adoption and Impact}

TinyTorch serves as the hands-on companion to the Machine Learning Systems textbook, providing practical implementation experience alongside theoretical foundations. Adoption will be measured through multiple channels: (1) \textbf{Educational adoption}: tracking course integrations, student enrollment, and instructor feedback across institutions; (2) \textbf{AI Olympics community}: inspired by MLPerf benchmarking, the AI Olympics leaderboard would create competitive systems engineering challenges where students submit optimized implementations competing across accuracy, speed, compression, and efficiency tracks---building community engagement and peer learning; (3) \textbf{Open-source metrics}: GitHub stars, forks, contributions, and community discussions indicating active use beyond formal coursework. This multi-faceted approach recognizes that educational impact extends beyond traditional classroom metrics to include community building, peer learning, and long-term skill development. The AI Olympics platform particularly enables students to see how their implementations compare globally, fostering systems thinking through competitive optimization while maintaining educational focus on understanding internals rather than achieving state-of-the-art performance.

\section{Conclusion}
\label{sec:conclusion}

Machine learning education faces a fundamental choice: teach students to \emph{use} frameworks as black boxes, or teach them to \emph{understand} what happens inside \texttt{loss.backward()}, why Adam requires 2$\times$ optimizer state memory, why attention scales $O(N^2)$. TinyTorch demonstrates that systems understanding---building autograd, profiling memory, debugging gradient flow---is accessible without requiring GPU clusters or distributed infrastructure. This accessibility matters: students worldwide can develop framework internals knowledge on modest hardware, transforming production debugging from trial-and-error into systematic engineering.

Three pedagogical contributions enable this transformation. \textbf{Progressive disclosure} manages complexity through gradual feature activation---students work with unified Tensor implementations that gain capabilities across modules rather than replacing code mid-semester. \textbf{Systems-first integration} embeds memory profiling from Module 01, preventing ``algorithms without costs'' learning where students optimize accuracy while ignoring deployment constraints. \textbf{Historical milestone validation} proves correctness through recreating 70 years of ML breakthroughs---from 1958 Perceptron through 2017 Transformers---making abstract implementations concrete through reproducing published results.

\textbf{For ML practitioners}: Building TinyTorch's 20 modules transforms how you debug production failures. When PyTorch training crashes with OOM errors, you understand memory allocation across parameters, optimizer states, and activation tensors. When gradient explosions occur, you recognize backpropagation numerical instability from implementing it yourself. When choosing between Adam and SGD under memory constraints, you know the 4$\times$ total memory multiplier from building both optimizers. This systems knowledge transfers directly to production framework usage---you become an engineer who understands \emph{why} frameworks behave as they do, not just \emph{what} they do.

\textbf{For CS education researchers}: TinyTorch provides replicable infrastructure for testing pedagogical hypotheses about ML systems education. Does progressive disclosure reduce cognitive load compared to teaching autograd as separate framework? Does systems-first integration improve production readiness versus algorithms-only instruction? Do historical milestones increase engagement and retention? The curriculum embodies design patterns amenable to controlled empirical investigation. Open-source release with detailed validation roadmap enables multi-institutional studies to establish evidence-based best practices for teaching framework internals.

\textbf{For educators and bootcamp instructors}: TinyTorch supports flexible integration---self-paced learning requiring zero infrastructure (students run locally on laptops), institutional courses with automated NBGrader assessment, or industry team onboarding for ML engineers transitioning from application development to systems work. The modular structure enables selective adoption: foundation tier only (Modules 01--07, teaching core concepts), architecture focus (adding CNNs and Transformers through Module 13), or complete systems coverage (all 20 modules including optimization and deployment). No GPU access required, no cloud credits needed, no infrastructure barriers.

The complete codebase, curriculum materials, and assessment infrastructure are openly available at \texttt{tinytorch.ai} under permissive open-source licensing. We invite the global ML education community to adopt TinyTorch in courses, contribute curriculum improvements, translate materials for international accessibility, fork for domain-specific variants (quantum ML, robotics, edge AI), and empirically evaluate whether implementation-based pedagogy achieves its promise. The difference between engineers who know \emph{what} ML systems do and engineers who understand \emph{why} they work begins with understanding what's inside \texttt{loss.backward()}---and TinyTorch makes that understanding accessible to everyone.

\section*{Acknowledgments}

Coming soon.

% Bibliography
\bibliographystyle{plainnat}
\bibliography{references}

\end{document}