Correct technical claims to align with implementation

- Fix CIFAR-10 accuracy: 75%+ → 65-75% (matches capstone.py target) - Standardize Module 20: Olympics/AI Olympics → Capstone (canonical name) - Clarify NBGrader: Integrated with markers, but unvalidated - Correct milestone span: 70 years → 66 years (1958-2024) - Verify Conv2d loops: 7 loops confirmed correct All changes align paper with actual TinyTorch implementation. Paper compiles successfully (26 pages, no errors).
2026-03-12 02:09:16 -05:00 · 2025-11-24 15:41:37 -05:00
parent 6166c0f112
commit f352eb2bd9
1 changed files with 14 additions and 14 deletions
--- a/paper/paper.tex
+++ b/paper/paper.tex
@@ -217,7 +217,7 @@

 % Abstract - REVISED: Curriculum design focus
 \begin{abstract}
-Machine learning systems engineering requires understanding framework internals: why optimizers consume memory, when computational complexity becomes prohibitive, how to navigate accuracy-latency-memory tradeoffs. Yet current ML education separates algorithms from systems—students learn gradient descent without measuring memory, attention mechanisms without profiling costs, training without understanding optimizer overhead. This divide leaves graduates unable to debug production failures or make informed engineering decisions. We present TinyTorch, a build-from-scratch curriculum where students implement PyTorch's core components (tensors, autograd, optimizers, neural networks) to gain framework transparency. Three pedagogical patterns address the gap: \textbf{progressive disclosure} gradually reveals complexity (gradient features exist from Module 01, activate in Module 05); \textbf{systems-first curriculum} embeds memory profiling from the start; \textbf{historical milestone validation} recreates 70 years of ML breakthroughs using exclusively student-implemented code. These patterns are grounded in learning theory (situated cognition, cognitive load theory) but represent testable hypotheses requiring empirical validation. The 20-module curriculum (60--80 hours) provides complete open-source infrastructure at \texttt{tinytorch.ai}.
+Machine learning systems engineering requires understanding framework internals: why optimizers consume memory, when computational complexity becomes prohibitive, how to navigate accuracy-latency-memory tradeoffs. Yet current ML education separates algorithms from systems—students learn gradient descent without measuring memory, attention mechanisms without profiling costs, training without understanding optimizer overhead. This divide leaves graduates unable to debug production failures or make informed engineering decisions. We present TinyTorch, a build-from-scratch curriculum where students implement PyTorch's core components (tensors, autograd, optimizers, neural networks) to gain framework transparency. Three pedagogical patterns address the gap: \textbf{progressive disclosure} gradually reveals complexity (gradient features exist from Module 01, activate in Module 05); \textbf{systems-first curriculum} embeds memory profiling from the start; \textbf{historical milestone validation} recreates 66 years of ML breakthroughs (1958--2024) using exclusively student-implemented code. These patterns are grounded in learning theory (situated cognition, cognitive load theory) but represent testable hypotheses requiring empirical validation. The 20-module curriculum (60--80 hours) provides complete open-source infrastructure at \texttt{tinytorch.ai}.
 \end{abstract}


@@ -367,7 +367,7 @@ Building systems knowledge alongside ML fundamentals presents three pedagogical
 \node[draw,rectangle,fill=green!20,below=of M16,minimum width=1.8cm] (M17) {17 Memoization};
 \node[draw,rectangle,fill=green!20,below=of M17,minimum width=1.8cm] (M18) {18 Acceleration};
 \node[draw,rectangle,fill=green!20,below=of M18,minimum width=1.8cm] (M19) {19 Benchmarking};
-\node[draw,rectangle,fill=red!30,below=of M19,minimum width=1.8cm] (M20) {20 Olympics};
+\node[draw,rectangle,fill=red!30,below=of M19,minimum width=1.8cm] (M20) {20 Capstone};

 % Arrows - Foundation connections (straight lines within column)
 \draw[->] (M01) -- (M02);
@@ -509,9 +509,9 @@ This section presents the 20-module curriculum structure, organized into four ti

 As established in \Cref{sec:intro}, TinyTorch targets students transitioning from framework users to framework engineers. The curriculum assumes intermediate Python proficiency (comfort with classes, functions, and NumPy array operations) alongside mathematical foundations in linear algebra (matrix multiplication, vectors) and basic calculus (derivatives, chain rule). Students should understand complexity analysis (Big-O notation) and basic algorithms. While prior ML coursework (traditional machine learning or deep learning courses) and data structures courses are helpful, they are not strictly required; motivated students can acquire these foundations concurrently.

-\subsection{The 3-Tier Learning Journey + Olympics}
+\subsection{The 3-Tier Learning Journey + Capstone}

-TinyTorch organizes modules into three progressive tiers plus a capstone competition (\Cref{tab:objectives}). Students cannot skip tiers: architectures require foundation mastery, optimization demands training system understanding. The tiers mirror ML systems engineering practice: foundation (core ML mechanics), architectures (domain-specific models), optimization (production deployment), culminating in the AI Olympics (competitive systems engineering).
+TinyTorch organizes modules into three progressive tiers plus a capstone competition (\Cref{tab:objectives}). Students cannot skip tiers: architectures require foundation mastery, optimization demands training system understanding. The tiers mirror ML systems engineering practice: foundation (core ML mechanics), architectures (domain-specific models), optimization (production deployment), culminating in the Capstone (competitive systems engineering).

 \begin{table*}[p]
 \centering
@@ -556,9 +556,9 @@ TinyTorch organizes modules into three progressive tiers plus a capstone competi
 19 & Opt & Benchmarking & Statistical comparison, multiple runs & Confidence intervals, warm-up protocols \\
 \addlinespace[2pt]
 \midrule
-\multicolumn{5}{@{}l}{\textbf{Olympics (20)}} \\
+\multicolumn{5}{@{}l}{\textbf{Capstone (20)}} \\
 \addlinespace[2pt]
-20 & Cap & Olympics & End-to-end optimized system & MLPerf-style metrics, Pareto optimization \\
+20 & Cap & Capstone & End-to-end optimized system & MLPerf-style metrics, Pareto optimization \\
 \bottomrule
 \end{tabularx}
 }
@@ -588,14 +588,14 @@ Students build the mathematical core enabling neural networks to learn. Systems
 \textbf{Tier 2: Architectures (Modules 08--13).}
 Students apply foundation knowledge to modern architectures for vision and language. Module 08 introduces the Dataset abstraction pattern (implementing \texttt{\_\_len\_\_} and \texttt{\_\_getitem\_\_} protocols) and DataLoader with batch collation, teaching how PyTorch's data pipeline transforms individual samples into batched tensors through the iterator protocol. While Module 07 implements basic training loops with manual batching (simple iteration over pre-batched arrays), Module 08 refactors this into production-quality data loading, a pedagogical pattern of ``make it work, then make it right.'' Students first understand training mechanics (forward pass, loss, backward, update), then learn proper data pipeline engineering. TinyTorch ships with two custom educational datasets that install with the repository: \textbf{TinyDigits} (5,000 grayscale handwritten digits, curated from public digit datasets) and \textbf{TinyTalks} (3,000 synthetically-generated conversational Q\&A pairs). These datasets are deliberately small and offline-first: they require no network connectivity during training, consume minimal storage ($<$50MB combined), and train in minutes on CPU-only hardware. This design ensures accessibility for students in regions with limited internet infrastructure, institutional computer labs with restricted network access, and developing countries where cloud-based datasets create barriers to ML education.

-The tier then branches into two paths. \textbf{Vision} implements Conv2d with seven explicit nested loops making $O(C_{out} \times H \times W \times C_{in} \times K^2)$ complexity visible before optimization. Students discover weight sharing's dramatic efficiency through direct comparison: Conv2d(3$\rightarrow$32, kernel=3) requires 896 parameters while an equivalent dense layer needs 98,336 parameters (3072 input features $\times$ 32 outputs + 32 bias terms), a 109$\times$ reduction demonstrating how inductive biases enable CNNs to learn spatial patterns without brute-force parameterization. This enables Milestone 4 (1998 CNN Revolution) targeting 75\%+ CIFAR-10 accuracy~\citep{krizhevsky2009cifar,lecun1998gradient}.
+The tier then branches into two paths. \textbf{Vision} implements Conv2d with seven explicit nested loops making $O(C_{out} \times H \times W \times C_{in} \times K^2)$ complexity visible before optimization. Students discover weight sharing's dramatic efficiency through direct comparison: Conv2d(3$\rightarrow$32, kernel=3) requires 896 parameters while an equivalent dense layer needs 98,336 parameters (3072 input features $\times$ 32 outputs + 32 bias terms), a 109$\times$ reduction demonstrating how inductive biases enable CNNs to learn spatial patterns without brute-force parameterization. This enables Milestone 4 (1998 CNN Revolution) targeting 65--75\% CIFAR-10 accuracy~\citep{krizhevsky2009cifar,lecun1998gradient}.

 \textbf{Language} progresses through tokenization (character-level and BPE), embeddings (both learned and sinusoidal positional encodings), attention ($O(N^2)$ memory), and complete transformers~\citep{vaswani2017attention}. Module 10 (Tokenization) teaches a fundamental NLP systems trade-off: vocabulary size controls model parameters (embedding matrix rows $\times$ dimensions), while sequence length determines transformer computation ($O(n^2)$ attention complexity). Students discover why later GPT models increased vocabulary from 50K tokens (GPT-2/GPT-3) to 100K tokens (GPT-3.5/GPT-4): not for better language understanding, but to reduce sequence lengths for long documents, trading parameter memory for computational efficiency. Students experience quadratic scaling through direct measurement. Milestone 5 (2017 Transformer Era) validates through text generation on TinyTalks.

 \textbf{Tier 3: Optimization (Modules 14--19).}
 Students transition from ``models that train'' to ``systems that deploy.'' Profiling (14) teaches measuring time, memory, and FLOPs (floating-point operations), introducing Amdahl's Law: optimizing 70\% of runtime by 2$\times$ yields only 1.53$\times$ overall speedup because the remaining 30\% becomes the new bottleneck. This teaches that optimization is iterative and measurement-driven. Quantization (15) achieves 4$\times$ compression (FP32$\rightarrow$INT8) with 1--2\% accuracy cost. Compression (16) applies pruning and distillation for 10$\times$ shrinkage. Memoization (17) implements KV caching (storing attention keys and values to avoid recomputation), a technique used in production LLM serving: students discover that naive autoregressive generation recomputes attention keys and values at every step, generating 100 tokens requires 5,050 redundant computations (1+2+...+100). By caching these values and reusing them, students transform $O(n^2)$ generation into $O(n)$, achieving 10--100$\times$ speedup and understanding why this optimization is essential in systems like ChatGPT and Claude for economically viable inference. Acceleration (18) vectorizes convolution for 10--100$\times$ gains. Benchmarking (19) teaches rigorous performance measurement.

-\textbf{AI Olympics (Module 20).}
+\textbf{Capstone (Module 20).}
 The capstone integrates all 19 modules into production-optimized systems. Inspired by MLPerf~\citep{reddi2020mlperf}, students optimize prior milestones (CIFAR-10 CNN, transformer generation, or custom architecture) for 10$\times$ faster inference, 4$\times$ smaller size, and sub-100ms latency while maintaining accuracy. Students compete on the TinyTorch Leaderboard across four tracks: Vision Excellence, Language Quality, Speed, and Compression. This teaches data-driven optimization mirroring real ML systems engineering.

 \subsection{Module Structure}
@@ -648,7 +648,7 @@ While milestones provide pedagogical motivation through historical framing, they
    \item \textbf{M07 (1986 MLP Revival)}: Achieves strong MNIST digit classification accuracy, validating backpropagation through all layers of deep networks.
    \item \textbf{M10 (1998 LeNet CNN)}: Demonstrates meaningful CIFAR-10 learning (substantially better than random 10\% baseline), showing convolutional feature extraction works correctly.
    \item \textbf{M13 (2017 Transformer)}: Generates coherent multi-token text continuations on TinyTalks dataset, demonstrating functional attention mechanisms and autoregressive generation.
-    \item \textbf{M20 (2024 AI Olympics)}: Student-selected challenge across Vision/Language/Speed/Compression tracks with self-defined success metrics, demonstrating production systems integration.
+    \item \textbf{M20 (2024 Capstone)}: Student-selected challenge across Vision/Language/Speed/Compression tracks with self-defined success metrics, demonstrating production systems integration.
 \end{itemize}

 Performance targets differ from published state-of-the-art due to pure-Python constraints (no GPU acceleration, simplified architectures). Correctness matters more than speed: if a student's CNN learns meaningful CIFAR-10 features, their convolution, pooling, and backpropagation implementations compose correctly into a functional vision system. This approach mirrors professional debugging where implementations prove correct by solving real tasks, not by passing synthetic unit tests alone.
@@ -953,7 +953,7 @@ TinyTorch integrates NBGrader~\citep{blank2019nbgrader} for scalable automated a

 This infrastructure enables deployment in MOOCs and large classrooms where manual grading proves infeasible. Instructors configure NBGrader to collect submissions, execute tests in sandboxed environments, and generate grade reports automatically.

-\textbf{Important caveat}: NBGrader scaffolding exists but remains unvalidated at scale (\Cref{sec:discussion}). Automated assessment validity requires empirical investigation: Do tests measure conceptual understanding or syntax correctness? We scope this as ``curriculum with autograding infrastructure'' rather than ``validated assessment system.''
+\textbf{Important caveat}: NBGrader is integrated for autograding using BEGIN SOLUTION/END SOLUTION markers and test cells, but remains unvalidated at scale (\Cref{sec:discussion}). Automated assessment validity requires empirical investigation: Do tests measure conceptual understanding or syntax correctness? We scope this as ``curriculum with autograding infrastructure'' rather than ``validated assessment system.''

 \subsection{Package Organization}
 \label{subsec:package}
@@ -1048,9 +1048,9 @@ Similarly, distributed training (data parallelism, model parallelism, gradient s

 \subsection{Limitations}

-TinyTorch's current implementation contains gaps requiring future work. \textbf{Assessment infrastructure}: NBGrader scaffolding works in development but remains unvalidated for large-scale deployment. Grading validity requires investigation: Do tests measure conceptual understanding or syntax? Future work should validate through item analysis and transfer task correlation.
+TinyTorch's current implementation contains gaps requiring future work. \textbf{Assessment infrastructure}: NBGrader is integrated for autograding with test cells and solution markers, but remains unvalidated for large-scale deployment. Grading validity requires investigation: Do tests measure conceptual understanding or syntax? Future work should validate through item analysis and transfer task correlation.

-\textbf{Performance transparency tradeoff}: Pure Python executes 100--1000$\times$ slower than PyTorch (\Cref{tab:performance}), a deliberate choice for pedagogical clarity. Seven explicit convolution loops reveal algorithmic complexity better than optimized C++ kernels, but slow execution limits practical experimentation. Students complete milestones (75\%+ CIFAR-10 accuracy, transformer text generation) but cannot iterate rapidly on architecture search.
+\textbf{Performance transparency tradeoff}: Pure Python executes 100--1000$\times$ slower than PyTorch (\Cref{tab:performance}), a deliberate choice for pedagogical clarity. Seven explicit convolution loops reveal algorithmic complexity better than optimized C++ kernels, but slow execution limits practical experimentation. Students complete milestones (65--75\% CIFAR-10 accuracy, transformer text generation) but cannot iterate rapidly on architecture search.

 \textbf{Energy consumption measurement}: While TinyTorch covers optimization techniques with significant energy implications (quantization achieving 4$\times$ compression, pruning enabling 10$\times$ model shrinkage), the curriculum does not explicitly measure or quantify energy consumption. Students understand that quantization reduces model size and pruning decreases computation, but may not connect these optimizations to concrete energy savings (joules/inference, watt-hours/training epoch). Future iterations could integrate energy profiling libraries to make sustainability an explicit learning objective alongside memory and latency optimization, particularly relevant for edge deployment.

@@ -1090,14 +1090,14 @@ TinyTorch's CPU-only design prioritizes pedagogical transparency, but students b

 \subsection{Community Adoption and Impact}

-TinyTorch serves as the hands-on companion to the Machine Learning Systems textbook, providing practical implementation experience alongside theoretical foundations. Adoption will be measured through multiple channels: (1) \textbf{Educational adoption}: tracking course integrations, student enrollment, and instructor feedback across institutions; (2) \textbf{AI Olympics community}: inspired by MLPerf benchmarking, the AI Olympics leaderboard would create competitive systems engineering challenges where students submit optimized implementations competing across accuracy, speed, compression, and efficiency tracks, building community engagement and peer learning; (3) \textbf{Open-source metrics}: GitHub stars, forks, contributions, and community discussions indicating active use beyond formal coursework. This multi-faceted approach recognizes that educational impact extends beyond traditional classroom metrics to include community building, peer learning, and long-term skill development. The AI Olympics platform particularly enables students to see how their implementations compare globally, fostering systems thinking through competitive optimization while maintaining educational focus on understanding internals rather than achieving state-of-the-art performance.
+TinyTorch serves as the hands-on companion to the Machine Learning Systems textbook, providing practical implementation experience alongside theoretical foundations. Adoption will be measured through multiple channels: (1) \textbf{Educational adoption}: tracking course integrations, student enrollment, and instructor feedback across institutions; (2) \textbf{Capstone community}: inspired by MLPerf benchmarking, the Capstone leaderboard would create competitive systems engineering challenges where students submit optimized implementations competing across accuracy, speed, compression, and efficiency tracks, building community engagement and peer learning; (3) \textbf{Open-source metrics}: GitHub stars, forks, contributions, and community discussions indicating active use beyond formal coursework. This multi-faceted approach recognizes that educational impact extends beyond traditional classroom metrics to include community building, peer learning, and long-term skill development. The Capstone platform particularly enables students to see how their implementations compare globally, fostering systems thinking through competitive optimization while maintaining educational focus on understanding internals rather than achieving state-of-the-art performance.

 \section{Conclusion}
 \label{sec:conclusion}

 Machine learning education faces a fundamental choice: teach students to \emph{use} frameworks as black boxes, or teach them to \emph{understand} what happens inside \texttt{loss.backward()}, why Adam requires 2$\times$ optimizer state memory, why attention scales $O(N^2)$. TinyTorch demonstrates that systems understanding (building autograd, profiling memory, debugging gradient flow) is accessible without requiring GPU clusters or distributed infrastructure. This accessibility matters: students worldwide can develop framework internals knowledge on modest hardware, transforming production debugging from trial-and-error into systematic engineering.

-Three pedagogical contributions enable this transformation. \textbf{Progressive disclosure} manages complexity through gradual feature activation: students work with unified Tensor implementations that gain capabilities across modules rather than replacing code mid-semester. \textbf{Systems-first integration} embeds memory profiling from Module 01, preventing ``algorithms without costs'' learning where students optimize accuracy while ignoring deployment constraints. \textbf{Historical milestone validation} proves correctness through recreating 70 years of ML breakthroughs (from 1958 Perceptron through 2017 Transformers), making abstract implementations concrete through reproducing published results.
+Three pedagogical contributions enable this transformation. \textbf{Progressive disclosure} manages complexity through gradual feature activation: students work with unified Tensor implementations that gain capabilities across modules rather than replacing code mid-semester. \textbf{Systems-first integration} embeds memory profiling from Module 01, preventing ``algorithms without costs'' learning where students optimize accuracy while ignoring deployment constraints. \textbf{Historical milestone validation} proves correctness through recreating 66 years of ML breakthroughs (1958--2024, from Perceptron through Transformers), making abstract implementations concrete through reproducing published results.

 \textbf{For ML practitioners}: Building TinyTorch's 20 modules transforms how you debug production failures. When PyTorch training crashes with OOM errors, you understand memory allocation across parameters, optimizer states, and activation tensors. When gradient explosions occur, you recognize backpropagation numerical instability from implementing it yourself. When choosing between Adam and SGD under memory constraints, you know the 4$\times$ total memory multiplier from building both optimizers. This systems knowledge transfers directly to production framework usage: you become an engineer who understands \emph{why} frameworks behave as they do, not just \emph{what} they do.