Fix orphans and widows in paper text

- Rephrase 'optimization opportunities' sentence to avoid dangling word
- Adjust 'roles industry desperately needs' to 'roles that the industry desperately needs'
- Change 'over dense layers' to 'compared to dense layers' to improve line break
- Reorder 'construct mental models gradually' to 'gradually construct mental models'
- Change 'self-paced professional development' to 'independent professional development'
This commit is contained in:
Vijay Janapa Reddi
2026-01-26 18:22:10 -05:00
parent 9bbab8f739
commit f8485a3323

View File

@@ -256,11 +256,11 @@ This pedagogical gap creates a systems efficiency crisis. Modern ML models often
This crisis directly impacts the ML workforce. Industry surveys indicate significant demand-supply imbalances for ML systems engineers~\citep{roberthalf2024talent,keller2025ai}, with surveys suggesting that a substantial portion of executives cite talent shortage as their primary barrier to AI adoption~\citep{keller2025ai}. These are not the scientists who invent new architectures but the engineers who make existing architectures computationally viable, who understand when mixed precision training preserves accuracy, how gradient checkpointing trades compute for memory, and why distributed training introduces synchronization bottlenecks. They are often the bottleneck to realizing the promise of computational scaling that enables general methods to triumph.
The knowledge these engineers need is fundamentally about systems mental models. Understanding \emph{why} Adam requires $2\times$ optimizer state memory requires visualizing optimizer state buffers. Predicting \emph{when} batch sizes must shrink to fit GPU memory requires internalizing memory hierarchy latencies. Navigating accuracy-latency-memory tradeoffs in production systems requires understanding collective communication patterns and their overhead~\citep{meadows2008thinking}. This tacit systems knowledge (how frameworks manage memory, schedule operations, and optimize execution) cannot be developed through high-level API usage alone. It emerges from building these systems, from implementing tensor operations that reveal memory access patterns, from constructing computational graphs that expose optimization opportunities.
The knowledge these engineers need is fundamentally about systems mental models. Understanding \emph{why} Adam requires $2\times$ optimizer state memory requires visualizing optimizer state buffers. Predicting \emph{when} batch sizes must shrink to fit GPU memory requires internalizing memory hierarchy latencies. Navigating accuracy-latency-memory tradeoffs in production systems requires understanding collective communication patterns and their overhead~\citep{meadows2008thinking}. This tacit systems knowledge (how frameworks manage memory, schedule operations, and optimize execution) cannot be developed through high-level API usage alone. It emerges from building these systems, from implementing tensor operations that reveal memory access patterns, from constructing computational graphs exposing optimization opportunities.
Current ML education often fails to develop these mental models through a strict separation: algorithms courses teach backpropagation mathematics while systems courses teach distributed computing, with limited bridges between them. Students learn gradient descent's convergence properties but not its memory footprint. They implement neural networks using framework APIs but never see how those APIs translate to memory allocations and kernel launches. They can mathematically derive the chain rule but cannot explain how \texttt{autograd} implements it efficiently through dynamic graph construction and topological traversal. This separation leaves students unprepared for the systems engineering roles industry desperately needs.
Current ML education often fails to develop these mental models through a strict separation: algorithms courses teach backpropagation mathematics while systems courses teach distributed computing, with limited bridges between them. Students learn gradient descent's convergence properties but not its memory footprint. They implement neural networks using framework APIs but never see how those APIs translate to memory allocations and kernel launches. They can mathematically derive the chain rule but cannot explain how \texttt{autograd} implements it efficiently through dynamic graph construction and topological traversal. This separation leaves students unprepared for the systems engineering roles that the industry desperately needs.
We present TinyTorch, a 20-module curriculum that teaches single node machine learning systems engineering by building a PyTorch-compatible framework from pure Python primitives. Designed as a hands-on companion to the \emph{Machine Learning Systems} textbook~\citep{mlsysbook2025}, TinyTorch makes systems efficiency tangible: students implement tensor operations while measuring memory consumption, build autograd while profiling computational graphs, create optimizers while tracking state overhead. Each module reinforces insights inspired by the systems imperative the Bitter Lesson reveals: computational efficiency, not algorithmic cleverness alone, drives ML progress. Students don't just learn \emph{that} Conv2d achieves 109$\times$ parameter efficiency over dense layers, they \emph{implement} sliding window convolution and \emph{measure} the difference directly through profiling code they wrote.
We present TinyTorch, a 20-module curriculum that teaches single node machine learning systems engineering by building a PyTorch-compatible framework from pure Python primitives. Designed as a hands-on companion to the \emph{Machine Learning Systems} textbook~\citep{mlsysbook2025}, TinyTorch makes systems efficiency tangible: students implement tensor operations while measuring memory consumption, build autograd while profiling computational graphs, create optimizers while tracking state overhead. Each module reinforces insights inspired by the systems imperative the Bitter Lesson reveals: computational efficiency, not algorithmic cleverness alone, drives ML progress. Students don't just learn \emph{that} Conv2d achieves 109$\times$ parameter efficiency compared to dense layers, they \emph{implement} sliding window convolution and \emph{measure} the difference directly through profiling code they wrote.
This pedagogical approach mirrors computer engineering education, where processor design progresses from transistors to logic gates to arithmetic units to complete CPUs, with each layer a working system that enables the next. TinyTorch applies the same principle: tensors enable operations, operations enable layers, layers enable networks, networks enable complete ML systems. \Cref{fig:code-comparison} illustrates this progression: from PyTorch's black-box APIs, through building internals like optimizers, to training transformers where every import is student-implemented code.
@@ -364,9 +364,9 @@ for batch in DataLoader(data):
\end{figure*}
Building systems knowledge alongside ML fundamentals presents three pedagogical challenges: teaching systems thinking early without overwhelming beginners (\Cref{sec:systems}), managing cognitive load when teaching both algorithms and implementation (\Cref{sec:progressive}), and validating student understanding through concrete milestones (\Cref{subsec:milestones}). TinyTorch addresses these through curriculum design inspired by compiler courses~\citep{aho2006compilers}: students build a complete system incrementally, with each module adding functionality while maintaining a working implementation. \Cref{fig:module-flow} illustrates this progression: tensors (Module 01) enable activations (02) and layers (03), which feed into dataloader (05) and autograd (06), powering optimizers (07) and training (08). Each completed module becomes immediately usable: after Module 03, students build neural networks; after Module 06, automatic differentiation enables training; after Module 13, transformers support language modeling. This structure enables students to construct mental models gradually while seeing immediate results.
Building systems knowledge alongside ML fundamentals presents three pedagogical challenges: teaching systems thinking early without overwhelming beginners (\Cref{sec:systems}), managing cognitive load when teaching both algorithms and implementation (\Cref{sec:progressive}), and validating student understanding through concrete milestones (\Cref{subsec:milestones}). TinyTorch addresses these through curriculum design inspired by compiler courses~\citep{aho2006compilers}: students build a complete system incrementally, with each module adding functionality while maintaining a working implementation. \Cref{fig:module-flow} illustrates this progression: tensors (Module 01) enable activations (02) and layers (03), which feed into dataloader (05) and autograd (06), powering optimizers (07) and training (08). Each completed module becomes immediately usable: after Module 03, students build neural networks; after Module 06, automatic differentiation enables training; after Module 13, transformers support language modeling. This structure enables students to gradually construct mental models while seeing immediate results.
TinyTorch serves students transitioning from framework \emph{users} to framework \emph{engineers}: those who have completed introductory ML courses (e.g., CS229, fast.ai) and want to understand PyTorch internals, those planning ML systems research or infrastructure careers, or practitioners debugging production deployment issues. The curriculum assumes NumPy proficiency and basic neural network familiarity but teaches framework architecture from first principles. Students needing immediate GPU/distributed training skills are better served by PyTorch tutorials; those preferring project-based application building will find high-level frameworks more appropriate. The 20-module structure supports flexible pacing: intensive completion, semester integration (parallel with lectures), or self-paced professional development.
TinyTorch serves students transitioning from framework \emph{users} to framework \emph{engineers}: those who have completed introductory ML courses (e.g., CS229, fast.ai) and want to understand PyTorch internals, those planning ML systems research or infrastructure careers, or practitioners debugging production deployment issues. The curriculum assumes NumPy proficiency and basic neural network familiarity but teaches framework architecture from first principles. Students needing immediate GPU/distributed training skills are better served by PyTorch tutorials; those preferring project-based application building will find high-level frameworks more appropriate. The 20-module structure supports flexible pacing: intensive completion, semester integration (parallel with lectures), or independent professional development.
This paper makes three contributions, each addressing the systems imperative the Bitter Lesson reveals: