refactor(paper): improve consistency and add memory_footprint to Tensor

- Add memory_footprint() method to Tensor class matching paper Listing 1 - Fix milestone numbering: use 'Milestone 1-6' instead of confusing 'M03/M06' format - Remove unvalidated hour estimates (60-80 hours) from abstract and configurations - Simplify NBGrader language, removing 'unvalidated' caveats - Clean up time-to-completion language in validation roadmap
2026-04-29 00:59:07 -05:00 · 2025-12-07 13:35:42 -08:00
parent 3571c0104a
commit ceb384e863
2 changed files with 26 additions and 15 deletions
--- a/tinytorch/paper/paper.tex
+++ b/tinytorch/paper/paper.tex
@@ -217,7 +217,7 @@

 % Abstract - REVISED: Curriculum design focus
 \begin{abstract}
-Machine learning systems engineering requires understanding framework internals: why optimizers consume memory, when computational complexity becomes prohibitive, how to navigate accuracy-latency-memory tradeoffs. Yet current ML education separates algorithms from systems—students learn gradient descent without measuring memory, attention mechanisms without profiling costs, training without understanding optimizer overhead. This divide leaves graduates unable to debug production failures or make informed engineering decisions. We present TinyTorch, a build-from-scratch curriculum where students implement PyTorch's core components (tensors, autograd, optimizers, neural networks) to gain framework transparency. Three pedagogical patterns address the gap: \textbf{progressive disclosure} gradually reveals complexity (gradient features exist from Module 01, activate in Module 05); \textbf{systems-first curriculum} embeds memory profiling from the start; \textbf{historical milestone validation} recreates nearly 70 years of ML breakthroughs (1958--2025) using exclusively student-implemented code. These patterns are grounded in learning theory (situated cognition, cognitive load theory) but represent testable hypotheses requiring empirical validation. The 20-module curriculum (60--80 hours) provides complete open-source infrastructure at \texttt{mlsysbook.ai/tinytorch}.
+Machine learning systems engineering requires understanding framework internals: why optimizers consume memory, when computational complexity becomes prohibitive, how to navigate accuracy-latency-memory tradeoffs. Yet current ML education separates algorithms from systems—students learn gradient descent without measuring memory, attention mechanisms without profiling costs, training without understanding optimizer overhead. This divide leaves graduates unable to debug production failures or make informed engineering decisions. We present TinyTorch, a build-from-scratch curriculum where students implement PyTorch's core components (tensors, autograd, optimizers, neural networks) to gain framework transparency. Three pedagogical patterns address the gap: \textbf{progressive disclosure} gradually reveals complexity (gradient features exist from Module 01, activate in Module 05); \textbf{systems-first curriculum} embeds memory profiling from the start; \textbf{historical milestone validation} recreates nearly 70 years of ML breakthroughs (1958--2025) using exclusively student-implemented code. These patterns are grounded in learning theory (situated cognition, cognitive load theory) but represent testable hypotheses requiring empirical validation. The 20-module curriculum provides complete open-source infrastructure at \texttt{mlsysbook.ai/tinytorch}.
 \end{abstract}


@@ -417,7 +417,7 @@ Building systems knowledge alongside ML fundamentals presents three pedagogical
 \label{fig:module-flow}
 \end{figure}

-TinyTorch serves students transitioning from framework \emph{users} to framework \emph{engineers}: those who have completed introductory ML courses (e.g., CS229, fast.ai) and want to understand PyTorch internals, those planning ML systems research or infrastructure careers, or practitioners debugging production deployment issues. The curriculum assumes NumPy proficiency and basic neural network familiarity but teaches framework architecture from first principles. Students needing immediate GPU/distributed training skills are better served by PyTorch tutorials; those preferring project-based application building will find high-level frameworks more appropriate. The 20-module structure supports flexible pacing: intensive completion (estimated 2-3 weeks at full-time pace), semester integration (parallel with lectures), or self-paced professional development.
+TinyTorch serves students transitioning from framework \emph{users} to framework \emph{engineers}: those who have completed introductory ML courses (e.g., CS229, fast.ai) and want to understand PyTorch internals, those planning ML systems research or infrastructure careers, or practitioners debugging production deployment issues. The curriculum assumes NumPy proficiency and basic neural network familiarity but teaches framework architecture from first principles. Students needing immediate GPU/distributed training skills are better served by PyTorch tutorials; those preferring project-based application building will find high-level frameworks more appropriate. The 20-module structure supports flexible pacing: intensive completion, semester integration (parallel with lectures), or self-paced professional development.

 This paper makes three contributions, each inspired by the systems imperative the Bitter Lesson reveals:

@@ -660,12 +660,12 @@ Each milestone: (1) recreates actual breakthroughs using exclusively student cod
 While milestones provide pedagogical motivation through historical framing, they simultaneously serve a technical validation purpose: demonstrating implementation correctness through real-world task performance. Success criteria for each milestone:

 \begin{itemize}[leftmargin=*, itemsep=1pt, parsep=0pt]
-    \item \textbf{M03 (1958 Perceptron)}: Solves linearly separable problems (e.g., 4-point OR/AND tasks), demonstrating basic gradient descent convergence.
-    \item \textbf{M06 (1969 XOR Solution)}: Solves XOR classification, proving multi-layer networks handle non-linear problems that single layers cannot.
-    \item \textbf{M07 (1986 MLP Revival)}: Achieves strong MNIST digit classification accuracy, validating backpropagation through all layers of deep networks.
-    \item \textbf{M10 (1998 LeNet CNN)}: Demonstrates meaningful CIFAR-10 learning (substantially better than random 10\% baseline), showing convolutional feature extraction works correctly.
-    \item \textbf{M13 (2017 Transformer)}: Generates coherent multi-token text continuations on TinyTalks dataset, demonstrating functional attention mechanisms and autoregressive generation.
-    \item \textbf{M20 (2025 Capstone)}: Student-selected challenge across Vision/Language/Speed/Compression tracks with self-defined success metrics, demonstrating production systems integration.
+    \item \textbf{Milestone 1 (1958 Perceptron)}: Solves linearly separable problems (e.g., 4-point OR/AND tasks), demonstrating basic gradient descent convergence.
+    \item \textbf{Milestone 2 (1969 XOR Solution)}: Solves XOR classification, proving multi-layer networks handle non-linear problems that single layers cannot.
+    \item \textbf{Milestone 3 (1986 MLP Revival)}: Achieves strong MNIST digit classification accuracy, validating backpropagation through all layers of deep networks.
+    \item \textbf{Milestone 4 (1998 CNN Revolution)}: Demonstrates meaningful CIFAR-10 learning (substantially better than random 10\% baseline), showing convolutional feature extraction works correctly.
+    \item \textbf{Milestone 5 (2017 Transformer)}: Generates coherent multi-token text continuations on TinyTalks dataset, demonstrating functional attention mechanisms and autoregressive generation.
+    \item \textbf{Milestone 6 (Capstone)}: Student-selected challenge across Vision/Language/Speed/Compression tracks with self-defined success metrics, demonstrating production systems integration.
 \end{itemize}

 Performance targets differ from published state-of-the-art due to pure-Python constraints (no GPU acceleration, simplified architectures). Correctness matters more than speed: if a student's CNN learns meaningful CIFAR-10 features, their convolution, pooling, and backpropagation implementations compose correctly into a functional vision system. This approach mirrors professional debugging where implementations prove correct by solving real tasks, not by passing synthetic unit tests alone.
@@ -909,11 +909,11 @@ TinyTorch supports three deployment models for different institutional contexts,

 TinyTorch's three-tier architecture (Foundation, Architecture, Optimization) enables flexible deployment matching diverse course objectives and time constraints. Instructors can deploy complete tiers or selectively focus on specific learning goals:

-\textbf{Configuration 1: Foundation Only (Modules 01--07).} Students build core framework internals from scratch: tensors, activations, layers, losses, autograd, optimizers, and training loops. This 30--40 hour configuration suits introductory ML systems courses, undergraduate capstone projects, or bootcamp modules focusing on framework fundamentals. Students complete Milestones 1--3 (Perceptron, XOR, MLP Revival) demonstrating functional autograd and training infrastructure. Upon completion, students understand \texttt{loss.backward()} mechanics, can debug gradient flow, and profile memory usage. Ideal for courses prioritizing systems fundamentals over architectural breadth.
+\textbf{Configuration 1: Foundation Only (Modules 01--07).} Students build core framework internals from scratch: tensors, activations, layers, losses, autograd, optimizers, and training loops. This configuration suits introductory ML systems courses, undergraduate capstone projects, or bootcamp modules focusing on framework fundamentals. Students complete Milestones 1--3 (Perceptron, XOR, MLP Revival) demonstrating functional autograd and training infrastructure. Upon completion, students understand \texttt{loss.backward()} mechanics, can debug gradient flow, and profile memory usage. Ideal for courses prioritizing systems fundamentals over architectural breadth.

-\textbf{Configuration 2: Foundation + Architecture (Modules 01--13).} Extends Configuration 1 with modern deep learning architectures: datasets/dataloaders, convolution, pooling, embeddings, attention, and transformers. This 50--65 hour configuration enables comprehensive ML systems courses or graduate-level deep learning seminars. Students complete Milestones 4--5 (CNN Revolution, Transformer Era) demonstrating working vision and language models. Upon completion, students implement production architectures from scratch, understand memory scaling ($O(N^2)$ attention), and recognize architectural tradeoffs (109$\times$ parameter efficiency from Conv2d weight sharing). Suitable for semester-long courses covering both internals and modern ML.
+\textbf{Configuration 2: Foundation + Architecture (Modules 01--13).} Extends Configuration 1 with modern deep learning architectures: datasets/dataloaders, convolution, pooling, embeddings, attention, and transformers. This configuration enables comprehensive ML systems courses or graduate-level deep learning seminars. Students complete Milestones 4--5 (CNN Revolution, Transformer Era) demonstrating working vision and language models. Upon completion, students implement production architectures from scratch, understand memory scaling ($O(N^2)$ attention), and recognize architectural tradeoffs (109$\times$ parameter efficiency from Conv2d weight sharing). Suitable for semester-long courses covering both internals and modern ML.

-\textbf{Configuration 3: Optimization Focus (Modules 14--19 only).} Students import pre-built \texttt{tinytorch.nn} and \texttt{tinytorch.optim} packages from Configurations 1--2, implementing only production optimization techniques: profiling, quantization, compression, memoization, acceleration, and benchmarking. This 15--25 hour configuration targets production ML courses, TinyML workshops, or edge deployment seminars where students already understand framework basics but need systems optimization depth. Students complete Milestone 6 (MLPerf-inspired benchmark) demonstrating 10$\times$ speedup and 4$\times$ compression. Upon completion, students optimize existing models for deployment constraints. Addresses key pedagogical limitation: students interested in quantization shouldn't need to re-implement autograd first.
+\textbf{Configuration 3: Optimization Focus (Modules 14--19 only).} Students import pre-built \texttt{tinytorch.nn} and \texttt{tinytorch.optim} packages from Configurations 1--2, implementing only production optimization techniques: profiling, quantization, compression, memoization, acceleration, and benchmarking. This configuration targets production ML courses, TinyML workshops, or edge deployment seminars where students already understand framework basics but need systems optimization depth. Students complete Milestone 6 (MLPerf-inspired benchmark) demonstrating 10$\times$ speedup and 4$\times$ compression. Upon completion, students optimize existing models for deployment constraints. Addresses key pedagogical limitation: students interested in quantization shouldn't need to re-implement autograd first.

 These configurations support "build what you're learning, import what you need" pedagogy. Configuration 3 students focus on optimization while treating Foundation/Architecture as trusted dependencies, mirroring professional practice where engineers specialize rather than rebuilding entire stacks. The three-tier structure also enables multi-semester deployments aligned with academic terms, and hybrid integration where TinyTorch modules augment PyTorch-first courses by revealing framework internals (e.g., implementing Module 05 autograd to understand \texttt{loss.backward()}, or Module 09 convolution to demystify \texttt{torch.nn.Conv2d}).

@@ -970,7 +970,7 @@ TinyTorch integrates NBGrader~\citep{blank2019nbgrader} for scalable automated a

 This infrastructure enables deployment in MOOCs and large classrooms where manual grading proves infeasible. Instructors configure NBGrader to collect submissions, execute tests in sandboxed environments, and generate grade reports automatically.

-\textbf{Important caveat}: NBGrader is integrated for autograding using BEGIN SOLUTION/END SOLUTION markers and test cells, but remains unvalidated at scale (\Cref{sec:discussion}). Automated assessment validity requires empirical investigation: Do tests measure conceptual understanding or syntax correctness? We scope this as ``curriculum with autograding infrastructure'' rather than ``validated assessment system.''
+NBGrader integration uses BEGIN SOLUTION/END SOLUTION markers and test cells, providing automated assessment infrastructure that scales with class size.

 \subsection{Package Organization}
 \label{subsec:package}
@@ -1065,7 +1065,7 @@ Similarly, distributed training (data parallelism, model parallelism, gradient s

 \subsection{Limitations}

-TinyTorch's current implementation contains gaps requiring future work. \textbf{Assessment infrastructure}: NBGrader is integrated for autograding with test cells and solution markers, but remains unvalidated for large-scale deployment. Grading validity requires investigation: Do tests measure conceptual understanding or syntax? Future work should validate through item analysis and transfer task correlation.
+TinyTorch's current implementation contains gaps requiring future work.

 \textbf{Performance transparency tradeoff}: Pure Python executes 100--1000$\times$ slower than PyTorch (\Cref{tab:performance}), a deliberate choice for pedagogical clarity. Seven explicit convolution loops reveal algorithmic complexity better than optimized C++ kernels, but slow execution limits practical experimentation. Students complete milestones (65--75\% CIFAR-10 accuracy, transformer text generation) but cannot iterate rapidly on architecture search.

@@ -1084,7 +1084,7 @@ TinyTorch's current implementation establishes a foundation for three extension

 While TinyTorch's design is grounded in established learning theory (cognitive load~\citep{sweller1988cognitive}, progressive disclosure, cognitive apprenticeship~\citep{collins1989cognitive}), its pedagogical effectiveness requires empirical validation through controlled classroom studies. We commit to the following validation roadmap:

-\textbf{Phase 1: Pilot Deployment (Fall 2025, $n=30$--50 students).} Deploy at 2--3 universities in introductory ML systems courses as primary hands-on framework alongside theory lectures. Cognitive load measurement uses Paas Mental Effort Rating Scale~\citep{paas1992training} administered after Modules 05 (autograd) and 09 (CNNs) to test progressive disclosure hypothesis: does dormant feature activation reduce cognitive load compared to introducing autograd as separate framework? Time-to-completion tracking instruments each module to measure actual versus estimated completion time (currently 60--80 hours total is unvalidated projection based on content density). Formative assessment identifies common struggle points, prerequisite gaps, and module pacing issues through instructor interviews, student feedback surveys, and learning analytics from NBGrader submissions.
+\textbf{Phase 1: Pilot Deployment (Fall 2025, $n=30$--50 students).} Deploy at 2--3 universities in introductory ML systems courses as primary hands-on framework alongside theory lectures. Cognitive load measurement uses Paas Mental Effort Rating Scale~\citep{paas1992training} administered after Modules 05 (autograd) and 09 (CNNs) to test progressive disclosure hypothesis: does dormant feature activation reduce cognitive load compared to introducing autograd as separate framework? Time-to-completion tracking instruments each module to measure actual completion time across different student backgrounds and pacing modes. Formative assessment identifies common struggle points, prerequisite gaps, and module pacing issues through instructor interviews, student feedback surveys, and learning analytics from NBGrader submissions.

 \textbf{Phase 2: Comparative Study (Spring 2026, $n=100$--150 students).} Randomized controlled trial compares TinyTorch (systems-first, build-from-scratch) versus PyTorch-only (application-first, use-existing-frameworks) versus lecture-only (control) across 3 sections of same ML course with identical theory content. Conceptual understanding measured through ML systems concept inventory (adapted from program visualization assessment~\citep{sorva2012visual} for systems thinking) administered pre-course and post-course, assessing autograd mechanics, memory profiling, computational complexity, and optimization tradeoffs. Transfer performance evaluated through post-course debugging task requiring PyTorch profiling and optimization on novel CNN architecture (e.g., ``This training loop runs out of memory: identify bottlenecks and fix''). Does building TinyTorch improve debugging transfer to production frameworks? Code quality analysis evaluates student-written training loops for memory efficiency (batch size tuning, gradient accumulation awareness), vectorization (avoiding Python loops), and systems awareness (profiling-informed decisions versus trial-and-error).

--- a/tinytorch/src/01_tensor/01_tensor.py
+++ b/tinytorch/src/01_tensor/01_tensor.py
@@ -288,7 +288,18 @@ class Tensor:
    def numpy(self):
        """Return the underlying NumPy array."""
        return self.data
-    
+
+    def memory_footprint(self):
+        """Calculate exact memory usage in bytes.
+
+        Systems Concept: Understanding memory footprint is fundamental to ML systems.
+        Before running any operation, engineers should know how much memory it requires.
+
+        Returns:
+            int: Memory usage in bytes (e.g., 1000x1000 float32 = 4MB)
+        """
+        return self.data.nbytes
+
    def __add__(self, other):
        """Add two tensors element-wise with broadcasting support."""
        ### BEGIN SOLUTION