diff --git a/paper/paper.tex b/paper/paper.tex index ae388ea9..b1b6ed15 100644 --- a/paper/paper.tex +++ b/paper/paper.tex @@ -1003,30 +1003,44 @@ TinyTorch embraces productive failure \citep{kapur2008productive}---learning thr \section{Discussion and Limitations} \label{sec:discussion} -This section positions TinyTorch's pedagogical approach through three lenses: deliberate scope decisions that focus curriculum on framework internals, observations about spiral reinforcement across modules, and honest assessment of limitations requiring future work. +This section reflects on TinyTorch's design through four lenses: pedagogical scope as deliberate design decision, transferable principles for systems education broadly, practical implications for adoption, and honest assessment of limitations requiring future work. -\subsection{Scope: What's NOT Covered} +\subsection{Pedagogical Scope as Design Decision} \label{subsec:scope} -TinyTorch prioritizes framework internals understanding over production ML completeness. The curriculum explicitly omits several critical production skills that require substantial additional complexity orthogonal to framework internals. GPU programming and hardware acceleration---including CUDA kernel optimization, memory hierarchies, tensor cores, mixed precision training with FP16/BF16/INT8 and gradient scaling, and hardware-specific optimization for TPUs and Apple Neural Engine---represent substantial domains requiring parallel programming expertise that would overwhelm students still mastering tensor operations and automatic differentiation. +TinyTorch's CPU-only, framework-internals-focused scope represents deliberate pedagogical constraint, not technical limitation. This scoping embodies three design principles: -Distributed training fundamentals including data parallelism (DistributedDataParallel, gradient synchronization), model parallelism (pipeline parallelism, tensor parallelism), and large-scale training systems (FSDP, DeepSpeed, Megatron-LM) similarly introduce communication complexity that extends beyond framework internals. Production deployment and serving skills---model compilation through TorchScript, ONNX, and TensorRT, serving infrastructure including batching, load balancing, and latency optimization, and MLOps tooling for experiment tracking, model versioning, and A/B testing---represent deployment infrastructure concerns distinct from framework understanding. +\textbf{Accessibility over performance}: Pure Python on modest hardware (4GB RAM, dual-core CPU) enables global participation. GPU access remains inequitably distributed---cloud credits favor well-funded institutions, personal GPUs favor affluent students. Eliminating GPU dependency prioritizes equitable access over execution speed. The 100--1000$\times$ slowdown versus PyTorch (\Cref{tab:performance}) is acceptable when pedagogical goal is understanding internals, not training production models. -Advanced systems techniques such as gradient checkpointing (trading computation for memory), operator fusion and graph compilation, Flash Attention and memory-efficient attention variants, and dynamic versus static computation graphs, while important for production systems, introduce optimization complexity that obscures pedagogical transparency. These topics require substantial additional complexity---parallel programming semantics, hardware knowledge, deployment infrastructure---that would shift focus away from framework internals understanding. +\textbf{Transparency over optimization}: Seven explicit convolution loops reveal algorithmic structure better than CUDA kernels. Students learn WHY convolution complexity is $O(W \times H \times C_{in} \times C_{out} \times K^2)$, not HOW to optimize it for specific hardware. This separation---understand algorithms first, optimize later---mirrors professional practice: prototype for correctness, profile to identify bottlenecks, then optimize measured hotspots. -TinyTorch teaches framework internals as foundation for GPU and distributed work, not as replacement. Complete production ML engineer preparation requires TinyTorch (internals) followed by PyTorch Distributed (GPU/multi-node), deployment courses (serving), and on-the-job experience. Students completing TinyTorch should pursue GPU and distributed training through PyTorch tutorials, NVIDIA Deep Learning Institute courses, or advanced ML systems courses. The CPU-only design offers three pedagogical benefits: accessibility (students in regions with limited cloud computing access can complete curriculum on modest hardware), reproducibility (no GPU availability variability across institutions), and pedagogical focus (internals learning not confounded with hardware optimization). +\textbf{Incremental complexity management}: GPU programming introduces memory hierarchy (registers, shared memory, global memory), kernel launch semantics, race conditions, and hardware-specific tuning. Teaching GPU programming simultaneously with autograd would violate cognitive load constraints. TinyTorch enables "framework internals now, hardware optimization later" learning pathway. Students completing TinyTorch should pursue GPU acceleration through PyTorch tutorials, NVIDIA Deep Learning Institute courses, or advanced ML systems courses---building on internals understanding to comprehend optimization techniques. -\subsection{Pedagogical Spiral: Concepts Revisited Across Tiers} +Similarly, distributed training (data parallelism, model parallelism, gradient synchronization) and production deployment (model serving, compilation, MLOps) introduce substantial additional complexity orthogonal to framework understanding. These topics remain important but beyond current pedagogical scope. Future extensions could address distributed systems through simulation-based pedagogy (\Cref{sec:future-work}), maintaining accessibility while teaching concepts. -TinyTorch's 20-module curriculum exhibits spiral design: core concepts introduced in foundation modules reappear with increasing sophistication across architecture and optimization tiers. This recursive reinforcement serves dual purposes---validating earlier understanding while deepening systems intuition through application at scale. +\subsection{Transferable Design Principles for Systems Education} -Memory reasoning spirals across all tiers. Module 01 students calculate tensor footprints (\texttt{tensor.nbytes}), Module 09 extends this to convolutional parameter counts and activation memory (output spatial dimensions $\times$ channels), Module 12 reveals attention's quadratic memory scaling ($O(N^2)$ for sequence length $N$), and Module 15 quantization demonstrates 4$\times$ compression through INT8 versus FP32 representation. Each encounter builds on prior knowledge: students who manually calculated Conv2d memory in Module 09 immediately recognize why quantization matters when profiling transformer memory in Module 15. +While TinyTorch targets ML framework education, five underlying principles generalize to systems courses broadly: -Computational complexity follows similar progression. Module 03 introduces FLOPs counting for matrix multiplication ($O(N^3)$ for $N\times N$ matrices), Module 09 applies this to convolution (sliding window creates $O(W \times H \times C_{in} \times C_{out} \times K^2)$ complexity for image dimensions $W,H$, channels, and kernel size $K$), Module 12 analyzes attention's $O(N^2 \cdot d)$ complexity for embedding dimension $d$, and Module 18 acceleration targets these measured bottlenecks through algorithmic improvements. Students develop intuition for complexity analysis not through abstract notation but through profiling actual implementations across architectures of increasing sophistication. +\textbf{1. Delayed Abstraction Activation}: Rather than introducing features sequentially (Module 1 = tensors, Module 2 = autograd as separate system), embed capabilities early but activate later. Tensor.backward() exists from Module 01 but remains dormant until Module 05, when computational graphs make gradients meaningful. This maintains conceptual unity (students work with ONE Tensor class) while managing cognitive load (early modules avoid gradient tracking overhead). \textbf{Applicability}: Compiler courses could expose semantic analysis infrastructure early (dormant) but activate after parser completion. Database courses could show transaction mechanisms before concurrency module. Operating systems could introduce virtual memory concepts before paging implementation. -The integration testing pattern (Build $\rightarrow$ Use $\rightarrow$ Reflect) creates backward connections: later modules illuminate earlier implementations. Students building Module 13 transformers discover why Module 03 layer abstractions matter---without clean interfaces, composing embeddings, attention, and MLPs becomes intractable. This backward reinforcement differs from forward prerequisites: students \emph{could} implement transformers without prior layer experience, but doing so reveals \emph{why} abstraction boundaries exist. Spiral design makes implicit knowledge explicit through repeated encounter at increasing depth. +\textbf{2. Historical Validation as Correctness Proof}: Using historical benchmarks (1958 Perceptron $\rightarrow$ 2017 Transformer) provides non-synthetic validation that implementations compose correctly. If student autograd trains XOR successfully (1986 milestone), backpropagation likely works; if attention generates coherent text (2017 milestone), transformer implementation succeeded. Historical framing also motivates learning: students "prove Minsky wrong" about neural networks, not just "complete Exercise 3." \textbf{Applicability}: Compiler courses recreate C $\rightarrow$ C++ $\rightarrow$ Rust language features historically. Graphics courses progress from wireframe rendering (1960s) through Phong shading (1970s) to physically-based rendering (2000s). Network courses implement TCP variants chronologically (Tahoe, Reno, CUBIC). -\subsection{Limitations and Future Directions} +\textbf{3. Systems-First, Not Systems-After}: Embed profiling and measurement from Module 01 rather than deferring to "advanced topics." Students calculate \texttt{tensor.nbytes} before matrix multiplication, predict Conv2d memory before implementation, profile attention complexity before optimization. This prevents "correct but unusable" implementations where students build functional systems without understanding resource constraints. \textbf{Applicability}: Database courses measure query execution time and index memory overhead from first SELECT statement. Operating systems courses profile context switch latency and memory footprint before threading module. Compiler courses track parse table size and optimization pass runtime from initial implementation. + +\textbf{4. Unified Implementation Model}: Students maintain ONE codebase that grows (not 20 separate assignments). Module 13 transformers import Module 11 embeddings, Module 12 attention, and Module 03 layers---integration tests validate cross-module composition. This mirrors professional practice (multi-month projects, not throw-away exercises) and enables compound learning (later modules depend on earlier correctness). \textbf{Applicability}: Distributed systems courses build consensus protocol (Raft/Paxos) then key-value store using that protocol. Compiler courses implement frontend (lexer, parser), then midend (optimizations), then backend (code generation) as unified pipeline. Database courses build storage layer, then query processor using that storage. + +\textbf{5. Accessibility Through Simplicity Trades Performance}: CPU-only, pure Python enables global access but sacrifices speed. This trade-off prioritizes pedagogical transparency---students read every line of TinyTorch without encountering C++ template metaprogramming or CUDA intrinsics. \textbf{Applicability}: Educational operating systems (xv6, Pintos) use simplified designs versus production complexity (Linux, Windows). Teaching compilers (MiniJava, Tiger) omit production optimizations (LLVM's 100+ passes). Pedagogical databases (SimpleDB, MiniBase) favor understandability over PostgreSQL's performance sophistication. + +\subsection{Implications for Practice} + +\textbf{For ML educators}: TinyTorch enables three adoption pathways: (1) \textbf{Standalone course}: Dedicate 3-4 credits (60-80 hours) to complete systems curriculum, targeting juniors/seniors post-algorithms and introductory ML. (2) \textbf{Integrated track}: Pair TinyTorch modules with PyTorch usage---Module 05 autograd implementation alongside PyTorch \texttt{loss.backward()} usage, revealing internals through parallel construction. (3) \textbf{Selective modules}: Extract foundation tier (Modules 01-07) as half-semester unit, or architecture tier (Modules 08-13) for advanced students. Modular structure supports flexible integration based on institutional constraints and student backgrounds. + +\textbf{For curriculum designers}: TinyTorch positions ML systems education between CS fundamentals and specialized ML coursework. Prerequisites include data structures (tensor operations), algorithms (complexity analysis), and introductory ML (gradient descent, loss functions). Post-TinyTorch pathways include advanced ML systems (CMU Deep Learning Systems, distributed training), ML theory (statistical learning, optimization), or production deployment (MLOps, model serving). The 60-80 hour scope fits 3-credit semester course, intensive 2-week bootcamp, or self-paced professional development. + +\textbf{For students and learners}: Completing TinyTorch develops three transferable competencies distinguishing ML systems engineers from ML application developers: (1) \textbf{Framework internals knowledge} enabling production debugging (diagnosing gradient flow issues, profiling memory bottlenecks, understanding optimizer state management). (2) \textbf{Systems thinking} for resource-constrained deployment (calculating memory requirements before training, predicting inference latency, reasoning about hardware trade-offs). (3) \textbf{Implementation skills} for rapid prototyping (building custom layers, modifying optimizers, experimenting with novel architectures). Career pathways include ML infrastructure engineering, framework development, compiler optimization for ML accelerators, and edge deployment for TinyML systems. + +\subsection{Limitations} TinyTorch's current implementation contains gaps requiring future work. \textbf{Assessment infrastructure}: NBGrader scaffolding works in development but remains unvalidated for large-scale deployment. Grading validity requires investigation: Do tests measure conceptual understanding or syntax? Future work should validate through item analysis and transfer task correlation.