- Improve module descriptions and learning objectives - Standardize documentation format and structure - Add clearer guidance for students - Enhance module-specific context and examples
35 KiB
title, description, difficulty, time_estimate, prerequisites, next_steps, learning_objectives
| title | description | difficulty | time_estimate | prerequisites | next_steps | learning_objectives | ||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Tensor | Build the fundamental N-dimensional array data structure that powers all machine learning | ⭐ | 4-6 hours |
|
|
01. Tensor
FOUNDATION TIER | Difficulty: ⭐ (1/4) | Time: 4-6 hours
Overview
The Tensor class is the foundational data structure of machine learning - every neural network, from simple linear models to GPT and Stable Diffusion, operates on tensors. You'll build N-dimensional arrays from scratch with arithmetic operations, broadcasting, and shape manipulation. This module gives you deep insight into how PyTorch and TensorFlow work under the hood, understanding the memory and performance implications that matter in production ML systems.
Learning Objectives
By the end of this module, you will be able to:
- Understand memory and performance implications: Recognize how tensor operations dominate compute time and memory usage in ML systems - a single matrix multiplication can consume 90% of forward pass time in production frameworks like PyTorch
- Implement core tensor functionality: Build a complete Tensor class with arithmetic (
+,-,*,/), matrix multiplication, shape manipulation (reshape,transpose), and reductions (sum,mean,max) with proper error handling and validation - Master broadcasting semantics: Understand NumPy broadcasting rules that enable efficient computations across different tensor shapes without data copying - critical for batch processing and efficient neural network operations
- Connect to production frameworks: See how your implementation mirrors PyTorch's
torch.Tensorand TensorFlow'stf.Tensordesign patterns, understanding the architectural decisions that power real ML systems - Analyze performance trade-offs: Understand computational complexity (O(n³) for matrix multiplication), memory usage patterns (contiguous vs. strided), and when to copy data vs. create views for optimization
Build → Use → Reflect
This module follows TinyTorch's Build → Use → Reflect framework:
- Build: Implement the Tensor class from scratch using NumPy as the underlying array library - creating
__init__, operator overloading (__add__,__mul__, etc.), shape manipulation methods, and reduction operations - Use: Apply your Tensor to real problems like matrix multiplication for neural network layers, data normalization with broadcasting, and statistical computations across various shapes and dimensions
- Reflect: Understand systems-level implications - why tensor operations dominate training time, how memory layout (row-major vs. column-major) affects cache performance, and how broadcasting eliminates redundant data copying
What You'll Build
By completing this module, you'll create a production-ready Tensor class with:
Core Data Structure:
- N-dimensional array wrapper around NumPy with clean API
- Properties for shape, size, dtype, and data access
- Dormant gradient tracking attributes (activated in Module 05)
Arithmetic Operations:
- Element-wise operations:
+,-,*,/,** - Full broadcasting support for Tensor-Tensor and Tensor-scalar operations
- Automatic shape alignment following NumPy broadcasting rules
Matrix Operations:
matmul()for matrix multiplication with shape validation- Support for matrix-matrix, matrix-vector multiplication
- Clear error messages for dimension mismatches
Shape Manipulation:
reshape()with -1 inference for automatic dimension calculationtranspose()for dimension swapping- View vs. copy semantics understanding
Reduction Operations:
sum(),mean(),max(),min()with axis parameter- Global reductions (entire tensor) and axis-specific reductions
keepdimssupport for maintaining dimensionality
Real-World Usage Pattern:
Your Tensor enables the fundamental neural network forward pass: output = x.matmul(W) + b - exactly how PyTorch and TensorFlow work internally.
Core Concepts
Tensors as Multidimensional Arrays
A tensor is a generalization of scalars (0D), vectors (1D), and matrices (2D) to N dimensions:
- Scalar:
Tensor(5.0)- shape() - Vector:
Tensor([1, 2, 3])- shape(3,) - Matrix:
Tensor([[1, 2], [3, 4]])- shape(2, 2) - 3D Tensor: Image batch
(batch, height, width)- shape(32, 224, 224) - 4D Tensor: CNN features
(batch, channels, height, width)- shape(32, 3, 224, 224)
Why tensors matter: They provide a unified interface for all ML data - images, text embeddings, audio spectrograms, and model parameters are all tensors with different shapes.
Broadcasting: Efficient Shape Alignment
Broadcasting automatically expands smaller tensors to match larger ones without copying data:
# Matrix (2,2) + Vector (2,) → broadcasts to (2,2)
matrix = Tensor([[1, 2], [3, 4]])
vector = Tensor([10, 20])
result = matrix + vector # [[11, 22], [13, 24]]
Broadcasting rules (NumPy-compatible):
- Align shapes from right to left
- Dimensions are compatible if they're equal or one is 1
- Missing dimensions are treated as size 1
Why broadcasting matters: Eliminates redundant data copying. Adding a bias vector to 1000 feature maps broadcasts once instead of copying the vector 1000 times - saving memory and enabling vectorization.
Views vs. Copies: Memory Efficiency
Some operations return views (sharing memory) vs. copies (duplicating data):
- Views (O(1)):
reshape(),transpose()when possible - no data movement - Copies (O(n)): Arithmetic operations, explicit
.copy()- duplicate storage
Why this matters: A view of a 1GB tensor is free (just metadata). A copy allocates another 1GB. Understanding view semantics prevents memory blowup in production systems.
Computational Complexity
Different operations have vastly different costs:
- Element-wise (
+,-,*): O(n) - linear in tensor size - Reductions (
sum,mean): O(n) - must visit every element - Matrix multiply (
matmul): O(n³) for square matrices - dominates training time
Why this matters: In a neural network forward pass, matrix multiplications consume 90%+ of compute time. Optimizing matmul is critical - hence specialized hardware (GPUs, TPUs) and libraries (cuBLAS, MKL).
Architecture Overview
Tensor Class Design
┌─────────────────────────────────────────┐
│ Tensor Class │
├─────────────────────────────────────────┤
│ Properties: │
│ - data: np.ndarray (underlying storage)│
│ - shape: tuple (dimensions) │
│ - size: int (total elements) │
│ - dtype: np.dtype (data type) │
│ - requires_grad: bool (autograd flag) │
│ - grad: Tensor (gradient - Module 05) │
├─────────────────────────────────────────┤
│ Operator Overloading: │
│ - __add__, __sub__, __mul__, __truediv__│
│ - __pow__ (exponentiation) │
│ - Returns new Tensor instances │
├─────────────────────────────────────────┤
│ Methods: │
│ - matmul(other): Matrix multiplication │
│ - reshape(*shape): Shape manipulation │
│ - transpose(): Dimension swap │
│ - sum/mean/max/min(axis): Reductions │
└─────────────────────────────────────────┘
Data Flow Architecture
Python Interface (your code)
↓
Tensor Class
↓
NumPy Backend (vectorized operations)
↓
C/Fortran Libraries (BLAS, LAPACK)
↓
Hardware (CPU SIMD, cache)
Your implementation: Python wrapper → NumPy PyTorch/TensorFlow: Python wrapper → C++ engine → GPU kernels
The architecture is identical in concept - you're learning the same design patterns used in production, just with NumPy instead of custom CUDA kernels.
Module Integration
Module 01: Tensor (THIS MODULE)
↓ provides foundation
Module 02: Activations (ReLU, Sigmoid operate on Tensors)
↓ uses tensors
Module 03: Layers (Linear, Conv2d store weights as Tensors)
↓ uses tensors
Module 05: Autograd (adds .grad attribute to Tensors)
↓ enhances tensors
Module 06: Optimizers (updates Tensor parameters)
Your Tensor is the universal foundation - every subsequent module builds on what you create here.
Prerequisites
This is the first module - no prerequisites! Verify your environment is ready:
# Activate TinyTorch environment
source bin/activate-tinytorch.sh
# Check system health
tito system doctor
All checks should pass (Python 3.8+, NumPy, pytest installed) before starting.
Getting Started
Development Workflow
- Open the development notebook:
modules/01_tensor/tensor_dev.ipynbin Jupyter or your preferred editor - Implement Tensor.init: Create constructor that converts data to NumPy array, stores shape/size/dtype, initializes gradient attributes
- Build arithmetic operations: Implement
__add__,__sub__,__mul__,__truediv__with broadcasting support for both Tensor-Tensor and Tensor-scalar operations - Add matrix multiplication: Implement
matmul()with shape validation and clear error messages for dimension mismatches - Create shape manipulation: Implement
reshape()(with -1 support) andtranspose()for dimension swapping - Implement reductions: Build
sum(),mean(),max()with axis parameter and keepdims support - Export and verify: Run
tito export 01to export to package, thentito test 01to validate all tests pass
Implementation Guide
Tensor Class Foundation
Your Tensor class wraps NumPy arrays and provides ML-specific functionality:
from tinytorch.core.tensor import Tensor
# Create tensors from Python lists or NumPy arrays
x = Tensor([[1.0, 2.0], [3.0, 4.0]])
y = Tensor([[0.5, 1.5], [2.5, 3.5]])
# Properties provide clean API access
print(x.shape) # (2, 2)
print(x.size) # 4
print(x.dtype) # float32
Implementation details: You'll implement __init__ to convert input data to NumPy arrays, store shape/size/dtype as properties, and initialize dormant gradient attributes (requires_grad, grad) that activate in Module 05.
Arithmetic Operations
Implement operator overloading for element-wise operations with broadcasting:
# Element-wise operations via operator overloading
z = x + y # Addition: [[1.5, 3.5], [5.5, 7.5]]
w = x * y # Element-wise multiplication
p = x ** 2 # Exponentiation
s = x - y # Subtraction
d = x / y # Division
# Broadcasting: scalar operations automatically expand
scaled = x * 2 # [[2.0, 4.0], [6.0, 8.0]]
shifted = x + 10 # [[11.0, 12.0], [13.0, 14.0]]
# Broadcasting: vector + matrix
matrix = Tensor([[1, 2], [3, 4]])
vector = Tensor([10, 20])
result = matrix + vector # [[11, 22], [13, 24]]
Systems insight: These operations vectorize automatically via NumPy, achieving ~100x speedup over Python loops. This is why all ML frameworks use tensors - the performance difference between for i in range(n): result[i] = a[i] + b[i] and result = a + b is dramatic at scale.
Matrix Multiplication
Matrix multiplication is the heart of neural networks - every layer performs it:
# Matrix multiplication (the @ operator)
a = Tensor([[1, 2], [3, 4]]) # 2×2
b = Tensor([[5, 6], [7, 8]]) # 2×2
c = a.matmul(b) # 2×2 result: [[19, 22], [43, 50]]
# Neural network forward pass pattern: y = xW + b
x = Tensor([[1, 2, 3], [4, 5, 6]]) # Input: (batch=2, features=3)
W = Tensor([[0.1, 0.2], [0.3, 0.4], [0.5, 0.6]]) # Weights: (3, 2)
b = Tensor([0.1, 0.2]) # Bias: (2,)
output = x.matmul(W) + b # (2, 2)
Computational complexity: For matrices (M,K) @ (K,N), the cost is O(M×K×N) floating-point operations. A 1000×1000 matrix multiplication requires 2 billion FLOPs - this dominates training time in production systems.
Shape Manipulation
Neural networks constantly reshape tensors to match layer requirements:
# Reshape: change interpretation of same data (O(1) operation)
tensor = Tensor([1, 2, 3, 4, 5, 6])
reshaped = tensor.reshape(2, 3) # [[1, 2, 3], [4, 5, 6]]
flat = reshaped.reshape(-1) # [1, 2, 3, 4, 5, 6]
# Transpose: swap dimensions (data rearrangement)
matrix = Tensor([[1, 2, 3], [4, 5, 6]]) # (2, 3)
transposed = matrix.transpose() # (3, 2): [[1, 4], [2, 5], [3, 6]]
# CNN data flow example
images = Tensor(np.random.rand(32, 3, 224, 224)) # (batch, channels, H, W)
features = images.reshape(32, -1) # (batch, 3*224*224) - flatten for MLP
Memory consideration: reshape often returns views (no data copying) when possible - an O(1) operation. transpose may require data rearrangement depending on memory layout. Understanding views vs. copies is crucial: views share memory (efficient), copies duplicate data (expensive for large tensors).
Reduction Operations
Aggregation operations collapse dimensions for statistics and loss computation:
# Reduce along different axes
total = x.sum() # Scalar: sum all elements
col_sums = x.sum(axis=0) # Sum columns: [4, 6]
row_sums = x.sum(axis=1) # Sum rows: [3, 7]
# Statistical reductions
means = x.mean(axis=0) # Column-wise mean
minimums = x.min(axis=1) # Row-wise minimum
maximums = x.max() # Global maximum
# Batch loss averaging (common pattern)
losses = Tensor([0.5, 0.3, 0.8, 0.2]) # Per-sample losses
avg_loss = losses.mean() # 0.45 - batch average
Production pattern: Every loss function uses reductions. Cross-entropy loss computes per-sample losses then averages: loss = -log(predictions[correct_class]).mean(). Understanding axis semantics prevents bugs in multi-dimensional operations.
Testing
Comprehensive Test Suite
Run the full test suite to verify tensor functionality:
# TinyTorch CLI (recommended - runs all 01_tensor tests)
tito test 01
# Direct pytest execution (more verbose output)
python -m pytest tests/01_tensor/ -v
# Run specific test class
python -m pytest tests/01_tensor/test_tensor_core.py::TestTensorCreation -v
Expected output: All tests pass with green checkmarks showing your Tensor implementation works correctly.
Test Coverage Areas
Your implementation is validated across these dimensions:
- Initialization (
test_tensor_from_list,test_tensor_from_numpy,test_tensor_shapes): Creating tensors from Python lists, NumPy arrays, and nested structures with correct shape/dtype handling - Arithmetic Operations (
test_tensor_addition,test_tensor_multiplication): Element-wise addition, subtraction, multiplication, division with both Tensor-Tensor and Tensor-scalar combinations - Broadcasting (
test_scalar_broadcasting,test_vector_broadcasting): Automatic shape alignment for different tensor shapes, scalar expansion, matrix-vector broadcasting - Matrix Multiplication (
test_matrix_multiplication): Matrix-matrix, matrix-vector multiplication with shape validation and error handling for incompatible dimensions - Shape Manipulation (
test_tensor_reshape,test_tensor_transpose,test_tensor_flatten): Reshape with -1 inference, transpose with dimension swapping, validation for incompatible sizes - Reductions (
test_sum,test_mean,test_max): Aggregation along various axes (None, 0, 1, multiple), keepdims behavior, global vs. axis-specific reduction - Memory Management (
test_tensor_data_access,test_tensor_copy_semantics,test_tensor_memory_efficiency): Data access patterns, copy vs. view semantics, memory usage validation
Inline Testing & Validation
The development notebook includes comprehensive inline tests with immediate feedback:
# Example inline test output
🧪 Unit Test: Tensor Creation...
✅ Tensor created from list
✅ Shape property correct: (2, 2)
✅ Size property correct: 4
✅ dtype is float32
📈 Progress: Tensor initialization ✓
🧪 Unit Test: Arithmetic Operations...
✅ Addition: [[6, 8], [10, 12]]
✅ Multiplication works element-wise
✅ Broadcasting: scalar + tensor
✅ Broadcasting: matrix + vector
📈 Progress: Arithmetic operations ✓
🧪 Unit Test: Matrix Multiplication...
✅ 2×2 @ 2×2 = [[19, 22], [43, 50]]
✅ Shape validation catches 2×2 @ 3×1 error
✅ Error message shows: "2 ≠ 3"
📈 Progress: Matrix operations ✓
Manual Testing Examples
Validate your implementation interactively:
from tinytorch.core.tensor import Tensor
import numpy as np
# Test basic operations
x = Tensor([[1, 2], [3, 4]])
y = Tensor([[5, 6], [7, 8]])
assert x.shape == (2, 2)
assert (x + y).data.tolist() == [[6, 8], [10, 12]]
assert x.sum().data == 10
print("✓ Basic operations working")
# Test broadcasting
small = Tensor([1, 2])
result = x + small
assert result.data.tolist() == [[2, 4], [4, 6]]
print("✓ Broadcasting functional")
# Test reductions
col_means = x.mean(axis=0)
assert np.allclose(col_means.data, [2.0, 3.0])
print("✓ Reductions working")
# Test neural network pattern: y = xW + b
batch = Tensor([[1, 2, 3], [4, 5, 6]]) # (2, 3)
weights = Tensor([[0.1, 0.2], [0.3, 0.4], [0.5, 0.6]]) # (3, 2)
bias = Tensor([0.1, 0.2])
output = batch.matmul(weights) + bias
assert output.shape == (2, 2)
print("✓ Neural network forward pass pattern works!")
Production Context
Your Implementation vs. Production Frameworks
Understanding what you're building vs. what production frameworks provide:
| Feature | Your Tensor (Module 01) | PyTorch torch.Tensor | TensorFlow tf.Tensor |
|---|---|---|---|
| Backend | NumPy (CPU-only) | C++/CUDA (CPU/GPU/TPU) | C++/CUDA/XLA |
| Dtype Support | float32 (primary) | float16/32/64, int8/16/32/64, bool, complex | Same + bfloat16 |
| Operations | Arithmetic, matmul, reshape, transpose, reductions | 1000+ operations | 1000+ operations |
| Broadcasting | ✅ Full NumPy rules | ✅ Same rules | ✅ Same rules |
| Autograd | Dormant (activates Module 05) | ✅ Full computation graph | ✅ GradientTape |
| GPU Support | ❌ CPU-only | ✅ CUDA, Metal, ROCm | ✅ CUDA, TPU |
| Memory Pooling | ❌ Python GC | ✅ Caching allocator | ✅ Memory pools |
| JIT Compilation | ❌ Interpreted | ✅ TorchScript, torch.compile | ✅ XLA, TF Graph |
| Distributed | ❌ Single process | ✅ DDP, FSDP | ✅ tf.distribute |
Educational focus: Your implementation prioritizes clarity and understanding over performance. The core concepts (broadcasting, shape manipulation, reductions) are identical - you're learning the same patterns used in production, just with simpler infrastructure.
Line count: Your implementation is ~1927 lines in the notebook (including tests and documentation). PyTorch's tensor implementation spans 50,000+ lines across multiple C++ files - your simplified version captures the essential concepts.
Side-by-Side Code Comparison
Your implementation:
from tinytorch.core.tensor import Tensor
# Create tensors
x = Tensor([[1, 2], [3, 4]])
w = Tensor([[0.5, 0.6], [0.7, 0.8]])
# Forward pass
output = x.matmul(w) # (2,2) @ (2,2) → (2,2)
loss = output.mean() # Scalar loss
Equivalent PyTorch (production):
import torch
# Create tensors (GPU-enabled)
x = torch.tensor([[1, 2], [3, 4]], dtype=torch.float32).cuda()
w = torch.tensor([[0.5, 0.6], [0.7, 0.8]], dtype=torch.float32).cuda()
# Forward pass (automatic gradient tracking)
output = x @ w # Uses cuBLAS for GPU acceleration
loss = output.mean() # Builds computation graph for backprop
loss.backward() # Automatic differentiation
Key differences:
- GPU Support: PyTorch tensors can move to GPU (
.cuda()) for 10-100x speedup via parallel processing - Autograd: PyTorch automatically tracks operations and computes gradients - you'll build this in Module 05
- Memory Pooling: PyTorch reuses GPU memory via caching allocator - avoids expensive malloc/free calls
- Optimized Kernels: PyTorch uses cuBLAS/cuDNN (GPU) and Intel MKL (CPU) - hand-tuned assembly for max performance
Real-World Production Usage
Meta (Facebook AI): PyTorch was developed at Meta and powers their recommendation systems, computer vision models, and LLaMA language models. Their production infrastructure processes billions of tensor operations per second.
Tesla: Uses PyTorch tensors for Autopilot neural networks. Each camera frame (6-9 cameras) is converted to tensors, processed through vision models (millions of parameters stored as tensors), and outputs driving decisions in real-time at 36 FPS.
OpenAI: GPT-4 training involved tensors with billions of parameters distributed across thousands of GPUs. Each training step performs matrix multiplications on tensors larger than single GPU memory.
Google: TensorFlow powers Google Search, Translate, Photos, and Assistant. Google's TPUs (Tensor Processing Units) are custom hardware designed specifically for accelerating tensor operations.
Performance Characteristics at Scale
Memory usage: GPT-3 scale models (175B parameters) require ~350GB memory just for weights stored as float16 tensors (175B × 2 bytes). Mixed precision training (float16/float32) reduces memory by 2x while maintaining accuracy.
Computational bottlenecks: In production training, tensor operations consume 95%+ of runtime. A single linear layer's matrix multiplication might take 100ms of a 110ms forward pass - optimizing tensor operations is critical.
Cache efficiency: Modern CPUs have ~32KB L1 cache, ~256KB L2, ~8MB L3. Accessing memory in tensor-friendly patterns (contiguous, row-major) can be 10-100x faster than cache-unfriendly patterns (strided, column-major).
Package Integration
After export, your Tensor implementation becomes the foundation of TinyTorch:
Package Export: Code exports to tinytorch.core.tensor
# When students install tinytorch, they import YOUR work:
from tinytorch.core.tensor import Tensor # Your implementation!
# Future modules build on YOUR tensor:
from tinytorch.core.activations import ReLU # Module 02 - operates on your Tensors
from tinytorch.core.layers import Linear # Module 03 - uses your Tensor for weights
from tinytorch.core.autograd import backward # Module 05 - adds gradients to your Tensor
from tinytorch.core.optimizers import SGD # Module 06 - updates your Tensor parameters
Package structure:
tinytorch/
├── core/
│ ├── tensor.py ← YOUR implementation exports here
│ ├── activations.py ← Module 02 builds on your Tensor
│ ├── layers.py ← Module 03 builds on your Tensor
│ ├── losses.py ← Module 04 builds on your Tensor
│ ├── autograd.py ← Module 05 adds gradients to your Tensor
│ ├── optimizers.py ← Module 06 updates your Tensor weights
│ └── ...
Your Tensor class is the universal foundation - every subsequent module depends on what you build here.
How Your Implementation Maps to PyTorch
What you just built:
# Your TinyTorch Tensor implementation
from tinytorch.core.tensor import Tensor
# Create a tensor
x = Tensor([[1, 2], [3, 4]])
# Core operations you implemented
y = x + 2 # Broadcasting
z = x.matmul(other) # Matrix multiplication
mean = x.mean(axis=0) # Reductions
reshaped = x.reshape(-1) # Shape manipulation
How PyTorch does it:
# PyTorch equivalent
import torch
# Create a tensor
x = torch.tensor([[1, 2], [3, 4]], dtype=torch.float32)
# Same operations, identical semantics
y = x + 2 # Broadcasting (same rules)
z = x @ other # Matrix multiplication (@ operator)
mean = x.mean(dim=0) # Reductions (dim instead of axis)
reshaped = x.reshape(-1) # Shape manipulation (same API)
Key Insight: Your implementation uses the same mathematical operations and design patterns that PyTorch uses internally. The @ operator is syntactic sugar for matrix multiplication—the actual computation is identical. Broadcasting rules, shape semantics, and reduction operations all follow the same NumPy conventions.
What's the SAME?
- Tensor abstraction and API design
- Broadcasting rules and memory layout principles
- Shape manipulation semantics (
reshape,transpose) - Reduction operation behavior (
sum,mean,max) - Conceptual architecture: data + operations + metadata
What's different in production PyTorch?
- Backend: C++/CUDA for 10-100× speed vs. NumPy
- GPU support:
.cuda()moves tensors to GPU for parallel processing - Autograd integration:
requires_grad=Trueenables automatic differentiation (you'll build this in Module 05) - Memory optimization: Caching allocator reuses GPU memory, avoiding expensive malloc/free
Why this matters: When you debug PyTorch code, you'll understand what's happening under tensor operations because you implemented them yourself. Shape mismatch errors, broadcasting bugs, memory issues—you know exactly how they work internally, not just how to call the API.
Production usage example:
# PyTorch production code (after TinyTorch)
import torch.nn as nn
class MLPLayer(nn.Module):
def __init__(self, in_features, out_features):
super().__init__()
self.linear = nn.Linear(in_features, out_features) # Uses torch.Tensor internally
def forward(self, x):
return self.linear(x) # Matrix multiply + bias (same as your Tensor.matmul)
After building your own Tensor class, you understand that nn.Linear(in_features, out_features) is essentially creating weight and bias tensors, then performing x @ weights + bias with your same broadcasting and matmul operations—just optimized in C++/CUDA.
Common Pitfalls
Shape Mismatch Errors
Problem: Matrix multiplication fails with cryptic errors like "shapes (2,3) and (2,2) not aligned"
Solution: Always verify inner dimensions match: (M,K) @ (K,N) requires K to be equal. Add shape validation with clear error messages:
if a.shape[1] != b.shape[0]:
raise ValueError(f"Cannot multiply ({a.shape[0]},{a.shape[1]}) @ ({b.shape[0]},{b.shape[1]}): {a.shape[1]} ≠ {b.shape[0]}")
Broadcasting Confusion
Problem: Expected (2,3) + (3,) to broadcast but got error
Solution: Broadcasting aligns shapes from the right. (2,3) + (3,) works (broadcasts to (2,3)), but (2,3) + (2,) fails. Add dimension with reshape if needed: tensor.reshape(2,1) to make (2,1) broadcastable with (2,3).
View vs Copy Confusion
Problem: Modified a reshaped tensor and original changed unexpectedly
Solution: reshape() returns a view when possible - they share memory. Changes to the view affect the original. Use .copy() if you need independent data:
view = tensor.reshape(2, 3) # Shares memory
copy = tensor.reshape(2, 3).copy() # Independent storage
Axis Parameter Mistakes
Problem: sum(axis=1) on (batch, features) returned wrong shape
Solution: Axis semantics: axis=0 reduces over first dimension (batch), axis=1 reduces over second (features). For (32, 128) tensor, sum(axis=0) gives (128,), sum(axis=1) gives (32,). Visualize which dimension you're collapsing.
Dtype Issues
Problem: Lost precision after operations, or got integer division instead of float
Solution: NumPy defaults to preserving dtype. Integer tensors do integer division (5 / 2 = 2). Always create tensors with float dtype explicitly: Tensor([[1, 2]], dtype=np.float32) or convert: tensor.astype(np.float32).
Memory Leaks with Large Tensors
Problem: Memory usage grows unbounded during training loop
Solution: Clear intermediate results in loops. Don't accumulate tensors in lists unnecessarily. Use in-place operations when safe. Example:
# Bad: accumulates memory
losses = []
for batch in data:
loss = model(batch)
losses.append(loss) # Keeps all tensors in memory
# Good: extract values
losses = []
for batch in data:
loss = model(batch)
losses.append(loss.data.item()) # Store scalar, release tensor
Systems Thinking Questions
Real-World Applications
- Deep Learning Training: All neural network layers operate on tensors - Linear layers perform matrix multiplication, Conv2d applies tensor convolutions, Attention mechanisms compute tensor dot products. How would doubling model size affect memory and compute requirements?
- Computer Vision: Images are 3D tensors (height × width × channels), and every transformation (resize, crop, normalize) is a tensor operation. What's the memory footprint of a batch of 32 images at 224×224 resolution with 3 color channels in float32?
- Natural Language Processing: Text embeddings are 2D tensors (sequence_length × embedding_dim), and Transformer models manipulate these through attention. For BERT with 512 sequence length and 768 hidden dimension, how many elements per sample?
- Scientific Computing: Tensors represent multidimensional data in climate models, molecular simulations, physics engines. What makes tensors more efficient than nested Python lists for these applications?
Mathematical Foundations
- Linear Algebra: Tensors generalize matrices to arbitrary dimensions. How does broadcasting relate to outer products? When is
(M,K) @ (K,N)more efficient than(K,M).T @ (K,N)? - Numerical Stability: Operations like softmax require careful implementation to avoid overflow/underflow. Why does
exp(x - max(x))prevent overflow in softmax computation? - Broadcasting Semantics: NumPy's broadcasting rules enable elegant code but require understanding shape compatibility. Can you predict the output shape of
(32, 1, 10) + (1, 5, 10)? - Computational Complexity: Matrix multiplication is O(n³) while element-wise operations are O(n). For large models, which dominates training time and why?
Performance Characteristics
- Memory Contiguity: Contiguous memory enables SIMD vectorization and cache efficiency. How much can non-contiguous tensors slow down operations (10x? 100x?)?
- View vs Copy: Views are O(1) with shared memory, copies are O(n) with duplicated storage. When might a view cause unexpected behavior (e.g., in-place operations)?
- Operation Fusion: Frameworks optimize
(a + b) * cby fusing operations to reduce memory reads. How many memory passes does unfused require vs. fused? - Batch Processing: Processing 32 images at once is much faster than 32 sequential passes. Why? (Hint: GPU parallelism, cache reuse, reduced Python overhead)
What's Next
After mastering tensors, you're ready to build the computational layers of neural networks:
Module 02: Activations - Implement ReLU, Sigmoid, Tanh, and Softmax activation functions that introduce non-linearity. You'll operate on your Tensor class and understand why activation functions are essential for learning complex patterns.
Module 03: Layers - Build Linear (fully-connected) and convolutional layers using tensor operations. See how weight matrices and bias vectors (stored as Tensors) transform inputs through matrix multiplication and broadcasting.
Module 05: Autograd - Add automatic differentiation to your Tensor class, enabling gradient computation for training. Your tensors will track operations and compute gradients automatically - the magic behind loss.backward().
Preview of tensor usage ahead:
- Activations:
output = ReLU()(input_tensor)- element-wise operations on tensors - Layers:
output = Linear(in_features=128, out_features=64)(input_tensor)- matmul with weight tensors - Loss:
loss = MSELoss()(predictions, targets)- tensor reductions for error measurement - Training:
optimizer.step()updates parameter tensors using gradients
Every module builds on your Tensor foundation - understanding tensors deeply means understanding how neural networks actually compute.
Ready to Build?
You're about to implement the foundation of all machine learning systems! The Tensor class you'll build is the universal data structure that powers everything from simple neural networks to GPT, Stable Diffusion, and AlphaFold.
This is where mathematical abstraction meets practical implementation. You'll see how N-dimensional arrays enable elegant representations of complex data, how operator overloading makes tensor math feel natural like z = x + y, and how careful memory management (views vs. copies) enables working with massive models. Every decision you make - from how to handle broadcasting to when to validate shapes - reflects trade-offs that production ML engineers face daily.
Take your time with this module. Understand each operation deeply. Test your implementations thoroughly. The Tensor foundation you build here will support every subsequent module - if you understand tensors from first principles, you'll understand how neural networks actually work, not just how to use them.
Every neural network you've ever used - ResNet, BERT, GPT, Stable Diffusion - is fundamentally built on tensor operations. Understanding tensors means understanding the computational substrate of modern AI.
Choose your preferred way to engage with this module:
```{grid-item-card} 🚀 Launch Binder
:link: https://mybinder.org/v2/gh/mlsysbook/TinyTorch/main?filepath=modules/01_tensor/tensor_dev.ipynb
:class-header: bg-light
Run this module interactively in your browser. No installation required!
```
```{grid-item-card} ⚡ Open in Colab
:link: https://colab.research.google.com/github/mlsysbook/TinyTorch/blob/main/modules/01_tensor/tensor_dev.ipynb
:class-header: bg-light
Use Google Colab for GPU access and cloud compute power.
```
```{grid-item-card} 📖 View Source
:link: https://github.com/mlsysbook/TinyTorch/blob/main/modules/01_tensor/tensor_dev.ipynb
:class-header: bg-light
Browse the Jupyter notebook and understand the implementation.
```
:class: tip
**Binder sessions are temporary!** Download your completed notebook when done, or switch to local development for persistent work.