mirror of
https://github.com/MLSysBook/TinyTorch.git
synced 2026-05-02 20:57:44 -05:00
- Removed 30 debugging and development artifact files - Kept core system, documentation, and demo files - tests/milestones: 9 clean files (system + docs) - milestones/05_2017_transformer: 5 clean files (demos) - Clear, focused directory structure - Ready for students and developers
1793 lines
79 KiB
Plaintext
1793 lines
79 KiB
Plaintext
{
|
||
"cells": [
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "ccca71b2",
|
||
"metadata": {
|
||
"cell_marker": "\"\"\""
|
||
},
|
||
"source": [
|
||
"# Module 01: Tensor Foundation - Building Blocks of ML\n",
|
||
"\n",
|
||
"Welcome to Module 01! You're about to build the foundational Tensor class that powers all machine learning operations.\n",
|
||
"\n",
|
||
"## 🔗 Prerequisites & Progress\n",
|
||
"**You've Built**: Nothing - this is our foundation!\n",
|
||
"**You'll Build**: A complete Tensor class with arithmetic, matrix operations, and shape manipulation\n",
|
||
"**You'll Enable**: Foundation for activations, layers, and all future neural network components\n",
|
||
"\n",
|
||
"**Connection Map**:\n",
|
||
"```\n",
|
||
"NumPy Arrays → Tensor → Activations (Module 02)\n",
|
||
"(raw data) (ML ops) (intelligence)\n",
|
||
"```\n",
|
||
"\n",
|
||
"## Learning Objectives\n",
|
||
"By the end of this module, you will:\n",
|
||
"1. Implement a complete Tensor class with fundamental operations\n",
|
||
"2. Understand tensors as the universal data structure in ML\n",
|
||
"3. Test tensor operations with immediate validation\n",
|
||
"4. Prepare for gradient computation in Module 05\n",
|
||
"\n",
|
||
"Let's get started!\n",
|
||
"\n",
|
||
"## 📦 Where This Code Lives in the Final Package\n",
|
||
"\n",
|
||
"**Learning Side:** You work in modules/01_tensor/tensor_dev.py\n",
|
||
"**Building Side:** Code exports to tinytorch.core.tensor\n",
|
||
"\n",
|
||
"```python\n",
|
||
"# Final package structure:\n",
|
||
"# Future modules will import and extend this Tensor\n",
|
||
"```\n",
|
||
"\n",
|
||
"**Why this matters:**\n",
|
||
"- **Learning:** Complete tensor system in one focused module for deep understanding\n",
|
||
"- **Production:** Proper organization like PyTorch's torch.Tensor with all core operations together\n",
|
||
"- **Consistency:** All tensor operations and data manipulation in core.tensor\n",
|
||
"- **Integration:** Foundation that every other module will build upon"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "e797b7f9",
|
||
"metadata": {
|
||
"nbgrader": {
|
||
"grade": false,
|
||
"grade_id": "imports",
|
||
"solution": true
|
||
}
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"#| default_exp core.tensor\n",
|
||
"#| export\n",
|
||
"\n",
|
||
"import numpy as np\n",
|
||
"\n",
|
||
"# Constants for memory calculations\n",
|
||
"BYTES_PER_FLOAT32 = 4 # Standard float32 size in bytes\n",
|
||
"KB_TO_BYTES = 1024 # Kilobytes to bytes conversion\n",
|
||
"MB_TO_BYTES = 1024 * 1024 # Megabytes to bytes conversion"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "0def48bb",
|
||
"metadata": {
|
||
"cell_marker": "\"\"\""
|
||
},
|
||
"source": [
|
||
"## 📋 Module Dependencies\n",
|
||
"\n",
|
||
"**Prerequisites**: NONE - This is the foundation module\n",
|
||
"\n",
|
||
"**External Dependencies**:\n",
|
||
"- `numpy` (for array operations and numerical computing)\n",
|
||
"\n",
|
||
"**TinyTorch Dependencies**: NONE\n",
|
||
"\n",
|
||
"**Important**: This module has NO TinyTorch dependencies.\n",
|
||
"All future modules will import FROM this module.\n",
|
||
"\n",
|
||
"**Dependency Flow**:\n",
|
||
"```\n",
|
||
"Module 01 (Tensor) → All Future Modules\n",
|
||
" ↓\n",
|
||
" Foundation for entire TinyTorch system\n",
|
||
"```\n",
|
||
"\n",
|
||
"Students completing this module will have built the foundation\n",
|
||
"that every other TinyTorch component depends on."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "8b7d805c",
|
||
"metadata": {
|
||
"cell_marker": "\"\"\""
|
||
},
|
||
"source": [
|
||
"## 1. Introduction: What is a Tensor?\n",
|
||
"\n",
|
||
"A tensor is a multi-dimensional array that serves as the fundamental data structure in machine learning. Think of it as a universal container that can hold data in different dimensions:\n",
|
||
"\n",
|
||
"```\n",
|
||
"Tensor Dimensions:\n",
|
||
"┌─────────────┐\n",
|
||
"│ 0D: Scalar │ 5.0 (just a number)\n",
|
||
"│ 1D: Vector │ [1, 2, 3] (list of numbers)\n",
|
||
"│ 2D: Matrix │ [[1, 2] (grid of numbers)\n",
|
||
"│ │ [3, 4]]\n",
|
||
"│ 3D: Cube │ [[[... (stack of matrices)\n",
|
||
"└─────────────┘\n",
|
||
"```\n",
|
||
"\n",
|
||
"In machine learning, tensors flow through operations like water through pipes:\n",
|
||
"\n",
|
||
"```\n",
|
||
"Neural Network Data Flow:\n",
|
||
"Input Tensor → Layer 1 → Activation → Layer 2 → ... → Output Tensor\n",
|
||
" [batch, [batch, [batch, [batch, [batch,\n",
|
||
" features] hidden] hidden] hidden2] classes]\n",
|
||
"```\n",
|
||
"\n",
|
||
"Every neural network, from simple linear regression to modern transformers, processes tensors. Understanding tensors means understanding the foundation of all ML computations.\n",
|
||
"\n",
|
||
"### Why Tensors Matter in ML Systems\n",
|
||
"\n",
|
||
"In production ML systems, tensors carry more than just data - they carry the computational graph, memory layout information, and execution context:\n",
|
||
"\n",
|
||
"```\n",
|
||
"Real ML Pipeline:\n",
|
||
"Raw Data → Preprocessing → Tensor Creation → Model Forward Pass → Loss Computation\n",
|
||
" ↓ ↓ ↓ ↓ ↓\n",
|
||
" Files NumPy Arrays Tensors GPU Tensors Scalar Loss\n",
|
||
"```\n",
|
||
"\n",
|
||
"**Key Insight**: Tensors bridge the gap between mathematical concepts and efficient computation on modern hardware."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "9a466b8d",
|
||
"metadata": {
|
||
"cell_marker": "\"\"\""
|
||
},
|
||
"source": [
|
||
"## 2. Foundations: Mathematical Background\n",
|
||
"\n",
|
||
"### Core Operations We'll Implement\n",
|
||
"\n",
|
||
"Our Tensor class will support all fundamental operations that neural networks need:\n",
|
||
"\n",
|
||
"```\n",
|
||
"Operation Types:\n",
|
||
"┌─────────────────┬─────────────────┬─────────────────┐\n",
|
||
"│ Element-wise │ Matrix Ops │ Shape Ops │\n",
|
||
"├─────────────────┼─────────────────┼─────────────────┤\n",
|
||
"│ + Addition │ @ Matrix Mult │ .reshape() │\n",
|
||
"│ - Subtraction │ .transpose() │ .sum() │\n",
|
||
"│ * Multiplication│ │ .mean() │\n",
|
||
"│ / Division │ │ .max() │\n",
|
||
"└─────────────────┴─────────────────┴─────────────────┘\n",
|
||
"```\n",
|
||
"\n",
|
||
"### Broadcasting: Making Tensors Work Together\n",
|
||
"\n",
|
||
"Broadcasting automatically aligns tensors of different shapes for operations:\n",
|
||
"\n",
|
||
"```\n",
|
||
"Broadcasting Examples:\n",
|
||
"┌─────────────────────────────────────────────────────────┐\n",
|
||
"│ Scalar + Vector: │\n",
|
||
"│ 5 + [1, 2, 3] → [5, 5, 5] + [1, 2, 3] = [6, 7, 8]│\n",
|
||
"│ │\n",
|
||
"│ Matrix + Vector (row-wise): │\n",
|
||
"│ [[1, 2]] [10] [[1, 2]] [[10, 10]] [[11, 12]] │\n",
|
||
"│ [[3, 4]] + [10] = [[3, 4]] + [[10, 10]] = [[13, 14]] │\n",
|
||
"└─────────────────────────────────────────────────────────┘\n",
|
||
"```\n",
|
||
"\n",
|
||
"**Memory Layout**: NumPy uses row-major (C-style) storage where elements are stored row by row in memory for cache efficiency:\n",
|
||
"\n",
|
||
"```\n",
|
||
"Memory Layout (2×3 matrix):\n",
|
||
"Matrix: Memory:\n",
|
||
"[[1, 2, 3] [1][2][3][4][5][6]\n",
|
||
" [4, 5, 6]] ↑ Row 1 ↑ Row 2\n",
|
||
"\n",
|
||
"Cache Behavior:\n",
|
||
"Sequential Access: Fast (uses cache lines efficiently)\n",
|
||
" Row access: [1][2][3] → cache hit, hit, hit\n",
|
||
"Random Access: Slow (cache misses)\n",
|
||
" Column access: [1][4] → cache hit, miss\n",
|
||
"```\n",
|
||
"\n",
|
||
"This memory layout affects performance in real ML workloads - algorithms that access data sequentially run faster than those that access randomly."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "90192fb0",
|
||
"metadata": {
|
||
"cell_marker": "\"\"\""
|
||
},
|
||
"source": [
|
||
"## 3. Implementation: Building Tensor Foundation\n",
|
||
"\n",
|
||
"Let's build our Tensor class step by step, testing each component as we go.\n",
|
||
"\n",
|
||
"**Key Design Decision**: We'll include gradient-related attributes from the start, but they'll remain dormant until Module 05. This ensures a consistent interface throughout the course while keeping the cognitive load manageable.\n",
|
||
"\n",
|
||
"### Tensor Class Architecture\n",
|
||
"\n",
|
||
"```\n",
|
||
"Tensor Class Structure:\n",
|
||
"┌─────────────────────────────────┐\n",
|
||
"│ Core Attributes: │\n",
|
||
"│ • data: np.array (the numbers) │\n",
|
||
"│ • shape: tuple (dimensions) │\n",
|
||
"│ • size: int (total elements) │\n",
|
||
"│ • dtype: type (float32, int64) │\n",
|
||
"├─────────────────────────────────┤\n",
|
||
"│ Gradient Attributes (dormant): │\n",
|
||
"│ • requires_grad: bool │\n",
|
||
"│ • grad: None (until Module 05) │\n",
|
||
"├─────────────────────────────────┤\n",
|
||
"│ Operations: │\n",
|
||
"│ • __add__, __sub__, __mul__ │\n",
|
||
"│ • matmul(), reshape() │\n",
|
||
"│ • sum(), mean(), max() │\n",
|
||
"│ • __repr__(), __str__() │\n",
|
||
"└─────────────────────────────────┘\n",
|
||
"```\n",
|
||
"\n",
|
||
"The beauty of this design: **all methods are defined inside the class from day one**. No monkey-patching, no dynamic attribute addition. Clean, consistent, debugger-friendly."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "ab0d2ee2",
|
||
"metadata": {
|
||
"cell_marker": "\"\"\"",
|
||
"lines_to_next_cell": 1
|
||
},
|
||
"source": [
|
||
"### Tensor Creation and Initialization\n",
|
||
"\n",
|
||
"Before we implement operations, let's understand how tensors store data and manage their attributes. This initialization is the foundation that everything else builds upon.\n",
|
||
"\n",
|
||
"```\n",
|
||
"Tensor Initialization Process:\n",
|
||
"Input Data → Validation → NumPy Array → Tensor Wrapper → Ready for Operations\n",
|
||
" [1,2,3] → types → np.array → shape=(3,) → + - * / @ ...\n",
|
||
" ↓ ↓ ↓ ↓\n",
|
||
" List/Array Type Check Memory Attributes Set\n",
|
||
" (optional) Allocation\n",
|
||
"\n",
|
||
"Memory Allocation Example:\n",
|
||
"Input: [[1, 2, 3], [4, 5, 6]]\n",
|
||
" ↓\n",
|
||
"NumPy allocates: [1][2][3][4][5][6] in contiguous memory\n",
|
||
" ↓\n",
|
||
"Tensor wraps with: shape=(2,3), size=6, dtype=int64\n",
|
||
"```\n",
|
||
"\n",
|
||
"**Key Design Principle**: Our Tensor is a wrapper around NumPy arrays that adds ML-specific functionality. We leverage NumPy's battle-tested memory management and computation kernels while adding the gradient tracking and operation chaining needed for deep learning.\n",
|
||
"\n",
|
||
"**Why This Approach?**\n",
|
||
"- **Performance**: NumPy's C implementations are highly optimized\n",
|
||
"- **Compatibility**: Easy integration with scientific Python ecosystem\n",
|
||
"- **Memory Efficiency**: No unnecessary data copying\n",
|
||
"- **Future-Proof**: Easy transition to GPU tensors in advanced modules"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "a2ab12fe",
|
||
"metadata": {
|
||
"lines_to_next_cell": 1,
|
||
"nbgrader": {
|
||
"grade": false,
|
||
"grade_id": "tensor-class",
|
||
"solution": true
|
||
}
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"#| export\n",
|
||
"class Tensor:\n",
|
||
" \"\"\"Educational tensor that grows with student knowledge.\n",
|
||
"\n",
|
||
" This class starts simple but includes dormant features for future modules:\n",
|
||
" - requires_grad: Will be used for automatic differentiation (Module 05)\n",
|
||
" - grad: Will store computed gradients (Module 05)\n",
|
||
" - backward(): Will compute gradients (Module 05)\n",
|
||
"\n",
|
||
" For now, focus on: data, shape, and basic operations.\n",
|
||
" \"\"\"\n",
|
||
"\n",
|
||
" def __init__(self, data, requires_grad=False):\n",
|
||
" \"\"\"Create a new tensor from data.\"\"\"\n",
|
||
" ### BEGIN SOLUTION\n",
|
||
" self.data = np.array(data, dtype=np.float32)\n",
|
||
" self.shape = self.data.shape\n",
|
||
" self.size = self.data.size\n",
|
||
" self.dtype = self.data.dtype\n",
|
||
" self.requires_grad = requires_grad\n",
|
||
" self.grad = None\n",
|
||
" ### END SOLUTION\n",
|
||
"\n",
|
||
" def __repr__(self):\n",
|
||
" \"\"\"String representation of tensor for debugging.\"\"\"\n",
|
||
" grad_info = f\", requires_grad={self.requires_grad}\" if self.requires_grad else \"\"\n",
|
||
" return f\"Tensor(data={self.data}, shape={self.shape}{grad_info})\"\n",
|
||
"\n",
|
||
" def __str__(self):\n",
|
||
" \"\"\"Human-readable string representation.\"\"\"\n",
|
||
" return f\"Tensor({self.data})\"\n",
|
||
"\n",
|
||
" def numpy(self):\n",
|
||
" \"\"\"Return the underlying NumPy array.\"\"\"\n",
|
||
" return self.data\n",
|
||
" \n",
|
||
" def __add__(self, other):\n",
|
||
" \"\"\"Add two tensors element-wise with broadcasting support.\"\"\"\n",
|
||
" ### BEGIN SOLUTION\n",
|
||
" if isinstance(other, Tensor):\n",
|
||
" return Tensor(self.data + other.data)\n",
|
||
" else:\n",
|
||
" return Tensor(self.data + other)\n",
|
||
" ### END SOLUTION\n",
|
||
" \n",
|
||
" def __sub__(self, other):\n",
|
||
" \"\"\"Subtract two tensors element-wise.\"\"\"\n",
|
||
" ### BEGIN SOLUTION\n",
|
||
" if isinstance(other, Tensor):\n",
|
||
" return Tensor(self.data - other.data)\n",
|
||
" else:\n",
|
||
" return Tensor(self.data - other)\n",
|
||
" ### END SOLUTION\n",
|
||
" \n",
|
||
" def __mul__(self, other):\n",
|
||
" \"\"\"Multiply two tensors element-wise (NOT matrix multiplication).\"\"\"\n",
|
||
" ### BEGIN SOLUTION\n",
|
||
" if isinstance(other, Tensor):\n",
|
||
" return Tensor(self.data * other.data)\n",
|
||
" else:\n",
|
||
" return Tensor(self.data * other)\n",
|
||
" ### END SOLUTION\n",
|
||
" \n",
|
||
" def __truediv__(self, other):\n",
|
||
" \"\"\"Divide two tensors element-wise.\"\"\"\n",
|
||
" ### BEGIN SOLUTION\n",
|
||
" if isinstance(other, Tensor):\n",
|
||
" return Tensor(self.data / other.data)\n",
|
||
" else:\n",
|
||
" return Tensor(self.data / other)\n",
|
||
" ### END SOLUTION\n",
|
||
" \n",
|
||
" def matmul(self, other):\n",
|
||
" \"\"\"Matrix multiplication of two tensors.\"\"\"\n",
|
||
" ### BEGIN SOLUTION\n",
|
||
" if not isinstance(other, Tensor):\n",
|
||
" raise TypeError(f\"Expected Tensor for matrix multiplication, got {type(other)}\")\n",
|
||
" if self.shape == () or other.shape == ():\n",
|
||
" return Tensor(self.data * other.data)\n",
|
||
" if len(self.shape) == 0 or len(other.shape) == 0:\n",
|
||
" return Tensor(self.data * other.data)\n",
|
||
" if len(self.shape) >= 2 and len(other.shape) >= 2:\n",
|
||
" if self.shape[-1] != other.shape[-2]:\n",
|
||
" raise ValueError(\n",
|
||
" f\"Cannot perform matrix multiplication: {self.shape} @ {other.shape}. \"\n",
|
||
" f\"Inner dimensions must match: {self.shape[-1]} ≠ {other.shape[-2]}\"\n",
|
||
" )\n",
|
||
" result_data = np.matmul(self.data, other.data)\n",
|
||
" return Tensor(result_data)\n",
|
||
" ### END SOLUTION\n",
|
||
" \n",
|
||
" def __getitem__(self, key):\n",
|
||
" \"\"\"Enable indexing and slicing operations on Tensors.\"\"\"\n",
|
||
" ### BEGIN SOLUTION\n",
|
||
" result_data = self.data[key]\n",
|
||
" if not isinstance(result_data, np.ndarray):\n",
|
||
" result_data = np.array(result_data)\n",
|
||
" result = Tensor(result_data, requires_grad=self.requires_grad)\n",
|
||
" return result\n",
|
||
" ### END SOLUTION\n",
|
||
" \n",
|
||
" def reshape(self, *shape):\n",
|
||
" \"\"\"Reshape tensor to new dimensions.\"\"\"\n",
|
||
" ### BEGIN SOLUTION\n",
|
||
" if len(shape) == 1 and isinstance(shape[0], (tuple, list)):\n",
|
||
" new_shape = tuple(shape[0])\n",
|
||
" else:\n",
|
||
" new_shape = shape\n",
|
||
" if -1 in new_shape:\n",
|
||
" if new_shape.count(-1) > 1:\n",
|
||
" raise ValueError(\"Can only specify one unknown dimension with -1\")\n",
|
||
" known_size = 1\n",
|
||
" unknown_idx = new_shape.index(-1)\n",
|
||
" for i, dim in enumerate(new_shape):\n",
|
||
" if i != unknown_idx:\n",
|
||
" known_size *= dim\n",
|
||
" unknown_dim = self.size // known_size\n",
|
||
" new_shape = list(new_shape)\n",
|
||
" new_shape[unknown_idx] = unknown_dim\n",
|
||
" new_shape = tuple(new_shape)\n",
|
||
" if np.prod(new_shape) != self.size:\n",
|
||
" raise ValueError(\n",
|
||
" f\"Cannot reshape tensor of size {self.size} to shape {new_shape}\"\n",
|
||
" )\n",
|
||
" reshaped_data = np.reshape(self.data, new_shape)\n",
|
||
" result = Tensor(reshaped_data, requires_grad=self.requires_grad)\n",
|
||
" return result\n",
|
||
" ### END SOLUTION\n",
|
||
" \n",
|
||
" def transpose(self, dim0=None, dim1=None):\n",
|
||
" \"\"\"Transpose tensor dimensions.\"\"\"\n",
|
||
" ### BEGIN SOLUTION\n",
|
||
" if dim0 is None and dim1 is None:\n",
|
||
" if len(self.shape) < 2:\n",
|
||
" return Tensor(self.data.copy())\n",
|
||
" else:\n",
|
||
" axes = list(range(len(self.shape)))\n",
|
||
" axes[-2], axes[-1] = axes[-1], axes[-2]\n",
|
||
" transposed_data = np.transpose(self.data, axes)\n",
|
||
" else:\n",
|
||
" if dim0 is None or dim1 is None:\n",
|
||
" raise ValueError(\"Both dim0 and dim1 must be specified\")\n",
|
||
" axes = list(range(len(self.shape)))\n",
|
||
" axes[dim0], axes[dim1] = axes[dim1], axes[dim0]\n",
|
||
" transposed_data = np.transpose(self.data, axes)\n",
|
||
" result = Tensor(transposed_data, requires_grad=self.requires_grad)\n",
|
||
" return result\n",
|
||
" ### END SOLUTION\n",
|
||
" \n",
|
||
" def sum(self, axis=None, keepdims=False):\n",
|
||
" \"\"\"Sum tensor along specified axis.\"\"\"\n",
|
||
" ### BEGIN SOLUTION\n",
|
||
" result = np.sum(self.data, axis=axis, keepdims=keepdims)\n",
|
||
" return Tensor(result)\n",
|
||
" ### END SOLUTION\n",
|
||
" \n",
|
||
" def mean(self, axis=None, keepdims=False):\n",
|
||
" \"\"\"Compute mean of tensor along specified axis.\"\"\"\n",
|
||
" ### BEGIN SOLUTION\n",
|
||
" result = np.mean(self.data, axis=axis, keepdims=keepdims)\n",
|
||
" return Tensor(result)\n",
|
||
" ### END SOLUTION\n",
|
||
" \n",
|
||
" def max(self, axis=None, keepdims=False):\n",
|
||
" \"\"\"Find maximum values along specified axis.\"\"\"\n",
|
||
" ### BEGIN SOLUTION\n",
|
||
" result = np.max(self.data, axis=axis, keepdims=keepdims)\n",
|
||
" return Tensor(result)\n",
|
||
" ### END SOLUTION\n",
|
||
" \n",
|
||
" def backward(self):\n",
|
||
" \"\"\"Compute gradients (implemented in Module 05: Autograd).\"\"\"\n",
|
||
" ### BEGIN SOLUTION\n",
|
||
" pass\n",
|
||
" ### END SOLUTION"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "7ca1bb75",
|
||
"metadata": {
|
||
"cell_marker": "\"\"\"",
|
||
"lines_to_next_cell": 1
|
||
},
|
||
"source": [
|
||
"### 🧪 Unit Test: Tensor Creation\n",
|
||
"\n",
|
||
"This test validates our Tensor constructor works correctly with various data types and properly initializes all attributes.\n",
|
||
"\n",
|
||
"**What we're testing**: Basic tensor creation and attribute setting\n",
|
||
"**Why it matters**: Foundation for all other operations - if creation fails, nothing works\n",
|
||
"**Expected**: Tensor wraps data correctly with proper attributes and consistent dtype"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "3199f1ec",
|
||
"metadata": {
|
||
"nbgrader": {
|
||
"grade": true,
|
||
"grade_id": "test-tensor-creation",
|
||
"locked": true,
|
||
"points": 10
|
||
}
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"def test_unit_tensor_creation():\n",
|
||
" \"\"\"🧪 Test Tensor creation with various data types.\"\"\"\n",
|
||
" print(\"🧪 Unit Test: Tensor Creation...\")\n",
|
||
"\n",
|
||
" # Test scalar creation\n",
|
||
" scalar = Tensor(5.0)\n",
|
||
" assert scalar.data == 5.0\n",
|
||
" assert scalar.shape == ()\n",
|
||
" assert scalar.size == 1\n",
|
||
" assert scalar.requires_grad == False\n",
|
||
" assert scalar.grad is None\n",
|
||
" assert scalar.dtype == np.float32\n",
|
||
"\n",
|
||
" # Test vector creation\n",
|
||
" vector = Tensor([1, 2, 3])\n",
|
||
" assert np.array_equal(vector.data, np.array([1, 2, 3], dtype=np.float32))\n",
|
||
" assert vector.shape == (3,)\n",
|
||
" assert vector.size == 3\n",
|
||
"\n",
|
||
" # Test matrix creation\n",
|
||
" matrix = Tensor([[1, 2], [3, 4]])\n",
|
||
" assert np.array_equal(matrix.data, np.array([[1, 2], [3, 4]], dtype=np.float32))\n",
|
||
" assert matrix.shape == (2, 2)\n",
|
||
" assert matrix.size == 4\n",
|
||
"\n",
|
||
" # Test gradient flag (dormant feature)\n",
|
||
" grad_tensor = Tensor([1, 2], requires_grad=True)\n",
|
||
" assert grad_tensor.requires_grad == True\n",
|
||
" assert grad_tensor.grad is None # Still None until Module 05\n",
|
||
"\n",
|
||
" print(\"✅ Tensor creation works correctly!\")\n",
|
||
"\n",
|
||
"if __name__ == \"__main__\":\n",
|
||
" test_unit_tensor_creation()"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "0704e8bc",
|
||
"metadata": {
|
||
"cell_marker": "\"\"\""
|
||
},
|
||
"source": [
|
||
"## Element-wise Arithmetic Operations\n",
|
||
"\n",
|
||
"Element-wise operations are the workhorses of neural network computation. They apply the same operation to corresponding elements in tensors, often with broadcasting to handle different shapes elegantly.\n",
|
||
"\n",
|
||
"### Why Element-wise Operations Matter\n",
|
||
"\n",
|
||
"In neural networks, element-wise operations appear everywhere:\n",
|
||
"- **Activation functions**: Apply ReLU, sigmoid to every element\n",
|
||
"- **Batch normalization**: Subtract mean, divide by std per element\n",
|
||
"- **Loss computation**: Compare predictions vs. targets element-wise\n",
|
||
"- **Gradient updates**: Add scaled gradients to parameters element-wise\n",
|
||
"\n",
|
||
"### Element-wise Addition: The Foundation\n",
|
||
"\n",
|
||
"Addition is the simplest and most fundamental operation. Understanding it deeply helps with all others.\n",
|
||
"\n",
|
||
"```\n",
|
||
"Element-wise Addition Visual:\n",
|
||
"[1, 2, 3] + [4, 5, 6] = [1+4, 2+5, 3+6] = [5, 7, 9]\n",
|
||
"\n",
|
||
"Matrix Addition:\n",
|
||
"[[1, 2]] [[5, 6]] [[1+5, 2+6]] [[6, 8]]\n",
|
||
"[[3, 4]] + [[7, 8]] = [[3+7, 4+8]] = [[10, 12]]\n",
|
||
"\n",
|
||
"Broadcasting Addition (Matrix + Vector):\n",
|
||
"[[1, 2]] [10] [[1, 2]] [[10, 10]] [[11, 12]]\n",
|
||
"[[3, 4]] + [20] = [[3, 4]] + [[20, 20]] = [[23, 24]]\n",
|
||
" ↑ ↑ ↑ ↑ ↑\n",
|
||
" (2,2) (2,1) (2,2) broadcast result\n",
|
||
"\n",
|
||
"Broadcasting Rules:\n",
|
||
"1. Start from rightmost dimension\n",
|
||
"2. Dimensions must be equal OR one must be 1 OR one must be missing\n",
|
||
"3. Missing dimensions are assumed to be 1\n",
|
||
"```\n",
|
||
"\n",
|
||
"**Key Insight**: Broadcasting makes tensors of different shapes compatible by automatically expanding dimensions. This is crucial for batch processing where you often add a single bias vector to an entire batch of data.\n",
|
||
"\n",
|
||
"**Memory Efficiency**: Broadcasting doesn't actually create expanded copies in memory - NumPy computes results on-the-fly, saving memory."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "0d876834",
|
||
"metadata": {
|
||
"cell_marker": "\"\"\"",
|
||
"lines_to_next_cell": 2
|
||
},
|
||
"source": [
|
||
"### Subtraction, Multiplication, and Division\n",
|
||
"\n",
|
||
"These operations follow the same pattern as addition, working element-wise with broadcasting support. Each serves specific purposes in neural networks:\n",
|
||
"\n",
|
||
"```\n",
|
||
"Element-wise Operations in Neural Networks:\n",
|
||
"\n",
|
||
"┌─────────────────┬─────────────────┬─────────────────┬─────────────────┐\n",
|
||
"│ Subtraction │ Multiplication │ Division │ Use Cases │\n",
|
||
"├─────────────────┼─────────────────┼─────────────────┼─────────────────┤\n",
|
||
"│ [6,8] - [1,2] │ [2,3] * [4,5] │ [8,9] / [2,3] │ • Gradient │\n",
|
||
"│ = [5,6] │ = [8,15] │ = [4.0, 3.0] │ computation │\n",
|
||
"│ │ │ │ • Normalization │\n",
|
||
"│ Center data: │ Gate values: │ Scale features: │ • Loss functions│\n",
|
||
"│ x - mean │ x * mask │ x / std │ • Attention │\n",
|
||
"└─────────────────┴─────────────────┴─────────────────┴─────────────────┘\n",
|
||
"\n",
|
||
"Broadcasting with Scalars (very common in ML):\n",
|
||
"[1, 2, 3] * 2 = [2, 4, 6] (scale all values)\n",
|
||
"[1, 2, 3] - 1 = [0, 1, 2] (shift all values)\n",
|
||
"[2, 4, 6] / 2 = [1, 2, 3] (normalize all values)\n",
|
||
"\n",
|
||
"Real ML Example - Batch Normalization:\n",
|
||
"batch_data = [[1, 2], [3, 4], [5, 6]] # Shape: (3, 2)\n",
|
||
"mean = [3, 4] # Shape: (2,)\n",
|
||
"std = [2, 2] # Shape: (2,)\n",
|
||
"\n",
|
||
"# Normalize: (x - mean) / std\n",
|
||
"normalized = (batch_data - mean) / std\n",
|
||
"# Broadcasting: (3,2) - (2,) = (3,2), then (3,2) / (2,) = (3,2)\n",
|
||
"```\n",
|
||
"\n",
|
||
"**Performance Note**: Element-wise operations are highly optimized in NumPy and run efficiently on modern CPUs with vectorization (SIMD instructions)."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "17044e9d",
|
||
"metadata": {
|
||
"cell_marker": "\"\"\"",
|
||
"lines_to_next_cell": 1
|
||
},
|
||
"source": [
|
||
"### 🧪 Unit Test: Arithmetic Operations\n",
|
||
"\n",
|
||
"This test validates our arithmetic operations work correctly with both tensor-tensor and tensor-scalar operations, including broadcasting behavior.\n",
|
||
"\n",
|
||
"**What we're testing**: Addition, subtraction, multiplication, division with broadcasting\n",
|
||
"**Why it matters**: Foundation for neural network forward passes, batch processing, normalization\n",
|
||
"**Expected**: Operations work with both tensors and scalars, proper broadcasting alignment"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "4a00b5c8",
|
||
"metadata": {
|
||
"nbgrader": {
|
||
"grade": true,
|
||
"grade_id": "test-arithmetic",
|
||
"locked": true,
|
||
"points": 15
|
||
}
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"def test_unit_arithmetic_operations():\n",
|
||
" \"\"\"🧪 Test arithmetic operations with broadcasting.\"\"\"\n",
|
||
" print(\"🧪 Unit Test: Arithmetic Operations...\")\n",
|
||
"\n",
|
||
" # Test tensor + tensor\n",
|
||
" a = Tensor([1, 2, 3])\n",
|
||
" b = Tensor([4, 5, 6])\n",
|
||
" result = a + b\n",
|
||
" assert np.array_equal(result.data, np.array([5, 7, 9], dtype=np.float32))\n",
|
||
"\n",
|
||
" # Test tensor + scalar (very common in ML)\n",
|
||
" result = a + 10\n",
|
||
" assert np.array_equal(result.data, np.array([11, 12, 13], dtype=np.float32))\n",
|
||
"\n",
|
||
" # Test broadcasting with different shapes (matrix + vector)\n",
|
||
" matrix = Tensor([[1, 2], [3, 4]])\n",
|
||
" vector = Tensor([10, 20])\n",
|
||
" result = matrix + vector\n",
|
||
" expected = np.array([[11, 22], [13, 24]], dtype=np.float32)\n",
|
||
" assert np.array_equal(result.data, expected)\n",
|
||
"\n",
|
||
" # Test subtraction (data centering)\n",
|
||
" result = b - a\n",
|
||
" assert np.array_equal(result.data, np.array([3, 3, 3], dtype=np.float32))\n",
|
||
"\n",
|
||
" # Test multiplication (scaling)\n",
|
||
" result = a * 2\n",
|
||
" assert np.array_equal(result.data, np.array([2, 4, 6], dtype=np.float32))\n",
|
||
"\n",
|
||
" # Test division (normalization)\n",
|
||
" result = b / 2\n",
|
||
" assert np.array_equal(result.data, np.array([2.0, 2.5, 3.0], dtype=np.float32))\n",
|
||
"\n",
|
||
" # Test chaining operations (common in ML pipelines)\n",
|
||
" normalized = (a - 2) / 2 # Center and scale\n",
|
||
" expected = np.array([-0.5, 0.0, 0.5], dtype=np.float32)\n",
|
||
" assert np.allclose(normalized.data, expected)\n",
|
||
"\n",
|
||
" print(\"✅ Arithmetic operations work correctly!\")\n",
|
||
"\n",
|
||
"if __name__ == \"__main__\":\n",
|
||
" test_unit_arithmetic_operations()"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "4f335a26",
|
||
"metadata": {
|
||
"cell_marker": "\"\"\"",
|
||
"lines_to_next_cell": 2
|
||
},
|
||
"source": [
|
||
"## Matrix Multiplication: The Heart of Neural Networks\n",
|
||
"\n",
|
||
"Matrix multiplication is fundamentally different from element-wise multiplication. It's the operation that gives neural networks their power to transform and combine information across features.\n",
|
||
"\n",
|
||
"### Why Matrix Multiplication is Central to ML\n",
|
||
"\n",
|
||
"Every neural network layer essentially performs matrix multiplication:\n",
|
||
"\n",
|
||
"```\n",
|
||
"Linear Layer (the building block of neural networks):\n",
|
||
"Input Features × Weight Matrix = Output Features\n",
|
||
" (N, D_in) × (D_in, D_out) = (N, D_out)\n",
|
||
"\n",
|
||
"Real Example - Image Classification:\n",
|
||
"Flattened Image × Hidden Weights = Hidden Features\n",
|
||
" (32, 784) × (784, 256) = (32, 256)\n",
|
||
" ↑ ↑ ↑\n",
|
||
" 32 images 784→256 transform 32 feature vectors\n",
|
||
"```\n",
|
||
"\n",
|
||
"### Matrix Multiplication Visualization\n",
|
||
"\n",
|
||
"```\n",
|
||
"Matrix Multiplication Process:\n",
|
||
" A (2×3) B (3×2) C (2×2)\n",
|
||
" ┌ ┐ ┌ ┐ ┌ ┐\n",
|
||
" │ 1 2 3 │ │ 7 8 │ │ 1×7+2×9+3×1 │ ┌ ┐\n",
|
||
" │ │ × │ 9 1 │ = │ │ = │ 28 13│\n",
|
||
" │ 4 5 6 │ │ 1 2 │ │ 4×7+5×9+6×1 │ │ 79 37│\n",
|
||
" └ ┘ └ ┘ └ ┘ └ ┘\n",
|
||
"\n",
|
||
"Computation Breakdown:\n",
|
||
"C[0,0] = A[0,:] · B[:,0] = [1,2,3] · [7,9,1] = 1×7 + 2×9 + 3×1 = 28\n",
|
||
"C[0,1] = A[0,:] · B[:,1] = [1,2,3] · [8,1,2] = 1×8 + 2×1 + 3×2 = 13\n",
|
||
"C[1,0] = A[1,:] · B[:,0] = [4,5,6] · [7,9,1] = 4×7 + 5×9 + 6×1 = 79\n",
|
||
"C[1,1] = A[1,:] · B[:,1] = [4,5,6] · [8,1,2] = 4×8 + 5×1 + 6×2 = 37\n",
|
||
"\n",
|
||
"Key Rule: Inner dimensions must match!\n",
|
||
"A(m,n) @ B(n,p) = C(m,p)\n",
|
||
" ↑ ↑\n",
|
||
" these must be equal\n",
|
||
"```\n",
|
||
"\n",
|
||
"### Computational Complexity and Performance\n",
|
||
"\n",
|
||
"```\n",
|
||
"Computational Cost:\n",
|
||
"For C = A @ B where A is (M×K), B is (K×N):\n",
|
||
"- Multiplications: M × N × K\n",
|
||
"- Additions: M × N × (K-1) ≈ M × N × K\n",
|
||
"- Total FLOPs: ≈ 2 × M × N × K\n",
|
||
"\n",
|
||
"Example: (1000×1000) @ (1000×1000)\n",
|
||
"- FLOPs: 2 × 1000³ = 2 billion operations\n",
|
||
"- On 1 GHz CPU: ~2 seconds if no optimization\n",
|
||
"- With optimized BLAS: ~0.1 seconds (20× speedup!)\n",
|
||
"\n",
|
||
"Memory Access Pattern:\n",
|
||
"A: M×K (row-wise access) ✓ Good cache locality\n",
|
||
"B: K×N (column-wise) ✗ Poor cache locality\n",
|
||
"C: M×N (row-wise write) ✓ Good cache locality\n",
|
||
"\n",
|
||
"This is why optimized libraries like OpenBLAS, Intel MKL use:\n",
|
||
"- Blocking algorithms (process in cache-sized chunks)\n",
|
||
"- Vectorization (SIMD instructions)\n",
|
||
"- Parallelization (multiple cores)\n",
|
||
"```\n",
|
||
"\n",
|
||
"### Neural Network Context\n",
|
||
"\n",
|
||
"```\n",
|
||
"Multi-layer Neural Network:\n",
|
||
"Input (batch=32, features=784)\n",
|
||
" ↓ W1: (784, 256)\n",
|
||
"Hidden1 (batch=32, features=256)\n",
|
||
" ↓ W2: (256, 128)\n",
|
||
"Hidden2 (batch=32, features=128)\n",
|
||
" ↓ W3: (128, 10)\n",
|
||
"Output (batch=32, classes=10)\n",
|
||
"\n",
|
||
"Each arrow represents a matrix multiplication:\n",
|
||
"- Forward pass: 3 matrix multiplications\n",
|
||
"- Backward pass: 3 more matrix multiplications (with transposes)\n",
|
||
"- Total: 6 matrix mults per forward+backward pass\n",
|
||
"\n",
|
||
"For training batch: 32 × (784×256 + 256×128 + 128×10) FLOPs\n",
|
||
"= 32 × (200,704 + 32,768 + 1,280) = 32 × 234,752 = 7.5M FLOPs per batch\n",
|
||
"```\n",
|
||
"\n",
|
||
"This is why GPU acceleration matters - modern GPUs can perform thousands of these operations in parallel!"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "4800670d",
|
||
"metadata": {
|
||
"cell_marker": "\"\"\"",
|
||
"lines_to_next_cell": 1
|
||
},
|
||
"source": [
|
||
"### 🧪 Unit Test: Matrix Multiplication\n",
|
||
"\n",
|
||
"This test validates matrix multiplication works correctly with proper shape checking and error handling.\n",
|
||
"\n",
|
||
"**What we're testing**: Matrix multiplication with shape validation and edge cases\n",
|
||
"**Why it matters**: Core operation in neural networks (linear layers, attention mechanisms)\n",
|
||
"**Expected**: Correct results for valid shapes, clear error messages for invalid shapes"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "5ee13d0d",
|
||
"metadata": {
|
||
"nbgrader": {
|
||
"grade": true,
|
||
"grade_id": "test-matmul",
|
||
"locked": true,
|
||
"points": 15
|
||
}
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"def test_unit_matrix_multiplication():\n",
|
||
" \"\"\"🧪 Test matrix multiplication operations.\"\"\"\n",
|
||
" print(\"🧪 Unit Test: Matrix Multiplication...\")\n",
|
||
"\n",
|
||
" # Test 2×2 matrix multiplication (basic case)\n",
|
||
" a = Tensor([[1, 2], [3, 4]]) # 2×2\n",
|
||
" b = Tensor([[5, 6], [7, 8]]) # 2×2\n",
|
||
" result = a.matmul(b)\n",
|
||
" # Expected: [[1×5+2×7, 1×6+2×8], [3×5+4×7, 3×6+4×8]] = [[19, 22], [43, 50]]\n",
|
||
" expected = np.array([[19, 22], [43, 50]], dtype=np.float32)\n",
|
||
" assert np.array_equal(result.data, expected)\n",
|
||
"\n",
|
||
" # Test rectangular matrices (common in neural networks)\n",
|
||
" c = Tensor([[1, 2, 3], [4, 5, 6]]) # 2×3 (like batch_size=2, features=3)\n",
|
||
" d = Tensor([[7, 8], [9, 10], [11, 12]]) # 3×2 (like features=3, outputs=2)\n",
|
||
" result = c.matmul(d)\n",
|
||
" # Expected: [[1×7+2×9+3×11, 1×8+2×10+3×12], [4×7+5×9+6×11, 4×8+5×10+6×12]]\n",
|
||
" expected = np.array([[58, 64], [139, 154]], dtype=np.float32)\n",
|
||
" assert np.array_equal(result.data, expected)\n",
|
||
"\n",
|
||
" # Test matrix-vector multiplication (common in forward pass)\n",
|
||
" matrix = Tensor([[1, 2, 3], [4, 5, 6]]) # 2×3\n",
|
||
" vector = Tensor([1, 2, 3]) # 3×1 (conceptually)\n",
|
||
" result = matrix.matmul(vector)\n",
|
||
" # Expected: [1×1+2×2+3×3, 4×1+5×2+6×3] = [14, 32]\n",
|
||
" expected = np.array([14, 32], dtype=np.float32)\n",
|
||
" assert np.array_equal(result.data, expected)\n",
|
||
"\n",
|
||
" # Test shape validation - should raise clear error\n",
|
||
" try:\n",
|
||
" incompatible_a = Tensor([[1, 2]]) # 1×2\n",
|
||
" incompatible_b = Tensor([[1], [2], [3]]) # 3×1\n",
|
||
" incompatible_a.matmul(incompatible_b) # 1×2 @ 3×1 should fail (2 ≠ 3)\n",
|
||
" assert False, \"Should have raised ValueError for incompatible shapes\"\n",
|
||
" except ValueError as e:\n",
|
||
" assert \"Inner dimensions must match\" in str(e)\n",
|
||
" assert \"2 ≠ 3\" in str(e) # Should show specific dimensions\n",
|
||
"\n",
|
||
" print(\"✅ Matrix multiplication works correctly!\")\n",
|
||
"\n",
|
||
"if __name__ == \"__main__\":\n",
|
||
" test_unit_matrix_multiplication()"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "efecf714",
|
||
"metadata": {
|
||
"cell_marker": "\"\"\"",
|
||
"lines_to_next_cell": 2
|
||
},
|
||
"source": [
|
||
"## Shape Manipulation: Reshape and Transpose\n",
|
||
"\n",
|
||
"Neural networks constantly change tensor shapes to match layer requirements. Understanding these operations is crucial for data flow through networks.\n",
|
||
"\n",
|
||
"### Why Shape Manipulation Matters\n",
|
||
"\n",
|
||
"Real neural networks require constant shape changes:\n",
|
||
"\n",
|
||
"```\n",
|
||
"CNN Data Flow Example:\n",
|
||
"Input Image: (32, 3, 224, 224) # batch, channels, height, width\n",
|
||
" ↓ Convolutional layers\n",
|
||
"Feature Maps: (32, 512, 7, 7) # batch, features, spatial\n",
|
||
" ↓ Global Average Pool\n",
|
||
"Pooled: (32, 512, 1, 1) # batch, features, 1, 1\n",
|
||
" ↓ Flatten for classifier\n",
|
||
"Flattened: (32, 512) # batch, features\n",
|
||
" ↓ Linear classifier\n",
|
||
"Output: (32, 1000) # batch, classes\n",
|
||
"\n",
|
||
"Each ↓ involves reshape or view operations!\n",
|
||
"```\n",
|
||
"\n",
|
||
"### Reshape: Changing Interpretation of the Same Data\n",
|
||
"\n",
|
||
"```\n",
|
||
"Reshaping (changing dimensions without changing data):\n",
|
||
"Original: [1, 2, 3, 4, 5, 6] (shape: (6,))\n",
|
||
" ↓ reshape(2, 3)\n",
|
||
"Result: [[1, 2, 3], (shape: (2, 3))\n",
|
||
" [4, 5, 6]]\n",
|
||
"\n",
|
||
"Memory Layout (unchanged):\n",
|
||
"Before: [1][2][3][4][5][6]\n",
|
||
"After: [1][2][3][4][5][6] ← Same memory, different interpretation\n",
|
||
"\n",
|
||
"Key Insight: Reshape is O(1) operation - no data copying!\n",
|
||
"Just changes how we interpret the memory layout.\n",
|
||
"\n",
|
||
"Common ML Reshapes:\n",
|
||
"┌─────────────────────┬─────────────────────┬─────────────────────┐\n",
|
||
"│ Flatten for MLP │ Unflatten for CNN │ Batch Dimension │\n",
|
||
"├─────────────────────┼─────────────────────┼─────────────────────┤\n",
|
||
"│ (N,H,W,C) → (N,H×W×C) │ (N,D) → (N,H,W,C) │ (H,W) → (1,H,W) │\n",
|
||
"│ Images to vectors │ Vectors to images │ Add batch dimension │\n",
|
||
"└─────────────────────┴─────────────────────┴─────────────────────┘\n",
|
||
"```\n",
|
||
"\n",
|
||
"### Transpose: Swapping Dimensions\n",
|
||
"\n",
|
||
"```\n",
|
||
"Transposing (swapping dimensions - data rearrangement):\n",
|
||
"Original: [[1, 2, 3], (shape: (2, 3))\n",
|
||
" [4, 5, 6]]\n",
|
||
" ↓ transpose()\n",
|
||
"Result: [[1, 4], (shape: (3, 2))\n",
|
||
" [2, 5],\n",
|
||
" [3, 6]]\n",
|
||
"\n",
|
||
"Memory Layout (rearranged):\n",
|
||
"Before: [1][2][3][4][5][6]\n",
|
||
"After: [1][4][2][5][3][6] ← Data actually moves in memory\n",
|
||
"\n",
|
||
"Key Insight: Transpose involves data movement - more expensive than reshape.\n",
|
||
"\n",
|
||
"Neural Network Usage:\n",
|
||
"┌─────────────────────┬─────────────────────┬─────────────────────┐\n",
|
||
"│ Weight Matrices │ Attention Mechanism │ Gradient Computation│\n",
|
||
"├─────────────────────┼─────────────────────┼─────────────────────┤\n",
|
||
"│ Forward: X @ W │ Q @ K^T attention │ ∂L/∂W = X^T @ ∂L/∂Y│\n",
|
||
"│ Backward: X @ W^T │ scores │ │\n",
|
||
"└─────────────────────┴─────────────────────┴─────────────────────┘\n",
|
||
"```\n",
|
||
"\n",
|
||
"### Performance Implications\n",
|
||
"\n",
|
||
"```\n",
|
||
"Operation Performance (for 1000×1000 matrix):\n",
|
||
"┌─────────────────┬──────────────┬─────────────────┬─────────────────┐\n",
|
||
"│ Operation │ Time │ Memory Access │ Cache Behavior │\n",
|
||
"├─────────────────┼──────────────┼─────────────────┼─────────────────┤\n",
|
||
"│ reshape() │ ~0.001 ms │ No data copy │ No cache impact │\n",
|
||
"│ transpose() │ ~10 ms │ Full data copy │ Poor locality │\n",
|
||
"│ view() (future) │ ~0.001 ms │ No data copy │ No cache impact │\n",
|
||
"└─────────────────┴──────────────┴─────────────────┴─────────────────┘\n",
|
||
"\n",
|
||
"Why transpose() is slower:\n",
|
||
"- Must rearrange data in memory\n",
|
||
"- Poor cache locality (accessing columns)\n",
|
||
"- Can't be parallelized easily\n",
|
||
"```\n",
|
||
"\n",
|
||
"This is why frameworks like PyTorch often use \"lazy\" transpose operations that defer the actual data movement until necessary."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "3224ad9c",
|
||
"metadata": {
|
||
"cell_marker": "\"\"\"",
|
||
"lines_to_next_cell": 1
|
||
},
|
||
"source": [
|
||
"### 🧪 Unit Test: Shape Manipulation\n",
|
||
"\n",
|
||
"This test validates reshape and transpose operations work correctly with validation and edge cases.\n",
|
||
"\n",
|
||
"**What we're testing**: Reshape and transpose operations with proper error handling\n",
|
||
"**Why it matters**: Essential for data flow in neural networks, CNN/RNN architectures\n",
|
||
"**Expected**: Correct shape changes, proper error handling for invalid operations"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "8eea43d4",
|
||
"metadata": {
|
||
"nbgrader": {
|
||
"grade": true,
|
||
"grade_id": "test-shape-ops",
|
||
"locked": true,
|
||
"points": 15
|
||
}
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"def test_unit_shape_manipulation():\n",
|
||
" \"\"\"🧪 Test reshape and transpose operations.\"\"\"\n",
|
||
" print(\"🧪 Unit Test: Shape Manipulation...\")\n",
|
||
"\n",
|
||
" # Test basic reshape (flatten → matrix)\n",
|
||
" tensor = Tensor([1, 2, 3, 4, 5, 6]) # Shape: (6,)\n",
|
||
" reshaped = tensor.reshape(2, 3) # Shape: (2, 3)\n",
|
||
" assert reshaped.shape == (2, 3)\n",
|
||
" expected = np.array([[1, 2, 3], [4, 5, 6]], dtype=np.float32)\n",
|
||
" assert np.array_equal(reshaped.data, expected)\n",
|
||
"\n",
|
||
" # Test reshape with tuple (alternative calling style)\n",
|
||
" reshaped2 = tensor.reshape((3, 2)) # Shape: (3, 2)\n",
|
||
" assert reshaped2.shape == (3, 2)\n",
|
||
" expected2 = np.array([[1, 2], [3, 4], [5, 6]], dtype=np.float32)\n",
|
||
" assert np.array_equal(reshaped2.data, expected2)\n",
|
||
"\n",
|
||
" # Test reshape with -1 (automatic dimension inference)\n",
|
||
" auto_reshaped = tensor.reshape(2, -1) # Should infer -1 as 3\n",
|
||
" assert auto_reshaped.shape == (2, 3)\n",
|
||
"\n",
|
||
" # Test reshape validation - should raise error for incompatible sizes\n",
|
||
" try:\n",
|
||
" tensor.reshape(2, 2) # 6 elements can't fit in 2×2=4\n",
|
||
" assert False, \"Should have raised ValueError\"\n",
|
||
" except ValueError as e:\n",
|
||
" assert \"Total elements must match\" in str(e)\n",
|
||
" assert \"6 ≠ 4\" in str(e)\n",
|
||
"\n",
|
||
" # Test matrix transpose (most common case)\n",
|
||
" matrix = Tensor([[1, 2, 3], [4, 5, 6]]) # (2, 3)\n",
|
||
" transposed = matrix.transpose() # (3, 2)\n",
|
||
" assert transposed.shape == (3, 2)\n",
|
||
" expected = np.array([[1, 4], [2, 5], [3, 6]], dtype=np.float32)\n",
|
||
" assert np.array_equal(transposed.data, expected)\n",
|
||
"\n",
|
||
" # Test 1D transpose (should be identity)\n",
|
||
" vector = Tensor([1, 2, 3])\n",
|
||
" vector_t = vector.transpose()\n",
|
||
" assert np.array_equal(vector.data, vector_t.data)\n",
|
||
"\n",
|
||
" # Test specific dimension transpose\n",
|
||
" tensor_3d = Tensor([[[1, 2], [3, 4]], [[5, 6], [7, 8]]]) # (2, 2, 2)\n",
|
||
" swapped = tensor_3d.transpose(0, 2) # Swap first and last dimensions\n",
|
||
" assert swapped.shape == (2, 2, 2) # Same shape but data rearranged\n",
|
||
"\n",
|
||
" # Test neural network reshape pattern (flatten for MLP)\n",
|
||
" batch_images = Tensor(np.random.rand(2, 3, 4)) # (batch=2, height=3, width=4)\n",
|
||
" flattened = batch_images.reshape(2, -1) # (batch=2, features=12)\n",
|
||
" assert flattened.shape == (2, 12)\n",
|
||
"\n",
|
||
" print(\"✅ Shape manipulation works correctly!\")\n",
|
||
"\n",
|
||
"if __name__ == \"__main__\":\n",
|
||
" test_unit_shape_manipulation()"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "15a0ab06",
|
||
"metadata": {
|
||
"cell_marker": "\"\"\"",
|
||
"lines_to_next_cell": 2
|
||
},
|
||
"source": [
|
||
"## Reduction Operations: Aggregating Information\n",
|
||
"\n",
|
||
"Reduction operations collapse dimensions by aggregating data, which is essential for computing statistics, losses, and preparing data for different layers.\n",
|
||
"\n",
|
||
"### Why Reductions are Crucial in ML\n",
|
||
"\n",
|
||
"Reduction operations appear throughout neural networks:\n",
|
||
"\n",
|
||
"```\n",
|
||
"Common ML Reduction Patterns:\n",
|
||
"\n",
|
||
"┌─────────────────────┬─────────────────────┬─────────────────────┐\n",
|
||
"│ Loss Computation │ Batch Normalization │ Global Pooling │\n",
|
||
"├─────────────────────┼─────────────────────┼─────────────────────┤\n",
|
||
"│ Per-sample losses → │ Batch statistics → │ Feature maps → │\n",
|
||
"│ Single batch loss │ Normalization │ Single features │\n",
|
||
"│ │ │ │\n",
|
||
"│ losses.mean() │ batch.mean(axis=0) │ fmaps.mean(axis=(2,3))│\n",
|
||
"│ (N,) → scalar │ (N,D) → (D,) │ (N,C,H,W) → (N,C) │\n",
|
||
"└─────────────────────┴─────────────────────┴─────────────────────┘\n",
|
||
"\n",
|
||
"Real Examples:\n",
|
||
"• Cross-entropy loss: -log(predictions).mean() [average over batch]\n",
|
||
"• Batch norm: (x - x.mean()) / x.std() [normalize each feature]\n",
|
||
"• Global avg pool: features.mean(dim=(2,3)) [spatial → scalar per channel]\n",
|
||
"```\n",
|
||
"\n",
|
||
"### Understanding Axis Operations\n",
|
||
"\n",
|
||
"```\n",
|
||
"Visual Axis Understanding:\n",
|
||
"Matrix: [[1, 2, 3], All reductions operate on this data\n",
|
||
" [4, 5, 6]] Shape: (2, 3)\n",
|
||
"\n",
|
||
" axis=0 (↓)\n",
|
||
" ┌─────────┐\n",
|
||
"axis=1 │ 1 2 3 │ → axis=1 reduces across columns (→)\n",
|
||
" (→) │ 4 5 6 │ → Result shape: (2,) [one value per row]\n",
|
||
" └─────────┘\n",
|
||
" ↓ ↓ ↓\n",
|
||
" axis=0 reduces down rows (↓)\n",
|
||
" Result shape: (3,) [one value per column]\n",
|
||
"\n",
|
||
"Reduction Results:\n",
|
||
"├─ .sum() → 21 (sum all: 1+2+3+4+5+6)\n",
|
||
"├─ .sum(axis=0) → [5, 7, 9] (sum columns: [1+4, 2+5, 3+6])\n",
|
||
"├─ .sum(axis=1) → [6, 15] (sum rows: [1+2+3, 4+5+6])\n",
|
||
"├─ .mean() → 3.5 (average all: 21/6)\n",
|
||
"├─ .mean(axis=0) → [2.5, 3.5, 4.5] (average columns)\n",
|
||
"└─ .max() → 6 (maximum element)\n",
|
||
"\n",
|
||
"3D Tensor Example (batch, height, width):\n",
|
||
"data.shape = (2, 3, 4) # 2 samples, 3×4 images\n",
|
||
"│\n",
|
||
"├─ .sum(axis=0) → (3, 4) # Sum across batch dimension\n",
|
||
"├─ .sum(axis=1) → (2, 4) # Sum across height dimension\n",
|
||
"├─ .sum(axis=2) → (2, 3) # Sum across width dimension\n",
|
||
"└─ .sum(axis=(1,2)) → (2,) # Sum across both spatial dims (global pool)\n",
|
||
"```\n",
|
||
"\n",
|
||
"### Memory and Performance Considerations\n",
|
||
"\n",
|
||
"```\n",
|
||
"Reduction Performance:\n",
|
||
"┌─────────────────┬──────────────┬─────────────────┬─────────────────┐\n",
|
||
"│ Operation │ Time Complex │ Memory Access │ Cache Behavior │\n",
|
||
"├─────────────────┼──────────────┼─────────────────┼─────────────────┤\n",
|
||
"│ .sum() │ O(N) │ Sequential read │ Excellent │\n",
|
||
"│ .sum(axis=0) │ O(N) │ Column access │ Poor (strided) │\n",
|
||
"│ .sum(axis=1) │ O(N) │ Row access │ Excellent │\n",
|
||
"│ .mean() │ O(N) │ Sequential read │ Excellent │\n",
|
||
"│ .max() │ O(N) │ Sequential read │ Excellent │\n",
|
||
"└─────────────────┴──────────────┴─────────────────┴─────────────────┘\n",
|
||
"\n",
|
||
"Why axis=0 is slower:\n",
|
||
"- Accesses elements with large strides\n",
|
||
"- Poor cache locality (jumping rows)\n",
|
||
"- Less vectorization-friendly\n",
|
||
"\n",
|
||
"Optimization strategies:\n",
|
||
"- Prefer axis=-1 operations when possible\n",
|
||
"- Use keepdims=True to maintain shape for broadcasting\n",
|
||
"- Consider reshaping before reduction for better cache behavior\n",
|
||
"```"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "65f33648",
|
||
"metadata": {
|
||
"cell_marker": "\"\"\"",
|
||
"lines_to_next_cell": 1
|
||
},
|
||
"source": [
|
||
"### 🧪 Unit Test: Reduction Operations\n",
|
||
"\n",
|
||
"This test validates reduction operations work correctly with axis control and maintain proper shapes.\n",
|
||
"\n",
|
||
"**What we're testing**: Sum, mean, max operations with axis parameter and keepdims\n",
|
||
"**Why it matters**: Essential for loss computation, batch processing, and pooling operations\n",
|
||
"**Expected**: Correct reduction along specified axes with proper shape handling"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "61ff9e7a",
|
||
"metadata": {
|
||
"nbgrader": {
|
||
"grade": true,
|
||
"grade_id": "test-reductions",
|
||
"locked": true,
|
||
"points": 10
|
||
}
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"def test_unit_reduction_operations():\n",
|
||
" \"\"\"🧪 Test reduction operations.\"\"\"\n",
|
||
" print(\"🧪 Unit Test: Reduction Operations...\")\n",
|
||
"\n",
|
||
" matrix = Tensor([[1, 2, 3], [4, 5, 6]]) # Shape: (2, 3)\n",
|
||
"\n",
|
||
" # Test sum all elements (common for loss computation)\n",
|
||
" total = matrix.sum()\n",
|
||
" assert total.data == 21.0 # 1+2+3+4+5+6\n",
|
||
" assert total.shape == () # Scalar result\n",
|
||
"\n",
|
||
" # Test sum along axis 0 (columns) - batch dimension reduction\n",
|
||
" col_sum = matrix.sum(axis=0)\n",
|
||
" expected_col = np.array([5, 7, 9], dtype=np.float32) # [1+4, 2+5, 3+6]\n",
|
||
" assert np.array_equal(col_sum.data, expected_col)\n",
|
||
" assert col_sum.shape == (3,)\n",
|
||
"\n",
|
||
" # Test sum along axis 1 (rows) - feature dimension reduction\n",
|
||
" row_sum = matrix.sum(axis=1)\n",
|
||
" expected_row = np.array([6, 15], dtype=np.float32) # [1+2+3, 4+5+6]\n",
|
||
" assert np.array_equal(row_sum.data, expected_row)\n",
|
||
" assert row_sum.shape == (2,)\n",
|
||
"\n",
|
||
" # Test mean (average loss computation)\n",
|
||
" avg = matrix.mean()\n",
|
||
" assert np.isclose(avg.data, 3.5) # 21/6\n",
|
||
" assert avg.shape == ()\n",
|
||
"\n",
|
||
" # Test mean along axis (batch normalization pattern)\n",
|
||
" col_mean = matrix.mean(axis=0)\n",
|
||
" expected_mean = np.array([2.5, 3.5, 4.5], dtype=np.float32) # [5/2, 7/2, 9/2]\n",
|
||
" assert np.allclose(col_mean.data, expected_mean)\n",
|
||
"\n",
|
||
" # Test max (finding best predictions)\n",
|
||
" maximum = matrix.max()\n",
|
||
" assert maximum.data == 6.0\n",
|
||
" assert maximum.shape == ()\n",
|
||
"\n",
|
||
" # Test max along axis (argmax-like operation)\n",
|
||
" row_max = matrix.max(axis=1)\n",
|
||
" expected_max = np.array([3, 6], dtype=np.float32) # [max(1,2,3), max(4,5,6)]\n",
|
||
" assert np.array_equal(row_max.data, expected_max)\n",
|
||
"\n",
|
||
" # Test keepdims (important for broadcasting)\n",
|
||
" sum_keepdims = matrix.sum(axis=1, keepdims=True)\n",
|
||
" assert sum_keepdims.shape == (2, 1) # Maintains 2D shape\n",
|
||
" expected_keepdims = np.array([[6], [15]], dtype=np.float32)\n",
|
||
" assert np.array_equal(sum_keepdims.data, expected_keepdims)\n",
|
||
"\n",
|
||
" # Test 3D reduction (simulating global average pooling)\n",
|
||
" tensor_3d = Tensor([[[1, 2], [3, 4]], [[5, 6], [7, 8]]]) # (2, 2, 2)\n",
|
||
" spatial_mean = tensor_3d.mean(axis=(1, 2)) # Average across spatial dimensions\n",
|
||
" assert spatial_mean.shape == (2,) # One value per batch item\n",
|
||
"\n",
|
||
" print(\"✅ Reduction operations work correctly!\")\n",
|
||
"\n",
|
||
"if __name__ == \"__main__\":\n",
|
||
" test_unit_reduction_operations()"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "e8f898c3",
|
||
"metadata": {
|
||
"cell_marker": "\"\"\"",
|
||
"lines_to_next_cell": 2
|
||
},
|
||
"source": [
|
||
"## Gradient Features: Preparing for Module 05\n",
|
||
"\n",
|
||
"Our Tensor includes dormant gradient features that will spring to life in Module 05. For now, they exist but do nothing - this design choice ensures a consistent interface throughout the course.\n",
|
||
"\n",
|
||
"### Why Include Gradient Features Now?\n",
|
||
"\n",
|
||
"```\n",
|
||
"Gradient System Evolution:\n",
|
||
"Module 01: Tensor with dormant gradients\n",
|
||
" ┌─────────────────────────────────┐\n",
|
||
" │ Tensor │\n",
|
||
" │ • data: actual values │\n",
|
||
" │ • requires_grad: False │ ← Present but unused\n",
|
||
" │ • grad: None │ ← Present but stays None\n",
|
||
" │ • backward(): pass │ ← Present but does nothing\n",
|
||
" └─────────────────────────────────┘\n",
|
||
" ↓ Module 05 activates these\n",
|
||
"Module 05: Tensor with active gradients\n",
|
||
" ┌─────────────────────────────────┐\n",
|
||
" │ Tensor │\n",
|
||
" │ • data: actual values │\n",
|
||
" │ • requires_grad: True │ ← Now controls gradient tracking\n",
|
||
" │ • grad: computed gradients │ ← Now accumulates gradients\n",
|
||
" │ • backward(): computes grads │ ← Now implements chain rule\n",
|
||
" └─────────────────────────────────┘\n",
|
||
"```\n",
|
||
"\n",
|
||
"### Design Benefits\n",
|
||
"\n",
|
||
"**Consistency**: Same Tensor class interface throughout all modules\n",
|
||
"- No confusing Variable vs. Tensor distinction (unlike early PyTorch)\n",
|
||
"- Students never need to learn a \"new\" Tensor class\n",
|
||
"- IDE autocomplete works from day one\n",
|
||
"\n",
|
||
"**Gradual Complexity**: Features activate when students are ready\n",
|
||
"- Module 01-04: Ignore gradient features, focus on operations\n",
|
||
"- Module 05: Gradient features \"turn on\" magically\n",
|
||
"- No cognitive overload in early modules\n",
|
||
"\n",
|
||
"**Future-Proof**: Easy to extend without breaking changes\n",
|
||
"- Additional features can be added as dormant initially\n",
|
||
"- No monkey-patching or dynamic class modification\n",
|
||
"- Clean evolution path\n",
|
||
"\n",
|
||
"### Current State (Module 01)\n",
|
||
"\n",
|
||
"```\n",
|
||
"Gradient Features - Current Behavior:\n",
|
||
"┌─────────────────────────────────────────────────────────┐\n",
|
||
"│ Feature │ Current State │ Module 05 State │\n",
|
||
"├─────────────────────────────────────────────────────────┤\n",
|
||
"│ requires_grad │ False │ True (when needed) │\n",
|
||
"│ grad │ None │ np.array(...) │\n",
|
||
"│ backward() │ pass (no-op) │ Chain rule impl │\n",
|
||
"│ Operation chaining│ Not tracked │ Computation graph │\n",
|
||
"└─────────────────────────────────────────────────────────┘\n",
|
||
"\n",
|
||
"Student Experience:\n",
|
||
"• Can call .backward() without errors (just does nothing)\n",
|
||
"• Can set requires_grad=True (just gets stored)\n",
|
||
"• Focus on understanding tensor operations first\n",
|
||
"• Gradients remain \"mysterious\" until Module 05 reveals them\n",
|
||
"```\n",
|
||
"\n",
|
||
"This approach matches the pedagogical principle of \"progressive disclosure\" - reveal complexity only when students are ready to handle it."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "03456dd8",
|
||
"metadata": {
|
||
"cell_marker": "\"\"\"",
|
||
"lines_to_next_cell": 1
|
||
},
|
||
"source": [
|
||
"## Systems Analysis: Memory Layout and Performance\n",
|
||
"\n",
|
||
"Even as a foundation module, let's understand ONE key systems concept that will inform every design decision in future modules: **memory layout and cache behavior**.\n",
|
||
"\n",
|
||
"This single analysis reveals why certain operations are fast while others are slow, and why framework designers make specific architectural choices."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "0a805194",
|
||
"metadata": {
|
||
"lines_to_next_cell": 2
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"def analyze_memory_layout():\n",
|
||
" \"\"\"📊 Demonstrate cache effects with row vs column access patterns.\"\"\"\n",
|
||
" print(\"📊 Analyzing Memory Access Patterns...\")\n",
|
||
" print(\"=\" * 60)\n",
|
||
"\n",
|
||
" # Create a moderately-sized matrix (large enough to show cache effects)\n",
|
||
" size = 2000\n",
|
||
" matrix = Tensor(np.random.rand(size, size))\n",
|
||
"\n",
|
||
" import time\n",
|
||
"\n",
|
||
" print(f\"\\nTesting with {size}×{size} matrix ({matrix.size * BYTES_PER_FLOAT32 / MB_TO_BYTES:.1f} MB)\")\n",
|
||
" print(\"-\" * 60)\n",
|
||
"\n",
|
||
" # Test 1: Row-wise access (cache-friendly)\n",
|
||
" # Memory layout: [row0][row1][row2]... stored contiguously\n",
|
||
" print(\"\\n🔬 Test 1: Row-wise Access (Cache-Friendly)\")\n",
|
||
" start = time.time()\n",
|
||
" row_sums = []\n",
|
||
" for i in range(size):\n",
|
||
" row_sum = matrix.data[i, :].sum() # Access entire row sequentially\n",
|
||
" row_sums.append(row_sum)\n",
|
||
" row_time = time.time() - start\n",
|
||
" print(f\" Time: {row_time*1000:.1f}ms\")\n",
|
||
" print(f\" Access pattern: Sequential (follows memory layout)\")\n",
|
||
"\n",
|
||
" # Test 2: Column-wise access (cache-unfriendly)\n",
|
||
" # Must jump between rows, poor spatial locality\n",
|
||
" print(\"\\n🔬 Test 2: Column-wise Access (Cache-Unfriendly)\")\n",
|
||
" start = time.time()\n",
|
||
" col_sums = []\n",
|
||
" for j in range(size):\n",
|
||
" col_sum = matrix.data[:, j].sum() # Access entire column with large strides\n",
|
||
" col_sums.append(col_sum)\n",
|
||
" col_time = time.time() - start\n",
|
||
" print(f\" Time: {col_time*1000:.1f}ms\")\n",
|
||
" print(f\" Access pattern: Strided (jumps {size * BYTES_PER_FLOAT32} bytes per element)\")\n",
|
||
"\n",
|
||
" # Calculate slowdown\n",
|
||
" slowdown = col_time / row_time\n",
|
||
" print(\"\\n\" + \"=\" * 60)\n",
|
||
" print(f\"📊 PERFORMANCE IMPACT:\")\n",
|
||
" print(f\" Slowdown factor: {slowdown:.2f}× ({col_time/row_time:.1f}× slower)\")\n",
|
||
" print(f\" Cache misses cause {(slowdown-1)*100:.0f}% performance loss\")\n",
|
||
"\n",
|
||
" # Educational insights\n",
|
||
" print(\"\\n💡 KEY INSIGHTS:\")\n",
|
||
" print(f\" 1. Memory layout matters: Row-major (C-style) storage is sequential\")\n",
|
||
" print(f\" 2. Cache lines are ~64 bytes: Row access loads nearby elements \\\"for free\\\"\")\n",
|
||
" print(f\" 3. Column access misses cache: Must reload from DRAM every time\")\n",
|
||
" print(f\" 4. This is O(n) algorithm but {slowdown:.1f}× different wall-clock time!\")\n",
|
||
"\n",
|
||
" print(\"\\n🚀 REAL-WORLD IMPLICATIONS:\")\n",
|
||
" print(f\" • CNNs use NCHW format (channels sequential) for cache efficiency\")\n",
|
||
" print(f\" • Matrix multiplication optimized with blocking (tile into cache-sized chunks)\")\n",
|
||
" print(f\" • Transpose is expensive ({slowdown:.1f}×) because it changes memory layout\")\n",
|
||
" print(f\" • This is why GPU frameworks obsess over memory coalescing\")\n",
|
||
"\n",
|
||
" print(\"\\n\" + \"=\" * 60)\n",
|
||
"\n",
|
||
"# Run the systems analysis\n",
|
||
"if __name__ == \"__main__\":\n",
|
||
" analyze_memory_layout()"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "3b24da26",
|
||
"metadata": {
|
||
"cell_marker": "\"\"\"",
|
||
"lines_to_next_cell": 2
|
||
},
|
||
"source": [
|
||
"## 4. Integration: Bringing It Together\n",
|
||
"\n",
|
||
"Let's test how our Tensor operations work together in realistic scenarios that mirror neural network computations. This integration demonstrates that our individual operations combine correctly for complex ML workflows.\n",
|
||
"\n",
|
||
"### Neural Network Layer Simulation\n",
|
||
"\n",
|
||
"The fundamental building block of neural networks is the linear transformation: **y = xW + b**\n",
|
||
"\n",
|
||
"```\n",
|
||
"Linear Layer Forward Pass: y = xW + b\n",
|
||
"\n",
|
||
"Input Features → Weight Matrix → Matrix Multiply → Add Bias → Output Features\n",
|
||
" (batch, in) (in, out) (batch, out) (batch, out) (batch, out)\n",
|
||
"\n",
|
||
"Step-by-Step Breakdown:\n",
|
||
"1. Input: X shape (batch_size, input_features)\n",
|
||
"2. Weight: W shape (input_features, output_features)\n",
|
||
"3. Matmul: XW shape (batch_size, output_features)\n",
|
||
"4. Bias: b shape (output_features,)\n",
|
||
"5. Result: XW + b shape (batch_size, output_features)\n",
|
||
"\n",
|
||
"Example Flow:\n",
|
||
"Input: [[1, 2, 3], Weight: [[0.1, 0.2], Bias: [0.1, 0.2]\n",
|
||
" [4, 5, 6]] [0.3, 0.4],\n",
|
||
" (2, 3) [0.5, 0.6]]\n",
|
||
" (3, 2)\n",
|
||
"\n",
|
||
"Step 1: Matrix Multiply\n",
|
||
"[[1, 2, 3]] @ [[0.1, 0.2]] = [[1×0.1+2×0.3+3×0.5, 1×0.2+2×0.4+3×0.6]]\n",
|
||
"[[4, 5, 6]] [[0.3, 0.4]] [[4×0.1+5×0.3+6×0.5, 4×0.2+5×0.4+6×0.6]]\n",
|
||
" [[0.5, 0.6]]\n",
|
||
" = [[1.6, 2.6],\n",
|
||
" [4.9, 6.8]]\n",
|
||
"\n",
|
||
"Step 2: Add Bias (Broadcasting)\n",
|
||
"[[1.6, 2.6]] + [0.1, 0.2] = [[1.7, 2.8],\n",
|
||
" [4.9, 6.8]] [5.0, 7.0]]\n",
|
||
"\n",
|
||
"This is the foundation of every neural network layer!\n",
|
||
"```\n",
|
||
"\n",
|
||
"### Why This Integration Matters\n",
|
||
"\n",
|
||
"This simulation shows how our basic operations combine to create the computational building blocks of neural networks:\n",
|
||
"\n",
|
||
"- **Matrix Multiplication**: Transforms input features into new feature space\n",
|
||
"- **Broadcasting Addition**: Applies learned biases efficiently across batches\n",
|
||
"- **Shape Handling**: Ensures data flows correctly through layers\n",
|
||
"- **Memory Management**: Creates new tensors without corrupting inputs\n",
|
||
"\n",
|
||
"Every layer in a neural network - from simple MLPs to complex transformers - uses this same pattern."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "6fb37dc0",
|
||
"metadata": {
|
||
"cell_marker": "\"\"\"",
|
||
"lines_to_next_cell": 1
|
||
},
|
||
"source": [
|
||
"## 🧪 Module Integration Test\n",
|
||
"\n",
|
||
"Final validation that everything works together correctly before module completion."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"id": "461b98b5",
|
||
"metadata": {
|
||
"lines_to_next_cell": 2,
|
||
"nbgrader": {
|
||
"grade": true,
|
||
"grade_id": "module-integration",
|
||
"locked": true,
|
||
"points": 20
|
||
}
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"def test_module():\n",
|
||
" \"\"\"\n",
|
||
" Comprehensive test of entire module functionality.\n",
|
||
"\n",
|
||
" This final test runs before module summary to ensure:\n",
|
||
" - All unit tests pass\n",
|
||
" - Functions work together correctly\n",
|
||
" - Module is ready for integration with TinyTorch\n",
|
||
" \"\"\"\n",
|
||
" print(\"🧪 RUNNING MODULE INTEGRATION TEST\")\n",
|
||
" print(\"=\" * 50)\n",
|
||
"\n",
|
||
" # Run all unit tests\n",
|
||
" print(\"Running unit tests...\")\n",
|
||
" test_unit_tensor_creation()\n",
|
||
" test_unit_arithmetic_operations()\n",
|
||
" test_unit_matrix_multiplication()\n",
|
||
" test_unit_shape_manipulation()\n",
|
||
" test_unit_reduction_operations()\n",
|
||
"\n",
|
||
" print(\"\\nRunning integration scenarios...\")\n",
|
||
"\n",
|
||
" # Test realistic neural network computation\n",
|
||
" print(\"🧪 Integration Test: Two-Layer Neural Network...\")\n",
|
||
"\n",
|
||
" # Create input data (2 samples, 3 features)\n",
|
||
" x = Tensor([[1, 2, 3], [4, 5, 6]])\n",
|
||
"\n",
|
||
" # First layer: 3 inputs → 4 hidden units\n",
|
||
" W1 = Tensor([[0.1, 0.2, 0.3, 0.4],\n",
|
||
" [0.5, 0.6, 0.7, 0.8],\n",
|
||
" [0.9, 1.0, 1.1, 1.2]])\n",
|
||
" b1 = Tensor([0.1, 0.2, 0.3, 0.4])\n",
|
||
"\n",
|
||
" # Forward pass: hidden = xW1 + b1\n",
|
||
" hidden = x.matmul(W1) + b1\n",
|
||
" assert hidden.shape == (2, 4), f\"Expected (2, 4), got {hidden.shape}\"\n",
|
||
"\n",
|
||
" # Second layer: 4 hidden → 2 outputs\n",
|
||
" W2 = Tensor([[0.1, 0.2], [0.3, 0.4], [0.5, 0.6], [0.7, 0.8]])\n",
|
||
" b2 = Tensor([0.1, 0.2])\n",
|
||
"\n",
|
||
" # Output layer: output = hiddenW2 + b2\n",
|
||
" output = hidden.matmul(W2) + b2\n",
|
||
" assert output.shape == (2, 2), f\"Expected (2, 2), got {output.shape}\"\n",
|
||
"\n",
|
||
" # Verify data flows correctly (no NaN, reasonable values)\n",
|
||
" assert not np.isnan(output.data).any(), \"Output contains NaN values\"\n",
|
||
" assert np.isfinite(output.data).all(), \"Output contains infinite values\"\n",
|
||
"\n",
|
||
" print(\"✅ Two-layer neural network computation works!\")\n",
|
||
"\n",
|
||
" # Test gradient attributes are preserved and functional\n",
|
||
" print(\"🧪 Integration Test: Gradient System Readiness...\")\n",
|
||
" grad_tensor = Tensor([1, 2, 3], requires_grad=True)\n",
|
||
" result = grad_tensor + 5\n",
|
||
" assert grad_tensor.requires_grad == True, \"requires_grad not preserved\"\n",
|
||
" assert grad_tensor.grad is None, \"grad should still be None\"\n",
|
||
"\n",
|
||
" # Test backward() doesn't crash (even though it does nothing)\n",
|
||
" grad_tensor.backward() # Should not raise any exception\n",
|
||
"\n",
|
||
" print(\"✅ Gradient system ready for Module 05!\")\n",
|
||
"\n",
|
||
" # Test complex shape manipulations\n",
|
||
" print(\"🧪 Integration Test: Complex Shape Operations...\")\n",
|
||
" data = Tensor([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12])\n",
|
||
"\n",
|
||
" # Reshape to 3D tensor (simulating batch processing)\n",
|
||
" tensor_3d = data.reshape(2, 2, 3) # (batch=2, height=2, width=3)\n",
|
||
" assert tensor_3d.shape == (2, 2, 3)\n",
|
||
"\n",
|
||
" # Global average pooling simulation\n",
|
||
" pooled = tensor_3d.mean(axis=(1, 2)) # Average across spatial dimensions\n",
|
||
" assert pooled.shape == (2,), f\"Expected (2,), got {pooled.shape}\"\n",
|
||
"\n",
|
||
" # Flatten for MLP\n",
|
||
" flattened = tensor_3d.reshape(2, -1) # (batch, features)\n",
|
||
" assert flattened.shape == (2, 6)\n",
|
||
"\n",
|
||
" # Transpose for different operations\n",
|
||
" transposed = tensor_3d.transpose() # Should transpose last two dims\n",
|
||
" assert transposed.shape == (2, 3, 2)\n",
|
||
"\n",
|
||
" print(\"✅ Complex shape operations work!\")\n",
|
||
"\n",
|
||
" # Test broadcasting edge cases\n",
|
||
" print(\"🧪 Integration Test: Broadcasting Edge Cases...\")\n",
|
||
"\n",
|
||
" # Scalar broadcasting\n",
|
||
" scalar = Tensor(5.0)\n",
|
||
" vector = Tensor([1, 2, 3])\n",
|
||
" result = scalar + vector # Should broadcast scalar to vector shape\n",
|
||
" expected = np.array([6, 7, 8], dtype=np.float32)\n",
|
||
" assert np.array_equal(result.data, expected)\n",
|
||
"\n",
|
||
" # Matrix + vector broadcasting\n",
|
||
" matrix = Tensor([[1, 2], [3, 4]])\n",
|
||
" vec = Tensor([10, 20])\n",
|
||
" result = matrix + vec\n",
|
||
" expected = np.array([[11, 22], [13, 24]], dtype=np.float32)\n",
|
||
" assert np.array_equal(result.data, expected)\n",
|
||
"\n",
|
||
" print(\"✅ Broadcasting edge cases work!\")\n",
|
||
"\n",
|
||
" print(\"\\n\" + \"=\" * 50)\n",
|
||
" print(\"🎉 ALL TESTS PASSED! Module ready for export.\")\n",
|
||
" print(\"Run: tito module complete 01_tensor\")\n",
|
||
"\n",
|
||
"# Run comprehensive module test\n",
|
||
"if __name__ == \"__main__\":\n",
|
||
" test_module()"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "0f104aba",
|
||
"metadata": {
|
||
"cell_marker": "\"\"\""
|
||
},
|
||
"source": [
|
||
"## 🤔 ML Systems Reflection Questions\n",
|
||
"\n",
|
||
"Answer these to deepen your understanding of tensor operations and their systems implications:\n",
|
||
"\n",
|
||
"### 1. Memory Layout and Cache Performance\n",
|
||
"**Question**: How does row-major vs column-major storage affect cache performance in tensor operations?\n",
|
||
"\n",
|
||
"**Consider**:\n",
|
||
"- What happens when you access matrix elements sequentially vs. with large strides?\n",
|
||
"- Why did our analysis show column-wise access being ~2-3× slower than row-wise?\n",
|
||
"- How would this affect the design of a convolutional neural network's memory layout?\n",
|
||
"\n",
|
||
"**Real-world context**: PyTorch uses NCHW (batch, channels, height, width) format specifically because accessing channels sequentially has better cache locality than NHWC format.\n",
|
||
"\n",
|
||
"---\n",
|
||
"\n",
|
||
"### 2. Batch Processing and Scaling\n",
|
||
"**Question**: If you double the batch size in a neural network, what happens to memory usage? What about computation time?\n",
|
||
"\n",
|
||
"**Consider**:\n",
|
||
"- A linear layer with input (batch, features): y = xW + b\n",
|
||
"- Memory for: input tensor, weight matrix, output tensor, intermediate results\n",
|
||
"- How does matrix multiplication time scale with batch size?\n",
|
||
"\n",
|
||
"**Think about**:\n",
|
||
"- If (32, 784) @ (784, 256) takes 10ms, how long does (64, 784) @ (784, 256) take?\n",
|
||
"- Does doubling batch size double memory usage? Why or why not?\n",
|
||
"- What are the trade-offs between large and small batch sizes?\n",
|
||
"\n",
|
||
"---\n",
|
||
"\n",
|
||
"### 3. Data Type Precision and Memory\n",
|
||
"**Question**: What's the memory difference between float64 and float32 for a (1000, 1000) tensor? When would you choose each?\n",
|
||
"\n",
|
||
"**Calculate**:\n",
|
||
"- float64: 8 bytes per element\n",
|
||
"- float32: 4 bytes per element\n",
|
||
"- Total elements in (1000, 1000): ___________\n",
|
||
"- Memory difference: ___________\n",
|
||
"\n",
|
||
"**Trade-offs to consider**:\n",
|
||
"- Training accuracy vs. memory consumption\n",
|
||
"- GPU memory limits (often 8-16GB for consumer GPUs)\n",
|
||
"- Numerical stability in gradient computation\n",
|
||
"- Inference speed on mobile devices\n",
|
||
"\n",
|
||
"---\n",
|
||
"\n",
|
||
"### 4. Production Scale: Memory Requirements\n",
|
||
"**Question**: A GPT-3-scale model has 175 billion parameters. How much RAM is needed just to store the weights in float32? What about with an optimizer like Adam?\n",
|
||
"\n",
|
||
"**Calculate**:\n",
|
||
"- Parameters: 175 × 10^9\n",
|
||
"- Bytes per float32: 4\n",
|
||
"- Weight memory: ___________GB\n",
|
||
"\n",
|
||
"**Additional memory for Adam optimizer**:\n",
|
||
"- Adam stores: parameters, gradients, first moment (m), second moment (v)\n",
|
||
"- Total multiplier: 4× the parameter count\n",
|
||
"- Total with Adam: ___________GB\n",
|
||
"\n",
|
||
"**Real-world implications**:\n",
|
||
"- Why do we need 8× A100 GPUs (40GB each) for training?\n",
|
||
"- What is mixed-precision training (float16/bfloat16)?\n",
|
||
"- How does gradient checkpointing help?\n",
|
||
"\n",
|
||
"---\n",
|
||
"\n",
|
||
"### 5. Hardware Awareness: GPU Efficiency\n",
|
||
"**Question**: Why do GPUs strongly prefer operations on large tensors over many small ones?\n",
|
||
"\n",
|
||
"**Consider these scenarios**:\n",
|
||
"- **Scenario A**: 1000 separate (10, 10) matrix multiplications\n",
|
||
"- **Scenario B**: 1 batched (1000, 10, 10) matrix multiplication\n",
|
||
"\n",
|
||
"**Think about**:\n",
|
||
"- GPU kernel launch overhead (~5-10 microseconds per launch)\n",
|
||
"- Thread parallelism utilization (GPUs have 1000s of cores)\n",
|
||
"- Memory transfer costs (CPU→GPU has ~10GB/s bandwidth, GPU memory has ~900GB/s)\n",
|
||
"- When is the GPU actually doing computation vs. waiting?\n",
|
||
"\n",
|
||
"**Design principle**: Batch operations together to amortize overhead and maximize parallelism.\n",
|
||
"\n",
|
||
"---\n",
|
||
"\n",
|
||
"### Bonus Challenge: Optimization Analysis\n",
|
||
"\n",
|
||
"**Scenario**: You're implementing a custom activation function that will be applied to every element in a tensor. You have two implementation choices:\n",
|
||
"\n",
|
||
"**Option A**: Python loop over each element\n",
|
||
"```python\n",
|
||
"def custom_activation(tensor):\n",
|
||
" result = np.empty_like(tensor.data)\n",
|
||
" for i in range(tensor.data.size):\n",
|
||
" result.flat[i] = complex_math_function(tensor.data.flat[i])\n",
|
||
" return Tensor(result)\n",
|
||
"```\n",
|
||
"\n",
|
||
"**Option B**: NumPy vectorized operation\n",
|
||
"```python\n",
|
||
"def custom_activation(tensor):\n",
|
||
" return Tensor(complex_math_function(tensor.data))\n",
|
||
"```\n",
|
||
"\n",
|
||
"**Questions**:\n",
|
||
"1. For a (1000, 1000) tensor, estimate the speedup of Option B vs Option A\n",
|
||
"2. Why is vectorization faster even though both are O(n) operations?\n",
|
||
"3. What if the tensor is tiny (10, 10) - does the answer change?\n",
|
||
"4. How would this change if we move to GPU computation?\n",
|
||
"\n",
|
||
"**Key insight**: Algorithmic complexity (Big-O) doesn't tell the whole performance story. Constant factors from vectorization, cache behavior, and parallelism dominate in practice."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "c8195b08",
|
||
"metadata": {
|
||
"cell_marker": "\"\"\""
|
||
},
|
||
"source": [
|
||
"## 🎯 MODULE SUMMARY: Tensor Foundation\n",
|
||
"\n",
|
||
"Congratulations! You've built the foundational Tensor class that powers all machine learning operations!\n",
|
||
"\n",
|
||
"### Key Accomplishments\n",
|
||
"- **Built a complete Tensor class** with arithmetic operations, matrix multiplication, and shape manipulation\n",
|
||
"- **Implemented broadcasting semantics** that match NumPy for automatic shape alignment\n",
|
||
"- **Created dormant gradient features** that will activate in Module 05 (autograd)\n",
|
||
"- **Added comprehensive ASCII diagrams** showing tensor operations visually\n",
|
||
"- **All methods defined INSIDE the class** (no monkey-patching) for clean, maintainable code\n",
|
||
"- **All tests pass ✅** (validated by `test_module()`)\n",
|
||
"\n",
|
||
"### Systems Insights Discovered\n",
|
||
"- **Memory scaling**: Matrix operations create new tensors (3× memory during computation)\n",
|
||
"- **Broadcasting efficiency**: NumPy's automatic shape alignment vs. explicit operations\n",
|
||
"- **Shape validation trade-offs**: Clear errors vs. performance in tight loops\n",
|
||
"- **Architecture decisions**: Dormant features vs. inheritance for clean evolution\n",
|
||
"\n",
|
||
"### Ready for Next Steps\n",
|
||
"Your Tensor implementation enables all future modules! The dormant gradient features will spring to life in Module 05, and every neural network component will build on this foundation.\n",
|
||
"\n",
|
||
"Export with: `tito module complete 01_tensor`\n",
|
||
"\n",
|
||
"**Next**: Module 02 will add activation functions (ReLU, Sigmoid, GELU) that bring intelligence to neural networks by introducing nonlinearity!"
|
||
]
|
||
}
|
||
],
|
||
"metadata": {
|
||
"kernelspec": {
|
||
"display_name": "Python 3 (ipykernel)",
|
||
"language": "python",
|
||
"name": "python3"
|
||
}
|
||
},
|
||
"nbformat": 4,
|
||
"nbformat_minor": 5
|
||
}
|