Files
TinyTorch/modules/dataloader/dataloader_dev.ipynb
Vijay Janapa Reddi 8331f93438 refactor: rename data module to dataloader
- Rename modules/data/ → modules/dataloader/
- Rename data_dev.py → dataloader_dev.py
- Update NBDev export target: core.data → core.dataloader
- Rename test files: test_data.py → test_dataloader.py
- Update package exports to tinytorch.core.dataloader
- Update module imports and internal references

This makes the module name more descriptive and aligned with ML industry standards.
2025-07-11 18:59:09 -04:00

1677 lines
70 KiB
Plaintext

{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"---\n",
"jupyter:\n",
" jupytext:\n",
" text_representation:\n",
" extension: .py\n",
" format_name: percent\n",
" format_version: '1.3'\n",
" jupytext_version: 1.17.1\n",
"---\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"\"\"\"\n",
"# Module 4: Data - Data Loading and Preprocessing\n",
"\n",
"Welcome to the Data module! This is where you'll learn how to efficiently load, process, and manage data for machine learning systems.\n",
"\n",
"## Learning Goals\n",
"- Understand data pipelines as the foundation of ML systems\n",
"- Implement efficient data loading with memory management\n",
"- Build reusable dataset abstractions for different data types\n",
"- Master batching strategies and I/O optimization\n",
"- Learn systems thinking for data engineering\n",
"\n",
"## Build \u2192 Use \u2192 Understand\n",
"1. **Build**: Create dataset classes and data loaders\n",
"2. **Use**: Load real datasets and train models\n",
"3. **Understand**: How data engineering affects system performance\n",
"\n",
"## Module Dependencies\n",
"This module builds on previous modules:\n",
"- **tensor** \u2192 **activations** \u2192 **layers** \u2192 **networks** \u2192 **data**\n",
"- Data feeds into training: data \u2192 autograd \u2192 training\n",
"\"\"\""
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"\"\"\"\n",
"## \ud83d\udce6 Where This Code Lives in the Final Package\n",
"\n",
"**Learning Side:** You work in `modules/dataloader/dataloader_dev.py` \n",
"**Building Side:** Code exports to `tinytorch.core.dataloader`\n",
"\n",
"```python\n",
"# Final package structure:\n",
"from tinytorch.core.dataloader import Dataset, DataLoader, CIFAR10Dataset\n",
"from tinytorch.core.tensor import Tensor\n",
"from tinytorch.core.networks import Sequential\n",
"```\n",
"\n",
"**Why this matters:**\n",
"- **Learning:** Focused modules for deep understanding\n",
"- **Production:** Proper organization like PyTorch's `torch.utils.data`\n",
"- **Consistency:** All data loading utilities live together in `core.data`\n",
"\"\"\""
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#| default_exp core.dataloader\n",
"\n",
"# Setup and imports\n",
"import numpy as np\n",
"import sys\n",
"import os\n",
"import pickle\n",
"import struct\n",
"from typing import List, Tuple, Optional, Union, Iterator\n",
"import matplotlib.pyplot as plt\n",
"import urllib.request\n",
"import tarfile\n",
"\n",
"# Import our building blocks\n",
"from tinytorch.core.tensor import Tensor\n",
"\n",
"print(\"\ud83d\udd25 TinyTorch Data Module\")\n",
"print(f\"NumPy version: {np.__version__}\")\n",
"print(f\"Python version: {sys.version_info.major}.{sys.version_info.minor}\")\n",
"print(\"Ready to build data pipelines!\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#| export\n",
"import numpy as np\n",
"import sys\n",
"import os\n",
"import pickle\n",
"import struct\n",
"from typing import List, Tuple, Optional, Union, Iterator\n",
"import matplotlib.pyplot as plt\n",
"import urllib.request\n",
"import tarfile\n",
"\n",
"# Import our building blocks\n",
"from tinytorch.core.tensor import Tensor"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#| hide\n",
"#| export\n",
"def _should_show_plots():\n",
" \"\"\"Check if we should show plots (disable during testing)\"\"\"\n",
" return 'pytest' not in sys.modules and 'test' not in sys.argv"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"\"\"\"\n",
"## Step 1: What is Data Engineering?\n",
"\n",
"### Definition\n",
"**Data engineering** is the foundation of all machine learning systems. It involves loading, processing, and managing data efficiently so that models can learn from it.\n",
"\n",
"### Why Data Engineering Matters\n",
"- **Data is the fuel**: Without proper data pipelines, nothing else works\n",
"- **I/O bottlenecks**: Data loading is often the biggest performance bottleneck\n",
"- **Memory management**: How you handle data affects everything else\n",
"- **Production reality**: Data pipelines are critical in real ML systems\n",
"\n",
"### The Fundamental Insight\n",
"**Data engineering is about managing the flow of information through your system:**\n",
"```\n",
"Raw Data \u2192 Load \u2192 Preprocess \u2192 Batch \u2192 Feed to Model\n",
"```\n",
"\n",
"### Real-World Examples\n",
"- **Image datasets**: CIFAR-10, ImageNet, MNIST\n",
"- **Text datasets**: Wikipedia, books, social media\n",
"- **Tabular data**: CSV files, databases, spreadsheets\n",
"- **Audio data**: Speech recordings, music files\n",
"\n",
"### Systems Thinking\n",
"- **Memory efficiency**: Handle datasets larger than RAM\n",
"- **I/O optimization**: Read from disk efficiently\n",
"- **Batching strategies**: Trade-offs between memory and speed\n",
"- **Caching**: When to cache vs recompute\n",
"\n",
"### Visual Intuition\n",
"```\n",
"Raw Files: [image1.jpg, image2.jpg, image3.jpg, ...]\n",
"Load: [Tensor(32x32x3), Tensor(32x32x3), Tensor(32x32x3), ...]\n",
"Batch: [Tensor(32, 32, 32, 3)] # 32 images at once\n",
"Model: Process batch efficiently\n",
"```\n",
"\n",
"Let's start by building the most fundamental component: **Dataset**.\n",
"\"\"\""
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#| export\n",
"class Dataset:\n",
" \"\"\"\n",
" Base Dataset class: Abstract interface for all datasets.\n",
" \n",
" The fundamental abstraction for data loading in TinyTorch.\n",
" Students implement concrete datasets by inheriting from this class.\n",
" \n",
" TODO: Implement the base Dataset class with required methods.\n",
" \n",
" APPROACH:\n",
" 1. Define the interface that all datasets must implement\n",
" 2. Include methods for getting individual samples and dataset size\n",
" 3. Make it easy to extend for different data types\n",
" \n",
" EXAMPLE:\n",
" dataset = CIFAR10Dataset(\"data/cifar10/\")\n",
" sample, label = dataset[0] # Get first sample\n",
" size = len(dataset) # Get dataset size\n",
" \n",
" HINTS:\n",
" - Use abstract methods that subclasses must implement\n",
" - Include __getitem__ for indexing and __len__ for size\n",
" - Add helper methods for getting sample shapes and number of classes\n",
" \"\"\"\n",
" \n",
" def __getitem__(self, index: int) -> Tuple[Tensor, Tensor]:\n",
" \"\"\"\n",
" Get a single sample and label by index.\n",
" \n",
" Args:\n",
" index: Index of the sample to retrieve\n",
" \n",
" Returns:\n",
" Tuple of (data, label) tensors\n",
" \n",
" TODO: Implement abstract method for getting samples.\n",
" \n",
" STEP-BY-STEP:\n",
" 1. This is an abstract method - subclasses will implement it\n",
" 2. Return a tuple of (data, label) tensors\n",
" 3. Data should be the input features, label should be the target\n",
" \n",
" EXAMPLE:\n",
" dataset[0] should return (Tensor(image_data), Tensor(label))\n",
" \"\"\"\n",
" raise NotImplementedError(\"Student implementation required\")\n",
" \n",
" def __len__(self) -> int:\n",
" \"\"\"\n",
" Get the total number of samples in the dataset.\n",
" \n",
" TODO: Implement abstract method for getting dataset size.\n",
" \n",
" STEP-BY-STEP:\n",
" 1. This is an abstract method - subclasses will implement it\n",
" 2. Return the total number of samples in the dataset\n",
" \n",
" EXAMPLE:\n",
" len(dataset) should return 50000 for CIFAR-10 training set\n",
" \"\"\"\n",
" raise NotImplementedError(\"Student implementation required\")\n",
" \n",
" def get_sample_shape(self) -> Tuple[int, ...]:\n",
" \"\"\"\n",
" Get the shape of a single data sample.\n",
" \n",
" TODO: Implement method to get sample shape.\n",
" \n",
" STEP-BY-STEP:\n",
" 1. Get the first sample using self[0]\n",
" 2. Extract the data part (first element of tuple)\n",
" 3. Return the shape of the data tensor\n",
" \n",
" EXAMPLE:\n",
" For CIFAR-10: returns (3, 32, 32) for RGB images\n",
" \"\"\"\n",
" raise NotImplementedError(\"Student implementation required\")\n",
" \n",
" def get_num_classes(self) -> int:\n",
" \"\"\"\n",
" Get the number of classes in the dataset.\n",
" \n",
" TODO: Implement abstract method for getting number of classes.\n",
" \n",
" STEP-BY-STEP:\n",
" 1. This is an abstract method - subclasses will implement it\n",
" 2. Return the total number of classes in the dataset\n",
" \n",
" EXAMPLE:\n",
" For CIFAR-10: returns 10 (airplane, car, bird, cat, deer, dog, frog, horse, ship, truck)\n",
" \"\"\"\n",
" raise NotImplementedError(\"Student implementation required\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#| hide\n",
"#| export\n",
"class Dataset:\n",
" \"\"\"Base Dataset class: Abstract interface for all datasets.\"\"\"\n",
" \n",
" def __getitem__(self, index: int) -> Tuple[Tensor, Tensor]:\n",
" \"\"\"Get a single sample and label by index.\"\"\"\n",
" raise NotImplementedError(\"Subclasses must implement __getitem__\")\n",
" \n",
" def __len__(self) -> int:\n",
" \"\"\"Get the total number of samples in the dataset.\"\"\"\n",
" raise NotImplementedError(\"Subclasses must implement __len__\")\n",
" \n",
" def get_sample_shape(self) -> Tuple[int, ...]:\n",
" \"\"\"Get the shape of a single data sample.\"\"\"\n",
" sample, _ = self[0]\n",
" return sample.shape\n",
" \n",
" def get_num_classes(self) -> int:\n",
" \"\"\"Get the number of classes in the dataset.\"\"\"\n",
" raise NotImplementedError(\"Subclasses must implement get_num_classes\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"\"\"\"\n",
"### \ud83e\uddea Test Your Base Dataset\n",
"\"\"\""
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Test the base Dataset class\n",
"print(\"Testing base Dataset class...\")\n",
"\n",
"try:\n",
" # Create a simple test dataset\n",
" class TestDataset(Dataset):\n",
" def __init__(self, num_samples=10):\n",
" self.num_samples = num_samples\n",
" self.data = [Tensor(np.random.randn(3, 32, 32)) for _ in range(num_samples)]\n",
" self.labels = [Tensor(np.array(i % 3)) for i in range(num_samples)]\n",
" \n",
" def __getitem__(self, index):\n",
" return self.data[index], self.labels[index]\n",
" \n",
" def __len__(self):\n",
" return self.num_samples\n",
" \n",
" def get_num_classes(self):\n",
" return 3\n",
" \n",
" # Test the dataset\n",
" dataset = TestDataset(5)\n",
" print(f\"\u2705 Dataset created with {len(dataset)} samples\")\n",
" \n",
" # Test indexing\n",
" sample, label = dataset[0]\n",
" print(f\"\u2705 Sample shape: {sample.shape}\")\n",
" print(f\"\u2705 Label: {label}\")\n",
" \n",
" # Test helper methods\n",
" print(f\"\u2705 Sample shape: {dataset.get_sample_shape()}\")\n",
" print(f\"\u2705 Number of classes: {dataset.get_num_classes()}\")\n",
" \n",
" print(\"\ud83c\udf89 Base Dataset class works!\")\n",
" \n",
"except Exception as e:\n",
" print(f\"\u274c Error: {e}\")\n",
" print(\"Make sure to implement the base Dataset class above!\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"\"\"\"\n",
"## Step 2: Understanding CIFAR-10 Dataset\n",
"\n",
"Now let's build a real dataset! We'll focus on **CIFAR-10** - the perfect dataset for learning data loading.\n",
"\n",
"### Why CIFAR-10?\n",
"- **Perfect size**: 170MB - large enough for optimization, small enough to manage\n",
"- **Real data**: 32x32 color images, 10 classes\n",
"- **Classic dataset**: Every ML student should know it\n",
"- **Good complexity**: Requires proper data loading techniques\n",
"\n",
"### The CIFAR-10 Format\n",
"```\n",
"File structure:\n",
"- data_batch_1: 10,000 images + labels\n",
"- data_batch_2: 10,000 images + labels\n",
"- ...\n",
"- test_batch: 10,000 test images\n",
"- batches.meta: Class names and metadata\n",
"\n",
"Binary format:\n",
"- Each image: 3073 bytes (3072 for RGB + 1 for label)\n",
"- Images stored as: [label, R, G, B, R, G, B, ...]\n",
"- 32x32x3 = 3072 bytes per image\n",
"```\n",
"\n",
"### Data Loading Challenges\n",
"- **Binary file parsing**: CIFAR-10 uses a custom binary format\n",
"- **Memory management**: 60,000 images need efficient handling\n",
"- **Batching**: Grouping samples for efficient processing\n",
"- **Preprocessing**: Normalization, augmentation, etc.\n",
"\n",
"Let's implement CIFAR-10 loading step by step!\n",
"\"\"\""
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#| export\n",
"class CIFAR10Dataset(Dataset):\n",
" \"\"\"\n",
" CIFAR-10 Dataset: Load and manage CIFAR-10 image data.\n",
" \n",
" CIFAR-10 contains 60,000 32x32 color images in 10 classes.\n",
" Perfect for learning data loading and image processing.\n",
" \n",
" Args:\n",
" root_dir: Directory containing CIFAR-10 files\n",
" train: If True, load training data. If False, load test data.\n",
" download: If True, download dataset if not present\n",
" \n",
" TODO: Implement CIFAR-10 dataset loading.\n",
" \n",
" APPROACH:\n",
" 1. Handle dataset download if needed (with progress bar!)\n",
" 2. Parse binary files to extract images and labels\n",
" 3. Store data efficiently in memory\n",
" 4. Implement indexing and size methods\n",
" \n",
" EXAMPLE:\n",
" dataset = CIFAR10Dataset(\"data/cifar10/\", train=True)\n",
" image, label = dataset[0] # Get first image\n",
" print(f\"Image shape: {image.shape}\") # (3, 32, 32)\n",
" print(f\"Label: {label}\") # Tensor with class index\n",
" \n",
" HINTS:\n",
" - Use pickle to load binary files\n",
" - Each batch file contains 'data' and 'labels' keys\n",
" - Reshape data to (3, 32, 32) format\n",
" - Store images and labels as separate lists\n",
" - Add progress bar with urllib.request.urlretrieve(url, filename, reporthook=progress_function)\n",
" - Progress function receives (block_num, block_size, total_size) parameters\n",
" \"\"\"\n",
" \n",
" def __init__(self, root_dir: str, train: bool = True, download: bool = True):\n",
" \"\"\"\n",
" Initialize CIFAR-10 dataset.\n",
" \n",
" Args:\n",
" root_dir: Directory to store/load dataset\n",
" train: If True, load training data. If False, load test data.\n",
" download: If True, download dataset if not present\n",
" \n",
" TODO: Implement CIFAR-10 initialization.\n",
" \n",
" STEP-BY-STEP:\n",
" 1. Create root directory if it doesn't exist\n",
" 2. Download dataset if needed and not present (with progress bar!)\n",
" 3. Load binary files and parse data\n",
" 4. Store images and labels in memory\n",
" 5. Set up class names\n",
" \n",
" EXAMPLE:\n",
" CIFAR10Dataset(\"data/cifar10/\", train=True)\n",
" creates a dataset with 50,000 training images\n",
" \n",
" PROGRESS BAR HINT:\n",
" def show_progress(block_num, block_size, total_size):\n",
" downloaded = block_num * block_size\n",
" percent = (downloaded * 100) // total_size\n",
" print(f\"\\\\rDownloading: {percent}%\", end='', flush=True)\n",
" \"\"\"\n",
" raise NotImplementedError(\"Student implementation required\")\n",
" \n",
" def __getitem__(self, index: int) -> Tuple[Tensor, Tensor]:\n",
" \"\"\"\n",
" Get a single image and label by index.\n",
" \n",
" Args:\n",
" index: Index of the sample to retrieve\n",
" \n",
" Returns:\n",
" Tuple of (image, label) tensors\n",
" \n",
" TODO: Implement sample retrieval.\n",
" \n",
" STEP-BY-STEP:\n",
" 1. Get image from self.images[index]\n",
" 2. Get label from self.labels[index]\n",
" 3. Return (Tensor(image), Tensor(label))\n",
" \n",
" EXAMPLE:\n",
" image, label = dataset[0]\n",
" image.shape should be (3, 32, 32)\n",
" label should be integer 0-9\n",
" \"\"\"\n",
" raise NotImplementedError(\"Student implementation required\")\n",
" \n",
" def __len__(self) -> int:\n",
" \"\"\"\n",
" Get the total number of samples in the dataset.\n",
" \n",
" TODO: Return the length of the dataset.\n",
" \n",
" STEP-BY-STEP:\n",
" 1. Return len(self.images)\n",
" \n",
" EXAMPLE:\n",
" Training set: 50,000 samples\n",
" Test set: 10,000 samples\n",
" \"\"\"\n",
" raise NotImplementedError(\"Student implementation required\")\n",
" \n",
" def get_num_classes(self) -> int:\n",
" \"\"\"\n",
" Get the number of classes in CIFAR-10.\n",
" \n",
" TODO: Return the number of classes.\n",
" \n",
" STEP-BY-STEP:\n",
" 1. CIFAR-10 has 10 classes\n",
" 2. Return 10\n",
" \n",
" EXAMPLE:\n",
" Returns 10 for CIFAR-10\n",
" \"\"\"\n",
" raise NotImplementedError(\"Student implementation required\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#| hide\n",
"#| export\n",
"class CIFAR10Dataset(Dataset):\n",
" \"\"\"CIFAR-10 Dataset: Load and manage CIFAR-10 image data.\"\"\"\n",
" \n",
" def __init__(self, root_dir: str, train: bool = True, download: bool = True):\n",
" self.root_dir = root_dir\n",
" self.train = train\n",
" self.class_names = ['airplane', 'car', 'bird', 'cat', 'deer', \n",
" 'dog', 'frog', 'horse', 'ship', 'truck']\n",
" \n",
" # Create directory if it doesn't exist\n",
" os.makedirs(root_dir, exist_ok=True)\n",
" \n",
" # Download if needed\n",
" if download:\n",
" self._download_if_needed()\n",
" \n",
" # Load data\n",
" self._load_data()\n",
" \n",
" def _download_if_needed(self):\n",
" \"\"\"Download CIFAR-10 if not present.\"\"\"\n",
" cifar_path = os.path.join(self.root_dir, \"cifar-10-batches-py\")\n",
" if not os.path.exists(cifar_path):\n",
" print(\"\ud83d\udd04 Downloading CIFAR-10 dataset...\")\n",
" print(\"\ud83d\udce6 Size: ~170MB (this may take a few minutes)\")\n",
" url = \"https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz\"\n",
" filename = os.path.join(self.root_dir, \"cifar-10-python.tar.gz\")\n",
" \n",
" try:\n",
" # Download with progress bar\n",
" def show_progress(block_num, block_size, total_size):\n",
" \"\"\"Show download progress bar.\"\"\"\n",
" downloaded = block_num * block_size\n",
" if total_size > 0:\n",
" percent = min(100, (downloaded * 100) // total_size)\n",
" bar_length = 50\n",
" filled_length = (percent * bar_length) // 100\n",
" bar = '\u2588' * filled_length + '\u2591' * (bar_length - filled_length)\n",
" \n",
" # Convert bytes to MB\n",
" downloaded_mb = downloaded / (1024 * 1024)\n",
" total_mb = total_size / (1024 * 1024)\n",
" \n",
" print(f\"\\r\ud83d\udce5 [{bar}] {percent}% ({downloaded_mb:.1f}/{total_mb:.1f} MB)\", end='', flush=True)\n",
" else:\n",
" # Fallback if total size unknown\n",
" downloaded_mb = downloaded / (1024 * 1024)\n",
" print(f\"\\r\ud83d\udce5 Downloaded: {downloaded_mb:.1f} MB\", end='', flush=True)\n",
" \n",
" urllib.request.urlretrieve(url, filename, reporthook=show_progress)\n",
" print() # New line after progress bar\n",
" \n",
" # Extract\n",
" print(\"\ud83d\udcc2 Extracting CIFAR-10 files...\")\n",
" with tarfile.open(filename, 'r:gz') as tar:\n",
" tar.extractall(self.root_dir, filter='data')\n",
" \n",
" # Clean up\n",
" os.remove(filename)\n",
" print(\"\u2705 CIFAR-10 downloaded and extracted successfully!\")\n",
" \n",
" except Exception as e:\n",
" print(f\"\\n\u274c Download failed: {e}\")\n",
" print(\"Please download CIFAR-10 manually from https://www.cs.toronto.edu/~kriz/cifar.html\")\n",
" \n",
" def _load_data(self):\n",
" \"\"\"Load CIFAR-10 data from binary files.\"\"\"\n",
" cifar_path = os.path.join(self.root_dir, \"cifar-10-batches-py\")\n",
" \n",
" self.images = []\n",
" self.labels = []\n",
" \n",
" if self.train:\n",
" # Load training batches\n",
" for i in range(1, 6):\n",
" batch_file = os.path.join(cifar_path, f\"data_batch_{i}\")\n",
" if os.path.exists(batch_file):\n",
" with open(batch_file, 'rb') as f:\n",
" batch = pickle.load(f, encoding='bytes')\n",
" # Convert bytes keys to strings\n",
" batch = {k.decode('utf-8') if isinstance(k, bytes) else k: v for k, v in batch.items()}\n",
" \n",
" # Extract images and labels\n",
" images = batch['data'].reshape(-1, 3, 32, 32).astype(np.float32)\n",
" labels = batch['labels']\n",
" \n",
" self.images.extend(images)\n",
" self.labels.extend(labels)\n",
" else:\n",
" # Load test batch\n",
" test_file = os.path.join(cifar_path, \"test_batch\")\n",
" if os.path.exists(test_file):\n",
" with open(test_file, 'rb') as f:\n",
" batch = pickle.load(f, encoding='bytes')\n",
" # Convert bytes keys to strings\n",
" batch = {k.decode('utf-8') if isinstance(k, bytes) else k: v for k, v in batch.items()}\n",
" \n",
" # Extract images and labels\n",
" self.images = batch['data'].reshape(-1, 3, 32, 32).astype(np.float32)\n",
" self.labels = batch['labels']\n",
" \n",
" print(f\"\u2705 Loaded {len(self.images)} {'training' if self.train else 'test'} samples\")\n",
" \n",
" def __getitem__(self, index: int) -> Tuple[Tensor, Tensor]:\n",
" \"\"\"Get a single image and label by index.\"\"\"\n",
" image = Tensor(self.images[index])\n",
" label = Tensor(np.array(self.labels[index]))\n",
" return image, label\n",
" \n",
" def __len__(self) -> int:\n",
" \"\"\"Get the total number of samples in the dataset.\"\"\"\n",
" return len(self.images)\n",
" \n",
" def get_num_classes(self) -> int:\n",
" \"\"\"Get the number of classes in CIFAR-10.\"\"\"\n",
" return 10"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"\"\"\"\n",
"### \ud83e\uddea Test Your CIFAR-10 Dataset\n",
"\"\"\""
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Test CIFAR-10 dataset (skip download for now)\n",
"print(\"Testing CIFAR-10 dataset...\")\n",
"\n",
"try:\n",
" # Create a mock dataset for testing without download\n",
" class MockCIFAR10Dataset(Dataset):\n",
" def __init__(self, size, train=True):\n",
" self.size = size\n",
" self.train = train\n",
" self.data = [np.random.randint(0, 255, (3, 32, 32), dtype=np.uint8) for _ in range(size)]\n",
" self.labels = [np.random.randint(0, 10) for _ in range(size)]\n",
" \n",
" def __getitem__(self, index):\n",
" return Tensor(self.data[index].astype(np.float32)), Tensor(np.array(self.labels[index]))\n",
" \n",
" def __len__(self):\n",
" return self.size\n",
" \n",
" def get_num_classes(self):\n",
" return 10\n",
" \n",
" # Test the dataset\n",
" dataset = MockCIFAR10Dataset(50)\n",
" print(f\"\u2705 Dataset created with {len(dataset)} samples\")\n",
" \n",
" # Test indexing\n",
" image, label = dataset[0]\n",
" print(f\"\u2705 Image shape: {image.shape}\")\n",
" print(f\"\u2705 Label: {label}\")\n",
" print(f\"\u2705 Number of classes: {dataset.get_num_classes()}\")\n",
" \n",
" # Test multiple samples\n",
" for i in range(3):\n",
" img, lbl = dataset[i]\n",
" print(f\"\u2705 Sample {i}: {img.shape}, class {int(lbl.data)}\")\n",
" \n",
" print(\"\ud83c\udf89 CIFAR-10 dataset structure works!\")\n",
" \n",
"except Exception as e:\n",
" print(f\"\u274c Error: {e}\")\n",
" print(\"Make sure to implement the CIFAR-10 dataset above!\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"\"\"\"\n",
"### \ud83d\udc41\ufe0f Visual Feedback: See Your Data!\n",
"\n",
"Let's add a visualization function to actually **see** the CIFAR-10 images we're loading. \n",
"This provides immediate visual feedback and builds intuition about the data.\n",
"\"\"\""
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"def show_cifar10_samples(dataset, num_samples=8, title=\"CIFAR-10 Samples\"):\n",
" \"\"\"\n",
" Display a grid of CIFAR-10 images with their labels.\n",
" \n",
" Args:\n",
" dataset: CIFAR-10 dataset\n",
" num_samples: Number of samples to display\n",
" title: Title for the plot\n",
" \n",
" TODO: Implement visualization function.\n",
" \n",
" APPROACH:\n",
" 1. Create a matplotlib subplot grid\n",
" 2. Get random samples from dataset\n",
" 3. Display each image with its class label\n",
" 4. Handle the image format (CHW -> HWC, normalize to 0-1)\n",
" \n",
" EXAMPLE:\n",
" show_cifar10_samples(dataset, num_samples=8)\n",
" # Shows 8 CIFAR-10 images in a 2x4 grid\n",
" \n",
" HINTS:\n",
" - Use plt.subplots() to create grid\n",
" - Convert image from (C, H, W) to (H, W, C) for display\n",
" - Normalize pixel values to [0, 1] range\n",
" - Use dataset.class_names for labels\n",
" \n",
" NOTE: This is a development/learning tool, not part of the core package.\n",
" \"\"\"\n",
" raise NotImplementedError(\"Student implementation required\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#| hide\n",
"def show_cifar10_samples(dataset, num_samples=8, title=\"CIFAR-10 Samples\"):\n",
" \"\"\"Display a grid of CIFAR-10 images with their labels.\"\"\"\n",
" if not _should_show_plots():\n",
" return\n",
" \n",
" # Create subplot grid\n",
" rows = 2\n",
" cols = num_samples // rows\n",
" fig, axes = plt.subplots(rows, cols, figsize=(12, 6))\n",
" fig.suptitle(title, fontsize=16)\n",
" \n",
" # Get random samples\n",
" indices = np.random.choice(len(dataset), num_samples, replace=False)\n",
" \n",
" for i, idx in enumerate(indices):\n",
" row = i // cols\n",
" col = i % cols\n",
" \n",
" # Get image and label\n",
" image, label = dataset[idx]\n",
" \n",
" # Convert from (C, H, W) to (H, W, C) and normalize to [0, 1]\n",
" if hasattr(image, 'data'):\n",
" img_data = image.data\n",
" else:\n",
" img_data = image\n",
" \n",
" # Handle different tensor formats\n",
" if img_data.shape[0] == 3: # (C, H, W)\n",
" img_display = np.transpose(img_data, (1, 2, 0))\n",
" else:\n",
" img_display = img_data\n",
" \n",
" # Normalize to [0, 1] range\n",
" img_display = img_display.astype(np.float32)\n",
" if img_display.max() > 1.0:\n",
" img_display = img_display / 255.0\n",
" \n",
" # Ensure values are in [0, 1]\n",
" img_display = np.clip(img_display, 0, 1)\n",
" \n",
" # Display image\n",
" if rows == 1:\n",
" ax = axes[col]\n",
" else:\n",
" ax = axes[row, col]\n",
" \n",
" ax.imshow(img_display)\n",
" ax.axis('off')\n",
" \n",
" # Add label\n",
" if hasattr(label, 'data'):\n",
" label_idx = int(label.data)\n",
" else:\n",
" label_idx = int(label)\n",
" \n",
" if hasattr(dataset, 'class_names'):\n",
" class_name = dataset.class_names[label_idx]\n",
" ax.set_title(f'{class_name} ({label_idx})', fontsize=10)\n",
" else:\n",
" ax.set_title(f'Class {label_idx}', fontsize=10)\n",
" \n",
" plt.tight_layout()\n",
" if _should_show_plots():\n",
" plt.show()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Test visual feedback with real CIFAR-10 data\n",
"print(\"\ud83c\udfa8 Testing visual feedback with real CIFAR-10...\")\n",
"\n",
"try:\n",
" # Create real CIFAR-10 dataset for visualization\n",
" import tempfile\n",
" import os\n",
" \n",
" with tempfile.TemporaryDirectory() as temp_dir:\n",
" # Load real CIFAR-10 dataset\n",
" cifar_dataset = CIFAR10Dataset(temp_dir, train=True, download=True)\n",
" \n",
" print(f\"\u2705 Loaded {len(cifar_dataset)} real CIFAR-10 samples\")\n",
" print(f\"\u2705 Class names: {cifar_dataset.class_names}\")\n",
" \n",
" # Show sample images\n",
" if _should_show_plots():\n",
" print(\"\ud83d\uddbc\ufe0f Displaying sample images...\")\n",
" show_cifar10_samples(cifar_dataset, num_samples=8, title=\"Real CIFAR-10 Training Samples\")\n",
" \n",
" # Show some statistics\n",
" sample_images = [cifar_dataset[i][0] for i in range(100)]\n",
" pixel_values = [img.data for img in sample_images]\n",
" all_pixels = np.concatenate([pixels.flatten() for pixels in pixel_values])\n",
" \n",
" print(f\"\u2705 Pixel value range: [{all_pixels.min():.1f}, {all_pixels.max():.1f}]\")\n",
" print(f\"\u2705 Mean pixel value: {all_pixels.mean():.1f}\")\n",
" print(f\"\u2705 Std pixel value: {all_pixels.std():.1f}\")\n",
" \n",
" print(\"\ud83c\udf89 Visual feedback works! You can see your data!\")\n",
" \n",
"except Exception as e:\n",
" print(f\"\u274c Error: {e}\")\n",
" print(\"Make sure CIFAR-10 dataset is implemented correctly!\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"\"\"\"\n",
"## Step 3: Understanding Data Loading\n",
"\n",
"Now let's build a **DataLoader** to efficiently batch and iterate through our dataset.\n",
"\n",
"### Why DataLoaders Matter\n",
"- **Batching**: Process multiple samples at once (GPU efficiency)\n",
"- **Shuffling**: Randomize order for better training\n",
"- **Memory management**: Handle large datasets efficiently\n",
"- **I/O optimization**: Load data in parallel with training\n",
"\n",
"### The DataLoader Pattern\n",
"```\n",
"Dataset: [sample1, sample2, sample3, ...]\n",
"DataLoader: [[batch1], [batch2], [batch3], ...]\n",
"```\n",
"\n",
"### Systems Thinking\n",
"- **Batch size**: Trade-off between memory and speed\n",
"- **Shuffling**: Prevents overfitting to data order\n",
"- **Iteration**: Efficient looping through data\n",
"- **Memory**: Manage large datasets that don't fit in RAM\n",
"\n",
"Let's implement a DataLoader!\n",
"\"\"\""
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#| export\n",
"class DataLoader:\n",
" \"\"\"\n",
" DataLoader: Efficiently batch and iterate through datasets.\n",
" \n",
" Provides batching, shuffling, and efficient iteration over datasets.\n",
" Essential for training neural networks efficiently.\n",
" \n",
" Args:\n",
" dataset: Dataset to load from\n",
" batch_size: Number of samples per batch\n",
" shuffle: Whether to shuffle data each epoch\n",
" \n",
" TODO: Implement DataLoader with batching and shuffling.\n",
" \n",
" APPROACH:\n",
" 1. Store dataset and configuration\n",
" 2. Implement __iter__ to yield batches\n",
" 3. Handle shuffling and batching logic\n",
" 4. Stack individual samples into batches\n",
" \n",
" EXAMPLE:\n",
" dataloader = DataLoader(dataset, batch_size=32, shuffle=True)\n",
" for batch_images, batch_labels in dataloader:\n",
" print(f\"Batch shape: {batch_images.shape}\") # (32, 3, 32, 32)\n",
" \n",
" HINTS:\n",
" - Use np.random.permutation for shuffling\n",
" - Stack samples using np.stack\n",
" - Yield batches as (batch_data, batch_labels)\n",
" - Handle last batch that might be smaller\n",
" \"\"\"\n",
" \n",
" def __init__(self, dataset: Dataset, batch_size: int = 32, shuffle: bool = True):\n",
" \"\"\"\n",
" Initialize DataLoader.\n",
" \n",
" Args:\n",
" dataset: Dataset to load from\n",
" batch_size: Number of samples per batch\n",
" shuffle: Whether to shuffle data each epoch\n",
" \n",
" TODO: Store configuration and dataset.\n",
" \n",
" STEP-BY-STEP:\n",
" 1. Store dataset as self.dataset\n",
" 2. Store batch_size as self.batch_size\n",
" 3. Store shuffle as self.shuffle\n",
" \n",
" EXAMPLE:\n",
" DataLoader(dataset, batch_size=32, shuffle=True)\n",
" \"\"\"\n",
" raise NotImplementedError(\"Student implementation required\")\n",
" \n",
" def __iter__(self) -> Iterator[Tuple[Tensor, Tensor]]:\n",
" \"\"\"\n",
" Iterate through dataset in batches.\n",
" \n",
" Returns:\n",
" Iterator yielding (batch_data, batch_labels) tuples\n",
" \n",
" TODO: Implement batching and shuffling logic.\n",
" \n",
" STEP-BY-STEP:\n",
" 1. Create indices list: list(range(len(dataset)))\n",
" 2. Shuffle indices if self.shuffle is True\n",
" 3. Loop through indices in batch_size chunks\n",
" 4. For each batch: collect samples, stack them, yield batch\n",
" \n",
" EXAMPLE:\n",
" for batch_data, batch_labels in dataloader:\n",
" # batch_data.shape: (batch_size, ...)\n",
" # batch_labels.shape: (batch_size,)\n",
" \"\"\"\n",
" raise NotImplementedError(\"Student implementation required\")\n",
" \n",
" def __len__(self) -> int:\n",
" \"\"\"\n",
" Get the number of batches per epoch.\n",
" \n",
" TODO: Calculate number of batches.\n",
" \n",
" STEP-BY-STEP:\n",
" 1. Get dataset size: len(self.dataset)\n",
" 2. Calculate: (dataset_size + batch_size - 1) // batch_size\n",
" 3. This handles the last partial batch correctly\n",
" \n",
" EXAMPLE:\n",
" Dataset size: 100, batch_size: 32\n",
" Number of batches: 4 (32, 32, 32, 4)\n",
" \"\"\"\n",
" raise NotImplementedError(\"Student implementation required\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#| hide\n",
"#| export\n",
"class DataLoader:\n",
" \"\"\"DataLoader: Efficiently batch and iterate through datasets.\"\"\"\n",
" \n",
" def __init__(self, dataset: Dataset, batch_size: int = 32, shuffle: bool = True):\n",
" self.dataset = dataset\n",
" self.batch_size = batch_size\n",
" self.shuffle = shuffle\n",
" \n",
" def __iter__(self) -> Iterator[Tuple[Tensor, Tensor]]:\n",
" \"\"\"Iterate through dataset in batches.\"\"\"\n",
" # Create indices\n",
" indices = list(range(len(self.dataset)))\n",
" \n",
" # Shuffle if requested\n",
" if self.shuffle:\n",
" np.random.shuffle(indices)\n",
" \n",
" # Generate batches\n",
" for i in range(0, len(indices), self.batch_size):\n",
" batch_indices = indices[i:i + self.batch_size]\n",
" \n",
" # Collect samples for this batch\n",
" batch_data = []\n",
" batch_labels = []\n",
" \n",
" for idx in batch_indices:\n",
" data, label = self.dataset[idx]\n",
" batch_data.append(data.data)\n",
" batch_labels.append(label.data)\n",
" \n",
" # Stack into batches\n",
" batch_data = np.stack(batch_data, axis=0)\n",
" batch_labels = np.stack(batch_labels, axis=0)\n",
" \n",
" yield Tensor(batch_data), Tensor(batch_labels)\n",
" \n",
" def __len__(self) -> int:\n",
" \"\"\"Get the number of batches per epoch.\"\"\"\n",
" return (len(self.dataset) + self.batch_size - 1) // self.batch_size"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"\"\"\"\n",
"### \ud83e\uddea Test Your DataLoader\n",
"\"\"\""
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Test DataLoader\n",
"print(\"Testing DataLoader...\")\n",
"\n",
"try:\n",
" # Create a test dataset\n",
" class SimpleDataset(Dataset):\n",
" def __init__(self, size=100):\n",
" self.size = size\n",
" self.data = [np.random.randn(3, 32, 32) for _ in range(size)]\n",
" self.labels = [i % 10 for i in range(size)]\n",
" \n",
" def __getitem__(self, index):\n",
" return Tensor(self.data[index]), Tensor(np.array(self.labels[index]))\n",
" \n",
" def __len__(self):\n",
" return self.size\n",
" \n",
" def get_num_classes(self):\n",
" return 10\n",
" \n",
" # Test DataLoader\n",
" dataset = SimpleDataset(100)\n",
" dataloader = DataLoader(dataset, batch_size=32, shuffle=True)\n",
" \n",
" print(f\"\u2705 Dataset size: {len(dataset)}\")\n",
" print(f\"\u2705 Number of batches: {len(dataloader)}\")\n",
" \n",
" # Test iteration\n",
" batch_count = 0\n",
" for batch_data, batch_labels in dataloader:\n",
" batch_count += 1\n",
" print(f\"\u2705 Batch {batch_count}: data shape {batch_data.shape}, labels shape {batch_labels.shape}\")\n",
" if batch_count >= 3: # Only show first 3 batches\n",
" break\n",
" \n",
" print(\"\ud83c\udf89 DataLoader works!\")\n",
" \n",
"except Exception as e:\n",
" print(f\"\u274c Error: {e}\")\n",
" print(\"Make sure to implement the DataLoader above!\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Test DataLoader with visual feedback using real CIFAR-10\n",
"print(\"\ud83c\udfa8 Testing DataLoader with visual feedback...\")\n",
"\n",
"try:\n",
" import tempfile\n",
" \n",
" with tempfile.TemporaryDirectory() as temp_dir:\n",
" # Load real CIFAR-10 dataset\n",
" cifar_dataset = CIFAR10Dataset(temp_dir, train=True, download=True)\n",
" \n",
" # Create DataLoader\n",
" dataloader = DataLoader(cifar_dataset, batch_size=16, shuffle=True)\n",
" \n",
" print(f\"\u2705 Created DataLoader with {len(dataloader)} batches\")\n",
" \n",
" # Get first batch\n",
" batch_data, batch_labels = next(iter(dataloader))\n",
" print(f\"\u2705 First batch shape: {batch_data.shape}\")\n",
" print(f\"\u2705 First batch labels: {batch_labels.shape}\")\n",
" \n",
" # Show first few images from the batch\n",
" print(\"\ud83d\uddbc\ufe0f Displaying first batch images...\")\n",
" \n",
" # Create a temporary dataset-like object for visualization\n",
" class BatchDataset:\n",
" def __init__(self, batch_data, batch_labels, class_names):\n",
" self.batch_data = batch_data\n",
" self.batch_labels = batch_labels\n",
" self.class_names = class_names\n",
" \n",
" def __getitem__(self, index):\n",
" return Tensor(self.batch_data.data[index]), Tensor(self.batch_labels.data[index])\n",
" \n",
" def __len__(self):\n",
" return self.batch_data.shape[0]\n",
" \n",
" batch_dataset = BatchDataset(batch_data, batch_labels, cifar_dataset.class_names)\n",
" show_cifar10_samples(batch_dataset, num_samples=8, title=\"DataLoader Batch - Real CIFAR-10\")\n",
" \n",
" print(\"\ud83c\udf89 DataLoader visual feedback works!\")\n",
" \n",
"except Exception as e:\n",
" print(f\"\u274c Error: {e}\")\n",
" print(\"Make sure DataLoader and visualization are implemented correctly!\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"\"\"\"\n",
"## Step 4: Understanding Data Preprocessing\n",
"\n",
"Finally, let's build a **Normalizer** to preprocess our data for better training.\n",
"\n",
"### Why Normalization Matters\n",
"- **Gradient stability**: Prevents exploding/vanishing gradients\n",
"- **Training speed**: Faster convergence\n",
"- **Numerical stability**: Prevents overflow/underflow\n",
"- **Consistent scales**: All features have similar ranges\n",
"\n",
"### Common Normalization Techniques\n",
"- **Min-Max**: Scale to [0, 1] range\n",
"- **Z-score**: Zero mean, unit variance\n",
"- **ImageNet**: Specific mean/std for pretrained models\n",
"\n",
"### The Normalization Process\n",
"```\n",
"Raw Data: [0, 255] pixel values\n",
"Normalized: [-1, 1] or [0, 1] range\n",
"```\n",
"\n",
"Let's implement a flexible normalizer!\n",
"\"\"\""
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#| export\n",
"class Normalizer:\n",
" \"\"\"\n",
" Data Normalizer: Standardize data for better training.\n",
" \n",
" Computes mean and standard deviation from training data,\n",
" then applies normalization to new data.\n",
" \n",
" TODO: Implement data normalization.\n",
" \n",
" APPROACH:\n",
" 1. Fit: Compute mean and std from training data\n",
" 2. Transform: Apply normalization using computed stats\n",
" 3. Handle both single tensors and batches\n",
" \n",
" EXAMPLE:\n",
" normalizer = Normalizer()\n",
" normalizer.fit(training_data) # Compute stats\n",
" normalized = normalizer.transform(new_data) # Apply normalization\n",
" \n",
" HINTS:\n",
" - Store mean and std as instance variables\n",
" - Use np.mean and np.std for statistics\n",
" - Apply: (data - mean) / std\n",
" - Handle division by zero (add small epsilon)\n",
" \"\"\"\n",
" \n",
" def __init__(self):\n",
" \"\"\"\n",
" Initialize normalizer.\n",
" \n",
" TODO: Initialize mean and std to None.\n",
" \n",
" STEP-BY-STEP:\n",
" 1. Set self.mean = None\n",
" 2. Set self.std = None\n",
" 3. Set self.epsilon = 1e-8 (for numerical stability)\n",
" \n",
" EXAMPLE:\n",
" normalizer = Normalizer()\n",
" \"\"\"\n",
" raise NotImplementedError(\"Student implementation required\")\n",
" \n",
" def fit(self, data: List[Tensor]):\n",
" \"\"\"\n",
" Compute normalization statistics from training data.\n",
" \n",
" Args:\n",
" data: List of tensors to compute statistics from\n",
" \n",
" TODO: Compute mean and standard deviation.\n",
" \n",
" STEP-BY-STEP:\n",
" 1. Stack all tensors: np.stack([t.data for t in data])\n",
" 2. Compute mean: np.mean(stacked_data)\n",
" 3. Compute std: np.std(stacked_data)\n",
" 4. Store as self.mean and self.std\n",
" \n",
" EXAMPLE:\n",
" normalizer.fit([tensor1, tensor2, tensor3])\n",
" \"\"\"\n",
" raise NotImplementedError(\"Student implementation required\")\n",
" \n",
" def transform(self, data: Union[Tensor, List[Tensor]]) -> Union[Tensor, List[Tensor]]:\n",
" \"\"\"\n",
" Apply normalization to data.\n",
" \n",
" Args:\n",
" data: Tensor or list of tensors to normalize\n",
" \n",
" Returns:\n",
" Normalized tensor(s)\n",
" \n",
" TODO: Apply normalization using computed statistics.\n",
" \n",
" STEP-BY-STEP:\n",
" 1. Check if mean and std are computed (not None)\n",
" 2. If single tensor: apply (data - mean) / (std + epsilon)\n",
" 3. If list: apply to each tensor in the list\n",
" 4. Return normalized data\n",
" \n",
" EXAMPLE:\n",
" normalized = normalizer.transform(tensor)\n",
" \"\"\"\n",
" raise NotImplementedError(\"Student implementation required\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#| hide\n",
"#| export\n",
"class Normalizer:\n",
" \"\"\"Data Normalizer: Standardize data for better training.\"\"\"\n",
" \n",
" def __init__(self):\n",
" self.mean = None\n",
" self.std = None\n",
" self.epsilon = 1e-8\n",
" \n",
" def fit(self, data: List[Tensor]):\n",
" \"\"\"Compute normalization statistics from training data.\"\"\"\n",
" # Stack all data\n",
" all_data = np.stack([t.data for t in data])\n",
" \n",
" # Compute statistics\n",
" self.mean = np.mean(all_data)\n",
" self.std = np.std(all_data)\n",
" \n",
" print(f\"\u2705 Computed normalization stats: mean={self.mean:.4f}, std={self.std:.4f}\")\n",
" \n",
" def transform(self, data: Union[Tensor, List[Tensor]]) -> Union[Tensor, List[Tensor]]:\n",
" \"\"\"Apply normalization to data.\"\"\"\n",
" if self.mean is None or self.std is None:\n",
" raise ValueError(\"Must call fit() before transform()\")\n",
" \n",
" if isinstance(data, list):\n",
" # Transform list of tensors\n",
" return [Tensor((t.data - self.mean) / (self.std + self.epsilon)) for t in data]\n",
" else:\n",
" # Transform single tensor\n",
" return Tensor((data.data - self.mean) / (self.std + self.epsilon))"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"\"\"\"\n",
"### \ud83e\uddea Test Your Normalizer\n",
"\"\"\""
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Test Normalizer\n",
"print(\"Testing Normalizer...\")\n",
"\n",
"try:\n",
" # Create test data\n",
" data = [\n",
" Tensor(np.random.randn(3, 32, 32) * 50 + 100), # Mean ~100, std ~50\n",
" Tensor(np.random.randn(3, 32, 32) * 50 + 100),\n",
" Tensor(np.random.randn(3, 32, 32) * 50 + 100)\n",
" ]\n",
" \n",
" # Test normalizer\n",
" normalizer = Normalizer()\n",
" \n",
" # Fit to data\n",
" normalizer.fit(data)\n",
" \n",
" # Transform data\n",
" normalized = normalizer.transform(data)\n",
" \n",
" # Check results\n",
" print(f\"\u2705 Original data mean: {np.mean([t.data for t in data]):.4f}\")\n",
" print(f\"\u2705 Original data std: {np.std([t.data for t in data]):.4f}\")\n",
" print(f\"\u2705 Normalized data mean: {np.mean([t.data for t in normalized]):.4f}\")\n",
" print(f\"\u2705 Normalized data std: {np.std([t.data for t in normalized]):.4f}\")\n",
" \n",
" # Test single tensor\n",
" single_tensor = Tensor(np.random.randn(3, 32, 32) * 50 + 100)\n",
" normalized_single = normalizer.transform(single_tensor)\n",
" print(f\"\u2705 Single tensor normalized: {normalized_single.shape}\")\n",
" \n",
" print(\"\ud83c\udf89 Normalizer works!\")\n",
" \n",
"except Exception as e:\n",
" print(f\"\u274c Error: {e}\")\n",
" print(\"Make sure to implement the Normalizer above!\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"\"\"\"\n",
"## Step 5: Building a Complete Data Pipeline\n",
"\n",
"Now let's put everything together into a complete data pipeline!\n",
"\n",
"### The Complete Pipeline\n",
"```\n",
"Raw Data \u2192 Dataset \u2192 DataLoader \u2192 Normalizer \u2192 Model\n",
"```\n",
"\n",
"This is the foundation of every machine learning system!\n",
"\"\"\""
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#| export\n",
"def create_data_pipeline(dataset_path: str = \"data/cifar10/\", \n",
" batch_size: int = 32, \n",
" normalize: bool = True,\n",
" shuffle: bool = True):\n",
" \"\"\"\n",
" Create a complete data pipeline for training.\n",
" \n",
" Args:\n",
" dataset_path: Path to dataset\n",
" batch_size: Batch size for training\n",
" normalize: Whether to normalize data\n",
" shuffle: Whether to shuffle data\n",
" \n",
" Returns:\n",
" Tuple of (train_loader, test_loader)\n",
" \n",
" TODO: Implement complete data pipeline.\n",
" \n",
" APPROACH:\n",
" 1. Create train and test datasets\n",
" 2. Create data loaders\n",
" 3. Fit normalizer on training data\n",
" 4. Return all components\n",
" \n",
" EXAMPLE:\n",
" train_loader, test_loader = create_data_pipeline()\n",
" for batch_data, batch_labels in train_loader:\n",
" # Ready for training!\n",
" \"\"\"\n",
" raise NotImplementedError(\"Student implementation required\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#| hide\n",
"#| export\n",
"def create_data_pipeline(dataset_path: str = \"data/cifar10/\", \n",
" batch_size: int = 32, \n",
" normalize: bool = True,\n",
" shuffle: bool = True):\n",
" \"\"\"Create a complete data pipeline for training.\"\"\"\n",
" \n",
" print(\"\ud83d\udd27 Creating data pipeline...\")\n",
" \n",
" # Create datasets with real CIFAR-10 data\n",
" train_dataset = CIFAR10Dataset(dataset_path, train=True, download=True)\n",
" test_dataset = CIFAR10Dataset(dataset_path, train=False, download=True)\n",
" \n",
" # Create data loaders\n",
" train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=shuffle)\n",
" test_loader = DataLoader(test_dataset, batch_size=batch_size, shuffle=False)\n",
" \n",
" # Create normalizer\n",
" normalizer = None\n",
" if normalize:\n",
" normalizer = Normalizer()\n",
" # Fit on a subset of training data for efficiency\n",
" sample_data = [train_dataset[i][0] for i in range(min(1000, len(train_dataset)))]\n",
" normalizer.fit(sample_data)\n",
" print(f\"\u2705 Computed normalization stats: mean={normalizer.mean:.4f}, std={normalizer.std:.4f}\")\n",
" \n",
" print(f\"\u2705 Pipeline created:\")\n",
" print(f\" - Training batches: {len(train_loader)}\")\n",
" print(f\" - Test batches: {len(test_loader)}\")\n",
" print(f\" - Batch size: {batch_size}\")\n",
" print(f\" - Normalization: {normalize}\")\n",
" \n",
" return train_loader, test_loader"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"\"\"\"\n",
"### \ud83e\uddea Test Your Complete Data Pipeline\n",
"\"\"\""
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Test complete data pipeline\n",
"print(\"Testing complete data pipeline...\")\n",
"\n",
"try:\n",
" # Create pipeline\n",
" train_loader, test_loader = create_data_pipeline(\n",
" batch_size=16, normalize=True, shuffle=True\n",
" )\n",
" \n",
" # Test training loop\n",
" print(\"\\n\ud83d\udd25 Testing training loop:\")\n",
" for i, (batch_data, batch_labels) in enumerate(train_loader):\n",
" print(f\" Batch {i+1}: data {batch_data.shape}, labels {batch_labels.shape}\")\n",
" \n",
" # Note: Data is already normalized in the pipeline if normalize=True\n",
" \n",
" if i >= 2: # Only show first 3 batches\n",
" break\n",
" \n",
" # Test test loop\n",
" print(\"\\n\ud83e\uddea Testing test loop:\")\n",
" for i, (batch_data, batch_labels) in enumerate(test_loader):\n",
" print(f\" Test batch {i+1}: data {batch_data.shape}, labels {batch_labels.shape}\")\n",
" if i >= 1: # Only show first 2 batches\n",
" break\n",
" \n",
" print(\"\\n\ud83c\udf89 Complete data pipeline works!\")\n",
" print(\"Ready for training neural networks!\")\n",
" \n",
"except Exception as e:\n",
" print(f\"\u274c Error: {e}\")\n",
" print(\"Make sure to implement the data pipeline above!\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Test complete pipeline with visual feedback\n",
"print(\"\ud83c\udfa8 Testing complete pipeline with visual feedback...\")\n",
"\n",
"try:\n",
" import tempfile\n",
" \n",
" with tempfile.TemporaryDirectory() as temp_dir:\n",
" # Create complete pipeline\n",
" train_loader, test_loader = create_data_pipeline(\n",
" dataset_path=temp_dir,\n",
" batch_size=16, \n",
" normalize=True, \n",
" shuffle=True\n",
" )\n",
" \n",
" # Get a batch from training data\n",
" train_batch_data, train_batch_labels = next(iter(train_loader))\n",
" print(f\"\u2705 Training batch shape: {train_batch_data.shape}\")\n",
" \n",
" # Get a batch from test data\n",
" test_batch_data, test_batch_labels = next(iter(test_loader))\n",
" print(f\"\u2705 Test batch shape: {test_batch_data.shape}\")\n",
" \n",
" # Show training batch images\n",
" print(\"\ud83d\uddbc\ufe0f Displaying training batch...\")\n",
" class PipelineBatchDataset:\n",
" def __init__(self, batch_data, batch_labels):\n",
" self.batch_data = batch_data\n",
" self.batch_labels = batch_labels\n",
" self.class_names = ['airplane', 'car', 'bird', 'cat', 'deer', \n",
" 'dog', 'frog', 'horse', 'ship', 'truck']\n",
" \n",
" def __getitem__(self, index):\n",
" return Tensor(self.batch_data.data[index]), Tensor(self.batch_labels.data[index])\n",
" \n",
" def __len__(self):\n",
" return self.batch_data.shape[0]\n",
" \n",
" train_batch_dataset = PipelineBatchDataset(train_batch_data, train_batch_labels)\n",
" show_cifar10_samples(train_batch_dataset, num_samples=8, title=\"Complete Pipeline - Training Batch\")\n",
" \n",
" # Show test batch images\n",
" print(\"\ud83d\uddbc\ufe0f Displaying test batch...\")\n",
" test_batch_dataset = PipelineBatchDataset(test_batch_data, test_batch_labels)\n",
" show_cifar10_samples(test_batch_dataset, num_samples=8, title=\"Complete Pipeline - Test Batch\")\n",
" \n",
" # Show data statistics\n",
" print(f\"\u2705 Training data range: [{train_batch_data.data.min():.3f}, {train_batch_data.data.max():.3f}]\")\n",
" print(f\"\u2705 Training data mean: {train_batch_data.data.mean():.3f}\")\n",
" print(f\"\u2705 Training data std: {train_batch_data.data.std():.3f}\")\n",
" \n",
" print(\"\ud83c\udf89 Complete pipeline visual feedback works!\")\n",
" print(\"\ud83d\ude80 You can see your entire data pipeline in action!\")\n",
" \n",
"except Exception as e:\n",
" print(f\"\u274c Error: {e}\")\n",
" print(\"Make sure complete pipeline and visualization work correctly!\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"\"\"\"\n",
"## \ud83c\udfaf Summary\n",
"\n",
"Congratulations! You've built a complete data loading system:\n",
"\n",
"### What You Built\n",
"1. **Dataset**: Abstract interface for data loading\n",
"2. **CIFAR10Dataset**: Real dataset implementation\n",
"3. **DataLoader**: Efficient batching and iteration\n",
"4. **Normalizer**: Data preprocessing for better training\n",
"5. **Data Pipeline**: Complete system integration\n",
"\n",
"### Key Concepts Learned\n",
"- **Data engineering**: The foundation of ML systems\n",
"- **Batching**: Efficient processing of multiple samples\n",
"- **Normalization**: Preprocessing for stable training\n",
"- **Systems thinking**: Memory, I/O, and performance considerations\n",
"\n",
"### Next Steps\n",
"- **Autograd**: Automatic differentiation for training\n",
"- **Training**: Optimization loops and loss functions\n",
"- **Advanced data**: Augmentation, distributed loading, etc.\n",
"\n",
"### Real-World Impact\n",
"This data loading system is the foundation of every ML pipeline:\n",
"- **Production systems**: Handle millions of samples efficiently\n",
"- **Research**: Enable experimentation with different datasets\n",
"- **MLOps**: Integrate with training and deployment pipelines\n",
"\n",
"You now understand how data flows through ML systems! \ud83d\ude80\n",
"\"\"\""
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"name": "python",
"version": "3.8.0"
}
},
"nbformat": 4,
"nbformat_minor": 4
}