mirror of
https://github.com/MLSysBook/TinyTorch.git
synced 2026-06-03 18:50:52 -05:00
- Rename modules/data/ → modules/dataloader/ - Rename data_dev.py → dataloader_dev.py - Update NBDev export target: core.data → core.dataloader - Rename test files: test_data.py → test_dataloader.py - Update package exports to tinytorch.core.dataloader - Update module imports and internal references This makes the module name more descriptive and aligned with ML industry standards.
1677 lines
70 KiB
Plaintext
1677 lines
70 KiB
Plaintext
{
|
|
"cells": [
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"---\n",
|
|
"jupyter:\n",
|
|
" jupytext:\n",
|
|
" text_representation:\n",
|
|
" extension: .py\n",
|
|
" format_name: percent\n",
|
|
" format_version: '1.3'\n",
|
|
" jupytext_version: 1.17.1\n",
|
|
"---\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"\"\"\"\n",
|
|
"# Module 4: Data - Data Loading and Preprocessing\n",
|
|
"\n",
|
|
"Welcome to the Data module! This is where you'll learn how to efficiently load, process, and manage data for machine learning systems.\n",
|
|
"\n",
|
|
"## Learning Goals\n",
|
|
"- Understand data pipelines as the foundation of ML systems\n",
|
|
"- Implement efficient data loading with memory management\n",
|
|
"- Build reusable dataset abstractions for different data types\n",
|
|
"- Master batching strategies and I/O optimization\n",
|
|
"- Learn systems thinking for data engineering\n",
|
|
"\n",
|
|
"## Build \u2192 Use \u2192 Understand\n",
|
|
"1. **Build**: Create dataset classes and data loaders\n",
|
|
"2. **Use**: Load real datasets and train models\n",
|
|
"3. **Understand**: How data engineering affects system performance\n",
|
|
"\n",
|
|
"## Module Dependencies\n",
|
|
"This module builds on previous modules:\n",
|
|
"- **tensor** \u2192 **activations** \u2192 **layers** \u2192 **networks** \u2192 **data**\n",
|
|
"- Data feeds into training: data \u2192 autograd \u2192 training\n",
|
|
"\"\"\""
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"\"\"\"\n",
|
|
"## \ud83d\udce6 Where This Code Lives in the Final Package\n",
|
|
"\n",
|
|
"**Learning Side:** You work in `modules/dataloader/dataloader_dev.py` \n",
|
|
"**Building Side:** Code exports to `tinytorch.core.dataloader`\n",
|
|
"\n",
|
|
"```python\n",
|
|
"# Final package structure:\n",
|
|
"from tinytorch.core.dataloader import Dataset, DataLoader, CIFAR10Dataset\n",
|
|
"from tinytorch.core.tensor import Tensor\n",
|
|
"from tinytorch.core.networks import Sequential\n",
|
|
"```\n",
|
|
"\n",
|
|
"**Why this matters:**\n",
|
|
"- **Learning:** Focused modules for deep understanding\n",
|
|
"- **Production:** Proper organization like PyTorch's `torch.utils.data`\n",
|
|
"- **Consistency:** All data loading utilities live together in `core.data`\n",
|
|
"\"\"\""
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"#| default_exp core.dataloader\n",
|
|
"\n",
|
|
"# Setup and imports\n",
|
|
"import numpy as np\n",
|
|
"import sys\n",
|
|
"import os\n",
|
|
"import pickle\n",
|
|
"import struct\n",
|
|
"from typing import List, Tuple, Optional, Union, Iterator\n",
|
|
"import matplotlib.pyplot as plt\n",
|
|
"import urllib.request\n",
|
|
"import tarfile\n",
|
|
"\n",
|
|
"# Import our building blocks\n",
|
|
"from tinytorch.core.tensor import Tensor\n",
|
|
"\n",
|
|
"print(\"\ud83d\udd25 TinyTorch Data Module\")\n",
|
|
"print(f\"NumPy version: {np.__version__}\")\n",
|
|
"print(f\"Python version: {sys.version_info.major}.{sys.version_info.minor}\")\n",
|
|
"print(\"Ready to build data pipelines!\")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"#| export\n",
|
|
"import numpy as np\n",
|
|
"import sys\n",
|
|
"import os\n",
|
|
"import pickle\n",
|
|
"import struct\n",
|
|
"from typing import List, Tuple, Optional, Union, Iterator\n",
|
|
"import matplotlib.pyplot as plt\n",
|
|
"import urllib.request\n",
|
|
"import tarfile\n",
|
|
"\n",
|
|
"# Import our building blocks\n",
|
|
"from tinytorch.core.tensor import Tensor"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"#| hide\n",
|
|
"#| export\n",
|
|
"def _should_show_plots():\n",
|
|
" \"\"\"Check if we should show plots (disable during testing)\"\"\"\n",
|
|
" return 'pytest' not in sys.modules and 'test' not in sys.argv"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"\"\"\"\n",
|
|
"## Step 1: What is Data Engineering?\n",
|
|
"\n",
|
|
"### Definition\n",
|
|
"**Data engineering** is the foundation of all machine learning systems. It involves loading, processing, and managing data efficiently so that models can learn from it.\n",
|
|
"\n",
|
|
"### Why Data Engineering Matters\n",
|
|
"- **Data is the fuel**: Without proper data pipelines, nothing else works\n",
|
|
"- **I/O bottlenecks**: Data loading is often the biggest performance bottleneck\n",
|
|
"- **Memory management**: How you handle data affects everything else\n",
|
|
"- **Production reality**: Data pipelines are critical in real ML systems\n",
|
|
"\n",
|
|
"### The Fundamental Insight\n",
|
|
"**Data engineering is about managing the flow of information through your system:**\n",
|
|
"```\n",
|
|
"Raw Data \u2192 Load \u2192 Preprocess \u2192 Batch \u2192 Feed to Model\n",
|
|
"```\n",
|
|
"\n",
|
|
"### Real-World Examples\n",
|
|
"- **Image datasets**: CIFAR-10, ImageNet, MNIST\n",
|
|
"- **Text datasets**: Wikipedia, books, social media\n",
|
|
"- **Tabular data**: CSV files, databases, spreadsheets\n",
|
|
"- **Audio data**: Speech recordings, music files\n",
|
|
"\n",
|
|
"### Systems Thinking\n",
|
|
"- **Memory efficiency**: Handle datasets larger than RAM\n",
|
|
"- **I/O optimization**: Read from disk efficiently\n",
|
|
"- **Batching strategies**: Trade-offs between memory and speed\n",
|
|
"- **Caching**: When to cache vs recompute\n",
|
|
"\n",
|
|
"### Visual Intuition\n",
|
|
"```\n",
|
|
"Raw Files: [image1.jpg, image2.jpg, image3.jpg, ...]\n",
|
|
"Load: [Tensor(32x32x3), Tensor(32x32x3), Tensor(32x32x3), ...]\n",
|
|
"Batch: [Tensor(32, 32, 32, 3)] # 32 images at once\n",
|
|
"Model: Process batch efficiently\n",
|
|
"```\n",
|
|
"\n",
|
|
"Let's start by building the most fundamental component: **Dataset**.\n",
|
|
"\"\"\""
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"#| export\n",
|
|
"class Dataset:\n",
|
|
" \"\"\"\n",
|
|
" Base Dataset class: Abstract interface for all datasets.\n",
|
|
" \n",
|
|
" The fundamental abstraction for data loading in TinyTorch.\n",
|
|
" Students implement concrete datasets by inheriting from this class.\n",
|
|
" \n",
|
|
" TODO: Implement the base Dataset class with required methods.\n",
|
|
" \n",
|
|
" APPROACH:\n",
|
|
" 1. Define the interface that all datasets must implement\n",
|
|
" 2. Include methods for getting individual samples and dataset size\n",
|
|
" 3. Make it easy to extend for different data types\n",
|
|
" \n",
|
|
" EXAMPLE:\n",
|
|
" dataset = CIFAR10Dataset(\"data/cifar10/\")\n",
|
|
" sample, label = dataset[0] # Get first sample\n",
|
|
" size = len(dataset) # Get dataset size\n",
|
|
" \n",
|
|
" HINTS:\n",
|
|
" - Use abstract methods that subclasses must implement\n",
|
|
" - Include __getitem__ for indexing and __len__ for size\n",
|
|
" - Add helper methods for getting sample shapes and number of classes\n",
|
|
" \"\"\"\n",
|
|
" \n",
|
|
" def __getitem__(self, index: int) -> Tuple[Tensor, Tensor]:\n",
|
|
" \"\"\"\n",
|
|
" Get a single sample and label by index.\n",
|
|
" \n",
|
|
" Args:\n",
|
|
" index: Index of the sample to retrieve\n",
|
|
" \n",
|
|
" Returns:\n",
|
|
" Tuple of (data, label) tensors\n",
|
|
" \n",
|
|
" TODO: Implement abstract method for getting samples.\n",
|
|
" \n",
|
|
" STEP-BY-STEP:\n",
|
|
" 1. This is an abstract method - subclasses will implement it\n",
|
|
" 2. Return a tuple of (data, label) tensors\n",
|
|
" 3. Data should be the input features, label should be the target\n",
|
|
" \n",
|
|
" EXAMPLE:\n",
|
|
" dataset[0] should return (Tensor(image_data), Tensor(label))\n",
|
|
" \"\"\"\n",
|
|
" raise NotImplementedError(\"Student implementation required\")\n",
|
|
" \n",
|
|
" def __len__(self) -> int:\n",
|
|
" \"\"\"\n",
|
|
" Get the total number of samples in the dataset.\n",
|
|
" \n",
|
|
" TODO: Implement abstract method for getting dataset size.\n",
|
|
" \n",
|
|
" STEP-BY-STEP:\n",
|
|
" 1. This is an abstract method - subclasses will implement it\n",
|
|
" 2. Return the total number of samples in the dataset\n",
|
|
" \n",
|
|
" EXAMPLE:\n",
|
|
" len(dataset) should return 50000 for CIFAR-10 training set\n",
|
|
" \"\"\"\n",
|
|
" raise NotImplementedError(\"Student implementation required\")\n",
|
|
" \n",
|
|
" def get_sample_shape(self) -> Tuple[int, ...]:\n",
|
|
" \"\"\"\n",
|
|
" Get the shape of a single data sample.\n",
|
|
" \n",
|
|
" TODO: Implement method to get sample shape.\n",
|
|
" \n",
|
|
" STEP-BY-STEP:\n",
|
|
" 1. Get the first sample using self[0]\n",
|
|
" 2. Extract the data part (first element of tuple)\n",
|
|
" 3. Return the shape of the data tensor\n",
|
|
" \n",
|
|
" EXAMPLE:\n",
|
|
" For CIFAR-10: returns (3, 32, 32) for RGB images\n",
|
|
" \"\"\"\n",
|
|
" raise NotImplementedError(\"Student implementation required\")\n",
|
|
" \n",
|
|
" def get_num_classes(self) -> int:\n",
|
|
" \"\"\"\n",
|
|
" Get the number of classes in the dataset.\n",
|
|
" \n",
|
|
" TODO: Implement abstract method for getting number of classes.\n",
|
|
" \n",
|
|
" STEP-BY-STEP:\n",
|
|
" 1. This is an abstract method - subclasses will implement it\n",
|
|
" 2. Return the total number of classes in the dataset\n",
|
|
" \n",
|
|
" EXAMPLE:\n",
|
|
" For CIFAR-10: returns 10 (airplane, car, bird, cat, deer, dog, frog, horse, ship, truck)\n",
|
|
" \"\"\"\n",
|
|
" raise NotImplementedError(\"Student implementation required\")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"#| hide\n",
|
|
"#| export\n",
|
|
"class Dataset:\n",
|
|
" \"\"\"Base Dataset class: Abstract interface for all datasets.\"\"\"\n",
|
|
" \n",
|
|
" def __getitem__(self, index: int) -> Tuple[Tensor, Tensor]:\n",
|
|
" \"\"\"Get a single sample and label by index.\"\"\"\n",
|
|
" raise NotImplementedError(\"Subclasses must implement __getitem__\")\n",
|
|
" \n",
|
|
" def __len__(self) -> int:\n",
|
|
" \"\"\"Get the total number of samples in the dataset.\"\"\"\n",
|
|
" raise NotImplementedError(\"Subclasses must implement __len__\")\n",
|
|
" \n",
|
|
" def get_sample_shape(self) -> Tuple[int, ...]:\n",
|
|
" \"\"\"Get the shape of a single data sample.\"\"\"\n",
|
|
" sample, _ = self[0]\n",
|
|
" return sample.shape\n",
|
|
" \n",
|
|
" def get_num_classes(self) -> int:\n",
|
|
" \"\"\"Get the number of classes in the dataset.\"\"\"\n",
|
|
" raise NotImplementedError(\"Subclasses must implement get_num_classes\")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"\"\"\"\n",
|
|
"### \ud83e\uddea Test Your Base Dataset\n",
|
|
"\"\"\""
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Test the base Dataset class\n",
|
|
"print(\"Testing base Dataset class...\")\n",
|
|
"\n",
|
|
"try:\n",
|
|
" # Create a simple test dataset\n",
|
|
" class TestDataset(Dataset):\n",
|
|
" def __init__(self, num_samples=10):\n",
|
|
" self.num_samples = num_samples\n",
|
|
" self.data = [Tensor(np.random.randn(3, 32, 32)) for _ in range(num_samples)]\n",
|
|
" self.labels = [Tensor(np.array(i % 3)) for i in range(num_samples)]\n",
|
|
" \n",
|
|
" def __getitem__(self, index):\n",
|
|
" return self.data[index], self.labels[index]\n",
|
|
" \n",
|
|
" def __len__(self):\n",
|
|
" return self.num_samples\n",
|
|
" \n",
|
|
" def get_num_classes(self):\n",
|
|
" return 3\n",
|
|
" \n",
|
|
" # Test the dataset\n",
|
|
" dataset = TestDataset(5)\n",
|
|
" print(f\"\u2705 Dataset created with {len(dataset)} samples\")\n",
|
|
" \n",
|
|
" # Test indexing\n",
|
|
" sample, label = dataset[0]\n",
|
|
" print(f\"\u2705 Sample shape: {sample.shape}\")\n",
|
|
" print(f\"\u2705 Label: {label}\")\n",
|
|
" \n",
|
|
" # Test helper methods\n",
|
|
" print(f\"\u2705 Sample shape: {dataset.get_sample_shape()}\")\n",
|
|
" print(f\"\u2705 Number of classes: {dataset.get_num_classes()}\")\n",
|
|
" \n",
|
|
" print(\"\ud83c\udf89 Base Dataset class works!\")\n",
|
|
" \n",
|
|
"except Exception as e:\n",
|
|
" print(f\"\u274c Error: {e}\")\n",
|
|
" print(\"Make sure to implement the base Dataset class above!\")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"\"\"\"\n",
|
|
"## Step 2: Understanding CIFAR-10 Dataset\n",
|
|
"\n",
|
|
"Now let's build a real dataset! We'll focus on **CIFAR-10** - the perfect dataset for learning data loading.\n",
|
|
"\n",
|
|
"### Why CIFAR-10?\n",
|
|
"- **Perfect size**: 170MB - large enough for optimization, small enough to manage\n",
|
|
"- **Real data**: 32x32 color images, 10 classes\n",
|
|
"- **Classic dataset**: Every ML student should know it\n",
|
|
"- **Good complexity**: Requires proper data loading techniques\n",
|
|
"\n",
|
|
"### The CIFAR-10 Format\n",
|
|
"```\n",
|
|
"File structure:\n",
|
|
"- data_batch_1: 10,000 images + labels\n",
|
|
"- data_batch_2: 10,000 images + labels\n",
|
|
"- ...\n",
|
|
"- test_batch: 10,000 test images\n",
|
|
"- batches.meta: Class names and metadata\n",
|
|
"\n",
|
|
"Binary format:\n",
|
|
"- Each image: 3073 bytes (3072 for RGB + 1 for label)\n",
|
|
"- Images stored as: [label, R, G, B, R, G, B, ...]\n",
|
|
"- 32x32x3 = 3072 bytes per image\n",
|
|
"```\n",
|
|
"\n",
|
|
"### Data Loading Challenges\n",
|
|
"- **Binary file parsing**: CIFAR-10 uses a custom binary format\n",
|
|
"- **Memory management**: 60,000 images need efficient handling\n",
|
|
"- **Batching**: Grouping samples for efficient processing\n",
|
|
"- **Preprocessing**: Normalization, augmentation, etc.\n",
|
|
"\n",
|
|
"Let's implement CIFAR-10 loading step by step!\n",
|
|
"\"\"\""
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"#| export\n",
|
|
"class CIFAR10Dataset(Dataset):\n",
|
|
" \"\"\"\n",
|
|
" CIFAR-10 Dataset: Load and manage CIFAR-10 image data.\n",
|
|
" \n",
|
|
" CIFAR-10 contains 60,000 32x32 color images in 10 classes.\n",
|
|
" Perfect for learning data loading and image processing.\n",
|
|
" \n",
|
|
" Args:\n",
|
|
" root_dir: Directory containing CIFAR-10 files\n",
|
|
" train: If True, load training data. If False, load test data.\n",
|
|
" download: If True, download dataset if not present\n",
|
|
" \n",
|
|
" TODO: Implement CIFAR-10 dataset loading.\n",
|
|
" \n",
|
|
" APPROACH:\n",
|
|
" 1. Handle dataset download if needed (with progress bar!)\n",
|
|
" 2. Parse binary files to extract images and labels\n",
|
|
" 3. Store data efficiently in memory\n",
|
|
" 4. Implement indexing and size methods\n",
|
|
" \n",
|
|
" EXAMPLE:\n",
|
|
" dataset = CIFAR10Dataset(\"data/cifar10/\", train=True)\n",
|
|
" image, label = dataset[0] # Get first image\n",
|
|
" print(f\"Image shape: {image.shape}\") # (3, 32, 32)\n",
|
|
" print(f\"Label: {label}\") # Tensor with class index\n",
|
|
" \n",
|
|
" HINTS:\n",
|
|
" - Use pickle to load binary files\n",
|
|
" - Each batch file contains 'data' and 'labels' keys\n",
|
|
" - Reshape data to (3, 32, 32) format\n",
|
|
" - Store images and labels as separate lists\n",
|
|
" - Add progress bar with urllib.request.urlretrieve(url, filename, reporthook=progress_function)\n",
|
|
" - Progress function receives (block_num, block_size, total_size) parameters\n",
|
|
" \"\"\"\n",
|
|
" \n",
|
|
" def __init__(self, root_dir: str, train: bool = True, download: bool = True):\n",
|
|
" \"\"\"\n",
|
|
" Initialize CIFAR-10 dataset.\n",
|
|
" \n",
|
|
" Args:\n",
|
|
" root_dir: Directory to store/load dataset\n",
|
|
" train: If True, load training data. If False, load test data.\n",
|
|
" download: If True, download dataset if not present\n",
|
|
" \n",
|
|
" TODO: Implement CIFAR-10 initialization.\n",
|
|
" \n",
|
|
" STEP-BY-STEP:\n",
|
|
" 1. Create root directory if it doesn't exist\n",
|
|
" 2. Download dataset if needed and not present (with progress bar!)\n",
|
|
" 3. Load binary files and parse data\n",
|
|
" 4. Store images and labels in memory\n",
|
|
" 5. Set up class names\n",
|
|
" \n",
|
|
" EXAMPLE:\n",
|
|
" CIFAR10Dataset(\"data/cifar10/\", train=True)\n",
|
|
" creates a dataset with 50,000 training images\n",
|
|
" \n",
|
|
" PROGRESS BAR HINT:\n",
|
|
" def show_progress(block_num, block_size, total_size):\n",
|
|
" downloaded = block_num * block_size\n",
|
|
" percent = (downloaded * 100) // total_size\n",
|
|
" print(f\"\\\\rDownloading: {percent}%\", end='', flush=True)\n",
|
|
" \"\"\"\n",
|
|
" raise NotImplementedError(\"Student implementation required\")\n",
|
|
" \n",
|
|
" def __getitem__(self, index: int) -> Tuple[Tensor, Tensor]:\n",
|
|
" \"\"\"\n",
|
|
" Get a single image and label by index.\n",
|
|
" \n",
|
|
" Args:\n",
|
|
" index: Index of the sample to retrieve\n",
|
|
" \n",
|
|
" Returns:\n",
|
|
" Tuple of (image, label) tensors\n",
|
|
" \n",
|
|
" TODO: Implement sample retrieval.\n",
|
|
" \n",
|
|
" STEP-BY-STEP:\n",
|
|
" 1. Get image from self.images[index]\n",
|
|
" 2. Get label from self.labels[index]\n",
|
|
" 3. Return (Tensor(image), Tensor(label))\n",
|
|
" \n",
|
|
" EXAMPLE:\n",
|
|
" image, label = dataset[0]\n",
|
|
" image.shape should be (3, 32, 32)\n",
|
|
" label should be integer 0-9\n",
|
|
" \"\"\"\n",
|
|
" raise NotImplementedError(\"Student implementation required\")\n",
|
|
" \n",
|
|
" def __len__(self) -> int:\n",
|
|
" \"\"\"\n",
|
|
" Get the total number of samples in the dataset.\n",
|
|
" \n",
|
|
" TODO: Return the length of the dataset.\n",
|
|
" \n",
|
|
" STEP-BY-STEP:\n",
|
|
" 1. Return len(self.images)\n",
|
|
" \n",
|
|
" EXAMPLE:\n",
|
|
" Training set: 50,000 samples\n",
|
|
" Test set: 10,000 samples\n",
|
|
" \"\"\"\n",
|
|
" raise NotImplementedError(\"Student implementation required\")\n",
|
|
" \n",
|
|
" def get_num_classes(self) -> int:\n",
|
|
" \"\"\"\n",
|
|
" Get the number of classes in CIFAR-10.\n",
|
|
" \n",
|
|
" TODO: Return the number of classes.\n",
|
|
" \n",
|
|
" STEP-BY-STEP:\n",
|
|
" 1. CIFAR-10 has 10 classes\n",
|
|
" 2. Return 10\n",
|
|
" \n",
|
|
" EXAMPLE:\n",
|
|
" Returns 10 for CIFAR-10\n",
|
|
" \"\"\"\n",
|
|
" raise NotImplementedError(\"Student implementation required\")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"#| hide\n",
|
|
"#| export\n",
|
|
"class CIFAR10Dataset(Dataset):\n",
|
|
" \"\"\"CIFAR-10 Dataset: Load and manage CIFAR-10 image data.\"\"\"\n",
|
|
" \n",
|
|
" def __init__(self, root_dir: str, train: bool = True, download: bool = True):\n",
|
|
" self.root_dir = root_dir\n",
|
|
" self.train = train\n",
|
|
" self.class_names = ['airplane', 'car', 'bird', 'cat', 'deer', \n",
|
|
" 'dog', 'frog', 'horse', 'ship', 'truck']\n",
|
|
" \n",
|
|
" # Create directory if it doesn't exist\n",
|
|
" os.makedirs(root_dir, exist_ok=True)\n",
|
|
" \n",
|
|
" # Download if needed\n",
|
|
" if download:\n",
|
|
" self._download_if_needed()\n",
|
|
" \n",
|
|
" # Load data\n",
|
|
" self._load_data()\n",
|
|
" \n",
|
|
" def _download_if_needed(self):\n",
|
|
" \"\"\"Download CIFAR-10 if not present.\"\"\"\n",
|
|
" cifar_path = os.path.join(self.root_dir, \"cifar-10-batches-py\")\n",
|
|
" if not os.path.exists(cifar_path):\n",
|
|
" print(\"\ud83d\udd04 Downloading CIFAR-10 dataset...\")\n",
|
|
" print(\"\ud83d\udce6 Size: ~170MB (this may take a few minutes)\")\n",
|
|
" url = \"https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz\"\n",
|
|
" filename = os.path.join(self.root_dir, \"cifar-10-python.tar.gz\")\n",
|
|
" \n",
|
|
" try:\n",
|
|
" # Download with progress bar\n",
|
|
" def show_progress(block_num, block_size, total_size):\n",
|
|
" \"\"\"Show download progress bar.\"\"\"\n",
|
|
" downloaded = block_num * block_size\n",
|
|
" if total_size > 0:\n",
|
|
" percent = min(100, (downloaded * 100) // total_size)\n",
|
|
" bar_length = 50\n",
|
|
" filled_length = (percent * bar_length) // 100\n",
|
|
" bar = '\u2588' * filled_length + '\u2591' * (bar_length - filled_length)\n",
|
|
" \n",
|
|
" # Convert bytes to MB\n",
|
|
" downloaded_mb = downloaded / (1024 * 1024)\n",
|
|
" total_mb = total_size / (1024 * 1024)\n",
|
|
" \n",
|
|
" print(f\"\\r\ud83d\udce5 [{bar}] {percent}% ({downloaded_mb:.1f}/{total_mb:.1f} MB)\", end='', flush=True)\n",
|
|
" else:\n",
|
|
" # Fallback if total size unknown\n",
|
|
" downloaded_mb = downloaded / (1024 * 1024)\n",
|
|
" print(f\"\\r\ud83d\udce5 Downloaded: {downloaded_mb:.1f} MB\", end='', flush=True)\n",
|
|
" \n",
|
|
" urllib.request.urlretrieve(url, filename, reporthook=show_progress)\n",
|
|
" print() # New line after progress bar\n",
|
|
" \n",
|
|
" # Extract\n",
|
|
" print(\"\ud83d\udcc2 Extracting CIFAR-10 files...\")\n",
|
|
" with tarfile.open(filename, 'r:gz') as tar:\n",
|
|
" tar.extractall(self.root_dir, filter='data')\n",
|
|
" \n",
|
|
" # Clean up\n",
|
|
" os.remove(filename)\n",
|
|
" print(\"\u2705 CIFAR-10 downloaded and extracted successfully!\")\n",
|
|
" \n",
|
|
" except Exception as e:\n",
|
|
" print(f\"\\n\u274c Download failed: {e}\")\n",
|
|
" print(\"Please download CIFAR-10 manually from https://www.cs.toronto.edu/~kriz/cifar.html\")\n",
|
|
" \n",
|
|
" def _load_data(self):\n",
|
|
" \"\"\"Load CIFAR-10 data from binary files.\"\"\"\n",
|
|
" cifar_path = os.path.join(self.root_dir, \"cifar-10-batches-py\")\n",
|
|
" \n",
|
|
" self.images = []\n",
|
|
" self.labels = []\n",
|
|
" \n",
|
|
" if self.train:\n",
|
|
" # Load training batches\n",
|
|
" for i in range(1, 6):\n",
|
|
" batch_file = os.path.join(cifar_path, f\"data_batch_{i}\")\n",
|
|
" if os.path.exists(batch_file):\n",
|
|
" with open(batch_file, 'rb') as f:\n",
|
|
" batch = pickle.load(f, encoding='bytes')\n",
|
|
" # Convert bytes keys to strings\n",
|
|
" batch = {k.decode('utf-8') if isinstance(k, bytes) else k: v for k, v in batch.items()}\n",
|
|
" \n",
|
|
" # Extract images and labels\n",
|
|
" images = batch['data'].reshape(-1, 3, 32, 32).astype(np.float32)\n",
|
|
" labels = batch['labels']\n",
|
|
" \n",
|
|
" self.images.extend(images)\n",
|
|
" self.labels.extend(labels)\n",
|
|
" else:\n",
|
|
" # Load test batch\n",
|
|
" test_file = os.path.join(cifar_path, \"test_batch\")\n",
|
|
" if os.path.exists(test_file):\n",
|
|
" with open(test_file, 'rb') as f:\n",
|
|
" batch = pickle.load(f, encoding='bytes')\n",
|
|
" # Convert bytes keys to strings\n",
|
|
" batch = {k.decode('utf-8') if isinstance(k, bytes) else k: v for k, v in batch.items()}\n",
|
|
" \n",
|
|
" # Extract images and labels\n",
|
|
" self.images = batch['data'].reshape(-1, 3, 32, 32).astype(np.float32)\n",
|
|
" self.labels = batch['labels']\n",
|
|
" \n",
|
|
" print(f\"\u2705 Loaded {len(self.images)} {'training' if self.train else 'test'} samples\")\n",
|
|
" \n",
|
|
" def __getitem__(self, index: int) -> Tuple[Tensor, Tensor]:\n",
|
|
" \"\"\"Get a single image and label by index.\"\"\"\n",
|
|
" image = Tensor(self.images[index])\n",
|
|
" label = Tensor(np.array(self.labels[index]))\n",
|
|
" return image, label\n",
|
|
" \n",
|
|
" def __len__(self) -> int:\n",
|
|
" \"\"\"Get the total number of samples in the dataset.\"\"\"\n",
|
|
" return len(self.images)\n",
|
|
" \n",
|
|
" def get_num_classes(self) -> int:\n",
|
|
" \"\"\"Get the number of classes in CIFAR-10.\"\"\"\n",
|
|
" return 10"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"\"\"\"\n",
|
|
"### \ud83e\uddea Test Your CIFAR-10 Dataset\n",
|
|
"\"\"\""
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Test CIFAR-10 dataset (skip download for now)\n",
|
|
"print(\"Testing CIFAR-10 dataset...\")\n",
|
|
"\n",
|
|
"try:\n",
|
|
" # Create a mock dataset for testing without download\n",
|
|
" class MockCIFAR10Dataset(Dataset):\n",
|
|
" def __init__(self, size, train=True):\n",
|
|
" self.size = size\n",
|
|
" self.train = train\n",
|
|
" self.data = [np.random.randint(0, 255, (3, 32, 32), dtype=np.uint8) for _ in range(size)]\n",
|
|
" self.labels = [np.random.randint(0, 10) for _ in range(size)]\n",
|
|
" \n",
|
|
" def __getitem__(self, index):\n",
|
|
" return Tensor(self.data[index].astype(np.float32)), Tensor(np.array(self.labels[index]))\n",
|
|
" \n",
|
|
" def __len__(self):\n",
|
|
" return self.size\n",
|
|
" \n",
|
|
" def get_num_classes(self):\n",
|
|
" return 10\n",
|
|
" \n",
|
|
" # Test the dataset\n",
|
|
" dataset = MockCIFAR10Dataset(50)\n",
|
|
" print(f\"\u2705 Dataset created with {len(dataset)} samples\")\n",
|
|
" \n",
|
|
" # Test indexing\n",
|
|
" image, label = dataset[0]\n",
|
|
" print(f\"\u2705 Image shape: {image.shape}\")\n",
|
|
" print(f\"\u2705 Label: {label}\")\n",
|
|
" print(f\"\u2705 Number of classes: {dataset.get_num_classes()}\")\n",
|
|
" \n",
|
|
" # Test multiple samples\n",
|
|
" for i in range(3):\n",
|
|
" img, lbl = dataset[i]\n",
|
|
" print(f\"\u2705 Sample {i}: {img.shape}, class {int(lbl.data)}\")\n",
|
|
" \n",
|
|
" print(\"\ud83c\udf89 CIFAR-10 dataset structure works!\")\n",
|
|
" \n",
|
|
"except Exception as e:\n",
|
|
" print(f\"\u274c Error: {e}\")\n",
|
|
" print(\"Make sure to implement the CIFAR-10 dataset above!\")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"\"\"\"\n",
|
|
"### \ud83d\udc41\ufe0f Visual Feedback: See Your Data!\n",
|
|
"\n",
|
|
"Let's add a visualization function to actually **see** the CIFAR-10 images we're loading. \n",
|
|
"This provides immediate visual feedback and builds intuition about the data.\n",
|
|
"\"\"\""
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"def show_cifar10_samples(dataset, num_samples=8, title=\"CIFAR-10 Samples\"):\n",
|
|
" \"\"\"\n",
|
|
" Display a grid of CIFAR-10 images with their labels.\n",
|
|
" \n",
|
|
" Args:\n",
|
|
" dataset: CIFAR-10 dataset\n",
|
|
" num_samples: Number of samples to display\n",
|
|
" title: Title for the plot\n",
|
|
" \n",
|
|
" TODO: Implement visualization function.\n",
|
|
" \n",
|
|
" APPROACH:\n",
|
|
" 1. Create a matplotlib subplot grid\n",
|
|
" 2. Get random samples from dataset\n",
|
|
" 3. Display each image with its class label\n",
|
|
" 4. Handle the image format (CHW -> HWC, normalize to 0-1)\n",
|
|
" \n",
|
|
" EXAMPLE:\n",
|
|
" show_cifar10_samples(dataset, num_samples=8)\n",
|
|
" # Shows 8 CIFAR-10 images in a 2x4 grid\n",
|
|
" \n",
|
|
" HINTS:\n",
|
|
" - Use plt.subplots() to create grid\n",
|
|
" - Convert image from (C, H, W) to (H, W, C) for display\n",
|
|
" - Normalize pixel values to [0, 1] range\n",
|
|
" - Use dataset.class_names for labels\n",
|
|
" \n",
|
|
" NOTE: This is a development/learning tool, not part of the core package.\n",
|
|
" \"\"\"\n",
|
|
" raise NotImplementedError(\"Student implementation required\")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"#| hide\n",
|
|
"def show_cifar10_samples(dataset, num_samples=8, title=\"CIFAR-10 Samples\"):\n",
|
|
" \"\"\"Display a grid of CIFAR-10 images with their labels.\"\"\"\n",
|
|
" if not _should_show_plots():\n",
|
|
" return\n",
|
|
" \n",
|
|
" # Create subplot grid\n",
|
|
" rows = 2\n",
|
|
" cols = num_samples // rows\n",
|
|
" fig, axes = plt.subplots(rows, cols, figsize=(12, 6))\n",
|
|
" fig.suptitle(title, fontsize=16)\n",
|
|
" \n",
|
|
" # Get random samples\n",
|
|
" indices = np.random.choice(len(dataset), num_samples, replace=False)\n",
|
|
" \n",
|
|
" for i, idx in enumerate(indices):\n",
|
|
" row = i // cols\n",
|
|
" col = i % cols\n",
|
|
" \n",
|
|
" # Get image and label\n",
|
|
" image, label = dataset[idx]\n",
|
|
" \n",
|
|
" # Convert from (C, H, W) to (H, W, C) and normalize to [0, 1]\n",
|
|
" if hasattr(image, 'data'):\n",
|
|
" img_data = image.data\n",
|
|
" else:\n",
|
|
" img_data = image\n",
|
|
" \n",
|
|
" # Handle different tensor formats\n",
|
|
" if img_data.shape[0] == 3: # (C, H, W)\n",
|
|
" img_display = np.transpose(img_data, (1, 2, 0))\n",
|
|
" else:\n",
|
|
" img_display = img_data\n",
|
|
" \n",
|
|
" # Normalize to [0, 1] range\n",
|
|
" img_display = img_display.astype(np.float32)\n",
|
|
" if img_display.max() > 1.0:\n",
|
|
" img_display = img_display / 255.0\n",
|
|
" \n",
|
|
" # Ensure values are in [0, 1]\n",
|
|
" img_display = np.clip(img_display, 0, 1)\n",
|
|
" \n",
|
|
" # Display image\n",
|
|
" if rows == 1:\n",
|
|
" ax = axes[col]\n",
|
|
" else:\n",
|
|
" ax = axes[row, col]\n",
|
|
" \n",
|
|
" ax.imshow(img_display)\n",
|
|
" ax.axis('off')\n",
|
|
" \n",
|
|
" # Add label\n",
|
|
" if hasattr(label, 'data'):\n",
|
|
" label_idx = int(label.data)\n",
|
|
" else:\n",
|
|
" label_idx = int(label)\n",
|
|
" \n",
|
|
" if hasattr(dataset, 'class_names'):\n",
|
|
" class_name = dataset.class_names[label_idx]\n",
|
|
" ax.set_title(f'{class_name} ({label_idx})', fontsize=10)\n",
|
|
" else:\n",
|
|
" ax.set_title(f'Class {label_idx}', fontsize=10)\n",
|
|
" \n",
|
|
" plt.tight_layout()\n",
|
|
" if _should_show_plots():\n",
|
|
" plt.show()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Test visual feedback with real CIFAR-10 data\n",
|
|
"print(\"\ud83c\udfa8 Testing visual feedback with real CIFAR-10...\")\n",
|
|
"\n",
|
|
"try:\n",
|
|
" # Create real CIFAR-10 dataset for visualization\n",
|
|
" import tempfile\n",
|
|
" import os\n",
|
|
" \n",
|
|
" with tempfile.TemporaryDirectory() as temp_dir:\n",
|
|
" # Load real CIFAR-10 dataset\n",
|
|
" cifar_dataset = CIFAR10Dataset(temp_dir, train=True, download=True)\n",
|
|
" \n",
|
|
" print(f\"\u2705 Loaded {len(cifar_dataset)} real CIFAR-10 samples\")\n",
|
|
" print(f\"\u2705 Class names: {cifar_dataset.class_names}\")\n",
|
|
" \n",
|
|
" # Show sample images\n",
|
|
" if _should_show_plots():\n",
|
|
" print(\"\ud83d\uddbc\ufe0f Displaying sample images...\")\n",
|
|
" show_cifar10_samples(cifar_dataset, num_samples=8, title=\"Real CIFAR-10 Training Samples\")\n",
|
|
" \n",
|
|
" # Show some statistics\n",
|
|
" sample_images = [cifar_dataset[i][0] for i in range(100)]\n",
|
|
" pixel_values = [img.data for img in sample_images]\n",
|
|
" all_pixels = np.concatenate([pixels.flatten() for pixels in pixel_values])\n",
|
|
" \n",
|
|
" print(f\"\u2705 Pixel value range: [{all_pixels.min():.1f}, {all_pixels.max():.1f}]\")\n",
|
|
" print(f\"\u2705 Mean pixel value: {all_pixels.mean():.1f}\")\n",
|
|
" print(f\"\u2705 Std pixel value: {all_pixels.std():.1f}\")\n",
|
|
" \n",
|
|
" print(\"\ud83c\udf89 Visual feedback works! You can see your data!\")\n",
|
|
" \n",
|
|
"except Exception as e:\n",
|
|
" print(f\"\u274c Error: {e}\")\n",
|
|
" print(\"Make sure CIFAR-10 dataset is implemented correctly!\")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"\"\"\"\n",
|
|
"## Step 3: Understanding Data Loading\n",
|
|
"\n",
|
|
"Now let's build a **DataLoader** to efficiently batch and iterate through our dataset.\n",
|
|
"\n",
|
|
"### Why DataLoaders Matter\n",
|
|
"- **Batching**: Process multiple samples at once (GPU efficiency)\n",
|
|
"- **Shuffling**: Randomize order for better training\n",
|
|
"- **Memory management**: Handle large datasets efficiently\n",
|
|
"- **I/O optimization**: Load data in parallel with training\n",
|
|
"\n",
|
|
"### The DataLoader Pattern\n",
|
|
"```\n",
|
|
"Dataset: [sample1, sample2, sample3, ...]\n",
|
|
"DataLoader: [[batch1], [batch2], [batch3], ...]\n",
|
|
"```\n",
|
|
"\n",
|
|
"### Systems Thinking\n",
|
|
"- **Batch size**: Trade-off between memory and speed\n",
|
|
"- **Shuffling**: Prevents overfitting to data order\n",
|
|
"- **Iteration**: Efficient looping through data\n",
|
|
"- **Memory**: Manage large datasets that don't fit in RAM\n",
|
|
"\n",
|
|
"Let's implement a DataLoader!\n",
|
|
"\"\"\""
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"#| export\n",
|
|
"class DataLoader:\n",
|
|
" \"\"\"\n",
|
|
" DataLoader: Efficiently batch and iterate through datasets.\n",
|
|
" \n",
|
|
" Provides batching, shuffling, and efficient iteration over datasets.\n",
|
|
" Essential for training neural networks efficiently.\n",
|
|
" \n",
|
|
" Args:\n",
|
|
" dataset: Dataset to load from\n",
|
|
" batch_size: Number of samples per batch\n",
|
|
" shuffle: Whether to shuffle data each epoch\n",
|
|
" \n",
|
|
" TODO: Implement DataLoader with batching and shuffling.\n",
|
|
" \n",
|
|
" APPROACH:\n",
|
|
" 1. Store dataset and configuration\n",
|
|
" 2. Implement __iter__ to yield batches\n",
|
|
" 3. Handle shuffling and batching logic\n",
|
|
" 4. Stack individual samples into batches\n",
|
|
" \n",
|
|
" EXAMPLE:\n",
|
|
" dataloader = DataLoader(dataset, batch_size=32, shuffle=True)\n",
|
|
" for batch_images, batch_labels in dataloader:\n",
|
|
" print(f\"Batch shape: {batch_images.shape}\") # (32, 3, 32, 32)\n",
|
|
" \n",
|
|
" HINTS:\n",
|
|
" - Use np.random.permutation for shuffling\n",
|
|
" - Stack samples using np.stack\n",
|
|
" - Yield batches as (batch_data, batch_labels)\n",
|
|
" - Handle last batch that might be smaller\n",
|
|
" \"\"\"\n",
|
|
" \n",
|
|
" def __init__(self, dataset: Dataset, batch_size: int = 32, shuffle: bool = True):\n",
|
|
" \"\"\"\n",
|
|
" Initialize DataLoader.\n",
|
|
" \n",
|
|
" Args:\n",
|
|
" dataset: Dataset to load from\n",
|
|
" batch_size: Number of samples per batch\n",
|
|
" shuffle: Whether to shuffle data each epoch\n",
|
|
" \n",
|
|
" TODO: Store configuration and dataset.\n",
|
|
" \n",
|
|
" STEP-BY-STEP:\n",
|
|
" 1. Store dataset as self.dataset\n",
|
|
" 2. Store batch_size as self.batch_size\n",
|
|
" 3. Store shuffle as self.shuffle\n",
|
|
" \n",
|
|
" EXAMPLE:\n",
|
|
" DataLoader(dataset, batch_size=32, shuffle=True)\n",
|
|
" \"\"\"\n",
|
|
" raise NotImplementedError(\"Student implementation required\")\n",
|
|
" \n",
|
|
" def __iter__(self) -> Iterator[Tuple[Tensor, Tensor]]:\n",
|
|
" \"\"\"\n",
|
|
" Iterate through dataset in batches.\n",
|
|
" \n",
|
|
" Returns:\n",
|
|
" Iterator yielding (batch_data, batch_labels) tuples\n",
|
|
" \n",
|
|
" TODO: Implement batching and shuffling logic.\n",
|
|
" \n",
|
|
" STEP-BY-STEP:\n",
|
|
" 1. Create indices list: list(range(len(dataset)))\n",
|
|
" 2. Shuffle indices if self.shuffle is True\n",
|
|
" 3. Loop through indices in batch_size chunks\n",
|
|
" 4. For each batch: collect samples, stack them, yield batch\n",
|
|
" \n",
|
|
" EXAMPLE:\n",
|
|
" for batch_data, batch_labels in dataloader:\n",
|
|
" # batch_data.shape: (batch_size, ...)\n",
|
|
" # batch_labels.shape: (batch_size,)\n",
|
|
" \"\"\"\n",
|
|
" raise NotImplementedError(\"Student implementation required\")\n",
|
|
" \n",
|
|
" def __len__(self) -> int:\n",
|
|
" \"\"\"\n",
|
|
" Get the number of batches per epoch.\n",
|
|
" \n",
|
|
" TODO: Calculate number of batches.\n",
|
|
" \n",
|
|
" STEP-BY-STEP:\n",
|
|
" 1. Get dataset size: len(self.dataset)\n",
|
|
" 2. Calculate: (dataset_size + batch_size - 1) // batch_size\n",
|
|
" 3. This handles the last partial batch correctly\n",
|
|
" \n",
|
|
" EXAMPLE:\n",
|
|
" Dataset size: 100, batch_size: 32\n",
|
|
" Number of batches: 4 (32, 32, 32, 4)\n",
|
|
" \"\"\"\n",
|
|
" raise NotImplementedError(\"Student implementation required\")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"#| hide\n",
|
|
"#| export\n",
|
|
"class DataLoader:\n",
|
|
" \"\"\"DataLoader: Efficiently batch and iterate through datasets.\"\"\"\n",
|
|
" \n",
|
|
" def __init__(self, dataset: Dataset, batch_size: int = 32, shuffle: bool = True):\n",
|
|
" self.dataset = dataset\n",
|
|
" self.batch_size = batch_size\n",
|
|
" self.shuffle = shuffle\n",
|
|
" \n",
|
|
" def __iter__(self) -> Iterator[Tuple[Tensor, Tensor]]:\n",
|
|
" \"\"\"Iterate through dataset in batches.\"\"\"\n",
|
|
" # Create indices\n",
|
|
" indices = list(range(len(self.dataset)))\n",
|
|
" \n",
|
|
" # Shuffle if requested\n",
|
|
" if self.shuffle:\n",
|
|
" np.random.shuffle(indices)\n",
|
|
" \n",
|
|
" # Generate batches\n",
|
|
" for i in range(0, len(indices), self.batch_size):\n",
|
|
" batch_indices = indices[i:i + self.batch_size]\n",
|
|
" \n",
|
|
" # Collect samples for this batch\n",
|
|
" batch_data = []\n",
|
|
" batch_labels = []\n",
|
|
" \n",
|
|
" for idx in batch_indices:\n",
|
|
" data, label = self.dataset[idx]\n",
|
|
" batch_data.append(data.data)\n",
|
|
" batch_labels.append(label.data)\n",
|
|
" \n",
|
|
" # Stack into batches\n",
|
|
" batch_data = np.stack(batch_data, axis=0)\n",
|
|
" batch_labels = np.stack(batch_labels, axis=0)\n",
|
|
" \n",
|
|
" yield Tensor(batch_data), Tensor(batch_labels)\n",
|
|
" \n",
|
|
" def __len__(self) -> int:\n",
|
|
" \"\"\"Get the number of batches per epoch.\"\"\"\n",
|
|
" return (len(self.dataset) + self.batch_size - 1) // self.batch_size"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"\"\"\"\n",
|
|
"### \ud83e\uddea Test Your DataLoader\n",
|
|
"\"\"\""
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Test DataLoader\n",
|
|
"print(\"Testing DataLoader...\")\n",
|
|
"\n",
|
|
"try:\n",
|
|
" # Create a test dataset\n",
|
|
" class SimpleDataset(Dataset):\n",
|
|
" def __init__(self, size=100):\n",
|
|
" self.size = size\n",
|
|
" self.data = [np.random.randn(3, 32, 32) for _ in range(size)]\n",
|
|
" self.labels = [i % 10 for i in range(size)]\n",
|
|
" \n",
|
|
" def __getitem__(self, index):\n",
|
|
" return Tensor(self.data[index]), Tensor(np.array(self.labels[index]))\n",
|
|
" \n",
|
|
" def __len__(self):\n",
|
|
" return self.size\n",
|
|
" \n",
|
|
" def get_num_classes(self):\n",
|
|
" return 10\n",
|
|
" \n",
|
|
" # Test DataLoader\n",
|
|
" dataset = SimpleDataset(100)\n",
|
|
" dataloader = DataLoader(dataset, batch_size=32, shuffle=True)\n",
|
|
" \n",
|
|
" print(f\"\u2705 Dataset size: {len(dataset)}\")\n",
|
|
" print(f\"\u2705 Number of batches: {len(dataloader)}\")\n",
|
|
" \n",
|
|
" # Test iteration\n",
|
|
" batch_count = 0\n",
|
|
" for batch_data, batch_labels in dataloader:\n",
|
|
" batch_count += 1\n",
|
|
" print(f\"\u2705 Batch {batch_count}: data shape {batch_data.shape}, labels shape {batch_labels.shape}\")\n",
|
|
" if batch_count >= 3: # Only show first 3 batches\n",
|
|
" break\n",
|
|
" \n",
|
|
" print(\"\ud83c\udf89 DataLoader works!\")\n",
|
|
" \n",
|
|
"except Exception as e:\n",
|
|
" print(f\"\u274c Error: {e}\")\n",
|
|
" print(\"Make sure to implement the DataLoader above!\")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Test DataLoader with visual feedback using real CIFAR-10\n",
|
|
"print(\"\ud83c\udfa8 Testing DataLoader with visual feedback...\")\n",
|
|
"\n",
|
|
"try:\n",
|
|
" import tempfile\n",
|
|
" \n",
|
|
" with tempfile.TemporaryDirectory() as temp_dir:\n",
|
|
" # Load real CIFAR-10 dataset\n",
|
|
" cifar_dataset = CIFAR10Dataset(temp_dir, train=True, download=True)\n",
|
|
" \n",
|
|
" # Create DataLoader\n",
|
|
" dataloader = DataLoader(cifar_dataset, batch_size=16, shuffle=True)\n",
|
|
" \n",
|
|
" print(f\"\u2705 Created DataLoader with {len(dataloader)} batches\")\n",
|
|
" \n",
|
|
" # Get first batch\n",
|
|
" batch_data, batch_labels = next(iter(dataloader))\n",
|
|
" print(f\"\u2705 First batch shape: {batch_data.shape}\")\n",
|
|
" print(f\"\u2705 First batch labels: {batch_labels.shape}\")\n",
|
|
" \n",
|
|
" # Show first few images from the batch\n",
|
|
" print(\"\ud83d\uddbc\ufe0f Displaying first batch images...\")\n",
|
|
" \n",
|
|
" # Create a temporary dataset-like object for visualization\n",
|
|
" class BatchDataset:\n",
|
|
" def __init__(self, batch_data, batch_labels, class_names):\n",
|
|
" self.batch_data = batch_data\n",
|
|
" self.batch_labels = batch_labels\n",
|
|
" self.class_names = class_names\n",
|
|
" \n",
|
|
" def __getitem__(self, index):\n",
|
|
" return Tensor(self.batch_data.data[index]), Tensor(self.batch_labels.data[index])\n",
|
|
" \n",
|
|
" def __len__(self):\n",
|
|
" return self.batch_data.shape[0]\n",
|
|
" \n",
|
|
" batch_dataset = BatchDataset(batch_data, batch_labels, cifar_dataset.class_names)\n",
|
|
" show_cifar10_samples(batch_dataset, num_samples=8, title=\"DataLoader Batch - Real CIFAR-10\")\n",
|
|
" \n",
|
|
" print(\"\ud83c\udf89 DataLoader visual feedback works!\")\n",
|
|
" \n",
|
|
"except Exception as e:\n",
|
|
" print(f\"\u274c Error: {e}\")\n",
|
|
" print(\"Make sure DataLoader and visualization are implemented correctly!\")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"\"\"\"\n",
|
|
"## Step 4: Understanding Data Preprocessing\n",
|
|
"\n",
|
|
"Finally, let's build a **Normalizer** to preprocess our data for better training.\n",
|
|
"\n",
|
|
"### Why Normalization Matters\n",
|
|
"- **Gradient stability**: Prevents exploding/vanishing gradients\n",
|
|
"- **Training speed**: Faster convergence\n",
|
|
"- **Numerical stability**: Prevents overflow/underflow\n",
|
|
"- **Consistent scales**: All features have similar ranges\n",
|
|
"\n",
|
|
"### Common Normalization Techniques\n",
|
|
"- **Min-Max**: Scale to [0, 1] range\n",
|
|
"- **Z-score**: Zero mean, unit variance\n",
|
|
"- **ImageNet**: Specific mean/std for pretrained models\n",
|
|
"\n",
|
|
"### The Normalization Process\n",
|
|
"```\n",
|
|
"Raw Data: [0, 255] pixel values\n",
|
|
"Normalized: [-1, 1] or [0, 1] range\n",
|
|
"```\n",
|
|
"\n",
|
|
"Let's implement a flexible normalizer!\n",
|
|
"\"\"\""
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"#| export\n",
|
|
"class Normalizer:\n",
|
|
" \"\"\"\n",
|
|
" Data Normalizer: Standardize data for better training.\n",
|
|
" \n",
|
|
" Computes mean and standard deviation from training data,\n",
|
|
" then applies normalization to new data.\n",
|
|
" \n",
|
|
" TODO: Implement data normalization.\n",
|
|
" \n",
|
|
" APPROACH:\n",
|
|
" 1. Fit: Compute mean and std from training data\n",
|
|
" 2. Transform: Apply normalization using computed stats\n",
|
|
" 3. Handle both single tensors and batches\n",
|
|
" \n",
|
|
" EXAMPLE:\n",
|
|
" normalizer = Normalizer()\n",
|
|
" normalizer.fit(training_data) # Compute stats\n",
|
|
" normalized = normalizer.transform(new_data) # Apply normalization\n",
|
|
" \n",
|
|
" HINTS:\n",
|
|
" - Store mean and std as instance variables\n",
|
|
" - Use np.mean and np.std for statistics\n",
|
|
" - Apply: (data - mean) / std\n",
|
|
" - Handle division by zero (add small epsilon)\n",
|
|
" \"\"\"\n",
|
|
" \n",
|
|
" def __init__(self):\n",
|
|
" \"\"\"\n",
|
|
" Initialize normalizer.\n",
|
|
" \n",
|
|
" TODO: Initialize mean and std to None.\n",
|
|
" \n",
|
|
" STEP-BY-STEP:\n",
|
|
" 1. Set self.mean = None\n",
|
|
" 2. Set self.std = None\n",
|
|
" 3. Set self.epsilon = 1e-8 (for numerical stability)\n",
|
|
" \n",
|
|
" EXAMPLE:\n",
|
|
" normalizer = Normalizer()\n",
|
|
" \"\"\"\n",
|
|
" raise NotImplementedError(\"Student implementation required\")\n",
|
|
" \n",
|
|
" def fit(self, data: List[Tensor]):\n",
|
|
" \"\"\"\n",
|
|
" Compute normalization statistics from training data.\n",
|
|
" \n",
|
|
" Args:\n",
|
|
" data: List of tensors to compute statistics from\n",
|
|
" \n",
|
|
" TODO: Compute mean and standard deviation.\n",
|
|
" \n",
|
|
" STEP-BY-STEP:\n",
|
|
" 1. Stack all tensors: np.stack([t.data for t in data])\n",
|
|
" 2. Compute mean: np.mean(stacked_data)\n",
|
|
" 3. Compute std: np.std(stacked_data)\n",
|
|
" 4. Store as self.mean and self.std\n",
|
|
" \n",
|
|
" EXAMPLE:\n",
|
|
" normalizer.fit([tensor1, tensor2, tensor3])\n",
|
|
" \"\"\"\n",
|
|
" raise NotImplementedError(\"Student implementation required\")\n",
|
|
" \n",
|
|
" def transform(self, data: Union[Tensor, List[Tensor]]) -> Union[Tensor, List[Tensor]]:\n",
|
|
" \"\"\"\n",
|
|
" Apply normalization to data.\n",
|
|
" \n",
|
|
" Args:\n",
|
|
" data: Tensor or list of tensors to normalize\n",
|
|
" \n",
|
|
" Returns:\n",
|
|
" Normalized tensor(s)\n",
|
|
" \n",
|
|
" TODO: Apply normalization using computed statistics.\n",
|
|
" \n",
|
|
" STEP-BY-STEP:\n",
|
|
" 1. Check if mean and std are computed (not None)\n",
|
|
" 2. If single tensor: apply (data - mean) / (std + epsilon)\n",
|
|
" 3. If list: apply to each tensor in the list\n",
|
|
" 4. Return normalized data\n",
|
|
" \n",
|
|
" EXAMPLE:\n",
|
|
" normalized = normalizer.transform(tensor)\n",
|
|
" \"\"\"\n",
|
|
" raise NotImplementedError(\"Student implementation required\")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"#| hide\n",
|
|
"#| export\n",
|
|
"class Normalizer:\n",
|
|
" \"\"\"Data Normalizer: Standardize data for better training.\"\"\"\n",
|
|
" \n",
|
|
" def __init__(self):\n",
|
|
" self.mean = None\n",
|
|
" self.std = None\n",
|
|
" self.epsilon = 1e-8\n",
|
|
" \n",
|
|
" def fit(self, data: List[Tensor]):\n",
|
|
" \"\"\"Compute normalization statistics from training data.\"\"\"\n",
|
|
" # Stack all data\n",
|
|
" all_data = np.stack([t.data for t in data])\n",
|
|
" \n",
|
|
" # Compute statistics\n",
|
|
" self.mean = np.mean(all_data)\n",
|
|
" self.std = np.std(all_data)\n",
|
|
" \n",
|
|
" print(f\"\u2705 Computed normalization stats: mean={self.mean:.4f}, std={self.std:.4f}\")\n",
|
|
" \n",
|
|
" def transform(self, data: Union[Tensor, List[Tensor]]) -> Union[Tensor, List[Tensor]]:\n",
|
|
" \"\"\"Apply normalization to data.\"\"\"\n",
|
|
" if self.mean is None or self.std is None:\n",
|
|
" raise ValueError(\"Must call fit() before transform()\")\n",
|
|
" \n",
|
|
" if isinstance(data, list):\n",
|
|
" # Transform list of tensors\n",
|
|
" return [Tensor((t.data - self.mean) / (self.std + self.epsilon)) for t in data]\n",
|
|
" else:\n",
|
|
" # Transform single tensor\n",
|
|
" return Tensor((data.data - self.mean) / (self.std + self.epsilon))"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"\"\"\"\n",
|
|
"### \ud83e\uddea Test Your Normalizer\n",
|
|
"\"\"\""
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Test Normalizer\n",
|
|
"print(\"Testing Normalizer...\")\n",
|
|
"\n",
|
|
"try:\n",
|
|
" # Create test data\n",
|
|
" data = [\n",
|
|
" Tensor(np.random.randn(3, 32, 32) * 50 + 100), # Mean ~100, std ~50\n",
|
|
" Tensor(np.random.randn(3, 32, 32) * 50 + 100),\n",
|
|
" Tensor(np.random.randn(3, 32, 32) * 50 + 100)\n",
|
|
" ]\n",
|
|
" \n",
|
|
" # Test normalizer\n",
|
|
" normalizer = Normalizer()\n",
|
|
" \n",
|
|
" # Fit to data\n",
|
|
" normalizer.fit(data)\n",
|
|
" \n",
|
|
" # Transform data\n",
|
|
" normalized = normalizer.transform(data)\n",
|
|
" \n",
|
|
" # Check results\n",
|
|
" print(f\"\u2705 Original data mean: {np.mean([t.data for t in data]):.4f}\")\n",
|
|
" print(f\"\u2705 Original data std: {np.std([t.data for t in data]):.4f}\")\n",
|
|
" print(f\"\u2705 Normalized data mean: {np.mean([t.data for t in normalized]):.4f}\")\n",
|
|
" print(f\"\u2705 Normalized data std: {np.std([t.data for t in normalized]):.4f}\")\n",
|
|
" \n",
|
|
" # Test single tensor\n",
|
|
" single_tensor = Tensor(np.random.randn(3, 32, 32) * 50 + 100)\n",
|
|
" normalized_single = normalizer.transform(single_tensor)\n",
|
|
" print(f\"\u2705 Single tensor normalized: {normalized_single.shape}\")\n",
|
|
" \n",
|
|
" print(\"\ud83c\udf89 Normalizer works!\")\n",
|
|
" \n",
|
|
"except Exception as e:\n",
|
|
" print(f\"\u274c Error: {e}\")\n",
|
|
" print(\"Make sure to implement the Normalizer above!\")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"\"\"\"\n",
|
|
"## Step 5: Building a Complete Data Pipeline\n",
|
|
"\n",
|
|
"Now let's put everything together into a complete data pipeline!\n",
|
|
"\n",
|
|
"### The Complete Pipeline\n",
|
|
"```\n",
|
|
"Raw Data \u2192 Dataset \u2192 DataLoader \u2192 Normalizer \u2192 Model\n",
|
|
"```\n",
|
|
"\n",
|
|
"This is the foundation of every machine learning system!\n",
|
|
"\"\"\""
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"#| export\n",
|
|
"def create_data_pipeline(dataset_path: str = \"data/cifar10/\", \n",
|
|
" batch_size: int = 32, \n",
|
|
" normalize: bool = True,\n",
|
|
" shuffle: bool = True):\n",
|
|
" \"\"\"\n",
|
|
" Create a complete data pipeline for training.\n",
|
|
" \n",
|
|
" Args:\n",
|
|
" dataset_path: Path to dataset\n",
|
|
" batch_size: Batch size for training\n",
|
|
" normalize: Whether to normalize data\n",
|
|
" shuffle: Whether to shuffle data\n",
|
|
" \n",
|
|
" Returns:\n",
|
|
" Tuple of (train_loader, test_loader)\n",
|
|
" \n",
|
|
" TODO: Implement complete data pipeline.\n",
|
|
" \n",
|
|
" APPROACH:\n",
|
|
" 1. Create train and test datasets\n",
|
|
" 2. Create data loaders\n",
|
|
" 3. Fit normalizer on training data\n",
|
|
" 4. Return all components\n",
|
|
" \n",
|
|
" EXAMPLE:\n",
|
|
" train_loader, test_loader = create_data_pipeline()\n",
|
|
" for batch_data, batch_labels in train_loader:\n",
|
|
" # Ready for training!\n",
|
|
" \"\"\"\n",
|
|
" raise NotImplementedError(\"Student implementation required\")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"#| hide\n",
|
|
"#| export\n",
|
|
"def create_data_pipeline(dataset_path: str = \"data/cifar10/\", \n",
|
|
" batch_size: int = 32, \n",
|
|
" normalize: bool = True,\n",
|
|
" shuffle: bool = True):\n",
|
|
" \"\"\"Create a complete data pipeline for training.\"\"\"\n",
|
|
" \n",
|
|
" print(\"\ud83d\udd27 Creating data pipeline...\")\n",
|
|
" \n",
|
|
" # Create datasets with real CIFAR-10 data\n",
|
|
" train_dataset = CIFAR10Dataset(dataset_path, train=True, download=True)\n",
|
|
" test_dataset = CIFAR10Dataset(dataset_path, train=False, download=True)\n",
|
|
" \n",
|
|
" # Create data loaders\n",
|
|
" train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=shuffle)\n",
|
|
" test_loader = DataLoader(test_dataset, batch_size=batch_size, shuffle=False)\n",
|
|
" \n",
|
|
" # Create normalizer\n",
|
|
" normalizer = None\n",
|
|
" if normalize:\n",
|
|
" normalizer = Normalizer()\n",
|
|
" # Fit on a subset of training data for efficiency\n",
|
|
" sample_data = [train_dataset[i][0] for i in range(min(1000, len(train_dataset)))]\n",
|
|
" normalizer.fit(sample_data)\n",
|
|
" print(f\"\u2705 Computed normalization stats: mean={normalizer.mean:.4f}, std={normalizer.std:.4f}\")\n",
|
|
" \n",
|
|
" print(f\"\u2705 Pipeline created:\")\n",
|
|
" print(f\" - Training batches: {len(train_loader)}\")\n",
|
|
" print(f\" - Test batches: {len(test_loader)}\")\n",
|
|
" print(f\" - Batch size: {batch_size}\")\n",
|
|
" print(f\" - Normalization: {normalize}\")\n",
|
|
" \n",
|
|
" return train_loader, test_loader"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"\"\"\"\n",
|
|
"### \ud83e\uddea Test Your Complete Data Pipeline\n",
|
|
"\"\"\""
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Test complete data pipeline\n",
|
|
"print(\"Testing complete data pipeline...\")\n",
|
|
"\n",
|
|
"try:\n",
|
|
" # Create pipeline\n",
|
|
" train_loader, test_loader = create_data_pipeline(\n",
|
|
" batch_size=16, normalize=True, shuffle=True\n",
|
|
" )\n",
|
|
" \n",
|
|
" # Test training loop\n",
|
|
" print(\"\\n\ud83d\udd25 Testing training loop:\")\n",
|
|
" for i, (batch_data, batch_labels) in enumerate(train_loader):\n",
|
|
" print(f\" Batch {i+1}: data {batch_data.shape}, labels {batch_labels.shape}\")\n",
|
|
" \n",
|
|
" # Note: Data is already normalized in the pipeline if normalize=True\n",
|
|
" \n",
|
|
" if i >= 2: # Only show first 3 batches\n",
|
|
" break\n",
|
|
" \n",
|
|
" # Test test loop\n",
|
|
" print(\"\\n\ud83e\uddea Testing test loop:\")\n",
|
|
" for i, (batch_data, batch_labels) in enumerate(test_loader):\n",
|
|
" print(f\" Test batch {i+1}: data {batch_data.shape}, labels {batch_labels.shape}\")\n",
|
|
" if i >= 1: # Only show first 2 batches\n",
|
|
" break\n",
|
|
" \n",
|
|
" print(\"\\n\ud83c\udf89 Complete data pipeline works!\")\n",
|
|
" print(\"Ready for training neural networks!\")\n",
|
|
" \n",
|
|
"except Exception as e:\n",
|
|
" print(f\"\u274c Error: {e}\")\n",
|
|
" print(\"Make sure to implement the data pipeline above!\")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Test complete pipeline with visual feedback\n",
|
|
"print(\"\ud83c\udfa8 Testing complete pipeline with visual feedback...\")\n",
|
|
"\n",
|
|
"try:\n",
|
|
" import tempfile\n",
|
|
" \n",
|
|
" with tempfile.TemporaryDirectory() as temp_dir:\n",
|
|
" # Create complete pipeline\n",
|
|
" train_loader, test_loader = create_data_pipeline(\n",
|
|
" dataset_path=temp_dir,\n",
|
|
" batch_size=16, \n",
|
|
" normalize=True, \n",
|
|
" shuffle=True\n",
|
|
" )\n",
|
|
" \n",
|
|
" # Get a batch from training data\n",
|
|
" train_batch_data, train_batch_labels = next(iter(train_loader))\n",
|
|
" print(f\"\u2705 Training batch shape: {train_batch_data.shape}\")\n",
|
|
" \n",
|
|
" # Get a batch from test data\n",
|
|
" test_batch_data, test_batch_labels = next(iter(test_loader))\n",
|
|
" print(f\"\u2705 Test batch shape: {test_batch_data.shape}\")\n",
|
|
" \n",
|
|
" # Show training batch images\n",
|
|
" print(\"\ud83d\uddbc\ufe0f Displaying training batch...\")\n",
|
|
" class PipelineBatchDataset:\n",
|
|
" def __init__(self, batch_data, batch_labels):\n",
|
|
" self.batch_data = batch_data\n",
|
|
" self.batch_labels = batch_labels\n",
|
|
" self.class_names = ['airplane', 'car', 'bird', 'cat', 'deer', \n",
|
|
" 'dog', 'frog', 'horse', 'ship', 'truck']\n",
|
|
" \n",
|
|
" def __getitem__(self, index):\n",
|
|
" return Tensor(self.batch_data.data[index]), Tensor(self.batch_labels.data[index])\n",
|
|
" \n",
|
|
" def __len__(self):\n",
|
|
" return self.batch_data.shape[0]\n",
|
|
" \n",
|
|
" train_batch_dataset = PipelineBatchDataset(train_batch_data, train_batch_labels)\n",
|
|
" show_cifar10_samples(train_batch_dataset, num_samples=8, title=\"Complete Pipeline - Training Batch\")\n",
|
|
" \n",
|
|
" # Show test batch images\n",
|
|
" print(\"\ud83d\uddbc\ufe0f Displaying test batch...\")\n",
|
|
" test_batch_dataset = PipelineBatchDataset(test_batch_data, test_batch_labels)\n",
|
|
" show_cifar10_samples(test_batch_dataset, num_samples=8, title=\"Complete Pipeline - Test Batch\")\n",
|
|
" \n",
|
|
" # Show data statistics\n",
|
|
" print(f\"\u2705 Training data range: [{train_batch_data.data.min():.3f}, {train_batch_data.data.max():.3f}]\")\n",
|
|
" print(f\"\u2705 Training data mean: {train_batch_data.data.mean():.3f}\")\n",
|
|
" print(f\"\u2705 Training data std: {train_batch_data.data.std():.3f}\")\n",
|
|
" \n",
|
|
" print(\"\ud83c\udf89 Complete pipeline visual feedback works!\")\n",
|
|
" print(\"\ud83d\ude80 You can see your entire data pipeline in action!\")\n",
|
|
" \n",
|
|
"except Exception as e:\n",
|
|
" print(f\"\u274c Error: {e}\")\n",
|
|
" print(\"Make sure complete pipeline and visualization work correctly!\")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"\"\"\"\n",
|
|
"## \ud83c\udfaf Summary\n",
|
|
"\n",
|
|
"Congratulations! You've built a complete data loading system:\n",
|
|
"\n",
|
|
"### What You Built\n",
|
|
"1. **Dataset**: Abstract interface for data loading\n",
|
|
"2. **CIFAR10Dataset**: Real dataset implementation\n",
|
|
"3. **DataLoader**: Efficient batching and iteration\n",
|
|
"4. **Normalizer**: Data preprocessing for better training\n",
|
|
"5. **Data Pipeline**: Complete system integration\n",
|
|
"\n",
|
|
"### Key Concepts Learned\n",
|
|
"- **Data engineering**: The foundation of ML systems\n",
|
|
"- **Batching**: Efficient processing of multiple samples\n",
|
|
"- **Normalization**: Preprocessing for stable training\n",
|
|
"- **Systems thinking**: Memory, I/O, and performance considerations\n",
|
|
"\n",
|
|
"### Next Steps\n",
|
|
"- **Autograd**: Automatic differentiation for training\n",
|
|
"- **Training**: Optimization loops and loss functions\n",
|
|
"- **Advanced data**: Augmentation, distributed loading, etc.\n",
|
|
"\n",
|
|
"### Real-World Impact\n",
|
|
"This data loading system is the foundation of every ML pipeline:\n",
|
|
"- **Production systems**: Handle millions of samples efficiently\n",
|
|
"- **Research**: Enable experimentation with different datasets\n",
|
|
"- **MLOps**: Integrate with training and deployment pipelines\n",
|
|
"\n",
|
|
"You now understand how data flows through ML systems! \ud83d\ude80\n",
|
|
"\"\"\""
|
|
]
|
|
}
|
|
],
|
|
"metadata": {
|
|
"kernelspec": {
|
|
"display_name": "Python 3",
|
|
"language": "python",
|
|
"name": "python3"
|
|
},
|
|
"language_info": {
|
|
"name": "python",
|
|
"version": "3.8.0"
|
|
}
|
|
},
|
|
"nbformat": 4,
|
|
"nbformat_minor": 4
|
|
} |