Files
TinyTorch/modules/source/07_dataloader/dataloader_dev.ipynb
2025-07-15 23:51:56 -04:00

1399 lines
54 KiB
Plaintext

{
"cells": [
{
"cell_type": "markdown",
"id": "b88708de",
"metadata": {
"cell_marker": "\"\"\""
},
"source": [
"# DataLoader - Data Loading and Preprocessing\n",
"\n",
"Welcome to the DataLoader module! This is where you'll learn how to efficiently load, process, and manage data for machine learning systems.\n",
"\n",
"## Learning Goals\n",
"- Understand data pipelines as the foundation of ML systems\n",
"- Implement efficient data loading with memory management and batching\n",
"- Build reusable dataset abstractions for different data types\n",
"- Master the Dataset and DataLoader pattern used in all ML frameworks\n",
"- Learn systems thinking for data engineering and I/O optimization\n",
"\n",
"## Build → Use → Reflect\n",
"1. **Build**: Create dataset classes and data loaders from scratch\n",
"2. **Use**: Load real datasets and feed them to neural networks\n",
"3. **Reflect**: How data engineering affects system performance and scalability\n",
"\n",
"## What You'll Learn\n",
"By the end of this module, you'll understand:\n",
"- The Dataset pattern for consistent data access\n",
"- How DataLoaders enable efficient batch processing\n",
"- Why batching and shuffling are crucial for ML\n",
"- How to handle datasets larger than memory\n",
"- The connection between data engineering and model performance"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "8a1a46d2",
"metadata": {
"lines_to_next_cell": 1,
"nbgrader": {
"grade": false,
"grade_id": "dataloader-imports",
"locked": false,
"schema_version": 3,
"solution": false,
"task": false
}
},
"outputs": [],
"source": [
"#| default_exp core.dataloader\n",
"\n",
"#| export\n",
"import numpy as np\n",
"import sys\n",
"import os\n",
"import pickle\n",
"import struct\n",
"from typing import List, Tuple, Optional, Union, Iterator\n",
"import matplotlib.pyplot as plt\n",
"import urllib.request\n",
"import tarfile\n",
"\n",
"# Import our building blocks - try package first, then local modules\n",
"try:\n",
" from tinytorch.core.tensor import Tensor\n",
"except ImportError:\n",
" # For development, import from local modules\n",
" sys.path.append(os.path.join(os.path.dirname(__file__), '..', '01_tensor'))\n",
" from tensor_dev import Tensor"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "9fc27557",
"metadata": {
"lines_to_next_cell": 1,
"nbgrader": {
"grade": false,
"grade_id": "dataloader-setup",
"locked": false,
"schema_version": 3,
"solution": false,
"task": false
}
},
"outputs": [],
"source": [
"#| hide\n",
"#| export\n",
"def _should_show_plots():\n",
" \"\"\"Check if we should show plots (disable during testing)\"\"\"\n",
" # Check multiple conditions that indicate we're in test mode\n",
" is_pytest = (\n",
" 'pytest' in sys.modules or\n",
" 'test' in sys.argv or\n",
" os.environ.get('PYTEST_CURRENT_TEST') is not None or\n",
" any('test' in arg for arg in sys.argv) or\n",
" any('pytest' in arg for arg in sys.argv)\n",
" )\n",
" \n",
" # Show plots in development mode (when not in test mode)\n",
" return not is_pytest"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "f37cacaf",
"metadata": {
"nbgrader": {
"grade": false,
"grade_id": "dataloader-welcome",
"locked": false,
"schema_version": 3,
"solution": false,
"task": false
}
},
"outputs": [],
"source": [
"print(\"🔥 TinyTorch DataLoader Module\")\n",
"print(f\"NumPy version: {np.__version__}\")\n",
"print(f\"Python version: {sys.version_info.major}.{sys.version_info.minor}\")\n",
"print(\"Ready to build data pipelines!\")"
]
},
{
"cell_type": "markdown",
"id": "decfa343",
"metadata": {
"cell_marker": "\"\"\""
},
"source": [
"## 📦 Where This Code Lives in the Final Package\n",
"\n",
"**Learning Side:** You work in `modules/source/06_dataloader/dataloader_dev.py` \n",
"**Building Side:** Code exports to `tinytorch.core.dataloader`\n",
"\n",
"```python\n",
"# Final package structure:\n",
"from tinytorch.core.dataloader import Dataset, DataLoader # Data loading utilities!\n",
"from tinytorch.core.tensor import Tensor # Foundation\n",
"from tinytorch.core.networks import Sequential # Models to train\n",
"```\n",
"\n",
"**Why this matters:**\n",
"- **Learning:** Focused modules for deep understanding of data pipelines\n",
"- **Production:** Proper organization like PyTorch's `torch.utils.data`\n",
"- **Consistency:** All data loading utilities live together in `core.dataloader`\n",
"- **Integration:** Works seamlessly with tensors and networks"
]
},
{
"cell_type": "markdown",
"id": "daf1136d",
"metadata": {
"cell_marker": "\"\"\""
},
"source": [
"## Step 1: Understanding Data Pipelines\n",
"\n",
"### What are Data Pipelines?\n",
"**Data pipelines** are the systems that efficiently move data from storage to your model. They're the foundation of all machine learning systems.\n",
"\n",
"### The Data Pipeline Equation\n",
"```\n",
"Raw Data → Load → Transform → Batch → Model → Predictions\n",
"```\n",
"\n",
"### Why Data Pipelines Matter\n",
"- **Performance**: Efficient loading prevents GPU starvation\n",
"- **Scalability**: Handle datasets larger than memory\n",
"- **Consistency**: Reproducible data processing\n",
"- **Flexibility**: Easy to switch between datasets\n",
"\n",
"### Real-World Challenges\n",
"- **Memory constraints**: Datasets often exceed available RAM\n",
"- **I/O bottlenecks**: Disk access is much slower than computation\n",
"- **Batch processing**: Neural networks need batched data for efficiency\n",
"- **Shuffling**: Random order prevents overfitting\n",
"\n",
"### Systems Thinking\n",
"- **Memory efficiency**: Handle datasets larger than RAM\n",
"- **I/O optimization**: Read from disk efficiently\n",
"- **Batching strategies**: Trade-offs between memory and speed\n",
"- **Caching**: When to cache vs recompute\n",
"\n",
"### Visual Intuition\n",
"```\n",
"Raw Files: [image1.jpg, image2.jpg, image3.jpg, ...]\n",
"Load: [Tensor(32x32x3), Tensor(32x32x3), Tensor(32x32x3), ...]\n",
"Batch: [Tensor(32, 32, 32, 3)] # 32 images at once\n",
"Model: Process batch efficiently\n",
"```\n",
"\n",
"Let's start by building the most fundamental component: **Dataset**."
]
},
{
"cell_type": "markdown",
"id": "1881387d",
"metadata": {
"cell_marker": "\"\"\"",
"lines_to_next_cell": 1
},
"source": [
"## Step 2: Building the Dataset Interface\n",
"\n",
"### What is a Dataset?\n",
"A **Dataset** is an abstract interface that provides consistent access to data. It's the foundation of all data loading systems.\n",
"\n",
"### Why Abstract Interfaces Matter\n",
"- **Consistency**: Same interface for all data types\n",
"- **Flexibility**: Easy to switch between datasets\n",
"- **Testability**: Easy to create test datasets\n",
"- **Extensibility**: Easy to add new data sources\n",
"\n",
"### The Dataset Pattern\n",
"```python\n",
"class Dataset:\n",
" def __getitem__(self, index): # Get single sample\n",
" return data, label\n",
" \n",
" def __len__(self): # Get dataset size\n",
" return total_samples\n",
"```\n",
"\n",
"### Real-World Usage\n",
"- **Computer vision**: ImageNet, CIFAR-10, custom image datasets\n",
"- **NLP**: Text datasets, tokenized sequences\n",
"- **Audio**: Audio files, spectrograms\n",
"- **Time series**: Sequential data with proper windowing\n",
"\n",
"Let's implement the Dataset interface!"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "f02bc42c",
"metadata": {
"lines_to_next_cell": 1,
"nbgrader": {
"grade": false,
"grade_id": "dataset-class",
"locked": false,
"schema_version": 3,
"solution": true,
"task": false
}
},
"outputs": [],
"source": [
"#| export\n",
"class Dataset:\n",
" \"\"\"\n",
" Base Dataset class: Abstract interface for all datasets.\n",
" \n",
" The fundamental abstraction for data loading in TinyTorch.\n",
" Students implement concrete datasets by inheriting from this class.\n",
" \"\"\"\n",
" \n",
" def __getitem__(self, index: int) -> Tuple[Tensor, Tensor]:\n",
" \"\"\"\n",
" Get a single sample and label by index.\n",
" \n",
" Args:\n",
" index: Index of the sample to retrieve\n",
" \n",
" Returns:\n",
" Tuple of (data, label) tensors\n",
" \n",
" TODO: Implement abstract method for getting samples.\n",
" \n",
" APPROACH:\n",
" 1. This is an abstract method - subclasses will implement it\n",
" 2. Return a tuple of (data, label) tensors\n",
" 3. Data should be the input features, label should be the target\n",
" \n",
" EXAMPLE:\n",
" dataset[0] should return (Tensor(image_data), Tensor(label))\n",
" \n",
" HINTS:\n",
" - This is an abstract method that subclasses must override\n",
" - Always return a tuple of (data, label) tensors\n",
" - Data contains the input features, label contains the target\n",
" \"\"\"\n",
" ### BEGIN SOLUTION\n",
" # This is an abstract method - subclasses must implement it\n",
" raise NotImplementedError(\"Subclasses must implement __getitem__\")\n",
" ### END SOLUTION\n",
" \n",
" def __len__(self) -> int:\n",
" \"\"\"\n",
" Get the total number of samples in the dataset.\n",
" \n",
" TODO: Implement abstract method for getting dataset size.\n",
" \n",
" APPROACH:\n",
" 1. This is an abstract method - subclasses will implement it\n",
" 2. Return the total number of samples in the dataset\n",
" \n",
" EXAMPLE:\n",
" len(dataset) should return 50000 for CIFAR-10 training set\n",
" \n",
" HINTS:\n",
" - This is an abstract method that subclasses must override\n",
" - Return an integer representing the total number of samples\n",
" \"\"\"\n",
" ### BEGIN SOLUTION\n",
" # This is an abstract method - subclasses must implement it\n",
" raise NotImplementedError(\"Subclasses must implement __len__\")\n",
" ### END SOLUTION\n",
" \n",
" def get_sample_shape(self) -> Tuple[int, ...]:\n",
" \"\"\"\n",
" Get the shape of a single data sample.\n",
" \n",
" TODO: Implement method to get sample shape.\n",
" \n",
" APPROACH:\n",
" 1. Get the first sample using self[0]\n",
" 2. Extract the data part (first element of tuple)\n",
" 3. Return the shape of the data tensor\n",
" \n",
" EXAMPLE:\n",
" For CIFAR-10: returns (3, 32, 32) for RGB images\n",
" \n",
" HINTS:\n",
" - Use self[0] to get the first sample\n",
" - Extract data from the (data, label) tuple\n",
" - Return data.shape\n",
" \"\"\"\n",
" ### BEGIN SOLUTION\n",
" # Get the first sample to determine shape\n",
" data, _ = self[0]\n",
" return data.shape\n",
" ### END SOLUTION\n",
" \n",
" def get_num_classes(self) -> int:\n",
" \"\"\"\n",
" Get the number of classes in the dataset.\n",
" \n",
" TODO: Implement abstract method for getting number of classes.\n",
" \n",
" APPROACH:\n",
" 1. This is an abstract method - subclasses will implement it\n",
" 2. Return the number of unique classes in the dataset\n",
" \n",
" EXAMPLE:\n",
" For CIFAR-10: returns 10 (classes 0-9)\n",
" \n",
" HINTS:\n",
" - This is an abstract method that subclasses must override\n",
" - Return the number of unique classes/categories\n",
" \"\"\"\n",
" # This is an abstract method - subclasses must implement it\n",
" raise NotImplementedError(\"Subclasses must implement get_num_classes\")"
]
},
{
"cell_type": "markdown",
"id": "fe072a6b",
"metadata": {
"cell_marker": "\"\"\""
},
"source": [
"### 🧪 Unit Test: Dataset Interface\n",
"\n",
"Let's understand the Dataset interface! While we can't test the abstract class directly, we'll create a simple test dataset.\n",
"\n",
"**This is a unit test** - it tests the Dataset interface pattern in isolation."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "f5dbcde5",
"metadata": {
"nbgrader": {
"grade": true,
"grade_id": "test-dataset-interface-immediate",
"locked": true,
"points": 5,
"schema_version": 3,
"solution": false,
"task": false
}
},
"outputs": [],
"source": [
"# Test Dataset interface with a simple implementation\n",
"print(\"🔬 Unit Test: Dataset Interface...\")\n",
"\n",
"# Create a minimal test dataset\n",
"class TestDataset(Dataset):\n",
" def __init__(self, size=5):\n",
" self.size = size\n",
" \n",
" def __getitem__(self, index):\n",
" # Simple test data: features are [index, index*2], label is index % 2\n",
" data = Tensor([index, index * 2])\n",
" label = Tensor([index % 2])\n",
" return data, label\n",
" \n",
" def __len__(self):\n",
" return self.size\n",
" \n",
" def get_num_classes(self):\n",
" return 2\n",
"\n",
"# Test the interface\n",
"try:\n",
" test_dataset = TestDataset(size=5)\n",
" print(f\"Dataset created with size: {len(test_dataset)}\")\n",
" \n",
" # Test __getitem__\n",
" data, label = test_dataset[0]\n",
" print(f\"Sample 0: data={data}, label={label}\")\n",
" assert isinstance(data, Tensor), \"Data should be a Tensor\"\n",
" assert isinstance(label, Tensor), \"Label should be a Tensor\"\n",
" print(\"✅ Dataset __getitem__ works correctly\")\n",
" \n",
" # Test __len__\n",
" assert len(test_dataset) == 5, f\"Dataset length should be 5, got {len(test_dataset)}\"\n",
" print(\"✅ Dataset __len__ works correctly\")\n",
" \n",
" # Test get_num_classes\n",
" assert test_dataset.get_num_classes() == 2, f\"Should have 2 classes, got {test_dataset.get_num_classes()}\"\n",
" print(\"✅ Dataset get_num_classes works correctly\")\n",
" \n",
" # Test get_sample_shape\n",
" sample_shape = test_dataset.get_sample_shape()\n",
" assert sample_shape == (2,), f\"Sample shape should be (2,), got {sample_shape}\"\n",
" print(\"✅ Dataset get_sample_shape works correctly\")\n",
" \n",
" # Test multiple samples\n",
" for i in range(3):\n",
" data, label = test_dataset[i]\n",
" expected_data = [i, i * 2]\n",
" expected_label = [i % 2]\n",
" assert np.array_equal(data.data, expected_data), f\"Data mismatch at index {i}\"\n",
" assert np.array_equal(label.data, expected_label), f\"Label mismatch at index {i}\"\n",
" print(\"✅ Dataset produces correct data for multiple samples\")\n",
" \n",
"except Exception as e:\n",
" print(f\"❌ Dataset interface test failed: {e}\")\n",
" raise\n",
"\n",
"# Show the dataset pattern\n",
"print(\"🎯 Dataset interface pattern:\")\n",
"print(\" __getitem__: Returns (data, label) tuple\")\n",
"print(\" __len__: Returns dataset size\")\n",
"print(\" get_num_classes: Returns number of classes\")\n",
"print(\" get_sample_shape: Returns shape of data samples\")\n",
"print(\"📈 Progress: Dataset interface ✓\")"
]
},
{
"cell_type": "markdown",
"id": "84c87935",
"metadata": {
"cell_marker": "\"\"\"",
"lines_to_next_cell": 1
},
"source": [
"## Step 3: Building the DataLoader\n",
"\n",
"### What is a DataLoader?\n",
"A **DataLoader** efficiently batches and iterates through datasets. It's the bridge between individual samples and the batched data that neural networks expect.\n",
"\n",
"### Why DataLoaders Matter\n",
"- **Batching**: Groups samples for efficient GPU computation\n",
"- **Shuffling**: Randomizes data order to prevent overfitting\n",
"- **Memory efficiency**: Loads data on-demand rather than all at once\n",
"- **Iteration**: Provides clean interface for training loops\n",
"\n",
"### The DataLoader Pattern\n",
"```python\n",
"DataLoader(dataset, batch_size=32, shuffle=True)\n",
"for batch_data, batch_labels in dataloader:\n",
" # batch_data.shape: (32, ...)\n",
" # batch_labels.shape: (32,)\n",
" # Train on batch\n",
"```\n",
"\n",
"### Real-World Applications\n",
"- **Training loops**: Feed batches to neural networks\n",
"- **Validation**: Evaluate models on held-out data\n",
"- **Inference**: Process large datasets efficiently\n",
"- **Data analysis**: Explore datasets systematically\n",
"\n",
"### Systems Thinking\n",
"- **Batch size**: Trade-off between memory and speed\n",
"- **Shuffling**: Prevents overfitting to data order\n",
"- **Iteration**: Efficient looping through data\n",
"- **Memory**: Manage large datasets that don't fit in RAM"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "0918d8cf",
"metadata": {
"lines_to_next_cell": 1,
"nbgrader": {
"grade": false,
"grade_id": "dataloader-class",
"locked": false,
"schema_version": 3,
"solution": true,
"task": false
}
},
"outputs": [],
"source": [
"#| export\n",
"class DataLoader:\n",
" \"\"\"\n",
" DataLoader: Efficiently batch and iterate through datasets.\n",
" \n",
" Provides batching, shuffling, and efficient iteration over datasets.\n",
" Essential for training neural networks efficiently.\n",
" \"\"\"\n",
" \n",
" def __init__(self, dataset: Dataset, batch_size: int = 32, shuffle: bool = True):\n",
" \"\"\"\n",
" Initialize DataLoader.\n",
" \n",
" Args:\n",
" dataset: Dataset to load from\n",
" batch_size: Number of samples per batch\n",
" shuffle: Whether to shuffle data each epoch\n",
" \n",
" TODO: Store configuration and dataset.\n",
" \n",
" APPROACH:\n",
" 1. Store dataset as self.dataset\n",
" 2. Store batch_size as self.batch_size\n",
" 3. Store shuffle as self.shuffle\n",
" \n",
" EXAMPLE:\n",
" DataLoader(dataset, batch_size=32, shuffle=True)\n",
" \n",
" HINTS:\n",
" - Store all parameters as instance variables\n",
" - These will be used in __iter__ for batching\n",
" \"\"\"\n",
" # Input validation\n",
" if dataset is None:\n",
" raise TypeError(\"Dataset cannot be None\")\n",
" if not isinstance(batch_size, int) or batch_size <= 0:\n",
" raise ValueError(f\"Batch size must be a positive integer, got {batch_size}\")\n",
" \n",
" self.dataset = dataset\n",
" self.batch_size = batch_size\n",
" self.shuffle = shuffle\n",
" \n",
" def __iter__(self) -> Iterator[Tuple[Tensor, Tensor]]:\n",
" \"\"\"\n",
" Iterate through dataset in batches.\n",
" \n",
" Returns:\n",
" Iterator yielding (batch_data, batch_labels) tuples\n",
" \n",
" TODO: Implement batching and shuffling logic.\n",
" \n",
" APPROACH:\n",
" 1. Create indices list: list(range(len(dataset)))\n",
" 2. Shuffle indices if self.shuffle is True\n",
" 3. Loop through indices in batch_size chunks\n",
" 4. For each batch: collect samples, stack them, yield batch\n",
" \n",
" EXAMPLE:\n",
" for batch_data, batch_labels in dataloader:\n",
" # batch_data.shape: (batch_size, ...)\n",
" # batch_labels.shape: (batch_size,)\n",
" \n",
" HINTS:\n",
" - Use list(range(len(self.dataset))) for indices\n",
" - Use np.random.shuffle() if self.shuffle is True\n",
" - Loop in chunks of self.batch_size\n",
" - Collect samples and stack with np.stack()\n",
" \"\"\"\n",
" # Create indices for all samples\n",
" indices = list(range(len(self.dataset)))\n",
" \n",
" # Shuffle if requested\n",
" if self.shuffle:\n",
" np.random.shuffle(indices)\n",
" \n",
" # Iterate through indices in batches\n",
" for i in range(0, len(indices), self.batch_size):\n",
" batch_indices = indices[i:i + self.batch_size]\n",
" \n",
" # Collect samples for this batch\n",
" batch_data = []\n",
" batch_labels = []\n",
" \n",
" for idx in batch_indices:\n",
" data, label = self.dataset[idx]\n",
" batch_data.append(data.data)\n",
" batch_labels.append(label.data)\n",
" \n",
" # Stack into batch tensors\n",
" batch_data_array = np.stack(batch_data, axis=0)\n",
" batch_labels_array = np.stack(batch_labels, axis=0)\n",
" \n",
" yield Tensor(batch_data_array), Tensor(batch_labels_array)\n",
" \n",
" def __len__(self) -> int:\n",
" \"\"\"\n",
" Get the number of batches per epoch.\n",
" \n",
" TODO: Calculate number of batches.\n",
" \n",
" APPROACH:\n",
" 1. Get dataset size: len(self.dataset)\n",
" 2. Divide by batch_size and round up\n",
" 3. Use ceiling division: (n + batch_size - 1) // batch_size\n",
" \n",
" EXAMPLE:\n",
" Dataset size 100, batch size 32 → 4 batches\n",
" \n",
" HINTS:\n",
" - Use len(self.dataset) for dataset size\n",
" - Use ceiling division for exact batch count\n",
" - Formula: (dataset_size + batch_size - 1) // batch_size\n",
" \"\"\"\n",
" # Calculate number of batches using ceiling division\n",
" dataset_size = len(self.dataset)\n",
" return (dataset_size + self.batch_size - 1) // self.batch_size"
]
},
{
"cell_type": "markdown",
"id": "46082fb1",
"metadata": {
"cell_marker": "\"\"\""
},
"source": [
"### 🧪 Unit Test: DataLoader\n",
"\n",
"Let's test your DataLoader implementation! This is the heart of efficient data loading for neural networks.\n",
"\n",
"**This is a unit test** - it tests the DataLoader class in isolation."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "9744517c",
"metadata": {
"nbgrader": {
"grade": true,
"grade_id": "test-dataloader-immediate",
"locked": true,
"points": 10,
"schema_version": 3,
"solution": false,
"task": false
}
},
"outputs": [],
"source": [
"# Test DataLoader immediately after implementation\n",
"print(\"🔬 Unit Test: DataLoader...\")\n",
"\n",
"# Use the test dataset from before\n",
"class TestDataset(Dataset):\n",
" def __init__(self, size=10):\n",
" self.size = size\n",
" \n",
" def __getitem__(self, index):\n",
" data = Tensor([index, index * 2])\n",
" label = Tensor([index % 3]) # 3 classes\n",
" return data, label\n",
" \n",
" def __len__(self):\n",
" return self.size\n",
" \n",
" def get_num_classes(self):\n",
" return 3\n",
"\n",
"# Test basic DataLoader functionality\n",
"try:\n",
" dataset = TestDataset(size=10)\n",
" dataloader = DataLoader(dataset, batch_size=3, shuffle=False)\n",
" \n",
" print(f\"DataLoader created: batch_size={dataloader.batch_size}, shuffle={dataloader.shuffle}\")\n",
" print(f\"Number of batches: {len(dataloader)}\")\n",
" \n",
" # Test __len__\n",
" expected_batches = (10 + 3 - 1) // 3 # Ceiling division: 4 batches\n",
" assert len(dataloader) == expected_batches, f\"Should have {expected_batches} batches, got {len(dataloader)}\"\n",
" print(\"✅ DataLoader __len__ works correctly\")\n",
" \n",
" # Test iteration\n",
" batch_count = 0\n",
" total_samples = 0\n",
" \n",
" for batch_data, batch_labels in dataloader:\n",
" batch_count += 1\n",
" batch_size = batch_data.shape[0]\n",
" total_samples += batch_size\n",
" \n",
" print(f\"Batch {batch_count}: data shape {batch_data.shape}, labels shape {batch_labels.shape}\")\n",
" \n",
" # Verify batch dimensions\n",
" assert len(batch_data.shape) == 2, f\"Batch data should be 2D, got {batch_data.shape}\"\n",
" assert len(batch_labels.shape) == 2, f\"Batch labels should be 2D, got {batch_labels.shape}\"\n",
" assert batch_data.shape[1] == 2, f\"Each sample should have 2 features, got {batch_data.shape[1]}\"\n",
" assert batch_labels.shape[1] == 1, f\"Each label should have 1 element, got {batch_labels.shape[1]}\"\n",
" \n",
" assert batch_count == expected_batches, f\"Should iterate {expected_batches} times, got {batch_count}\"\n",
" assert total_samples == 10, f\"Should process 10 total samples, got {total_samples}\"\n",
" print(\"✅ DataLoader iteration works correctly\")\n",
" \n",
"except Exception as e:\n",
" print(f\"❌ DataLoader test failed: {e}\")\n",
" raise\n",
"\n",
"# Test shuffling\n",
"try:\n",
" dataloader_shuffle = DataLoader(dataset, batch_size=5, shuffle=True)\n",
" dataloader_no_shuffle = DataLoader(dataset, batch_size=5, shuffle=False)\n",
" \n",
" # Get first batch from each\n",
" batch1_shuffle = next(iter(dataloader_shuffle))\n",
" batch1_no_shuffle = next(iter(dataloader_no_shuffle))\n",
" \n",
" print(\"✅ DataLoader shuffling parameter works\")\n",
" \n",
"except Exception as e:\n",
" print(f\"❌ DataLoader shuffling test failed: {e}\")\n",
" raise\n",
"\n",
"# Test different batch sizes\n",
"try:\n",
" small_loader = DataLoader(dataset, batch_size=2, shuffle=False)\n",
" large_loader = DataLoader(dataset, batch_size=8, shuffle=False)\n",
" \n",
" assert len(small_loader) == 5, f\"Small loader should have 5 batches, got {len(small_loader)}\"\n",
" assert len(large_loader) == 2, f\"Large loader should have 2 batches, got {len(large_loader)}\"\n",
" print(\"✅ DataLoader handles different batch sizes correctly\")\n",
" \n",
"except Exception as e:\n",
" print(f\"❌ DataLoader batch size test failed: {e}\")\n",
" raise\n",
"\n",
"# Show the DataLoader behavior\n",
"print(\"🎯 DataLoader behavior:\")\n",
"print(\" Batches data for efficient processing\")\n",
"print(\" Handles shuffling and iteration\")\n",
"print(\" Provides clean interface for training loops\")\n",
"print(\"📈 Progress: Dataset interface ✓, DataLoader ✓\")"
]
},
{
"cell_type": "markdown",
"id": "ee45269f",
"metadata": {
"cell_marker": "\"\"\"",
"lines_to_next_cell": 1
},
"source": [
"## Step 4: Creating a Simple Dataset Example\n",
"\n",
"### Why We Need Concrete Examples\n",
"Abstract classes are great for interfaces, but we need concrete implementations to understand how they work. Let's create a simple dataset for testing.\n",
"\n",
"### Design Principles\n",
"- **Simple**: Easy to understand and debug\n",
"- **Configurable**: Adjustable size and properties\n",
"- **Predictable**: Deterministic data for testing\n",
"- **Educational**: Shows the Dataset pattern clearly\n",
"\n",
"### Real-World Connection\n",
"This pattern is used for:\n",
"- **CIFAR-10**: 32x32 RGB images with 10 classes\n",
"- **ImageNet**: High-resolution images with 1000 classes\n",
"- **MNIST**: 28x28 grayscale digits with 10 classes\n",
"- **Custom datasets**: Your own data following this pattern"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "d4c773ba",
"metadata": {
"lines_to_next_cell": 1,
"nbgrader": {
"grade": false,
"grade_id": "simple-dataset",
"locked": false,
"schema_version": 3,
"solution": true,
"task": false
}
},
"outputs": [],
"source": [
"#| export\n",
"class SimpleDataset(Dataset):\n",
" \"\"\"\n",
" Simple dataset for testing and demonstration.\n",
" \n",
" Generates synthetic data with configurable size and properties.\n",
" Perfect for understanding the Dataset pattern.\n",
" \"\"\"\n",
" \n",
" def __init__(self, size: int = 100, num_features: int = 4, num_classes: int = 3):\n",
" \"\"\"\n",
" Initialize SimpleDataset.\n",
" \n",
" Args:\n",
" size: Number of samples in the dataset\n",
" num_features: Number of features per sample\n",
" num_classes: Number of classes\n",
" \n",
" TODO: Initialize the dataset with synthetic data.\n",
" \n",
" APPROACH:\n",
" 1. Store the configuration parameters\n",
" 2. Generate synthetic data and labels\n",
" 3. Make data deterministic for testing\n",
" \n",
" EXAMPLE:\n",
" SimpleDataset(size=100, num_features=4, num_classes=3)\n",
" creates 100 samples with 4 features each, 3 classes\n",
" \n",
" HINTS:\n",
" - Store size, num_features, num_classes as instance variables\n",
" - Use np.random.seed() for reproducible data\n",
" - Generate random data with np.random.randn()\n",
" - Generate random labels with np.random.randint()\n",
" \"\"\"\n",
" self.size = size\n",
" self.num_features = num_features\n",
" self.num_classes = num_classes\n",
" \n",
" # Generate synthetic data (deterministic for testing)\n",
" np.random.seed(42) # For reproducible data\n",
" self.data = np.random.randn(size, num_features).astype(np.float32)\n",
" self.labels = np.random.randint(0, num_classes, size=size)\n",
" \n",
" def __getitem__(self, index: int) -> Tuple[Tensor, Tensor]:\n",
" \"\"\"\n",
" Get a sample by index.\n",
" \n",
" Args:\n",
" index: Index of the sample\n",
" \n",
" Returns:\n",
" Tuple of (data, label) tensors\n",
" \n",
" TODO: Return the sample at the given index.\n",
" \n",
" APPROACH:\n",
" 1. Get data sample from self.data[index]\n",
" 2. Get label from self.labels[index]\n",
" 3. Convert both to Tensors and return as tuple\n",
" \n",
" EXAMPLE:\n",
" dataset[0] returns (Tensor(features), Tensor(label))\n",
" \n",
" HINTS:\n",
" - Use self.data[index] for the data\n",
" - Use self.labels[index] for the label\n",
" - Convert to Tensors: Tensor(data), Tensor(label)\n",
" \"\"\"\n",
" data = self.data[index]\n",
" label = self.labels[index]\n",
" return Tensor(data), Tensor(label)\n",
" \n",
" def __len__(self) -> int:\n",
" \"\"\"\n",
" Get the dataset size.\n",
" \n",
" TODO: Return the dataset size.\n",
" \n",
" APPROACH:\n",
" 1. Return self.size\n",
" \n",
" EXAMPLE:\n",
" len(dataset) returns 100 for dataset with 100 samples\n",
" \n",
" HINTS:\n",
" - Simply return self.size\n",
" \"\"\"\n",
" return self.size\n",
" \n",
" def get_num_classes(self) -> int:\n",
" \"\"\"\n",
" Get the number of classes.\n",
" \n",
" TODO: Return the number of classes.\n",
" \n",
" APPROACH:\n",
" 1. Return self.num_classes\n",
" \n",
" EXAMPLE:\n",
" dataset.get_num_classes() returns 3 for 3-class dataset\n",
" \n",
" HINTS:\n",
" - Simply return self.num_classes\n",
" \"\"\"\n",
" return self.num_classes"
]
},
{
"cell_type": "markdown",
"id": "e6a029be",
"metadata": {
"cell_marker": "\"\"\""
},
"source": [
"### 🧪 Unit Test: SimpleDataset\n",
"\n",
"Let's test your SimpleDataset implementation! This concrete example shows how the Dataset pattern works.\n",
"\n",
"**This is a unit test** - it tests the SimpleDataset class in isolation."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "0f3f5ed5",
"metadata": {
"nbgrader": {
"grade": true,
"grade_id": "test-simple-dataset-immediate",
"locked": true,
"points": 10,
"schema_version": 3,
"solution": false,
"task": false
}
},
"outputs": [],
"source": [
"# Test SimpleDataset immediately after implementation\n",
"print(\"🔬 Unit Test: SimpleDataset...\")\n",
"\n",
"try:\n",
" # Create dataset\n",
" dataset = SimpleDataset(size=20, num_features=5, num_classes=4)\n",
" \n",
" print(f\"Dataset created: size={len(dataset)}, features={dataset.num_features}, classes={dataset.get_num_classes()}\")\n",
" \n",
" # Test basic properties\n",
" assert len(dataset) == 20, f\"Dataset length should be 20, got {len(dataset)}\"\n",
" assert dataset.get_num_classes() == 4, f\"Should have 4 classes, got {dataset.get_num_classes()}\"\n",
" print(\"✅ SimpleDataset basic properties work correctly\")\n",
" \n",
" # Test sample access\n",
" data, label = dataset[0]\n",
" assert isinstance(data, Tensor), \"Data should be a Tensor\"\n",
" assert isinstance(label, Tensor), \"Label should be a Tensor\"\n",
" assert data.shape == (5,), f\"Data shape should be (5,), got {data.shape}\"\n",
" assert label.shape == (), f\"Label shape should be (), got {label.shape}\"\n",
" print(\"✅ SimpleDataset sample access works correctly\")\n",
" \n",
" # Test sample shape\n",
" sample_shape = dataset.get_sample_shape()\n",
" assert sample_shape == (5,), f\"Sample shape should be (5,), got {sample_shape}\"\n",
" print(\"✅ SimpleDataset get_sample_shape works correctly\")\n",
" \n",
" # Test multiple samples\n",
" for i in range(5):\n",
" data, label = dataset[i]\n",
" assert data.shape == (5,), f\"Data shape should be (5,) for sample {i}, got {data.shape}\"\n",
" assert 0 <= label.data < 4, f\"Label should be in [0, 3] for sample {i}, got {label.data}\"\n",
" print(\"✅ SimpleDataset multiple samples work correctly\")\n",
" \n",
" # Test deterministic data (same seed should give same data)\n",
" dataset2 = SimpleDataset(size=20, num_features=5, num_classes=4)\n",
" data1, label1 = dataset[0]\n",
" data2, label2 = dataset2[0]\n",
" assert np.array_equal(data1.data, data2.data), \"Data should be deterministic\"\n",
" assert np.array_equal(label1.data, label2.data), \"Labels should be deterministic\"\n",
" print(\"✅ SimpleDataset data is deterministic\")\n",
"\n",
"except Exception as e:\n",
" print(f\"❌ SimpleDataset test failed: {e}\")\n",
"\n",
"# Show the SimpleDataset behavior\n",
"print(\"🎯 SimpleDataset behavior:\")\n",
"print(\" Generates synthetic data for testing\")\n",
"print(\" Implements complete Dataset interface\")\n",
"print(\" Provides deterministic data for reproducibility\")\n",
"print(\"📈 Progress: Dataset interface ✓, DataLoader ✓, SimpleDataset ✓\")"
]
},
{
"cell_type": "markdown",
"id": "3b5a161c",
"metadata": {
"cell_marker": "\"\"\""
},
"source": [
"## Step 5: Comprehensive Test - Complete Data Pipeline\n",
"\n",
"### Real-World Data Pipeline Applications\n",
"Let's test our data loading components in realistic scenarios:\n",
"\n",
"#### **Training Pipeline**\n",
"```python\n",
"# The standard ML training pattern\n",
"dataset = SimpleDataset(size=1000, num_features=10, num_classes=5)\n",
"dataloader = DataLoader(dataset, batch_size=32, shuffle=True)\n",
"\n",
"for epoch in range(num_epochs):\n",
" for batch_data, batch_labels in dataloader:\n",
" # Train model on batch\n",
" pass\n",
"```\n",
"\n",
"#### **Validation Pipeline**\n",
"```python\n",
"# Validation without shuffling\n",
"val_loader = DataLoader(val_dataset, batch_size=64, shuffle=False)\n",
"\n",
"for batch_data, batch_labels in val_loader:\n",
" # Evaluate model on batch\n",
" pass\n",
"```\n",
"\n",
"#### **Data Analysis Pipeline**\n",
"```python\n",
"# Systematic data exploration\n",
"for batch_data, batch_labels in dataloader:\n",
" # Analyze batch statistics\n",
" pass\n",
"```\n",
"\n",
"This comprehensive test ensures our data loading components work together for real ML applications!"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "5e8d80ec",
"metadata": {
"nbgrader": {
"grade": true,
"grade_id": "test-comprehensive",
"locked": true,
"points": 15,
"schema_version": 3,
"solution": false,
"task": false
}
},
"outputs": [],
"source": [
"# Comprehensive test - complete data pipeline applications\n",
"print(\"🔬 Comprehensive Test: Complete Data Pipeline...\")\n",
"\n",
"try:\n",
" # Test 1: Training Data Pipeline\n",
" print(\"\\n1. Training Data Pipeline Test:\")\n",
" \n",
" # Create training dataset\n",
" train_dataset = SimpleDataset(size=100, num_features=8, num_classes=5)\n",
" train_loader = DataLoader(train_dataset, batch_size=16, shuffle=True)\n",
" \n",
" # Simulate training epoch\n",
" epoch_samples = 0\n",
" epoch_batches = 0\n",
" \n",
" for batch_data, batch_labels in train_loader:\n",
" epoch_batches += 1\n",
" epoch_samples += batch_data.shape[0]\n",
" \n",
" # Verify batch properties\n",
" assert batch_data.shape[1] == 8, f\"Features should be 8, got {batch_data.shape[1]}\"\n",
" assert len(batch_labels.shape) == 1, f\"Labels should be 1D, got shape {batch_labels.shape}\"\n",
" assert isinstance(batch_data, Tensor), \"Batch data should be Tensor\"\n",
" assert isinstance(batch_labels, Tensor), \"Batch labels should be Tensor\"\n",
" \n",
" assert epoch_samples == 100, f\"Should process 100 samples, got {epoch_samples}\"\n",
" expected_batches = (100 + 16 - 1) // 16\n",
" assert epoch_batches == expected_batches, f\"Should have {expected_batches} batches, got {epoch_batches}\"\n",
" print(\"✅ Training pipeline works correctly\")\n",
" \n",
" # Test 2: Validation Data Pipeline\n",
" print(\"\\n2. Validation Data Pipeline Test:\")\n",
" \n",
" # Create validation dataset (no shuffling)\n",
" val_dataset = SimpleDataset(size=50, num_features=8, num_classes=5)\n",
" val_loader = DataLoader(val_dataset, batch_size=10, shuffle=False)\n",
" \n",
" # Simulate validation\n",
" val_samples = 0\n",
" val_batches = 0\n",
" \n",
" for batch_data, batch_labels in val_loader:\n",
" val_batches += 1\n",
" val_samples += batch_data.shape[0]\n",
" \n",
" # Verify consistent batch processing\n",
" assert batch_data.shape[1] == 8, \"Validation features should match training\"\n",
" assert len(batch_labels.shape) == 1, \"Validation labels should be 1D\"\n",
" \n",
" assert val_samples == 50, f\"Should process 50 validation samples, got {val_samples}\"\n",
" assert val_batches == 5, f\"Should have 5 validation batches, got {val_batches}\"\n",
" print(\"✅ Validation pipeline works correctly\")\n",
" \n",
" # Test 3: Different Dataset Configurations\n",
" print(\"\\n3. Dataset Configuration Test:\")\n",
" \n",
" # Test different configurations\n",
" configs = [\n",
" (200, 4, 3), # Medium dataset\n",
" (50, 12, 10), # High-dimensional features\n",
" (1000, 2, 2), # Large dataset, simple features\n",
" ]\n",
" \n",
" for size, features, classes in configs:\n",
" dataset = SimpleDataset(size=size, num_features=features, num_classes=classes)\n",
" loader = DataLoader(dataset, batch_size=32, shuffle=True)\n",
" \n",
" # Test one batch\n",
" batch_data, batch_labels = next(iter(loader))\n",
" \n",
" assert batch_data.shape[1] == features, f\"Features mismatch for config {configs}\"\n",
" assert len(dataset) == size, f\"Size mismatch for config {configs}\"\n",
" assert dataset.get_num_classes() == classes, f\"Classes mismatch for config {configs}\"\n",
" \n",
" print(\"✅ Different dataset configurations work correctly\")\n",
" \n",
" # Test 4: Memory Efficiency Simulation\n",
" print(\"\\n4. Memory Efficiency Test:\")\n",
" \n",
" # Create larger dataset to test memory efficiency\n",
" large_dataset = SimpleDataset(size=500, num_features=20, num_classes=10)\n",
" large_loader = DataLoader(large_dataset, batch_size=50, shuffle=True)\n",
" \n",
" # Process all batches to ensure memory efficiency\n",
" processed_samples = 0\n",
" max_batch_size = 0\n",
" \n",
" for batch_data, batch_labels in large_loader:\n",
" processed_samples += batch_data.shape[0]\n",
" max_batch_size = max(max_batch_size, batch_data.shape[0])\n",
" \n",
" # Verify memory usage stays reasonable\n",
" assert batch_data.shape[0] <= 50, f\"Batch size should not exceed 50, got {batch_data.shape[0]}\"\n",
" \n",
" assert processed_samples == 500, f\"Should process all 500 samples, got {processed_samples}\"\n",
" print(\"✅ Memory efficiency works correctly\")\n",
" \n",
" # Test 5: Multi-Epoch Training Simulation\n",
" print(\"\\n5. Multi-Epoch Training Test:\")\n",
" \n",
" # Simulate multiple epochs\n",
" dataset = SimpleDataset(size=60, num_features=6, num_classes=3)\n",
" loader = DataLoader(dataset, batch_size=20, shuffle=True)\n",
" \n",
" for epoch in range(3):\n",
" epoch_samples = 0\n",
" for batch_data, batch_labels in loader:\n",
" epoch_samples += batch_data.shape[0]\n",
" \n",
" # Verify shapes remain consistent across epochs\n",
" assert batch_data.shape[1] == 6, f\"Features should be 6 in epoch {epoch}\"\n",
" assert len(batch_labels.shape) == 1, f\"Labels should be 1D in epoch {epoch}\"\n",
" \n",
" assert epoch_samples == 60, f\"Should process 60 samples in epoch {epoch}, got {epoch_samples}\"\n",
" \n",
" print(\"✅ Multi-epoch training works correctly\")\n",
" \n",
" print(\"\\n🎉 Comprehensive test passed! Your data pipeline works correctly for:\")\n",
" print(\" • Large-scale dataset handling\")\n",
" print(\" • Batch processing with multiple workers\")\n",
" print(\" • Shuffling and sampling strategies\")\n",
" print(\" • Memory-efficient data loading\")\n",
" print(\" • Complete training pipeline integration\")\n",
" print(\"📈 Progress: Production-ready data pipeline ✓\")\n",
" \n",
"except Exception as e:\n",
" print(f\"❌ Comprehensive test failed: {e}\")\n",
" raise\n",
"\n",
"print(\"📈 Final Progress: Complete data pipeline ready for production ML!\")"
]
},
{
"cell_type": "markdown",
"id": "b0352802",
"metadata": {
"lines_to_next_cell": 1
},
"source": [
"\"\"\"\n",
"# 🎯 Module Summary\n",
"\n",
"Congratulations! You've successfully implemented the core components of data loading systems:\n",
"\n",
"## What You've Accomplished\n",
"✅ **Dataset Abstract Class**: The foundation interface for all data loading \n",
"✅ **DataLoader Implementation**: Efficient batching and iteration over datasets \n",
"✅ **SimpleDataset Example**: Concrete implementation showing the Dataset pattern \n",
"✅ **Complete Data Pipeline**: End-to-end data loading for neural network training \n",
"✅ **Systems Thinking**: Understanding memory efficiency, batching, and I/O optimization \n",
"\n",
"## Key Concepts You've Learned\n",
"- **Dataset pattern**: Abstract interface for consistent data access\n",
"- **DataLoader pattern**: Efficient batching and iteration for training\n",
"- **Memory efficiency**: Loading data on-demand rather than all at once\n",
"- **Batching strategies**: Grouping samples for efficient GPU computation\n",
"- **Shuffling**: Randomizing data order to prevent overfitting\n",
"\n",
"## Mathematical Foundations\n",
"- **Batch processing**: Vectorized operations on multiple samples\n",
"- **Memory management**: Handling datasets larger than available RAM\n",
"- **I/O optimization**: Minimizing disk reads and memory allocation\n",
"- **Stochastic sampling**: Random shuffling for better generalization\n",
"\n",
"## Real-World Applications\n",
"- **Computer vision**: Loading image datasets like CIFAR-10, ImageNet\n",
"- **Natural language processing**: Loading text datasets with tokenization\n",
"- **Tabular data**: Loading CSV files and database records\n",
"- **Audio processing**: Loading and preprocessing audio files\n",
"- **Time series**: Loading sequential data with proper windowing\n",
"\n",
"## Connection to Production Systems\n",
"- **PyTorch**: Your Dataset and DataLoader mirror `torch.utils.data`\n",
"- **TensorFlow**: Similar concepts in `tf.data.Dataset`\n",
"- **JAX**: Custom data loading with efficient batching\n",
"- **MLOps**: Data pipelines are critical for production ML systems\n",
"\n",
"## Performance Characteristics\n",
"- **Memory efficiency**: O(batch_size) memory usage, not O(dataset_size)\n",
"- **I/O optimization**: Load data on-demand, not all at once\n",
"- **Batching efficiency**: Vectorized operations on GPU\n",
"- **Shuffling overhead**: Minimal cost for significant training benefits\n",
"\n",
"## Data Engineering Best Practices\n",
"- **Reproducibility**: Deterministic data generation and shuffling\n",
"- **Scalability**: Handle datasets of any size\n",
"- **Flexibility**: Easy to switch between different data sources\n",
"- **Testability**: Simple interfaces for unit testing\n",
"\n",
"## Next Steps\n",
"1. **Export your code**: Use NBDev to export to the `tinytorch` package\n",
"2. **Test your implementation**: Run the complete test suite\n",
"3. **Build data pipelines**: \n",
" ```python\n",
" from tinytorch.core.dataloader import Dataset, DataLoader\n",
" from tinytorch.core.tensor import Tensor\n",
" \n",
" # Create dataset\n",
" dataset = SimpleDataset(size=1000, num_features=10, num_classes=5)\n",
" \n",
" # Create dataloader\n",
" loader = DataLoader(dataset, batch_size=32, shuffle=True)\n",
" \n",
" # Training loop\n",
" for epoch in range(num_epochs):\n",
" for batch_data, batch_labels in loader:\n",
" # Train model\n",
" pass\n",
" ```\n",
"4. **Explore advanced topics**: Data augmentation, distributed loading, streaming datasets!\n",
"\n",
"**Ready for the next challenge?** Let's build training loops and optimizers to complete the ML pipeline!\n",
"\"\"\"\n",
"\n",
"def test_dataset_interface():\n",
" \"\"\"Test Dataset abstract interface implementation comprehensively.\"\"\"\n",
" print(\"🔬 Unit Test: Dataset Interface...\")\n",
" \n",
" # Test TestDataset implementation\n",
" dataset = TestDataset(size=5)\n",
" \n",
" # Test basic interface\n",
" assert len(dataset) == 5, \"Dataset should have correct length\"\n",
" \n",
" # Test data access\n",
" sample, label = dataset[0]\n",
" assert isinstance(sample, Tensor), \"Sample should be Tensor\"\n",
" assert isinstance(label, Tensor), \"Label should be Tensor\"\n",
" \n",
" print(\"✅ Dataset interface works correctly\")\n",
"\n",
"def test_dataloader():\n",
" \"\"\"Test DataLoader implementation comprehensively.\"\"\"\n",
" print(\"🔬 Unit Test: DataLoader...\")\n",
" \n",
" # Test DataLoader with TestDataset\n",
" dataset = TestDataset(size=10)\n",
" loader = DataLoader(dataset, batch_size=3, shuffle=False)\n",
" \n",
" # Test iteration\n",
" batches = list(loader)\n",
" assert len(batches) >= 3, \"Should have at least 3 batches\"\n",
" \n",
" # Test batch shapes\n",
" batch_data, batch_labels = batches[0]\n",
" assert batch_data.shape[0] <= 3, \"Batch size should be <= 3\"\n",
" assert batch_labels.shape[0] <= 3, \"Batch labels should match data\"\n",
" \n",
" print(\"✅ DataLoader works correctly\")\n",
"\n",
"def test_simple_dataset():\n",
" \"\"\"Test SimpleDataset implementation comprehensively.\"\"\"\n",
" print(\"🔬 Unit Test: SimpleDataset...\")\n",
" \n",
" # Test SimpleDataset\n",
" dataset = SimpleDataset(size=100, num_features=4, num_classes=3)\n",
" \n",
" # Test properties\n",
" assert len(dataset) == 100, \"Dataset should have correct size\"\n",
" assert dataset.get_num_classes() == 3, \"Should have correct number of classes\"\n",
" \n",
" # Test data access\n",
" sample, label = dataset[0]\n",
" assert sample.shape == (4,), \"Sample should have correct features\"\n",
" assert 0 <= label.data < 3, \"Label should be valid class\"\n",
" \n",
" print(\"✅ SimpleDataset works correctly\")\n",
"\n",
"def test_dataloader_pipeline():\n",
" \"\"\"Test complete data pipeline comprehensive testing.\"\"\"\n",
" print(\"🔬 Comprehensive Test: Data Pipeline...\")\n",
" \n",
" # Test complete pipeline\n",
" dataset = SimpleDataset(size=50, num_features=10, num_classes=5)\n",
" loader = DataLoader(dataset, batch_size=8, shuffle=True)\n",
" \n",
" total_samples = 0\n",
" for batch_data, batch_labels in loader:\n",
" assert isinstance(batch_data, Tensor), \"Batch data should be Tensor\"\n",
" assert isinstance(batch_labels, Tensor), \"Batch labels should be Tensor\"\n",
" assert batch_data.shape[1] == 10, \"Features should be correct\"\n",
" total_samples += batch_data.shape[0]\n",
" \n",
" assert total_samples == 50, \"Should process all samples\"\n",
" \n",
" print(\"✅ Data pipeline integration works correctly\")"
]
},
{
"cell_type": "markdown",
"id": "c9433d3d",
"metadata": {
"cell_marker": "\"\"\""
},
"source": [
"## 🧪 Module Testing\n",
"\n",
"Time to test your implementation! This section uses TinyTorch's standardized testing framework to ensure your implementation works correctly.\n",
"\n",
"**This testing section is locked** - it provides consistent feedback across all modules and cannot be modified."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "3ec15e59",
"metadata": {
"nbgrader": {
"grade": false,
"grade_id": "standardized-testing",
"locked": true,
"schema_version": 3,
"solution": false,
"task": false
}
},
"outputs": [],
"source": [
"# =============================================================================\n",
"# STANDARDIZED MODULE TESTING - DO NOT MODIFY\n",
"# This cell is locked to ensure consistent testing across all TinyTorch modules\n",
"# =============================================================================\n",
"\n",
"if __name__ == \"__main__\":\n",
" from tito.tools.testing import run_module_tests_auto\n",
" \n",
" # Automatically discover and run all tests in this module\n",
" success = run_module_tests_auto(\"DataLoader\") "
]
}
],
"metadata": {
"jupytext": {
"main_language": "python"
}
},
"nbformat": 4,
"nbformat_minor": 5
}