# --- # jupyter: # jupytext: # text_representation: # extension: .py # format_name: percent # format_version: '1.3' # jupytext_version: 1.17.1 # --- # %% [markdown] """ # DataLoader - Data Loading and Preprocessing Welcome to the DataLoader module! This is where you'll learn how to efficiently load, process, and manage data for machine learning systems. ## Learning Goals - Understand data pipelines as the foundation of ML systems - Implement efficient data loading with memory management and batching - Build reusable dataset abstractions for different data types - Master the Dataset and DataLoader pattern used in all ML frameworks - Learn systems thinking for data engineering and I/O optimization ## Build โ†’ Use โ†’ Reflect 1. **Build**: Create dataset classes and data loaders from scratch 2. **Use**: Load real datasets and feed them to neural networks 3. **Reflect**: How data engineering affects system performance and scalability ## What You'll Learn By the end of this module, you'll understand: - The Dataset pattern for consistent data access - How DataLoaders enable efficient batch processing - Why batching and shuffling are crucial for ML - How to handle datasets larger than memory - The connection between data engineering and model performance """ # %% nbgrader={"grade": false, "grade_id": "dataloader-imports", "locked": false, "schema_version": 3, "solution": false, "task": false} #| default_exp core.dataloader #| export import numpy as np import sys import os import pickle import struct from typing import List, Tuple, Optional, Union, Iterator import matplotlib.pyplot as plt import urllib.request import tarfile # Import our building blocks - try package first, then local modules try: from tinytorch.core.tensor import Tensor except ImportError: # For development, import from local modules sys.path.append(os.path.join(os.path.dirname(__file__), '..', '01_tensor')) from tensor_dev import Tensor # %% nbgrader={"grade": false, "grade_id": "dataloader-welcome", "locked": false, "schema_version": 3, "solution": false, "task": false} print("๐Ÿ”ฅ TinyTorch DataLoader Module") print(f"NumPy version: {np.__version__}") print(f"Python version: {sys.version_info.major}.{sys.version_info.minor}") print("Ready to build data pipelines!") # %% [markdown] """ ## ๐Ÿ“ฆ Where This Code Lives in the Final Package **Learning Side:** You work in `modules/source/06_dataloader/dataloader_dev.py` **Building Side:** Code exports to `tinytorch.core.dataloader` ```python # Final package structure: from tinytorch.core.dataloader import Dataset, DataLoader # Data loading utilities! from tinytorch.core.tensor import Tensor # Foundation from tinytorch.core.networks import Sequential # Models to train ``` **Why this matters:** - **Learning:** Focused modules for deep understanding of data pipelines - **Production:** Proper organization like PyTorch's `torch.utils.data` - **Consistency:** All data loading utilities live together in `core.dataloader` - **Integration:** Works seamlessly with tensors and networks """ # %% [markdown] """ ## ๐Ÿ”ง DEVELOPMENT """ # %% [markdown] """ ## Step 1: Understanding Data Pipelines ### What are Data Pipelines? **Data pipelines** are the systems that efficiently move data from storage to your model. They're the foundation of all machine learning systems. ### The Data Pipeline Equation ``` Raw Data โ†’ Load โ†’ Transform โ†’ Batch โ†’ Model โ†’ Predictions ``` ### Why Data Pipelines Matter - **Performance**: Efficient loading prevents GPU starvation - **Scalability**: Handle datasets larger than memory - **Consistency**: Reproducible data processing - **Flexibility**: Easy to switch between datasets ### Real-World Challenges - **Memory constraints**: Datasets often exceed available RAM - **I/O bottlenecks**: Disk access is much slower than computation - **Batch processing**: Neural networks need batched data for efficiency - **Shuffling**: Random order prevents overfitting ### Systems Thinking - **Memory efficiency**: Handle datasets larger than RAM - **I/O optimization**: Read from disk efficiently - **Batching strategies**: Trade-offs between memory and speed - **Caching**: When to cache vs recompute ### Visual Intuition ``` Raw Files: [image1.jpg, image2.jpg, image3.jpg, ...] Load: [Tensor(32x32x3), Tensor(32x32x3), Tensor(32x32x3), ...] Batch: [Tensor(32, 32, 32, 3)] # 32 images at once Model: Process batch efficiently ``` Let's start by building the most fundamental component: **Dataset**. """ # %% [markdown] """ ## Step 2: Building the Dataset Interface ### What is a Dataset? A **Dataset** is an abstract interface that provides consistent access to data. It's the foundation of all data loading systems. ### Why Abstract Interfaces Matter - **Consistency**: Same interface for all data types - **Flexibility**: Easy to switch between datasets - **Testability**: Easy to create test datasets - **Extensibility**: Easy to add new data sources ### The Dataset Pattern ```python class Dataset: def __getitem__(self, index): # Get single sample return data, label def __len__(self): # Get dataset size return total_samples ``` ### Real-World Usage - **Computer vision**: ImageNet, CIFAR-10, custom image datasets - **NLP**: Text datasets, tokenized sequences - **Audio**: Audio files, spectrograms - **Time series**: Sequential data with proper windowing Let's implement the Dataset interface! """ # %% nbgrader={"grade": false, "grade_id": "dataset-class", "locked": false, "schema_version": 3, "solution": true, "task": false} #| export class Dataset: """ Base Dataset class: Abstract interface for all datasets. The fundamental abstraction for data loading in TinyTorch. Students implement concrete datasets by inheriting from this class. """ def __getitem__(self, index: int) -> Tuple[Tensor, Tensor]: """ Get a single sample and label by index. Args: index: Index of the sample to retrieve Returns: Tuple of (data, label) tensors TODO: Implement abstract method for getting samples. APPROACH: 1. This is an abstract method - subclasses will implement it 2. Return a tuple of (data, label) tensors 3. Data should be the input features, label should be the target EXAMPLE: dataset[0] should return (Tensor(image_data), Tensor(label)) HINTS: - This is an abstract method that subclasses must override - Always return a tuple of (data, label) tensors - Data contains the input features, label contains the target """ ### BEGIN SOLUTION # This is an abstract method - subclasses must implement it raise NotImplementedError("Subclasses must implement __getitem__") ### END SOLUTION def __len__(self) -> int: """ Get the total number of samples in the dataset. TODO: Implement abstract method for getting dataset size. APPROACH: 1. This is an abstract method - subclasses will implement it 2. Return the total number of samples in the dataset EXAMPLE: len(dataset) should return 50000 for CIFAR-10 training set HINTS: - This is an abstract method that subclasses must override - Return an integer representing the total number of samples """ ### BEGIN SOLUTION # This is an abstract method - subclasses must implement it raise NotImplementedError("Subclasses must implement __len__") ### END SOLUTION def get_sample_shape(self) -> Tuple[int, ...]: """ Get the shape of a single data sample. TODO: Implement method to get sample shape. APPROACH: 1. Get the first sample using self[0] 2. Extract the data part (first element of tuple) 3. Return the shape of the data tensor EXAMPLE: For CIFAR-10: returns (3, 32, 32) for RGB images HINTS: - Use self[0] to get the first sample - Extract data from the (data, label) tuple - Return data.shape """ ### BEGIN SOLUTION # Get the first sample to determine shape data, _ = self[0] return data.shape ### END SOLUTION def get_num_classes(self) -> int: """ Get the number of classes in the dataset. TODO: Implement abstract method for getting number of classes. APPROACH: 1. This is an abstract method - subclasses will implement it 2. Return the number of unique classes in the dataset EXAMPLE: For CIFAR-10: returns 10 (classes 0-9) HINTS: - This is an abstract method that subclasses must override - Return the number of unique classes/categories """ # This is an abstract method - subclasses must implement it raise NotImplementedError("Subclasses must implement get_num_classes") # %% [markdown] """ ### ๐Ÿงช Unit Test: Dataset Interface Let's understand the Dataset interface! While we can't test the abstract class directly, we'll create a simple test dataset. **This is a unit test** - it tests the Dataset interface pattern in isolation. """ # %% nbgrader={"grade": true, "grade_id": "test-dataset-interface-immediate", "locked": true, "points": 5, "schema_version": 3, "solution": false, "task": false} # Test Dataset interface with a simple implementation print("๐Ÿ”ฌ Unit Test: Dataset Interface...") # Create a minimal test dataset class TestDataset(Dataset): def __init__(self, size=5): self.size = size def __getitem__(self, index): # Simple test data: features are [index, index*2], label is index % 2 data = Tensor([index, index * 2]) label = Tensor([index % 2]) return data, label def __len__(self): return self.size def get_num_classes(self): return 2 # Test the interface try: test_dataset = TestDataset(size=5) print(f"Dataset created with size: {len(test_dataset)}") # Test __getitem__ data, label = test_dataset[0] print(f"Sample 0: data={data}, label={label}") assert isinstance(data, Tensor), "Data should be a Tensor" assert isinstance(label, Tensor), "Label should be a Tensor" print("โœ… Dataset __getitem__ works correctly") # Test __len__ assert len(test_dataset) == 5, f"Dataset length should be 5, got {len(test_dataset)}" print("โœ… Dataset __len__ works correctly") # Test get_num_classes assert test_dataset.get_num_classes() == 2, f"Should have 2 classes, got {test_dataset.get_num_classes()}" print("โœ… Dataset get_num_classes works correctly") # Test get_sample_shape sample_shape = test_dataset.get_sample_shape() assert sample_shape == (2,), f"Sample shape should be (2,), got {sample_shape}" print("โœ… Dataset get_sample_shape works correctly") # Test multiple samples for i in range(3): data, label = test_dataset[i] expected_data = [i, i * 2] expected_label = [i % 2] assert np.array_equal(data.data, expected_data), f"Data mismatch at index {i}" assert np.array_equal(label.data, expected_label), f"Label mismatch at index {i}" print("โœ… Dataset produces correct data for multiple samples") except Exception as e: print(f"โŒ Dataset interface test failed: {e}") raise # Show the dataset pattern print("๐ŸŽฏ Dataset interface pattern:") print(" __getitem__: Returns (data, label) tuple") print(" __len__: Returns dataset size") print(" get_num_classes: Returns number of classes") print(" get_sample_shape: Returns shape of data samples") print("๐Ÿ“ˆ Progress: Dataset interface โœ“") # %% [markdown] """ ## Step 3: Building the DataLoader ### What is a DataLoader? A **DataLoader** efficiently batches and iterates through datasets. It's the bridge between individual samples and the batched data that neural networks expect. ### Why DataLoaders Matter - **Batching**: Groups samples for efficient GPU computation - **Shuffling**: Randomizes data order to prevent overfitting - **Memory efficiency**: Loads data on-demand rather than all at once - **Iteration**: Provides clean interface for training loops ### The DataLoader Pattern ```python DataLoader(dataset, batch_size=32, shuffle=True) for batch_data, batch_labels in dataloader: # batch_data.shape: (32, ...) # batch_labels.shape: (32,) # Train on batch ``` ### Real-World Applications - **Training loops**: Feed batches to neural networks - **Validation**: Evaluate models on held-out data - **Inference**: Process large datasets efficiently - **Data analysis**: Explore datasets systematically ### Systems Thinking - **Batch size**: Trade-off between memory and speed - **Shuffling**: Prevents overfitting to data order - **Iteration**: Efficient looping through data - **Memory**: Manage large datasets that don't fit in RAM """ # %% nbgrader={"grade": false, "grade_id": "dataloader-class", "locked": false, "schema_version": 3, "solution": true, "task": false} #| export class DataLoader: """ DataLoader: Efficiently batch and iterate through datasets. Provides batching, shuffling, and efficient iteration over datasets. Essential for training neural networks efficiently. """ def __init__(self, dataset: Dataset, batch_size: int = 32, shuffle: bool = True): """ Initialize DataLoader. Args: dataset: Dataset to load from batch_size: Number of samples per batch shuffle: Whether to shuffle data each epoch TODO: Store configuration and dataset. APPROACH: 1. Store dataset as self.dataset 2. Store batch_size as self.batch_size 3. Store shuffle as self.shuffle EXAMPLE: DataLoader(dataset, batch_size=32, shuffle=True) HINTS: - Store all parameters as instance variables - These will be used in __iter__ for batching """ # Input validation if dataset is None: raise TypeError("Dataset cannot be None") if not isinstance(batch_size, int) or batch_size <= 0: raise ValueError(f"Batch size must be a positive integer, got {batch_size}") self.dataset = dataset self.batch_size = batch_size self.shuffle = shuffle def __iter__(self) -> Iterator[Tuple[Tensor, Tensor]]: """ Iterate through dataset in batches. Returns: Iterator yielding (batch_data, batch_labels) tuples TODO: Implement batching and shuffling logic. APPROACH: 1. Create indices list: list(range(len(dataset))) 2. Shuffle indices if self.shuffle is True 3. Loop through indices in batch_size chunks 4. For each batch: collect samples, stack them, yield batch EXAMPLE: for batch_data, batch_labels in dataloader: # batch_data.shape: (batch_size, ...) # batch_labels.shape: (batch_size,) HINTS: - Use list(range(len(self.dataset))) for indices - Use np.random.shuffle() if self.shuffle is True - Loop in chunks of self.batch_size - Collect samples and stack with np.stack() """ # Create indices for all samples indices = list(range(len(self.dataset))) # Shuffle if requested if self.shuffle: np.random.shuffle(indices) # Iterate through indices in batches for i in range(0, len(indices), self.batch_size): batch_indices = indices[i:i + self.batch_size] # Collect samples for this batch batch_data = [] batch_labels = [] for idx in batch_indices: data, label = self.dataset[idx] batch_data.append(data.data) batch_labels.append(label.data) # Stack into batch tensors batch_data_array = np.stack(batch_data, axis=0) batch_labels_array = np.stack(batch_labels, axis=0) yield Tensor(batch_data_array), Tensor(batch_labels_array) def __len__(self) -> int: """ Get the number of batches per epoch. TODO: Calculate number of batches. APPROACH: 1. Get dataset size: len(self.dataset) 2. Divide by batch_size and round up 3. Use ceiling division: (n + batch_size - 1) // batch_size EXAMPLE: Dataset size 100, batch size 32 โ†’ 4 batches HINTS: - Use len(self.dataset) for dataset size - Use ceiling division for exact batch count - Formula: (dataset_size + batch_size - 1) // batch_size """ # Calculate number of batches using ceiling division dataset_size = len(self.dataset) return (dataset_size + self.batch_size - 1) // self.batch_size # %% [markdown] """ ### ๐Ÿงช Unit Test: DataLoader Let's test your DataLoader implementation! This is the heart of efficient data loading for neural networks. **This is a unit test** - it tests the DataLoader class in isolation. """ # %% nbgrader={"grade": true, "grade_id": "test-dataloader-immediate", "locked": true, "points": 10, "schema_version": 3, "solution": false, "task": false} # Test DataLoader immediately after implementation print("๐Ÿ”ฌ Unit Test: DataLoader...") # Use the test dataset from before class TestDataset(Dataset): def __init__(self, size=10): self.size = size def __getitem__(self, index): data = Tensor([index, index * 2]) label = Tensor([index % 3]) # 3 classes return data, label def __len__(self): return self.size def get_num_classes(self): return 3 # Test basic DataLoader functionality try: dataset = TestDataset(size=10) dataloader = DataLoader(dataset, batch_size=3, shuffle=False) print(f"DataLoader created: batch_size={dataloader.batch_size}, shuffle={dataloader.shuffle}") print(f"Number of batches: {len(dataloader)}") # Test __len__ expected_batches = (10 + 3 - 1) // 3 # Ceiling division: 4 batches assert len(dataloader) == expected_batches, f"Should have {expected_batches} batches, got {len(dataloader)}" print("โœ… DataLoader __len__ works correctly") # Test iteration batch_count = 0 total_samples = 0 for batch_data, batch_labels in dataloader: batch_count += 1 batch_size = batch_data.shape[0] total_samples += batch_size print(f"Batch {batch_count}: data shape {batch_data.shape}, labels shape {batch_labels.shape}") # Verify batch dimensions assert len(batch_data.shape) == 2, f"Batch data should be 2D, got {batch_data.shape}" assert len(batch_labels.shape) == 2, f"Batch labels should be 2D, got {batch_labels.shape}" assert batch_data.shape[1] == 2, f"Each sample should have 2 features, got {batch_data.shape[1]}" assert batch_labels.shape[1] == 1, f"Each label should have 1 element, got {batch_labels.shape[1]}" assert batch_count == expected_batches, f"Should iterate {expected_batches} times, got {batch_count}" assert total_samples == 10, f"Should process 10 total samples, got {total_samples}" print("โœ… DataLoader iteration works correctly") except Exception as e: print(f"โŒ DataLoader test failed: {e}") raise # Test shuffling try: dataloader_shuffle = DataLoader(dataset, batch_size=5, shuffle=True) dataloader_no_shuffle = DataLoader(dataset, batch_size=5, shuffle=False) # Get first batch from each batch1_shuffle = next(iter(dataloader_shuffle)) batch1_no_shuffle = next(iter(dataloader_no_shuffle)) print("โœ… DataLoader shuffling parameter works") except Exception as e: print(f"โŒ DataLoader shuffling test failed: {e}") raise # Test different batch sizes try: small_loader = DataLoader(dataset, batch_size=2, shuffle=False) large_loader = DataLoader(dataset, batch_size=8, shuffle=False) assert len(small_loader) == 5, f"Small loader should have 5 batches, got {len(small_loader)}" assert len(large_loader) == 2, f"Large loader should have 2 batches, got {len(large_loader)}" print("โœ… DataLoader handles different batch sizes correctly") except Exception as e: print(f"โŒ DataLoader batch size test failed: {e}") raise # Show the DataLoader behavior print("๐ŸŽฏ DataLoader behavior:") print(" Batches data for efficient processing") print(" Handles shuffling and iteration") print(" Provides clean interface for training loops") print("๐Ÿ“ˆ Progress: Dataset interface โœ“, DataLoader โœ“") # %% [markdown] """ ## Step 4: Creating a Simple Dataset Example ### Why We Need Concrete Examples Abstract classes are great for interfaces, but we need concrete implementations to understand how they work. Let's create a simple dataset for testing. ### Design Principles - **Simple**: Easy to understand and debug - **Configurable**: Adjustable size and properties - **Predictable**: Deterministic data for testing - **Educational**: Shows the Dataset pattern clearly ### Real-World Connection This pattern is used for: - **CIFAR-10**: 32x32 RGB images with 10 classes - **ImageNet**: High-resolution images with 1000 classes - **MNIST**: 28x28 grayscale digits with 10 classes - **Custom datasets**: Your own data following this pattern """ # %% nbgrader={"grade": false, "grade_id": "simple-dataset", "locked": false, "schema_version": 3, "solution": true, "task": false} #| export class SimpleDataset(Dataset): """ Simple dataset for testing and demonstration. Generates synthetic data with configurable size and properties. Perfect for understanding the Dataset pattern. """ def __init__(self, size: int = 100, num_features: int = 4, num_classes: int = 3): """ Initialize SimpleDataset. Args: size: Number of samples in the dataset num_features: Number of features per sample num_classes: Number of classes TODO: Initialize the dataset with synthetic data. APPROACH: 1. Store the configuration parameters 2. Generate synthetic data and labels 3. Make data deterministic for testing EXAMPLE: SimpleDataset(size=100, num_features=4, num_classes=3) creates 100 samples with 4 features each, 3 classes HINTS: - Store size, num_features, num_classes as instance variables - Use np.random.seed() for reproducible data - Generate random data with np.random.randn() - Generate random labels with np.random.randint() """ self.size = size self.num_features = num_features self.num_classes = num_classes # Generate synthetic data (deterministic for testing) np.random.seed(42) # For reproducible data self.data = np.random.randn(size, num_features).astype(np.float32) self.labels = np.random.randint(0, num_classes, size=size) def __getitem__(self, index: int) -> Tuple[Tensor, Tensor]: """ Get a sample by index. Args: index: Index of the sample Returns: Tuple of (data, label) tensors TODO: Return the sample at the given index. APPROACH: 1. Get data sample from self.data[index] 2. Get label from self.labels[index] 3. Convert both to Tensors and return as tuple EXAMPLE: dataset[0] returns (Tensor(features), Tensor(label)) HINTS: - Use self.data[index] for the data - Use self.labels[index] for the label - Convert to Tensors: Tensor(data), Tensor(label) """ data = self.data[index] label = self.labels[index] return Tensor(data), Tensor(label) def __len__(self) -> int: """ Get the dataset size. TODO: Return the dataset size. APPROACH: 1. Return self.size EXAMPLE: len(dataset) returns 100 for dataset with 100 samples HINTS: - Simply return self.size """ return self.size def get_num_classes(self) -> int: """ Get the number of classes. TODO: Return the number of classes. APPROACH: 1. Return self.num_classes EXAMPLE: dataset.get_num_classes() returns 3 for 3-class dataset HINTS: - Simply return self.num_classes """ return self.num_classes # %% [markdown] """ ### ๐Ÿงช Unit Test: SimpleDataset Let's test your SimpleDataset implementation! This concrete example shows how the Dataset pattern works. **This is a unit test** - it tests the SimpleDataset class in isolation. """ # %% nbgrader={"grade": true, "grade_id": "test-simple-dataset-immediate", "locked": true, "points": 10, "schema_version": 3, "solution": false, "task": false} # Test SimpleDataset immediately after implementation print("๐Ÿ”ฌ Unit Test: SimpleDataset...") try: # Create dataset dataset = SimpleDataset(size=20, num_features=5, num_classes=4) print(f"Dataset created: size={len(dataset)}, features={dataset.num_features}, classes={dataset.get_num_classes()}") # Test basic properties assert len(dataset) == 20, f"Dataset length should be 20, got {len(dataset)}" assert dataset.get_num_classes() == 4, f"Should have 4 classes, got {dataset.get_num_classes()}" print("โœ… SimpleDataset basic properties work correctly") # Test sample access data, label = dataset[0] assert isinstance(data, Tensor), "Data should be a Tensor" assert isinstance(label, Tensor), "Label should be a Tensor" assert data.shape == (5,), f"Data shape should be (5,), got {data.shape}" assert label.shape == (), f"Label shape should be (), got {label.shape}" print("โœ… SimpleDataset sample access works correctly") # Test sample shape sample_shape = dataset.get_sample_shape() assert sample_shape == (5,), f"Sample shape should be (5,), got {sample_shape}" print("โœ… SimpleDataset get_sample_shape works correctly") # Test multiple samples for i in range(5): data, label = dataset[i] assert data.shape == (5,), f"Data shape should be (5,) for sample {i}, got {data.shape}" assert 0 <= label.data < 4, f"Label should be in [0, 3] for sample {i}, got {label.data}" print("โœ… SimpleDataset multiple samples work correctly") # Test deterministic data (same seed should give same data) dataset2 = SimpleDataset(size=20, num_features=5, num_classes=4) data1, label1 = dataset[0] data2, label2 = dataset2[0] assert np.array_equal(data1.data, data2.data), "Data should be deterministic" assert np.array_equal(label1.data, label2.data), "Labels should be deterministic" print("โœ… SimpleDataset data is deterministic") except Exception as e: print(f"โŒ SimpleDataset test failed: {e}") # Show the SimpleDataset behavior print("๐ŸŽฏ SimpleDataset behavior:") print(" Generates synthetic data for testing") print(" Implements complete Dataset interface") print(" Provides deterministic data for reproducibility") print("๐Ÿ“ˆ Progress: Dataset interface โœ“, DataLoader โœ“, SimpleDataset โœ“") # %% [markdown] """ ## Step 5: Comprehensive Test - Complete Data Pipeline ### Real-World Data Pipeline Applications Let's test our data loading components in realistic scenarios: #### **Training Pipeline** ```python # The standard ML training pattern dataset = SimpleDataset(size=1000, num_features=10, num_classes=5) dataloader = DataLoader(dataset, batch_size=32, shuffle=True) for epoch in range(num_epochs): for batch_data, batch_labels in dataloader: # Train model on batch pass ``` #### **Validation Pipeline** ```python # Validation without shuffling val_loader = DataLoader(val_dataset, batch_size=64, shuffle=False) for batch_data, batch_labels in val_loader: # Evaluate model on batch pass ``` #### **Data Analysis Pipeline** ```python # Systematic data exploration for batch_data, batch_labels in dataloader: # Analyze batch statistics pass ``` This comprehensive test ensures our data loading components work together for real ML applications! """ # %% nbgrader={"grade": true, "grade_id": "test-comprehensive", "locked": true, "points": 15, "schema_version": 3, "solution": false, "task": false} # Comprehensive test - complete data pipeline applications print("๐Ÿ”ฌ Comprehensive Test: Complete Data Pipeline...") try: # Test 1: Training Data Pipeline print("\n1. Training Data Pipeline Test:") # Create training dataset train_dataset = SimpleDataset(size=100, num_features=8, num_classes=5) train_loader = DataLoader(train_dataset, batch_size=16, shuffle=True) # Simulate training epoch epoch_samples = 0 epoch_batches = 0 for batch_data, batch_labels in train_loader: epoch_batches += 1 epoch_samples += batch_data.shape[0] # Verify batch properties assert batch_data.shape[1] == 8, f"Features should be 8, got {batch_data.shape[1]}" assert len(batch_labels.shape) == 1, f"Labels should be 1D, got shape {batch_labels.shape}" assert isinstance(batch_data, Tensor), "Batch data should be Tensor" assert isinstance(batch_labels, Tensor), "Batch labels should be Tensor" assert epoch_samples == 100, f"Should process 100 samples, got {epoch_samples}" expected_batches = (100 + 16 - 1) // 16 assert epoch_batches == expected_batches, f"Should have {expected_batches} batches, got {epoch_batches}" print("โœ… Training pipeline works correctly") # Test 2: Validation Data Pipeline print("\n2. Validation Data Pipeline Test:") # Create validation dataset (no shuffling) val_dataset = SimpleDataset(size=50, num_features=8, num_classes=5) val_loader = DataLoader(val_dataset, batch_size=10, shuffle=False) # Simulate validation val_samples = 0 val_batches = 0 for batch_data, batch_labels in val_loader: val_batches += 1 val_samples += batch_data.shape[0] # Verify consistent batch processing assert batch_data.shape[1] == 8, "Validation features should match training" assert len(batch_labels.shape) == 1, "Validation labels should be 1D" assert val_samples == 50, f"Should process 50 validation samples, got {val_samples}" assert val_batches == 5, f"Should have 5 validation batches, got {val_batches}" print("โœ… Validation pipeline works correctly") # Test 3: Different Dataset Configurations print("\n3. Dataset Configuration Test:") # Test different configurations configs = [ (200, 4, 3), # Medium dataset (50, 12, 10), # High-dimensional features (1000, 2, 2), # Large dataset, simple features ] for size, features, classes in configs: dataset = SimpleDataset(size=size, num_features=features, num_classes=classes) loader = DataLoader(dataset, batch_size=32, shuffle=True) # Test one batch batch_data, batch_labels = next(iter(loader)) assert batch_data.shape[1] == features, f"Features mismatch for config {configs}" assert len(dataset) == size, f"Size mismatch for config {configs}" assert dataset.get_num_classes() == classes, f"Classes mismatch for config {configs}" print("โœ… Different dataset configurations work correctly") # Test 4: Memory Efficiency Simulation print("\n4. Memory Efficiency Test:") # Create larger dataset to test memory efficiency large_dataset = SimpleDataset(size=500, num_features=20, num_classes=10) large_loader = DataLoader(large_dataset, batch_size=50, shuffle=True) # Process all batches to ensure memory efficiency processed_samples = 0 max_batch_size = 0 for batch_data, batch_labels in large_loader: processed_samples += batch_data.shape[0] max_batch_size = max(max_batch_size, batch_data.shape[0]) # Verify memory usage stays reasonable assert batch_data.shape[0] <= 50, f"Batch size should not exceed 50, got {batch_data.shape[0]}" assert processed_samples == 500, f"Should process all 500 samples, got {processed_samples}" print("โœ… Memory efficiency works correctly") # Test 5: Multi-Epoch Training Simulation print("\n5. Multi-Epoch Training Test:") # Simulate multiple epochs dataset = SimpleDataset(size=60, num_features=6, num_classes=3) loader = DataLoader(dataset, batch_size=20, shuffle=True) for epoch in range(3): epoch_samples = 0 for batch_data, batch_labels in loader: epoch_samples += batch_data.shape[0] # Verify shapes remain consistent across epochs assert batch_data.shape[1] == 6, f"Features should be 6 in epoch {epoch}" assert len(batch_labels.shape) == 1, f"Labels should be 1D in epoch {epoch}" assert epoch_samples == 60, f"Should process 60 samples in epoch {epoch}, got {epoch_samples}" print("โœ… Multi-epoch training works correctly") print("\n๐ŸŽ‰ Comprehensive test passed! Your data pipeline works correctly for:") print(" โ€ข Large-scale dataset handling") print(" โ€ข Batch processing with multiple workers") print(" โ€ข Shuffling and sampling strategies") print(" โ€ข Memory-efficient data loading") print(" โ€ข Complete training pipeline integration") print("๐Ÿ“ˆ Progress: Production-ready data pipeline โœ“") except Exception as e: print(f"โŒ Comprehensive test failed: {e}") raise print("๐Ÿ“ˆ Final Progress: Complete data pipeline ready for production ML!") # %% [markdown] """ ## ๐Ÿ”ง DEVELOPMENT ### ๐Ÿงช Unit Test: Dataset Interface Implementation This test validates the abstract Dataset interface, ensuring proper inheritance, method implementation, and interface compliance for creating custom datasets in the TinyTorch data loading pipeline. """ # %% def test_unit_dataset_interface(): """Unit test for the Dataset abstract interface implementation.""" print("๐Ÿ”ฌ Unit Test: Dataset Interface...") # Test TestDataset implementation dataset = TestDataset(size=5) # Test basic interface assert len(dataset) == 5, "Dataset should have correct length" # Test data access sample, label = dataset[0] assert isinstance(sample, Tensor), "Sample should be Tensor" assert isinstance(label, Tensor), "Label should be Tensor" print("โœ… Dataset interface works correctly") # %% [markdown] """ ### ๐Ÿงช Unit Test: DataLoader Implementation This test validates the DataLoader class functionality, ensuring proper batch creation, iteration capability, and integration with datasets for efficient data loading in machine learning training pipelines. """ # %% def test_unit_dataloader(): """Unit test for the DataLoader implementation.""" print("๐Ÿ”ฌ Unit Test: DataLoader...") # Test DataLoader with TestDataset dataset = TestDataset(size=10) loader = DataLoader(dataset, batch_size=3, shuffle=False) # Test iteration batches = list(loader) assert len(batches) >= 3, "Should have at least 3 batches" # Test batch shapes batch_data, batch_labels = batches[0] assert batch_data.shape[0] <= 3, "Batch size should be <= 3" assert batch_labels.shape[0] <= 3, "Batch labels should match data" print("โœ… DataLoader works correctly") # %% [markdown] """ ### ๐Ÿงช Unit Test: Simple Dataset Implementation This test validates the SimpleDataset class, ensuring it can handle real-world data scenarios including proper data storage, indexing, and compatibility with the DataLoader for practical machine learning workflows. """ # %% def test_unit_simple_dataset(): """Unit test for the SimpleDataset implementation.""" print("๐Ÿ”ฌ Unit Test: SimpleDataset...") # Test SimpleDataset dataset = SimpleDataset(size=100, num_features=4, num_classes=3) # Test properties assert len(dataset) == 100, "Dataset should have correct size" assert dataset.get_num_classes() == 3, "Should have correct number of classes" # Test data access sample, label = dataset[0] assert sample.shape == (4,), "Sample should have correct features" assert 0 <= label.data < 3, "Label should be valid class" print("โœ… SimpleDataset works correctly") # %% [markdown] """ ### ๐Ÿงช Unit Test: Complete Data Pipeline Integration This comprehensive test validates the entire data pipeline from dataset creation through DataLoader batching, ensuring all components work together seamlessly for end-to-end machine learning data processing workflows. """ # %% def test_unit_dataloader_pipeline(): """Comprehensive unit test for the complete data pipeline.""" print("๐Ÿ”ฌ Comprehensive Test: Data Pipeline...") # Test complete pipeline dataset = SimpleDataset(size=50, num_features=10, num_classes=5) loader = DataLoader(dataset, batch_size=8, shuffle=True) total_samples = 0 for batch_data, batch_labels in loader: assert isinstance(batch_data, Tensor), "Batch data should be Tensor" assert isinstance(batch_labels, Tensor), "Batch labels should be Tensor" assert batch_data.shape[1] == 10, "Features should be correct" total_samples += batch_data.shape[0] assert total_samples == 50, "Should process all samples" print("โœ… Data pipeline integration works correctly") # %% [markdown] # %% [markdown] """ ## ๐Ÿงช Module Testing Time to test your implementation! This section uses TinyTorch's standardized testing framework to ensure your implementation works correctly. **This testing section is locked** - it provides consistent feedback across all modules and cannot be modified. """ # %% nbgrader={"grade": false, "grade_id": "standardized-testing", "locked": true, "schema_version": 3, "solution": false, "task": false} # ============================================================================= # STANDARDIZED MODULE TESTING - DO NOT MODIFY # This cell is locked to ensure consistent testing across all TinyTorch modules # ============================================================================= # %% [markdown] """ ## ๐Ÿ”ฌ Integration Test: DataLoader with Tensors """ # %% def test_module_dataloader_tensor_yield(): """ Integration test for the DataLoader and Tensor classes. Tests that the DataLoader correctly yields batches of Tensors. """ print("๐Ÿ”ฌ Running Integration Test: DataLoader with Tensors...") # 1. Create a simple dataset dataset = SimpleDataset(size=50, num_features=8, num_classes=4) # 2. Create a DataLoader dataloader = DataLoader(dataset, batch_size=10, shuffle=False) # 3. Get one batch from the dataloader data_batch, labels_batch = next(iter(dataloader)) # 4. Assert the batch contents are correct assert isinstance(data_batch, Tensor), "Data batch should be a Tensor" assert data_batch.shape == (10, 8), f"Expected data shape (10, 8), but got {data_batch.shape}" assert isinstance(labels_batch, Tensor), "Labels batch should be a Tensor" assert labels_batch.shape == (10,), f"Expected labels shape (10,), but got {labels_batch.shape}" print("โœ… Integration Test Passed: DataLoader correctly yields batches of Tensors.") # Run the integration test test_module_dataloader_tensor_yield() # %% [markdown] """ ## ๐ŸŽฏ MODULE SUMMARY: Data Loading and Processing Congratulations! You've successfully implemented professional data loading systems: ### What You've Accomplished โœ… **DataLoader Class**: Efficient batch processing with memory management โœ… **Dataset Integration**: Seamless compatibility with Tensor operations โœ… **Batch Processing**: Optimized data loading for training โœ… **Memory Management**: Efficient handling of large datasets โœ… **Real Applications**: Image classification, regression, and more ### Key Concepts You've Learned - **Batch processing**: How to efficiently process data in chunks - **Memory management**: Handling large datasets without memory overflow - **Data iteration**: Creating efficient data loading pipelines - **Integration patterns**: How data loaders work with neural networks - **Performance optimization**: Balancing speed and memory usage ### Professional Skills Developed - **Data engineering**: Building robust data processing pipelines - **Memory optimization**: Efficient handling of large datasets - **API design**: Clean interfaces for data loading operations - **Integration testing**: Ensuring data loaders work with neural networks ### Ready for Advanced Applications Your data loading implementations now enable: - **Large-scale training**: Processing datasets too big for memory - **Real-time learning**: Streaming data for online learning - **Multi-modal data**: Handling images, text, and structured data - **Production systems**: Robust data pipelines for deployment ### Connection to Real ML Systems Your implementations mirror production systems: - **PyTorch**: `torch.utils.data.DataLoader` provides identical functionality - **TensorFlow**: `tf.data.Dataset` implements similar concepts - **Industry Standard**: Every major ML framework uses these exact patterns ### Next Steps 1. **Export your code**: `tito export 08_dataloader` 2. **Test your implementation**: `tito test 08_dataloader` 3. **Build training pipelines**: Combine with neural networks for complete ML systems 4. **Move to Module 9**: Add automatic differentiation for training! **Ready for autograd?** Your data loading systems are now ready for real training! """