TinyTorch/modules/source/06_dataloader/dataloader_dev.py

# ---
# jupyter:
#   jupytext:
#     text_representation:
#       extension: .py
#       format_name: percent
#       format_version: '1.3'
#       jupytext_version: 1.17.1
# ---

# %% [markdown]
"""
# Module 6: DataLoader - Data Loading and Preprocessing

Welcome to the DataLoader module! This is where you'll learn how to efficiently load, process, and manage data for machine learning systems.

## Learning Goals
- Understand data pipelines as the foundation of ML systems
- Implement efficient data loading with memory management and batching
- Build reusable dataset abstractions for different data types
- Master the Dataset and DataLoader pattern used in all ML frameworks
- Learn systems thinking for data engineering and I/O optimization

## Build → Use → Reflect
1. **Build**: Create dataset classes and data loaders from scratch
2. **Use**: Load real datasets and feed them to neural networks
3. **Reflect**: How data engineering affects system performance and scalability

## What You'll Learn
By the end of this module, you'll understand:
- The Dataset pattern for consistent data access
- How DataLoaders enable efficient batch processing
- Why batching and shuffling are crucial for ML
- How to handle datasets larger than memory
- The connection between data engineering and model performance
"""

# %% nbgrader={"grade": false, "grade_id": "dataloader-imports", "locked": false, "schema_version": 3, "solution": false, "task": false}
#| default_exp core.dataloader

#| export
import numpy as np
import sys
import os
import pickle
import struct
from typing import List, Tuple, Optional, Union, Iterator
import matplotlib.pyplot as plt
import urllib.request
import tarfile

# Import our building blocks - try package first, then local modules
try:
    from tinytorch.core.tensor import Tensor
except ImportError:
    # For development, import from local modules
    sys.path.append(os.path.join(os.path.dirname(__file__), '..', '01_tensor'))
    from tensor_dev import Tensor

# %% nbgrader={"grade": false, "grade_id": "dataloader-setup", "locked": false, "schema_version": 3, "solution": false, "task": false}
#| hide
#| export
def _should_show_plots():
    """Check if we should show plots (disable during testing)"""
    # Check multiple conditions that indicate we're in test mode
    is_pytest = (
        'pytest' in sys.modules or
        'test' in sys.argv or
        os.environ.get('PYTEST_CURRENT_TEST') is not None or
        any('test' in arg for arg in sys.argv) or
        any('pytest' in arg for arg in sys.argv)
    )

    # Show plots in development mode (when not in test mode)
    return not is_pytest

# %% nbgrader={"grade": false, "grade_id": "dataloader-welcome", "locked": false, "schema_version": 3, "solution": false, "task": false}
print("🔥 TinyTorch DataLoader Module")
print(f"NumPy version: {np.__version__}")
print(f"Python version: {sys.version_info.major}.{sys.version_info.minor}")
print("Ready to build data pipelines!")

# %% [markdown]
"""
## 📦 Where This Code Lives in the Final Package

**Learning Side:** You work in `modules/source/06_dataloader/dataloader_dev.py`
**Building Side:** Code exports to `tinytorch.core.dataloader`

```python
# Final package structure:
from tinytorch.core.dataloader import Dataset, DataLoader  # Data loading utilities!
from tinytorch.core.tensor import Tensor  # Foundation
from tinytorch.core.networks import Sequential  # Models to train
```

**Why this matters:**
- **Learning:** Focused modules for deep understanding of data pipelines
- **Production:** Proper organization like PyTorch's `torch.utils.data`
- **Consistency:** All data loading utilities live together in `core.dataloader`
- **Integration:** Works seamlessly with tensors and networks
"""

# %% [markdown]
"""
## Step 1: Understanding Data Pipelines

### What are Data Pipelines?
**Data pipelines** are the systems that efficiently move data from storage to your model. They're the foundation of all machine learning systems.

### The Data Pipeline Equation
```
Raw Data → Load → Transform → Batch → Model → Predictions
```

### Why Data Pipelines Matter
- **Performance**: Efficient loading prevents GPU starvation
- **Scalability**: Handle datasets larger than memory
- **Consistency**: Reproducible data processing
- **Flexibility**: Easy to switch between datasets

### Real-World Challenges
- **Memory constraints**: Datasets often exceed available RAM
- **I/O bottlenecks**: Disk access is much slower than computation
- **Batch processing**: Neural networks need batched data for efficiency
- **Shuffling**: Random order prevents overfitting

### Systems Thinking
- **Memory efficiency**: Handle datasets larger than RAM
- **I/O optimization**: Read from disk efficiently
- **Batching strategies**: Trade-offs between memory and speed
- **Caching**: When to cache vs recompute

### Visual Intuition
```
Raw Files: [image1.jpg, image2.jpg, image3.jpg, ...]
Load: [Tensor(32x32x3), Tensor(32x32x3), Tensor(32x32x3), ...]
Batch: [Tensor(32, 32, 32, 3)]  # 32 images at once
Model: Process batch efficiently
```

Let's start by building the most fundamental component: **Dataset**.
"""

# %% [markdown]
"""
## Step 2: Building the Dataset Interface

### What is a Dataset?
A **Dataset** is an abstract interface that provides consistent access to data. It's the foundation of all data loading systems.

### Why Abstract Interfaces Matter
- **Consistency**: Same interface for all data types
- **Flexibility**: Easy to switch between datasets
- **Testability**: Easy to create test datasets
- **Extensibility**: Easy to add new data sources

### The Dataset Pattern
```python
class Dataset:
    def __getitem__(self, index):  # Get single sample
        return data, label

    def __len__(self):  # Get dataset size
        return total_samples
```

### Real-World Usage
- **Computer vision**: ImageNet, CIFAR-10, custom image datasets
- **NLP**: Text datasets, tokenized sequences
- **Audio**: Audio files, spectrograms
- **Time series**: Sequential data with proper windowing

Let's implement the Dataset interface!
"""

# %% nbgrader={"grade": false, "grade_id": "dataset-class", "locked": false, "schema_version": 3, "solution": true, "task": false}
#| export
class Dataset:
    """
    Base Dataset class: Abstract interface for all datasets.

    The fundamental abstraction for data loading in TinyTorch.
    Students implement concrete datasets by inheriting from this class.
    """

    def __getitem__(self, index: int) -> Tuple[Tensor, Tensor]:
        """
        Get a single sample and label by index.

        Args:
            index: Index of the sample to retrieve

        Returns:
            Tuple of (data, label) tensors

        TODO: Implement abstract method for getting samples.

        APPROACH:
        1. This is an abstract method - subclasses will implement it
        2. Return a tuple of (data, label) tensors
        3. Data should be the input features, label should be the target

        EXAMPLE:
        dataset[0] should return (Tensor(image_data), Tensor(label))

        HINTS:
        - This is an abstract method that subclasses must override
        - Always return a tuple of (data, label) tensors
        - Data contains the input features, label contains the target
        """
        ### BEGIN SOLUTION
        # This is an abstract method - subclasses must implement it
        raise NotImplementedError("Subclasses must implement __getitem__")
        ### END SOLUTION

    def __len__(self) -> int:
        """
        Get the total number of samples in the dataset.

        TODO: Implement abstract method for getting dataset size.

        APPROACH:
        1. This is an abstract method - subclasses will implement it
        2. Return the total number of samples in the dataset

        EXAMPLE:
        len(dataset) should return 50000 for CIFAR-10 training set

        HINTS:
        - This is an abstract method that subclasses must override
        - Return an integer representing the total number of samples
        """
        ### BEGIN SOLUTION
        # This is an abstract method - subclasses must implement it
        raise NotImplementedError("Subclasses must implement __len__")
        ### END SOLUTION

    def get_sample_shape(self) -> Tuple[int, ...]:
        """
        Get the shape of a single data sample.

        TODO: Implement method to get sample shape.

        APPROACH:
        1. Get the first sample using self[0]
        2. Extract the data part (first element of tuple)
        3. Return the shape of the data tensor

        EXAMPLE:
        For CIFAR-10: returns (3, 32, 32) for RGB images

        HINTS:
        - Use self[0] to get the first sample
        - Extract data from the (data, label) tuple
        - Return data.shape
        """
        ### BEGIN SOLUTION
        # Get the first sample to determine shape
        data, _ = self[0]
        return data.shape
        ### END SOLUTION

    def get_num_classes(self) -> int:
        """
        Get the number of classes in the dataset.

        TODO: Implement abstract method for getting number of classes.

        APPROACH:
        1. This is an abstract method - subclasses will implement it
        2. Return the number of unique classes in the dataset

        EXAMPLE:
        For CIFAR-10: returns 10 (classes 0-9)

        HINTS:
        - This is an abstract method that subclasses must override
        - Return the number of unique classes/categories
        """
        # This is an abstract method - subclasses must implement it
        raise NotImplementedError("Subclasses must implement get_num_classes")

# %% [markdown]
"""
### 🧪 Unit Test: Dataset Interface

Let's understand the Dataset interface! While we can't test the abstract class directly, we'll create a simple test dataset.

**This is a unit test** - it tests the Dataset interface pattern in isolation.
"""

# %% nbgrader={"grade": true, "grade_id": "test-dataset-interface-immediate", "locked": true, "points": 5, "schema_version": 3, "solution": false, "task": false}
# Test Dataset interface with a simple implementation
print("🔬 Unit Test: Dataset Interface...")

# Create a minimal test dataset
class TestDataset(Dataset):
    def __init__(self, size=5):
        self.size = size

    def __getitem__(self, index):
        # Simple test data: features are [index, index*2], label is index % 2
        data = Tensor([index, index * 2])
        label = Tensor([index % 2])
        return data, label

    def __len__(self):
        return self.size

    def get_num_classes(self):
        return 2

# Test the interface
try:
    test_dataset = TestDataset(size=5)
    print(f"Dataset created with size: {len(test_dataset)}")

    # Test __getitem__
    data, label = test_dataset[0]
    print(f"Sample 0: data={data}, label={label}")
    assert isinstance(data, Tensor), "Data should be a Tensor"
    assert isinstance(label, Tensor), "Label should be a Tensor"
    print("✅ Dataset __getitem__ works correctly")

    # Test __len__
    assert len(test_dataset) == 5, f"Dataset length should be 5, got {len(test_dataset)}"
    print("✅ Dataset __len__ works correctly")

    # Test get_num_classes
    assert test_dataset.get_num_classes() == 2, f"Should have 2 classes, got {test_dataset.get_num_classes()}"
    print("✅ Dataset get_num_classes works correctly")

    # Test get_sample_shape
    sample_shape = test_dataset.get_sample_shape()
    assert sample_shape == (2,), f"Sample shape should be (2,), got {sample_shape}"
    print("✅ Dataset get_sample_shape works correctly")

    # Test multiple samples
    for i in range(3):
        data, label = test_dataset[i]
        expected_data = [i, i * 2]
        expected_label = [i % 2]
        assert np.array_equal(data.data, expected_data), f"Data mismatch at index {i}"
        assert np.array_equal(label.data, expected_label), f"Label mismatch at index {i}"
    print("✅ Dataset produces correct data for multiple samples")

except Exception as e:
    print(f"❌ Dataset interface test failed: {e}")
    raise

# Show the dataset pattern
print("🎯 Dataset interface pattern:")
print("   __getitem__: Returns (data, label) tuple")
print("   __len__: Returns dataset size")
print("   get_num_classes: Returns number of classes")
print("   get_sample_shape: Returns shape of data samples")
print("📈 Progress: Dataset interface ✓")

# %% [markdown]
"""
## Step 3: Building the DataLoader

### What is a DataLoader?
A **DataLoader** efficiently batches and iterates through datasets. It's the bridge between individual samples and the batched data that neural networks expect.

### Why DataLoaders Matter
- **Batching**: Groups samples for efficient GPU computation
- **Shuffling**: Randomizes data order to prevent overfitting
- **Memory efficiency**: Loads data on-demand rather than all at once
- **Iteration**: Provides clean interface for training loops

### The DataLoader Pattern
```python
DataLoader(dataset, batch_size=32, shuffle=True)
for batch_data, batch_labels in dataloader:
    # batch_data.shape: (32, ...)
    # batch_labels.shape: (32,)
    # Train on batch
```

### Real-World Applications
- **Training loops**: Feed batches to neural networks
- **Validation**: Evaluate models on held-out data
- **Inference**: Process large datasets efficiently
- **Data analysis**: Explore datasets systematically

### Systems Thinking
- **Batch size**: Trade-off between memory and speed
- **Shuffling**: Prevents overfitting to data order
- **Iteration**: Efficient looping through data
- **Memory**: Manage large datasets that don't fit in RAM
"""

# %% nbgrader={"grade": false, "grade_id": "dataloader-class", "locked": false, "schema_version": 3, "solution": true, "task": false}
#| export
class DataLoader:
    """
    DataLoader: Efficiently batch and iterate through datasets.

    Provides batching, shuffling, and efficient iteration over datasets.
    Essential for training neural networks efficiently.
    """

    def __init__(self, dataset: Dataset, batch_size: int = 32, shuffle: bool = True):
        """
        Initialize DataLoader.

        Args:
            dataset: Dataset to load from
            batch_size: Number of samples per batch
            shuffle: Whether to shuffle data each epoch

        TODO: Store configuration and dataset.

        APPROACH:
        1. Store dataset as self.dataset
        2. Store batch_size as self.batch_size
        3. Store shuffle as self.shuffle

        EXAMPLE:
        DataLoader(dataset, batch_size=32, shuffle=True)

        HINTS:
        - Store all parameters as instance variables
        - These will be used in __iter__ for batching
        """
        self.dataset = dataset
        self.batch_size = batch_size
        self.shuffle = shuffle

    def __iter__(self) -> Iterator[Tuple[Tensor, Tensor]]:
        """
        Iterate through dataset in batches.

        Returns:
            Iterator yielding (batch_data, batch_labels) tuples

        TODO: Implement batching and shuffling logic.

        APPROACH:
        1. Create indices list: list(range(len(dataset)))
        2. Shuffle indices if self.shuffle is True
        3. Loop through indices in batch_size chunks
        4. For each batch: collect samples, stack them, yield batch

        EXAMPLE:
        for batch_data, batch_labels in dataloader:
            # batch_data.shape: (batch_size, ...)
            # batch_labels.shape: (batch_size,)

        HINTS:
        - Use list(range(len(self.dataset))) for indices
        - Use np.random.shuffle() if self.shuffle is True
        - Loop in chunks of self.batch_size
        - Collect samples and stack with np.stack()
        """
        # Create indices for all samples
        indices = list(range(len(self.dataset)))

        # Shuffle if requested
        if self.shuffle:
            np.random.shuffle(indices)

        # Iterate through indices in batches
        for i in range(0, len(indices), self.batch_size):
            batch_indices = indices[i:i + self.batch_size]

            # Collect samples for this batch
            batch_data = []
            batch_labels = []

            for idx in batch_indices:
                data, label = self.dataset[idx]
                batch_data.append(data.data)
                batch_labels.append(label.data)

            # Stack into batch tensors
            batch_data_array = np.stack(batch_data, axis=0)
            batch_labels_array = np.stack(batch_labels, axis=0)

            yield Tensor(batch_data_array), Tensor(batch_labels_array)

    def __len__(self) -> int:
        """
        Get the number of batches per epoch.

        TODO: Calculate number of batches.

        APPROACH:
        1. Get dataset size: len(self.dataset)
        2. Divide by batch_size and round up
        3. Use ceiling division: (n + batch_size - 1) // batch_size

        EXAMPLE:
        Dataset size 100, batch size 32 → 4 batches

        HINTS:
        - Use len(self.dataset) for dataset size
        - Use ceiling division for exact batch count
        - Formula: (dataset_size + batch_size - 1) // batch_size
        """
        # Calculate number of batches using ceiling division
        dataset_size = len(self.dataset)
        return (dataset_size + self.batch_size - 1) // self.batch_size

# %% [markdown]
"""
### 🧪 Unit Test: DataLoader

Let's test your DataLoader implementation! This is the heart of efficient data loading for neural networks.

**This is a unit test** - it tests the DataLoader class in isolation.
"""

# %% nbgrader={"grade": true, "grade_id": "test-dataloader-immediate", "locked": true, "points": 10, "schema_version": 3, "solution": false, "task": false}
# Test DataLoader immediately after implementation
print("🔬 Unit Test: DataLoader...")

# Use the test dataset from before
class TestDataset(Dataset):
    def __init__(self, size=10):
        self.size = size

    def __getitem__(self, index):
        data = Tensor([index, index * 2])
        label = Tensor([index % 3])  # 3 classes
        return data, label

    def __len__(self):
        return self.size

    def get_num_classes(self):
        return 3

# Test basic DataLoader functionality
try:
    dataset = TestDataset(size=10)
    dataloader = DataLoader(dataset, batch_size=3, shuffle=False)

    print(f"DataLoader created: batch_size={dataloader.batch_size}, shuffle={dataloader.shuffle}")
    print(f"Number of batches: {len(dataloader)}")

    # Test __len__
    expected_batches = (10 + 3 - 1) // 3  # Ceiling division: 4 batches
    assert len(dataloader) == expected_batches, f"Should have {expected_batches} batches, got {len(dataloader)}"
    print("✅ DataLoader __len__ works correctly")

    # Test iteration
    batch_count = 0
    total_samples = 0

    for batch_data, batch_labels in dataloader:
        batch_count += 1
        batch_size = batch_data.shape[0]
        total_samples += batch_size

        print(f"Batch {batch_count}: data shape {batch_data.shape}, labels shape {batch_labels.shape}")

        # Verify batch dimensions
        assert len(batch_data.shape) == 2, f"Batch data should be 2D, got {batch_data.shape}"
        assert len(batch_labels.shape) == 2, f"Batch labels should be 2D, got {batch_labels.shape}"
        assert batch_data.shape[1] == 2, f"Each sample should have 2 features, got {batch_data.shape[1]}"
        assert batch_labels.shape[1] == 1, f"Each label should have 1 element, got {batch_labels.shape[1]}"

    assert batch_count == expected_batches, f"Should iterate {expected_batches} times, got {batch_count}"
    assert total_samples == 10, f"Should process 10 total samples, got {total_samples}"
    print("✅ DataLoader iteration works correctly")

except Exception as e:
    print(f"❌ DataLoader test failed: {e}")
    raise

# Test shuffling
try:
    dataloader_shuffle = DataLoader(dataset, batch_size=5, shuffle=True)
    dataloader_no_shuffle = DataLoader(dataset, batch_size=5, shuffle=False)

    # Get first batch from each
    batch1_shuffle = next(iter(dataloader_shuffle))
    batch1_no_shuffle = next(iter(dataloader_no_shuffle))

    print("✅ DataLoader shuffling parameter works")

except Exception as e:
    print(f"❌ DataLoader shuffling test failed: {e}")
    raise

# Test different batch sizes
try:
    small_loader = DataLoader(dataset, batch_size=2, shuffle=False)
    large_loader = DataLoader(dataset, batch_size=8, shuffle=False)

    assert len(small_loader) == 5, f"Small loader should have 5 batches, got {len(small_loader)}"
    assert len(large_loader) == 2, f"Large loader should have 2 batches, got {len(large_loader)}"
    print("✅ DataLoader handles different batch sizes correctly")

except Exception as e:
    print(f"❌ DataLoader batch size test failed: {e}")
    raise

# Show the DataLoader behavior
print("🎯 DataLoader behavior:")
print("   Batches data for efficient processing")
print("   Handles shuffling and iteration")
print("   Provides clean interface for training loops")
print("📈 Progress: Dataset interface ✓, DataLoader ✓")

# %% [markdown]
"""
## Step 4: Creating a Simple Dataset Example

### Why We Need Concrete Examples
Abstract classes are great for interfaces, but we need concrete implementations to understand how they work. Let's create a simple dataset for testing.

### Design Principles
- **Simple**: Easy to understand and debug
- **Configurable**: Adjustable size and properties
- **Predictable**: Deterministic data for testing
- **Educational**: Shows the Dataset pattern clearly

### Real-World Connection
This pattern is used for:
- **CIFAR-10**: 32x32 RGB images with 10 classes
- **ImageNet**: High-resolution images with 1000 classes
- **MNIST**: 28x28 grayscale digits with 10 classes
- **Custom datasets**: Your own data following this pattern
"""

# %% nbgrader={"grade": false, "grade_id": "simple-dataset", "locked": false, "schema_version": 3, "solution": true, "task": false}
#| export
class SimpleDataset(Dataset):
    """
    Simple dataset for testing and demonstration.

    Generates synthetic data with configurable size and properties.
    Perfect for understanding the Dataset pattern.
    """

    def __init__(self, size: int = 100, num_features: int = 4, num_classes: int = 3):
        """
        Initialize SimpleDataset.

        Args:
            size: Number of samples in the dataset
            num_features: Number of features per sample
            num_classes: Number of classes

        TODO: Initialize the dataset with synthetic data.

        APPROACH:
        1. Store the configuration parameters
        2. Generate synthetic data and labels
        3. Make data deterministic for testing

        EXAMPLE:
        SimpleDataset(size=100, num_features=4, num_classes=3)
        creates 100 samples with 4 features each, 3 classes

        HINTS:
        - Store size, num_features, num_classes as instance variables
        - Use np.random.seed() for reproducible data
        - Generate random data with np.random.randn()
        - Generate random labels with np.random.randint()
        """
        self.size = size
        self.num_features = num_features
        self.num_classes = num_classes

        # Generate synthetic data (deterministic for testing)
        np.random.seed(42)  # For reproducible data
        self.data = np.random.randn(size, num_features).astype(np.float32)
        self.labels = np.random.randint(0, num_classes, size=size)

    def __getitem__(self, index: int) -> Tuple[Tensor, Tensor]:
        """
        Get a sample by index.

        Args:
            index: Index of the sample

        Returns:
            Tuple of (data, label) tensors

        TODO: Return the sample at the given index.

        APPROACH:
        1. Get data sample from self.data[index]
        2. Get label from self.labels[index]
        3. Convert both to Tensors and return as tuple

        EXAMPLE:
        dataset[0] returns (Tensor(features), Tensor(label))

        HINTS:
        - Use self.data[index] for the data
        - Use self.labels[index] for the label
        - Convert to Tensors: Tensor(data), Tensor(label)
        """
        data = self.data[index]
        label = self.labels[index]
        return Tensor(data), Tensor(label)

    def __len__(self) -> int:
        """
        Get the dataset size.

        TODO: Return the dataset size.

        APPROACH:
        1. Return self.size

        EXAMPLE:
        len(dataset) returns 100 for dataset with 100 samples

        HINTS:
        - Simply return self.size
        """
        return self.size

    def get_num_classes(self) -> int:
        """
        Get the number of classes.

        TODO: Return the number of classes.

        APPROACH:
        1. Return self.num_classes

        EXAMPLE:
        dataset.get_num_classes() returns 3 for 3-class dataset

        HINTS:
        - Simply return self.num_classes
        """
        return self.num_classes

# %% [markdown]
"""
### 🧪 Unit Test: SimpleDataset

Let's test your SimpleDataset implementation! This concrete example shows how the Dataset pattern works.

**This is a unit test** - it tests the SimpleDataset class in isolation.
"""

# %% nbgrader={"grade": true, "grade_id": "test-simple-dataset-immediate", "locked": true, "points": 10, "schema_version": 3, "solution": false, "task": false}
# Test SimpleDataset immediately after implementation
print("🔬 Unit Test: SimpleDataset...")

try:
    # Create dataset
    dataset = SimpleDataset(size=20, num_features=5, num_classes=4)

    print(f"Dataset created: size={len(dataset)}, features={dataset.num_features}, classes={dataset.get_num_classes()}")

        # Test basic properties
    assert len(dataset) == 20, f"Dataset length should be 20, got {len(dataset)}"
    assert dataset.get_num_classes() == 4, f"Should have 4 classes, got {dataset.get_num_classes()}"
    print("✅ SimpleDataset basic properties work correctly")

    # Test sample access
    data, label = dataset[0]
    assert isinstance(data, Tensor), "Data should be a Tensor"
    assert isinstance(label, Tensor), "Label should be a Tensor"
    assert data.shape == (5,), f"Data shape should be (5,), got {data.shape}"
    assert label.shape == (), f"Label shape should be (), got {label.shape}"
    print("✅ SimpleDataset sample access works correctly")

    # Test sample shape
    sample_shape = dataset.get_sample_shape()
    assert sample_shape == (5,), f"Sample shape should be (5,), got {sample_shape}"
    print("✅ SimpleDataset get_sample_shape works correctly")

    # Test multiple samples
    for i in range(5):
            data, label = dataset[i]
            assert data.shape == (5,), f"Data shape should be (5,) for sample {i}, got {data.shape}"
            assert 0 <= label.data < 4, f"Label should be in [0, 3] for sample {i}, got {label.data}"
    print("✅ SimpleDataset multiple samples work correctly")

    # Test deterministic data (same seed should give same data)
    dataset2 = SimpleDataset(size=20, num_features=5, num_classes=4)
    data1, label1 = dataset[0]
    data2, label2 = dataset2[0]
    assert np.array_equal(data1.data, data2.data), "Data should be deterministic"
    assert np.array_equal(label1.data, label2.data), "Labels should be deterministic"
    print("✅ SimpleDataset data is deterministic")

except Exception as e:
    print(f"❌ SimpleDataset test failed: {e}")

# Show the SimpleDataset behavior
print("🎯 SimpleDataset behavior:")
print("   Generates synthetic data for testing")
print("   Implements complete Dataset interface")
print("   Provides deterministic data for reproducibility")
print("📈 Progress: Dataset interface ✓, DataLoader ✓, SimpleDataset ✓")

# %% [markdown]
"""
## Step 5: Integration Test - Complete Data Pipeline

### Real-World Data Pipeline Applications
Let's test our data loading components in realistic scenarios:

#### **Training Pipeline**
```python
# The standard ML training pattern
dataset = SimpleDataset(size=1000, num_features=10, num_classes=5)
dataloader = DataLoader(dataset, batch_size=32, shuffle=True)

for epoch in range(num_epochs):
    for batch_data, batch_labels in dataloader:
        # Train model on batch
        pass
```

#### **Validation Pipeline**
```python
# Validation without shuffling
val_loader = DataLoader(val_dataset, batch_size=64, shuffle=False)

for batch_data, batch_labels in val_loader:
    # Evaluate model on batch
    pass
```

#### **Data Analysis Pipeline**
```python
# Systematic data exploration
for batch_data, batch_labels in dataloader:
    # Analyze batch statistics
    pass
```

This integration test ensures our data loading components work together for real ML applications!
"""

# %% nbgrader={"grade": true, "grade_id": "test-integration", "locked": true, "points": 15, "schema_version": 3, "solution": false, "task": false}
# Integration test - complete data pipeline applications
print("🔬 Integration Test: Complete Data Pipeline...")

try:
    # Test 1: Training Data Pipeline
    print("\n1. Training Data Pipeline Test:")

    # Create training dataset
    train_dataset = SimpleDataset(size=100, num_features=8, num_classes=5)
    train_loader = DataLoader(train_dataset, batch_size=16, shuffle=True)

    # Simulate training epoch
    epoch_samples = 0
    epoch_batches = 0

    for batch_data, batch_labels in train_loader:
        epoch_batches += 1
        epoch_samples += batch_data.shape[0]

        # Verify batch properties
        assert batch_data.shape[1] == 8, f"Features should be 8, got {batch_data.shape[1]}"
        assert len(batch_labels.shape) == 1, f"Labels should be 1D, got shape {batch_labels.shape}"
        assert isinstance(batch_data, Tensor), "Batch data should be Tensor"
        assert isinstance(batch_labels, Tensor), "Batch labels should be Tensor"

    assert epoch_samples == 100, f"Should process 100 samples, got {epoch_samples}"
    expected_batches = (100 + 16 - 1) // 16
    assert epoch_batches == expected_batches, f"Should have {expected_batches} batches, got {epoch_batches}"
    print("✅ Training pipeline works correctly")

    # Test 2: Validation Data Pipeline
    print("\n2. Validation Data Pipeline Test:")

    # Create validation dataset (no shuffling)
    val_dataset = SimpleDataset(size=50, num_features=8, num_classes=5)
    val_loader = DataLoader(val_dataset, batch_size=10, shuffle=False)

    # Simulate validation
    val_samples = 0
    val_batches = 0

    for batch_data, batch_labels in val_loader:
        val_batches += 1
        val_samples += batch_data.shape[0]

        # Verify consistent batch processing
        assert batch_data.shape[1] == 8, "Validation features should match training"
        assert len(batch_labels.shape) == 1, "Validation labels should be 1D"

        assert val_samples == 50, f"Should process 50 validation samples, got {val_samples}"
    assert val_batches == 5, f"Should have 5 validation batches, got {val_batches}"
    print("✅ Validation pipeline works correctly")

    # Test 3: Different Dataset Configurations
    print("\n3. Dataset Configuration Test:")

    # Test different configurations
    configs = [
        (200, 4, 3),   # Medium dataset
        (50, 12, 10),  # High-dimensional features
        (1000, 2, 2),  # Large dataset, simple features
    ]

    for size, features, classes in configs:
        dataset = SimpleDataset(size=size, num_features=features, num_classes=classes)
        loader = DataLoader(dataset, batch_size=32, shuffle=True)

        # Test one batch
        batch_data, batch_labels = next(iter(loader))

        assert batch_data.shape[1] == features, f"Features mismatch for config {configs}"
        assert len(dataset) == size, f"Size mismatch for config {configs}"
        assert dataset.get_num_classes() == classes, f"Classes mismatch for config {configs}"

    print("✅ Different dataset configurations work correctly")

    # Test 4: Memory Efficiency Simulation
    print("\n4. Memory Efficiency Test:")

    # Create larger dataset to test memory efficiency
    large_dataset = SimpleDataset(size=500, num_features=20, num_classes=10)
    large_loader = DataLoader(large_dataset, batch_size=50, shuffle=True)

    # Process all batches to ensure memory efficiency
    processed_samples = 0
    max_batch_size = 0

    for batch_data, batch_labels in large_loader:
        processed_samples += batch_data.shape[0]
        max_batch_size = max(max_batch_size, batch_data.shape[0])

        # Verify memory usage stays reasonable
        assert batch_data.shape[0] <= 50, f"Batch size should not exceed 50, got {batch_data.shape[0]}"

    assert processed_samples == 500, f"Should process all 500 samples, got {processed_samples}"
    print("✅ Memory efficiency works correctly")

    # Test 5: Multi-Epoch Training Simulation
    print("\n5. Multi-Epoch Training Test:")

    # Simulate multiple epochs
    dataset = SimpleDataset(size=60, num_features=6, num_classes=3)
    loader = DataLoader(dataset, batch_size=20, shuffle=True)

    for epoch in range(3):
        epoch_samples = 0
        for batch_data, batch_labels in loader:
            epoch_samples += batch_data.shape[0]

            # Verify shapes remain consistent across epochs
            assert batch_data.shape[1] == 6, f"Features should be 6 in epoch {epoch}"
            assert len(batch_labels.shape) == 1, f"Labels should be 1D in epoch {epoch}"

        assert epoch_samples == 60, f"Should process 60 samples in epoch {epoch}, got {epoch_samples}"

    print("✅ Multi-epoch training works correctly")

    print("\n🎉 Integration test passed! Your data pipeline works correctly for:")
    print("  • Training with shuffled batches")
    print("  • Validation with deterministic order")
    print("  • Different dataset configurations")
    print("  • Memory-efficient processing")
    print("  • Multi-epoch training scenarios")

except Exception as e:
    print(f"❌ Integration test failed: {e}")
    raise

print("📈 Final Progress: Complete data pipeline ready for production ML!")

# %% [markdown]
"""
## 🎯 Module Summary

Congratulations! You've successfully implemented the core components of data loading systems:

### What You've Accomplished
✅ **Dataset Abstract Class**: The foundation interface for all data loading
✅ **DataLoader Implementation**: Efficient batching and iteration over datasets
✅ **SimpleDataset Example**: Concrete implementation showing the Dataset pattern
✅ **Complete Data Pipeline**: End-to-end data loading for neural network training
✅ **Systems Thinking**: Understanding memory efficiency, batching, and I/O optimization

### Key Concepts You've Learned
- **Dataset pattern**: Abstract interface for consistent data access
- **DataLoader pattern**: Efficient batching and iteration for training
- **Memory efficiency**: Loading data on-demand rather than all at once
- **Batching strategies**: Grouping samples for efficient GPU computation
- **Shuffling**: Randomizing data order to prevent overfitting

### Mathematical Foundations
- **Batch processing**: Vectorized operations on multiple samples
- **Memory management**: Handling datasets larger than available RAM
- **I/O optimization**: Minimizing disk reads and memory allocation
- **Stochastic sampling**: Random shuffling for better generalization

### Real-World Applications
- **Computer vision**: Loading image datasets like CIFAR-10, ImageNet
- **Natural language processing**: Loading text datasets with tokenization
- **Tabular data**: Loading CSV files and database records
- **Audio processing**: Loading and preprocessing audio files
- **Time series**: Loading sequential data with proper windowing

### Connection to Production Systems
- **PyTorch**: Your Dataset and DataLoader mirror `torch.utils.data`
- **TensorFlow**: Similar concepts in `tf.data.Dataset`
- **JAX**: Custom data loading with efficient batching
- **MLOps**: Data pipelines are critical for production ML systems

### Performance Characteristics
- **Memory efficiency**: O(batch_size) memory usage, not O(dataset_size)
- **I/O optimization**: Load data on-demand, not all at once
- **Batching efficiency**: Vectorized operations on GPU
- **Shuffling overhead**: Minimal cost for significant training benefits

### Data Engineering Best Practices
- **Reproducibility**: Deterministic data generation and shuffling
- **Scalability**: Handle datasets of any size
- **Flexibility**: Easy to switch between different data sources
- **Testability**: Simple interfaces for unit testing

### Next Steps
1. **Export your code**: Use NBDev to export to the `tinytorch` package
2. **Test your implementation**: Run the complete test suite
3. **Build data pipelines**:
   ```python
   from tinytorch.core.dataloader import Dataset, DataLoader
   from tinytorch.core.tensor import Tensor

   # Create dataset
   dataset = SimpleDataset(size=1000, num_features=10, num_classes=5)

   # Create dataloader
   loader = DataLoader(dataset, batch_size=32, shuffle=True)

   # Training loop
   for epoch in range(num_epochs):
       for batch_data, batch_labels in loader:
           # Train model
       pass
   ```
4. **Explore advanced topics**: Data augmentation, distributed loading, streaming datasets!

**Ready for the next challenge?** Let's build training loops and optimizers to complete the ML pipeline!
"""