TinyTorch/modules/source/07_dataloader/dataloader_dev.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "b88708de",
   "metadata": {
    "cell_marker": "\"\"\""
   },
   "source": [
    "# DataLoader - Data Loading and Preprocessing\n",
    "\n",
    "Welcome to the DataLoader module! This is where you'll learn how to efficiently load, process, and manage data for machine learning systems.\n",
    "\n",
    "## Learning Goals\n",
    "- Understand data pipelines as the foundation of ML systems\n",
    "- Implement efficient data loading with memory management and batching\n",
    "- Build reusable dataset abstractions for different data types\n",
    "- Master the Dataset and DataLoader pattern used in all ML frameworks\n",
    "- Learn systems thinking for data engineering and I/O optimization\n",
    "\n",
    "## Build → Use → Reflect\n",
    "1. **Build**: Create dataset classes and data loaders from scratch\n",
    "2. **Use**: Load real datasets and feed them to neural networks\n",
    "3. **Reflect**: How data engineering affects system performance and scalability\n",
    "\n",
    "## What You'll Learn\n",
    "By the end of this module, you'll understand:\n",
    "- The Dataset pattern for consistent data access\n",
    "- How DataLoaders enable efficient batch processing\n",
    "- Why batching and shuffling are crucial for ML\n",
    "- How to handle datasets larger than memory\n",
    "- The connection between data engineering and model performance"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8a1a46d2",
   "metadata": {
    "lines_to_next_cell": 1,
    "nbgrader": {
     "grade": false,
     "grade_id": "dataloader-imports",
     "locked": false,
     "schema_version": 3,
     "solution": false,
     "task": false
    }
   },
   "outputs": [],
   "source": [
    "#| default_exp core.dataloader\n",
    "\n",
    "#| export\n",
    "import numpy as np\n",
    "import sys\n",
    "import os\n",
    "import pickle\n",
    "import struct\n",
    "from typing import List, Tuple, Optional, Union, Iterator\n",
    "import matplotlib.pyplot as plt\n",
    "import urllib.request\n",
    "import tarfile\n",
    "\n",
    "# Import our building blocks - try package first, then local modules\n",
    "try:\n",
    "    from tinytorch.core.tensor import Tensor\n",
    "except ImportError:\n",
    "    # For development, import from local modules\n",
    "    sys.path.append(os.path.join(os.path.dirname(__file__), '..', '01_tensor'))\n",
    "    from tensor_dev import Tensor"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "9fc27557",
   "metadata": {
    "lines_to_next_cell": 1,
    "nbgrader": {
     "grade": false,
     "grade_id": "dataloader-setup",
     "locked": false,
     "schema_version": 3,
     "solution": false,
     "task": false
    }
   },
   "outputs": [],
   "source": [
    "#| hide\n",
    "#| export\n",
    "def _should_show_plots():\n",
    "    \"\"\"Check if we should show plots (disable during testing)\"\"\"\n",
    "    # Check multiple conditions that indicate we're in test mode\n",
    "    is_pytest = (\n",
    "        'pytest' in sys.modules or\n",
    "        'test' in sys.argv or\n",
    "        os.environ.get('PYTEST_CURRENT_TEST') is not None or\n",
    "        any('test' in arg for arg in sys.argv) or\n",
    "        any('pytest' in arg for arg in sys.argv)\n",
    "    )\n",
    "    \n",
    "    # Show plots in development mode (when not in test mode)\n",
    "    return not is_pytest"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f37cacaf",
   "metadata": {
    "nbgrader": {
     "grade": false,
     "grade_id": "dataloader-welcome",
     "locked": false,
     "schema_version": 3,
     "solution": false,
     "task": false
    }
   },
   "outputs": [],
   "source": [
    "print(\"🔥 TinyTorch DataLoader Module\")\n",
    "print(f\"NumPy version: {np.__version__}\")\n",
    "print(f\"Python version: {sys.version_info.major}.{sys.version_info.minor}\")\n",
    "print(\"Ready to build data pipelines!\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "decfa343",
   "metadata": {
    "cell_marker": "\"\"\""
   },
   "source": [
    "## 📦 Where This Code Lives in the Final Package\n",
    "\n",
    "**Learning Side:** You work in `modules/source/06_dataloader/dataloader_dev.py`  \n",
    "**Building Side:** Code exports to `tinytorch.core.dataloader`\n",
    "\n",
    "```python\n",
    "# Final package structure:\n",
    "from tinytorch.core.dataloader import Dataset, DataLoader  # Data loading utilities!\n",
    "from tinytorch.core.tensor import Tensor  # Foundation\n",
    "from tinytorch.core.networks import Sequential  # Models to train\n",
    "```\n",
    "\n",
    "**Why this matters:**\n",
    "- **Learning:** Focused modules for deep understanding of data pipelines\n",
    "- **Production:** Proper organization like PyTorch's `torch.utils.data`\n",
    "- **Consistency:** All data loading utilities live together in `core.dataloader`\n",
    "- **Integration:** Works seamlessly with tensors and networks"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "daf1136d",
   "metadata": {
    "cell_marker": "\"\"\""
   },
   "source": [
    "## Step 1: Understanding Data Pipelines\n",
    "\n",
    "### What are Data Pipelines?\n",
    "**Data pipelines** are the systems that efficiently move data from storage to your model. They're the foundation of all machine learning systems.\n",
    "\n",
    "### The Data Pipeline Equation\n",
    "```\n",
    "Raw Data → Load → Transform → Batch → Model → Predictions\n",
    "```\n",
    "\n",
    "### Why Data Pipelines Matter\n",
    "- **Performance**: Efficient loading prevents GPU starvation\n",
    "- **Scalability**: Handle datasets larger than memory\n",
    "- **Consistency**: Reproducible data processing\n",
    "- **Flexibility**: Easy to switch between datasets\n",
    "\n",
    "### Real-World Challenges\n",
    "- **Memory constraints**: Datasets often exceed available RAM\n",
    "- **I/O bottlenecks**: Disk access is much slower than computation\n",
    "- **Batch processing**: Neural networks need batched data for efficiency\n",
    "- **Shuffling**: Random order prevents overfitting\n",
    "\n",
    "### Systems Thinking\n",
    "- **Memory efficiency**: Handle datasets larger than RAM\n",
    "- **I/O optimization**: Read from disk efficiently\n",
    "- **Batching strategies**: Trade-offs between memory and speed\n",
    "- **Caching**: When to cache vs recompute\n",
    "\n",
    "### Visual Intuition\n",
    "```\n",
    "Raw Files: [image1.jpg, image2.jpg, image3.jpg, ...]\n",
    "Load: [Tensor(32x32x3), Tensor(32x32x3), Tensor(32x32x3), ...]\n",
    "Batch: [Tensor(32, 32, 32, 3)]  # 32 images at once\n",
    "Model: Process batch efficiently\n",
    "```\n",
    "\n",
    "Let's start by building the most fundamental component: **Dataset**."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1881387d",
   "metadata": {
    "cell_marker": "\"\"\"",
    "lines_to_next_cell": 1
   },
   "source": [
    "## Step 2: Building the Dataset Interface\n",
    "\n",
    "### What is a Dataset?\n",
    "A **Dataset** is an abstract interface that provides consistent access to data. It's the foundation of all data loading systems.\n",
    "\n",
    "### Why Abstract Interfaces Matter\n",
    "- **Consistency**: Same interface for all data types\n",
    "- **Flexibility**: Easy to switch between datasets\n",
    "- **Testability**: Easy to create test datasets\n",
    "- **Extensibility**: Easy to add new data sources\n",
    "\n",
    "### The Dataset Pattern\n",
    "```python\n",
    "class Dataset:\n",
    "    def __getitem__(self, index):  # Get single sample\n",
    "        return data, label\n",
    "    \n",
    "    def __len__(self):  # Get dataset size\n",
    "        return total_samples\n",
    "```\n",
    "\n",
    "### Real-World Usage\n",
    "- **Computer vision**: ImageNet, CIFAR-10, custom image datasets\n",
    "- **NLP**: Text datasets, tokenized sequences\n",
    "- **Audio**: Audio files, spectrograms\n",
    "- **Time series**: Sequential data with proper windowing\n",
    "\n",
    "Let's implement the Dataset interface!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f02bc42c",
   "metadata": {
    "lines_to_next_cell": 1,
    "nbgrader": {
     "grade": false,
     "grade_id": "dataset-class",
     "locked": false,
     "schema_version": 3,
     "solution": true,
     "task": false
    }
   },
   "outputs": [],
   "source": [
    "#| export\n",
    "class Dataset:\n",
    "    \"\"\"\n",
    "    Base Dataset class: Abstract interface for all datasets.\n",
    "    \n",
    "    The fundamental abstraction for data loading in TinyTorch.\n",
    "    Students implement concrete datasets by inheriting from this class.\n",
    "    \"\"\"\n",
    "    \n",
    "    def __getitem__(self, index: int) -> Tuple[Tensor, Tensor]:\n",
    "        \"\"\"\n",
    "        Get a single sample and label by index.\n",
    "        \n",
    "        Args:\n",
    "            index: Index of the sample to retrieve\n",
    "            \n",
    "        Returns:\n",
    "            Tuple of (data, label) tensors\n",
    "            \n",
    "        TODO: Implement abstract method for getting samples.\n",
    "        \n",
    "        APPROACH:\n",
    "        1. This is an abstract method - subclasses will implement it\n",
    "        2. Return a tuple of (data, label) tensors\n",
    "        3. Data should be the input features, label should be the target\n",
    "        \n",
    "        EXAMPLE:\n",
    "        dataset[0] should return (Tensor(image_data), Tensor(label))\n",
    "        \n",
    "        HINTS:\n",
    "        - This is an abstract method that subclasses must override\n",
    "        - Always return a tuple of (data, label) tensors\n",
    "        - Data contains the input features, label contains the target\n",
    "        \"\"\"\n",
    "        ### BEGIN SOLUTION\n",
    "        # This is an abstract method - subclasses must implement it\n",
    "        raise NotImplementedError(\"Subclasses must implement __getitem__\")\n",
    "        ### END SOLUTION\n",
    "    \n",
    "    def __len__(self) -> int:\n",
    "        \"\"\"\n",
    "        Get the total number of samples in the dataset.\n",
    "        \n",
    "        TODO: Implement abstract method for getting dataset size.\n",
    "        \n",
    "        APPROACH:\n",
    "        1. This is an abstract method - subclasses will implement it\n",
    "        2. Return the total number of samples in the dataset\n",
    "        \n",
    "        EXAMPLE:\n",
    "        len(dataset) should return 50000 for CIFAR-10 training set\n",
    "        \n",
    "        HINTS:\n",
    "        - This is an abstract method that subclasses must override\n",
    "        - Return an integer representing the total number of samples\n",
    "        \"\"\"\n",
    "        ### BEGIN SOLUTION\n",
    "        # This is an abstract method - subclasses must implement it\n",
    "        raise NotImplementedError(\"Subclasses must implement __len__\")\n",
    "        ### END SOLUTION\n",
    "    \n",
    "    def get_sample_shape(self) -> Tuple[int, ...]:\n",
    "        \"\"\"\n",
    "        Get the shape of a single data sample.\n",
    "        \n",
    "        TODO: Implement method to get sample shape.\n",
    "        \n",
    "        APPROACH:\n",
    "        1. Get the first sample using self[0]\n",
    "        2. Extract the data part (first element of tuple)\n",
    "        3. Return the shape of the data tensor\n",
    "        \n",
    "        EXAMPLE:\n",
    "        For CIFAR-10: returns (3, 32, 32) for RGB images\n",
    "        \n",
    "        HINTS:\n",
    "        - Use self[0] to get the first sample\n",
    "        - Extract data from the (data, label) tuple\n",
    "        - Return data.shape\n",
    "        \"\"\"\n",
    "        ### BEGIN SOLUTION\n",
    "        # Get the first sample to determine shape\n",
    "        data, _ = self[0]\n",
    "        return data.shape\n",
    "        ### END SOLUTION\n",
    "    \n",
    "    def get_num_classes(self) -> int:\n",
    "        \"\"\"\n",
    "        Get the number of classes in the dataset.\n",
    "        \n",
    "        TODO: Implement abstract method for getting number of classes.\n",
    "        \n",
    "        APPROACH:\n",
    "        1. This is an abstract method - subclasses will implement it\n",
    "        2. Return the number of unique classes in the dataset\n",
    "        \n",
    "        EXAMPLE:\n",
    "        For CIFAR-10: returns 10 (classes 0-9)\n",
    "        \n",
    "        HINTS:\n",
    "        - This is an abstract method that subclasses must override\n",
    "        - Return the number of unique classes/categories\n",
    "        \"\"\"\n",
    "        # This is an abstract method - subclasses must implement it\n",
    "        raise NotImplementedError(\"Subclasses must implement get_num_classes\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "fe072a6b",
   "metadata": {
    "cell_marker": "\"\"\""
   },
   "source": [
    "### 🧪 Unit Test: Dataset Interface\n",
    "\n",
    "Let's understand the Dataset interface! While we can't test the abstract class directly, we'll create a simple test dataset.\n",
    "\n",
    "**This is a unit test** - it tests the Dataset interface pattern in isolation."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f5dbcde5",
   "metadata": {
    "nbgrader": {
     "grade": true,
     "grade_id": "test-dataset-interface-immediate",
     "locked": true,
     "points": 5,
     "schema_version": 3,
     "solution": false,
     "task": false
    }
   },
   "outputs": [],
   "source": [
    "# Test Dataset interface with a simple implementation\n",
    "print(\"🔬 Unit Test: Dataset Interface...\")\n",
    "\n",
    "# Create a minimal test dataset\n",
    "class TestDataset(Dataset):\n",
    "    def __init__(self, size=5):\n",
    "        self.size = size\n",
    "    \n",
    "    def __getitem__(self, index):\n",
    "        # Simple test data: features are [index, index*2], label is index % 2\n",
    "        data = Tensor([index, index * 2])\n",
    "        label = Tensor([index % 2])\n",
    "        return data, label\n",
    "    \n",
    "    def __len__(self):\n",
    "        return self.size\n",
    "    \n",
    "    def get_num_classes(self):\n",
    "        return 2\n",
    "\n",
    "# Test the interface\n",
    "try:\n",
    "    test_dataset = TestDataset(size=5)\n",
    "    print(f\"Dataset created with size: {len(test_dataset)}\")\n",
    "    \n",
    "    # Test __getitem__\n",
    "    data, label = test_dataset[0]\n",
    "    print(f\"Sample 0: data={data}, label={label}\")\n",
    "    assert isinstance(data, Tensor), \"Data should be a Tensor\"\n",
    "    assert isinstance(label, Tensor), \"Label should be a Tensor\"\n",
    "    print(\"✅ Dataset __getitem__ works correctly\")\n",
    "    \n",
    "    # Test __len__\n",
    "    assert len(test_dataset) == 5, f\"Dataset length should be 5, got {len(test_dataset)}\"\n",
    "    print(\"✅ Dataset __len__ works correctly\")\n",
    "    \n",
    "    # Test get_num_classes\n",
    "    assert test_dataset.get_num_classes() == 2, f\"Should have 2 classes, got {test_dataset.get_num_classes()}\"\n",
    "    print(\"✅ Dataset get_num_classes works correctly\")\n",
    "    \n",
    "    # Test get_sample_shape\n",
    "    sample_shape = test_dataset.get_sample_shape()\n",
    "    assert sample_shape == (2,), f\"Sample shape should be (2,), got {sample_shape}\"\n",
    "    print(\"✅ Dataset get_sample_shape works correctly\")\n",
    "    \n",
    "    # Test multiple samples\n",
    "    for i in range(3):\n",
    "        data, label = test_dataset[i]\n",
    "        expected_data = [i, i * 2]\n",
    "        expected_label = [i % 2]\n",
    "        assert np.array_equal(data.data, expected_data), f\"Data mismatch at index {i}\"\n",
    "        assert np.array_equal(label.data, expected_label), f\"Label mismatch at index {i}\"\n",
    "    print(\"✅ Dataset produces correct data for multiple samples\")\n",
    "    \n",
    "except Exception as e:\n",
    "    print(f\"❌ Dataset interface test failed: {e}\")\n",
    "    raise\n",
    "\n",
    "# Show the dataset pattern\n",
    "print(\"🎯 Dataset interface pattern:\")\n",
    "print(\"   __getitem__: Returns (data, label) tuple\")\n",
    "print(\"   __len__: Returns dataset size\")\n",
    "print(\"   get_num_classes: Returns number of classes\")\n",
    "print(\"   get_sample_shape: Returns shape of data samples\")\n",
    "print(\"📈 Progress: Dataset interface ✓\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "84c87935",
   "metadata": {
    "cell_marker": "\"\"\"",
    "lines_to_next_cell": 1
   },
   "source": [
    "## Step 3: Building the DataLoader\n",
    "\n",
    "### What is a DataLoader?\n",
    "A **DataLoader** efficiently batches and iterates through datasets. It's the bridge between individual samples and the batched data that neural networks expect.\n",
    "\n",
    "### Why DataLoaders Matter\n",
    "- **Batching**: Groups samples for efficient GPU computation\n",
    "- **Shuffling**: Randomizes data order to prevent overfitting\n",
    "- **Memory efficiency**: Loads data on-demand rather than all at once\n",
    "- **Iteration**: Provides clean interface for training loops\n",
    "\n",
    "### The DataLoader Pattern\n",
    "```python\n",
    "DataLoader(dataset, batch_size=32, shuffle=True)\n",
    "for batch_data, batch_labels in dataloader:\n",
    "    # batch_data.shape: (32, ...)\n",
    "    # batch_labels.shape: (32,)\n",
    "    # Train on batch\n",
    "```\n",
    "\n",
    "### Real-World Applications\n",
    "- **Training loops**: Feed batches to neural networks\n",
    "- **Validation**: Evaluate models on held-out data\n",
    "- **Inference**: Process large datasets efficiently\n",
    "- **Data analysis**: Explore datasets systematically\n",
    "\n",
    "### Systems Thinking\n",
    "- **Batch size**: Trade-off between memory and speed\n",
    "- **Shuffling**: Prevents overfitting to data order\n",
    "- **Iteration**: Efficient looping through data\n",
    "- **Memory**: Manage large datasets that don't fit in RAM"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "0918d8cf",
   "metadata": {
    "lines_to_next_cell": 1,
    "nbgrader": {
     "grade": false,
     "grade_id": "dataloader-class",
     "locked": false,
     "schema_version": 3,
     "solution": true,
     "task": false
    }
   },
   "outputs": [],
   "source": [
    "#| export\n",
    "class DataLoader:\n",
    "    \"\"\"\n",
    "    DataLoader: Efficiently batch and iterate through datasets.\n",
    "    \n",
    "    Provides batching, shuffling, and efficient iteration over datasets.\n",
    "    Essential for training neural networks efficiently.\n",
    "    \"\"\"\n",
    "    \n",
    "    def __init__(self, dataset: Dataset, batch_size: int = 32, shuffle: bool = True):\n",
    "        \"\"\"\n",
    "        Initialize DataLoader.\n",
    "        \n",
    "        Args:\n",
    "            dataset: Dataset to load from\n",
    "            batch_size: Number of samples per batch\n",
    "            shuffle: Whether to shuffle data each epoch\n",
    "            \n",
    "        TODO: Store configuration and dataset.\n",
    "        \n",
    "        APPROACH:\n",
    "        1. Store dataset as self.dataset\n",
    "        2. Store batch_size as self.batch_size\n",
    "        3. Store shuffle as self.shuffle\n",
    "        \n",
    "        EXAMPLE:\n",
    "        DataLoader(dataset, batch_size=32, shuffle=True)\n",
    "        \n",
    "        HINTS:\n",
    "        - Store all parameters as instance variables\n",
    "        - These will be used in __iter__ for batching\n",
    "        \"\"\"\n",
    "        # Input validation\n",
    "        if dataset is None:\n",
    "            raise TypeError(\"Dataset cannot be None\")\n",
    "        if not isinstance(batch_size, int) or batch_size <= 0:\n",
    "            raise ValueError(f\"Batch size must be a positive integer, got {batch_size}\")\n",
    "        \n",
    "        self.dataset = dataset\n",
    "        self.batch_size = batch_size\n",
    "        self.shuffle = shuffle\n",
    "    \n",
    "    def __iter__(self) -> Iterator[Tuple[Tensor, Tensor]]:\n",
    "        \"\"\"\n",
    "        Iterate through dataset in batches.\n",
    "        \n",
    "        Returns:\n",
    "            Iterator yielding (batch_data, batch_labels) tuples\n",
    "            \n",
    "        TODO: Implement batching and shuffling logic.\n",
    "        \n",
    "        APPROACH:\n",
    "        1. Create indices list: list(range(len(dataset)))\n",
    "        2. Shuffle indices if self.shuffle is True\n",
    "        3. Loop through indices in batch_size chunks\n",
    "        4. For each batch: collect samples, stack them, yield batch\n",
    "        \n",
    "        EXAMPLE:\n",
    "        for batch_data, batch_labels in dataloader:\n",
    "            # batch_data.shape: (batch_size, ...)\n",
    "            # batch_labels.shape: (batch_size,)\n",
    "        \n",
    "        HINTS:\n",
    "        - Use list(range(len(self.dataset))) for indices\n",
    "        - Use np.random.shuffle() if self.shuffle is True\n",
    "        - Loop in chunks of self.batch_size\n",
    "        - Collect samples and stack with np.stack()\n",
    "        \"\"\"\n",
    "        # Create indices for all samples\n",
    "        indices = list(range(len(self.dataset)))\n",
    "        \n",
    "        # Shuffle if requested\n",
    "        if self.shuffle:\n",
    "            np.random.shuffle(indices)\n",
    "        \n",
    "        # Iterate through indices in batches\n",
    "        for i in range(0, len(indices), self.batch_size):\n",
    "            batch_indices = indices[i:i + self.batch_size]\n",
    "            \n",
    "            # Collect samples for this batch\n",
    "            batch_data = []\n",
    "            batch_labels = []\n",
    "            \n",
    "            for idx in batch_indices:\n",
    "                data, label = self.dataset[idx]\n",
    "                batch_data.append(data.data)\n",
    "                batch_labels.append(label.data)\n",
    "            \n",
    "            # Stack into batch tensors\n",
    "            batch_data_array = np.stack(batch_data, axis=0)\n",
    "            batch_labels_array = np.stack(batch_labels, axis=0)\n",
    "            \n",
    "            yield Tensor(batch_data_array), Tensor(batch_labels_array)\n",
    "    \n",
    "    def __len__(self) -> int:\n",
    "        \"\"\"\n",
    "        Get the number of batches per epoch.\n",
    "        \n",
    "        TODO: Calculate number of batches.\n",
    "        \n",
    "        APPROACH:\n",
    "        1. Get dataset size: len(self.dataset)\n",
    "        2. Divide by batch_size and round up\n",
    "        3. Use ceiling division: (n + batch_size - 1) // batch_size\n",
    "        \n",
    "        EXAMPLE:\n",
    "        Dataset size 100, batch size 32 → 4 batches\n",
    "        \n",
    "        HINTS:\n",
    "        - Use len(self.dataset) for dataset size\n",
    "        - Use ceiling division for exact batch count\n",
    "        - Formula: (dataset_size + batch_size - 1) // batch_size\n",
    "        \"\"\"\n",
    "        # Calculate number of batches using ceiling division\n",
    "        dataset_size = len(self.dataset)\n",
    "        return (dataset_size + self.batch_size - 1) // self.batch_size"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "46082fb1",
   "metadata": {
    "cell_marker": "\"\"\""
   },
   "source": [
    "### 🧪 Unit Test: DataLoader\n",
    "\n",
    "Let's test your DataLoader implementation! This is the heart of efficient data loading for neural networks.\n",
    "\n",
    "**This is a unit test** - it tests the DataLoader class in isolation."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "9744517c",
   "metadata": {
    "nbgrader": {
     "grade": true,
     "grade_id": "test-dataloader-immediate",
     "locked": true,
     "points": 10,
     "schema_version": 3,
     "solution": false,
     "task": false
    }
   },
   "outputs": [],
   "source": [
    "# Test DataLoader immediately after implementation\n",
    "print(\"🔬 Unit Test: DataLoader...\")\n",
    "\n",
    "# Use the test dataset from before\n",
    "class TestDataset(Dataset):\n",
    "    def __init__(self, size=10):\n",
    "        self.size = size\n",
    "    \n",
    "    def __getitem__(self, index):\n",
    "        data = Tensor([index, index * 2])\n",
    "        label = Tensor([index % 3])  # 3 classes\n",
    "        return data, label\n",
    "    \n",
    "    def __len__(self):\n",
    "        return self.size\n",
    "    \n",
    "    def get_num_classes(self):\n",
    "        return 3\n",
    "\n",
    "# Test basic DataLoader functionality\n",
    "try:\n",
    "    dataset = TestDataset(size=10)\n",
    "    dataloader = DataLoader(dataset, batch_size=3, shuffle=False)\n",
    "    \n",
    "    print(f\"DataLoader created: batch_size={dataloader.batch_size}, shuffle={dataloader.shuffle}\")\n",
    "    print(f\"Number of batches: {len(dataloader)}\")\n",
    "    \n",
    "    # Test __len__\n",
    "    expected_batches = (10 + 3 - 1) // 3  # Ceiling division: 4 batches\n",
    "    assert len(dataloader) == expected_batches, f\"Should have {expected_batches} batches, got {len(dataloader)}\"\n",
    "    print(\"✅ DataLoader __len__ works correctly\")\n",
    "    \n",
    "    # Test iteration\n",
    "    batch_count = 0\n",
    "    total_samples = 0\n",
    "    \n",
    "    for batch_data, batch_labels in dataloader:\n",
    "        batch_count += 1\n",
    "        batch_size = batch_data.shape[0]\n",
    "        total_samples += batch_size\n",
    "        \n",
    "        print(f\"Batch {batch_count}: data shape {batch_data.shape}, labels shape {batch_labels.shape}\")\n",
    "        \n",
    "        # Verify batch dimensions\n",
    "        assert len(batch_data.shape) == 2, f\"Batch data should be 2D, got {batch_data.shape}\"\n",
    "        assert len(batch_labels.shape) == 2, f\"Batch labels should be 2D, got {batch_labels.shape}\"\n",
    "        assert batch_data.shape[1] == 2, f\"Each sample should have 2 features, got {batch_data.shape[1]}\"\n",
    "        assert batch_labels.shape[1] == 1, f\"Each label should have 1 element, got {batch_labels.shape[1]}\"\n",
    "        \n",
    "    assert batch_count == expected_batches, f\"Should iterate {expected_batches} times, got {batch_count}\"\n",
    "    assert total_samples == 10, f\"Should process 10 total samples, got {total_samples}\"\n",
    "    print(\"✅ DataLoader iteration works correctly\")\n",
    "    \n",
    "except Exception as e:\n",
    "    print(f\"❌ DataLoader test failed: {e}\")\n",
    "    raise\n",
    "\n",
    "# Test shuffling\n",
    "try:\n",
    "    dataloader_shuffle = DataLoader(dataset, batch_size=5, shuffle=True)\n",
    "    dataloader_no_shuffle = DataLoader(dataset, batch_size=5, shuffle=False)\n",
    "    \n",
    "    # Get first batch from each\n",
    "    batch1_shuffle = next(iter(dataloader_shuffle))\n",
    "    batch1_no_shuffle = next(iter(dataloader_no_shuffle))\n",
    "    \n",
    "    print(\"✅ DataLoader shuffling parameter works\")\n",
    "    \n",
    "except Exception as e:\n",
    "    print(f\"❌ DataLoader shuffling test failed: {e}\")\n",
    "    raise\n",
    "\n",
    "# Test different batch sizes\n",
    "try:\n",
    "    small_loader = DataLoader(dataset, batch_size=2, shuffle=False)\n",
    "    large_loader = DataLoader(dataset, batch_size=8, shuffle=False)\n",
    "    \n",
    "    assert len(small_loader) == 5, f\"Small loader should have 5 batches, got {len(small_loader)}\"\n",
    "    assert len(large_loader) == 2, f\"Large loader should have 2 batches, got {len(large_loader)}\"\n",
    "    print(\"✅ DataLoader handles different batch sizes correctly\")\n",
    "    \n",
    "except Exception as e:\n",
    "    print(f\"❌ DataLoader batch size test failed: {e}\")\n",
    "    raise\n",
    "\n",
    "# Show the DataLoader behavior\n",
    "print(\"🎯 DataLoader behavior:\")\n",
    "print(\"   Batches data for efficient processing\")\n",
    "print(\"   Handles shuffling and iteration\")\n",
    "print(\"   Provides clean interface for training loops\")\n",
    "print(\"📈 Progress: Dataset interface ✓, DataLoader ✓\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ee45269f",
   "metadata": {
    "cell_marker": "\"\"\"",
    "lines_to_next_cell": 1
   },
   "source": [
    "## Step 4: Creating a Simple Dataset Example\n",
    "\n",
    "### Why We Need Concrete Examples\n",
    "Abstract classes are great for interfaces, but we need concrete implementations to understand how they work. Let's create a simple dataset for testing.\n",
    "\n",
    "### Design Principles\n",
    "- **Simple**: Easy to understand and debug\n",
    "- **Configurable**: Adjustable size and properties\n",
    "- **Predictable**: Deterministic data for testing\n",
    "- **Educational**: Shows the Dataset pattern clearly\n",
    "\n",
    "### Real-World Connection\n",
    "This pattern is used for:\n",
    "- **CIFAR-10**: 32x32 RGB images with 10 classes\n",
    "- **ImageNet**: High-resolution images with 1000 classes\n",
    "- **MNIST**: 28x28 grayscale digits with 10 classes\n",
    "- **Custom datasets**: Your own data following this pattern"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d4c773ba",
   "metadata": {
    "lines_to_next_cell": 1,
    "nbgrader": {
     "grade": false,
     "grade_id": "simple-dataset",
     "locked": false,
     "schema_version": 3,
     "solution": true,
     "task": false
    }
   },
   "outputs": [],
   "source": [
    "#| export\n",
    "class SimpleDataset(Dataset):\n",
    "    \"\"\"\n",
    "    Simple dataset for testing and demonstration.\n",
    "    \n",
    "    Generates synthetic data with configurable size and properties.\n",
    "    Perfect for understanding the Dataset pattern.\n",
    "    \"\"\"\n",
    "    \n",
    "    def __init__(self, size: int = 100, num_features: int = 4, num_classes: int = 3):\n",
    "        \"\"\"\n",
    "        Initialize SimpleDataset.\n",
    "        \n",
    "        Args:\n",
    "            size: Number of samples in the dataset\n",
    "            num_features: Number of features per sample\n",
    "            num_classes: Number of classes\n",
    "            \n",
    "        TODO: Initialize the dataset with synthetic data.\n",
    "        \n",
    "        APPROACH:\n",
    "        1. Store the configuration parameters\n",
    "        2. Generate synthetic data and labels\n",
    "        3. Make data deterministic for testing\n",
    "        \n",
    "        EXAMPLE:\n",
    "        SimpleDataset(size=100, num_features=4, num_classes=3)\n",
    "        creates 100 samples with 4 features each, 3 classes\n",
    "        \n",
    "        HINTS:\n",
    "        - Store size, num_features, num_classes as instance variables\n",
    "        - Use np.random.seed() for reproducible data\n",
    "        - Generate random data with np.random.randn()\n",
    "        - Generate random labels with np.random.randint()\n",
    "        \"\"\"\n",
    "        self.size = size\n",
    "        self.num_features = num_features\n",
    "        self.num_classes = num_classes\n",
    "        \n",
    "        # Generate synthetic data (deterministic for testing)\n",
    "        np.random.seed(42)  # For reproducible data\n",
    "        self.data = np.random.randn(size, num_features).astype(np.float32)\n",
    "        self.labels = np.random.randint(0, num_classes, size=size)\n",
    "    \n",
    "    def __getitem__(self, index: int) -> Tuple[Tensor, Tensor]:\n",
    "        \"\"\"\n",
    "        Get a sample by index.\n",
    "        \n",
    "        Args:\n",
    "            index: Index of the sample\n",
    "            \n",
    "        Returns:\n",
    "            Tuple of (data, label) tensors\n",
    "            \n",
    "        TODO: Return the sample at the given index.\n",
    "        \n",
    "        APPROACH:\n",
    "        1. Get data sample from self.data[index]\n",
    "        2. Get label from self.labels[index]\n",
    "        3. Convert both to Tensors and return as tuple\n",
    "        \n",
    "        EXAMPLE:\n",
    "        dataset[0] returns (Tensor(features), Tensor(label))\n",
    "        \n",
    "        HINTS:\n",
    "        - Use self.data[index] for the data\n",
    "        - Use self.labels[index] for the label\n",
    "        - Convert to Tensors: Tensor(data), Tensor(label)\n",
    "        \"\"\"\n",
    "        data = self.data[index]\n",
    "        label = self.labels[index]\n",
    "        return Tensor(data), Tensor(label)\n",
    "    \n",
    "    def __len__(self) -> int:\n",
    "        \"\"\"\n",
    "        Get the dataset size.\n",
    "        \n",
    "        TODO: Return the dataset size.\n",
    "        \n",
    "        APPROACH:\n",
    "        1. Return self.size\n",
    "        \n",
    "        EXAMPLE:\n",
    "        len(dataset) returns 100 for dataset with 100 samples\n",
    "        \n",
    "        HINTS:\n",
    "        - Simply return self.size\n",
    "        \"\"\"\n",
    "        return self.size\n",
    "    \n",
    "    def get_num_classes(self) -> int:\n",
    "        \"\"\"\n",
    "        Get the number of classes.\n",
    "        \n",
    "        TODO: Return the number of classes.\n",
    "        \n",
    "        APPROACH:\n",
    "        1. Return self.num_classes\n",
    "        \n",
    "        EXAMPLE:\n",
    "        dataset.get_num_classes() returns 3 for 3-class dataset\n",
    "        \n",
    "        HINTS:\n",
    "        - Simply return self.num_classes\n",
    "        \"\"\"\n",
    "        return self.num_classes"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e6a029be",
   "metadata": {
    "cell_marker": "\"\"\""
   },
   "source": [
    "### 🧪 Unit Test: SimpleDataset\n",
    "\n",
    "Let's test your SimpleDataset implementation! This concrete example shows how the Dataset pattern works.\n",
    "\n",
    "**This is a unit test** - it tests the SimpleDataset class in isolation."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "0f3f5ed5",
   "metadata": {
    "nbgrader": {
     "grade": true,
     "grade_id": "test-simple-dataset-immediate",
     "locked": true,
     "points": 10,
     "schema_version": 3,
     "solution": false,
     "task": false
    }
   },
   "outputs": [],
   "source": [
    "# Test SimpleDataset immediately after implementation\n",
    "print(\"🔬 Unit Test: SimpleDataset...\")\n",
    "\n",
    "try:\n",
    "    # Create dataset\n",
    "    dataset = SimpleDataset(size=20, num_features=5, num_classes=4)\n",
    "    \n",
    "    print(f\"Dataset created: size={len(dataset)}, features={dataset.num_features}, classes={dataset.get_num_classes()}\")\n",
    "        \n",
    "        # Test basic properties\n",
    "    assert len(dataset) == 20, f\"Dataset length should be 20, got {len(dataset)}\"\n",
    "    assert dataset.get_num_classes() == 4, f\"Should have 4 classes, got {dataset.get_num_classes()}\"\n",
    "    print(\"✅ SimpleDataset basic properties work correctly\")\n",
    "        \n",
    "    # Test sample access\n",
    "    data, label = dataset[0]\n",
    "    assert isinstance(data, Tensor), \"Data should be a Tensor\"\n",
    "    assert isinstance(label, Tensor), \"Label should be a Tensor\"\n",
    "    assert data.shape == (5,), f\"Data shape should be (5,), got {data.shape}\"\n",
    "    assert label.shape == (), f\"Label shape should be (), got {label.shape}\"\n",
    "    print(\"✅ SimpleDataset sample access works correctly\")\n",
    "    \n",
    "    # Test sample shape\n",
    "    sample_shape = dataset.get_sample_shape()\n",
    "    assert sample_shape == (5,), f\"Sample shape should be (5,), got {sample_shape}\"\n",
    "    print(\"✅ SimpleDataset get_sample_shape works correctly\")\n",
    "    \n",
    "    # Test multiple samples\n",
    "    for i in range(5):\n",
    "            data, label = dataset[i]\n",
    "            assert data.shape == (5,), f\"Data shape should be (5,) for sample {i}, got {data.shape}\"\n",
    "            assert 0 <= label.data < 4, f\"Label should be in [0, 3] for sample {i}, got {label.data}\"\n",
    "    print(\"✅ SimpleDataset multiple samples work correctly\")\n",
    "    \n",
    "    # Test deterministic data (same seed should give same data)\n",
    "    dataset2 = SimpleDataset(size=20, num_features=5, num_classes=4)\n",
    "    data1, label1 = dataset[0]\n",
    "    data2, label2 = dataset2[0]\n",
    "    assert np.array_equal(data1.data, data2.data), \"Data should be deterministic\"\n",
    "    assert np.array_equal(label1.data, label2.data), \"Labels should be deterministic\"\n",
    "    print(\"✅ SimpleDataset data is deterministic\")\n",
    "\n",
    "except Exception as e:\n",
    "    print(f\"❌ SimpleDataset test failed: {e}\")\n",
    "\n",
    "# Show the SimpleDataset behavior\n",
    "print(\"🎯 SimpleDataset behavior:\")\n",
    "print(\"   Generates synthetic data for testing\")\n",
    "print(\"   Implements complete Dataset interface\")\n",
    "print(\"   Provides deterministic data for reproducibility\")\n",
    "print(\"📈 Progress: Dataset interface ✓, DataLoader ✓, SimpleDataset ✓\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3b5a161c",
   "metadata": {
    "cell_marker": "\"\"\""
   },
   "source": [
    "## Step 5: Comprehensive Test - Complete Data Pipeline\n",
    "\n",
    "### Real-World Data Pipeline Applications\n",
    "Let's test our data loading components in realistic scenarios:\n",
    "\n",
    "#### **Training Pipeline**\n",
    "```python\n",
    "# The standard ML training pattern\n",
    "dataset = SimpleDataset(size=1000, num_features=10, num_classes=5)\n",
    "dataloader = DataLoader(dataset, batch_size=32, shuffle=True)\n",
    "\n",
    "for epoch in range(num_epochs):\n",
    "    for batch_data, batch_labels in dataloader:\n",
    "        # Train model on batch\n",
    "        pass\n",
    "```\n",
    "\n",
    "#### **Validation Pipeline**\n",
    "```python\n",
    "# Validation without shuffling\n",
    "val_loader = DataLoader(val_dataset, batch_size=64, shuffle=False)\n",
    "\n",
    "for batch_data, batch_labels in val_loader:\n",
    "    # Evaluate model on batch\n",
    "    pass\n",
    "```\n",
    "\n",
    "#### **Data Analysis Pipeline**\n",
    "```python\n",
    "# Systematic data exploration\n",
    "for batch_data, batch_labels in dataloader:\n",
    "    # Analyze batch statistics\n",
    "    pass\n",
    "```\n",
    "\n",
    "This comprehensive test ensures our data loading components work together for real ML applications!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "5e8d80ec",
   "metadata": {
    "nbgrader": {
     "grade": true,
     "grade_id": "test-comprehensive",
     "locked": true,
     "points": 15,
     "schema_version": 3,
     "solution": false,
     "task": false
    }
   },
   "outputs": [],
   "source": [
    "# Comprehensive test - complete data pipeline applications\n",
    "print(\"🔬 Comprehensive Test: Complete Data Pipeline...\")\n",
    "\n",
    "try:\n",
    "    # Test 1: Training Data Pipeline\n",
    "    print(\"\\n1. Training Data Pipeline Test:\")\n",
    "    \n",
    "    # Create training dataset\n",
    "    train_dataset = SimpleDataset(size=100, num_features=8, num_classes=5)\n",
    "    train_loader = DataLoader(train_dataset, batch_size=16, shuffle=True)\n",
    "    \n",
    "    # Simulate training epoch\n",
    "    epoch_samples = 0\n",
    "    epoch_batches = 0\n",
    "    \n",
    "    for batch_data, batch_labels in train_loader:\n",
    "        epoch_batches += 1\n",
    "        epoch_samples += batch_data.shape[0]\n",
    "        \n",
    "        # Verify batch properties\n",
    "        assert batch_data.shape[1] == 8, f\"Features should be 8, got {batch_data.shape[1]}\"\n",
    "        assert len(batch_labels.shape) == 1, f\"Labels should be 1D, got shape {batch_labels.shape}\"\n",
    "        assert isinstance(batch_data, Tensor), \"Batch data should be Tensor\"\n",
    "        assert isinstance(batch_labels, Tensor), \"Batch labels should be Tensor\"\n",
    "    \n",
    "    assert epoch_samples == 100, f\"Should process 100 samples, got {epoch_samples}\"\n",
    "    expected_batches = (100 + 16 - 1) // 16\n",
    "    assert epoch_batches == expected_batches, f\"Should have {expected_batches} batches, got {epoch_batches}\"\n",
    "    print(\"✅ Training pipeline works correctly\")\n",
    "    \n",
    "    # Test 2: Validation Data Pipeline\n",
    "    print(\"\\n2. Validation Data Pipeline Test:\")\n",
    "    \n",
    "    # Create validation dataset (no shuffling)\n",
    "    val_dataset = SimpleDataset(size=50, num_features=8, num_classes=5)\n",
    "    val_loader = DataLoader(val_dataset, batch_size=10, shuffle=False)\n",
    "    \n",
    "    # Simulate validation\n",
    "    val_samples = 0\n",
    "    val_batches = 0\n",
    "    \n",
    "    for batch_data, batch_labels in val_loader:\n",
    "        val_batches += 1\n",
    "        val_samples += batch_data.shape[0]\n",
    "        \n",
    "        # Verify consistent batch processing\n",
    "        assert batch_data.shape[1] == 8, \"Validation features should match training\"\n",
    "        assert len(batch_labels.shape) == 1, \"Validation labels should be 1D\"\n",
    "        \n",
    "    assert val_samples == 50, f\"Should process 50 validation samples, got {val_samples}\"\n",
    "    assert val_batches == 5, f\"Should have 5 validation batches, got {val_batches}\"\n",
    "    print(\"✅ Validation pipeline works correctly\")\n",
    "    \n",
    "    # Test 3: Different Dataset Configurations\n",
    "    print(\"\\n3. Dataset Configuration Test:\")\n",
    "    \n",
    "    # Test different configurations\n",
    "    configs = [\n",
    "        (200, 4, 3),   # Medium dataset\n",
    "        (50, 12, 10),  # High-dimensional features\n",
    "        (1000, 2, 2),  # Large dataset, simple features\n",
    "    ]\n",
    "    \n",
    "    for size, features, classes in configs:\n",
    "        dataset = SimpleDataset(size=size, num_features=features, num_classes=classes)\n",
    "        loader = DataLoader(dataset, batch_size=32, shuffle=True)\n",
    "        \n",
    "        # Test one batch\n",
    "        batch_data, batch_labels = next(iter(loader))\n",
    "        \n",
    "        assert batch_data.shape[1] == features, f\"Features mismatch for config {configs}\"\n",
    "        assert len(dataset) == size, f\"Size mismatch for config {configs}\"\n",
    "        assert dataset.get_num_classes() == classes, f\"Classes mismatch for config {configs}\"\n",
    "    \n",
    "    print(\"✅ Different dataset configurations work correctly\")\n",
    "    \n",
    "    # Test 4: Memory Efficiency Simulation\n",
    "    print(\"\\n4. Memory Efficiency Test:\")\n",
    "    \n",
    "    # Create larger dataset to test memory efficiency\n",
    "    large_dataset = SimpleDataset(size=500, num_features=20, num_classes=10)\n",
    "    large_loader = DataLoader(large_dataset, batch_size=50, shuffle=True)\n",
    "    \n",
    "    # Process all batches to ensure memory efficiency\n",
    "    processed_samples = 0\n",
    "    max_batch_size = 0\n",
    "    \n",
    "    for batch_data, batch_labels in large_loader:\n",
    "        processed_samples += batch_data.shape[0]\n",
    "        max_batch_size = max(max_batch_size, batch_data.shape[0])\n",
    "        \n",
    "        # Verify memory usage stays reasonable\n",
    "        assert batch_data.shape[0] <= 50, f\"Batch size should not exceed 50, got {batch_data.shape[0]}\"\n",
    "    \n",
    "    assert processed_samples == 500, f\"Should process all 500 samples, got {processed_samples}\"\n",
    "    print(\"✅ Memory efficiency works correctly\")\n",
    "    \n",
    "    # Test 5: Multi-Epoch Training Simulation\n",
    "    print(\"\\n5. Multi-Epoch Training Test:\")\n",
    "    \n",
    "    # Simulate multiple epochs\n",
    "    dataset = SimpleDataset(size=60, num_features=6, num_classes=3)\n",
    "    loader = DataLoader(dataset, batch_size=20, shuffle=True)\n",
    "    \n",
    "    for epoch in range(3):\n",
    "        epoch_samples = 0\n",
    "        for batch_data, batch_labels in loader:\n",
    "            epoch_samples += batch_data.shape[0]\n",
    "            \n",
    "            # Verify shapes remain consistent across epochs\n",
    "            assert batch_data.shape[1] == 6, f\"Features should be 6 in epoch {epoch}\"\n",
    "            assert len(batch_labels.shape) == 1, f\"Labels should be 1D in epoch {epoch}\"\n",
    "        \n",
    "        assert epoch_samples == 60, f\"Should process 60 samples in epoch {epoch}, got {epoch_samples}\"\n",
    "    \n",
    "    print(\"✅ Multi-epoch training works correctly\")\n",
    "    \n",
    "    print(\"\\n🎉 Comprehensive test passed! Your data pipeline works correctly for:\")\n",
    "    print(\"  • Large-scale dataset handling\")\n",
    "    print(\"  • Batch processing with multiple workers\")\n",
    "    print(\"  • Shuffling and sampling strategies\")\n",
    "    print(\"  • Memory-efficient data loading\")\n",
    "    print(\"  • Complete training pipeline integration\")\n",
    "    print(\"📈 Progress: Production-ready data pipeline ✓\")\n",
    "    \n",
    "except Exception as e:\n",
    "    print(f\"❌ Comprehensive test failed: {e}\")\n",
    "    raise\n",
    "\n",
    "print(\"📈 Final Progress: Complete data pipeline ready for production ML!\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b0352802",
   "metadata": {
    "lines_to_next_cell": 1
   },
   "source": [
    "\"\"\"\n",
    "# 🎯 Module Summary\n",
    "\n",
    "Congratulations! You've successfully implemented the core components of data loading systems:\n",
    "\n",
    "## What You've Accomplished\n",
    "✅ **Dataset Abstract Class**: The foundation interface for all data loading  \n",
    "✅ **DataLoader Implementation**: Efficient batching and iteration over datasets  \n",
    "✅ **SimpleDataset Example**: Concrete implementation showing the Dataset pattern  \n",
    "✅ **Complete Data Pipeline**: End-to-end data loading for neural network training  \n",
    "✅ **Systems Thinking**: Understanding memory efficiency, batching, and I/O optimization  \n",
    "\n",
    "## Key Concepts You've Learned\n",
    "- **Dataset pattern**: Abstract interface for consistent data access\n",
    "- **DataLoader pattern**: Efficient batching and iteration for training\n",
    "- **Memory efficiency**: Loading data on-demand rather than all at once\n",
    "- **Batching strategies**: Grouping samples for efficient GPU computation\n",
    "- **Shuffling**: Randomizing data order to prevent overfitting\n",
    "\n",
    "## Mathematical Foundations\n",
    "- **Batch processing**: Vectorized operations on multiple samples\n",
    "- **Memory management**: Handling datasets larger than available RAM\n",
    "- **I/O optimization**: Minimizing disk reads and memory allocation\n",
    "- **Stochastic sampling**: Random shuffling for better generalization\n",
    "\n",
    "## Real-World Applications\n",
    "- **Computer vision**: Loading image datasets like CIFAR-10, ImageNet\n",
    "- **Natural language processing**: Loading text datasets with tokenization\n",
    "- **Tabular data**: Loading CSV files and database records\n",
    "- **Audio processing**: Loading and preprocessing audio files\n",
    "- **Time series**: Loading sequential data with proper windowing\n",
    "\n",
    "## Connection to Production Systems\n",
    "- **PyTorch**: Your Dataset and DataLoader mirror `torch.utils.data`\n",
    "- **TensorFlow**: Similar concepts in `tf.data.Dataset`\n",
    "- **JAX**: Custom data loading with efficient batching\n",
    "- **MLOps**: Data pipelines are critical for production ML systems\n",
    "\n",
    "## Performance Characteristics\n",
    "- **Memory efficiency**: O(batch_size) memory usage, not O(dataset_size)\n",
    "- **I/O optimization**: Load data on-demand, not all at once\n",
    "- **Batching efficiency**: Vectorized operations on GPU\n",
    "- **Shuffling overhead**: Minimal cost for significant training benefits\n",
    "\n",
    "## Data Engineering Best Practices\n",
    "- **Reproducibility**: Deterministic data generation and shuffling\n",
    "- **Scalability**: Handle datasets of any size\n",
    "- **Flexibility**: Easy to switch between different data sources\n",
    "- **Testability**: Simple interfaces for unit testing\n",
    "\n",
    "## Next Steps\n",
    "1. **Export your code**: Use NBDev to export to the `tinytorch` package\n",
    "2. **Test your implementation**: Run the complete test suite\n",
    "3. **Build data pipelines**: \n",
    "   ```python\n",
    "   from tinytorch.core.dataloader import Dataset, DataLoader\n",
    "   from tinytorch.core.tensor import Tensor\n",
    "   \n",
    "   # Create dataset\n",
    "   dataset = SimpleDataset(size=1000, num_features=10, num_classes=5)\n",
    "   \n",
    "   # Create dataloader\n",
    "   loader = DataLoader(dataset, batch_size=32, shuffle=True)\n",
    "   \n",
    "   # Training loop\n",
    "   for epoch in range(num_epochs):\n",
    "       for batch_data, batch_labels in loader:\n",
    "           # Train model\n",
    "       pass\n",
    "   ```\n",
    "4. **Explore advanced topics**: Data augmentation, distributed loading, streaming datasets!\n",
    "\n",
    "**Ready for the next challenge?** Let's build training loops and optimizers to complete the ML pipeline!\n",
    "\"\"\"\n",
    "\n",
    "def test_dataset_interface():\n",
    "    \"\"\"Test Dataset abstract interface implementation comprehensively.\"\"\"\n",
    "    print(\"🔬 Unit Test: Dataset Interface...\")\n",
    "    \n",
    "    # Test TestDataset implementation\n",
    "    dataset = TestDataset(size=5)\n",
    "    \n",
    "    # Test basic interface\n",
    "    assert len(dataset) == 5, \"Dataset should have correct length\"\n",
    "    \n",
    "    # Test data access\n",
    "    sample, label = dataset[0]\n",
    "    assert isinstance(sample, Tensor), \"Sample should be Tensor\"\n",
    "    assert isinstance(label, Tensor), \"Label should be Tensor\"\n",
    "    \n",
    "    print(\"✅ Dataset interface works correctly\")\n",
    "\n",
    "def test_dataloader():\n",
    "    \"\"\"Test DataLoader implementation comprehensively.\"\"\"\n",
    "    print(\"🔬 Unit Test: DataLoader...\")\n",
    "    \n",
    "    # Test DataLoader with TestDataset\n",
    "    dataset = TestDataset(size=10)\n",
    "    loader = DataLoader(dataset, batch_size=3, shuffle=False)\n",
    "    \n",
    "    # Test iteration\n",
    "    batches = list(loader)\n",
    "    assert len(batches) >= 3, \"Should have at least 3 batches\"\n",
    "    \n",
    "    # Test batch shapes\n",
    "    batch_data, batch_labels = batches[0]\n",
    "    assert batch_data.shape[0] <= 3, \"Batch size should be <= 3\"\n",
    "    assert batch_labels.shape[0] <= 3, \"Batch labels should match data\"\n",
    "    \n",
    "    print(\"✅ DataLoader works correctly\")\n",
    "\n",
    "def test_simple_dataset():\n",
    "    \"\"\"Test SimpleDataset implementation comprehensively.\"\"\"\n",
    "    print(\"🔬 Unit Test: SimpleDataset...\")\n",
    "    \n",
    "    # Test SimpleDataset\n",
    "    dataset = SimpleDataset(size=100, num_features=4, num_classes=3)\n",
    "    \n",
    "    # Test properties\n",
    "    assert len(dataset) == 100, \"Dataset should have correct size\"\n",
    "    assert dataset.get_num_classes() == 3, \"Should have correct number of classes\"\n",
    "    \n",
    "    # Test data access\n",
    "    sample, label = dataset[0]\n",
    "    assert sample.shape == (4,), \"Sample should have correct features\"\n",
    "    assert 0 <= label.data < 3, \"Label should be valid class\"\n",
    "    \n",
    "    print(\"✅ SimpleDataset works correctly\")\n",
    "\n",
    "def test_dataloader_pipeline():\n",
    "    \"\"\"Test complete data pipeline comprehensive testing.\"\"\"\n",
    "    print(\"🔬 Comprehensive Test: Data Pipeline...\")\n",
    "    \n",
    "    # Test complete pipeline\n",
    "    dataset = SimpleDataset(size=50, num_features=10, num_classes=5)\n",
    "    loader = DataLoader(dataset, batch_size=8, shuffle=True)\n",
    "    \n",
    "    total_samples = 0\n",
    "    for batch_data, batch_labels in loader:\n",
    "        assert isinstance(batch_data, Tensor), \"Batch data should be Tensor\"\n",
    "        assert isinstance(batch_labels, Tensor), \"Batch labels should be Tensor\"\n",
    "        assert batch_data.shape[1] == 10, \"Features should be correct\"\n",
    "        total_samples += batch_data.shape[0]\n",
    "    \n",
    "    assert total_samples == 50, \"Should process all samples\"\n",
    "    \n",
    "    print(\"✅ Data pipeline integration works correctly\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c9433d3d",
   "metadata": {
    "cell_marker": "\"\"\""
   },
   "source": [
    "## 🧪 Module Testing\n",
    "\n",
    "Time to test your implementation! This section uses TinyTorch's standardized testing framework to ensure your implementation works correctly.\n",
    "\n",
    "**This testing section is locked** - it provides consistent feedback across all modules and cannot be modified."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "3ec15e59",
   "metadata": {
    "nbgrader": {
     "grade": false,
     "grade_id": "standardized-testing",
     "locked": true,
     "schema_version": 3,
     "solution": false,
     "task": false
    }
   },
   "outputs": [],
   "source": [
    "# =============================================================================\n",
    "# STANDARDIZED MODULE TESTING - DO NOT MODIFY\n",
    "# This cell is locked to ensure consistent testing across all TinyTorch modules\n",
    "# =============================================================================\n",
    "\n",
    "if __name__ == \"__main__\":\n",
    "    from tito.tools.testing import run_module_tests_auto\n",
    "    \n",
    "    # Automatically discover and run all tests in this module\n",
    "    success = run_module_tests_auto(\"DataLoader\") "
   ]
  }
 ],
 "metadata": {
  "jupytext": {
   "main_language": "python"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}