{ "cells": [ { "cell_type": "markdown", "id": "b88708de", "metadata": { "cell_marker": "\"\"\"" }, "source": [ "# DataLoader - Data Loading and Preprocessing\n", "\n", "Welcome to the DataLoader module! This is where you'll learn how to efficiently load, process, and manage data for machine learning systems.\n", "\n", "## Learning Goals\n", "- Understand data pipelines as the foundation of ML systems\n", "- Implement efficient data loading with memory management and batching\n", "- Build reusable dataset abstractions for different data types\n", "- Master the Dataset and DataLoader pattern used in all ML frameworks\n", "- Learn systems thinking for data engineering and I/O optimization\n", "\n", "## Build โ†’ Use โ†’ Reflect\n", "1. **Build**: Create dataset classes and data loaders from scratch\n", "2. **Use**: Load real datasets and feed them to neural networks\n", "3. **Reflect**: How data engineering affects system performance and scalability\n", "\n", "## What You'll Learn\n", "By the end of this module, you'll understand:\n", "- The Dataset pattern for consistent data access\n", "- How DataLoaders enable efficient batch processing\n", "- Why batching and shuffling are crucial for ML\n", "- How to handle datasets larger than memory\n", "- The connection between data engineering and model performance" ] }, { "cell_type": "code", "execution_count": null, "id": "8a1a46d2", "metadata": { "lines_to_next_cell": 1, "nbgrader": { "grade": false, "grade_id": "dataloader-imports", "locked": false, "schema_version": 3, "solution": false, "task": false } }, "outputs": [], "source": [ "#| default_exp core.dataloader\n", "\n", "#| export\n", "import numpy as np\n", "import sys\n", "import os\n", "import pickle\n", "import struct\n", "from typing import List, Tuple, Optional, Union, Iterator\n", "import matplotlib.pyplot as plt\n", "import urllib.request\n", "import tarfile\n", "\n", "# Import our building blocks - try package first, then local modules\n", "try:\n", " from tinytorch.core.tensor import Tensor\n", "except ImportError:\n", " # For development, import from local modules\n", " sys.path.append(os.path.join(os.path.dirname(__file__), '..', '01_tensor'))\n", " from tensor_dev import Tensor" ] }, { "cell_type": "code", "execution_count": null, "id": "9fc27557", "metadata": { "lines_to_next_cell": 1, "nbgrader": { "grade": false, "grade_id": "dataloader-setup", "locked": false, "schema_version": 3, "solution": false, "task": false } }, "outputs": [], "source": [ "#| hide\n", "#| export\n", "def _should_show_plots():\n", " \"\"\"Check if we should show plots (disable during testing)\"\"\"\n", " # Check multiple conditions that indicate we're in test mode\n", " is_pytest = (\n", " 'pytest' in sys.modules or\n", " 'test' in sys.argv or\n", " os.environ.get('PYTEST_CURRENT_TEST') is not None or\n", " any('test' in arg for arg in sys.argv) or\n", " any('pytest' in arg for arg in sys.argv)\n", " )\n", " \n", " # Show plots in development mode (when not in test mode)\n", " return not is_pytest" ] }, { "cell_type": "code", "execution_count": null, "id": "f37cacaf", "metadata": { "nbgrader": { "grade": false, "grade_id": "dataloader-welcome", "locked": false, "schema_version": 3, "solution": false, "task": false } }, "outputs": [], "source": [ "print(\"๐Ÿ”ฅ TinyTorch DataLoader Module\")\n", "print(f\"NumPy version: {np.__version__}\")\n", "print(f\"Python version: {sys.version_info.major}.{sys.version_info.minor}\")\n", "print(\"Ready to build data pipelines!\")" ] }, { "cell_type": "markdown", "id": "decfa343", "metadata": { "cell_marker": "\"\"\"" }, "source": [ "## ๐Ÿ“ฆ Where This Code Lives in the Final Package\n", "\n", "**Learning Side:** You work in `modules/source/06_dataloader/dataloader_dev.py` \n", "**Building Side:** Code exports to `tinytorch.core.dataloader`\n", "\n", "```python\n", "# Final package structure:\n", "from tinytorch.core.dataloader import Dataset, DataLoader # Data loading utilities!\n", "from tinytorch.core.tensor import Tensor # Foundation\n", "from tinytorch.core.networks import Sequential # Models to train\n", "```\n", "\n", "**Why this matters:**\n", "- **Learning:** Focused modules for deep understanding of data pipelines\n", "- **Production:** Proper organization like PyTorch's `torch.utils.data`\n", "- **Consistency:** All data loading utilities live together in `core.dataloader`\n", "- **Integration:** Works seamlessly with tensors and networks" ] }, { "cell_type": "markdown", "id": "daf1136d", "metadata": { "cell_marker": "\"\"\"" }, "source": [ "## Step 1: Understanding Data Pipelines\n", "\n", "### What are Data Pipelines?\n", "**Data pipelines** are the systems that efficiently move data from storage to your model. They're the foundation of all machine learning systems.\n", "\n", "### The Data Pipeline Equation\n", "```\n", "Raw Data โ†’ Load โ†’ Transform โ†’ Batch โ†’ Model โ†’ Predictions\n", "```\n", "\n", "### Why Data Pipelines Matter\n", "- **Performance**: Efficient loading prevents GPU starvation\n", "- **Scalability**: Handle datasets larger than memory\n", "- **Consistency**: Reproducible data processing\n", "- **Flexibility**: Easy to switch between datasets\n", "\n", "### Real-World Challenges\n", "- **Memory constraints**: Datasets often exceed available RAM\n", "- **I/O bottlenecks**: Disk access is much slower than computation\n", "- **Batch processing**: Neural networks need batched data for efficiency\n", "- **Shuffling**: Random order prevents overfitting\n", "\n", "### Systems Thinking\n", "- **Memory efficiency**: Handle datasets larger than RAM\n", "- **I/O optimization**: Read from disk efficiently\n", "- **Batching strategies**: Trade-offs between memory and speed\n", "- **Caching**: When to cache vs recompute\n", "\n", "### Visual Intuition\n", "```\n", "Raw Files: [image1.jpg, image2.jpg, image3.jpg, ...]\n", "Load: [Tensor(32x32x3), Tensor(32x32x3), Tensor(32x32x3), ...]\n", "Batch: [Tensor(32, 32, 32, 3)] # 32 images at once\n", "Model: Process batch efficiently\n", "```\n", "\n", "Let's start by building the most fundamental component: **Dataset**." ] }, { "cell_type": "markdown", "id": "1881387d", "metadata": { "cell_marker": "\"\"\"", "lines_to_next_cell": 1 }, "source": [ "## Step 2: Building the Dataset Interface\n", "\n", "### What is a Dataset?\n", "A **Dataset** is an abstract interface that provides consistent access to data. It's the foundation of all data loading systems.\n", "\n", "### Why Abstract Interfaces Matter\n", "- **Consistency**: Same interface for all data types\n", "- **Flexibility**: Easy to switch between datasets\n", "- **Testability**: Easy to create test datasets\n", "- **Extensibility**: Easy to add new data sources\n", "\n", "### The Dataset Pattern\n", "```python\n", "class Dataset:\n", " def __getitem__(self, index): # Get single sample\n", " return data, label\n", " \n", " def __len__(self): # Get dataset size\n", " return total_samples\n", "```\n", "\n", "### Real-World Usage\n", "- **Computer vision**: ImageNet, CIFAR-10, custom image datasets\n", "- **NLP**: Text datasets, tokenized sequences\n", "- **Audio**: Audio files, spectrograms\n", "- **Time series**: Sequential data with proper windowing\n", "\n", "Let's implement the Dataset interface!" ] }, { "cell_type": "code", "execution_count": null, "id": "f02bc42c", "metadata": { "lines_to_next_cell": 1, "nbgrader": { "grade": false, "grade_id": "dataset-class", "locked": false, "schema_version": 3, "solution": true, "task": false } }, "outputs": [], "source": [ "#| export\n", "class Dataset:\n", " \"\"\"\n", " Base Dataset class: Abstract interface for all datasets.\n", " \n", " The fundamental abstraction for data loading in TinyTorch.\n", " Students implement concrete datasets by inheriting from this class.\n", " \"\"\"\n", " \n", " def __getitem__(self, index: int) -> Tuple[Tensor, Tensor]:\n", " \"\"\"\n", " Get a single sample and label by index.\n", " \n", " Args:\n", " index: Index of the sample to retrieve\n", " \n", " Returns:\n", " Tuple of (data, label) tensors\n", " \n", " TODO: Implement abstract method for getting samples.\n", " \n", " APPROACH:\n", " 1. This is an abstract method - subclasses will implement it\n", " 2. Return a tuple of (data, label) tensors\n", " 3. Data should be the input features, label should be the target\n", " \n", " EXAMPLE:\n", " dataset[0] should return (Tensor(image_data), Tensor(label))\n", " \n", " HINTS:\n", " - This is an abstract method that subclasses must override\n", " - Always return a tuple of (data, label) tensors\n", " - Data contains the input features, label contains the target\n", " \"\"\"\n", " ### BEGIN SOLUTION\n", " # This is an abstract method - subclasses must implement it\n", " raise NotImplementedError(\"Subclasses must implement __getitem__\")\n", " ### END SOLUTION\n", " \n", " def __len__(self) -> int:\n", " \"\"\"\n", " Get the total number of samples in the dataset.\n", " \n", " TODO: Implement abstract method for getting dataset size.\n", " \n", " APPROACH:\n", " 1. This is an abstract method - subclasses will implement it\n", " 2. Return the total number of samples in the dataset\n", " \n", " EXAMPLE:\n", " len(dataset) should return 50000 for CIFAR-10 training set\n", " \n", " HINTS:\n", " - This is an abstract method that subclasses must override\n", " - Return an integer representing the total number of samples\n", " \"\"\"\n", " ### BEGIN SOLUTION\n", " # This is an abstract method - subclasses must implement it\n", " raise NotImplementedError(\"Subclasses must implement __len__\")\n", " ### END SOLUTION\n", " \n", " def get_sample_shape(self) -> Tuple[int, ...]:\n", " \"\"\"\n", " Get the shape of a single data sample.\n", " \n", " TODO: Implement method to get sample shape.\n", " \n", " APPROACH:\n", " 1. Get the first sample using self[0]\n", " 2. Extract the data part (first element of tuple)\n", " 3. Return the shape of the data tensor\n", " \n", " EXAMPLE:\n", " For CIFAR-10: returns (3, 32, 32) for RGB images\n", " \n", " HINTS:\n", " - Use self[0] to get the first sample\n", " - Extract data from the (data, label) tuple\n", " - Return data.shape\n", " \"\"\"\n", " ### BEGIN SOLUTION\n", " # Get the first sample to determine shape\n", " data, _ = self[0]\n", " return data.shape\n", " ### END SOLUTION\n", " \n", " def get_num_classes(self) -> int:\n", " \"\"\"\n", " Get the number of classes in the dataset.\n", " \n", " TODO: Implement abstract method for getting number of classes.\n", " \n", " APPROACH:\n", " 1. This is an abstract method - subclasses will implement it\n", " 2. Return the number of unique classes in the dataset\n", " \n", " EXAMPLE:\n", " For CIFAR-10: returns 10 (classes 0-9)\n", " \n", " HINTS:\n", " - This is an abstract method that subclasses must override\n", " - Return the number of unique classes/categories\n", " \"\"\"\n", " # This is an abstract method - subclasses must implement it\n", " raise NotImplementedError(\"Subclasses must implement get_num_classes\")" ] }, { "cell_type": "markdown", "id": "fe072a6b", "metadata": { "cell_marker": "\"\"\"" }, "source": [ "### ๐Ÿงช Unit Test: Dataset Interface\n", "\n", "Let's understand the Dataset interface! While we can't test the abstract class directly, we'll create a simple test dataset.\n", "\n", "**This is a unit test** - it tests the Dataset interface pattern in isolation." ] }, { "cell_type": "code", "execution_count": null, "id": "f5dbcde5", "metadata": { "nbgrader": { "grade": true, "grade_id": "test-dataset-interface-immediate", "locked": true, "points": 5, "schema_version": 3, "solution": false, "task": false } }, "outputs": [], "source": [ "# Test Dataset interface with a simple implementation\n", "print(\"๐Ÿ”ฌ Unit Test: Dataset Interface...\")\n", "\n", "# Create a minimal test dataset\n", "class TestDataset(Dataset):\n", " def __init__(self, size=5):\n", " self.size = size\n", " \n", " def __getitem__(self, index):\n", " # Simple test data: features are [index, index*2], label is index % 2\n", " data = Tensor([index, index * 2])\n", " label = Tensor([index % 2])\n", " return data, label\n", " \n", " def __len__(self):\n", " return self.size\n", " \n", " def get_num_classes(self):\n", " return 2\n", "\n", "# Test the interface\n", "try:\n", " test_dataset = TestDataset(size=5)\n", " print(f\"Dataset created with size: {len(test_dataset)}\")\n", " \n", " # Test __getitem__\n", " data, label = test_dataset[0]\n", " print(f\"Sample 0: data={data}, label={label}\")\n", " assert isinstance(data, Tensor), \"Data should be a Tensor\"\n", " assert isinstance(label, Tensor), \"Label should be a Tensor\"\n", " print(\"โœ… Dataset __getitem__ works correctly\")\n", " \n", " # Test __len__\n", " assert len(test_dataset) == 5, f\"Dataset length should be 5, got {len(test_dataset)}\"\n", " print(\"โœ… Dataset __len__ works correctly\")\n", " \n", " # Test get_num_classes\n", " assert test_dataset.get_num_classes() == 2, f\"Should have 2 classes, got {test_dataset.get_num_classes()}\"\n", " print(\"โœ… Dataset get_num_classes works correctly\")\n", " \n", " # Test get_sample_shape\n", " sample_shape = test_dataset.get_sample_shape()\n", " assert sample_shape == (2,), f\"Sample shape should be (2,), got {sample_shape}\"\n", " print(\"โœ… Dataset get_sample_shape works correctly\")\n", " \n", " # Test multiple samples\n", " for i in range(3):\n", " data, label = test_dataset[i]\n", " expected_data = [i, i * 2]\n", " expected_label = [i % 2]\n", " assert np.array_equal(data.data, expected_data), f\"Data mismatch at index {i}\"\n", " assert np.array_equal(label.data, expected_label), f\"Label mismatch at index {i}\"\n", " print(\"โœ… Dataset produces correct data for multiple samples\")\n", " \n", "except Exception as e:\n", " print(f\"โŒ Dataset interface test failed: {e}\")\n", " raise\n", "\n", "# Show the dataset pattern\n", "print(\"๐ŸŽฏ Dataset interface pattern:\")\n", "print(\" __getitem__: Returns (data, label) tuple\")\n", "print(\" __len__: Returns dataset size\")\n", "print(\" get_num_classes: Returns number of classes\")\n", "print(\" get_sample_shape: Returns shape of data samples\")\n", "print(\"๐Ÿ“ˆ Progress: Dataset interface โœ“\")" ] }, { "cell_type": "markdown", "id": "84c87935", "metadata": { "cell_marker": "\"\"\"", "lines_to_next_cell": 1 }, "source": [ "## Step 3: Building the DataLoader\n", "\n", "### What is a DataLoader?\n", "A **DataLoader** efficiently batches and iterates through datasets. It's the bridge between individual samples and the batched data that neural networks expect.\n", "\n", "### Why DataLoaders Matter\n", "- **Batching**: Groups samples for efficient GPU computation\n", "- **Shuffling**: Randomizes data order to prevent overfitting\n", "- **Memory efficiency**: Loads data on-demand rather than all at once\n", "- **Iteration**: Provides clean interface for training loops\n", "\n", "### The DataLoader Pattern\n", "```python\n", "DataLoader(dataset, batch_size=32, shuffle=True)\n", "for batch_data, batch_labels in dataloader:\n", " # batch_data.shape: (32, ...)\n", " # batch_labels.shape: (32,)\n", " # Train on batch\n", "```\n", "\n", "### Real-World Applications\n", "- **Training loops**: Feed batches to neural networks\n", "- **Validation**: Evaluate models on held-out data\n", "- **Inference**: Process large datasets efficiently\n", "- **Data analysis**: Explore datasets systematically\n", "\n", "### Systems Thinking\n", "- **Batch size**: Trade-off between memory and speed\n", "- **Shuffling**: Prevents overfitting to data order\n", "- **Iteration**: Efficient looping through data\n", "- **Memory**: Manage large datasets that don't fit in RAM" ] }, { "cell_type": "code", "execution_count": null, "id": "0918d8cf", "metadata": { "lines_to_next_cell": 1, "nbgrader": { "grade": false, "grade_id": "dataloader-class", "locked": false, "schema_version": 3, "solution": true, "task": false } }, "outputs": [], "source": [ "#| export\n", "class DataLoader:\n", " \"\"\"\n", " DataLoader: Efficiently batch and iterate through datasets.\n", " \n", " Provides batching, shuffling, and efficient iteration over datasets.\n", " Essential for training neural networks efficiently.\n", " \"\"\"\n", " \n", " def __init__(self, dataset: Dataset, batch_size: int = 32, shuffle: bool = True):\n", " \"\"\"\n", " Initialize DataLoader.\n", " \n", " Args:\n", " dataset: Dataset to load from\n", " batch_size: Number of samples per batch\n", " shuffle: Whether to shuffle data each epoch\n", " \n", " TODO: Store configuration and dataset.\n", " \n", " APPROACH:\n", " 1. Store dataset as self.dataset\n", " 2. Store batch_size as self.batch_size\n", " 3. Store shuffle as self.shuffle\n", " \n", " EXAMPLE:\n", " DataLoader(dataset, batch_size=32, shuffle=True)\n", " \n", " HINTS:\n", " - Store all parameters as instance variables\n", " - These will be used in __iter__ for batching\n", " \"\"\"\n", " # Input validation\n", " if dataset is None:\n", " raise TypeError(\"Dataset cannot be None\")\n", " if not isinstance(batch_size, int) or batch_size <= 0:\n", " raise ValueError(f\"Batch size must be a positive integer, got {batch_size}\")\n", " \n", " self.dataset = dataset\n", " self.batch_size = batch_size\n", " self.shuffle = shuffle\n", " \n", " def __iter__(self) -> Iterator[Tuple[Tensor, Tensor]]:\n", " \"\"\"\n", " Iterate through dataset in batches.\n", " \n", " Returns:\n", " Iterator yielding (batch_data, batch_labels) tuples\n", " \n", " TODO: Implement batching and shuffling logic.\n", " \n", " APPROACH:\n", " 1. Create indices list: list(range(len(dataset)))\n", " 2. Shuffle indices if self.shuffle is True\n", " 3. Loop through indices in batch_size chunks\n", " 4. For each batch: collect samples, stack them, yield batch\n", " \n", " EXAMPLE:\n", " for batch_data, batch_labels in dataloader:\n", " # batch_data.shape: (batch_size, ...)\n", " # batch_labels.shape: (batch_size,)\n", " \n", " HINTS:\n", " - Use list(range(len(self.dataset))) for indices\n", " - Use np.random.shuffle() if self.shuffle is True\n", " - Loop in chunks of self.batch_size\n", " - Collect samples and stack with np.stack()\n", " \"\"\"\n", " # Create indices for all samples\n", " indices = list(range(len(self.dataset)))\n", " \n", " # Shuffle if requested\n", " if self.shuffle:\n", " np.random.shuffle(indices)\n", " \n", " # Iterate through indices in batches\n", " for i in range(0, len(indices), self.batch_size):\n", " batch_indices = indices[i:i + self.batch_size]\n", " \n", " # Collect samples for this batch\n", " batch_data = []\n", " batch_labels = []\n", " \n", " for idx in batch_indices:\n", " data, label = self.dataset[idx]\n", " batch_data.append(data.data)\n", " batch_labels.append(label.data)\n", " \n", " # Stack into batch tensors\n", " batch_data_array = np.stack(batch_data, axis=0)\n", " batch_labels_array = np.stack(batch_labels, axis=0)\n", " \n", " yield Tensor(batch_data_array), Tensor(batch_labels_array)\n", " \n", " def __len__(self) -> int:\n", " \"\"\"\n", " Get the number of batches per epoch.\n", " \n", " TODO: Calculate number of batches.\n", " \n", " APPROACH:\n", " 1. Get dataset size: len(self.dataset)\n", " 2. Divide by batch_size and round up\n", " 3. Use ceiling division: (n + batch_size - 1) // batch_size\n", " \n", " EXAMPLE:\n", " Dataset size 100, batch size 32 โ†’ 4 batches\n", " \n", " HINTS:\n", " - Use len(self.dataset) for dataset size\n", " - Use ceiling division for exact batch count\n", " - Formula: (dataset_size + batch_size - 1) // batch_size\n", " \"\"\"\n", " # Calculate number of batches using ceiling division\n", " dataset_size = len(self.dataset)\n", " return (dataset_size + self.batch_size - 1) // self.batch_size" ] }, { "cell_type": "markdown", "id": "46082fb1", "metadata": { "cell_marker": "\"\"\"" }, "source": [ "### ๐Ÿงช Unit Test: DataLoader\n", "\n", "Let's test your DataLoader implementation! This is the heart of efficient data loading for neural networks.\n", "\n", "**This is a unit test** - it tests the DataLoader class in isolation." ] }, { "cell_type": "code", "execution_count": null, "id": "9744517c", "metadata": { "nbgrader": { "grade": true, "grade_id": "test-dataloader-immediate", "locked": true, "points": 10, "schema_version": 3, "solution": false, "task": false } }, "outputs": [], "source": [ "# Test DataLoader immediately after implementation\n", "print(\"๐Ÿ”ฌ Unit Test: DataLoader...\")\n", "\n", "# Use the test dataset from before\n", "class TestDataset(Dataset):\n", " def __init__(self, size=10):\n", " self.size = size\n", " \n", " def __getitem__(self, index):\n", " data = Tensor([index, index * 2])\n", " label = Tensor([index % 3]) # 3 classes\n", " return data, label\n", " \n", " def __len__(self):\n", " return self.size\n", " \n", " def get_num_classes(self):\n", " return 3\n", "\n", "# Test basic DataLoader functionality\n", "try:\n", " dataset = TestDataset(size=10)\n", " dataloader = DataLoader(dataset, batch_size=3, shuffle=False)\n", " \n", " print(f\"DataLoader created: batch_size={dataloader.batch_size}, shuffle={dataloader.shuffle}\")\n", " print(f\"Number of batches: {len(dataloader)}\")\n", " \n", " # Test __len__\n", " expected_batches = (10 + 3 - 1) // 3 # Ceiling division: 4 batches\n", " assert len(dataloader) == expected_batches, f\"Should have {expected_batches} batches, got {len(dataloader)}\"\n", " print(\"โœ… DataLoader __len__ works correctly\")\n", " \n", " # Test iteration\n", " batch_count = 0\n", " total_samples = 0\n", " \n", " for batch_data, batch_labels in dataloader:\n", " batch_count += 1\n", " batch_size = batch_data.shape[0]\n", " total_samples += batch_size\n", " \n", " print(f\"Batch {batch_count}: data shape {batch_data.shape}, labels shape {batch_labels.shape}\")\n", " \n", " # Verify batch dimensions\n", " assert len(batch_data.shape) == 2, f\"Batch data should be 2D, got {batch_data.shape}\"\n", " assert len(batch_labels.shape) == 2, f\"Batch labels should be 2D, got {batch_labels.shape}\"\n", " assert batch_data.shape[1] == 2, f\"Each sample should have 2 features, got {batch_data.shape[1]}\"\n", " assert batch_labels.shape[1] == 1, f\"Each label should have 1 element, got {batch_labels.shape[1]}\"\n", " \n", " assert batch_count == expected_batches, f\"Should iterate {expected_batches} times, got {batch_count}\"\n", " assert total_samples == 10, f\"Should process 10 total samples, got {total_samples}\"\n", " print(\"โœ… DataLoader iteration works correctly\")\n", " \n", "except Exception as e:\n", " print(f\"โŒ DataLoader test failed: {e}\")\n", " raise\n", "\n", "# Test shuffling\n", "try:\n", " dataloader_shuffle = DataLoader(dataset, batch_size=5, shuffle=True)\n", " dataloader_no_shuffle = DataLoader(dataset, batch_size=5, shuffle=False)\n", " \n", " # Get first batch from each\n", " batch1_shuffle = next(iter(dataloader_shuffle))\n", " batch1_no_shuffle = next(iter(dataloader_no_shuffle))\n", " \n", " print(\"โœ… DataLoader shuffling parameter works\")\n", " \n", "except Exception as e:\n", " print(f\"โŒ DataLoader shuffling test failed: {e}\")\n", " raise\n", "\n", "# Test different batch sizes\n", "try:\n", " small_loader = DataLoader(dataset, batch_size=2, shuffle=False)\n", " large_loader = DataLoader(dataset, batch_size=8, shuffle=False)\n", " \n", " assert len(small_loader) == 5, f\"Small loader should have 5 batches, got {len(small_loader)}\"\n", " assert len(large_loader) == 2, f\"Large loader should have 2 batches, got {len(large_loader)}\"\n", " print(\"โœ… DataLoader handles different batch sizes correctly\")\n", " \n", "except Exception as e:\n", " print(f\"โŒ DataLoader batch size test failed: {e}\")\n", " raise\n", "\n", "# Show the DataLoader behavior\n", "print(\"๐ŸŽฏ DataLoader behavior:\")\n", "print(\" Batches data for efficient processing\")\n", "print(\" Handles shuffling and iteration\")\n", "print(\" Provides clean interface for training loops\")\n", "print(\"๐Ÿ“ˆ Progress: Dataset interface โœ“, DataLoader โœ“\")" ] }, { "cell_type": "markdown", "id": "ee45269f", "metadata": { "cell_marker": "\"\"\"", "lines_to_next_cell": 1 }, "source": [ "## Step 4: Creating a Simple Dataset Example\n", "\n", "### Why We Need Concrete Examples\n", "Abstract classes are great for interfaces, but we need concrete implementations to understand how they work. Let's create a simple dataset for testing.\n", "\n", "### Design Principles\n", "- **Simple**: Easy to understand and debug\n", "- **Configurable**: Adjustable size and properties\n", "- **Predictable**: Deterministic data for testing\n", "- **Educational**: Shows the Dataset pattern clearly\n", "\n", "### Real-World Connection\n", "This pattern is used for:\n", "- **CIFAR-10**: 32x32 RGB images with 10 classes\n", "- **ImageNet**: High-resolution images with 1000 classes\n", "- **MNIST**: 28x28 grayscale digits with 10 classes\n", "- **Custom datasets**: Your own data following this pattern" ] }, { "cell_type": "code", "execution_count": null, "id": "d4c773ba", "metadata": { "lines_to_next_cell": 1, "nbgrader": { "grade": false, "grade_id": "simple-dataset", "locked": false, "schema_version": 3, "solution": true, "task": false } }, "outputs": [], "source": [ "#| export\n", "class SimpleDataset(Dataset):\n", " \"\"\"\n", " Simple dataset for testing and demonstration.\n", " \n", " Generates synthetic data with configurable size and properties.\n", " Perfect for understanding the Dataset pattern.\n", " \"\"\"\n", " \n", " def __init__(self, size: int = 100, num_features: int = 4, num_classes: int = 3):\n", " \"\"\"\n", " Initialize SimpleDataset.\n", " \n", " Args:\n", " size: Number of samples in the dataset\n", " num_features: Number of features per sample\n", " num_classes: Number of classes\n", " \n", " TODO: Initialize the dataset with synthetic data.\n", " \n", " APPROACH:\n", " 1. Store the configuration parameters\n", " 2. Generate synthetic data and labels\n", " 3. Make data deterministic for testing\n", " \n", " EXAMPLE:\n", " SimpleDataset(size=100, num_features=4, num_classes=3)\n", " creates 100 samples with 4 features each, 3 classes\n", " \n", " HINTS:\n", " - Store size, num_features, num_classes as instance variables\n", " - Use np.random.seed() for reproducible data\n", " - Generate random data with np.random.randn()\n", " - Generate random labels with np.random.randint()\n", " \"\"\"\n", " self.size = size\n", " self.num_features = num_features\n", " self.num_classes = num_classes\n", " \n", " # Generate synthetic data (deterministic for testing)\n", " np.random.seed(42) # For reproducible data\n", " self.data = np.random.randn(size, num_features).astype(np.float32)\n", " self.labels = np.random.randint(0, num_classes, size=size)\n", " \n", " def __getitem__(self, index: int) -> Tuple[Tensor, Tensor]:\n", " \"\"\"\n", " Get a sample by index.\n", " \n", " Args:\n", " index: Index of the sample\n", " \n", " Returns:\n", " Tuple of (data, label) tensors\n", " \n", " TODO: Return the sample at the given index.\n", " \n", " APPROACH:\n", " 1. Get data sample from self.data[index]\n", " 2. Get label from self.labels[index]\n", " 3. Convert both to Tensors and return as tuple\n", " \n", " EXAMPLE:\n", " dataset[0] returns (Tensor(features), Tensor(label))\n", " \n", " HINTS:\n", " - Use self.data[index] for the data\n", " - Use self.labels[index] for the label\n", " - Convert to Tensors: Tensor(data), Tensor(label)\n", " \"\"\"\n", " data = self.data[index]\n", " label = self.labels[index]\n", " return Tensor(data), Tensor(label)\n", " \n", " def __len__(self) -> int:\n", " \"\"\"\n", " Get the dataset size.\n", " \n", " TODO: Return the dataset size.\n", " \n", " APPROACH:\n", " 1. Return self.size\n", " \n", " EXAMPLE:\n", " len(dataset) returns 100 for dataset with 100 samples\n", " \n", " HINTS:\n", " - Simply return self.size\n", " \"\"\"\n", " return self.size\n", " \n", " def get_num_classes(self) -> int:\n", " \"\"\"\n", " Get the number of classes.\n", " \n", " TODO: Return the number of classes.\n", " \n", " APPROACH:\n", " 1. Return self.num_classes\n", " \n", " EXAMPLE:\n", " dataset.get_num_classes() returns 3 for 3-class dataset\n", " \n", " HINTS:\n", " - Simply return self.num_classes\n", " \"\"\"\n", " return self.num_classes" ] }, { "cell_type": "markdown", "id": "e6a029be", "metadata": { "cell_marker": "\"\"\"" }, "source": [ "### ๐Ÿงช Unit Test: SimpleDataset\n", "\n", "Let's test your SimpleDataset implementation! This concrete example shows how the Dataset pattern works.\n", "\n", "**This is a unit test** - it tests the SimpleDataset class in isolation." ] }, { "cell_type": "code", "execution_count": null, "id": "0f3f5ed5", "metadata": { "nbgrader": { "grade": true, "grade_id": "test-simple-dataset-immediate", "locked": true, "points": 10, "schema_version": 3, "solution": false, "task": false } }, "outputs": [], "source": [ "# Test SimpleDataset immediately after implementation\n", "print(\"๐Ÿ”ฌ Unit Test: SimpleDataset...\")\n", "\n", "try:\n", " # Create dataset\n", " dataset = SimpleDataset(size=20, num_features=5, num_classes=4)\n", " \n", " print(f\"Dataset created: size={len(dataset)}, features={dataset.num_features}, classes={dataset.get_num_classes()}\")\n", " \n", " # Test basic properties\n", " assert len(dataset) == 20, f\"Dataset length should be 20, got {len(dataset)}\"\n", " assert dataset.get_num_classes() == 4, f\"Should have 4 classes, got {dataset.get_num_classes()}\"\n", " print(\"โœ… SimpleDataset basic properties work correctly\")\n", " \n", " # Test sample access\n", " data, label = dataset[0]\n", " assert isinstance(data, Tensor), \"Data should be a Tensor\"\n", " assert isinstance(label, Tensor), \"Label should be a Tensor\"\n", " assert data.shape == (5,), f\"Data shape should be (5,), got {data.shape}\"\n", " assert label.shape == (), f\"Label shape should be (), got {label.shape}\"\n", " print(\"โœ… SimpleDataset sample access works correctly\")\n", " \n", " # Test sample shape\n", " sample_shape = dataset.get_sample_shape()\n", " assert sample_shape == (5,), f\"Sample shape should be (5,), got {sample_shape}\"\n", " print(\"โœ… SimpleDataset get_sample_shape works correctly\")\n", " \n", " # Test multiple samples\n", " for i in range(5):\n", " data, label = dataset[i]\n", " assert data.shape == (5,), f\"Data shape should be (5,) for sample {i}, got {data.shape}\"\n", " assert 0 <= label.data < 4, f\"Label should be in [0, 3] for sample {i}, got {label.data}\"\n", " print(\"โœ… SimpleDataset multiple samples work correctly\")\n", " \n", " # Test deterministic data (same seed should give same data)\n", " dataset2 = SimpleDataset(size=20, num_features=5, num_classes=4)\n", " data1, label1 = dataset[0]\n", " data2, label2 = dataset2[0]\n", " assert np.array_equal(data1.data, data2.data), \"Data should be deterministic\"\n", " assert np.array_equal(label1.data, label2.data), \"Labels should be deterministic\"\n", " print(\"โœ… SimpleDataset data is deterministic\")\n", "\n", "except Exception as e:\n", " print(f\"โŒ SimpleDataset test failed: {e}\")\n", "\n", "# Show the SimpleDataset behavior\n", "print(\"๐ŸŽฏ SimpleDataset behavior:\")\n", "print(\" Generates synthetic data for testing\")\n", "print(\" Implements complete Dataset interface\")\n", "print(\" Provides deterministic data for reproducibility\")\n", "print(\"๐Ÿ“ˆ Progress: Dataset interface โœ“, DataLoader โœ“, SimpleDataset โœ“\")" ] }, { "cell_type": "markdown", "id": "3b5a161c", "metadata": { "cell_marker": "\"\"\"" }, "source": [ "## Step 5: Comprehensive Test - Complete Data Pipeline\n", "\n", "### Real-World Data Pipeline Applications\n", "Let's test our data loading components in realistic scenarios:\n", "\n", "#### **Training Pipeline**\n", "```python\n", "# The standard ML training pattern\n", "dataset = SimpleDataset(size=1000, num_features=10, num_classes=5)\n", "dataloader = DataLoader(dataset, batch_size=32, shuffle=True)\n", "\n", "for epoch in range(num_epochs):\n", " for batch_data, batch_labels in dataloader:\n", " # Train model on batch\n", " pass\n", "```\n", "\n", "#### **Validation Pipeline**\n", "```python\n", "# Validation without shuffling\n", "val_loader = DataLoader(val_dataset, batch_size=64, shuffle=False)\n", "\n", "for batch_data, batch_labels in val_loader:\n", " # Evaluate model on batch\n", " pass\n", "```\n", "\n", "#### **Data Analysis Pipeline**\n", "```python\n", "# Systematic data exploration\n", "for batch_data, batch_labels in dataloader:\n", " # Analyze batch statistics\n", " pass\n", "```\n", "\n", "This comprehensive test ensures our data loading components work together for real ML applications!" ] }, { "cell_type": "code", "execution_count": null, "id": "5e8d80ec", "metadata": { "nbgrader": { "grade": true, "grade_id": "test-comprehensive", "locked": true, "points": 15, "schema_version": 3, "solution": false, "task": false } }, "outputs": [], "source": [ "# Comprehensive test - complete data pipeline applications\n", "print(\"๐Ÿ”ฌ Comprehensive Test: Complete Data Pipeline...\")\n", "\n", "try:\n", " # Test 1: Training Data Pipeline\n", " print(\"\\n1. Training Data Pipeline Test:\")\n", " \n", " # Create training dataset\n", " train_dataset = SimpleDataset(size=100, num_features=8, num_classes=5)\n", " train_loader = DataLoader(train_dataset, batch_size=16, shuffle=True)\n", " \n", " # Simulate training epoch\n", " epoch_samples = 0\n", " epoch_batches = 0\n", " \n", " for batch_data, batch_labels in train_loader:\n", " epoch_batches += 1\n", " epoch_samples += batch_data.shape[0]\n", " \n", " # Verify batch properties\n", " assert batch_data.shape[1] == 8, f\"Features should be 8, got {batch_data.shape[1]}\"\n", " assert len(batch_labels.shape) == 1, f\"Labels should be 1D, got shape {batch_labels.shape}\"\n", " assert isinstance(batch_data, Tensor), \"Batch data should be Tensor\"\n", " assert isinstance(batch_labels, Tensor), \"Batch labels should be Tensor\"\n", " \n", " assert epoch_samples == 100, f\"Should process 100 samples, got {epoch_samples}\"\n", " expected_batches = (100 + 16 - 1) // 16\n", " assert epoch_batches == expected_batches, f\"Should have {expected_batches} batches, got {epoch_batches}\"\n", " print(\"โœ… Training pipeline works correctly\")\n", " \n", " # Test 2: Validation Data Pipeline\n", " print(\"\\n2. Validation Data Pipeline Test:\")\n", " \n", " # Create validation dataset (no shuffling)\n", " val_dataset = SimpleDataset(size=50, num_features=8, num_classes=5)\n", " val_loader = DataLoader(val_dataset, batch_size=10, shuffle=False)\n", " \n", " # Simulate validation\n", " val_samples = 0\n", " val_batches = 0\n", " \n", " for batch_data, batch_labels in val_loader:\n", " val_batches += 1\n", " val_samples += batch_data.shape[0]\n", " \n", " # Verify consistent batch processing\n", " assert batch_data.shape[1] == 8, \"Validation features should match training\"\n", " assert len(batch_labels.shape) == 1, \"Validation labels should be 1D\"\n", " \n", " assert val_samples == 50, f\"Should process 50 validation samples, got {val_samples}\"\n", " assert val_batches == 5, f\"Should have 5 validation batches, got {val_batches}\"\n", " print(\"โœ… Validation pipeline works correctly\")\n", " \n", " # Test 3: Different Dataset Configurations\n", " print(\"\\n3. Dataset Configuration Test:\")\n", " \n", " # Test different configurations\n", " configs = [\n", " (200, 4, 3), # Medium dataset\n", " (50, 12, 10), # High-dimensional features\n", " (1000, 2, 2), # Large dataset, simple features\n", " ]\n", " \n", " for size, features, classes in configs:\n", " dataset = SimpleDataset(size=size, num_features=features, num_classes=classes)\n", " loader = DataLoader(dataset, batch_size=32, shuffle=True)\n", " \n", " # Test one batch\n", " batch_data, batch_labels = next(iter(loader))\n", " \n", " assert batch_data.shape[1] == features, f\"Features mismatch for config {configs}\"\n", " assert len(dataset) == size, f\"Size mismatch for config {configs}\"\n", " assert dataset.get_num_classes() == classes, f\"Classes mismatch for config {configs}\"\n", " \n", " print(\"โœ… Different dataset configurations work correctly\")\n", " \n", " # Test 4: Memory Efficiency Simulation\n", " print(\"\\n4. Memory Efficiency Test:\")\n", " \n", " # Create larger dataset to test memory efficiency\n", " large_dataset = SimpleDataset(size=500, num_features=20, num_classes=10)\n", " large_loader = DataLoader(large_dataset, batch_size=50, shuffle=True)\n", " \n", " # Process all batches to ensure memory efficiency\n", " processed_samples = 0\n", " max_batch_size = 0\n", " \n", " for batch_data, batch_labels in large_loader:\n", " processed_samples += batch_data.shape[0]\n", " max_batch_size = max(max_batch_size, batch_data.shape[0])\n", " \n", " # Verify memory usage stays reasonable\n", " assert batch_data.shape[0] <= 50, f\"Batch size should not exceed 50, got {batch_data.shape[0]}\"\n", " \n", " assert processed_samples == 500, f\"Should process all 500 samples, got {processed_samples}\"\n", " print(\"โœ… Memory efficiency works correctly\")\n", " \n", " # Test 5: Multi-Epoch Training Simulation\n", " print(\"\\n5. Multi-Epoch Training Test:\")\n", " \n", " # Simulate multiple epochs\n", " dataset = SimpleDataset(size=60, num_features=6, num_classes=3)\n", " loader = DataLoader(dataset, batch_size=20, shuffle=True)\n", " \n", " for epoch in range(3):\n", " epoch_samples = 0\n", " for batch_data, batch_labels in loader:\n", " epoch_samples += batch_data.shape[0]\n", " \n", " # Verify shapes remain consistent across epochs\n", " assert batch_data.shape[1] == 6, f\"Features should be 6 in epoch {epoch}\"\n", " assert len(batch_labels.shape) == 1, f\"Labels should be 1D in epoch {epoch}\"\n", " \n", " assert epoch_samples == 60, f\"Should process 60 samples in epoch {epoch}, got {epoch_samples}\"\n", " \n", " print(\"โœ… Multi-epoch training works correctly\")\n", " \n", " print(\"\\n๐ŸŽ‰ Comprehensive test passed! Your data pipeline works correctly for:\")\n", " print(\" โ€ข Large-scale dataset handling\")\n", " print(\" โ€ข Batch processing with multiple workers\")\n", " print(\" โ€ข Shuffling and sampling strategies\")\n", " print(\" โ€ข Memory-efficient data loading\")\n", " print(\" โ€ข Complete training pipeline integration\")\n", " print(\"๐Ÿ“ˆ Progress: Production-ready data pipeline โœ“\")\n", " \n", "except Exception as e:\n", " print(f\"โŒ Comprehensive test failed: {e}\")\n", " raise\n", "\n", "print(\"๐Ÿ“ˆ Final Progress: Complete data pipeline ready for production ML!\")" ] }, { "cell_type": "markdown", "id": "b0352802", "metadata": { "lines_to_next_cell": 1 }, "source": [ "\"\"\"\n", "# ๐ŸŽฏ Module Summary\n", "\n", "Congratulations! You've successfully implemented the core components of data loading systems:\n", "\n", "## What You've Accomplished\n", "โœ… **Dataset Abstract Class**: The foundation interface for all data loading \n", "โœ… **DataLoader Implementation**: Efficient batching and iteration over datasets \n", "โœ… **SimpleDataset Example**: Concrete implementation showing the Dataset pattern \n", "โœ… **Complete Data Pipeline**: End-to-end data loading for neural network training \n", "โœ… **Systems Thinking**: Understanding memory efficiency, batching, and I/O optimization \n", "\n", "## Key Concepts You've Learned\n", "- **Dataset pattern**: Abstract interface for consistent data access\n", "- **DataLoader pattern**: Efficient batching and iteration for training\n", "- **Memory efficiency**: Loading data on-demand rather than all at once\n", "- **Batching strategies**: Grouping samples for efficient GPU computation\n", "- **Shuffling**: Randomizing data order to prevent overfitting\n", "\n", "## Mathematical Foundations\n", "- **Batch processing**: Vectorized operations on multiple samples\n", "- **Memory management**: Handling datasets larger than available RAM\n", "- **I/O optimization**: Minimizing disk reads and memory allocation\n", "- **Stochastic sampling**: Random shuffling for better generalization\n", "\n", "## Real-World Applications\n", "- **Computer vision**: Loading image datasets like CIFAR-10, ImageNet\n", "- **Natural language processing**: Loading text datasets with tokenization\n", "- **Tabular data**: Loading CSV files and database records\n", "- **Audio processing**: Loading and preprocessing audio files\n", "- **Time series**: Loading sequential data with proper windowing\n", "\n", "## Connection to Production Systems\n", "- **PyTorch**: Your Dataset and DataLoader mirror `torch.utils.data`\n", "- **TensorFlow**: Similar concepts in `tf.data.Dataset`\n", "- **JAX**: Custom data loading with efficient batching\n", "- **MLOps**: Data pipelines are critical for production ML systems\n", "\n", "## Performance Characteristics\n", "- **Memory efficiency**: O(batch_size) memory usage, not O(dataset_size)\n", "- **I/O optimization**: Load data on-demand, not all at once\n", "- **Batching efficiency**: Vectorized operations on GPU\n", "- **Shuffling overhead**: Minimal cost for significant training benefits\n", "\n", "## Data Engineering Best Practices\n", "- **Reproducibility**: Deterministic data generation and shuffling\n", "- **Scalability**: Handle datasets of any size\n", "- **Flexibility**: Easy to switch between different data sources\n", "- **Testability**: Simple interfaces for unit testing\n", "\n", "## Next Steps\n", "1. **Export your code**: Use NBDev to export to the `tinytorch` package\n", "2. **Test your implementation**: Run the complete test suite\n", "3. **Build data pipelines**: \n", " ```python\n", " from tinytorch.core.dataloader import Dataset, DataLoader\n", " from tinytorch.core.tensor import Tensor\n", " \n", " # Create dataset\n", " dataset = SimpleDataset(size=1000, num_features=10, num_classes=5)\n", " \n", " # Create dataloader\n", " loader = DataLoader(dataset, batch_size=32, shuffle=True)\n", " \n", " # Training loop\n", " for epoch in range(num_epochs):\n", " for batch_data, batch_labels in loader:\n", " # Train model\n", " pass\n", " ```\n", "4. **Explore advanced topics**: Data augmentation, distributed loading, streaming datasets!\n", "\n", "**Ready for the next challenge?** Let's build training loops and optimizers to complete the ML pipeline!\n", "\"\"\"\n", "\n", "def test_dataset_interface():\n", " \"\"\"Test Dataset abstract interface implementation comprehensively.\"\"\"\n", " print(\"๐Ÿ”ฌ Unit Test: Dataset Interface...\")\n", " \n", " # Test TestDataset implementation\n", " dataset = TestDataset(size=5)\n", " \n", " # Test basic interface\n", " assert len(dataset) == 5, \"Dataset should have correct length\"\n", " \n", " # Test data access\n", " sample, label = dataset[0]\n", " assert isinstance(sample, Tensor), \"Sample should be Tensor\"\n", " assert isinstance(label, Tensor), \"Label should be Tensor\"\n", " \n", " print(\"โœ… Dataset interface works correctly\")\n", "\n", "def test_dataloader():\n", " \"\"\"Test DataLoader implementation comprehensively.\"\"\"\n", " print(\"๐Ÿ”ฌ Unit Test: DataLoader...\")\n", " \n", " # Test DataLoader with TestDataset\n", " dataset = TestDataset(size=10)\n", " loader = DataLoader(dataset, batch_size=3, shuffle=False)\n", " \n", " # Test iteration\n", " batches = list(loader)\n", " assert len(batches) >= 3, \"Should have at least 3 batches\"\n", " \n", " # Test batch shapes\n", " batch_data, batch_labels = batches[0]\n", " assert batch_data.shape[0] <= 3, \"Batch size should be <= 3\"\n", " assert batch_labels.shape[0] <= 3, \"Batch labels should match data\"\n", " \n", " print(\"โœ… DataLoader works correctly\")\n", "\n", "def test_simple_dataset():\n", " \"\"\"Test SimpleDataset implementation comprehensively.\"\"\"\n", " print(\"๐Ÿ”ฌ Unit Test: SimpleDataset...\")\n", " \n", " # Test SimpleDataset\n", " dataset = SimpleDataset(size=100, num_features=4, num_classes=3)\n", " \n", " # Test properties\n", " assert len(dataset) == 100, \"Dataset should have correct size\"\n", " assert dataset.get_num_classes() == 3, \"Should have correct number of classes\"\n", " \n", " # Test data access\n", " sample, label = dataset[0]\n", " assert sample.shape == (4,), \"Sample should have correct features\"\n", " assert 0 <= label.data < 3, \"Label should be valid class\"\n", " \n", " print(\"โœ… SimpleDataset works correctly\")\n", "\n", "def test_dataloader_pipeline():\n", " \"\"\"Test complete data pipeline comprehensive testing.\"\"\"\n", " print(\"๐Ÿ”ฌ Comprehensive Test: Data Pipeline...\")\n", " \n", " # Test complete pipeline\n", " dataset = SimpleDataset(size=50, num_features=10, num_classes=5)\n", " loader = DataLoader(dataset, batch_size=8, shuffle=True)\n", " \n", " total_samples = 0\n", " for batch_data, batch_labels in loader:\n", " assert isinstance(batch_data, Tensor), \"Batch data should be Tensor\"\n", " assert isinstance(batch_labels, Tensor), \"Batch labels should be Tensor\"\n", " assert batch_data.shape[1] == 10, \"Features should be correct\"\n", " total_samples += batch_data.shape[0]\n", " \n", " assert total_samples == 50, \"Should process all samples\"\n", " \n", " print(\"โœ… Data pipeline integration works correctly\")" ] }, { "cell_type": "markdown", "id": "c9433d3d", "metadata": { "cell_marker": "\"\"\"" }, "source": [ "## ๐Ÿงช Module Testing\n", "\n", "Time to test your implementation! This section uses TinyTorch's standardized testing framework to ensure your implementation works correctly.\n", "\n", "**This testing section is locked** - it provides consistent feedback across all modules and cannot be modified." ] }, { "cell_type": "code", "execution_count": null, "id": "3ec15e59", "metadata": { "nbgrader": { "grade": false, "grade_id": "standardized-testing", "locked": true, "schema_version": 3, "solution": false, "task": false } }, "outputs": [], "source": [ "# =============================================================================\n", "# STANDARDIZED MODULE TESTING - DO NOT MODIFY\n", "# This cell is locked to ensure consistent testing across all TinyTorch modules\n", "# =============================================================================\n", "\n", "if __name__ == \"__main__\":\n", " from tito.tools.testing import run_module_tests_auto\n", " \n", " # Automatically discover and run all tests in this module\n", " success = run_module_tests_auto(\"DataLoader\") " ] } ], "metadata": { "jupytext": { "main_language": "python" } }, "nbformat": 4, "nbformat_minor": 5 }