Update documentation after module reordering

All module references updated to reflect new ordering: - Module 15: Quantization (was 16) - Module 16: Compression (was 17) - Module 17: Memoization (was 15) Updated by module-developer and website-manager agents: - Module ABOUT files with correct numbers and prerequisites - Cross-references and "What's Next" chains - Website navigation (_toc.yml) and content - Learning path progression in LEARNING_PATH.md - Profile milestone completion message (Module 17) Pedagogical flow now: Profile → Quantize → Prune → Cache → Accelerate
2026-04-29 20:31:43 -05:00 · 2025-11-10 19:37:41 -05:00
parent a71e0eded5
commit a2e4586f18
42 changed files with 9117 additions and 27 deletions
--- a/modules/08_dataloader/ABOUT.md
+++ b/modules/08_dataloader/ABOUT.md
@@ -0,0 +1,332 @@
+---
+title: "DataLoader - Data Pipeline Engineering"
+description: "Build production-grade data loading infrastructure for training at scale"
+difficulty: 3
+time_estimate: "5-6 hours"
+prerequisites: ["Tensor", "Layers", "Training"]
+next_steps: ["Spatial (CNNs)"]
+learning_objectives:
+  - "Design scalable data pipeline architectures for production ML systems"
+  - "Implement efficient dataset abstractions with batching and streaming"
+  - "Build preprocessing pipelines for normalization and data augmentation"
+  - "Understand memory-efficient data loading patterns for large datasets"
+  - "Apply systems thinking to I/O optimization and throughput engineering"
+---
+
+# 08. DataLoader
+
+**🏛️ ARCHITECTURE TIER** | Difficulty: ⭐⭐⭐ (3/4) | Time: 5-6 hours
+
+## Overview
+
+Build the data engineering infrastructure that feeds neural networks. This module implements production-grade data loading, preprocessing, and batching systems—the critical backbone that enables training on real-world datasets like CIFAR-10.
+
+## Learning Objectives
+
+By completing this module, you will be able to:
+
+1. **Design scalable data pipeline architectures** for production ML systems with proper abstractions and interfaces
+2. **Implement efficient dataset abstractions** with batching, shuffling, and streaming for memory-efficient training
+3. **Build preprocessing pipelines** for normalization, augmentation, and transformation with fit-transform patterns
+4. **Understand memory-efficient data loading patterns** for large datasets that don't fit in RAM
+5. **Apply systems thinking** to I/O optimization, caching strategies, and throughput engineering
+
+## Why This Matters
+
+### Production Context
+
+Every production ML system depends on robust data infrastructure:
+
+- **Netflix** uses sophisticated data pipelines to train recommendation models on billions of viewing records
+- **Tesla** processes terabytes of driving sensor data through efficient loading pipelines for autonomous driving
+- **OpenAI** built custom data loaders to train GPT models on hundreds of billions of tokens
+- **Meta** developed PyTorch's DataLoader (which you're reimplementing) to power research and production
+
+### Historical Context
+
+Data loading evolved from bottleneck to optimized system:
+
+- **Early ML (pre-2010)**: Small datasets fit entirely in memory; data loading was an afterthought
+- **ImageNet Era (2012)**: AlexNet required efficient loading of 1.2M images; preprocessing became critical
+- **Big Data ML (2015+)**: Streaming data pipelines became necessary for datasets too large for memory
+- **Modern Scale (2020+)**: Data loading is now a first-class systems problem with dedicated infrastructure teams
+
+The patterns you're building are the same ones used in production at scale.
+
+## Pedagogical Pattern: Build → Use → Analyze
+
+### 1. Build
+
+Implement from first principles:
+- Dataset abstraction with Python protocols (`__getitem__`, `__len__`)
+- DataLoader with batching, shuffling, and iteration
+- CIFAR-10 dataset loader with binary file parsing
+- Normalizer with fit-transform pattern
+- Memory-efficient streaming for large datasets
+
+### 2. Use
+
+Apply to real problems:
+- Load and preprocess CIFAR-10 (50,000 training images)
+- Create train/test data loaders with proper batching
+- Build preprocessing pipelines for normalization
+- Integrate with training loops from Module 07
+- Measure throughput and identify bottlenecks
+
+### 3. Analyze
+
+Deep-dive into systems behavior:
+- Profile memory usage patterns with different batch sizes
+- Measure I/O throughput and identify disk bottlenecks
+- Compare streaming vs in-memory loading strategies
+- Analyze the impact of shuffling on training dynamics
+- Understand trade-offs between batch size and memory
+
+## Implementation Guide
+
+### Core Components
+
+**Dataset Abstraction**
+```python
+class Dataset:
+    """Abstract base class for all datasets.
+    
+    Implements Python protocols for indexing and length.
+    Subclasses must implement __getitem__ and __len__.
+    """
+    def __getitem__(self, index: int):
+        """Return (data, label) for given index."""
+        raise NotImplementedError
+    
+    def __len__(self) -> int:
+        """Return total number of samples."""
+        raise NotImplementedError
+```
+
+**DataLoader Implementation**
+```python
+class DataLoader:
+    """Efficient batch loading with shuffling support.
+    
+    Features:
+    - Automatic batching with configurable batch size
+    - Optional shuffling for training randomization
+    - Drop last batch handling for even batch sizes
+    - Memory-efficient iteration without loading all data
+    """
+    def __init__(self, dataset, batch_size=32, shuffle=False, drop_last=False):
+        self.dataset = dataset
+        self.batch_size = batch_size
+        self.shuffle = shuffle
+        self.drop_last = drop_last
+    
+    def __iter__(self):
+        # Generate indices (shuffled or sequential)
+        indices = list(range(len(self.dataset)))
+        if self.shuffle:
+            np.random.shuffle(indices)
+        
+        # Yield batches
+        for i in range(0, len(indices), self.batch_size):
+            batch_indices = indices[i:i + self.batch_size]
+            if len(batch_indices) < self.batch_size and self.drop_last:
+                continue
+            yield self._get_batch(batch_indices)
+```
+
+**CIFAR-10 Dataset Loader**
+```python
+class CIFAR10Dataset(Dataset):
+    """Load CIFAR-10 dataset with automatic download.
+    
+    CIFAR-10: 60,000 32x32 color images in 10 classes
+    - 50,000 training images
+    - 10,000 test images
+    - Classes: airplane, car, bird, cat, deer, dog, frog, horse, ship, truck
+    """
+    def __init__(self, root='./data', train=True, download=True):
+        self.train = train
+        if download:
+            self._download(root)
+        self.data, self.labels = self._load_batch_files(root, train)
+    
+    def __getitem__(self, index):
+        return self.data[index], self.labels[index]
+    
+    def __len__(self):
+        return len(self.data)
+```
+
+**Preprocessing Pipeline**
+```python
+class Normalizer:
+    """Normalize data using fit-transform pattern.
+    
+    Fits statistics on training data, applies to all splits.
+    Ensures consistent preprocessing across train/val/test.
+    """
+    def fit(self, data):
+        """Compute mean and std from training data."""
+        self.mean = data.mean(axis=0)
+        self.std = data.std(axis=0)
+        return self
+    
+    def transform(self, data):
+        """Apply normalization using fitted statistics."""
+        return (data - self.mean) / (self.std + 1e-8)
+    
+    def fit_transform(self, data):
+        """Fit and transform in one step."""
+        return self.fit(data).transform(data)
+```
+
+### Step-by-Step Implementation
+
+1. **Create Dataset Base Class**
+   - Implement `__getitem__` and `__len__` protocols
+   - Define the interface all datasets must follow
+   - Test with simple array-based dataset
+
+2. **Build CIFAR-10 Loader**
+   - Implement download and extraction logic
+   - Parse binary batch files (pickle format)
+   - Reshape data from flat arrays to (3, 32, 32) images
+   - Handle train/test split loading
+
+3. **Implement DataLoader**
+   - Create batching logic with configurable batch size
+   - Add shuffling with random permutation
+   - Implement iterator protocol for Pythonic loops
+   - Handle edge cases (last incomplete batch, empty dataset)
+
+4. **Add Preprocessing**
+   - Build Normalizer with fit-transform pattern
+   - Compute per-channel statistics for RGB images
+   - Apply transformations efficiently across batches
+   - Test normalization correctness (zero mean, unit variance)
+
+5. **Integration Testing**
+   - Load CIFAR-10 and create data loaders
+   - Iterate through batches and verify shapes
+   - Test with actual training loop from Module 07
+   - Measure data loading throughput
+
+## Testing
+
+### Inline Tests (During Development)
+
+Run inline tests while building:
+```bash
+cd modules/08_dataloader
+python dataloader_dev.py
+```
+
+Expected output:
+```
+Unit Test: Dataset abstraction...
+✅ __getitem__ protocol works correctly
+✅ __len__ returns correct size
+✅ Indexing returns (data, label) tuples
+Progress: Dataset Interface ✓
+
+Unit Test: CIFAR-10 loading...
+✅ Downloaded and extracted 170MB dataset
+✅ Loaded 50,000 training samples
+✅ Sample shape: (3, 32, 32), label range: [0, 9]
+Progress: CIFAR-10 Dataset ✓
+
+Unit Test: DataLoader batching...
+✅ Batch shapes correct: (32, 3, 32, 32)
+✅ Shuffling produces different orderings
+✅ Iteration covers all samples exactly once
+Progress: DataLoader ✓
+```
+
+### Export and Validate
+
+After completing the module:
+```bash
+# Export to tinytorch package
+tito export 08_dataloader
+
+# Run integration tests
+tito test 08_dataloader
+```
+
+### Comprehensive Test Coverage
+
+The test suite validates:
+- Dataset interface correctness
+- CIFAR-10 loading and parsing
+- Batch shape consistency
+- Shuffling randomness
+- Memory efficiency
+- Preprocessing accuracy
+
+## Where This Code Lives
+
+```
+tinytorch/
+├── core/
+│   └── dataloader.py          # Your implementation goes here
+└── __init__.py                # Exposes DataLoader, Dataset, etc.
+
+Usage in other modules:
+>>> from tinytorch.core.dataloader import DataLoader, CIFAR10Dataset
+>>> dataset = CIFAR10Dataset(download=True)
+>>> loader = DataLoader(dataset, batch_size=32, shuffle=True)
+```
+
+## Systems Thinking Questions
+
+1. **Memory vs Throughput Trade-off**: Why does increasing batch size improve GPU utilization but increase memory usage? What's the optimal batch size for a 16GB GPU?
+
+2. **Shuffling Impact**: How does shuffling affect training dynamics and convergence? Why is it critical for training but not for evaluation?
+
+3. **I/O Bottlenecks**: Your GPU can process 1000 images/sec but your disk reads at 100 images/sec. Where's the bottleneck? How would you fix it?
+
+4. **Preprocessing Placement**: Should preprocessing happen in the data loader or in the training loop? What are the trade-offs for CPU vs GPU preprocessing?
+
+5. **Distributed Loading**: If you're training on 8 GPUs, how should you partition the dataset? What challenges arise with shuffling across multiple workers?
+
+## Real-World Connections
+
+### Industry Applications
+
+**Netflix (Recommendation Systems)**
+- Processes billions of viewing records through custom data pipelines
+- Uses streaming loaders for datasets that don't fit in memory
+- Implements sophisticated batching strategies for negative sampling
+
+**Autonomous Vehicles (Tesla, Waymo)**
+- Load terabytes of sensor data (camera, LIDAR, radar) for training
+- Use multi-worker data loading to keep GPUs fully utilized
+- Implement real-time preprocessing pipelines for online learning
+
+**Large Language Models (OpenAI, Anthropic)**
+- Stream hundreds of billions of tokens from distributed storage
+- Use custom data loaders optimized for sequence data
+- Implement efficient tokenization and batching for transformers
+
+### Research Impact
+
+This module teaches patterns from:
+- PyTorch DataLoader (2016): The industry-standard data loading API
+- TensorFlow Dataset API (2017): Google's approach to data pipelines
+- NVIDIA DALI (2019): GPU-accelerated preprocessing for peak throughput
+- WebDataset (2020): Efficient loading from cloud storage
+
+## What's Next?
+
+In **Module 09: Spatial (CNNs)**, you'll use these data loaders to train convolutional neural networks on CIFAR-10:
+
+- Apply convolution operations to the RGB images you're loading
+- Use your DataLoader to iterate through 50,000 training samples
+- Achieve >75% accuracy on CIFAR-10 classification
+- Understand how CNNs process spatial data efficiently
+
+The data infrastructure you built here becomes critical—training CNNs requires efficient batch loading of image data with proper preprocessing.
+
+---
+
+**Ready to build production data infrastructure?** Open `modules/08_dataloader/dataloader_dev.py` and start implementing.