Update documentation after module reordering

All module references updated to reflect new ordering:
- Module 15: Quantization (was 16)
- Module 16: Compression (was 17)
- Module 17: Memoization (was 15)

Updated by module-developer and website-manager agents:
- Module ABOUT files with correct numbers and prerequisites
- Cross-references and "What's Next" chains
- Website navigation (_toc.yml) and content
- Learning path progression in LEARNING_PATH.md
- Profile milestone completion message (Module 17)

Pedagogical flow now: Profile → Quantize → Prune → Cache → Accelerate
This commit is contained in:
Vijay Janapa Reddi
2025-11-10 19:37:41 -05:00
parent a71e0eded5
commit a2e4586f18
42 changed files with 9117 additions and 27 deletions

View File

@@ -0,0 +1,332 @@
---
title: "DataLoader - Data Pipeline Engineering"
description: "Build production-grade data loading infrastructure for training at scale"
difficulty: 3
time_estimate: "5-6 hours"
prerequisites: ["Tensor", "Layers", "Training"]
next_steps: ["Spatial (CNNs)"]
learning_objectives:
- "Design scalable data pipeline architectures for production ML systems"
- "Implement efficient dataset abstractions with batching and streaming"
- "Build preprocessing pipelines for normalization and data augmentation"
- "Understand memory-efficient data loading patterns for large datasets"
- "Apply systems thinking to I/O optimization and throughput engineering"
---
# 08. DataLoader
**🏛️ ARCHITECTURE TIER** | Difficulty: ⭐⭐⭐ (3/4) | Time: 5-6 hours
## Overview
Build the data engineering infrastructure that feeds neural networks. This module implements production-grade data loading, preprocessing, and batching systems—the critical backbone that enables training on real-world datasets like CIFAR-10.
## Learning Objectives
By completing this module, you will be able to:
1. **Design scalable data pipeline architectures** for production ML systems with proper abstractions and interfaces
2. **Implement efficient dataset abstractions** with batching, shuffling, and streaming for memory-efficient training
3. **Build preprocessing pipelines** for normalization, augmentation, and transformation with fit-transform patterns
4. **Understand memory-efficient data loading patterns** for large datasets that don't fit in RAM
5. **Apply systems thinking** to I/O optimization, caching strategies, and throughput engineering
## Why This Matters
### Production Context
Every production ML system depends on robust data infrastructure:
- **Netflix** uses sophisticated data pipelines to train recommendation models on billions of viewing records
- **Tesla** processes terabytes of driving sensor data through efficient loading pipelines for autonomous driving
- **OpenAI** built custom data loaders to train GPT models on hundreds of billions of tokens
- **Meta** developed PyTorch's DataLoader (which you're reimplementing) to power research and production
### Historical Context
Data loading evolved from bottleneck to optimized system:
- **Early ML (pre-2010)**: Small datasets fit entirely in memory; data loading was an afterthought
- **ImageNet Era (2012)**: AlexNet required efficient loading of 1.2M images; preprocessing became critical
- **Big Data ML (2015+)**: Streaming data pipelines became necessary for datasets too large for memory
- **Modern Scale (2020+)**: Data loading is now a first-class systems problem with dedicated infrastructure teams
The patterns you're building are the same ones used in production at scale.
## Pedagogical Pattern: Build → Use → Analyze
### 1. Build
Implement from first principles:
- Dataset abstraction with Python protocols (`__getitem__`, `__len__`)
- DataLoader with batching, shuffling, and iteration
- CIFAR-10 dataset loader with binary file parsing
- Normalizer with fit-transform pattern
- Memory-efficient streaming for large datasets
### 2. Use
Apply to real problems:
- Load and preprocess CIFAR-10 (50,000 training images)
- Create train/test data loaders with proper batching
- Build preprocessing pipelines for normalization
- Integrate with training loops from Module 07
- Measure throughput and identify bottlenecks
### 3. Analyze
Deep-dive into systems behavior:
- Profile memory usage patterns with different batch sizes
- Measure I/O throughput and identify disk bottlenecks
- Compare streaming vs in-memory loading strategies
- Analyze the impact of shuffling on training dynamics
- Understand trade-offs between batch size and memory
## Implementation Guide
### Core Components
**Dataset Abstraction**
```python
class Dataset:
"""Abstract base class for all datasets.
Implements Python protocols for indexing and length.
Subclasses must implement __getitem__ and __len__.
"""
def __getitem__(self, index: int):
"""Return (data, label) for given index."""
raise NotImplementedError
def __len__(self) -> int:
"""Return total number of samples."""
raise NotImplementedError
```
**DataLoader Implementation**
```python
class DataLoader:
"""Efficient batch loading with shuffling support.
Features:
- Automatic batching with configurable batch size
- Optional shuffling for training randomization
- Drop last batch handling for even batch sizes
- Memory-efficient iteration without loading all data
"""
def __init__(self, dataset, batch_size=32, shuffle=False, drop_last=False):
self.dataset = dataset
self.batch_size = batch_size
self.shuffle = shuffle
self.drop_last = drop_last
def __iter__(self):
# Generate indices (shuffled or sequential)
indices = list(range(len(self.dataset)))
if self.shuffle:
np.random.shuffle(indices)
# Yield batches
for i in range(0, len(indices), self.batch_size):
batch_indices = indices[i:i + self.batch_size]
if len(batch_indices) < self.batch_size and self.drop_last:
continue
yield self._get_batch(batch_indices)
```
**CIFAR-10 Dataset Loader**
```python
class CIFAR10Dataset(Dataset):
"""Load CIFAR-10 dataset with automatic download.
CIFAR-10: 60,000 32x32 color images in 10 classes
- 50,000 training images
- 10,000 test images
- Classes: airplane, car, bird, cat, deer, dog, frog, horse, ship, truck
"""
def __init__(self, root='./data', train=True, download=True):
self.train = train
if download:
self._download(root)
self.data, self.labels = self._load_batch_files(root, train)
def __getitem__(self, index):
return self.data[index], self.labels[index]
def __len__(self):
return len(self.data)
```
**Preprocessing Pipeline**
```python
class Normalizer:
"""Normalize data using fit-transform pattern.
Fits statistics on training data, applies to all splits.
Ensures consistent preprocessing across train/val/test.
"""
def fit(self, data):
"""Compute mean and std from training data."""
self.mean = data.mean(axis=0)
self.std = data.std(axis=0)
return self
def transform(self, data):
"""Apply normalization using fitted statistics."""
return (data - self.mean) / (self.std + 1e-8)
def fit_transform(self, data):
"""Fit and transform in one step."""
return self.fit(data).transform(data)
```
### Step-by-Step Implementation
1. **Create Dataset Base Class**
- Implement `__getitem__` and `__len__` protocols
- Define the interface all datasets must follow
- Test with simple array-based dataset
2. **Build CIFAR-10 Loader**
- Implement download and extraction logic
- Parse binary batch files (pickle format)
- Reshape data from flat arrays to (3, 32, 32) images
- Handle train/test split loading
3. **Implement DataLoader**
- Create batching logic with configurable batch size
- Add shuffling with random permutation
- Implement iterator protocol for Pythonic loops
- Handle edge cases (last incomplete batch, empty dataset)
4. **Add Preprocessing**
- Build Normalizer with fit-transform pattern
- Compute per-channel statistics for RGB images
- Apply transformations efficiently across batches
- Test normalization correctness (zero mean, unit variance)
5. **Integration Testing**
- Load CIFAR-10 and create data loaders
- Iterate through batches and verify shapes
- Test with actual training loop from Module 07
- Measure data loading throughput
## Testing
### Inline Tests (During Development)
Run inline tests while building:
```bash
cd modules/08_dataloader
python dataloader_dev.py
```
Expected output:
```
Unit Test: Dataset abstraction...
✅ __getitem__ protocol works correctly
✅ __len__ returns correct size
✅ Indexing returns (data, label) tuples
Progress: Dataset Interface ✓
Unit Test: CIFAR-10 loading...
✅ Downloaded and extracted 170MB dataset
✅ Loaded 50,000 training samples
✅ Sample shape: (3, 32, 32), label range: [0, 9]
Progress: CIFAR-10 Dataset ✓
Unit Test: DataLoader batching...
✅ Batch shapes correct: (32, 3, 32, 32)
✅ Shuffling produces different orderings
✅ Iteration covers all samples exactly once
Progress: DataLoader ✓
```
### Export and Validate
After completing the module:
```bash
# Export to tinytorch package
tito export 08_dataloader
# Run integration tests
tito test 08_dataloader
```
### Comprehensive Test Coverage
The test suite validates:
- Dataset interface correctness
- CIFAR-10 loading and parsing
- Batch shape consistency
- Shuffling randomness
- Memory efficiency
- Preprocessing accuracy
## Where This Code Lives
```
tinytorch/
├── core/
│ └── dataloader.py # Your implementation goes here
└── __init__.py # Exposes DataLoader, Dataset, etc.
Usage in other modules:
>>> from tinytorch.core.dataloader import DataLoader, CIFAR10Dataset
>>> dataset = CIFAR10Dataset(download=True)
>>> loader = DataLoader(dataset, batch_size=32, shuffle=True)
```
## Systems Thinking Questions
1. **Memory vs Throughput Trade-off**: Why does increasing batch size improve GPU utilization but increase memory usage? What's the optimal batch size for a 16GB GPU?
2. **Shuffling Impact**: How does shuffling affect training dynamics and convergence? Why is it critical for training but not for evaluation?
3. **I/O Bottlenecks**: Your GPU can process 1000 images/sec but your disk reads at 100 images/sec. Where's the bottleneck? How would you fix it?
4. **Preprocessing Placement**: Should preprocessing happen in the data loader or in the training loop? What are the trade-offs for CPU vs GPU preprocessing?
5. **Distributed Loading**: If you're training on 8 GPUs, how should you partition the dataset? What challenges arise with shuffling across multiple workers?
## Real-World Connections
### Industry Applications
**Netflix (Recommendation Systems)**
- Processes billions of viewing records through custom data pipelines
- Uses streaming loaders for datasets that don't fit in memory
- Implements sophisticated batching strategies for negative sampling
**Autonomous Vehicles (Tesla, Waymo)**
- Load terabytes of sensor data (camera, LIDAR, radar) for training
- Use multi-worker data loading to keep GPUs fully utilized
- Implement real-time preprocessing pipelines for online learning
**Large Language Models (OpenAI, Anthropic)**
- Stream hundreds of billions of tokens from distributed storage
- Use custom data loaders optimized for sequence data
- Implement efficient tokenization and batching for transformers
### Research Impact
This module teaches patterns from:
- PyTorch DataLoader (2016): The industry-standard data loading API
- TensorFlow Dataset API (2017): Google's approach to data pipelines
- NVIDIA DALI (2019): GPU-accelerated preprocessing for peak throughput
- WebDataset (2020): Efficient loading from cloud storage
## What's Next?
In **Module 09: Spatial (CNNs)**, you'll use these data loaders to train convolutional neural networks on CIFAR-10:
- Apply convolution operations to the RGB images you're loading
- Use your DataLoader to iterate through 50,000 training samples
- Achieve >75% accuracy on CIFAR-10 classification
- Understand how CNNs process spatial data efficiently
The data infrastructure you built here becomes critical—training CNNs requires efficient batch loading of image data with proper preprocessing.
---
**Ready to build production data infrastructure?** Open `modules/08_dataloader/dataloader_dev.py` and start implementing.