mirror of
https://github.com/MLSysBook/TinyTorch.git
synced 2026-04-29 20:31:43 -05:00
Update documentation after module reordering
All module references updated to reflect new ordering: - Module 15: Quantization (was 16) - Module 16: Compression (was 17) - Module 17: Memoization (was 15) Updated by module-developer and website-manager agents: - Module ABOUT files with correct numbers and prerequisites - Cross-references and "What's Next" chains - Website navigation (_toc.yml) and content - Learning path progression in LEARNING_PATH.md - Profile milestone completion message (Module 17) Pedagogical flow now: Profile → Quantize → Prune → Cache → Accelerate
This commit is contained in:
332
modules/08_dataloader/ABOUT.md
Normal file
332
modules/08_dataloader/ABOUT.md
Normal file
@@ -0,0 +1,332 @@
|
||||
---
|
||||
title: "DataLoader - Data Pipeline Engineering"
|
||||
description: "Build production-grade data loading infrastructure for training at scale"
|
||||
difficulty: 3
|
||||
time_estimate: "5-6 hours"
|
||||
prerequisites: ["Tensor", "Layers", "Training"]
|
||||
next_steps: ["Spatial (CNNs)"]
|
||||
learning_objectives:
|
||||
- "Design scalable data pipeline architectures for production ML systems"
|
||||
- "Implement efficient dataset abstractions with batching and streaming"
|
||||
- "Build preprocessing pipelines for normalization and data augmentation"
|
||||
- "Understand memory-efficient data loading patterns for large datasets"
|
||||
- "Apply systems thinking to I/O optimization and throughput engineering"
|
||||
---
|
||||
|
||||
# 08. DataLoader
|
||||
|
||||
**🏛️ ARCHITECTURE TIER** | Difficulty: ⭐⭐⭐ (3/4) | Time: 5-6 hours
|
||||
|
||||
## Overview
|
||||
|
||||
Build the data engineering infrastructure that feeds neural networks. This module implements production-grade data loading, preprocessing, and batching systems—the critical backbone that enables training on real-world datasets like CIFAR-10.
|
||||
|
||||
## Learning Objectives
|
||||
|
||||
By completing this module, you will be able to:
|
||||
|
||||
1. **Design scalable data pipeline architectures** for production ML systems with proper abstractions and interfaces
|
||||
2. **Implement efficient dataset abstractions** with batching, shuffling, and streaming for memory-efficient training
|
||||
3. **Build preprocessing pipelines** for normalization, augmentation, and transformation with fit-transform patterns
|
||||
4. **Understand memory-efficient data loading patterns** for large datasets that don't fit in RAM
|
||||
5. **Apply systems thinking** to I/O optimization, caching strategies, and throughput engineering
|
||||
|
||||
## Why This Matters
|
||||
|
||||
### Production Context
|
||||
|
||||
Every production ML system depends on robust data infrastructure:
|
||||
|
||||
- **Netflix** uses sophisticated data pipelines to train recommendation models on billions of viewing records
|
||||
- **Tesla** processes terabytes of driving sensor data through efficient loading pipelines for autonomous driving
|
||||
- **OpenAI** built custom data loaders to train GPT models on hundreds of billions of tokens
|
||||
- **Meta** developed PyTorch's DataLoader (which you're reimplementing) to power research and production
|
||||
|
||||
### Historical Context
|
||||
|
||||
Data loading evolved from bottleneck to optimized system:
|
||||
|
||||
- **Early ML (pre-2010)**: Small datasets fit entirely in memory; data loading was an afterthought
|
||||
- **ImageNet Era (2012)**: AlexNet required efficient loading of 1.2M images; preprocessing became critical
|
||||
- **Big Data ML (2015+)**: Streaming data pipelines became necessary for datasets too large for memory
|
||||
- **Modern Scale (2020+)**: Data loading is now a first-class systems problem with dedicated infrastructure teams
|
||||
|
||||
The patterns you're building are the same ones used in production at scale.
|
||||
|
||||
## Pedagogical Pattern: Build → Use → Analyze
|
||||
|
||||
### 1. Build
|
||||
|
||||
Implement from first principles:
|
||||
- Dataset abstraction with Python protocols (`__getitem__`, `__len__`)
|
||||
- DataLoader with batching, shuffling, and iteration
|
||||
- CIFAR-10 dataset loader with binary file parsing
|
||||
- Normalizer with fit-transform pattern
|
||||
- Memory-efficient streaming for large datasets
|
||||
|
||||
### 2. Use
|
||||
|
||||
Apply to real problems:
|
||||
- Load and preprocess CIFAR-10 (50,000 training images)
|
||||
- Create train/test data loaders with proper batching
|
||||
- Build preprocessing pipelines for normalization
|
||||
- Integrate with training loops from Module 07
|
||||
- Measure throughput and identify bottlenecks
|
||||
|
||||
### 3. Analyze
|
||||
|
||||
Deep-dive into systems behavior:
|
||||
- Profile memory usage patterns with different batch sizes
|
||||
- Measure I/O throughput and identify disk bottlenecks
|
||||
- Compare streaming vs in-memory loading strategies
|
||||
- Analyze the impact of shuffling on training dynamics
|
||||
- Understand trade-offs between batch size and memory
|
||||
|
||||
## Implementation Guide
|
||||
|
||||
### Core Components
|
||||
|
||||
**Dataset Abstraction**
|
||||
```python
|
||||
class Dataset:
|
||||
"""Abstract base class for all datasets.
|
||||
|
||||
Implements Python protocols for indexing and length.
|
||||
Subclasses must implement __getitem__ and __len__.
|
||||
"""
|
||||
def __getitem__(self, index: int):
|
||||
"""Return (data, label) for given index."""
|
||||
raise NotImplementedError
|
||||
|
||||
def __len__(self) -> int:
|
||||
"""Return total number of samples."""
|
||||
raise NotImplementedError
|
||||
```
|
||||
|
||||
**DataLoader Implementation**
|
||||
```python
|
||||
class DataLoader:
|
||||
"""Efficient batch loading with shuffling support.
|
||||
|
||||
Features:
|
||||
- Automatic batching with configurable batch size
|
||||
- Optional shuffling for training randomization
|
||||
- Drop last batch handling for even batch sizes
|
||||
- Memory-efficient iteration without loading all data
|
||||
"""
|
||||
def __init__(self, dataset, batch_size=32, shuffle=False, drop_last=False):
|
||||
self.dataset = dataset
|
||||
self.batch_size = batch_size
|
||||
self.shuffle = shuffle
|
||||
self.drop_last = drop_last
|
||||
|
||||
def __iter__(self):
|
||||
# Generate indices (shuffled or sequential)
|
||||
indices = list(range(len(self.dataset)))
|
||||
if self.shuffle:
|
||||
np.random.shuffle(indices)
|
||||
|
||||
# Yield batches
|
||||
for i in range(0, len(indices), self.batch_size):
|
||||
batch_indices = indices[i:i + self.batch_size]
|
||||
if len(batch_indices) < self.batch_size and self.drop_last:
|
||||
continue
|
||||
yield self._get_batch(batch_indices)
|
||||
```
|
||||
|
||||
**CIFAR-10 Dataset Loader**
|
||||
```python
|
||||
class CIFAR10Dataset(Dataset):
|
||||
"""Load CIFAR-10 dataset with automatic download.
|
||||
|
||||
CIFAR-10: 60,000 32x32 color images in 10 classes
|
||||
- 50,000 training images
|
||||
- 10,000 test images
|
||||
- Classes: airplane, car, bird, cat, deer, dog, frog, horse, ship, truck
|
||||
"""
|
||||
def __init__(self, root='./data', train=True, download=True):
|
||||
self.train = train
|
||||
if download:
|
||||
self._download(root)
|
||||
self.data, self.labels = self._load_batch_files(root, train)
|
||||
|
||||
def __getitem__(self, index):
|
||||
return self.data[index], self.labels[index]
|
||||
|
||||
def __len__(self):
|
||||
return len(self.data)
|
||||
```
|
||||
|
||||
**Preprocessing Pipeline**
|
||||
```python
|
||||
class Normalizer:
|
||||
"""Normalize data using fit-transform pattern.
|
||||
|
||||
Fits statistics on training data, applies to all splits.
|
||||
Ensures consistent preprocessing across train/val/test.
|
||||
"""
|
||||
def fit(self, data):
|
||||
"""Compute mean and std from training data."""
|
||||
self.mean = data.mean(axis=0)
|
||||
self.std = data.std(axis=0)
|
||||
return self
|
||||
|
||||
def transform(self, data):
|
||||
"""Apply normalization using fitted statistics."""
|
||||
return (data - self.mean) / (self.std + 1e-8)
|
||||
|
||||
def fit_transform(self, data):
|
||||
"""Fit and transform in one step."""
|
||||
return self.fit(data).transform(data)
|
||||
```
|
||||
|
||||
### Step-by-Step Implementation
|
||||
|
||||
1. **Create Dataset Base Class**
|
||||
- Implement `__getitem__` and `__len__` protocols
|
||||
- Define the interface all datasets must follow
|
||||
- Test with simple array-based dataset
|
||||
|
||||
2. **Build CIFAR-10 Loader**
|
||||
- Implement download and extraction logic
|
||||
- Parse binary batch files (pickle format)
|
||||
- Reshape data from flat arrays to (3, 32, 32) images
|
||||
- Handle train/test split loading
|
||||
|
||||
3. **Implement DataLoader**
|
||||
- Create batching logic with configurable batch size
|
||||
- Add shuffling with random permutation
|
||||
- Implement iterator protocol for Pythonic loops
|
||||
- Handle edge cases (last incomplete batch, empty dataset)
|
||||
|
||||
4. **Add Preprocessing**
|
||||
- Build Normalizer with fit-transform pattern
|
||||
- Compute per-channel statistics for RGB images
|
||||
- Apply transformations efficiently across batches
|
||||
- Test normalization correctness (zero mean, unit variance)
|
||||
|
||||
5. **Integration Testing**
|
||||
- Load CIFAR-10 and create data loaders
|
||||
- Iterate through batches and verify shapes
|
||||
- Test with actual training loop from Module 07
|
||||
- Measure data loading throughput
|
||||
|
||||
## Testing
|
||||
|
||||
### Inline Tests (During Development)
|
||||
|
||||
Run inline tests while building:
|
||||
```bash
|
||||
cd modules/08_dataloader
|
||||
python dataloader_dev.py
|
||||
```
|
||||
|
||||
Expected output:
|
||||
```
|
||||
Unit Test: Dataset abstraction...
|
||||
✅ __getitem__ protocol works correctly
|
||||
✅ __len__ returns correct size
|
||||
✅ Indexing returns (data, label) tuples
|
||||
Progress: Dataset Interface ✓
|
||||
|
||||
Unit Test: CIFAR-10 loading...
|
||||
✅ Downloaded and extracted 170MB dataset
|
||||
✅ Loaded 50,000 training samples
|
||||
✅ Sample shape: (3, 32, 32), label range: [0, 9]
|
||||
Progress: CIFAR-10 Dataset ✓
|
||||
|
||||
Unit Test: DataLoader batching...
|
||||
✅ Batch shapes correct: (32, 3, 32, 32)
|
||||
✅ Shuffling produces different orderings
|
||||
✅ Iteration covers all samples exactly once
|
||||
Progress: DataLoader ✓
|
||||
```
|
||||
|
||||
### Export and Validate
|
||||
|
||||
After completing the module:
|
||||
```bash
|
||||
# Export to tinytorch package
|
||||
tito export 08_dataloader
|
||||
|
||||
# Run integration tests
|
||||
tito test 08_dataloader
|
||||
```
|
||||
|
||||
### Comprehensive Test Coverage
|
||||
|
||||
The test suite validates:
|
||||
- Dataset interface correctness
|
||||
- CIFAR-10 loading and parsing
|
||||
- Batch shape consistency
|
||||
- Shuffling randomness
|
||||
- Memory efficiency
|
||||
- Preprocessing accuracy
|
||||
|
||||
## Where This Code Lives
|
||||
|
||||
```
|
||||
tinytorch/
|
||||
├── core/
|
||||
│ └── dataloader.py # Your implementation goes here
|
||||
└── __init__.py # Exposes DataLoader, Dataset, etc.
|
||||
|
||||
Usage in other modules:
|
||||
>>> from tinytorch.core.dataloader import DataLoader, CIFAR10Dataset
|
||||
>>> dataset = CIFAR10Dataset(download=True)
|
||||
>>> loader = DataLoader(dataset, batch_size=32, shuffle=True)
|
||||
```
|
||||
|
||||
## Systems Thinking Questions
|
||||
|
||||
1. **Memory vs Throughput Trade-off**: Why does increasing batch size improve GPU utilization but increase memory usage? What's the optimal batch size for a 16GB GPU?
|
||||
|
||||
2. **Shuffling Impact**: How does shuffling affect training dynamics and convergence? Why is it critical for training but not for evaluation?
|
||||
|
||||
3. **I/O Bottlenecks**: Your GPU can process 1000 images/sec but your disk reads at 100 images/sec. Where's the bottleneck? How would you fix it?
|
||||
|
||||
4. **Preprocessing Placement**: Should preprocessing happen in the data loader or in the training loop? What are the trade-offs for CPU vs GPU preprocessing?
|
||||
|
||||
5. **Distributed Loading**: If you're training on 8 GPUs, how should you partition the dataset? What challenges arise with shuffling across multiple workers?
|
||||
|
||||
## Real-World Connections
|
||||
|
||||
### Industry Applications
|
||||
|
||||
**Netflix (Recommendation Systems)**
|
||||
- Processes billions of viewing records through custom data pipelines
|
||||
- Uses streaming loaders for datasets that don't fit in memory
|
||||
- Implements sophisticated batching strategies for negative sampling
|
||||
|
||||
**Autonomous Vehicles (Tesla, Waymo)**
|
||||
- Load terabytes of sensor data (camera, LIDAR, radar) for training
|
||||
- Use multi-worker data loading to keep GPUs fully utilized
|
||||
- Implement real-time preprocessing pipelines for online learning
|
||||
|
||||
**Large Language Models (OpenAI, Anthropic)**
|
||||
- Stream hundreds of billions of tokens from distributed storage
|
||||
- Use custom data loaders optimized for sequence data
|
||||
- Implement efficient tokenization and batching for transformers
|
||||
|
||||
### Research Impact
|
||||
|
||||
This module teaches patterns from:
|
||||
- PyTorch DataLoader (2016): The industry-standard data loading API
|
||||
- TensorFlow Dataset API (2017): Google's approach to data pipelines
|
||||
- NVIDIA DALI (2019): GPU-accelerated preprocessing for peak throughput
|
||||
- WebDataset (2020): Efficient loading from cloud storage
|
||||
|
||||
## What's Next?
|
||||
|
||||
In **Module 09: Spatial (CNNs)**, you'll use these data loaders to train convolutional neural networks on CIFAR-10:
|
||||
|
||||
- Apply convolution operations to the RGB images you're loading
|
||||
- Use your DataLoader to iterate through 50,000 training samples
|
||||
- Achieve >75% accuracy on CIFAR-10 classification
|
||||
- Understand how CNNs process spatial data efficiently
|
||||
|
||||
The data infrastructure you built here becomes critical—training CNNs requires efficient batch loading of image data with proper preprocessing.
|
||||
|
||||
---
|
||||
|
||||
**Ready to build production data infrastructure?** Open `modules/08_dataloader/dataloader_dev.py` and start implementing.
|
||||
Reference in New Issue
Block a user