mirror of https://github.com/MLSysBook/TinyTorch.git synced 2026-03-12 02:09:16 -05:00

Files

Vijay Janapa Reddi a2e4586f18 Update documentation after module reordering

All module references updated to reflect new ordering:
- Module 15: Quantization (was 16)
- Module 16: Compression (was 17)
- Module 17: Memoization (was 15)

Updated by module-developer and website-manager agents:
- Module ABOUT files with correct numbers and prerequisites
- Cross-references and "What's Next" chains
- Website navigation (_toc.yml) and content
- Learning path progression in LEARNING_PATH.md
- Profile milestone completion message (Module 17)

Pedagogical flow now: Profile → Quantize → Prune → Cache → Accelerate

2025-11-10 19:37:41 -05:00

12 KiB

Raw Blame History

title, description, difficulty, time_estimate, prerequisites, next_steps, learning_objectives

title

description

difficulty

time_estimate

prerequisites

next_steps

learning_objectives

DataLoader - Data Pipeline Engineering

Build production-grade data loading infrastructure for training at scale

5-6 hours

Tensor

Layers

Training

Spatial (CNNs)

Design scalable data pipeline architectures for production ML systems

Implement efficient dataset abstractions with batching and streaming

Build preprocessing pipelines for normalization and data augmentation

Understand memory-efficient data loading patterns for large datasets

Apply systems thinking to I/O optimization and throughput engineering

08. DataLoader

🏛️ ARCHITECTURE TIER | Difficulty: ⭐⭐⭐ (3/4) | Time: 5-6 hours

Overview

Build the data engineering infrastructure that feeds neural networks. This module implements production-grade data loading, preprocessing, and batching systems—the critical backbone that enables training on real-world datasets like CIFAR-10.

Learning Objectives

By completing this module, you will be able to:

Design scalable data pipeline architectures for production ML systems with proper abstractions and interfaces
Implement efficient dataset abstractions with batching, shuffling, and streaming for memory-efficient training
Build preprocessing pipelines for normalization, augmentation, and transformation with fit-transform patterns
Understand memory-efficient data loading patterns for large datasets that don't fit in RAM
Apply systems thinking to I/O optimization, caching strategies, and throughput engineering

Why This Matters

Production Context

Every production ML system depends on robust data infrastructure:

Netflix uses sophisticated data pipelines to train recommendation models on billions of viewing records
Tesla processes terabytes of driving sensor data through efficient loading pipelines for autonomous driving
OpenAI built custom data loaders to train GPT models on hundreds of billions of tokens
Meta developed PyTorch's DataLoader (which you're reimplementing) to power research and production

Historical Context

Data loading evolved from bottleneck to optimized system:

Early ML (pre-2010): Small datasets fit entirely in memory; data loading was an afterthought
ImageNet Era (2012): AlexNet required efficient loading of 1.2M images; preprocessing became critical
Big Data ML (2015+): Streaming data pipelines became necessary for datasets too large for memory
Modern Scale (2020+): Data loading is now a first-class systems problem with dedicated infrastructure teams

The patterns you're building are the same ones used in production at scale.

Pedagogical Pattern: Build → Use → Analyze

1. Build

Implement from first principles:

Dataset abstraction with Python protocols (__getitem__, __len__)
DataLoader with batching, shuffling, and iteration
CIFAR-10 dataset loader with binary file parsing
Normalizer with fit-transform pattern
Memory-efficient streaming for large datasets

2. Use

Apply to real problems:

Load and preprocess CIFAR-10 (50,000 training images)
Create train/test data loaders with proper batching
Build preprocessing pipelines for normalization
Integrate with training loops from Module 07
Measure throughput and identify bottlenecks

3. Analyze

Deep-dive into systems behavior:

Profile memory usage patterns with different batch sizes
Measure I/O throughput and identify disk bottlenecks
Compare streaming vs in-memory loading strategies
Analyze the impact of shuffling on training dynamics
Understand trade-offs between batch size and memory

Implementation Guide

Core Components

Dataset Abstraction

class Dataset:
    """Abstract base class for all datasets.
    
    Implements Python protocols for indexing and length.
    Subclasses must implement __getitem__ and __len__.
    """
    def __getitem__(self, index: int):
        """Return (data, label) for given index."""
        raise NotImplementedError
    
    def __len__(self) -> int:
        """Return total number of samples."""
        raise NotImplementedError

DataLoader Implementation

class DataLoader:
    """Efficient batch loading with shuffling support.
    
    Features:
    - Automatic batching with configurable batch size
    - Optional shuffling for training randomization
    - Drop last batch handling for even batch sizes
    - Memory-efficient iteration without loading all data
    """
    def __init__(self, dataset, batch_size=32, shuffle=False, drop_last=False):
        self.dataset = dataset
        self.batch_size = batch_size
        self.shuffle = shuffle
        self.drop_last = drop_last
    
    def __iter__(self):
        # Generate indices (shuffled or sequential)
        indices = list(range(len(self.dataset)))
        if self.shuffle:
            np.random.shuffle(indices)
        
        # Yield batches
        for i in range(0, len(indices), self.batch_size):
            batch_indices = indices[i:i + self.batch_size]
            if len(batch_indices) < self.batch_size and self.drop_last:
                continue
            yield self._get_batch(batch_indices)

CIFAR-10 Dataset Loader

class CIFAR10Dataset(Dataset):
    """Load CIFAR-10 dataset with automatic download.
    
    CIFAR-10: 60,000 32x32 color images in 10 classes
    - 50,000 training images
    - 10,000 test images
    - Classes: airplane, car, bird, cat, deer, dog, frog, horse, ship, truck
    """
    def __init__(self, root='./data', train=True, download=True):
        self.train = train
        if download:
            self._download(root)
        self.data, self.labels = self._load_batch_files(root, train)
    
    def __getitem__(self, index):
        return self.data[index], self.labels[index]
    
    def __len__(self):
        return len(self.data)

Preprocessing Pipeline

class Normalizer:
    """Normalize data using fit-transform pattern.
    
    Fits statistics on training data, applies to all splits.
    Ensures consistent preprocessing across train/val/test.
    """
    def fit(self, data):
        """Compute mean and std from training data."""
        self.mean = data.mean(axis=0)
        self.std = data.std(axis=0)
        return self
    
    def transform(self, data):
        """Apply normalization using fitted statistics."""
        return (data - self.mean) / (self.std + 1e-8)
    
    def fit_transform(self, data):
        """Fit and transform in one step."""
        return self.fit(data).transform(data)

Step-by-Step Implementation

Create Dataset Base Class
- Implement __getitem__ and __len__ protocols
- Define the interface all datasets must follow
- Test with simple array-based dataset
Build CIFAR-10 Loader
- Implement download and extraction logic
- Parse binary batch files (pickle format)
- Reshape data from flat arrays to (3, 32, 32) images
- Handle train/test split loading
Implement DataLoader
- Create batching logic with configurable batch size
- Add shuffling with random permutation
- Implement iterator protocol for Pythonic loops
- Handle edge cases (last incomplete batch, empty dataset)
Add Preprocessing
- Build Normalizer with fit-transform pattern
- Compute per-channel statistics for RGB images
- Apply transformations efficiently across batches
- Test normalization correctness (zero mean, unit variance)
Integration Testing
- Load CIFAR-10 and create data loaders
- Iterate through batches and verify shapes
- Test with actual training loop from Module 07
- Measure data loading throughput

Testing

Inline Tests (During Development)

Run inline tests while building:

cd modules/08_dataloader
python dataloader_dev.py

Expected output:

Unit Test: Dataset abstraction...
✅ __getitem__ protocol works correctly
✅ __len__ returns correct size
✅ Indexing returns (data, label) tuples
Progress: Dataset Interface ✓

Unit Test: CIFAR-10 loading...
✅ Downloaded and extracted 170MB dataset
✅ Loaded 50,000 training samples
✅ Sample shape: (3, 32, 32), label range: [0, 9]
Progress: CIFAR-10 Dataset ✓

Unit Test: DataLoader batching...
✅ Batch shapes correct: (32, 3, 32, 32)
✅ Shuffling produces different orderings
✅ Iteration covers all samples exactly once
Progress: DataLoader ✓

Export and Validate

After completing the module:

# Export to tinytorch package
tito export 08_dataloader

# Run integration tests
tito test 08_dataloader

Comprehensive Test Coverage

The test suite validates:

Dataset interface correctness
CIFAR-10 loading and parsing
Batch shape consistency
Shuffling randomness
Memory efficiency
Preprocessing accuracy

Where This Code Lives

tinytorch/
├── core/
│   └── dataloader.py          # Your implementation goes here
└── __init__.py                # Exposes DataLoader, Dataset, etc.

Usage in other modules:
>>> from tinytorch.core.dataloader import DataLoader, CIFAR10Dataset
>>> dataset = CIFAR10Dataset(download=True)
>>> loader = DataLoader(dataset, batch_size=32, shuffle=True)

Systems Thinking Questions

Memory vs Throughput Trade-off: Why does increasing batch size improve GPU utilization but increase memory usage? What's the optimal batch size for a 16GB GPU?
Shuffling Impact: How does shuffling affect training dynamics and convergence? Why is it critical for training but not for evaluation?
I/O Bottlenecks: Your GPU can process 1000 images/sec but your disk reads at 100 images/sec. Where's the bottleneck? How would you fix it?
Preprocessing Placement: Should preprocessing happen in the data loader or in the training loop? What are the trade-offs for CPU vs GPU preprocessing?
Distributed Loading: If you're training on 8 GPUs, how should you partition the dataset? What challenges arise with shuffling across multiple workers?

Real-World Connections

Industry Applications

Netflix (Recommendation Systems)

Processes billions of viewing records through custom data pipelines
Uses streaming loaders for datasets that don't fit in memory
Implements sophisticated batching strategies for negative sampling

Autonomous Vehicles (Tesla, Waymo)

Load terabytes of sensor data (camera, LIDAR, radar) for training
Use multi-worker data loading to keep GPUs fully utilized
Implement real-time preprocessing pipelines for online learning

Large Language Models (OpenAI, Anthropic)

Stream hundreds of billions of tokens from distributed storage
Use custom data loaders optimized for sequence data
Implement efficient tokenization and batching for transformers

Research Impact

This module teaches patterns from:

PyTorch DataLoader (2016): The industry-standard data loading API
TensorFlow Dataset API (2017): Google's approach to data pipelines
NVIDIA DALI (2019): GPU-accelerated preprocessing for peak throughput
WebDataset (2020): Efficient loading from cloud storage

What's Next?

In Module 09: Spatial (CNNs), you'll use these data loaders to train convolutional neural networks on CIFAR-10:

Apply convolution operations to the RGB images you're loading
Use your DataLoader to iterate through 50,000 training samples
Achieve >75% accuracy on CIFAR-10 classification
Understand how CNNs process spatial data efficiently

The data infrastructure you built here becomes critical—training CNNs requires efficient batch loading of image data with proper preprocessing.

Ready to build production data infrastructure? Open modules/08_dataloader/dataloader_dev.py and start implementing.

12 KiB Raw Blame History