Files
TinyTorch/modules/08_dataloader/ABOUT.md
Vijay Janapa Reddi a2e4586f18 Update documentation after module reordering
All module references updated to reflect new ordering:
- Module 15: Quantization (was 16)
- Module 16: Compression (was 17)
- Module 17: Memoization (was 15)

Updated by module-developer and website-manager agents:
- Module ABOUT files with correct numbers and prerequisites
- Cross-references and "What's Next" chains
- Website navigation (_toc.yml) and content
- Learning path progression in LEARNING_PATH.md
- Profile milestone completion message (Module 17)

Pedagogical flow now: Profile → Quantize → Prune → Cache → Accelerate
2025-11-10 19:37:41 -05:00

12 KiB

title, description, difficulty, time_estimate, prerequisites, next_steps, learning_objectives
title description difficulty time_estimate prerequisites next_steps learning_objectives
DataLoader - Data Pipeline Engineering Build production-grade data loading infrastructure for training at scale 3 5-6 hours
Tensor
Layers
Training
Spatial (CNNs)
Design scalable data pipeline architectures for production ML systems
Implement efficient dataset abstractions with batching and streaming
Build preprocessing pipelines for normalization and data augmentation
Understand memory-efficient data loading patterns for large datasets
Apply systems thinking to I/O optimization and throughput engineering

08. DataLoader

🏛️ ARCHITECTURE TIER | Difficulty: (3/4) | Time: 5-6 hours

Overview

Build the data engineering infrastructure that feeds neural networks. This module implements production-grade data loading, preprocessing, and batching systems—the critical backbone that enables training on real-world datasets like CIFAR-10.

Learning Objectives

By completing this module, you will be able to:

  1. Design scalable data pipeline architectures for production ML systems with proper abstractions and interfaces
  2. Implement efficient dataset abstractions with batching, shuffling, and streaming for memory-efficient training
  3. Build preprocessing pipelines for normalization, augmentation, and transformation with fit-transform patterns
  4. Understand memory-efficient data loading patterns for large datasets that don't fit in RAM
  5. Apply systems thinking to I/O optimization, caching strategies, and throughput engineering

Why This Matters

Production Context

Every production ML system depends on robust data infrastructure:

  • Netflix uses sophisticated data pipelines to train recommendation models on billions of viewing records
  • Tesla processes terabytes of driving sensor data through efficient loading pipelines for autonomous driving
  • OpenAI built custom data loaders to train GPT models on hundreds of billions of tokens
  • Meta developed PyTorch's DataLoader (which you're reimplementing) to power research and production

Historical Context

Data loading evolved from bottleneck to optimized system:

  • Early ML (pre-2010): Small datasets fit entirely in memory; data loading was an afterthought
  • ImageNet Era (2012): AlexNet required efficient loading of 1.2M images; preprocessing became critical
  • Big Data ML (2015+): Streaming data pipelines became necessary for datasets too large for memory
  • Modern Scale (2020+): Data loading is now a first-class systems problem with dedicated infrastructure teams

The patterns you're building are the same ones used in production at scale.

Pedagogical Pattern: Build → Use → Analyze

1. Build

Implement from first principles:

  • Dataset abstraction with Python protocols (__getitem__, __len__)
  • DataLoader with batching, shuffling, and iteration
  • CIFAR-10 dataset loader with binary file parsing
  • Normalizer with fit-transform pattern
  • Memory-efficient streaming for large datasets

2. Use

Apply to real problems:

  • Load and preprocess CIFAR-10 (50,000 training images)
  • Create train/test data loaders with proper batching
  • Build preprocessing pipelines for normalization
  • Integrate with training loops from Module 07
  • Measure throughput and identify bottlenecks

3. Analyze

Deep-dive into systems behavior:

  • Profile memory usage patterns with different batch sizes
  • Measure I/O throughput and identify disk bottlenecks
  • Compare streaming vs in-memory loading strategies
  • Analyze the impact of shuffling on training dynamics
  • Understand trade-offs between batch size and memory

Implementation Guide

Core Components

Dataset Abstraction

class Dataset:
    """Abstract base class for all datasets.
    
    Implements Python protocols for indexing and length.
    Subclasses must implement __getitem__ and __len__.
    """
    def __getitem__(self, index: int):
        """Return (data, label) for given index."""
        raise NotImplementedError
    
    def __len__(self) -> int:
        """Return total number of samples."""
        raise NotImplementedError

DataLoader Implementation

class DataLoader:
    """Efficient batch loading with shuffling support.
    
    Features:
    - Automatic batching with configurable batch size
    - Optional shuffling for training randomization
    - Drop last batch handling for even batch sizes
    - Memory-efficient iteration without loading all data
    """
    def __init__(self, dataset, batch_size=32, shuffle=False, drop_last=False):
        self.dataset = dataset
        self.batch_size = batch_size
        self.shuffle = shuffle
        self.drop_last = drop_last
    
    def __iter__(self):
        # Generate indices (shuffled or sequential)
        indices = list(range(len(self.dataset)))
        if self.shuffle:
            np.random.shuffle(indices)
        
        # Yield batches
        for i in range(0, len(indices), self.batch_size):
            batch_indices = indices[i:i + self.batch_size]
            if len(batch_indices) < self.batch_size and self.drop_last:
                continue
            yield self._get_batch(batch_indices)

CIFAR-10 Dataset Loader

class CIFAR10Dataset(Dataset):
    """Load CIFAR-10 dataset with automatic download.
    
    CIFAR-10: 60,000 32x32 color images in 10 classes
    - 50,000 training images
    - 10,000 test images
    - Classes: airplane, car, bird, cat, deer, dog, frog, horse, ship, truck
    """
    def __init__(self, root='./data', train=True, download=True):
        self.train = train
        if download:
            self._download(root)
        self.data, self.labels = self._load_batch_files(root, train)
    
    def __getitem__(self, index):
        return self.data[index], self.labels[index]
    
    def __len__(self):
        return len(self.data)

Preprocessing Pipeline

class Normalizer:
    """Normalize data using fit-transform pattern.
    
    Fits statistics on training data, applies to all splits.
    Ensures consistent preprocessing across train/val/test.
    """
    def fit(self, data):
        """Compute mean and std from training data."""
        self.mean = data.mean(axis=0)
        self.std = data.std(axis=0)
        return self
    
    def transform(self, data):
        """Apply normalization using fitted statistics."""
        return (data - self.mean) / (self.std + 1e-8)
    
    def fit_transform(self, data):
        """Fit and transform in one step."""
        return self.fit(data).transform(data)

Step-by-Step Implementation

  1. Create Dataset Base Class

    • Implement __getitem__ and __len__ protocols
    • Define the interface all datasets must follow
    • Test with simple array-based dataset
  2. Build CIFAR-10 Loader

    • Implement download and extraction logic
    • Parse binary batch files (pickle format)
    • Reshape data from flat arrays to (3, 32, 32) images
    • Handle train/test split loading
  3. Implement DataLoader

    • Create batching logic with configurable batch size
    • Add shuffling with random permutation
    • Implement iterator protocol for Pythonic loops
    • Handle edge cases (last incomplete batch, empty dataset)
  4. Add Preprocessing

    • Build Normalizer with fit-transform pattern
    • Compute per-channel statistics for RGB images
    • Apply transformations efficiently across batches
    • Test normalization correctness (zero mean, unit variance)
  5. Integration Testing

    • Load CIFAR-10 and create data loaders
    • Iterate through batches and verify shapes
    • Test with actual training loop from Module 07
    • Measure data loading throughput

Testing

Inline Tests (During Development)

Run inline tests while building:

cd modules/08_dataloader
python dataloader_dev.py

Expected output:

Unit Test: Dataset abstraction...
✅ __getitem__ protocol works correctly
✅ __len__ returns correct size
✅ Indexing returns (data, label) tuples
Progress: Dataset Interface ✓

Unit Test: CIFAR-10 loading...
✅ Downloaded and extracted 170MB dataset
✅ Loaded 50,000 training samples
✅ Sample shape: (3, 32, 32), label range: [0, 9]
Progress: CIFAR-10 Dataset ✓

Unit Test: DataLoader batching...
✅ Batch shapes correct: (32, 3, 32, 32)
✅ Shuffling produces different orderings
✅ Iteration covers all samples exactly once
Progress: DataLoader ✓

Export and Validate

After completing the module:

# Export to tinytorch package
tito export 08_dataloader

# Run integration tests
tito test 08_dataloader

Comprehensive Test Coverage

The test suite validates:

  • Dataset interface correctness
  • CIFAR-10 loading and parsing
  • Batch shape consistency
  • Shuffling randomness
  • Memory efficiency
  • Preprocessing accuracy

Where This Code Lives

tinytorch/
├── core/
│   └── dataloader.py          # Your implementation goes here
└── __init__.py                # Exposes DataLoader, Dataset, etc.

Usage in other modules:
>>> from tinytorch.core.dataloader import DataLoader, CIFAR10Dataset
>>> dataset = CIFAR10Dataset(download=True)
>>> loader = DataLoader(dataset, batch_size=32, shuffle=True)

Systems Thinking Questions

  1. Memory vs Throughput Trade-off: Why does increasing batch size improve GPU utilization but increase memory usage? What's the optimal batch size for a 16GB GPU?

  2. Shuffling Impact: How does shuffling affect training dynamics and convergence? Why is it critical for training but not for evaluation?

  3. I/O Bottlenecks: Your GPU can process 1000 images/sec but your disk reads at 100 images/sec. Where's the bottleneck? How would you fix it?

  4. Preprocessing Placement: Should preprocessing happen in the data loader or in the training loop? What are the trade-offs for CPU vs GPU preprocessing?

  5. Distributed Loading: If you're training on 8 GPUs, how should you partition the dataset? What challenges arise with shuffling across multiple workers?

Real-World Connections

Industry Applications

Netflix (Recommendation Systems)

  • Processes billions of viewing records through custom data pipelines
  • Uses streaming loaders for datasets that don't fit in memory
  • Implements sophisticated batching strategies for negative sampling

Autonomous Vehicles (Tesla, Waymo)

  • Load terabytes of sensor data (camera, LIDAR, radar) for training
  • Use multi-worker data loading to keep GPUs fully utilized
  • Implement real-time preprocessing pipelines for online learning

Large Language Models (OpenAI, Anthropic)

  • Stream hundreds of billions of tokens from distributed storage
  • Use custom data loaders optimized for sequence data
  • Implement efficient tokenization and batching for transformers

Research Impact

This module teaches patterns from:

  • PyTorch DataLoader (2016): The industry-standard data loading API
  • TensorFlow Dataset API (2017): Google's approach to data pipelines
  • NVIDIA DALI (2019): GPU-accelerated preprocessing for peak throughput
  • WebDataset (2020): Efficient loading from cloud storage

What's Next?

In Module 09: Spatial (CNNs), you'll use these data loaders to train convolutional neural networks on CIFAR-10:

  • Apply convolution operations to the RGB images you're loading
  • Use your DataLoader to iterate through 50,000 training samples
  • Achieve >75% accuracy on CIFAR-10 classification
  • Understand how CNNs process spatial data efficiently

The data infrastructure you built here becomes critical—training CNNs requires efficient batch loading of image data with proper preprocessing.


Ready to build production data infrastructure? Open modules/08_dataloader/dataloader_dev.py and start implementing.