Files
TinyTorch/modules/source/06_dataloader
Vijay Janapa Reddi 9199199845 feat: Add comprehensive intermediate testing across all TinyTorch modules
- Add 17 intermediate test points across 6 modules for immediate student feedback
- Tensor module: Tests after creation, properties, arithmetic, and operators
- Activations module: Tests after each activation function (ReLU, Sigmoid, Tanh, Softmax)
- Layers module: Tests after matrix multiplication and Dense layer implementation
- Networks module: Tests after Sequential class and MLP creation
- CNN module: Tests after convolution, Conv2D layer, and flatten operations
- DataLoader module: Tests after Dataset interface and DataLoader class
- All tests include visual progress indicators and behavioral explanations
- Maintains NBGrader compliance with proper metadata and point allocation
- Enables steady forward progress and better debugging for students
- 100% test success rate across all modules and integration testing
2025-07-12 18:28:35 -04:00
..

🔥 Module: DataLoader

📊 Module Info

  • Difficulty: Advanced
  • Time Estimate: 5-7 hours
  • Prerequisites: Tensor, Layers modules
  • Next Steps: Training, Networks modules

Build the data pipeline foundation of TinyTorch! This module implements efficient data loading, preprocessing, and batching systems - the critical infrastructure that feeds neural networks during training.

🎯 Learning Objectives

By the end of this module, you will:

  • Understand data engineering as the foundation of ML systems
  • Implement reusable dataset abstractions and interfaces
  • Build efficient data loaders with batching and shuffling
  • Create data preprocessing pipelines for normalization
  • Apply systems thinking to data I/O and memory management
  • Have a complete data pipeline ready for neural network training

📋 Module Structure

modules/dataloader/
├── README.md                 # 📖 This file - Module overview
├── dataloader_dev.py         # 🔧 Main development file  
├── dataloader_dev.ipynb      # 📓 Generated notebook (auto-created)
├── tests/
│   └── test_dataloader.py    # 🧪 Automated tests
└── check_dataloader.py       # ✅ Manual verification (coming soon)

🚀 Getting Started

Step 1: Complete Prerequisites

Make sure you've completed the foundational modules:

python bin/tito.py test --module setup    # Should pass
python bin/tito.py test --module tensor   # Should pass
python bin/tito.py test --module layers   # Should pass

Step 2: Open the Data Development File

# Start from the dataloader module directory
cd modules/dataloader/

# Convert to notebook if needed
python bin/tito.py notebooks --module dataloader

# Open the development notebook
jupyter lab dataloader_dev.ipynb

Step 3: Work Through the Implementation

The development file guides you through building:

  1. Dataset base class - Abstract interface for all datasets
  2. CIFAR-10 implementation - Real dataset with binary file parsing
  3. DataLoader - Efficient batching and shuffling system
  4. Normalizer - Data preprocessing for stable training
  5. Complete pipeline - Integration of all components

Step 4: Export and Test

# Export your dataloader implementation
python bin/tito.py sync --module dataloader

# Test your implementation
python bin/tito.py test --module dataloader

📚 What You'll Implement

Core Data Infrastructure

You'll build a complete data loading system that supports:

1. Dataset Abstraction

# Abstract base class for all datasets
class Dataset:
    def __getitem__(self, index):
        # Get single sample and label
        pass
    
    def __len__(self):
        # Get total number of samples
        pass
    
    def get_num_classes(self):
        # Get number of classes
        pass

# Concrete implementation
dataset = CIFAR10Dataset("data/cifar10/", train=True)
image, label = dataset[0]  # Get first sample

2. Real Dataset Loading

# CIFAR-10 dataset with download and parsing
dataset = CIFAR10Dataset("data/cifar10/", train=True, download=True)
print(f"Dataset size: {len(dataset)}")           # 50,000 training samples
print(f"Sample shape: {dataset.get_sample_shape()}")  # (3, 32, 32)
print(f"Classes: {dataset.get_num_classes()}")        # 10 classes

3. Efficient Data Loading

# DataLoader with batching and shuffling
dataloader = DataLoader(dataset, batch_size=32, shuffle=True)
for batch_images, batch_labels in dataloader:
    print(f"Batch shape: {batch_images.shape}")  # (32, 3, 32, 32)
    print(f"Labels shape: {batch_labels.shape}")  # (32,)
    # Ready for neural network training!

4. Data Preprocessing

# Normalizer for stable training
normalizer = Normalizer()
normalizer.fit(training_data)  # Compute statistics
normalized_data = normalizer.transform(test_data)  # Apply normalization

5. Complete Pipeline

# One-function pipeline creation
train_loader, test_loader, normalizer = create_data_pipeline(
    dataset_path="data/cifar10/",
    batch_size=32,
    normalize=True,
    shuffle=True
)

Technical Requirements

Your data system must:

  • Handle multiple dataset types through common interface
  • Efficiently load and parse binary data files
  • Support batching with configurable batch sizes
  • Implement shuffling for training randomization
  • Provide data normalization for stable training
  • Export to tinytorch.core.dataloader

🧪 Testing Your Implementation

Progressive Testing with Real Data

The tests follow the "Build → Use → Understand" pattern with real CIFAR-10 data:

# Run all tests (downloads real CIFAR-10 data)
python bin/tito.py test --module dataloader

# Run specific test categories
python -m pytest tests/test_dataloader.py::TestDatasetInterface -v      # Test abstract interface
python -m pytest tests/test_dataloader.py::TestCIFAR10Dataset -v        # Test real data loading
python -m pytest tests/test_dataloader.py::TestDataLoader -v            # Test batching real data
python -m pytest tests/test_dataloader.py::TestNormalizer -v            # Test normalizing real data
python -m pytest tests/test_dataloader.py::TestDataPipeline -v          # Test complete pipeline

Real Data Testing Flow

Each test builds on the previous component using actual CIFAR-10 data:

  1. Build DatasetTest: Download and load real CIFAR-10 images (50,000 training, 10,000 test)
  2. Build DataLoaderTest: Batch real images with proper shuffling and iteration
  3. Build NormalizerTest: Normalize real pixel values (0-255 range → standardized)
  4. Build PipelineTest: Complete pipeline with real data flow and preprocessing

Why Real Data Testing Matters

  • Real-world validation: Tests work with actual data students will use in training
  • Immediate feedback: See your pipeline working with real images, not fake data
  • Systems thinking: Understand I/O, memory, and performance with real data distributions
  • Debugging: Catch issues that only appear with real data (file formats, edge cases)

Note: First test run downloads ~170MB CIFAR-10 dataset with progress bar. Subsequent runs use cached data.

Interactive Testing with Visual Feedback

# Test in the notebook or Python REPL
from tinytorch.core.dataloader import Dataset, DataLoader, CIFAR10Dataset

# Create and test datasets with real data
dataset = CIFAR10Dataset("data/cifar10/", train=True, download=True)
print(f"Loaded {len(dataset)} real CIFAR-10 samples")

# Test data loading
dataloader = DataLoader(dataset, batch_size=16)
for batch_data, batch_labels in dataloader:
    print(f"Real batch shape: {batch_data.shape}")  # (16, 3, 32, 32)
    print(f"Real labels: {batch_labels}")  # Actual CIFAR-10 classes
    break

🎨 Development Visual Feedback

The development notebook (dataloader_dev.py) includes visual feedback for learning:

# 👁️ SEE your data - Available in development notebook only
show_cifar10_samples(dataset, num_samples=8, title="My CIFAR-10 Data")

🎨 Visual Feedback Features (Development Only)

The development notebook includes visual feedback for learning and debugging:

  • Download progress bar: Visual progress indicator during CIFAR-10 download (~170MB)
  • show_cifar10_samples(): Display a grid of CIFAR-10 images with class labels
  • Real image visualization: See actual airplanes, cars, birds, cats, etc.
  • Batch visualization: View what your DataLoader is producing
  • Pipeline visualization: See the complete data flow in action

Why Visual Feedback Matters:

  • Build confidence: See that your data pipeline is working correctly
  • Debug issues: Spot problems like incorrect normalization or corrupted images
  • Understand data: Build intuition about what your model will be learning from
  • Immediate feedback: Visual confirmation follows the "Build → Use → Understand" pattern

Note: Visual feedback is available in the development notebook (data_dev.py) for learning purposes. The core package exports only the essential data loading components.

🎯 Success Criteria

Your data module is complete when:

  1. All tests pass: python bin/tito.py test --module dataloader
  2. Data classes import correctly: from tinytorch.core.dataloader import Dataset, DataLoader
  3. Dataset loading works: Can create datasets and access samples
  4. Batching works: DataLoader produces correct batch shapes
  5. Preprocessing works: Normalizer computes and applies statistics
  6. Pipeline works: Complete pipeline creates train/test loaders

💡 Implementation Tips

Start with the Interface

  1. Dataset base class - Define the abstract interface
  2. Simple test dataset - Create mock data for testing
  3. Basic DataLoader - Implement batching without shuffling
  4. Add shuffling - Randomize sample order
  5. Test frequently - Verify each component works

Design Patterns

class Dataset:
    def __getitem__(self, index):
        # Return (data, label) tuple
        return data_tensor, label_tensor
    
    def __len__(self):
        # Return total number of samples
        return self.num_samples

class DataLoader:
    def __iter__(self):
        # Yield batches of (batch_data, batch_labels)
        for batch in self._create_batches():
            yield batch_data, batch_labels

Systems Thinking

  • Memory management: Don't load entire dataset into RAM
  • I/O efficiency: Batch file operations when possible
  • Preprocessing: Compute statistics once, apply many times
  • Interface design: Make components easily swappable

Common Challenges

  • Binary file parsing - CIFAR-10 uses custom format
  • Batch size handling - Last batch may be smaller
  • Data type consistency - Convert to consistent types
  • Error handling - Provide helpful debugging messages

🔧 Advanced Features (Optional)

If you finish early, try implementing:

  • Data augmentation - Random transformations for training
  • Multi-worker loading - Parallel data loading
  • Caching - Store processed data for faster access
  • Different datasets - MNIST, Fashion-MNIST, etc.

🚀 Next Steps

Once you complete the data module:

  1. Move to Autograd: cd modules/autograd/
  2. Build automatic differentiation: Enable gradient computation
  3. Combine with data: Train models on real datasets
  4. Prepare for training: Ready for the training module

🔗 Why Data Engineering Matters

Data engineering is the foundation of all ML systems:

  • Training loops need efficient data loading
  • Model performance depends on data quality
  • Production systems require scalable data pipelines
  • Research needs flexible data interfaces

Your data implementation will power all TinyTorch training!

📊 Real-World Connection

The patterns you'll implement are used in:

  • PyTorch DataLoader - Same interface and concepts
  • TensorFlow tf.data - Similar pipeline architecture
  • Production ML - Scalable data processing systems
  • Research - Flexible experimentation frameworks

🎉 Ready to Build?

The data module is where TinyTorch becomes a real ML system. You're about to create the infrastructure that will feed neural networks, enable training loops, and power production ML pipelines.

Focus on clean interfaces, efficient implementation, and systems thinking! 🔥