mirror of https://github.com/MLSysBook/TinyTorch.git synced 2026-03-12 02:43:35 -05:00

Files

Vijay Janapa Reddi 9199199845 feat: Add comprehensive intermediate testing across all TinyTorch modules

- Add 17 intermediate test points across 6 modules for immediate student feedback
- Tensor module: Tests after creation, properties, arithmetic, and operators
- Activations module: Tests after each activation function (ReLU, Sigmoid, Tanh, Softmax)
- Layers module: Tests after matrix multiplication and Dense layer implementation
- Networks module: Tests after Sequential class and MLP creation
- CNN module: Tests after convolution, Conv2D layer, and flatten operations
- DataLoader module: Tests after Dataset interface and DataLoader class
- All tests include visual progress indicators and behavioral explanations
- Maintains NBGrader compliance with proper metadata and point allocation
- Enables steady forward progress and better debugging for students
- 100% test success rate across all modules and integration testing

2025-07-12 18:28:35 -04:00

tests

Simplify export workflow: remove module_paths.txt, use dynamic discovery

2025-07-12 17:19:22 -04:00

dataloader_dev.py

feat: Add comprehensive intermediate testing across all TinyTorch modules

2025-07-12 18:28:35 -04:00

README.md

Simplify export workflow: remove module_paths.txt, use dynamic discovery

2025-07-12 17:19:22 -04:00

README.md

🔥 Module: DataLoader

📊 Module Info

Difficulty: ⭐⭐⭐ Advanced
Time Estimate: 5-7 hours
Prerequisites: Tensor, Layers modules
Next Steps: Training, Networks modules

Build the data pipeline foundation of TinyTorch! This module implements efficient data loading, preprocessing, and batching systems - the critical infrastructure that feeds neural networks during training.

🎯 Learning Objectives

By the end of this module, you will:

✅ Understand data engineering as the foundation of ML systems
✅ Implement reusable dataset abstractions and interfaces
✅ Build efficient data loaders with batching and shuffling
✅ Create data preprocessing pipelines for normalization
✅ Apply systems thinking to data I/O and memory management
✅ Have a complete data pipeline ready for neural network training

📋 Module Structure

modules/dataloader/
├── README.md                 # 📖 This file - Module overview
├── dataloader_dev.py         # 🔧 Main development file  
├── dataloader_dev.ipynb      # 📓 Generated notebook (auto-created)
├── tests/
│   └── test_dataloader.py    # 🧪 Automated tests
└── check_dataloader.py       # ✅ Manual verification (coming soon)

🚀 Getting Started

Step 1: Complete Prerequisites

Make sure you've completed the foundational modules:

python bin/tito.py test --module setup    # Should pass
python bin/tito.py test --module tensor   # Should pass
python bin/tito.py test --module layers   # Should pass

Step 2: Open the Data Development File

# Start from the dataloader module directory
cd modules/dataloader/

# Convert to notebook if needed
python bin/tito.py notebooks --module dataloader

# Open the development notebook
jupyter lab dataloader_dev.ipynb

Step 3: Work Through the Implementation

The development file guides you through building:

Dataset base class - Abstract interface for all datasets
CIFAR-10 implementation - Real dataset with binary file parsing
DataLoader - Efficient batching and shuffling system
Normalizer - Data preprocessing for stable training
Complete pipeline - Integration of all components

Step 4: Export and Test

# Export your dataloader implementation
python bin/tito.py sync --module dataloader

# Test your implementation
python bin/tito.py test --module dataloader

📚 What You'll Implement

Core Data Infrastructure

You'll build a complete data loading system that supports:

1. Dataset Abstraction

# Abstract base class for all datasets
class Dataset:
    def __getitem__(self, index):
        # Get single sample and label
        pass
    
    def __len__(self):
        # Get total number of samples
        pass
    
    def get_num_classes(self):
        # Get number of classes
        pass

# Concrete implementation
dataset = CIFAR10Dataset("data/cifar10/", train=True)
image, label = dataset[0]  # Get first sample

2. Real Dataset Loading

# CIFAR-10 dataset with download and parsing
dataset = CIFAR10Dataset("data/cifar10/", train=True, download=True)
print(f"Dataset size: {len(dataset)}")           # 50,000 training samples
print(f"Sample shape: {dataset.get_sample_shape()}")  # (3, 32, 32)
print(f"Classes: {dataset.get_num_classes()}")        # 10 classes

3. Efficient Data Loading

# DataLoader with batching and shuffling
dataloader = DataLoader(dataset, batch_size=32, shuffle=True)
for batch_images, batch_labels in dataloader:
    print(f"Batch shape: {batch_images.shape}")  # (32, 3, 32, 32)
    print(f"Labels shape: {batch_labels.shape}")  # (32,)
    # Ready for neural network training!

4. Data Preprocessing

# Normalizer for stable training
normalizer = Normalizer()
normalizer.fit(training_data)  # Compute statistics
normalized_data = normalizer.transform(test_data)  # Apply normalization

5. Complete Pipeline

# One-function pipeline creation
train_loader, test_loader, normalizer = create_data_pipeline(
    dataset_path="data/cifar10/",
    batch_size=32,
    normalize=True,
    shuffle=True
)

Technical Requirements

Your data system must:

Handle multiple dataset types through common interface
Efficiently load and parse binary data files
Support batching with configurable batch sizes
Implement shuffling for training randomization
Provide data normalization for stable training
Export to tinytorch.core.dataloader

🧪 Testing Your Implementation

Progressive Testing with Real Data

The tests follow the "Build → Use → Understand" pattern with real CIFAR-10 data:

# Run all tests (downloads real CIFAR-10 data)
python bin/tito.py test --module dataloader

# Run specific test categories
python -m pytest tests/test_dataloader.py::TestDatasetInterface -v      # Test abstract interface
python -m pytest tests/test_dataloader.py::TestCIFAR10Dataset -v        # Test real data loading
python -m pytest tests/test_dataloader.py::TestDataLoader -v            # Test batching real data
python -m pytest tests/test_dataloader.py::TestNormalizer -v            # Test normalizing real data
python -m pytest tests/test_dataloader.py::TestDataPipeline -v          # Test complete pipeline

Real Data Testing Flow

Each test builds on the previous component using actual CIFAR-10 data:

Build Dataset → Test: Download and load real CIFAR-10 images (50,000 training, 10,000 test)
Build DataLoader → Test: Batch real images with proper shuffling and iteration
Build Normalizer → Test: Normalize real pixel values (0-255 range → standardized)
Build Pipeline → Test: Complete pipeline with real data flow and preprocessing

Why Real Data Testing Matters

Real-world validation: Tests work with actual data students will use in training
Immediate feedback: See your pipeline working with real images, not fake data
Systems thinking: Understand I/O, memory, and performance with real data distributions
Debugging: Catch issues that only appear with real data (file formats, edge cases)

Note: First test run downloads ~170MB CIFAR-10 dataset with progress bar. Subsequent runs use cached data.

Interactive Testing with Visual Feedback

# Test in the notebook or Python REPL
from tinytorch.core.dataloader import Dataset, DataLoader, CIFAR10Dataset

# Create and test datasets with real data
dataset = CIFAR10Dataset("data/cifar10/", train=True, download=True)
print(f"Loaded {len(dataset)} real CIFAR-10 samples")

# Test data loading
dataloader = DataLoader(dataset, batch_size=16)
for batch_data, batch_labels in dataloader:
    print(f"Real batch shape: {batch_data.shape}")  # (16, 3, 32, 32)
    print(f"Real labels: {batch_labels}")  # Actual CIFAR-10 classes
    break

🎨 Development Visual Feedback

The development notebook (dataloader_dev.py) includes visual feedback for learning:

# 👁️ SEE your data - Available in development notebook only
show_cifar10_samples(dataset, num_samples=8, title="My CIFAR-10 Data")

🎨 Visual Feedback Features (Development Only)

The development notebook includes visual feedback for learning and debugging:

Download progress bar: Visual progress indicator during CIFAR-10 download (~170MB)
show_cifar10_samples(): Display a grid of CIFAR-10 images with class labels
Real image visualization: See actual airplanes, cars, birds, cats, etc.
Batch visualization: View what your DataLoader is producing
Pipeline visualization: See the complete data flow in action

Why Visual Feedback Matters:

Build confidence: See that your data pipeline is working correctly
Debug issues: Spot problems like incorrect normalization or corrupted images
Understand data: Build intuition about what your model will be learning from
Immediate feedback: Visual confirmation follows the "Build → Use → Understand" pattern

Note: Visual feedback is available in the development notebook (data_dev.py) for learning purposes. The core package exports only the essential data loading components.

🎯 Success Criteria

Your data module is complete when:

All tests pass: python bin/tito.py test --module dataloader
Data classes import correctly: from tinytorch.core.dataloader import Dataset, DataLoader
Dataset loading works: Can create datasets and access samples
Batching works: DataLoader produces correct batch shapes
Preprocessing works: Normalizer computes and applies statistics
Pipeline works: Complete pipeline creates train/test loaders

💡 Implementation Tips

Start with the Interface

Dataset base class - Define the abstract interface
Simple test dataset - Create mock data for testing
Basic DataLoader - Implement batching without shuffling
Add shuffling - Randomize sample order
Test frequently - Verify each component works

Design Patterns

class Dataset:
    def __getitem__(self, index):
        # Return (data, label) tuple
        return data_tensor, label_tensor
    
    def __len__(self):
        # Return total number of samples
        return self.num_samples

class DataLoader:
    def __iter__(self):
        # Yield batches of (batch_data, batch_labels)
        for batch in self._create_batches():
            yield batch_data, batch_labels

Systems Thinking

Memory management: Don't load entire dataset into RAM
I/O efficiency: Batch file operations when possible
Preprocessing: Compute statistics once, apply many times
Interface design: Make components easily swappable

Common Challenges

Binary file parsing - CIFAR-10 uses custom format
Batch size handling - Last batch may be smaller
Data type consistency - Convert to consistent types
Error handling - Provide helpful debugging messages

🔧 Advanced Features (Optional)

If you finish early, try implementing:

Data augmentation - Random transformations for training
Multi-worker loading - Parallel data loading
Caching - Store processed data for faster access
Different datasets - MNIST, Fashion-MNIST, etc.

🚀 Next Steps

Once you complete the data module:

Move to Autograd: cd modules/autograd/
Build automatic differentiation: Enable gradient computation
Combine with data: Train models on real datasets
Prepare for training: Ready for the training module

🔗 Why Data Engineering Matters

Data engineering is the foundation of all ML systems:

Training loops need efficient data loading
Model performance depends on data quality
Production systems require scalable data pipelines
Research needs flexible data interfaces

Your data implementation will power all TinyTorch training!

📊 Real-World Connection

The patterns you'll implement are used in:

PyTorch DataLoader - Same interface and concepts
TensorFlow tf.data - Similar pipeline architecture
Production ML - Scalable data processing systems
Research - Flexible experimentation frameworks

🎉 Ready to Build?

The data module is where TinyTorch becomes a real ML system. You're about to create the infrastructure that will feed neural networks, enable training loops, and power production ML pipelines.

Focus on clean interfaces, efficient implementation, and systems thinking! 🔥