mirror of
https://github.com/MLSysBook/TinyTorch.git
synced 2026-06-04 04:06:36 -05:00
✅ Rename all module directories: 00_setup → 01_setup, etc. ✅ Update convert_modules.py mappings for new directory names ✅ Update _toc.yml file paths and titles (1-14 instead of 0-13) ✅ Regenerate all overview pages with new numbering ✅ Fix all broken references in usage-paths and intro ✅ Update chapter references to use natural numbering Benefits: - More intuitive course progression starting from 1 - Matches academic course numbering conventions - Eliminates confusion about 'Module 0' concept - Cleaner mental model for students and instructors - All references and links properly updated Complete transformation: 14 modules now numbered 01-14
314 lines
11 KiB
Markdown
314 lines
11 KiB
Markdown
# 🔥 Module: DataLoader
|
|
|
|
## 📊 Module Info
|
|
- **Difficulty**: ⭐⭐⭐ Advanced
|
|
- **Time Estimate**: 5-7 hours
|
|
- **Prerequisites**: Tensor, Layers modules
|
|
- **Next Steps**: Training, Networks modules
|
|
|
|
Build the data pipeline foundation of TinyTorch! This module implements efficient data loading, preprocessing, and batching systems - the critical infrastructure that feeds neural networks during training.
|
|
|
|
## 🎯 Learning Objectives
|
|
|
|
By the end of this module, you will:
|
|
- ✅ Understand data engineering as the foundation of ML systems
|
|
- ✅ Implement reusable dataset abstractions and interfaces
|
|
- ✅ Build efficient data loaders with batching and shuffling
|
|
- ✅ Create data preprocessing pipelines for normalization
|
|
- ✅ Apply systems thinking to data I/O and memory management
|
|
- ✅ Have a complete data pipeline ready for neural network training
|
|
|
|
## 📋 Module Structure
|
|
|
|
```
|
|
modules/dataloader/
|
|
├── README.md # 📖 This file - Module overview
|
|
├── dataloader_dev.py # 🔧 Main development file
|
|
├── dataloader_dev.ipynb # 📓 Generated notebook (auto-created)
|
|
├── tests/
|
|
│ └── test_dataloader.py # 🧪 Automated tests
|
|
└── check_dataloader.py # ✅ Manual verification (coming soon)
|
|
```
|
|
|
|
## 🚀 Getting Started
|
|
|
|
### Step 1: Complete Prerequisites
|
|
Make sure you've completed the foundational modules:
|
|
```bash
|
|
tito test --module setup # Should pass
|
|
tito test --module tensor # Should pass
|
|
tito test --module layers # Should pass
|
|
```
|
|
|
|
### Step 2: Open the Data Development File
|
|
```bash
|
|
# Start from the dataloader module directory
|
|
cd modules/dataloader/
|
|
|
|
# Convert to notebook if needed
|
|
tito notebooks --module dataloader
|
|
|
|
# Open the development notebook
|
|
jupyter lab dataloader_dev.ipynb
|
|
```
|
|
|
|
### Step 3: Work Through the Implementation
|
|
The development file guides you through building:
|
|
1. **Dataset base class** - Abstract interface for all datasets
|
|
2. **CIFAR-10 implementation** - Real dataset with binary file parsing
|
|
3. **DataLoader** - Efficient batching and shuffling system
|
|
4. **Normalizer** - Data preprocessing for stable training
|
|
5. **Complete pipeline** - Integration of all components
|
|
|
|
### Step 4: Export and Test
|
|
```bash
|
|
# Export your dataloader implementation
|
|
tito sync --module dataloader
|
|
|
|
# Test your implementation
|
|
tito test --module dataloader
|
|
```
|
|
|
|
## 📚 What You'll Implement
|
|
|
|
### Core Data Infrastructure
|
|
You'll build a complete data loading system that supports:
|
|
|
|
#### 1. Dataset Abstraction
|
|
```python
|
|
# Abstract base class for all datasets
|
|
class Dataset:
|
|
def __getitem__(self, index):
|
|
# Get single sample and label
|
|
pass
|
|
|
|
def __len__(self):
|
|
# Get total number of samples
|
|
pass
|
|
|
|
def get_num_classes(self):
|
|
# Get number of classes
|
|
pass
|
|
|
|
# Concrete implementation
|
|
dataset = CIFAR10Dataset("data/cifar10/", train=True)
|
|
image, label = dataset[0] # Get first sample
|
|
```
|
|
|
|
#### 2. Real Dataset Loading
|
|
```python
|
|
# CIFAR-10 dataset with download and parsing
|
|
dataset = CIFAR10Dataset("data/cifar10/", train=True, download=True)
|
|
print(f"Dataset size: {len(dataset)}") # 50,000 training samples
|
|
print(f"Sample shape: {dataset.get_sample_shape()}") # (3, 32, 32)
|
|
print(f"Classes: {dataset.get_num_classes()}") # 10 classes
|
|
```
|
|
|
|
#### 3. Efficient Data Loading
|
|
```python
|
|
# DataLoader with batching and shuffling
|
|
dataloader = DataLoader(dataset, batch_size=32, shuffle=True)
|
|
for batch_images, batch_labels in dataloader:
|
|
print(f"Batch shape: {batch_images.shape}") # (32, 3, 32, 32)
|
|
print(f"Labels shape: {batch_labels.shape}") # (32,)
|
|
# Ready for neural network training!
|
|
```
|
|
|
|
#### 4. Data Preprocessing
|
|
```python
|
|
# Normalizer for stable training
|
|
normalizer = Normalizer()
|
|
normalizer.fit(training_data) # Compute statistics
|
|
normalized_data = normalizer.transform(test_data) # Apply normalization
|
|
```
|
|
|
|
#### 5. Complete Pipeline
|
|
```python
|
|
# One-function pipeline creation
|
|
train_loader, test_loader, normalizer = create_data_pipeline(
|
|
dataset_path="data/cifar10/",
|
|
batch_size=32,
|
|
normalize=True,
|
|
shuffle=True
|
|
)
|
|
```
|
|
|
|
### Technical Requirements
|
|
Your data system must:
|
|
- Handle multiple dataset types through common interface
|
|
- Efficiently load and parse binary data files
|
|
- Support batching with configurable batch sizes
|
|
- Implement shuffling for training randomization
|
|
- Provide data normalization for stable training
|
|
- Export to `tinytorch.core.dataloader`
|
|
|
|
## 🧪 Testing Your Implementation
|
|
|
|
### Progressive Testing with Real Data
|
|
|
|
The tests follow the **"Build → Use → Understand"** pattern with real CIFAR-10 data:
|
|
|
|
```bash
|
|
# Run all tests (downloads real CIFAR-10 data)
|
|
tito test --module dataloader
|
|
|
|
# Run specific test categories
|
|
python -m pytest tests/test_dataloader.py::TestDatasetInterface -v # Test abstract interface
|
|
python -m pytest tests/test_dataloader.py::TestCIFAR10Dataset -v # Test real data loading
|
|
python -m pytest tests/test_dataloader.py::TestDataLoader -v # Test batching real data
|
|
python -m pytest tests/test_dataloader.py::TestNormalizer -v # Test normalizing real data
|
|
python -m pytest tests/test_dataloader.py::TestDataPipeline -v # Test complete pipeline
|
|
```
|
|
|
|
### Real Data Testing Flow
|
|
|
|
Each test builds on the previous component using actual CIFAR-10 data:
|
|
|
|
1. **Build Dataset** → **Test**: Download and load real CIFAR-10 images (50,000 training, 10,000 test)
|
|
2. **Build DataLoader** → **Test**: Batch real images with proper shuffling and iteration
|
|
3. **Build Normalizer** → **Test**: Normalize real pixel values (0-255 range → standardized)
|
|
4. **Build Pipeline** → **Test**: Complete pipeline with real data flow and preprocessing
|
|
|
|
### Why Real Data Testing Matters
|
|
|
|
- **Real-world validation**: Tests work with actual data students will use in training
|
|
- **Immediate feedback**: See your pipeline working with real images, not fake data
|
|
- **Systems thinking**: Understand I/O, memory, and performance with real data distributions
|
|
- **Debugging**: Catch issues that only appear with real data (file formats, edge cases)
|
|
|
|
**Note**: First test run downloads ~170MB CIFAR-10 dataset with progress bar. Subsequent runs use cached data.
|
|
|
|
### Interactive Testing with Visual Feedback
|
|
```python
|
|
# Test in the notebook or Python REPL
|
|
from tinytorch.core.dataloader import Dataset, DataLoader, CIFAR10Dataset
|
|
|
|
# Create and test datasets with real data
|
|
dataset = CIFAR10Dataset("data/cifar10/", train=True, download=True)
|
|
print(f"Loaded {len(dataset)} real CIFAR-10 samples")
|
|
|
|
# Test data loading
|
|
dataloader = DataLoader(dataset, batch_size=16)
|
|
for batch_data, batch_labels in dataloader:
|
|
print(f"Real batch shape: {batch_data.shape}") # (16, 3, 32, 32)
|
|
print(f"Real labels: {batch_labels}") # Actual CIFAR-10 classes
|
|
break
|
|
```
|
|
|
|
### 🎨 Development Visual Feedback
|
|
|
|
The development notebook (`dataloader_dev.py`) includes **visual feedback** for learning:
|
|
|
|
```python
|
|
# 👁️ SEE your data - Available in development notebook only
|
|
show_cifar10_samples(dataset, num_samples=8, title="My CIFAR-10 Data")
|
|
```
|
|
|
|
### 🎨 Visual Feedback Features (Development Only)
|
|
|
|
The development notebook includes **visual feedback** for learning and debugging:
|
|
|
|
- **Download progress bar**: Visual progress indicator during CIFAR-10 download (~170MB)
|
|
- **`show_cifar10_samples()`**: Display a grid of CIFAR-10 images with class labels
|
|
- **Real image visualization**: See actual airplanes, cars, birds, cats, etc.
|
|
- **Batch visualization**: View what your DataLoader is producing
|
|
- **Pipeline visualization**: See the complete data flow in action
|
|
|
|
**Why Visual Feedback Matters:**
|
|
- **Build confidence**: See that your data pipeline is working correctly
|
|
- **Debug issues**: Spot problems like incorrect normalization or corrupted images
|
|
- **Understand data**: Build intuition about what your model will be learning from
|
|
- **Immediate feedback**: Visual confirmation follows the "Build → Use → Understand" pattern
|
|
|
|
**Note**: Visual feedback is available in the development notebook (`data_dev.py`) for learning purposes. The core package exports only the essential data loading components.
|
|
|
|
## 🎯 Success Criteria
|
|
|
|
Your data module is complete when:
|
|
|
|
1. **All tests pass**: `tito test --module dataloader`
|
|
2. **Data classes import correctly**: `from tinytorch.core.dataloader import Dataset, DataLoader`
|
|
3. **Dataset loading works**: Can create datasets and access samples
|
|
4. **Batching works**: DataLoader produces correct batch shapes
|
|
5. **Preprocessing works**: Normalizer computes and applies statistics
|
|
6. **Pipeline works**: Complete pipeline creates train/test loaders
|
|
|
|
## 💡 Implementation Tips
|
|
|
|
### Start with the Interface
|
|
1. **Dataset base class** - Define the abstract interface
|
|
2. **Simple test dataset** - Create mock data for testing
|
|
3. **Basic DataLoader** - Implement batching without shuffling
|
|
4. **Add shuffling** - Randomize sample order
|
|
5. **Test frequently** - Verify each component works
|
|
|
|
### Design Patterns
|
|
```python
|
|
class Dataset:
|
|
def __getitem__(self, index):
|
|
# Return (data, label) tuple
|
|
return data_tensor, label_tensor
|
|
|
|
def __len__(self):
|
|
# Return total number of samples
|
|
return self.num_samples
|
|
|
|
class DataLoader:
|
|
def __iter__(self):
|
|
# Yield batches of (batch_data, batch_labels)
|
|
for batch in self._create_batches():
|
|
yield batch_data, batch_labels
|
|
```
|
|
|
|
### Systems Thinking
|
|
- **Memory management**: Don't load entire dataset into RAM
|
|
- **I/O efficiency**: Batch file operations when possible
|
|
- **Preprocessing**: Compute statistics once, apply many times
|
|
- **Interface design**: Make components easily swappable
|
|
|
|
### Common Challenges
|
|
- **Binary file parsing** - CIFAR-10 uses custom format
|
|
- **Batch size handling** - Last batch may be smaller
|
|
- **Data type consistency** - Convert to consistent types
|
|
- **Error handling** - Provide helpful debugging messages
|
|
|
|
## 🔧 Advanced Features (Optional)
|
|
|
|
If you finish early, try implementing:
|
|
- **Data augmentation** - Random transformations for training
|
|
- **Multi-worker loading** - Parallel data loading
|
|
- **Caching** - Store processed data for faster access
|
|
- **Different datasets** - MNIST, Fashion-MNIST, etc.
|
|
|
|
## 🚀 Next Steps
|
|
|
|
Once you complete the data module:
|
|
|
|
1. **Move to Autograd**: `cd modules/autograd/`
|
|
2. **Build automatic differentiation**: Enable gradient computation
|
|
3. **Combine with data**: Train models on real datasets
|
|
4. **Prepare for training**: Ready for the training module
|
|
|
|
## 🔗 Why Data Engineering Matters
|
|
|
|
Data engineering is the foundation of all ML systems:
|
|
- **Training loops** need efficient data loading
|
|
- **Model performance** depends on data quality
|
|
- **Production systems** require scalable data pipelines
|
|
- **Research** needs flexible data interfaces
|
|
|
|
Your data implementation will power all TinyTorch training!
|
|
|
|
## 📊 Real-World Connection
|
|
|
|
The patterns you'll implement are used in:
|
|
- **PyTorch DataLoader** - Same interface and concepts
|
|
- **TensorFlow tf.data** - Similar pipeline architecture
|
|
- **Production ML** - Scalable data processing systems
|
|
- **Research** - Flexible experimentation frameworks
|
|
|
|
## 🎉 Ready to Build?
|
|
|
|
The data module is where TinyTorch becomes a real ML system. You're about to create the infrastructure that will feed neural networks, enable training loops, and power production ML pipelines.
|
|
|
|
Focus on clean interfaces, efficient implementation, and systems thinking! 🔥 |