# ๐Ÿ”ฅ Module: DataLoader ## ๐Ÿ“Š Module Info - **Difficulty**: โญโญโญ Advanced - **Time Estimate**: 5-7 hours - **Prerequisites**: Tensor, Layers modules - **Next Steps**: Training, Networks modules Build the data pipeline foundation of TinyTorch! This module implements efficient data loading, preprocessing, and batching systems - the critical infrastructure that feeds neural networks during training. ## ๐ŸŽฏ Learning Objectives By the end of this module, you will: - โœ… Understand data engineering as the foundation of ML systems - โœ… Implement reusable dataset abstractions and interfaces - โœ… Build efficient data loaders with batching and shuffling - โœ… Create data preprocessing pipelines for normalization - โœ… Apply systems thinking to data I/O and memory management - โœ… Have a complete data pipeline ready for neural network training ## ๐Ÿ“‹ Module Structure ``` modules/dataloader/ โ”œโ”€โ”€ README.md # ๐Ÿ“– This file - Module overview โ”œโ”€โ”€ dataloader_dev.py # ๐Ÿ”ง Main development file โ”œโ”€โ”€ dataloader_dev.ipynb # ๐Ÿ““ Generated notebook (auto-created) โ”œโ”€โ”€ tests/ โ”‚ โ””โ”€โ”€ test_dataloader.py # ๐Ÿงช Automated tests โ””โ”€โ”€ check_dataloader.py # โœ… Manual verification (coming soon) ``` ## ๐Ÿš€ Getting Started ### Step 1: Complete Prerequisites Make sure you've completed the foundational modules: ```bash tito test --module setup # Should pass tito test --module tensor # Should pass tito test --module layers # Should pass ``` ### Step 2: Open the Data Development File ```bash # Start from the dataloader module directory cd modules/dataloader/ # Convert to notebook if needed tito notebooks --module dataloader # Open the development notebook jupyter lab dataloader_dev.ipynb ``` ### Step 3: Work Through the Implementation The development file guides you through building: 1. **Dataset base class** - Abstract interface for all datasets 2. **CIFAR-10 implementation** - Real dataset with binary file parsing 3. **DataLoader** - Efficient batching and shuffling system 4. **Normalizer** - Data preprocessing for stable training 5. **Complete pipeline** - Integration of all components ### Step 4: Export and Test ```bash # Export your dataloader implementation tito sync --module dataloader # Test your implementation tito test --module dataloader ``` ## ๐Ÿ“š What You'll Implement ### Core Data Infrastructure You'll build a complete data loading system that supports: #### 1. Dataset Abstraction ```python # Abstract base class for all datasets class Dataset: def __getitem__(self, index): # Get single sample and label pass def __len__(self): # Get total number of samples pass def get_num_classes(self): # Get number of classes pass # Concrete implementation dataset = CIFAR10Dataset("data/cifar10/", train=True) image, label = dataset[0] # Get first sample ``` #### 2. Real Dataset Loading ```python # CIFAR-10 dataset with download and parsing dataset = CIFAR10Dataset("data/cifar10/", train=True, download=True) print(f"Dataset size: {len(dataset)}") # 50,000 training samples print(f"Sample shape: {dataset.get_sample_shape()}") # (3, 32, 32) print(f"Classes: {dataset.get_num_classes()}") # 10 classes ``` #### 3. Efficient Data Loading ```python # DataLoader with batching and shuffling dataloader = DataLoader(dataset, batch_size=32, shuffle=True) for batch_images, batch_labels in dataloader: print(f"Batch shape: {batch_images.shape}") # (32, 3, 32, 32) print(f"Labels shape: {batch_labels.shape}") # (32,) # Ready for neural network training! ``` #### 4. Data Preprocessing ```python # Normalizer for stable training normalizer = Normalizer() normalizer.fit(training_data) # Compute statistics normalized_data = normalizer.transform(test_data) # Apply normalization ``` #### 5. Complete Pipeline ```python # One-function pipeline creation train_loader, test_loader, normalizer = create_data_pipeline( dataset_path="data/cifar10/", batch_size=32, normalize=True, shuffle=True ) ``` ### Technical Requirements Your data system must: - Handle multiple dataset types through common interface - Efficiently load and parse binary data files - Support batching with configurable batch sizes - Implement shuffling for training randomization - Provide data normalization for stable training - Export to `tinytorch.core.dataloader` ## ๐Ÿงช Testing Your Implementation ### Progressive Testing with Real Data The tests follow the **"Build โ†’ Use โ†’ Understand"** pattern with real CIFAR-10 data: ```bash # Run all tests (downloads real CIFAR-10 data) tito test --module dataloader # Run specific test categories python -m pytest tests/test_dataloader.py::TestDatasetInterface -v # Test abstract interface python -m pytest tests/test_dataloader.py::TestCIFAR10Dataset -v # Test real data loading python -m pytest tests/test_dataloader.py::TestDataLoader -v # Test batching real data python -m pytest tests/test_dataloader.py::TestNormalizer -v # Test normalizing real data python -m pytest tests/test_dataloader.py::TestDataPipeline -v # Test complete pipeline ``` ### Real Data Testing Flow Each test builds on the previous component using actual CIFAR-10 data: 1. **Build Dataset** โ†’ **Test**: Download and load real CIFAR-10 images (50,000 training, 10,000 test) 2. **Build DataLoader** โ†’ **Test**: Batch real images with proper shuffling and iteration 3. **Build Normalizer** โ†’ **Test**: Normalize real pixel values (0-255 range โ†’ standardized) 4. **Build Pipeline** โ†’ **Test**: Complete pipeline with real data flow and preprocessing ### Why Real Data Testing Matters - **Real-world validation**: Tests work with actual data students will use in training - **Immediate feedback**: See your pipeline working with real images, not fake data - **Systems thinking**: Understand I/O, memory, and performance with real data distributions - **Debugging**: Catch issues that only appear with real data (file formats, edge cases) **Note**: First test run downloads ~170MB CIFAR-10 dataset with progress bar. Subsequent runs use cached data. ### Interactive Testing with Visual Feedback ```python # Test in the notebook or Python REPL from tinytorch.core.dataloader import Dataset, DataLoader, CIFAR10Dataset # Create and test datasets with real data dataset = CIFAR10Dataset("data/cifar10/", train=True, download=True) print(f"Loaded {len(dataset)} real CIFAR-10 samples") # Test data loading dataloader = DataLoader(dataset, batch_size=16) for batch_data, batch_labels in dataloader: print(f"Real batch shape: {batch_data.shape}") # (16, 3, 32, 32) print(f"Real labels: {batch_labels}") # Actual CIFAR-10 classes break ``` ### ๐ŸŽจ Development Visual Feedback The development notebook (`dataloader_dev.py`) includes **visual feedback** for learning: ```python # ๐Ÿ‘๏ธ SEE your data - Available in development notebook only show_cifar10_samples(dataset, num_samples=8, title="My CIFAR-10 Data") ``` ### ๐ŸŽจ Visual Feedback Features (Development Only) The development notebook includes **visual feedback** for learning and debugging: - **Download progress bar**: Visual progress indicator during CIFAR-10 download (~170MB) - **`show_cifar10_samples()`**: Display a grid of CIFAR-10 images with class labels - **Real image visualization**: See actual airplanes, cars, birds, cats, etc. - **Batch visualization**: View what your DataLoader is producing - **Pipeline visualization**: See the complete data flow in action **Why Visual Feedback Matters:** - **Build confidence**: See that your data pipeline is working correctly - **Debug issues**: Spot problems like incorrect normalization or corrupted images - **Understand data**: Build intuition about what your model will be learning from - **Immediate feedback**: Visual confirmation follows the "Build โ†’ Use โ†’ Understand" pattern **Note**: Visual feedback is available in the development notebook (`data_dev.py`) for learning purposes. The core package exports only the essential data loading components. ## ๐ŸŽฏ Success Criteria Your data module is complete when: 1. **All tests pass**: `tito test --module dataloader` 2. **Data classes import correctly**: `from tinytorch.core.dataloader import Dataset, DataLoader` 3. **Dataset loading works**: Can create datasets and access samples 4. **Batching works**: DataLoader produces correct batch shapes 5. **Preprocessing works**: Normalizer computes and applies statistics 6. **Pipeline works**: Complete pipeline creates train/test loaders ## ๐Ÿ’ก Implementation Tips ### Start with the Interface 1. **Dataset base class** - Define the abstract interface 2. **Simple test dataset** - Create mock data for testing 3. **Basic DataLoader** - Implement batching without shuffling 4. **Add shuffling** - Randomize sample order 5. **Test frequently** - Verify each component works ### Design Patterns ```python class Dataset: def __getitem__(self, index): # Return (data, label) tuple return data_tensor, label_tensor def __len__(self): # Return total number of samples return self.num_samples class DataLoader: def __iter__(self): # Yield batches of (batch_data, batch_labels) for batch in self._create_batches(): yield batch_data, batch_labels ``` ### Systems Thinking - **Memory management**: Don't load entire dataset into RAM - **I/O efficiency**: Batch file operations when possible - **Preprocessing**: Compute statistics once, apply many times - **Interface design**: Make components easily swappable ### Common Challenges - **Binary file parsing** - CIFAR-10 uses custom format - **Batch size handling** - Last batch may be smaller - **Data type consistency** - Convert to consistent types - **Error handling** - Provide helpful debugging messages ## ๐Ÿ”ง Advanced Features (Optional) If you finish early, try implementing: - **Data augmentation** - Random transformations for training - **Multi-worker loading** - Parallel data loading - **Caching** - Store processed data for faster access - **Different datasets** - MNIST, Fashion-MNIST, etc. ## ๐Ÿš€ Next Steps Once you complete the data module: 1. **Move to Autograd**: `cd modules/autograd/` 2. **Build automatic differentiation**: Enable gradient computation 3. **Combine with data**: Train models on real datasets 4. **Prepare for training**: Ready for the training module ## ๐Ÿ”— Why Data Engineering Matters Data engineering is the foundation of all ML systems: - **Training loops** need efficient data loading - **Model performance** depends on data quality - **Production systems** require scalable data pipelines - **Research** needs flexible data interfaces Your data implementation will power all TinyTorch training! ## ๐Ÿ“Š Real-World Connection The patterns you'll implement are used in: - **PyTorch DataLoader** - Same interface and concepts - **TensorFlow tf.data** - Similar pipeline architecture - **Production ML** - Scalable data processing systems - **Research** - Flexible experimentation frameworks ## ๐ŸŽ‰ Ready to Build? The data module is where TinyTorch becomes a real ML system. You're about to create the infrastructure that will feed neural networks, enable training loops, and power production ML pipelines. Focus on clean interfaces, efficient implementation, and systems thinking! ๐Ÿ”ฅ