mirror of
https://github.com/MLSysBook/TinyTorch.git
synced 2026-05-03 15:32:44 -05:00
- 301 Q&A pairs across 5 progressive difficulty levels - 17.5 KB total size, optimized for 3-5 minute training - Includes train/val/test splits (70/15/15) - Professional documentation (README, DATASHEET, CHANGELOG, SUMMARY) - Validation and statistics scripts - Licensed under CC BY 4.0 Dataset designed specifically for TinyTorch Module 13 (Transformers) to provide immediate learning feedback for students training their first transformer model.
383 lines
9.6 KiB
Markdown
383 lines
9.6 KiB
Markdown
# TinyTalks: A Conversational Q&A Dataset for Educational Transformers
|
|
|
|
**A carefully curated question-answering dataset designed for learning transformer architectures**
|
|
|
|
[](https://creativecommons.org/licenses/by/4.0/)
|
|
[]()
|
|
[]()
|
|
|
|
---
|
|
|
|
## 📖 Overview
|
|
|
|
**TinyTalks** is a lightweight, pedagogically-designed conversational dataset for training transformer models in educational settings. Unlike large-scale datasets that require hours of training, TinyTalks enables students to see their first transformer learn meaningful patterns in **under 5 minutes**.
|
|
|
|
### Why TinyTalks?
|
|
|
|
✅ **Fast Training** - Trains in 3-5 minutes on a laptop
|
|
✅ **Verifiable Learning** - Clear success metrics (correct vs. incorrect answers)
|
|
✅ **Progressive Difficulty** - 5 levels from greetings to reasoning
|
|
✅ **Educational Focus** - Designed for "aha!" moments, not benchmarks
|
|
✅ **Zero Dependencies** - Ships with TinyTorch, no downloads needed
|
|
✅ **Reproducible** - Deterministic generation, versioned releases
|
|
|
|
---
|
|
|
|
## 📊 Dataset Statistics
|
|
|
|
| Property | Value |
|
|
|----------|-------|
|
|
| **Total Q&A Pairs** | 350 |
|
|
| **File Size** | ~40 KB |
|
|
| **Vocabulary Size** | ~1,500 unique tokens (character-level) |
|
|
| **Avg Question Length** | 8 words |
|
|
| **Avg Answer Length** | 10 words |
|
|
| **Training Split** | 245 pairs (70%) |
|
|
| **Validation Split** | 53 pairs (15%) |
|
|
| **Test Split** | 52 pairs (15%) |
|
|
|
|
---
|
|
|
|
## 🎯 Content Structure
|
|
|
|
TinyTalks is organized into **5 progressive difficulty levels**:
|
|
|
|
### **Level 1: Greetings & Identity (50 pairs)**
|
|
Basic conversational patterns and self-identification.
|
|
|
|
```
|
|
Q: Hello!
|
|
A: Hi there! How can I help you today?
|
|
|
|
Q: What is your name?
|
|
A: I am TinyBot, a simple AI assistant.
|
|
```
|
|
|
|
**Learning Goal:** Model learns conversation structure and identity.
|
|
|
|
---
|
|
|
|
### **Level 2: Simple Facts (100 pairs)**
|
|
Factual knowledge about the world (colors, animals, objects).
|
|
|
|
```
|
|
Q: What color is the sky?
|
|
A: The sky is blue during the day.
|
|
|
|
Q: How many legs does a dog have?
|
|
A: A dog has four legs.
|
|
```
|
|
|
|
**Learning Goal:** Model learns to retrieve and state facts.
|
|
|
|
---
|
|
|
|
### **Level 3: Basic Math (50 pairs)**
|
|
Simple arithmetic operations (single-digit addition, subtraction).
|
|
|
|
```
|
|
Q: What is 2 plus 3?
|
|
A: 2 plus 3 equals 5.
|
|
|
|
Q: What is 7 minus 4?
|
|
A: 7 minus 4 equals 3.
|
|
```
|
|
|
|
**Learning Goal:** Model learns basic numerical reasoning.
|
|
|
|
---
|
|
|
|
### **Level 4: Common Sense Reasoning (100 pairs)**
|
|
Purpose, usage, and logical relationships.
|
|
|
|
```
|
|
Q: What do you use an umbrella for?
|
|
A: You use an umbrella to stay dry in the rain.
|
|
|
|
Q: Where do fish live?
|
|
A: Fish live in water.
|
|
```
|
|
|
|
**Learning Goal:** Model learns causal and functional relationships.
|
|
|
|
---
|
|
|
|
### **Level 5: Multi-turn Context (50 pairs)**
|
|
Simple context tracking across exchanges.
|
|
|
|
```
|
|
Q: I like pizza.
|
|
A: Pizza is delicious! What toppings do you like?
|
|
|
|
Q: I like pepperoni.
|
|
A: Pepperoni is a popular choice!
|
|
```
|
|
|
|
**Learning Goal:** Model begins to track conversational context.
|
|
|
|
---
|
|
|
|
## 🚀 Quick Start
|
|
|
|
### Loading the Dataset
|
|
|
|
```python
|
|
# Load full dataset
|
|
with open('datasets/tinytalks/tinytalks_v1.txt', 'r') as f:
|
|
text = f.read()
|
|
|
|
# Or use pre-split versions
|
|
with open('datasets/tinytalks/splits/train.txt', 'r') as f:
|
|
train_text = f.read()
|
|
```
|
|
|
|
### Training a Transformer
|
|
|
|
```python
|
|
# See milestones/05_2017_transformer/tinybot_demo.py for full example
|
|
from tinytorch.models.transformer import GPT
|
|
from tinytorch.text.tokenization import CharTokenizer
|
|
|
|
# Initialize model
|
|
tokenizer = CharTokenizer()
|
|
tokenizer.fit(train_text)
|
|
|
|
model = GPT(
|
|
vocab_size=len(tokenizer),
|
|
embed_dim=128,
|
|
num_layers=4,
|
|
num_heads=4,
|
|
max_seq_len=64
|
|
)
|
|
|
|
# Train for 5 minutes → See meaningful results!
|
|
```
|
|
|
|
### Expected Performance
|
|
|
|
After training for **10-20 epochs** (~3-5 minutes):
|
|
- ✅ Correctly answers Level 1-2 questions (~80% accuracy)
|
|
- ✅ Maintains grammatical structure
|
|
- ✅ Generates coherent (if not always correct) responses
|
|
- ⚠️ Level 3-5 show partial understanding
|
|
|
|
This demonstrates the transformer has **learned patterns**, not just memorized.
|
|
|
|
---
|
|
|
|
## 📐 Dataset Format
|
|
|
|
**Simple, human-readable text format:**
|
|
|
|
```
|
|
Q: [Question text]
|
|
A: [Answer text]
|
|
|
|
Q: [Next question]
|
|
A: [Next answer]
|
|
```
|
|
|
|
**Rationale:**
|
|
- Character-level tokenization (no special tokenizers needed)
|
|
- Easy to inspect and validate
|
|
- Works with any text processing pipeline
|
|
- Human-readable for debugging
|
|
|
|
**Delimiter:** Empty line separates Q&A pairs.
|
|
|
|
---
|
|
|
|
## 🔬 Dataset Creation Methodology
|
|
|
|
### Generation Process
|
|
|
|
1. **Manual Curation** - All Q&A pairs hand-written by TinyTorch maintainers
|
|
2. **Diversity Sampling** - Systematic coverage of topics within each level
|
|
3. **Quality Control** - Each pair reviewed for grammar, factual accuracy, appropriateness
|
|
4. **Balance Verification** - Ensured even distribution across levels
|
|
5. **Reproducibility** - Generation script (`scripts/generate_tinytalks.py`) produces identical output
|
|
|
|
### Quality Assurance
|
|
|
|
- ✅ Grammar check (automated + manual review)
|
|
- ✅ Factual accuracy verification
|
|
- ✅ No offensive or biased content
|
|
- ✅ No personally identifiable information
|
|
- ✅ Balanced topic distribution
|
|
- ✅ Appropriate for all ages
|
|
|
|
### Validation Script
|
|
|
|
```bash
|
|
python datasets/tinytalks/scripts/validate_dataset.py
|
|
```
|
|
|
|
Checks:
|
|
- Format consistency
|
|
- No duplicate pairs
|
|
- Balanced splits
|
|
- Character encoding (UTF-8)
|
|
- Line endings (Unix)
|
|
|
|
---
|
|
|
|
## 📊 Dataset Statistics
|
|
|
|
Run `scripts/stats.py` to generate:
|
|
|
|
```bash
|
|
python datasets/tinytalks/scripts/stats.py
|
|
```
|
|
|
|
Output:
|
|
- Total pairs per level
|
|
- Vocabulary statistics
|
|
- Length distributions
|
|
- Split sizes
|
|
- Character frequency
|
|
|
|
---
|
|
|
|
## 🎓 Educational Use Cases
|
|
|
|
### Primary Use: Module 13 (Transformers)
|
|
|
|
TinyTalks is designed as the **canonical dataset** for TinyTorch's Transformer milestone:
|
|
|
|
- **milestones/05_2017_transformer/tinybot_demo.py** - Main training demo
|
|
- Students see their first transformer learn in < 5 minutes
|
|
- Clear success metric: Can it answer questions?
|
|
- "Wow, I built this!" moment
|
|
|
|
### Secondary Uses
|
|
|
|
1. **Tokenization** (Module 10) - Character vs. BPE comparison
|
|
2. **Embeddings** (Module 11) - Visualize learned embeddings
|
|
3. **Attention** (Module 12) - Inspect attention patterns on Q&A
|
|
4. **Debugging** - Small enough to trace gradients manually
|
|
5. **Experimentation** - Test architecture changes quickly
|
|
|
|
---
|
|
|
|
## ⚖️ License
|
|
|
|
**Creative Commons Attribution 4.0 International (CC BY 4.0)**
|
|
|
|
You are free to:
|
|
- ✅ Share — copy and redistribute in any format
|
|
- ✅ Adapt — remix, transform, and build upon the material
|
|
- ✅ Commercial use allowed
|
|
|
|
Under these terms:
|
|
- **Attribution** — Cite TinyTalks (see below)
|
|
- **No additional restrictions**
|
|
|
|
See [LICENSE](LICENSE) for full text.
|
|
|
|
---
|
|
|
|
## 📚 Citation
|
|
|
|
If you use TinyTalks in your work, please cite:
|
|
|
|
```bibtex
|
|
@dataset{tinytalks2025,
|
|
title={TinyTalks: A Conversational Q\&A Dataset for Educational Transformers},
|
|
author={TinyTorch Contributors},
|
|
year={2025},
|
|
publisher={GitHub},
|
|
url={https://github.com/VJ/TinyTorch/tree/main/datasets/tinytalks},
|
|
version={1.0.0}
|
|
}
|
|
```
|
|
|
|
**Text citation:**
|
|
TinyTorch Contributors. (2025). TinyTalks: A Conversational Q&A Dataset for Educational Transformers (Version 1.0.0). https://github.com/VJ/TinyTorch
|
|
|
|
---
|
|
|
|
## 🔄 Versioning
|
|
|
|
**Version 1.0.0** (Current)
|
|
- Initial release: 350 Q&A pairs across 5 levels
|
|
- Character-level format
|
|
- 70/15/15 train/val/test split
|
|
|
|
**Planned:**
|
|
- v1.1 - Add 100 more Level 4-5 pairs for better reasoning
|
|
- v2.0 - Multi-language support (Spanish, French)
|
|
- v3.0 - Expanded to 1,000 pairs with more complex reasoning
|
|
|
|
See [CHANGELOG.md](CHANGELOG.md) for detailed history.
|
|
|
|
---
|
|
|
|
## 🤝 Contributing
|
|
|
|
We welcome contributions! Ways to help:
|
|
|
|
1. **Report Issues** - Found a factual error or typo? Open an issue.
|
|
2. **Suggest Q&A Pairs** - Submit ideas for new questions via PR.
|
|
3. **Translations** - Help translate TinyTalks to other languages.
|
|
4. **Validation** - Test on different models and report results.
|
|
|
|
**Guidelines:**
|
|
- Follow existing format and style
|
|
- Ensure factual accuracy
|
|
- Keep language simple and clear
|
|
- No offensive or biased content
|
|
- Appropriate for all ages (G-rated)
|
|
|
|
See [CONTRIBUTING.md](../../CONTRIBUTING.md) for details.
|
|
|
|
---
|
|
|
|
## 📞 Contact & Support
|
|
|
|
- **Issues:** [GitHub Issues](https://github.com/VJ/TinyTorch/issues)
|
|
- **Discussions:** [GitHub Discussions](https://github.com/VJ/TinyTorch/discussions)
|
|
- **Email:** tinytorch@example.com (for sensitive issues)
|
|
|
|
---
|
|
|
|
## 🙏 Acknowledgments
|
|
|
|
**Inspired by:**
|
|
- bAbI Dataset (Facebook AI Research) - Reasoning tasks
|
|
- SQuAD - Question answering format
|
|
- TinyStories - Simplicity philosophy
|
|
- TinyTorch Community - Feedback and testing
|
|
|
|
**Created for:**
|
|
- Students learning transformer architectures
|
|
- Educators teaching NLP
|
|
- Researchers prototyping small models
|
|
- Developers testing implementations
|
|
|
|
---
|
|
|
|
## 📖 Additional Documentation
|
|
|
|
- **[DATASHEET.md](DATASHEET.md)** - Comprehensive dataset metadata (Gebru et al. format)
|
|
- **[examples/demo_usage.py](examples/demo_usage.py)** - Complete usage examples
|
|
- **[scripts/README.md](scripts/README.md)** - Scripts documentation
|
|
|
|
---
|
|
|
|
## 🌟 Why "TinyTalks"?
|
|
|
|
The name embodies our philosophy:
|
|
|
|
- **Tiny** - Small enough to train in minutes, not hours
|
|
- **Talks** - Conversational, accessible, human-like
|
|
- **Educational** - Designed for learning, not leaderboards
|
|
|
|
Just like TinyTorch makes deep learning accessible, TinyTalks makes conversational AI **immediate and tangible**.
|
|
|
|
---
|
|
|
|
*Built with ❤️ by the TinyTorch community*
|
|
|
|
*"The best way to understand transformers is to see them learn."*
|
|
|