# TinyTalks: A Conversational Q&A Dataset for Educational Transformers **A carefully curated question-answering dataset designed for learning transformer architectures** [![License: CC BY 4.0](https://img.shields.io/badge/License-CC%20BY%204.0-lightgrey.svg)](https://creativecommons.org/licenses/by/4.0/) [![Size: ~50KB](https://img.shields.io/badge/Size-~50KB-blue.svg)]() [![Version: 1.0.0](https://img.shields.io/badge/Version-1.0.0-green.svg)]() --- ## 📖 Overview **TinyTalks** is a lightweight, pedagogically-designed conversational dataset for training transformer models in educational settings. Unlike large-scale datasets that require hours of training, TinyTalks enables students to see their first transformer learn meaningful patterns in **under 5 minutes**. ### Why TinyTalks? ✅ **Fast Training** - Trains in 3-5 minutes on a laptop ✅ **Verifiable Learning** - Clear success metrics (correct vs. incorrect answers) ✅ **Progressive Difficulty** - 5 levels from greetings to reasoning ✅ **Educational Focus** - Designed for "aha!" moments, not benchmarks ✅ **Zero Dependencies** - Ships with TinyTorch, no downloads needed ✅ **Reproducible** - Deterministic generation, versioned releases --- ## 📊 Dataset Statistics | Property | Value | |----------|-------| | **Total Q&A Pairs** | 350 | | **File Size** | ~40 KB | | **Vocabulary Size** | ~1,500 unique tokens (character-level) | | **Avg Question Length** | 8 words | | **Avg Answer Length** | 10 words | | **Training Split** | 245 pairs (70%) | | **Validation Split** | 53 pairs (15%) | | **Test Split** | 52 pairs (15%) | --- ## 🎯 Content Structure TinyTalks is organized into **5 progressive difficulty levels**: ### **Level 1: Greetings & Identity (50 pairs)** Basic conversational patterns and self-identification. ``` Q: Hello! A: Hi there! How can I help you today? Q: What is your name? A: I am TinyBot, a simple AI assistant. ``` **Learning Goal:** Model learns conversation structure and identity. --- ### **Level 2: Simple Facts (100 pairs)** Factual knowledge about the world (colors, animals, objects). ``` Q: What color is the sky? A: The sky is blue during the day. Q: How many legs does a dog have? A: A dog has four legs. ``` **Learning Goal:** Model learns to retrieve and state facts. --- ### **Level 3: Basic Math (50 pairs)** Simple arithmetic operations (single-digit addition, subtraction). ``` Q: What is 2 plus 3? A: 2 plus 3 equals 5. Q: What is 7 minus 4? A: 7 minus 4 equals 3. ``` **Learning Goal:** Model learns basic numerical reasoning. --- ### **Level 4: Common Sense Reasoning (100 pairs)** Purpose, usage, and logical relationships. ``` Q: What do you use an umbrella for? A: You use an umbrella to stay dry in the rain. Q: Where do fish live? A: Fish live in water. ``` **Learning Goal:** Model learns causal and functional relationships. --- ### **Level 5: Multi-turn Context (50 pairs)** Simple context tracking across exchanges. ``` Q: I like pizza. A: Pizza is delicious! What toppings do you like? Q: I like pepperoni. A: Pepperoni is a popular choice! ``` **Learning Goal:** Model begins to track conversational context. --- ## 🚀 Quick Start ### Loading the Dataset ```python # Load full dataset with open('datasets/tinytalks/tinytalks_v1.txt', 'r') as f: text = f.read() # Or use pre-split versions with open('datasets/tinytalks/splits/train.txt', 'r') as f: train_text = f.read() ``` ### Training a Transformer ```python # See milestones/05_2017_transformer/tinybot_demo.py for full example from tinytorch.models.transformer import GPT from tinytorch.text.tokenization import CharTokenizer # Initialize model tokenizer = CharTokenizer() tokenizer.fit(train_text) model = GPT( vocab_size=len(tokenizer), embed_dim=128, num_layers=4, num_heads=4, max_seq_len=64 ) # Train for 5 minutes → See meaningful results! ``` ### Expected Performance After training for **10-20 epochs** (~3-5 minutes): - ✅ Correctly answers Level 1-2 questions (~80% accuracy) - ✅ Maintains grammatical structure - ✅ Generates coherent (if not always correct) responses - ⚠️ Level 3-5 show partial understanding This demonstrates the transformer has **learned patterns**, not just memorized. --- ## 📐 Dataset Format **Simple, human-readable text format:** ``` Q: [Question text] A: [Answer text] Q: [Next question] A: [Next answer] ``` **Rationale:** - Character-level tokenization (no special tokenizers needed) - Easy to inspect and validate - Works with any text processing pipeline - Human-readable for debugging **Delimiter:** Empty line separates Q&A pairs. --- ## 🔬 Dataset Creation Methodology ### Generation Process 1. **Manual Curation** - All Q&A pairs hand-written by TinyTorch maintainers 2. **Diversity Sampling** - Systematic coverage of topics within each level 3. **Quality Control** - Each pair reviewed for grammar, factual accuracy, appropriateness 4. **Balance Verification** - Ensured even distribution across levels 5. **Reproducibility** - Generation script (`scripts/generate_tinytalks.py`) produces identical output ### Quality Assurance - ✅ Grammar check (automated + manual review) - ✅ Factual accuracy verification - ✅ No offensive or biased content - ✅ No personally identifiable information - ✅ Balanced topic distribution - ✅ Appropriate for all ages ### Validation Script ```bash python datasets/tinytalks/scripts/validate_dataset.py ``` Checks: - Format consistency - No duplicate pairs - Balanced splits - Character encoding (UTF-8) - Line endings (Unix) --- ## 📊 Dataset Statistics Run `scripts/stats.py` to generate: ```bash python datasets/tinytalks/scripts/stats.py ``` Output: - Total pairs per level - Vocabulary statistics - Length distributions - Split sizes - Character frequency --- ## 🎓 Educational Use Cases ### Primary Use: Module 13 (Transformers) TinyTalks is designed as the **canonical dataset** for TinyTorch's Transformer milestone: - **milestones/05_2017_transformer/tinybot_demo.py** - Main training demo - Students see their first transformer learn in < 5 minutes - Clear success metric: Can it answer questions? - "Wow, I built this!" moment ### Secondary Uses 1. **Tokenization** (Module 10) - Character vs. BPE comparison 2. **Embeddings** (Module 11) - Visualize learned embeddings 3. **Attention** (Module 12) - Inspect attention patterns on Q&A 4. **Debugging** - Small enough to trace gradients manually 5. **Experimentation** - Test architecture changes quickly --- ## ⚖️ License **Creative Commons Attribution 4.0 International (CC BY 4.0)** You are free to: - ✅ Share — copy and redistribute in any format - ✅ Adapt — remix, transform, and build upon the material - ✅ Commercial use allowed Under these terms: - **Attribution** — Cite TinyTalks (see below) - **No additional restrictions** See [LICENSE](LICENSE) for full text. --- ## 📚 Citation If you use TinyTalks in your work, please cite: ```bibtex @dataset{tinytalks2025, title={TinyTalks: A Conversational Q\&A Dataset for Educational Transformers}, author={TinyTorch Contributors}, year={2025}, publisher={GitHub}, url={https://github.com/VJ/TinyTorch/tree/main/datasets/tinytalks}, version={1.0.0} } ``` **Text citation:** TinyTorch Contributors. (2025). TinyTalks: A Conversational Q&A Dataset for Educational Transformers (Version 1.0.0). https://github.com/VJ/TinyTorch --- ## 🔄 Versioning **Version 1.0.0** (Current) - Initial release: 350 Q&A pairs across 5 levels - Character-level format - 70/15/15 train/val/test split **Planned:** - v1.1 - Add 100 more Level 4-5 pairs for better reasoning - v2.0 - Multi-language support (Spanish, French) - v3.0 - Expanded to 1,000 pairs with more complex reasoning See [CHANGELOG.md](CHANGELOG.md) for detailed history. --- ## 🤝 Contributing We welcome contributions! Ways to help: 1. **Report Issues** - Found a factual error or typo? Open an issue. 2. **Suggest Q&A Pairs** - Submit ideas for new questions via PR. 3. **Translations** - Help translate TinyTalks to other languages. 4. **Validation** - Test on different models and report results. **Guidelines:** - Follow existing format and style - Ensure factual accuracy - Keep language simple and clear - No offensive or biased content - Appropriate for all ages (G-rated) See [CONTRIBUTING.md](../../CONTRIBUTING.md) for details. --- ## 📞 Contact & Support - **Issues:** [GitHub Issues](https://github.com/VJ/TinyTorch/issues) - **Discussions:** [GitHub Discussions](https://github.com/VJ/TinyTorch/discussions) - **Email:** tinytorch@example.com (for sensitive issues) --- ## 🙏 Acknowledgments **Inspired by:** - bAbI Dataset (Facebook AI Research) - Reasoning tasks - SQuAD - Question answering format - TinyStories - Simplicity philosophy - TinyTorch Community - Feedback and testing **Created for:** - Students learning transformer architectures - Educators teaching NLP - Researchers prototyping small models - Developers testing implementations --- ## 📖 Additional Documentation - **[DATASHEET.md](DATASHEET.md)** - Comprehensive dataset metadata (Gebru et al. format) - **[examples/demo_usage.py](examples/demo_usage.py)** - Complete usage examples - **[scripts/README.md](scripts/README.md)** - Scripts documentation --- ## 🌟 Why "TinyTalks"? The name embodies our philosophy: - **Tiny** - Small enough to train in minutes, not hours - **Talks** - Conversational, accessible, human-like - **Educational** - Designed for learning, not leaderboards Just like TinyTorch makes deep learning accessible, TinyTalks makes conversational AI **immediate and tangible**. --- *Built with ❤️ by the TinyTorch community* *"The best way to understand transformers is to see them learn."*