mirror of
https://github.com/harvard-edge/cs249r_book.git
synced 2026-05-02 10:39:10 -05:00
1.9 KiB
1.9 KiB
Changelog
All notable changes to the TinyTalks dataset will be documented in this file.
The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.
[1.0.0] - 2025-01-28
Added
- Initial release of TinyTalks dataset
- 301 Q&A pairs across 5 difficulty levels:
- Level 1: Greetings & Identity (47 pairs)
- Level 2: Simple Facts (82 pairs)
- Level 3: Basic Math (45 pairs)
- Level 4: Common Sense Reasoning (87 pairs)
- Level 5: Multi-turn Context (40 pairs)
- Train/validation/test splits (70/15/15)
- Comprehensive README with usage examples
- DATASHEET.md following "Datasheets for Datasets" best practices
- CC BY 4.0 license
- Generation script (
scripts/generate_tinytalks.py) - Validation script (
scripts/validate_dataset.py) - Statistics script (
scripts/stats.py) - Example usage script (
examples/demo_usage.py)
Dataset Statistics
- Total size: ~17.5 KB
- Character vocabulary: 65 unique characters
- Word vocabulary: 865 unique words
- Average question length: 4.8 words (21.6 characters)
- Average answer length: 6.1 words (29.0 characters)
Validation
- ✅ UTF-8 encoding
- ✅ Unix line endings (LF)
- ✅ No duplicate questions
- ✅ No empty questions or answers
- ✅ Proper punctuation
- ✅ Balanced splits with no overlap
[Unreleased]
Planned for v1.1.0
- Add 50 more Level 4-5 pairs for better reasoning
- Expand math questions to include simple multiplication tables
- Add more conversational context pairs
Planned for v2.0.0
- Multi-language support (Spanish, French)
- Expanded to 500+ pairs
- Add difficulty scores for each Q&A pair
Planned for v3.0.0
- Expand to 1,000+ pairs
- Add more complex reasoning tasks
- Include multi-hop questions
- Add entity recognition annotations
Version History
- 1.0.0 (2025-01-28) - Initial release