mirror of
https://github.com/harvard-edge/cs249r_book.git
synced 2026-05-03 08:08:51 -05:00
67 lines
1.9 KiB
Markdown
67 lines
1.9 KiB
Markdown
# Changelog
|
|
|
|
All notable changes to the TinyTalks dataset will be documented in this file.
|
|
|
|
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
|
|
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
|
|
|
|
## [1.0.0] - 2025-01-28
|
|
|
|
### Added
|
|
- Initial release of TinyTalks dataset
|
|
- 301 Q&A pairs across 5 difficulty levels:
|
|
- Level 1: Greetings & Identity (47 pairs)
|
|
- Level 2: Simple Facts (82 pairs)
|
|
- Level 3: Basic Math (45 pairs)
|
|
- Level 4: Common Sense Reasoning (87 pairs)
|
|
- Level 5: Multi-turn Context (40 pairs)
|
|
- Train/validation/test splits (70/15/15)
|
|
- Comprehensive README with usage examples
|
|
- DATASHEET.md following "Datasheets for Datasets" best practices
|
|
- CC BY 4.0 license
|
|
- Generation script (`scripts/generate_tinytalks.py`)
|
|
- Validation script (`scripts/validate_dataset.py`)
|
|
- Statistics script (`scripts/stats.py`)
|
|
- Example usage script (`examples/demo_usage.py`)
|
|
|
|
### Dataset Statistics
|
|
- Total size: ~17.5 KB
|
|
- Character vocabulary: 65 unique characters
|
|
- Word vocabulary: 865 unique words
|
|
- Average question length: 4.8 words (21.6 characters)
|
|
- Average answer length: 6.1 words (29.0 characters)
|
|
|
|
### Validation
|
|
- ✅ UTF-8 encoding
|
|
- ✅ Unix line endings (LF)
|
|
- ✅ No duplicate questions
|
|
- ✅ No empty questions or answers
|
|
- ✅ Proper punctuation
|
|
- ✅ Balanced splits with no overlap
|
|
|
|
---
|
|
|
|
## [Unreleased]
|
|
|
|
### Planned for v1.1.0
|
|
- Add 50 more Level 4-5 pairs for better reasoning
|
|
- Expand math questions to include simple multiplication tables
|
|
- Add more conversational context pairs
|
|
|
|
### Planned for v2.0.0
|
|
- Multi-language support (Spanish, French)
|
|
- Expanded to 500+ pairs
|
|
- Add difficulty scores for each Q&A pair
|
|
|
|
### Planned for v3.0.0
|
|
- Expand to 1,000+ pairs
|
|
- Add more complex reasoning tasks
|
|
- Include multi-hop questions
|
|
- Add entity recognition annotations
|
|
|
|
---
|
|
|
|
## Version History
|
|
|
|
- **1.0.0** (2025-01-28) - Initial release
|