github-starred/cs249r_book

mirror of https://github.com/harvard-edge/cs249r_book.git synced 2026-05-02 10:39:10 -05:00

Files

Vijay Janapa Reddi 036075fb5a fix(tinytorch): fix trailing whitespace and EOF in tinytalks docs

2026-01-13 18:21:20 -05:00

1.9 KiB

Raw Blame History

Changelog

All notable changes to the TinyTalks dataset will be documented in this file.

The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.

[1.0.0] - 2025-01-28

Added

Initial release of TinyTalks dataset
301 Q&A pairs across 5 difficulty levels:
- Level 1: Greetings & Identity (47 pairs)
- Level 2: Simple Facts (82 pairs)
- Level 3: Basic Math (45 pairs)
- Level 4: Common Sense Reasoning (87 pairs)
- Level 5: Multi-turn Context (40 pairs)
Train/validation/test splits (70/15/15)
Comprehensive README with usage examples
DATASHEET.md following "Datasheets for Datasets" best practices
CC BY 4.0 license
Generation script (scripts/generate_tinytalks.py)
Validation script (scripts/validate_dataset.py)
Statistics script (scripts/stats.py)
Example usage script (examples/demo_usage.py)

Dataset Statistics

Total size: ~17.5 KB
Character vocabulary: 65 unique characters
Word vocabulary: 865 unique words
Average question length: 4.8 words (21.6 characters)
Average answer length: 6.1 words (29.0 characters)

Validation

✅ UTF-8 encoding
✅ Unix line endings (LF)
✅ No duplicate questions
✅ No empty questions or answers
✅ Proper punctuation
✅ Balanced splits with no overlap

[Unreleased]

Planned for v1.1.0

Add 50 more Level 4-5 pairs for better reasoning
Expand math questions to include simple multiplication tables
Add more conversational context pairs

Planned for v2.0.0

Multi-language support (Spanish, French)
Expanded to 500+ pairs
Add difficulty scores for each Q&A pair

Planned for v3.0.0

Expand to 1,000+ pairs
Add more complex reasoning tasks
Include multi-hop questions
Add entity recognition annotations

Version History

1.0.0 (2025-01-28) - Initial release