mirror of
https://github.com/harvard-edge/cs249r_book.git
synced 2026-05-06 01:28:35 -05:00
Add curated educational datasets for TinyTorch milestones: TinyDigits (~310 KB): - 1000 train + 200 test samples of 8x8 digit images - Balanced: 100 samples per digit class (0-9) - Used by Milestones 03 (MLP) and 04 (CNN) - Created from sklearn digits, normalized to [0,1] TinyTalks (~40 KB): - 350 Q&A pairs across 5 difficulty levels - Character-level conversational dataset - Used by Milestone 05 (Transformer) - Designed for fast training (3-5 min on laptop) Both datasets follow Karpathy's ~1K samples philosophy: - Small enough to ship with repo - Large enough for meaningful learning - Fast training with instant feedback - Works offline, no downloads needed
55 lines
2.3 KiB
Plaintext
55 lines
2.3 KiB
Plaintext
BSD 3-Clause License
|
|
|
|
TinyDigits Dataset License
|
|
==========================
|
|
|
|
TinyDigits is a curated educational subset derived from the sklearn digits dataset.
|
|
|
|
Original Data Source:
|
|
---------------------
|
|
scikit-learn digits dataset (sklearn.datasets.load_digits)
|
|
- Derived from UCI ML hand-written digits datasets
|
|
- Copyright (c) 2007-2024 The scikit-learn developers
|
|
- License: BSD 3-Clause
|
|
|
|
TinyTorch Curation:
|
|
------------------
|
|
Copyright (c) 2025 TinyTorch Project
|
|
|
|
Redistribution and use in source and binary forms, with or without
|
|
modification, are permitted provided that the following conditions are met:
|
|
|
|
1. Redistributions of source code must retain the above copyright notice, this
|
|
list of conditions and the following disclaimer.
|
|
|
|
2. Redistributions in binary form must reproduce the above copyright notice,
|
|
this list of conditions and the following disclaimer in the documentation
|
|
and/or other materials provided with the distribution.
|
|
|
|
3. Neither the name of the copyright holder nor the names of its
|
|
contributors may be used to endorse or promote products derived from
|
|
this software without specific prior written permission.
|
|
|
|
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
|
|
AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
|
|
IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
|
|
DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
|
|
FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
|
|
DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
|
|
SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
|
|
CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
|
|
OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
|
|
OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
|
|
|
|
Attribution
|
|
-----------
|
|
When using TinyDigits in research or educational materials, please cite:
|
|
|
|
1. The original sklearn digits dataset:
|
|
Pedregosa et al., "Scikit-learn: Machine Learning in Python",
|
|
JMLR 12, pp. 2825-2830, 2011.
|
|
|
|
2. TinyTorch's educational curation:
|
|
TinyTorch Project (2025). "TinyDigits: Curated Educational Dataset
|
|
for ML Systems Learning". Available at: https://github.com/VJHack/TinyTorch
|