tinytorch Module 10: CharTokenizer, vocabulary initialization #528

New Issue

GiteaMirror · 2026-03-22T15:45:14-05:00

GiteaMirror commented

2026-03-22 15:45:14 -05:00

Originally created by @asgalon on GitHub (Mar 19, 2026).

In the CharTokenizer, the vocabulary is initialized with an optional vocabulary, but in the build_vocab method, these tokens are overwritten by the tokens coming from the corpus. I am not completely sure yet that this is an expected use case, but I have made a fix in a fork anyway. Also moved the unknown and end of word tokens to constants, while I was at it and mended the tests accordingly. The changes preserve the existing char to index mapping. Prepared a first tested version at https://github.com/asgalon/cs249r_book.git, basically making a few adjustments to tinytorch/src/10_tokenization/10_tokenization.py
Not a big deal, just using a very simple issue to ease into participation. Nor sure if the placement of the constants is already the best solution, but it is better than repeating strings verbatim over and over again through usage, I'd say? Also unified one or two repeated code snippets there in the process. Not much, but repetitions are always potential sources of errors in case of later refactorings, so the less remain the better.

Originally created by @asgalon on GitHub (Mar 19, 2026). Re: [TinyTorch Q&A question](https://github.com/harvard-edge/cs249r_book/discussions/1255#discussion-9692769) In the CharTokenizer, the vocabulary is initialized with an optional vocabulary, but in the build_vocab method, these tokens are overwritten by the tokens coming from the corpus. I am not completely sure yet that this is an expected use case, but I have made a fix in a fork anyway. Also moved the unknown and end of word tokens to constants, while I was at it and mended the tests accordingly. The changes preserve the existing char to index mapping. Prepared a first tested version at https://github.com/asgalon/cs249r_book.git, basically making a few adjustments to tinytorch/src/10_tokenization/10_tokenization.py Not a big deal, just using a very simple issue to ease into participation. Nor sure if the placement of the constants is already the best solution, but it is better than repeating strings verbatim over and over again through usage, I'd say? Also unified one or two repeated code snippets there in the process. Not much, but repetitions are always potential sources of errors in case of later refactorings, so the less remain the better.

GiteaMirror added the area: book type: improvement labels 2026-03-22 15:45:14 -05:00

GiteaMirror commented

2026-03-22 15:45:15 -05:00

@asgalon commented on GitHub (Mar 19, 2026):

Changed the commit message to include a prefix with the issue number and title, this always helps with understanding the history later. That is why there are 5 commits instead of 2. Next time: First open the issue, and then the commit with the right issue number... and perhaps a much shorter title (if at all) to better read the actual commit message in the git history view in the IDE. Edit: Putting the issue titles into the commit message prefix helps sometimes after migrating the codebase to a different provider when the actual issues are no longer accessible. Putting the issue number into the commit message in a standard format makes release management a bit easier in large projects.

Also, area:book is wrong, should be area:tinytorch, but looks like I can't give a hint for the github-actions labelling (Note: prefixing the Title with "tinytorch" when creating the issue seems to work, but not when updating)

Should I go ahead with a PR or refine it a bit more?

@asgalon commented on GitHub (Mar 19, 2026): Changed the commit message to include a prefix with the issue number and title, this always helps with understanding the history later. That is why there are 5 commits instead of 2. Next time: First open the issue, and then the commit with the right issue number... and perhaps a much shorter title (if at all) to better read the actual commit message in the git history view in the IDE. Edit: Putting the issue titles into the commit message prefix helps sometimes after migrating the codebase to a different provider when the actual issues are no longer accessible. Putting the issue number into the commit message in a standard format makes release management a bit easier in large projects. Also, area:book is wrong, should be area:tinytorch, but looks like I can't give a hint for the github-actions labelling (Note: prefixing the Title with "tinytorch" when creating the issue seems to work, but not when updating) Should I go ahead with a PR or refine it a bit more?

GiteaMirror commented

2026-03-22 15:45:15 -05:00

@asgalon commented on GitHub (Mar 19, 2026):

Oh, well, found the CONTRIBUTING.md file. So forking the dev tree and working there is not such a good idea. Mending the structure to work on feature branches instead...

@asgalon commented on GitHub (Mar 19, 2026): Oh, well, found the CONTRIBUTING.md file. So forking the dev tree and working there is not such a good idea. Mending the structure to work on feature branches instead...

GiteaMirror commented

2026-03-22 15:45:15 -05:00

@asgalon commented on GitHub (Mar 22, 2026):

I did a few more experiments with the code and it is not so trivial to change it in a way that maintains a good learning experience keeping focused on the main item while aiming for a good code structure.

The basic issue that caught my attention here is that there is no defined use case about what should happen when both the vocabulary list in init() as well as the build_vocab() method are used, or using build_vocab() twice for incrementally adding vocabulary. I can think of at least one use case where this would be useful: Initializing the tokenizer with a number of fixed meta-tokens, and then adding the actual vocabulary feeding document(s) into it. But using build_vocab repeatedly cannot achieve at the same time maintaining the existing ids and sorting the assembled vocabulary. Right now, it is not possible to add vocabulary in multiple steps, and maybe this should be the expected behaviour then. In that case, there should be a defined test case for that to make sure the defined behaviour works with the solution. Otherwise, depending on the student implementation, behaviour will differ in spite of passing all the test cases. So the real question here is: How should those use cases be set up? Accumulating added vocabulary or restart from scratch each time build_vocab is used? Undefined and undocumented edge case behaviours in code are very difficult to track and probably may confuse users at some point later on, I imagine. So I would like to hear some informed opinion about this. Is this something worth dealing with or can it be ignored?

@asgalon commented on GitHub (Mar 22, 2026): I did a few more experiments with the code and it is not so trivial to change it in a way that maintains a good learning experience keeping focused on the main item while aiming for a good code structure. The basic issue that caught my attention here is that there is no defined use case about what should happen when both the vocabulary list in __init__() as well as the build_vocab() method are used, or using build_vocab() twice for incrementally adding vocabulary. I can think of at least one use case where this would be useful: Initializing the tokenizer with a number of fixed meta-tokens, and then adding the actual vocabulary feeding document(s) into it. But using build_vocab repeatedly cannot achieve at the same time maintaining the existing ids and sorting the assembled vocabulary. Right now, it is not possible to add vocabulary in multiple steps, and maybe this should be the expected behaviour then. In that case, there should be a defined test case for that to make sure the defined behaviour works with the solution. Otherwise, depending on the student implementation, behaviour will differ in spite of passing all the test cases. So the real question here is: How should those use cases be set up? Accumulating added vocabulary or restart from scratch each time build_vocab is used? Undefined and undocumented edge case behaviours in code are very difficult to track and probably may confuse users at some point later on, I imagine. So I would like to hear some informed opinion about this. Is this something worth dealing with or can it be ignored?

GiteaMirror referenced this issue

2026-03-22 15:51:16 -05:00

[PR #528] [MERGED] Student Feedback Chapter 19 #789

GiteaMirror referenced this issue

2026-04-11 08:14:06 -05:00

[PR #528] [MERGED] Student Feedback Chapter 19 #2012

GiteaMirror referenced this issue

2026-04-13 13:14:20 -05:00

[PR #528] [MERGED] Student Feedback Chapter 19 #2734

GiteaMirror referenced this issue

2026-04-16 01:47:49 -05:00

[PR #528] [MERGED] Student Feedback Chapter 19 #3413