[GH-ISSUE #1256] tinytorch Module 10: CharTokenizer, vocabulary initialization #1744

Closed
opened 2026-04-11 08:06:38 -05:00 by GiteaMirror · 6 comments
Owner

Originally created by @asgalon on GitHub (Mar 19, 2026).
Original GitHub issue: https://github.com/harvard-edge/cs249r_book/issues/1256

Re: TinyTorch Q&A question

In the CharTokenizer, the vocabulary is initialized with an optional vocabulary, but in the build_vocab method, these tokens are overwritten by the tokens coming from the corpus. I am not completely sure yet that this is an expected use case, but I have made a fix in a fork anyway. Also moved the unknown and end of word tokens to constants, while I was at it and mended the tests accordingly. The changes preserve the existing char to index mapping. Prepared a first tested version at https://github.com/asgalon/cs249r_book.git, basically making a few adjustments to tinytorch/src/10_tokenization/10_tokenization.py
Not a big deal, just using a very simple issue to ease into participation. Nor sure if the placement of the constants is already the best solution, but it is better than repeating strings verbatim over and over again through usage, I'd say? Also unified one or two repeated code snippets there in the process. Not much, but repetitions are always potential sources of errors in case of later refactorings, so the less remain the better.

Originally created by @asgalon on GitHub (Mar 19, 2026). Original GitHub issue: https://github.com/harvard-edge/cs249r_book/issues/1256 Re: [TinyTorch Q&A question](https://github.com/harvard-edge/cs249r_book/discussions/1255#discussion-9692769) In the CharTokenizer, the vocabulary is initialized with an optional vocabulary, but in the build_vocab method, these tokens are overwritten by the tokens coming from the corpus. I am not completely sure yet that this is an expected use case, but I have made a fix in a fork anyway. Also moved the unknown and end of word tokens to constants, while I was at it and mended the tests accordingly. The changes preserve the existing char to index mapping. Prepared a first tested version at https://github.com/asgalon/cs249r_book.git, basically making a few adjustments to tinytorch/src/10_tokenization/10_tokenization.py Not a big deal, just using a very simple issue to ease into participation. Nor sure if the placement of the constants is already the best solution, but it is better than repeating strings verbatim over and over again through usage, I'd say? Also unified one or two repeated code snippets there in the process. Not much, but repetitions are always potential sources of errors in case of later refactorings, so the less remain the better.
GiteaMirror added the area: booktype: improvement labels 2026-04-11 08:06:38 -05:00
Author
Owner

@asgalon commented on GitHub (Mar 19, 2026):

Changed the commit message to include a prefix with the issue number and title, this always helps with understanding the history later. That is why there are 5 commits instead of 2. Next time: First open the issue, and then the commit with the right issue number... and perhaps a much shorter title (if at all) to better read the actual commit message in the git history view in the IDE. Edit: Putting the issue titles into the commit message prefix helps sometimes after migrating the codebase to a different provider when the actual issues are no longer accessible. Putting the issue number into the commit message in a standard format makes release management a bit easier in large projects.

Also, area:book is wrong, should be area:tinytorch, but looks like I can't give a hint for the github-actions labelling (Note: prefixing the Title with "tinytorch" when creating the issue seems to work, but not when updating)

Should I go ahead with a PR or refine it a bit more?

<!-- gh-comment-id:4093017695 --> @asgalon commented on GitHub (Mar 19, 2026): Changed the commit message to include a prefix with the issue number and title, this always helps with understanding the history later. That is why there are 5 commits instead of 2. Next time: First open the issue, and then the commit with the right issue number... and perhaps a much shorter title (if at all) to better read the actual commit message in the git history view in the IDE. Edit: Putting the issue titles into the commit message prefix helps sometimes after migrating the codebase to a different provider when the actual issues are no longer accessible. Putting the issue number into the commit message in a standard format makes release management a bit easier in large projects. Also, area:book is wrong, should be area:tinytorch, but looks like I can't give a hint for the github-actions labelling (Note: prefixing the Title with "tinytorch" when creating the issue seems to work, but not when updating) Should I go ahead with a PR or refine it a bit more?
Author
Owner

@asgalon commented on GitHub (Mar 19, 2026):

Oh, well, found the CONTRIBUTING.md file. So forking the dev tree and working there is not such a good idea. Mending the structure to work on feature branches instead...

<!-- gh-comment-id:4094164962 --> @asgalon commented on GitHub (Mar 19, 2026): Oh, well, found the CONTRIBUTING.md file. So forking the dev tree and working there is not such a good idea. Mending the structure to work on feature branches instead...
Author
Owner

@asgalon commented on GitHub (Mar 22, 2026):

I did a few more experiments with the code and it is not so trivial to change it in a way that maintains a good learning experience keeping focused on the main item while aiming for a good code structure.

The basic issue that caught my attention here is that there is no defined use case about what should happen when both the vocabulary list in init() as well as the build_vocab() method are used, or using build_vocab() twice for incrementally adding vocabulary. I can think of at least one use case where this would be useful: Initializing the tokenizer with a number of fixed meta-tokens, and then adding the actual vocabulary feeding document(s) into it. But using build_vocab repeatedly cannot achieve at the same time maintaining the existing ids and sorting the assembled vocabulary. Right now, it is not possible to add vocabulary in multiple steps, and maybe this should be the expected behaviour then. In that case, there should be a defined test case for that to make sure the defined behaviour works with the solution. Otherwise, depending on the student implementation, behaviour will differ in spite of passing all the test cases. So the real question here is: How should those use cases be set up? Accumulating added vocabulary or restart from scratch each time build_vocab is used? Undefined and undocumented edge case behaviours in code are very difficult to track and probably may confuse users at some point later on, I imagine. So I would like to hear some informed opinion about this. Is this something worth dealing with or can it be ignored?

<!-- gh-comment-id:4106246215 --> @asgalon commented on GitHub (Mar 22, 2026): I did a few more experiments with the code and it is not so trivial to change it in a way that maintains a good learning experience keeping focused on the main item while aiming for a good code structure. The basic issue that caught my attention here is that there is no defined use case about what should happen when both the vocabulary list in __init__() as well as the build_vocab() method are used, or using build_vocab() twice for incrementally adding vocabulary. I can think of at least one use case where this would be useful: Initializing the tokenizer with a number of fixed meta-tokens, and then adding the actual vocabulary feeding document(s) into it. But using build_vocab repeatedly cannot achieve at the same time maintaining the existing ids and sorting the assembled vocabulary. Right now, it is not possible to add vocabulary in multiple steps, and maybe this should be the expected behaviour then. In that case, there should be a defined test case for that to make sure the defined behaviour works with the solution. Otherwise, depending on the student implementation, behaviour will differ in spite of passing all the test cases. So the real question here is: How should those use cases be set up? Accumulating added vocabulary or restart from scratch each time build_vocab is used? Undefined and undocumented edge case behaviours in code are very difficult to track and probably may confuse users at some point later on, I imagine. So I would like to hear some informed opinion about this. Is this something worth dealing with or can it be ignored?
Author
Owner

@profvjreddi commented on GitHub (Mar 28, 2026):

Hey @asgalon — sorry for the late reply on this. I was just going through this and noticed that I merged code but didn't actually answer your question properly. Your question was a good one and deserved a direct answer rather than just a code merge. So let me get at it.

The design decision I landed on was to build_vocab() overwrite from scratch — it does not accumulate. If you call build_vocab() after __init__(vocab=...), the init vocab is replaced entirely. Same if you call build_vocab() twice.

The reasoning is that if we overwrite, semantics are simpler and more predictable. I figured students would find that a lot easier to understand. There's no hidden state carried over from prior calls, which makes the behavior easier to reason about — especially for students learning tokenization for the first time, based on my experience. I added an explicit test asserting this (checking that a character from the init vocab is absent after build_vocab() with a different corpus).

Your proposed use case to initialize with fixed meta-tokens (<PAD>, <BOS>, <EOS>) and then layer the corpus vocab on top is a legitimate pattern used in production tokenizers. We don't support it yet, but if it comes up as a need, the clean way would be a dedicated special_tokens parameter rather than making build_vocab() accumulate. That keeps the two concerns (special tokens vs. learned vocabulary) separated.

Also: the token constants refactoring you were working toward (TOK_UNKNOWN, TOK_EOW as class-level constants on the base Tokenizer) landed in the PR/code merge. Appreciate you flagging the maintainability issue that motivated it.

Let me know if you disagree and think we should do something different. Happy to consider alternatives.

Sound good?

If you are satisfied, please close the issue. Or we can discuss more 👍

<!-- gh-comment-id:4148636453 --> @profvjreddi commented on GitHub (Mar 28, 2026): Hey @asgalon — sorry for the late reply on this. I was just going through this and noticed that I merged code but didn't actually answer your question properly. Your question was a good one and deserved a direct answer rather than just a code merge. So let me get at it. The design decision I landed on was to **`build_vocab()` overwrite from scratch** — it does not accumulate. If you call `build_vocab()` after `__init__(vocab=...)`, the init vocab is replaced entirely. Same if you call `build_vocab()` twice. The reasoning is that if we overwrite, semantics are simpler and more predictable. I figured students would find that a lot easier to understand. There's no hidden state carried over from prior calls, which makes the behavior easier to reason about — especially for students learning tokenization for the first time, based on my experience. I added an explicit test asserting this (checking that a character from the init vocab is absent after `build_vocab()` with a different corpus). Your proposed use case to initialize with fixed meta-tokens (`<PAD>`, `<BOS>`, `<EOS>`) and then layer the corpus vocab on top is a legitimate pattern used in production tokenizers. We don't support it yet, but if it comes up as a need, the clean way would be a dedicated `special_tokens` parameter rather than making `build_vocab()` accumulate. That keeps the two concerns (special tokens vs. learned vocabulary) separated. Also: the token constants refactoring you were working toward (`TOK_UNKNOWN`, `TOK_EOW` as class-level constants on the base `Tokenizer`) landed in the PR/code merge. Appreciate you flagging the maintainability issue that motivated it. Let me know if you disagree and think we should do something different. Happy to consider alternatives. Sound good? If you are satisfied, please close the issue. Or we can discuss more 👍
Author
Owner

@asgalon commented on GitHub (Mar 28, 2026):

Yes, everything ok. I am not so sure yet how the workflow with discussion entries and issues should best be distributed, so there was a bit of parallel action going on with this, but it all worked out fine 👍 If some better idea comes up, I can open a new issue or discussion.

<!-- gh-comment-id:4148646883 --> @asgalon commented on GitHub (Mar 28, 2026): Yes, everything ok. I am not so sure yet how the workflow with discussion entries and issues should best be distributed, so there was a bit of parallel action going on with this, but it all worked out fine 👍 If some better idea comes up, I can open a new issue or discussion.
Author
Owner

@profvjreddi commented on GitHub (Apr 5, 2026):

Discussions are best for open-ended Q&A and Issues for something we might merge (bug/improvement with a concrete change). Either path is fine; linking the discussion from the issue (or vice versa) is enough to reduce duplication.

<!-- gh-comment-id:4189036366 --> @profvjreddi commented on GitHub (Apr 5, 2026): Discussions are best for open-ended Q&A and Issues for something we might merge (bug/improvement with a concrete change). Either path is fine; linking the discussion from the issue (or vice versa) is enough to reduce duplication.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/cs249r_book#1744