[GH-ISSUE #7171] Counting tokens in text before embedding #51063

Closed
opened 2026-04-28 18:13:20 -05:00 by GiteaMirror · 3 comments
Owner

Originally created by @DewiarQR on GitHub (Oct 11, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/7171

When creating a vector database, we use embedding models such as bge-m3. The problem is that if the size of the text sent for vectorization does not fit into the context window of the model, the data is simply lost! and the Ollama project does not have a SINGLE MODEL!!! that would simply calculate tokens in the text before sending! For example, the bge-m3 model uses the RoBERTa algorithm, it would be very convenient if there was at least one model available via API that would simply count the exact number of tokens in the text! Is it possible to add this? Without this, working with embedding is difficult.

Originally created by @DewiarQR on GitHub (Oct 11, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/7171 When creating a vector database, we use embedding models such as bge-m3. The problem is that if the size of the text sent for vectorization does not fit into the context window of the model, the data is simply lost! and the Ollama project does not have a SINGLE MODEL!!! that would simply calculate tokens in the text before sending! For example, the bge-m3 model uses the RoBERTa algorithm, it would be very convenient if there was at least one model available via API that would simply count the exact number of tokens in the text! Is it possible to add this? Without this, working with embedding is difficult.
GiteaMirror added the model label 2026-04-28 18:13:20 -05:00
Author
Owner

@DewiarQR commented on GitHub (Oct 11, 2024):

Even better, if embedding models like mxbai-embed-large, nomic-embed-text, paraphrase-multilingual, bge-m3 - along with the response in the form of vector representations of the data, also reported how many tokens they received as input... then this could be used to confirm that everything went well and no data was lost

<!-- gh-comment-id:2407098185 --> @DewiarQR commented on GitHub (Oct 11, 2024): Even better, if embedding models like mxbai-embed-large, nomic-embed-text, paraphrase-multilingual, bge-m3 - along with the response in the form of vector representations of the data, also reported how many tokens they received as input... then this could be used to confirm that everything went well and no data was lost
Author
Owner

@rick-github commented on GitHub (Oct 11, 2024):

https://github.com/ollama/ollama/issues/3582

<!-- gh-comment-id:2407201548 --> @rick-github commented on GitHub (Oct 11, 2024): https://github.com/ollama/ollama/issues/3582
Author
Owner

@rick-github commented on GitHub (Dec 2, 2024):

dupe #3582

<!-- gh-comment-id:2511706464 --> @rick-github commented on GitHub (Dec 2, 2024): dupe #3582
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#51063