[GH-ISSUE #7376] Is there a way to track tokens/context window in real-time? #51200

Closed
opened 2026-04-28 18:54:32 -05:00 by GiteaMirror · 2 comments
Owner

Originally created by @robotom on GitHub (Oct 26, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/7376

I'd like to implement a counter in a front end app to track the tokens used in order to see if I'm close to exceeding the context window.

This is useful to me because if I feed a large document into the model, I'd like to know when it's "too large" and perhaps to break it down or do something else.

Originally created by @robotom on GitHub (Oct 26, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/7376 I'd like to implement a counter in a front end app to track the tokens used in order to see if I'm close to exceeding the context window. This is useful to me because if I feed a large document into the model, I'd like to know when it's "too large" and perhaps to break it down or do something else.
GiteaMirror added the feature request label 2026-04-28 18:54:32 -05:00
Author
Owner

@rick-github commented on GitHub (Oct 26, 2024):

prompt_eval_count, returned in the response, counts the tokens in the request. There's no way to incrementally update a prompt. There's an open issue, https://github.com/ollama/ollama/issues/3582, to add a tokenizer endpoint that would allow a client to count tokens before sending a completion.

<!-- gh-comment-id:2439730096 --> @rick-github commented on GitHub (Oct 26, 2024): [prompt_eval_count](https://github.com/ollama/ollama/blob/main/docs/api.md#generate-a-completion:~:text=loading%20the%20model-,prompt_eval_count), returned in the response, counts the tokens in the request. There's no way to incrementally update a prompt. There's an open issue, https://github.com/ollama/ollama/issues/3582, to add a tokenizer endpoint that would allow a client to count tokens before sending a completion.
Author
Owner

@rick-github commented on GitHub (Oct 26, 2024):

Note that if your model was derived from one on HuggingFace, you can count the tokens in the client:

#!/usr/bin/env python3

from transformers import AutoTokenizer

model = "CohereForAI/aya-expanse-8b"
messages = [ { "role":"user", "content":"hello world" } ]

autotokenizer = AutoTokenizer.from_pretrained(model)
autotokens = autotokenizer.apply_chat_template(messages)

print(autotokens)
print(len(autotokens))
print(autotokenizer.decode(autotokens))
$ ./7376.py 
[5, 255000, 255006, 34313, 3845, 255001]
6
<BOS_TOKEN><|START_OF_TURN_TOKEN|><|USER_TOKEN|>hello world<|END_OF_TURN_TOKEN|>

Be aware that the template applied by ollama may be different to the template in the HuggingFace model config, so the token count between AutoTokenizer and ollama may differ slightly.

<!-- gh-comment-id:2439749060 --> @rick-github commented on GitHub (Oct 26, 2024): Note that if your model was derived from one on HuggingFace, you can count the tokens in the client: ```python #!/usr/bin/env python3 from transformers import AutoTokenizer model = "CohereForAI/aya-expanse-8b" messages = [ { "role":"user", "content":"hello world" } ] autotokenizer = AutoTokenizer.from_pretrained(model) autotokens = autotokenizer.apply_chat_template(messages) print(autotokens) print(len(autotokens)) print(autotokenizer.decode(autotokens)) ``` ```console $ ./7376.py [5, 255000, 255006, 34313, 3845, 255001] 6 <BOS_TOKEN><|START_OF_TURN_TOKEN|><|USER_TOKEN|>hello world<|END_OF_TURN_TOKEN|> ``` Be aware that the template applied by ollama may be different to the template in the HuggingFace model config, so the token count between AutoTokenizer and ollama may differ slightly.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#51200