[GH-ISSUE #1345] [WISH] API for token count? faster than embeddings vector length? #26463

Closed
opened 2026-04-22 02:45:43 -05:00 by GiteaMirror · 10 comments
Owner

Originally created by @kettoleon on GitHub (Dec 1, 2023).
Original GitHub issue: https://github.com/ollama/ollama/issues/1345

Hi, I've been using ollama for a few days, I really like it.

However, I'm using it by making raw requests, I mean I'm handling the context myself.

When under this use case, the system needs to count tokens for many strings to decide what goes into the context and what is too much.

For now, I've been using the embedding API, and taking the length of embeddings vector as token count.

But I understand an "only count tokens without computing embeddings" API would be way faster.

I'm assuming something like that to be possible? I was using exllama before ollama, and it had something like that. But I never went into the details to see how it was done.

It would be awesome if someone could make a PR for that, or point me in the right direction to do the PR myself 😜 (although my python knowledge is scarce) .

Originally created by @kettoleon on GitHub (Dec 1, 2023). Original GitHub issue: https://github.com/ollama/ollama/issues/1345 Hi, I've been using ollama for a few days, I really like it. However, I'm using it by making raw requests, I mean I'm handling the context myself. When under this use case, the system needs to count tokens for many strings to decide what goes into the context and what is too much. For now, I've been using the embedding API, and taking the length of embeddings vector as token count. But I understand an "only count tokens without computing embeddings" API would be way faster. I'm assuming something like that to be possible? I was using exllama before ollama, and it had something like that. But I never went into the details to see how it was done. It would be awesome if someone could make a PR for that, or point me in the right direction to do the PR myself 😜 (although my python knowledge is scarce) .
GiteaMirror added the feature request label 2026-04-22 02:45:43 -05:00
Author
Owner

@oliverbob commented on GitHub (Dec 3, 2023):

Yes, this is very practical to see implemented for this repo. Excited to see it in action.

<!-- gh-comment-id:1837321417 --> @oliverbob commented on GitHub (Dec 3, 2023): Yes, this is very practical to see implemented for this repo. Excited to see it in action.
Author
Owner

@jukofyork commented on GitHub (Jan 28, 2024):

I was just thinking about this too - didn't think of using the count from the embedding call though - thanks!

Would still be better to have the token count returned. Possibly even have the ability to return the tokenized text and get that returned too. Somebody linked this on reddit the other day and it's quite interesting:

https://www.danieldemmel.me/tokenizer.html

<!-- gh-comment-id:1913704710 --> @jukofyork commented on GitHub (Jan 28, 2024): I was just thinking about this too - didn't think of using the count from the embedding call though - thanks! Would still be better to have the token count returned. Possibly even have the ability to return the tokenized text and get that returned too. Somebody linked this on reddit the other day and it's quite interesting: https://www.danieldemmel.me/tokenizer.html
Author
Owner

@suvalaki commented on GitHub (Feb 4, 2024):

This would be pretty valuable. Useful for other libraries which call token counting methods as a part of their everyday flow. I want to integrate Ollama with some such flows.

I spent some time looking at what can be added: seems simple enough. However i noticed this discussion: https://github.com/ollama/ollama/pull/988

I believe that an encoding endpoint is still relevant because it enables a broader range of APIs. Id love some clarity if i should complete and PR my changes.

I pretty much copy pasted the generate script... https://github.com/ollama/ollama/compare/main...suvalaki:ollama:main (Not very dry, but will await further comment before improving)

Looks a bit like this at a request level:
Input

{
  "model": "mistral:latest",
  "prompt": "Why is the sky blue?"
}

Output

{
    "model": "mistral:latest",
    "created_at": "2024-02-05T21:49:44.472893Z",
    "total_duration": 8965307875,
    "load_duration": 8961889917,
    "context": [
        733,
        16289,
        28793,
        28705,
        4315,
        349,
        272,
        7212,
        5045,
        28804,
        733,
        28748,
        16289,
        28793,
        13
    ],
    "prompt_eval_count": 15
}
<!-- gh-comment-id:1925740900 --> @suvalaki commented on GitHub (Feb 4, 2024): This would be pretty valuable. Useful for other libraries which call token counting methods as a part of their everyday flow. I want to integrate Ollama with some such flows. I spent some time looking at what can be added: seems simple enough. However i noticed this discussion: https://github.com/ollama/ollama/pull/988 I believe that an encoding endpoint is still relevant because it enables a broader range of APIs. Id love some clarity if i should complete and PR my changes. I pretty much copy pasted the generate script... https://github.com/ollama/ollama/compare/main...suvalaki:ollama:main (Not very dry, but will await further comment before improving) Looks a bit like this at a request level: Input ```json { "model": "mistral:latest", "prompt": "Why is the sky blue?" } ``` Output ```json { "model": "mistral:latest", "created_at": "2024-02-05T21:49:44.472893Z", "total_duration": 8965307875, "load_duration": 8961889917, "context": [ 733, 16289, 28793, 28705, 4315, 349, 272, 7212, 5045, 28804, 733, 28748, 16289, 28793, 13 ], "prompt_eval_count": 15 } ```
Author
Owner

@oliverbob commented on GitHub (Feb 5, 2024):

There is yet no native support, but ollama-webui seems to have it.

<!-- gh-comment-id:1926526490 --> @oliverbob commented on GitHub (Feb 5, 2024): There is yet no native support, but ollama-webui seems to have it.
Author
Owner

@suvalaki commented on GitHub (Feb 5, 2024):

I mean it being available in the webui doesnt really solve the issue does it.

<!-- gh-comment-id:1926576613 --> @suvalaki commented on GitHub (Feb 5, 2024): I mean it being available in the webui doesnt really solve the issue does it.
Author
Owner

@oliverbob commented on GitHub (Feb 5, 2024):

image

If I understand you correctly.

<!-- gh-comment-id:1927195685 --> @oliverbob commented on GitHub (Feb 5, 2024): ![image](https://github.com/ollama/ollama/assets/23272429/c69267cf-7475-4f56-ac8e-3c7b3430ec94) If I understand you correctly.
Author
Owner

@suvalaki commented on GitHub (Feb 5, 2024):

id just read the modifications i made in my branch and you'll see the delta difference. Its the difference between apriori knowledge and aposteriori ...

You just want access to the underlying tokenizer without needing to call generate (at the api layer)

<!-- gh-comment-id:1928153019 --> @suvalaki commented on GitHub (Feb 5, 2024): id just read the modifications i made in my branch and you'll see the delta difference. Its the difference between apriori knowledge and aposteriori ... You just want access to the underlying tokenizer without needing to call generate (at the api layer)
Author
Owner

@oliverbob commented on GitHub (Feb 6, 2024):

Sounds Greek to me. Wish you all the luck.

<!-- gh-comment-id:1929244998 --> @oliverbob commented on GitHub (Feb 6, 2024): Sounds Greek to me. Wish you all the luck.
Author
Owner

@chigkim commented on GitHub (May 2, 2024):

It seems like a lot of people want this.

https://github.com/ollama/ollama/issues/1716 and https://github.com/ollama/ollama/issues/3582

Llama.cpp server has POST /tokenize and POST /detokenize now, so hopefully Ollama can just expose the api.

<!-- gh-comment-id:2089645762 --> @chigkim commented on GitHub (May 2, 2024): It seems like a lot of people want this. https://github.com/ollama/ollama/issues/1716 and https://github.com/ollama/ollama/issues/3582 Llama.cpp server has POST /tokenize and POST /detokenize now, so hopefully Ollama can just expose the api.
Author
Owner

@jmorganca commented on GitHub (Sep 4, 2024):

Thanks for the issue!

/api/show will now show embedding length

Regarding /tokenize and /detokenize, closing for this issue: https://github.com/ollama/ollama/issues/3582

<!-- gh-comment-id:2327839641 --> @jmorganca commented on GitHub (Sep 4, 2024): Thanks for the issue! `/api/show` will now show embedding length Regarding `/tokenize` and `/detokenize`, closing for this issue: https://github.com/ollama/ollama/issues/3582
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#26463