[GH-ISSUE #3582] Add Tokenize and Detokenize Endpoints to Ollama Server #2214

New Issue

GiteaMirror · 2026-04-12T12:28:33-05:00

GiteaMirror commented

2026-04-12 12:28:33 -05:00

Originally created by @ParisNeo on GitHub (Apr 10, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/3582

Originally assigned to: @ParthSareen on GitHub.

What are you trying to do?

I would like to propose the addition of tokenize and detokenize endpoints to the Ollama server. This feature is crucial for the Ollama client interfaces (such as lollms) to effectively prepare prompts and accurately estimate the number of tokens for the LLMs. Currently, the client uses tiktoken for tokenization, which is not optimal since the token distribution depends on the model. While this can work with chatgpt compatible models, it may fail to correctly estimate the number of tokens, leading to suboptimal token computing and, in some cases, errors when the number of requested tokens exceeds the capacity of the LLM.

How should we solve this?

Introduce two new endpoints, one for tokenization and another for detokenization, to the Ollama server:

Tokenize Endpoint:

Input: Raw text, model name
Output: List of tokens

Detokenize Endpoint:

Input: List of tokens, model name
Output: Raw text

These endpoints should return the right tokens or text depending on the model currently in use..

The tokenization endpoint should provide accurate token counting tailored to the specific LLM being used. This will ensure optimal token computing and help avoid potential errors caused by exceeding the capacity of the LLM.

What is the impact of not solving this?

Without these endpoints, users might have to continue relying on inefficient or suboptimal solutions for tokenizing and detokenizing text data.

Anything else?

Include documentation and examples demonstrating how to use these new functionalities effectively. Providing comprehensive guidance will help users quickly adopt these features and enhance the overall user experience.

Originally created by @ParisNeo on GitHub (Apr 10, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/3582 Originally assigned to: @ParthSareen on GitHub. ### What are you trying to do? I would like to propose the addition of tokenize and detokenize endpoints to the Ollama server. This feature is crucial for the Ollama client interfaces (such as lollms) to effectively prepare prompts and accurately estimate the number of tokens for the LLMs. Currently, the client uses tiktoken for tokenization, which is not optimal since the token distribution depends on the model. While this can work with chatgpt compatible models, it may fail to correctly estimate the number of tokens, leading to suboptimal token computing and, in some cases, errors when the number of requested tokens exceeds the capacity of the LLM. ### How should we solve this? Introduce two new endpoints, one for tokenization and another for detokenization, to the Ollama server: Tokenize Endpoint: - Input: Raw text, model name - Output: List of tokens Detokenize Endpoint: - Input: List of tokens, model name - Output: Raw text These endpoints should return the right tokens or text depending on the model currently in use.. The tokenization endpoint should provide accurate token counting tailored to the specific LLM being used. This will ensure optimal token computing and help avoid potential errors caused by exceeding the capacity of the LLM. ### What is the impact of not solving this? Without these endpoints, users might have to continue relying on inefficient or suboptimal solutions for tokenizing and detokenizing text data. ### Anything else? Include documentation and examples demonstrating how to use these new functionalities effectively. Providing comprehensive guidance will help users quickly adopt these features and enhance the overall user experience.

GiteaMirror added the feature request label 2026-04-12 12:28:33 -05:00

GiteaMirror closed this issue

2026-04-12 12:28:34 -05:00

GiteaMirror commented

2026-04-12 12:28:35 -05:00

@chigkim commented on GitHub (May 2, 2024):

Related issues to keep an eye on: https://github.com/ollama/ollama/issues/1716

@chigkim commented on GitHub (May 2, 2024): Related issues to keep an eye on: https://github.com/ollama/ollama/issues/1716

GiteaMirror commented

2026-04-12 12:28:35 -05:00

@rohitgr7 commented on GitHub (Mar 25, 2025):

hey guys! any update on this?

@rohitgr7 commented on GitHub (Mar 25, 2025): hey guys! any update on this?

GiteaMirror commented

2026-04-12 12:28:36 -05:00

@ParthSareen commented on GitHub (Mar 25, 2025):

Hey @rohitgr7! Haven't forgotten about this but with the new engine work we're still figuring out what the support would look like with new engine and old engine as some of this is dependent on model loading. Will keep you all posted!

@ParthSareen commented on GitHub (Mar 25, 2025): Hey @rohitgr7! Haven't forgotten about this but with the new engine work we're still figuring out what the support would look like with new engine and old engine as some of this is dependent on model loading. Will keep you all posted!

GiteaMirror commented

2026-04-12 12:28:36 -05:00

@raffaeler commented on GitHub (Jun 20, 2025):

I see 3 different unreviewed PRs about this.
Can someone clarify if this is going to happen or not?
I don't get the reasons behind the blocks.

Thanks

@raffaeler commented on GitHub (Jun 20, 2025): I see 3 different unreviewed PRs about this. Can someone clarify if this is going to happen or not? I don't get the reasons behind the blocks. Thanks

GiteaMirror commented

2026-04-12 12:28:37 -05:00

@icedmoca commented on GitHub (Jun 26, 2025):

Surprised this still isn’t implemented given how many issues and PRs have piled up around it. Tokenization isn’t optional anymore — it’s critical for logit biasing, context estimation, and any serious token-level intervention (e.g., memory, contradiction suppression).

Relying on HF tokenizers is a hack — GGUF models often diverge, and without a native /tokenize, we can’t trust logit-level control to be accurate.

Ollama handles generation, so why not expose the tokenizer driving it? This should’ve shipped a long time ago.

@icedmoca commented on GitHub (Jun 26, 2025): **Surprised this still isn’t implemented given how many issues and PRs have piled up around it.** Tokenization isn’t optional anymore — it’s critical for logit biasing, context estimation, and any serious token-level intervention (e.g., memory, contradiction suppression). Relying on HF tokenizers is a hack — GGUF models often diverge, and without a native /tokenize, we can’t trust logit-level control to be accurate. Ollama handles generation, so why not expose the tokenizer driving it? This should’ve shipped a long time ago.

GiteaMirror commented

2026-04-12 12:28:37 -05:00

@ParthSareen commented on GitHub (Jun 28, 2025):

Hey @icedmoca, @raffaeler, I empathize and agree that we should have this but it's a nontrivial implementation till there's more cleared up.

I had recently taken another look at my PR: https://github.com/ollama/ollama/pull/8106

There are a few current issues which we just need more info on in order to have a good experience for the upcoming years.

There's a divergence in how the old engine and ollama engine work with model loading is different. This would involve a smarter model cache in the old engine vs. not needing one in the new engine.
While not too difficult to expose there's long tail issues in supporting two different ways of getting the tokens back.
The API needs to extend to different to not just text input but also images, and eventually other modalities. How this gets designed is pretty critical as we'd have to continue supporting whatever goes out.

There's some other stuff in the works too which makes this a bit more complicated. Will keep you all posted but will continue tracking in my PR. Going to close this one out for now.

@ParthSareen commented on GitHub (Jun 28, 2025): Hey @icedmoca, @raffaeler, I empathize and agree that we should have this but it's a nontrivial implementation till there's more cleared up. I had recently taken another look at my PR: https://github.com/ollama/ollama/pull/8106 There are a few current issues which we just need more info on in order to have a good experience for the upcoming years. 1. There's a divergence in how the old engine and ollama engine work with model loading is different. This would involve a smarter model cache in the old engine vs. not needing one in the new engine. 2. While not too difficult to expose there's long tail issues in supporting two different ways of getting the tokens back. 3. The API needs to extend to different to not just text input but also images, and eventually other modalities. How this gets designed is pretty critical as we'd have to continue supporting whatever goes out. There's some other stuff in the works too which makes this a bit more complicated. Will keep you all posted but will continue tracking in my PR. Going to close this one out for now.

GiteaMirror commented

2026-04-12 12:28:38 -05:00

@raffaeler commented on GitHub (Jun 29, 2025):

@ParthSareen I understand and thank you and the other contributors for all the efforts in Ollama.
I am asking what is coming because I already had to migrate my talk demos to python transformers instead of Ollama because of this missing feature.
As others said, token counting is one of the many vital features.
Thanks again.

@raffaeler commented on GitHub (Jun 29, 2025): @ParthSareen I understand and thank you and the other contributors for all the efforts in Ollama. I am asking what is coming because I already had to migrate my talk demos to python transformers instead of Ollama because of this missing feature. As others said, token counting is one of the many vital features. Thanks again.

GiteaMirror commented

2026-04-12 12:28:38 -05:00

@icedmoca commented on GitHub (Jul 24, 2025):

My workaround for my project made me have to use vector-based semantic similarity and structured fact representation.. 🥀

@icedmoca commented on GitHub (Jul 24, 2025): My workaround for my project made me have to use vector-based semantic similarity and structured fact representation.. 🥀

GiteaMirror commented

2026-04-12 12:28:39 -05:00

@raffaeler commented on GitHub (Jul 24, 2025):

My workaround for my project made me have to use vector-based semantic similarity and structured fact representation.. 🥀

Could you please elaborate a bit more?
If you can´t estimate the tokens size, you are forced to continuously try to tokenize to see if there is an error or not.
Since my algorithm to chunk the documents is quite complex, this will cause it to take a very (too) long time.

Because of this lacking feature, I had to migrate to the Hugging Face libraries to host the model.
Does your workaround solve this issue?

@raffaeler commented on GitHub (Jul 24, 2025): > My workaround for my project made me have to use vector-based semantic similarity and structured fact representation.. 🥀 Could you please elaborate a bit more? If you can´t estimate the tokens size, you are forced to continuously try to tokenize to see if there is an error or not. Since my algorithm to chunk the documents is quite complex, this will cause it to take a very (too) long time. Because of this lacking feature, I had to migrate to the Hugging Face libraries to host the model. Does your workaround solve this issue?

GiteaMirror commented

2026-04-12 12:28:39 -05:00

@icedmoca commented on GitHub (Jul 24, 2025):

My workaround for my project made me have to use vector-based semantic similarity and structured fact representation.. 🥀

Could you please elaborate a bit more? If you can´t estimate the tokens size, you are forced to continuously try to tokenize to see if there is an error or not. Since my algorithm to chunk the documents is quite complex, this will cause it to take a very (too) long time.

Because of this lacking feature, I had to migrate to the Hugging Face libraries to host the model. Does your workaround solve this issue?

Yes, my workaround avoids token estimation entirely by operating at a higher semantic level. Instead of relying on tokenizer length constraints, I parse inputs into structured fact triplets and group them by semantic similarity using embedding vectors (like en_core_web_lg or MiniLM). This allows chunking and context window management to be driven by meaning rather than raw token count.

It’s not perfect for token-level logit biasing, but for document chunking, summarization, and reasoning, it avoids the need for a tokenizer altogether — and works with Ollama today. Let me know if you want a demo or outline.

You're absolutely right about the token estimation issue—especially when dealing with semantically dense or volatile input. That’s precisely why I moved away from dynamic token counting altogether and implemented a hybrid approach in MeRNSTA.

My system uses a dual-layer memory design:

Structured Fact Representation
Inputs are parsed into normalized (subject, predicate, object) triplets and stored in SQLite under the enhanced_facts table. Each entry is augmented with:

Contradiction and volatility flags

Confidence scores and change history

Temporal, session, and user-profile metadata

Optional vector embeddings for semantic operations

Contradictions are tracked via contradiction_records, enabling dynamic volatility scoring and automated contradiction resolution.

Vector-Based Semantic Similarity
Rather than attempting to tokenize large documents repeatedly (which as you've pointed out becomes untenable), I use sentence-transformers (MiniLM) to embed fact-level assertions at ingestion time. During recall, the system performs vector similarity search to surface contextually relevant memory without ever re-tokenizing the full memory base. This keeps latency predictable and scales much better under evolving memory state.

MeRNSTA avoids real-time token estimation altogether by decomposing input into discrete, vectorized facts. Only the semantically relevant subset is ever reassembled for LLM prompts. That design decision made token overflows a non-issue—even when working with complex contextual recall or highly entropic sessions.

I’ve only published portions of the system publicly so far.. like memory schemas, contradiction handling, and partial semantic indexing logic, because the full orchestration engine includes self-evolving code, real-time memory reinforcement, and internal scaffolding I’m not ready to open-source until I finalize the broader cognitive loop.

@icedmoca commented on GitHub (Jul 24, 2025): > > My workaround for my project made me have to use vector-based semantic similarity and structured fact representation.. 🥀 > > Could you please elaborate a bit more? If you can´t estimate the tokens size, you are forced to continuously try to tokenize to see if there is an error or not. Since my algorithm to chunk the documents is quite complex, this will cause it to take a very (too) long time. > > Because of this lacking feature, I had to migrate to the Hugging Face libraries to host the model. Does your workaround solve this issue? Yes, my workaround avoids token estimation entirely by operating at a higher semantic level. Instead of relying on tokenizer length constraints, I parse inputs into structured fact triplets and group them by semantic similarity using embedding vectors (like en_core_web_lg or MiniLM). This allows chunking and context window management to be driven by meaning rather than raw token count. It’s not perfect for token-level logit biasing, but for document chunking, summarization, and reasoning, it avoids the need for a tokenizer altogether — and works with Ollama today. Let me know if you want a demo or outline. You're absolutely right about the token estimation issue—especially when dealing with semantically dense or volatile input. That’s precisely why I moved away from dynamic token counting altogether and implemented a hybrid approach in [MeRNSTA](https://github.com/icedmoca/MeRNSTA). My system uses a dual-layer memory design: Structured Fact Representation Inputs are parsed into normalized (subject, predicate, object) triplets and stored in SQLite under the enhanced_facts table. Each entry is augmented with: Contradiction and volatility flags Confidence scores and change history Temporal, session, and user-profile metadata Optional vector embeddings for semantic operations Contradictions are tracked via contradiction_records, enabling dynamic volatility scoring and automated contradiction resolution. Vector-Based Semantic Similarity Rather than attempting to tokenize large documents repeatedly (which as you've pointed out becomes untenable), I use sentence-transformers (MiniLM) to embed fact-level assertions at ingestion time. During recall, the system performs vector similarity search to surface contextually relevant memory without ever re-tokenizing the full memory base. This keeps latency predictable and scales much better under evolving memory state. MeRNSTA avoids real-time token estimation altogether by decomposing input into discrete, vectorized facts. Only the semantically relevant subset is ever reassembled for LLM prompts. That design decision made token overflows a non-issue—even when working with complex contextual recall or highly entropic sessions. I’ve only published portions of the system publicly so far.. like memory schemas, contradiction handling, and partial semantic indexing logic, because the full orchestration engine includes self-evolving code, real-time memory reinforcement, and internal scaffolding I’m not ready to open-source until I finalize the broader cognitive loop.

GiteaMirror commented

2026-04-12 12:28:40 -05:00

@raffaeler commented on GitHub (Jul 25, 2025):

This strategy doesn't work in my case. Either the upper limit is very high (Qwen3 for example), or you need to iterate until you fit your threshold.
In my research (and on papers) I found that aggregation by vector similarity is not the best way to create the chunks. This is a bad strategy in many use-cases like the legal one where you have many phrases that are relevant for a context but they express different concepts (depositions for example). In other use-cases I found the same problem with quotes.

My analysis starts from the syntax tree of the original document (that must have been accurately filtered in the previous step). This is where I can visualize clusters of concepts inside the documents. And when I see well-separated clusters, I know the result is good.

@raffaeler commented on GitHub (Jul 25, 2025): This strategy doesn't work in my case. Either the upper limit is very high (Qwen3 for example), or you need to iterate until you fit your threshold. In my research (and on papers) I found that aggregation by vector similarity is not the best way to create the chunks. This is a bad strategy in many use-cases like the legal one where you have many phrases that are relevant for a context but they express different concepts (depositions for example). In other use-cases I found the same problem with quotes. My analysis starts from the syntax tree of the original document (that must have been accurately filtered in the previous step). This is where I can visualize clusters of concepts inside the documents. And when I see well-separated clusters, I know the result is good.

GiteaMirror commented

2026-04-12 12:28:40 -05:00

@js402 commented on GitHub (Jul 25, 2025):

Hey Raffaele Rialdi,

I know this is off-topic, but I'm super interested in your research.
I would be very grateful if you could share where to find your papers.

This strategy doesn't work in my case. Either the upper limit is very high (Qwen3 for example), or you need to iterate until you fit your threshold. In my research (and on papers) I found that aggregation by vector similarity is not the best way to create the chunks. This is a bad strategy in many use-cases like the legal one where you have many phrases that are relevant for a context but they express different concepts (depositions for example). In other use-cases I found the same problem with quotes.

My analysis starts from the syntax tree of the original document (that must have been accurately filtered in the previous step). This is where I can visualize clusters of concepts inside the documents. And when I see well-separated clusters, I know the result is good.

@js402 commented on GitHub (Jul 25, 2025): Hey Raffaele Rialdi, I know this is off-topic, but I'm super interested in your research. I would be very grateful if you could share where to find your papers. > This strategy doesn't work in my case. Either the upper limit is very high (Qwen3 for example), or you need to iterate until you fit your threshold. In my research (and on papers) I found that aggregation by vector similarity is not the best way to create the chunks. This is a bad strategy in many use-cases like the legal one where you have many phrases that are relevant for a context but they express different concepts (depositions for example). In other use-cases I found the same problem with quotes. > > My analysis starts from the syntax tree of the original document (that must have been accurately filtered in the previous step). This is where I can visualize clusters of concepts inside the documents. And when I see well-separated clusters, I know the result is good.

GiteaMirror commented

2026-04-12 12:28:41 -05:00

@raffaeler commented on GitHub (Jul 25, 2025):

Unfortunately the papers are very generic on this topic.
I explain in detail what I am doing in the talks I am giving on the topic, but there are no published videos (yet).
If I find some time, I may decide to publish something, but do not expect shortly as I am hard working on two large projects now.
Sorry

@raffaeler commented on GitHub (Jul 25, 2025): Unfortunately the papers are very generic on this topic. I explain in detail what I am doing in the talks I am giving on the topic, but there are no published videos (yet). If I find some time, I may decide to publish something, but do not expect shortly as I am hard working on two large projects now. Sorry

GiteaMirror commented

2026-04-12 12:28:41 -05:00

@icedmoca commented on GitHub (Jul 28, 2025):

This strategy doesn't work in my case. Either the upper limit is very high (Qwen3 for example), or you need to iterate until you fit your threshold. In my research (and on papers) I found that aggregation by vector similarity is not the best way to create the chunks. This is a bad strategy in many use-cases like the legal one where you have many phrases that are relevant for a context but they express different concepts (depositions for example). In other use-cases I found the same problem with quotes.

My analysis starts from the syntax tree of the original document (that must have been accurately filtered in the previous step). This is where I can visualize clusters of concepts inside the documents. And when I see well-separated clusters, I know the result is good.

It makes sense why vector similarity wouldn't cut it for your use case. Yes, semantic clustering breaks down when the input spans multiple dense but semantically divergent regions quotes, depositions, overlapping testimony, etc. In those contexts, syntax trees give you much tighter control over conceptual boundaries. Your use of syntactic clustering is right for preserving document logic and ensuring legal/technical fidelity. For MeRNSTA, I had different constraints: I needed a system that could reason across evolving conversational facts, detect contradictions over time, and maintain stability without ever re-tokenizing huge memory bases. So instead of using token counts or syntax trees, I chunk based on concept recurrence and contradiction volatility. Works well for dynamic memory systems, but yeah definitely not for high-precision document ingestion like yours. Would love to see your syntax-based clustering pipeline if you ever publish it especially how you're managing ambiguity and cross-cluster references. Also agree that if Ollama ever exposes /tokenize, the hybrid approach is probably ideal: syntax or semantic chunking first, then real token budget trimming as a final step. That s the missing piece for a lot of us.

Also @js402 I would like to chat about your work if possible, I'm also working on an intent engine utilizing blockchain architecture with ai agents.

@icedmoca commented on GitHub (Jul 28, 2025): > This strategy doesn't work in my case. Either the upper limit is very high (Qwen3 for example), or you need to iterate until you fit your threshold. In my research (and on papers) I found that aggregation by vector similarity is not the best way to create the chunks. This is a bad strategy in many use-cases like the legal one where you have many phrases that are relevant for a context but they express different concepts (depositions for example). In other use-cases I found the same problem with quotes. > > My analysis starts from the syntax tree of the original document (that must have been accurately filtered in the previous step). This is where I can visualize clusters of concepts inside the documents. And when I see well-separated clusters, I know the result is good. It makes sense why vector similarity wouldn't cut it for your use case. Yes, semantic clustering breaks down when the input spans multiple dense but semantically divergent regions quotes, depositions, overlapping testimony, etc. In those contexts, syntax trees give you much tighter control over conceptual boundaries. Your use of syntactic clustering is right for preserving document logic and ensuring legal/technical fidelity. For MeRNSTA, I had different constraints: I needed a system that could reason across evolving conversational facts, detect contradictions over time, and maintain stability without ever re-tokenizing huge memory bases. So instead of using token counts or syntax trees, I chunk based on concept recurrence and contradiction volatility. Works well for dynamic memory systems, but yeah definitely not for high-precision document ingestion like yours. Would love to see your syntax-based clustering pipeline if you ever publish it especially how you're managing ambiguity and cross-cluster references. Also agree that if Ollama ever exposes /tokenize, the hybrid approach is probably ideal: syntax or semantic chunking first, then real token budget trimming as a final step. That s the missing piece for a lot of us. Also @js402 I would like to chat about your work if possible, I'm also working on an intent engine utilizing blockchain architecture with ai agents.

GiteaMirror commented

2026-04-12 12:28:41 -05:00

@icedmoca commented on GitHub (Jul 28, 2025):

@js402 @raffaeler @ParisNeo @ParthSareen

Full Implementation of `/tokenize` and `/detokenize` Endpoints for Ollama

Hi all I went ahead and implemented the missing /tokenize and /detokenize endpoints for Ollama and validated them using mistral:latest. This directly resolves the original request in #3582 and related threads.

What I Built

The Ollama server currently exposes no way to perform model-accurate tokenization or detokenization.
This creates major problems for:

Logit biasing - Context window estimation - Chunking and prompt trimming - Memory systems, structured reasoning, and summarization.

Developers trying to interoperate with GGUF or non-HF models Most of us were forced to hack around this using tiktoken, which doesn't match GGUF or model-specific BPEs.

What I Added

I used the existing LlamaServer interface which already contains internal Tokenize() and Detokenize() methods. I exposed them via HTTP endpoints and integrated with the scheduler.

Endpoints:

POST /api/tokenize Request: { "model": "mistral:latest", "content": "Hello, world!" } Response: { "tokens": [23325, 29493, 2294, 29576] }

POST /api/detokenize Request: { "model": "mistral:latest", "tokens": [23325, 29493, 2294, 29576] }
Response: { "content": " Hello, world!" }

✔ Fully model-aligned (works for GGUF and HF)
✔ Uses the same tokenization logic used in generation
✔ Returns timings and model info
✔ Full round-trip verification

Proof

⏱ Live Server Response

[GIN] 2025/07/28 - 16:28:25 | 200 | 3.33s | 127.0.0.1 | POST /api/tokenize

[GIN] 2025/07/28 - 16:28:41 | 200 | 5.40ms | 127.0.0.1 | POST /api/detokenize

CURL Test

Tokenize

input:

bash curl http://localhost:11434/api/tokenize \ -H "Content-Type: application/json" \ -d '{ "model": "mistral:latest", "content": "Hello, world!" }'

output:

{ "model": "mistral:latest", "tokens": [23325, 29493, 2294, 29576], "total_duration": 3333091020, "load_duration": 3332916624 }

Detokenize

input:

bash curl http://localhost:11434/api/detokenize \ -H "Content-Type: application/json" \ -d '{ "model": "mistral:latest", "tokens": [23325, 29493, 2294, 29576] }'

output:

{ "model": "mistral:latest", "content": " Hello, world!", "total_duration": 5376833, "load_duration": 5371033 }

The detokenized output matches exactly, including the expected leading space!

Changes

api/types.go: Added TokenizeRequest, TokenizeResponse, DetokenizeRequest, DetokenizeResponse - server/routes.go: Added TokenizeHandler, DetokenizeHandler, registered both routes - api/client.go: Added Tokenize() and Detokenize() methods to the client - api/examples/tokenize/main.go: Round-trip demo - integration/api_test.go: Integration test for round-trip correctness - docs/api.md: Documented new endpoints - Uses: scheduleRunner() to access models, supports keep_alive, consistent with other endpoints.

View what I changed here: aa855f2b15

Also @ParthSareen I’ve implemented full /tokenize and /detokenize endpoints with support for media_type, keep_alive, and a modular TokenizerAdapter interface to future-proof for other engines and modalities. I validated round-trip integrity across a wide range of stress cases (multilingual, emojis, fuzzed Unicode, edge whitespace, markdown/math, RTL), and keep-alive drastically reduces cold start overhead. While this doesn’t yet handle multimodal input like images, the adapter pattern was explicitly designed to support it cleanly. I’d appreciate your review or suggestions on any blockers I may have missed.

@icedmoca commented on GitHub (Jul 28, 2025): @js402 @raffaeler @ParisNeo @ParthSareen # Full Implementation of `/tokenize` and `/detokenize` Endpoints for Ollama Hi all I went ahead and implemented the missing `/tokenize` and `/detokenize` endpoints for Ollama and validated them using `mistral:latest`. This directly resolves the original request in [#3582](https://github.com/ollama/ollama/issues/3582) and related threads. ## What I Built ### The Ollama server currently exposes no way to perform model-accurate tokenization or detokenization. This creates major problems for: - Logit biasing - Context window estimation - Chunking and prompt trimming - Memory systems, structured reasoning, and summarization. Developers trying to interoperate with GGUF or non-HF models Most of us were forced to hack around this using `tiktoken`, which doesn't match GGUF or model-specific BPEs. ### What I Added I used the existing `LlamaServer` interface which already contains internal `Tokenize()` and `Detokenize()` methods. I exposed them via HTTP endpoints and integrated with the scheduler. **Endpoints:** `POST /api/tokenize` Request: `{ "model": "mistral:latest", "content": "Hello, world!" }` Response: `{ "tokens": [23325, 29493, 2294, 29576] }` `POST /api/detokenize` Request: `{ "model": "mistral:latest", "tokens": [23325, 29493, 2294, 29576] }` Response: `{ "content": " Hello, world!" }` ✔ Fully model-aligned (works for GGUF and HF) ✔ Uses the same tokenization logic used in generation ✔ Returns timings and model info ✔ Full round-trip verification ## Proof ### ⏱ Live Server Response ``` [GIN] 2025/07/28 - 16:28:25 | 200 | 3.33s | 127.0.0.1 | POST /api/tokenize ``` ```[GIN] 2025/07/28 - 16:28:41 | 200 | 5.40ms | 127.0.0.1 | POST /api/detokenize ``` ### CURL Test #### Tokenize input: ```bash curl http://localhost:11434/api/tokenize \ -H "Content-Type: application/json" \ -d '{ "model": "mistral:latest", "content": "Hello, world!" }' ``` output: ```{ "model": "mistral:latest", "tokens": [23325, 29493, 2294, 29576], "total_duration": 3333091020, "load_duration": 3332916624 } ``` #### Detokenize input: ```bash curl http://localhost:11434/api/detokenize \ -H "Content-Type: application/json" \ -d '{ "model": "mistral:latest", "tokens": [23325, 29493, 2294, 29576] }' ``` output: ```{ "model": "mistral:latest", "content": " Hello, world!", "total_duration": 5376833, "load_duration": 5371033 } ``` The detokenized output matches exactly, including the expected leading space! ## Changes **`api/types.go`**: Added `TokenizeRequest`, `TokenizeResponse`, `DetokenizeRequest`, `DetokenizeResponse` - **`server/routes.go`**: Added `TokenizeHandler`, `DetokenizeHandler`, registered both routes - **`api/client.go`**: Added `Tokenize()` and `Detokenize()` methods to the client - **`api/examples/tokenize/main.go`**: Round-trip demo - **`integration/api_test.go`**: Integration test for round-trip correctness - **`docs/api.md`**: Documented new endpoints - **Uses**: `scheduleRunner()` to access models, supports `keep_alive`, consistent with other endpoints. View what I changed here: https://github.com/icedmoca/ollama/commit/aa855f2b156708e8388480f95586e083c01732a6 Also @ParthSareen I’ve implemented full /tokenize and /detokenize endpoints with support for media_type, keep_alive, and a modular TokenizerAdapter interface to future-proof for other engines and modalities. I validated round-trip integrity across a wide range of stress cases (multilingual, emojis, fuzzed Unicode, edge whitespace, markdown/math, RTL), and keep-alive drastically reduces cold start overhead. While this doesn’t yet handle multimodal input like images, the adapter pattern was explicitly designed to support it cleanly. I’d appreciate your review or suggestions on any blockers I may have missed.

GiteaMirror commented

2026-04-12 12:28:42 -05:00

@raffaeler commented on GitHub (Jul 29, 2025):

@icedmoca Exactly. There is certainly more than one winning strategy to chunk documents. This mostly depends on the document type and the use-case. Measuring them is the best way to decide what is better.
I'll probably publish my work once it is polished and tested across multiple scenarios, but I have to find the time.

About your strategy, I am not familiar with MeRNSTA or I can find anything with that precise name. What is that?

With regards to your new implementation, great work but as I was mentioning above, there are already multiple open PRs and none of them got reviewed or approved. I don´t know the reason, but an "official" version is needed before this feature can actively be used and promoted.

@raffaeler commented on GitHub (Jul 29, 2025): @icedmoca Exactly. There is certainly more than one winning strategy to chunk documents. This mostly depends on the document type and the use-case. Measuring them is the best way to decide what is better. I'll probably publish my work once it is polished and tested across multiple scenarios, but I have to find the time. About your strategy, I am not familiar with MeRNSTA or I can find anything with that precise name. What is that? With regards to your new implementation, great work but as I was mentioning above, there are already multiple open PRs and none of them got reviewed or approved. I don´t know the reason, but an "official" version is needed before this feature can actively be used and promoted.

GiteaMirror commented

2026-04-12 12:28:42 -05:00

@icedmoca commented on GitHub (Aug 2, 2025):

Hey @raffaeler ,i agree on the PR backlog and needing an official version first and hopefully something similar is implemented..

As for MeRNSTA — it’s a custom neuro-symbolic "AGI" framework I’ve been developing that combines a contradiction-resolving memory graph, causal + temporal fact tracking, multi-agent debate/reflection layers, and recursive self-evolution (it mutates and tests its own agents).

I used the tokenizer/detokenizer endpoints to create an adaptive encoder that rewrites prompts based on past memory and contradiction history. And it actually works and doesn't hallucinate!! It’s crazy... I MIGHT open source once stable, but if you’re curious, happy to DM a breakdown or benchmark snippets. The mernsta repo on my profile is outdated but does reflect aspects of it. It's truly crazy what I've been able to accomplish with vibe planning and some vibe coding..

Edit: Also I think this might interest you, given you're looking for something to chunk large documents.

@icedmoca commented on GitHub (Aug 2, 2025): Hey @raffaeler ,i agree on the PR backlog and needing an official version first and hopefully something similar is implemented.. As for MeRNSTA — it’s a custom neuro-symbolic "AGI" framework I’ve been developing that combines a contradiction-resolving memory graph, causal + temporal fact tracking, multi-agent debate/reflection layers, and recursive self-evolution (it mutates and tests its own agents). I used the tokenizer/detokenizer endpoints to create an adaptive encoder that rewrites prompts based on past memory and contradiction history. And it actually works and doesn't hallucinate!! It’s crazy... I MIGHT open source once stable, but if you’re curious, happy to DM a breakdown or benchmark snippets. The mernsta repo on my profile is outdated but does reflect aspects of it. It's truly crazy what I've been able to accomplish with vibe planning and some vibe coding.. Edit: Also I think [this](https://github.com/CaRniFeXeR/transformer-kernel-ranking) might interest you, given you're looking for something to chunk large documents.

GiteaMirror referenced this issue

2026-04-12 23:11:57 -05:00

[PR #2214] [MERGED] Detect lack of AVX and fallback to CPU mode #10823

GiteaMirror referenced this issue

2026-04-16 05:15:53 -05:00

[PR #2214] [MERGED] Detect lack of AVX and fallback to CPU mode #16094

GiteaMirror referenced this issue

2026-04-19 15:35:54 -05:00

[PR #2214] [MERGED] Detect lack of AVX and fallback to CPU mode #21363

GiteaMirror referenced this issue

2026-04-22 03:52:44 -05:00

[GH-ISSUE #2187] Support GPU runners on CPUs without AVX #27010

GiteaMirror referenced this issue

2026-04-22 21:20:11 -05:00

[PR #2214] [MERGED] Detect lack of AVX and fallback to CPU mode #36696

GiteaMirror referenced this issue

2026-04-24 21:51:43 -05:00

[PR #2214] [MERGED] Detect lack of AVX and fallback to CPU mode #42071

GiteaMirror referenced this issue

2026-04-28 05:15:04 -05:00

[GH-ISSUE #2187] Support GPU runners on CPUs without AVX #47762

GiteaMirror referenced this issue

2026-04-29 12:10:14 -05:00

[PR #2214] [MERGED] Detect lack of AVX and fallback to CPU mode #57520

GiteaMirror referenced this issue

2026-05-03 12:51:37 -05:00

[GH-ISSUE #2187] Support GPU runners on CPUs without AVX #63288

GiteaMirror referenced this issue

2026-05-05 04:47:20 -05:00

[PR #2214] [MERGED] Detect lack of AVX and fallback to CPU mode #73117

Sign in to join this conversation.

Branches Tags

main

parth-mlx-decode-checkpoints

dhiltgen/ci

hoyyeva/editor-config-repair

parth-launch-codex-app

hoyyeva/fix-codex-model-metadata-warning

hoyyeva/qwen

hoyyeva/launch-backup-ux

parth/hide-claude-desktop-till-release

hoyyeva/opencode-image-modality

parth-add-claude-code-autoinstall

release_v0.22.0

pdevine/manifest-list

codex/fix-codex-model-metadata-warning

pdevine/addressable-manifest

brucemacd/launch-fetch-reccomended

jmorganca/llama-compat

launch-copilot-cli

hoyyeva/opencode-thinking

release_v0.20.7

parth-auto-save-backup

parth-test

jmorganca/gemma4-audio-replacements

fix-manifest-digest-on-pull

hoyyeva/vscode-improve

brucemacd/install-server-wait

brucemacd/download-before-remove

parth/update-claude-docs

parth-anthropic-reference-images-path

brucemac/start-ap-install

pdevine/mlx-update

pdevine/qwen35_vision

drifkin/api-show-fallback

mintlify/image-generation-1773352582

hoyyeva/server-context-length-local-config

jmorganca/faster-reptition-penalties

jmorganca/convert-nemotron

parth-pi-thinking

pdevine/sampling-penalties

jmorganca/fix-create-quantization-memory

dongchen/resumable_transfer_fix

pdevine/sampling-cache-error

jessegross/mlx-usage

hoyyeva/openclaw-config

hoyyeva/app-html

pdevine/qwen3next

brucemacd/sign-sh-install

brucemacd/tui-update

brucemacd/usage-api

jmorganca/launch-empty

fix-app-dist-embed

mxyng/mlx-compile

mxyng/mlx-quant

mxyng/mlx-glm4.7

mxyng/mlx

brucemacd/simplify-model-picker

jmorganca/qwen3-concurrent

fix-glm-4.7-flash-mla-config

drifkin/qwen3-coder-opening-tag

brucemacd/usage-cli

fix-cuda12-fattn-shmem

ollama-imagegen-docs

parth/fix-multiline-inputs

brucemacd/config-docs

mxyng/model-files

mxyng/simple-execute

fix-imagegen-ollama-models

mxyng/async-upload

jmorganca/lazy-no-dtype-changes

imagegen-auto-detect-create

parth/decrease-concurrent-download-hf

fix-mlx-quantize-init

jmorganca/x-cleanup

usage

imagegen-readme

jmorganca/glm-image

mlx-gpu-cd

jmorganca/imagegen-modelfile

parth/agent-skills

parth/agent-allowlist

parth/signed-in-offline

parth/agents

parth/fix-context-chopping

improve-cloud-flow

parth/add-models-websearch

parth/prompt-renderer-mcp

jmorganca/native-settings

jmorganca/download-stream-hash

jmorganca/client2-rebased

brucemacd/oai-chat-req-multipart

jessegross/multi_chunk_reserve

grace/additional-omit-empty

grace/mistral-3-large

mxyng/tokenizer2

mxyng/tokenizer

jessegross/flash

hoyyeva/windows-nacked-app

mxyng/cleanup-attention

grace/deepseek-parser

hoyyeva/remember-unsent-prompt

parth/add-lfs-pointer-error-conversion

parth/olmo2-test2

hoyyeva/ollama-launchagent-plist

nicole/olmo-model

parth/olmo-test

mxyng/remove-embedded

parth/render-template

jmorganca/intellect-3

parth/remove-prealloc-linter

jmorganca/cmd-eval

nicole/nomic-embed-text-fix

mxyng/lint-2

hoyyeva/add-gemini-3-pro-preview

hoyyeva/load-model-list

mxyng/expand-path

mxyng/environ-2

hoyyeva/deeplink-json-encoding

parth/improve-tool-calling-tests

hoyyeva/conversation

hoyyeva/assistant-edit-response

hoyyeva/thinking

origin/brucemacd/invalid-char-i-err

parth/improve-tool-calling

jmorganca/required-omitempty

grace/qwen3-vl-tests

mxyng/iter-client

parth/docs-readme

nicole/embed-test

pdevine/integration-benchstat

parth/remove-generate-cmd

parth/add-toolcall-id

mxyng/server-tests

jmorganca/glm-4.6

jmorganca/gin-h-compat

drifkin/stable-tool-args

pdevine/qwen3-more-thinking

parth/add-websearch-client

nicole/websearch_local

jmorganca/qwen3-coder-updates

grace/deepseek-v3-migration-tests

mxyng/fix-create

jmorganca/cloud-errors

pdevine/parser-tidy

revert-12233-parth/simplify-entrypoints-runner

parth/enable-so-gpt-oss

brucemacd/qwen3vl

jmorganca/readme-simplify

parth/gpt-oss-structured-outputs

revert-12039-jmorganca/tools-braces

mxyng/embeddings

mxyng/gguf

mxyng/benchmark

mxyng/types-null

parth/move-parsing

mxyng/gemma2

jmorganca/docs

mxyng/16-bit

mxyng/create-stdin

pdevine/authorizedkeys

mxyng/quant

parth/opt-in-error-context-window

brucemacd/cache-models

brucemacd/runner-completion

jmorganca/llama-update-6

brucemacd/benchmark-list

brucemacd/partial-read-caps

parth/deepseek-r1-tools

mxyng/omit-array

parth/tool-prefix-temp

brucemacd/runner-test

jmorganca/qwen25vl

brucemacd/model-forward-test-ext

parth/python-function-parsing

jmorganca/cuda-compression-none

drifkin/num-parallel

drifkin/chat-truncation-fix

jmorganca/sync

parth/python-tools-calling

drifkin/array-head-count

brucemacd/create-no-loop

parth/server-enable-content-stream-with-tools

qwen25omni

mxyng/v3

brucemacd/ropeconfig

jmorganca/silence-tokenizer

parth/sample-so-test

parth/sampling-structured-outputs

brucemacd/doc-go-engine

parth/constrained-sampling-json

jmorganca/mistral-wip

brucemacd/mistral-small-convert

parth/sample-unmarshal-json-for-params

brucemacd/jomorganca/mistral

pdevine/bfloat16

jmorganca/mistral

brucemacd/mistral

pdevine/logging

parth/sample-correctness-fix

parth/sample-fix-sorting

jmorgan/sample-fix-sorting-extras

jmorganca/temp-0-images

brucemacd/parallel-embed-models

brucemacd/shim-grammar

jmorganca/fix-gguf-error

bmizerany/nameswork

jmorganca/faster-releases

bmizerany/validatenames

brucemacd/err-no-vocab

brucemacd/rope-config

brucemacd/err-hint

brucemacd/qwen2_5

brucemacd/logprobs

brucemacd/new_runner_graph_bench

progress-flicker

brucemacd/forward-test

brucemacd/go_qwen2

pdevine/gemma2

jmorganca/add-missing-symlink-eval

mxyng/next-debug

parth/set-context-size-openai

brucemacd/next-bpe-bench

brucemacd/next-bpe-test

brucemacd/new_runner_e2e

brucemacd/new_runner_qwen2

pdevine/convert-cohere2

brucemacd/convert-cli

parth/log-probs

mxyng/next-mlx

mxyng/cmd-history

parth/templating

parth/tokenize-detokenize

brucemacd/check-key-register

bmizerany/grammar

jmorganca/vendor-081b29bd

mxyng/func-checks

jmorganca/fix-null-format

parth/fix-default-to-warn-json

jmorganca/qwen2vl

jmorganca/no-concat

parth/cmd-cleanup-SO

brucemacd/check-key-register-structured-err

parth/openai-stream-usage

parth/fix-referencing-so

stream-tools-stop

jmorganca/degin-1

brucemacd/install-path-clean

brucemacd/push-name-validation

brucemacd/browser-key-register

jmorganca/openai-fix-first-message

jmorganca/fix-proxy

jessegross/sample

parth/disallow-streaming-tools

dhiltgen/remove_submodule

jmorganca/ga

jmorganca/mllama

pdevine/newlines

pdevine/geems-2b

jmorganca/llama-bump

mxyng/modelname-7

mxyng/gin-slog

mxyng/modelname-6

jyan/convert-prog

jyan/quant5

paligemma-support

pdevine/import-docs

jmorganca/openai-context

jyan/paligemma

jyan/p2

jyan/palitest

bmizerany/embedspeedup

jmorganca/llama-vit

brucemacd/allow-ollama

royh/ep-methods

royh/whisper

mxyng/api-models

mxyng/fix-memory

jyan/q4_4/8

jyan/ollama-v

royh/stream-tools

roy-embed-parallel

bmizerany/hrm

revert-5963-revert-5924-mxyng/llama3.1-rope

royh/embed-viz

jyan/local2

jyan/auth

jyan/local

jyan/parse-temp

jmorganca/template-mistral

jyan/reord-g

royh-openai-suffixdocs

royh-imgembed

royh-embed-parallel

jyan/quant4

royh-precision

jyan/progress

pdevine/fix-template

jyan/quant3

pdevine/ggla

mxyng/update-registry-domain

jmorganca/ggml-static

mxyng/create-context

jyan/v0.146

mxyng/layers-from-files

build_dist

bmizerany/noseek

royh-ls

royh-name

timeout

mxyng/server-timestamp

bmizerany/nosillyggufslurps

royh-params

jmorganca/llama-cpp-7c26775

royh-openai-delete

royh-show-rigid

jmorganca/enable-fa

jmorganca/no-error-template

jyan/format

royh-testdelete

bmizerany/fastverify

language_support

pdevine/ps-glitches

brucemacd/tokenize

bruce/iq-quants

bmizerany/filepathwithcoloninhost

mxyng/split-bin

bmizerany/client-registry

jmorganca/if-none-match

native

jmorganca/native

jmorganca/batch-embeddings

jmorganca/initcmake

jmorganca/mm

pdevine/showggmlinfo

modenameenforcealphanum

bmizerany/modenameenforcealphanum

jmorganca/done-reason

jmorganca/llama-cpp-8960fe8

ollama.com

bmizerany/filepathnobuild

bmizerany/types/model/defaultfix

rmdisplaylong

nogogen

bmizerany/x

modelfile-readme

bmizerany/replacecolon

jmorganca/limit

jmorganca/execstack

jmorganca/replace-assets

mxyng/tune-concurrency

jmorganca/testing

whitespace-detection

jmorganca/options

upgrade-all

scratch

cuda-search

mattw/airenamer

mattw/allmodelsonhuggingface

mattw/quantcontext

mattw/whatneedstorun

brucemacd/llama-mem-calc

mattw/faq-context

mattw/communitylinks

mattw/noprune

mattw/python-functioncalling

rename

mxyng/install

pulse

remove-first

editor

mattw/selfqueryingretrieval

cgo

mattw/howtoquant

api

matt/streamingapi

format-config

mxyng/extra-args

shell

update-nous-hermes

cp-model

upload-progress

fix-unknown-model

fix-model-names

delete-fix

insecure-registry

ls

deletemodels

progressbar

readme-updates

license-layers

skip-list

list-models

modelpath

matt/examplemodelfiles

distribution

go-opts

1 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: github-starred/ollama#2214

[GH-ISSUE #3582] Add Tokenize and Detokenize Endpoints to Ollama Server #2214

What are you trying to do?

How should we solve this?

What is the impact of not solving this?

Anything else?

Full Implementation of /tokenize and /detokenize Endpoints for Ollama

What I Built

What I Added

Proof

CURL Test

Tokenize

Detokenize

Changes

Full Implementation of `/tokenize` and `/detokenize` Endpoints for Ollama