[GH-ISSUE #3582] Add Tokenize and Detokenize Endpoints to Ollama Server #2214

Closed
opened 2026-04-12 12:28:33 -05:00 by GiteaMirror · 17 comments
Owner

Originally created by @ParisNeo on GitHub (Apr 10, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/3582

Originally assigned to: @ParthSareen on GitHub.

What are you trying to do?

I would like to propose the addition of tokenize and detokenize endpoints to the Ollama server. This feature is crucial for the Ollama client interfaces (such as lollms) to effectively prepare prompts and accurately estimate the number of tokens for the LLMs. Currently, the client uses tiktoken for tokenization, which is not optimal since the token distribution depends on the model. While this can work with chatgpt compatible models, it may fail to correctly estimate the number of tokens, leading to suboptimal token computing and, in some cases, errors when the number of requested tokens exceeds the capacity of the LLM.

How should we solve this?

Introduce two new endpoints, one for tokenization and another for detokenization, to the Ollama server:

Tokenize Endpoint:

  • Input: Raw text, model name
  • Output: List of tokens

Detokenize Endpoint:

  • Input: List of tokens, model name
  • Output: Raw text

These endpoints should return the right tokens or text depending on the model currently in use..

The tokenization endpoint should provide accurate token counting tailored to the specific LLM being used. This will ensure optimal token computing and help avoid potential errors caused by exceeding the capacity of the LLM.

What is the impact of not solving this?

Without these endpoints, users might have to continue relying on inefficient or suboptimal solutions for tokenizing and detokenizing text data.

Anything else?

Include documentation and examples demonstrating how to use these new functionalities effectively. Providing comprehensive guidance will help users quickly adopt these features and enhance the overall user experience.

Originally created by @ParisNeo on GitHub (Apr 10, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/3582 Originally assigned to: @ParthSareen on GitHub. ### What are you trying to do? I would like to propose the addition of tokenize and detokenize endpoints to the Ollama server. This feature is crucial for the Ollama client interfaces (such as lollms) to effectively prepare prompts and accurately estimate the number of tokens for the LLMs. Currently, the client uses tiktoken for tokenization, which is not optimal since the token distribution depends on the model. While this can work with chatgpt compatible models, it may fail to correctly estimate the number of tokens, leading to suboptimal token computing and, in some cases, errors when the number of requested tokens exceeds the capacity of the LLM. ### How should we solve this? Introduce two new endpoints, one for tokenization and another for detokenization, to the Ollama server: Tokenize Endpoint: - Input: Raw text, model name - Output: List of tokens Detokenize Endpoint: - Input: List of tokens, model name - Output: Raw text These endpoints should return the right tokens or text depending on the model currently in use.. The tokenization endpoint should provide accurate token counting tailored to the specific LLM being used. This will ensure optimal token computing and help avoid potential errors caused by exceeding the capacity of the LLM. ### What is the impact of not solving this? Without these endpoints, users might have to continue relying on inefficient or suboptimal solutions for tokenizing and detokenizing text data. ### Anything else? Include documentation and examples demonstrating how to use these new functionalities effectively. Providing comprehensive guidance will help users quickly adopt these features and enhance the overall user experience.
GiteaMirror added the feature request label 2026-04-12 12:28:33 -05:00
Author
Owner

@chigkim commented on GitHub (May 2, 2024):

Related issues to keep an eye on: https://github.com/ollama/ollama/issues/1716

<!-- gh-comment-id:2089638133 --> @chigkim commented on GitHub (May 2, 2024): Related issues to keep an eye on: https://github.com/ollama/ollama/issues/1716
Author
Owner

@rohitgr7 commented on GitHub (Mar 25, 2025):

hey guys! any update on this?

<!-- gh-comment-id:2751113422 --> @rohitgr7 commented on GitHub (Mar 25, 2025): hey guys! any update on this?
Author
Owner

@ParthSareen commented on GitHub (Mar 25, 2025):

Hey @rohitgr7! Haven't forgotten about this but with the new engine work we're still figuring out what the support would look like with new engine and old engine as some of this is dependent on model loading. Will keep you all posted!

<!-- gh-comment-id:2752220217 --> @ParthSareen commented on GitHub (Mar 25, 2025): Hey @rohitgr7! Haven't forgotten about this but with the new engine work we're still figuring out what the support would look like with new engine and old engine as some of this is dependent on model loading. Will keep you all posted!
Author
Owner

@raffaeler commented on GitHub (Jun 20, 2025):

I see 3 different unreviewed PRs about this.
Can someone clarify if this is going to happen or not?
I don't get the reasons behind the blocks.

Thanks

<!-- gh-comment-id:2990517474 --> @raffaeler commented on GitHub (Jun 20, 2025): I see 3 different unreviewed PRs about this. Can someone clarify if this is going to happen or not? I don't get the reasons behind the blocks. Thanks
Author
Owner

@icedmoca commented on GitHub (Jun 26, 2025):

Surprised this still isn’t implemented given how many issues and PRs have piled up around it. Tokenization isn’t optional anymore — it’s critical for logit biasing, context estimation, and any serious token-level intervention (e.g., memory, contradiction suppression).

Relying on HF tokenizers is a hack — GGUF models often diverge, and without a native /tokenize, we can’t trust logit-level control to be accurate.

Ollama handles generation, so why not expose the tokenizer driving it? This should’ve shipped a long time ago.

<!-- gh-comment-id:3010030345 --> @icedmoca commented on GitHub (Jun 26, 2025): **Surprised this still isn’t implemented given how many issues and PRs have piled up around it.** Tokenization isn’t optional anymore — it’s critical for logit biasing, context estimation, and any serious token-level intervention (e.g., memory, contradiction suppression). Relying on HF tokenizers is a hack — GGUF models often diverge, and without a native /tokenize, we can’t trust logit-level control to be accurate. Ollama handles generation, so why not expose the tokenizer driving it? This should’ve shipped a long time ago.
Author
Owner

@ParthSareen commented on GitHub (Jun 28, 2025):

Hey @icedmoca, @raffaeler, I empathize and agree that we should have this but it's a nontrivial implementation till there's more cleared up.

I had recently taken another look at my PR: https://github.com/ollama/ollama/pull/8106

There are a few current issues which we just need more info on in order to have a good experience for the upcoming years.

  1. There's a divergence in how the old engine and ollama engine work with model loading is different. This would involve a smarter model cache in the old engine vs. not needing one in the new engine.
  2. While not too difficult to expose there's long tail issues in supporting two different ways of getting the tokens back.
  3. The API needs to extend to different to not just text input but also images, and eventually other modalities. How this gets designed is pretty critical as we'd have to continue supporting whatever goes out.

There's some other stuff in the works too which makes this a bit more complicated. Will keep you all posted but will continue tracking in my PR. Going to close this one out for now.

<!-- gh-comment-id:3014837347 --> @ParthSareen commented on GitHub (Jun 28, 2025): Hey @icedmoca, @raffaeler, I empathize and agree that we should have this but it's a nontrivial implementation till there's more cleared up. I had recently taken another look at my PR: https://github.com/ollama/ollama/pull/8106 There are a few current issues which we just need more info on in order to have a good experience for the upcoming years. 1. There's a divergence in how the old engine and ollama engine work with model loading is different. This would involve a smarter model cache in the old engine vs. not needing one in the new engine. 2. While not too difficult to expose there's long tail issues in supporting two different ways of getting the tokens back. 3. The API needs to extend to different to not just text input but also images, and eventually other modalities. How this gets designed is pretty critical as we'd have to continue supporting whatever goes out. There's some other stuff in the works too which makes this a bit more complicated. Will keep you all posted but will continue tracking in my PR. Going to close this one out for now.
Author
Owner

@raffaeler commented on GitHub (Jun 29, 2025):

@ParthSareen I understand and thank you and the other contributors for all the efforts in Ollama.
I am asking what is coming because I already had to migrate my talk demos to python transformers instead of Ollama because of this missing feature.
As others said, token counting is one of the many vital features.
Thanks again.

<!-- gh-comment-id:3016590706 --> @raffaeler commented on GitHub (Jun 29, 2025): @ParthSareen I understand and thank you and the other contributors for all the efforts in Ollama. I am asking what is coming because I already had to migrate my talk demos to python transformers instead of Ollama because of this missing feature. As others said, token counting is one of the many vital features. Thanks again.
Author
Owner

@icedmoca commented on GitHub (Jul 24, 2025):

My workaround for my project made me have to use vector-based semantic similarity and structured fact representation.. 🥀

<!-- gh-comment-id:3113341333 --> @icedmoca commented on GitHub (Jul 24, 2025): My workaround for my project made me have to use vector-based semantic similarity and structured fact representation.. 🥀
Author
Owner

@raffaeler commented on GitHub (Jul 24, 2025):

My workaround for my project made me have to use vector-based semantic similarity and structured fact representation.. 🥀

Could you please elaborate a bit more?
If you can´t estimate the tokens size, you are forced to continuously try to tokenize to see if there is an error or not.
Since my algorithm to chunk the documents is quite complex, this will cause it to take a very (too) long time.

Because of this lacking feature, I had to migrate to the Hugging Face libraries to host the model.
Does your workaround solve this issue?

<!-- gh-comment-id:3113763667 --> @raffaeler commented on GitHub (Jul 24, 2025): > My workaround for my project made me have to use vector-based semantic similarity and structured fact representation.. 🥀 Could you please elaborate a bit more? If you can´t estimate the tokens size, you are forced to continuously try to tokenize to see if there is an error or not. Since my algorithm to chunk the documents is quite complex, this will cause it to take a very (too) long time. Because of this lacking feature, I had to migrate to the Hugging Face libraries to host the model. Does your workaround solve this issue?
Author
Owner

@icedmoca commented on GitHub (Jul 24, 2025):

My workaround for my project made me have to use vector-based semantic similarity and structured fact representation.. 🥀

Could you please elaborate a bit more? If you can´t estimate the tokens size, you are forced to continuously try to tokenize to see if there is an error or not. Since my algorithm to chunk the documents is quite complex, this will cause it to take a very (too) long time.

Because of this lacking feature, I had to migrate to the Hugging Face libraries to host the model. Does your workaround solve this issue?

Yes, my workaround avoids token estimation entirely by operating at a higher semantic level. Instead of relying on tokenizer length constraints, I parse inputs into structured fact triplets and group them by semantic similarity using embedding vectors (like en_core_web_lg or MiniLM). This allows chunking and context window management to be driven by meaning rather than raw token count.

It’s not perfect for token-level logit biasing, but for document chunking, summarization, and reasoning, it avoids the need for a tokenizer altogether — and works with Ollama today. Let me know if you want a demo or outline.

You're absolutely right about the token estimation issue—especially when dealing with semantically dense or volatile input. That’s precisely why I moved away from dynamic token counting altogether and implemented a hybrid approach in MeRNSTA.

My system uses a dual-layer memory design:

Structured Fact Representation
Inputs are parsed into normalized (subject, predicate, object) triplets and stored in SQLite under the enhanced_facts table. Each entry is augmented with:

Contradiction and volatility flags

Confidence scores and change history

Temporal, session, and user-profile metadata

Optional vector embeddings for semantic operations

Contradictions are tracked via contradiction_records, enabling dynamic volatility scoring and automated contradiction resolution.

Vector-Based Semantic Similarity
Rather than attempting to tokenize large documents repeatedly (which as you've pointed out becomes untenable), I use sentence-transformers (MiniLM) to embed fact-level assertions at ingestion time. During recall, the system performs vector similarity search to surface contextually relevant memory without ever re-tokenizing the full memory base. This keeps latency predictable and scales much better under evolving memory state.

MeRNSTA avoids real-time token estimation altogether by decomposing input into discrete, vectorized facts. Only the semantically relevant subset is ever reassembled for LLM prompts. That design decision made token overflows a non-issue—even when working with complex contextual recall or highly entropic sessions.

I’ve only published portions of the system publicly so far.. like memory schemas, contradiction handling, and partial semantic indexing logic, because the full orchestration engine includes self-evolving code, real-time memory reinforcement, and internal scaffolding I’m not ready to open-source until I finalize the broader cognitive loop.

<!-- gh-comment-id:3115079569 --> @icedmoca commented on GitHub (Jul 24, 2025): > > My workaround for my project made me have to use vector-based semantic similarity and structured fact representation.. 🥀 > > Could you please elaborate a bit more? If you can´t estimate the tokens size, you are forced to continuously try to tokenize to see if there is an error or not. Since my algorithm to chunk the documents is quite complex, this will cause it to take a very (too) long time. > > Because of this lacking feature, I had to migrate to the Hugging Face libraries to host the model. Does your workaround solve this issue? Yes, my workaround avoids token estimation entirely by operating at a higher semantic level. Instead of relying on tokenizer length constraints, I parse inputs into structured fact triplets and group them by semantic similarity using embedding vectors (like en_core_web_lg or MiniLM). This allows chunking and context window management to be driven by meaning rather than raw token count. It’s not perfect for token-level logit biasing, but for document chunking, summarization, and reasoning, it avoids the need for a tokenizer altogether — and works with Ollama today. Let me know if you want a demo or outline. You're absolutely right about the token estimation issue—especially when dealing with semantically dense or volatile input. That’s precisely why I moved away from dynamic token counting altogether and implemented a hybrid approach in [MeRNSTA](https://github.com/icedmoca/MeRNSTA). My system uses a dual-layer memory design: Structured Fact Representation Inputs are parsed into normalized (subject, predicate, object) triplets and stored in SQLite under the enhanced_facts table. Each entry is augmented with: Contradiction and volatility flags Confidence scores and change history Temporal, session, and user-profile metadata Optional vector embeddings for semantic operations Contradictions are tracked via contradiction_records, enabling dynamic volatility scoring and automated contradiction resolution. Vector-Based Semantic Similarity Rather than attempting to tokenize large documents repeatedly (which as you've pointed out becomes untenable), I use sentence-transformers (MiniLM) to embed fact-level assertions at ingestion time. During recall, the system performs vector similarity search to surface contextually relevant memory without ever re-tokenizing the full memory base. This keeps latency predictable and scales much better under evolving memory state. MeRNSTA avoids real-time token estimation altogether by decomposing input into discrete, vectorized facts. Only the semantically relevant subset is ever reassembled for LLM prompts. That design decision made token overflows a non-issue—even when working with complex contextual recall or highly entropic sessions. I’ve only published portions of the system publicly so far.. like memory schemas, contradiction handling, and partial semantic indexing logic, because the full orchestration engine includes self-evolving code, real-time memory reinforcement, and internal scaffolding I’m not ready to open-source until I finalize the broader cognitive loop.
Author
Owner

@raffaeler commented on GitHub (Jul 25, 2025):

This strategy doesn't work in my case. Either the upper limit is very high (Qwen3 for example), or you need to iterate until you fit your threshold.
In my research (and on papers) I found that aggregation by vector similarity is not the best way to create the chunks. This is a bad strategy in many use-cases like the legal one where you have many phrases that are relevant for a context but they express different concepts (depositions for example). In other use-cases I found the same problem with quotes.

My analysis starts from the syntax tree of the original document (that must have been accurately filtered in the previous step). This is where I can visualize clusters of concepts inside the documents. And when I see well-separated clusters, I know the result is good.

<!-- gh-comment-id:3116663964 --> @raffaeler commented on GitHub (Jul 25, 2025): This strategy doesn't work in my case. Either the upper limit is very high (Qwen3 for example), or you need to iterate until you fit your threshold. In my research (and on papers) I found that aggregation by vector similarity is not the best way to create the chunks. This is a bad strategy in many use-cases like the legal one where you have many phrases that are relevant for a context but they express different concepts (depositions for example). In other use-cases I found the same problem with quotes. My analysis starts from the syntax tree of the original document (that must have been accurately filtered in the previous step). This is where I can visualize clusters of concepts inside the documents. And when I see well-separated clusters, I know the result is good.
Author
Owner

@js402 commented on GitHub (Jul 25, 2025):

Hey Raffaele Rialdi,

I know this is off-topic, but I'm super interested in your research.
I would be very grateful if you could share where to find your papers.

This strategy doesn't work in my case. Either the upper limit is very high (Qwen3 for example), or you need to iterate until you fit your threshold. In my research (and on papers) I found that aggregation by vector similarity is not the best way to create the chunks. This is a bad strategy in many use-cases like the legal one where you have many phrases that are relevant for a context but they express different concepts (depositions for example). In other use-cases I found the same problem with quotes.

My analysis starts from the syntax tree of the original document (that must have been accurately filtered in the previous step). This is where I can visualize clusters of concepts inside the documents. And when I see well-separated clusters, I know the result is good.

<!-- gh-comment-id:3116742764 --> @js402 commented on GitHub (Jul 25, 2025): Hey Raffaele Rialdi, I know this is off-topic, but I'm super interested in your research. I would be very grateful if you could share where to find your papers. > This strategy doesn't work in my case. Either the upper limit is very high (Qwen3 for example), or you need to iterate until you fit your threshold. In my research (and on papers) I found that aggregation by vector similarity is not the best way to create the chunks. This is a bad strategy in many use-cases like the legal one where you have many phrases that are relevant for a context but they express different concepts (depositions for example). In other use-cases I found the same problem with quotes. > > My analysis starts from the syntax tree of the original document (that must have been accurately filtered in the previous step). This is where I can visualize clusters of concepts inside the documents. And when I see well-separated clusters, I know the result is good.
Author
Owner

@raffaeler commented on GitHub (Jul 25, 2025):

Unfortunately the papers are very generic on this topic.
I explain in detail what I am doing in the talks I am giving on the topic, but there are no published videos (yet).
If I find some time, I may decide to publish something, but do not expect shortly as I am hard working on two large projects now.
Sorry

<!-- gh-comment-id:3117957983 --> @raffaeler commented on GitHub (Jul 25, 2025): Unfortunately the papers are very generic on this topic. I explain in detail what I am doing in the talks I am giving on the topic, but there are no published videos (yet). If I find some time, I may decide to publish something, but do not expect shortly as I am hard working on two large projects now. Sorry
Author
Owner

@icedmoca commented on GitHub (Jul 28, 2025):

This strategy doesn't work in my case. Either the upper limit is very high (Qwen3 for example), or you need to iterate until you fit your threshold. In my research (and on papers) I found that aggregation by vector similarity is not the best way to create the chunks. This is a bad strategy in many use-cases like the legal one where you have many phrases that are relevant for a context but they express different concepts (depositions for example). In other use-cases I found the same problem with quotes.

My analysis starts from the syntax tree of the original document (that must have been accurately filtered in the previous step). This is where I can visualize clusters of concepts inside the documents. And when I see well-separated clusters, I know the result is good.

It makes sense why vector similarity wouldn't cut it for your use case. Yes, semantic clustering breaks down when the input spans multiple dense but semantically divergent regions quotes, depositions, overlapping testimony, etc. In those contexts, syntax trees give you much tighter control over conceptual boundaries. Your use of syntactic clustering is right for preserving document logic and ensuring legal/technical fidelity. For MeRNSTA, I had different constraints: I needed a system that could reason across evolving conversational facts, detect contradictions over time, and maintain stability without ever re-tokenizing huge memory bases. So instead of using token counts or syntax trees, I chunk based on concept recurrence and contradiction volatility. Works well for dynamic memory systems, but yeah definitely not for high-precision document ingestion like yours. Would love to see your syntax-based clustering pipeline if you ever publish it especially how you're managing ambiguity and cross-cluster references. Also agree that if Ollama ever exposes /tokenize, the hybrid approach is probably ideal: syntax or semantic chunking first, then real token budget trimming as a final step. That s the missing piece for a lot of us.

Also @js402 I would like to chat about your work if possible, I'm also working on an intent engine utilizing blockchain architecture with ai agents.

<!-- gh-comment-id:3130096571 --> @icedmoca commented on GitHub (Jul 28, 2025): > This strategy doesn't work in my case. Either the upper limit is very high (Qwen3 for example), or you need to iterate until you fit your threshold. In my research (and on papers) I found that aggregation by vector similarity is not the best way to create the chunks. This is a bad strategy in many use-cases like the legal one where you have many phrases that are relevant for a context but they express different concepts (depositions for example). In other use-cases I found the same problem with quotes. > > My analysis starts from the syntax tree of the original document (that must have been accurately filtered in the previous step). This is where I can visualize clusters of concepts inside the documents. And when I see well-separated clusters, I know the result is good. It makes sense why vector similarity wouldn't cut it for your use case. Yes, semantic clustering breaks down when the input spans multiple dense but semantically divergent regions quotes, depositions, overlapping testimony, etc. In those contexts, syntax trees give you much tighter control over conceptual boundaries. Your use of syntactic clustering is right for preserving document logic and ensuring legal/technical fidelity. For MeRNSTA, I had different constraints: I needed a system that could reason across evolving conversational facts, detect contradictions over time, and maintain stability without ever re-tokenizing huge memory bases. So instead of using token counts or syntax trees, I chunk based on concept recurrence and contradiction volatility. Works well for dynamic memory systems, but yeah definitely not for high-precision document ingestion like yours. Would love to see your syntax-based clustering pipeline if you ever publish it especially how you're managing ambiguity and cross-cluster references. Also agree that if Ollama ever exposes /tokenize, the hybrid approach is probably ideal: syntax or semantic chunking first, then real token budget trimming as a final step. That s the missing piece for a lot of us. Also @js402 I would like to chat about your work if possible, I'm also working on an intent engine utilizing blockchain architecture with ai agents.
Author
Owner

@icedmoca commented on GitHub (Jul 28, 2025):

@js402 @raffaeler @ParisNeo @ParthSareen

Full Implementation of /tokenize and /detokenize Endpoints for Ollama

Hi all I went ahead and implemented the missing /tokenize and /detokenize endpoints for Ollama and validated them using mistral:latest. This directly resolves the original request in #3582 and related threads.

What I Built

The Ollama server currently exposes no way to perform model-accurate tokenization or detokenization.
This creates major problems for:

  • Logit biasing - Context window estimation - Chunking and prompt trimming - Memory systems, structured reasoning, and summarization.

Developers trying to interoperate with GGUF or non-HF models Most of us were forced to hack around this using tiktoken, which doesn't match GGUF or model-specific BPEs.

What I Added

I used the existing LlamaServer interface which already contains internal Tokenize() and Detokenize() methods. I exposed them via HTTP endpoints and integrated with the scheduler.

Endpoints:

POST /api/tokenize Request: { "model": "mistral:latest", "content": "Hello, world!" } Response: { "tokens": [23325, 29493, 2294, 29576] }

POST /api/detokenize Request: { "model": "mistral:latest", "tokens": [23325, 29493, 2294, 29576] }
Response: { "content": " Hello, world!" }

✔ Fully model-aligned (works for GGUF and HF)
✔ Uses the same tokenization logic used in generation
✔ Returns timings and model info
✔ Full round-trip verification

Proof

⏱ Live Server Response

[GIN] 2025/07/28 - 16:28:25 | 200 | 3.33s | 127.0.0.1 | POST /api/tokenize

[GIN] 2025/07/28 - 16:28:41 | 200 | 5.40ms | 127.0.0.1 | POST /api/detokenize

CURL Test

Tokenize

input:

bash curl http://localhost:11434/api/tokenize \ -H "Content-Type: application/json" \ -d '{ "model": "mistral:latest", "content": "Hello, world!" }'

output:

{ "model": "mistral:latest", "tokens": [23325, 29493, 2294, 29576], "total_duration": 3333091020, "load_duration": 3332916624 }

Detokenize

input:

bash curl http://localhost:11434/api/detokenize \ -H "Content-Type: application/json" \ -d '{ "model": "mistral:latest", "tokens": [23325, 29493, 2294, 29576] }'

output:

{ "model": "mistral:latest", "content": " Hello, world!", "total_duration": 5376833, "load_duration": 5371033 }

The detokenized output matches exactly, including the expected leading space!

Changes

api/types.go: Added TokenizeRequest, TokenizeResponse, DetokenizeRequest, DetokenizeResponse - server/routes.go: Added TokenizeHandler, DetokenizeHandler, registered both routes - api/client.go: Added Tokenize() and Detokenize() methods to the client - api/examples/tokenize/main.go: Round-trip demo - integration/api_test.go: Integration test for round-trip correctness - docs/api.md: Documented new endpoints - Uses: scheduleRunner() to access models, supports keep_alive, consistent with other endpoints.

View what I changed here: aa855f2b15

Also @ParthSareen I’ve implemented full /tokenize and /detokenize endpoints with support for media_type, keep_alive, and a modular TokenizerAdapter interface to future-proof for other engines and modalities. I validated round-trip integrity across a wide range of stress cases (multilingual, emojis, fuzzed Unicode, edge whitespace, markdown/math, RTL), and keep-alive drastically reduces cold start overhead. While this doesn’t yet handle multimodal input like images, the adapter pattern was explicitly designed to support it cleanly. I’d appreciate your review or suggestions on any blockers I may have missed.

<!-- gh-comment-id:3130148923 --> @icedmoca commented on GitHub (Jul 28, 2025): @js402 @raffaeler @ParisNeo @ParthSareen # Full Implementation of `/tokenize` and `/detokenize` Endpoints for Ollama Hi all I went ahead and implemented the missing `/tokenize` and `/detokenize` endpoints for Ollama and validated them using `mistral:latest`. This directly resolves the original request in [#3582](https://github.com/ollama/ollama/issues/3582) and related threads. ## What I Built ### The Ollama server currently exposes no way to perform model-accurate tokenization or detokenization. This creates major problems for: - Logit biasing - Context window estimation - Chunking and prompt trimming - Memory systems, structured reasoning, and summarization. Developers trying to interoperate with GGUF or non-HF models Most of us were forced to hack around this using `tiktoken`, which doesn't match GGUF or model-specific BPEs. ### What I Added I used the existing `LlamaServer` interface which already contains internal `Tokenize()` and `Detokenize()` methods. I exposed them via HTTP endpoints and integrated with the scheduler. **Endpoints:** `POST /api/tokenize` Request: `{ "model": "mistral:latest", "content": "Hello, world!" }` Response: `{ "tokens": [23325, 29493, 2294, 29576] }` `POST /api/detokenize` Request: `{ "model": "mistral:latest", "tokens": [23325, 29493, 2294, 29576] }` Response: `{ "content": " Hello, world!" }` ✔ Fully model-aligned (works for GGUF and HF) ✔ Uses the same tokenization logic used in generation ✔ Returns timings and model info ✔ Full round-trip verification ## Proof ### ⏱ Live Server Response ``` [GIN] 2025/07/28 - 16:28:25 | 200 | 3.33s | 127.0.0.1 | POST /api/tokenize ``` ```[GIN] 2025/07/28 - 16:28:41 | 200 | 5.40ms | 127.0.0.1 | POST /api/detokenize ``` ### CURL Test #### Tokenize input: ```bash curl http://localhost:11434/api/tokenize \ -H "Content-Type: application/json" \ -d '{ "model": "mistral:latest", "content": "Hello, world!" }' ``` output: ```{ "model": "mistral:latest", "tokens": [23325, 29493, 2294, 29576], "total_duration": 3333091020, "load_duration": 3332916624 } ``` #### Detokenize input: ```bash curl http://localhost:11434/api/detokenize \ -H "Content-Type: application/json" \ -d '{ "model": "mistral:latest", "tokens": [23325, 29493, 2294, 29576] }' ``` output: ```{ "model": "mistral:latest", "content": " Hello, world!", "total_duration": 5376833, "load_duration": 5371033 } ``` The detokenized output matches exactly, including the expected leading space! ## Changes **`api/types.go`**: Added `TokenizeRequest`, `TokenizeResponse`, `DetokenizeRequest`, `DetokenizeResponse` - **`server/routes.go`**: Added `TokenizeHandler`, `DetokenizeHandler`, registered both routes - **`api/client.go`**: Added `Tokenize()` and `Detokenize()` methods to the client - **`api/examples/tokenize/main.go`**: Round-trip demo - **`integration/api_test.go`**: Integration test for round-trip correctness - **`docs/api.md`**: Documented new endpoints - **Uses**: `scheduleRunner()` to access models, supports `keep_alive`, consistent with other endpoints. View what I changed here: https://github.com/icedmoca/ollama/commit/aa855f2b156708e8388480f95586e083c01732a6 Also @ParthSareen I’ve implemented full /tokenize and /detokenize endpoints with support for media_type, keep_alive, and a modular TokenizerAdapter interface to future-proof for other engines and modalities. I validated round-trip integrity across a wide range of stress cases (multilingual, emojis, fuzzed Unicode, edge whitespace, markdown/math, RTL), and keep-alive drastically reduces cold start overhead. While this doesn’t yet handle multimodal input like images, the adapter pattern was explicitly designed to support it cleanly. I’d appreciate your review or suggestions on any blockers I may have missed.
Author
Owner

@raffaeler commented on GitHub (Jul 29, 2025):

@icedmoca Exactly. There is certainly more than one winning strategy to chunk documents. This mostly depends on the document type and the use-case. Measuring them is the best way to decide what is better.
I'll probably publish my work once it is polished and tested across multiple scenarios, but I have to find the time.

About your strategy, I am not familiar with MeRNSTA or I can find anything with that precise name. What is that?

With regards to your new implementation, great work but as I was mentioning above, there are already multiple open PRs and none of them got reviewed or approved. I don´t know the reason, but an "official" version is needed before this feature can actively be used and promoted.

<!-- gh-comment-id:3130905193 --> @raffaeler commented on GitHub (Jul 29, 2025): @icedmoca Exactly. There is certainly more than one winning strategy to chunk documents. This mostly depends on the document type and the use-case. Measuring them is the best way to decide what is better. I'll probably publish my work once it is polished and tested across multiple scenarios, but I have to find the time. About your strategy, I am not familiar with MeRNSTA or I can find anything with that precise name. What is that? With regards to your new implementation, great work but as I was mentioning above, there are already multiple open PRs and none of them got reviewed or approved. I don´t know the reason, but an "official" version is needed before this feature can actively be used and promoted.
Author
Owner

@icedmoca commented on GitHub (Aug 2, 2025):

Hey @raffaeler ,i agree on the PR backlog and needing an official version first and hopefully something similar is implemented..

As for MeRNSTA — it’s a custom neuro-symbolic "AGI" framework I’ve been developing that combines a contradiction-resolving memory graph, causal + temporal fact tracking, multi-agent debate/reflection layers, and recursive self-evolution (it mutates and tests its own agents).

I used the tokenizer/detokenizer endpoints to create an adaptive encoder that rewrites prompts based on past memory and contradiction history. And it actually works and doesn't hallucinate!! It’s crazy... I MIGHT open source once stable, but if you’re curious, happy to DM a breakdown or benchmark snippets. The mernsta repo on my profile is outdated but does reflect aspects of it. It's truly crazy what I've been able to accomplish with vibe planning and some vibe coding..

Edit: Also I think this might interest you, given you're looking for something to chunk large documents.

<!-- gh-comment-id:3146432951 --> @icedmoca commented on GitHub (Aug 2, 2025): Hey @raffaeler ,i agree on the PR backlog and needing an official version first and hopefully something similar is implemented.. As for MeRNSTA — it’s a custom neuro-symbolic "AGI" framework I’ve been developing that combines a contradiction-resolving memory graph, causal + temporal fact tracking, multi-agent debate/reflection layers, and recursive self-evolution (it mutates and tests its own agents). I used the tokenizer/detokenizer endpoints to create an adaptive encoder that rewrites prompts based on past memory and contradiction history. And it actually works and doesn't hallucinate!! It’s crazy... I MIGHT open source once stable, but if you’re curious, happy to DM a breakdown or benchmark snippets. The mernsta repo on my profile is outdated but does reflect aspects of it. It's truly crazy what I've been able to accomplish with vibe planning and some vibe coding.. Edit: Also I think [this](https://github.com/CaRniFeXeR/transformer-kernel-ranking) might interest you, given you're looking for something to chunk large documents.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#2214