[GH-ISSUE #12375] Improvement ideas regarding KV cache and Token ID. #54732

Closed
opened 2026-04-29 07:07:07 -05:00 by GiteaMirror · 2 comments
Owner

Originally created by @ghost on GitHub (Sep 22, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/12375

  1. Inconsistency Between Token IDs in Generation and Re-tokenization:
    When generating text, the token IDs output during generation may not match the token IDs obtained from re-tokenizing the generated text, especially for languages like Chinese. This causes the KV (Key-Value) cache to become invalid.

    For example, if I ask the model to generate a long novel and then respond with a single character like "好" (good), the KV cache for the intermediate steps is recalculated:

    time=2025-09-22T22:51:46.064Z level=DEBUG source=cache.go:104 msg="loading cache slot" id=0 cache=630 prompt=631 used=180 remaining=451
    

    For scenarios where a local model runs on a mobile device, this leads to significant wasted waiting time.
    Proposed Improvement:
    Re-tokenize only the differing text segments after the edit, while reusing the token IDs for the unchanged text.
    However, although the context field returns the complete token ID sequence when raw=false, if the text is edited in the middle, there’s no way to know the correspondence between the edited text and the token IDs.

  2. Lack of Token ID in Response Output:
    Although the response field outputs the token’s text, it does not include the corresponding token IDs, making it impossible to map the token text to its token ID.

    I’ve noticed that some models randomly mix English words in Chinese text, often generating tokens like " demo" (with a leading space). In most cases, detecting tokens with a space followed by English and regenerating them resolves the issue of mixed English.
    However, if the token text and its corresponding token ID were provided simultaneously, we could immediately set a logit_bias to suppress dynamically generated tokens with spaces and English words.
    Currently, this relies on luck and multiple regenerations, which wastes time, as the same problematic tokens may require repeated regeneration.
    The context is only returned after the entire generation is complete, and it doesn’t provide the mapping between text and token IDs, making it impossible to meet this requirement.

Originally created by @ghost on GitHub (Sep 22, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/12375 1. **Inconsistency Between Token IDs in Generation and Re-tokenization**: When generating text, the token IDs output during generation may not match the token IDs obtained from re-tokenizing the generated text, especially for languages like Chinese. This causes the KV (Key-Value) cache to become invalid. For example, if I ask the model to generate a long novel and then respond with a single character like "好" (good), the KV cache for the intermediate steps is recalculated: ``` time=2025-09-22T22:51:46.064Z level=DEBUG source=cache.go:104 msg="loading cache slot" id=0 cache=630 prompt=631 used=180 remaining=451 ``` For scenarios where a local model runs on a mobile device, this leads to significant wasted waiting time. **Proposed Improvement**: Re-tokenize only the differing text segments after the edit, while reusing the token IDs for the unchanged text. However, although the `context` field returns the complete token ID sequence when `raw=false`, if the text is edited in the middle, there’s no way to know the correspondence between the edited text and the token IDs. 2. **Lack of Token ID in Response Output**: Although the `response` field outputs the token’s text, it does not include the corresponding token IDs, making it impossible to map the token text to its token ID. I’ve noticed that some models randomly mix English words in Chinese text, often generating tokens like " demo" (with a leading space). In most cases, detecting tokens with a space followed by English and regenerating them resolves the issue of mixed English. However, if the token text and its corresponding token ID were provided simultaneously, we could immediately set a `logit_bias` to suppress dynamically generated tokens with spaces and English words. Currently, this relies on luck and multiple regenerations, which wastes time, as the same problematic tokens may require repeated regeneration. The `context` is only returned after the entire generation is complete, and it doesn’t provide the mapping between text and token IDs, making it impossible to meet this requirement.
GiteaMirror added the feature request label 2026-04-29 07:07:07 -05:00
Author
Owner

@ghost commented on GitHub (Sep 23, 2025):

First Issue: It only requires determining the identical prefix text (but excluding cases where the last token in the prefix text has multiple characters extending beyond the prefix text), and recalculating tokens for the differing suffix text.

Second Issue: In addition to returning token IDs simultaneously, it’s also acceptable not to return them but to provide another API that can convert text to token IDs, or directly allow setting token text as a parameter instead of only setting token IDs.

<!-- gh-comment-id:3322665228 --> @ghost commented on GitHub (Sep 23, 2025): **First Issue**: It only requires determining the identical prefix text (but excluding cases where the last token in the prefix text has multiple characters extending beyond the prefix text), and recalculating tokens for the differing suffix text. **Second Issue**: In addition to returning token IDs simultaneously, it’s also acceptable not to return them but to provide another API that can convert text to token IDs, or directly allow setting token text as a parameter instead of only setting token IDs.
Author
Owner

@ghost commented on GitHub (Oct 21, 2025):

For the second issue, it seems that the API for converting token IDs already exists — I just didn’t find it when checking the API documentation, so I wasn’t aware of it. As for the first issue, I’ve already submitted a bug report to llama.cpp.

<!-- gh-comment-id:3425240032 --> @ghost commented on GitHub (Oct 21, 2025): For the **second issue**, it seems that the **API for converting token IDs already exists** — I just didn’t find it when checking the API documentation, so I wasn’t aware of it. As for the **first issue**, I’ve already **submitted a bug report to llama.cpp**.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#54732