[GH-ISSUE #1007] Context Shifting To Increase Speed Dramatically #490

Closed
opened 2026-04-12 10:10:12 -05:00 by GiteaMirror · 4 comments
Owner

Originally created by @chigkim on GitHub (Nov 5, 2023).
Original GitHub issue: https://github.com/ollama/ollama/issues/1007

koboldcpp v1.48.1 now has implemented this feature, and apparently it increased the speed dramatically because it doesn't have to reprocess previous context to generate a new response.

Could you look into implement this feature to Ollama as well?

https://github.com/LostRuins/koboldcpp/releases/tag/v1.48.1

"Context Shifting (A.K.A. EvenSmarterContext) - This feature utilizes KV cache shifting to automatically remove old tokens from context and add new ones without requiring any reprocessing. So long as you use no memory/fixed memory and don't use world info, you should be able to avoid almost all reprocessing between consecutive generations even at max context. This does not consume any additional context space, making it superior to SmartContext."

Thanks!

Originally created by @chigkim on GitHub (Nov 5, 2023). Original GitHub issue: https://github.com/ollama/ollama/issues/1007 koboldcpp v1.48.1 now has implemented this feature, and apparently it increased the speed dramatically because it doesn't have to reprocess previous context to generate a new response. Could you look into implement this feature to Ollama as well? https://github.com/LostRuins/koboldcpp/releases/tag/v1.48.1 "Context Shifting (A.K.A. EvenSmarterContext) - This feature utilizes KV cache shifting to automatically remove old tokens from context and add new ones without requiring any reprocessing. So long as you use no memory/fixed memory and don't use world info, you should be able to avoid almost all reprocessing between consecutive generations even at max context. This does not consume any additional context space, making it superior to SmartContext." Thanks!
GiteaMirror added the feature request label 2026-04-12 10:10:12 -05:00
Author
Owner

@jmorganca commented on GitHub (May 6, 2024):

Hi there, thank you so much for the issue. At this point we've seen quite a few quality concerns with context shifting and will be focused on helping understand context utilization (e.g. API will return a field when the context limit is hit) vs shifting the context.

We may re-introduce context shifting later on once we can do so between bos/eos tokens but currently context shifting happens at arbitrary points in the context which causes infinite and/or bad generations

<!-- gh-comment-id:2097088114 --> @jmorganca commented on GitHub (May 6, 2024): Hi there, thank you so much for the issue. At this point we've seen quite a few quality concerns with context shifting and will be focused on helping understand context utilization (e.g. API will return a field when the context limit is hit) vs shifting the context. We may re-introduce context shifting later on once we can do so between bos/eos tokens but currently context shifting happens at arbitrary points in the context which causes infinite and/or bad generations
Author
Owner

@chigkim commented on GitHub (May 7, 2024):

Could you expose tokenize and detokenize feature of llama.cpp api? Then we can know the token count before feeding text for generation. It seems there are a lot of interest in this.

<!-- gh-comment-id:2097133747 --> @chigkim commented on GitHub (May 7, 2024): Could you expose tokenize and detokenize feature of llama.cpp api? Then we can know the token count before feeding text for generation. It seems there are a lot of interest in this.
Author
Owner
<!-- gh-comment-id:2097135703 --> @chigkim commented on GitHub (May 7, 2024): https://github.com/ggerganov/llama.cpp/blob/master/examples/server/README.md https://github.com/ollama/ollama/issues/4186 https://github.com/ollama/ollama/issues/1716 https://github.com/ollama/ollama/issues/3582 https://github.com/ollama/ollama/issues/1345
Author
Owner

@joewinke commented on GitHub (May 22, 2024):

Could you expose tokenize and detokenize feature of llama.cpp api? Then we can know the token count before feeding text for generation. It seems there are a lot of interest in this.

This would allow for UX that shows % of context window used, which would be helpful in many contexts.

<!-- gh-comment-id:2125546786 --> @joewinke commented on GitHub (May 22, 2024): > Could you expose tokenize and detokenize feature of llama.cpp api? Then we can know the token count before feeding text for generation. It seems there are a lot of interest in this. This would allow for UX that shows % of context window used, which would be helpful in many contexts.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#490