[GH-ISSUE #12292] feat: Sliding window context to handle long contexts #32066

Closed
opened 2026-04-25 05:57:21 -05:00 by GiteaMirror · 4 comments
Owner

Originally created by @AlbertoSinigaglia on GitHub (Apr 1, 2025).
Original GitHub issue: https://github.com/open-webui/open-webui/issues/12292

Check Existing Issues

  • I have searched the existing issues and discussions.

Problem Description

I've discussed in this issue https://github.com/ollama/ollama/issues/9890 the fact that long context completely breaks the usability of models, due to the need of preloading the whole context.

Desired Solution you'd like

One of the founders/maintainers of Ollama intelligently suggested a sliding window approach on the client side: https://github.com/ollama/ollama/issues/9890#issuecomment-2740319483.

I think this is extremely interesting for OpenWeb UI to allow model-wide to preload a minimal context length and then increase it if the number of tokens in the chat gets close to it.

Alternatives Considered

No response

Additional Context

Ideally, with the introduction of the "Bypass Embedding and Retrieval" option, it's almost fundamental to have 128k as context length. Let alone the 1M token length served by Google with Gemini.

Originally created by @AlbertoSinigaglia on GitHub (Apr 1, 2025). Original GitHub issue: https://github.com/open-webui/open-webui/issues/12292 ### Check Existing Issues - [x] I have searched the existing issues and discussions. ### Problem Description I've discussed in this issue https://github.com/ollama/ollama/issues/9890 the fact that long context completely breaks the usability of models, due to the need of preloading the whole context. ### Desired Solution you'd like One of the founders/maintainers of Ollama intelligently suggested a sliding window approach on the client side: https://github.com/ollama/ollama/issues/9890#issuecomment-2740319483. I think this is extremely interesting for OpenWeb UI to allow model-wide to preload a minimal context length and then increase it if the number of tokens in the chat gets close to it. ### Alternatives Considered _No response_ ### Additional Context Ideally, with the introduction of the "Bypass Embedding and Retrieval" option, it's almost fundamental to have 128k as context length. Let alone the 1M token length served by Google with Gemini.
Author
Owner

@Classic298 commented on GitHub (Apr 1, 2025):

should be implemented with filters

There are already some example filters available which you can use

https://openwebui.com/f/hub/context_clip_filter

https://openwebui.com/f/houxin/token_clip_filter

Here the search query: https://openwebui.com/functions?query=context

<!-- gh-comment-id:2769489990 --> @Classic298 commented on GitHub (Apr 1, 2025): should be implemented with filters There are already some example filters available which you can use https://openwebui.com/f/hub/context_clip_filter https://openwebui.com/f/houxin/token_clip_filter Here the search query: https://openwebui.com/functions?query=context
Author
Owner

@AlbertoSinigaglia commented on GitHub (Apr 1, 2025):

should be implemented with filters

There are already some example filters available which you can use

https://openwebui.com/f/hub/context_clip_filter

https://openwebui.com/f/houxin/token_clip_filter

Here the search query: https://openwebui.com/functions?query=context

These clips the history to fit in the context window, instead i'm looking at the opposite, which is gradually increasing the context size of the model to fit the chat, instead of allocating a 128k token KV cache in ollama for a "hi there my dear LLM"

<!-- gh-comment-id:2769584694 --> @AlbertoSinigaglia commented on GitHub (Apr 1, 2025): > should be implemented with filters > > There are already some example filters available which you can use > > https://openwebui.com/f/hub/context_clip_filter > > https://openwebui.com/f/houxin/token_clip_filter > > Here the search query: https://openwebui.com/functions?query=context These clips the history to fit in the context window, instead i'm looking at the opposite, which is gradually increasing the context size of the model to fit the chat, instead of allocating a 128k token KV cache in ollama for a "hi there my dear LLM"
Author
Owner

@Classic298 commented on GitHub (Apr 1, 2025):

Ah. My bad I misunderstood the request then!

<!-- gh-comment-id:2769979362 --> @Classic298 commented on GitHub (Apr 1, 2025): Ah. My bad I misunderstood the request then!
Author
Owner

@AlbertoSinigaglia commented on GitHub (Apr 1, 2025):

Ah. My bad I misunderstood the request then!

No problem! Still nice functions to have

<!-- gh-comment-id:2770262179 --> @AlbertoSinigaglia commented on GitHub (Apr 1, 2025): > Ah. My bad I misunderstood the request then! No problem! Still nice functions to have
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/open-webui#32066