[GH-ISSUE #12292] feat: Sliding window context to handle long contexts #32066

New Issue

GiteaMirror · 2026-04-25T05:57:21-05:00

GiteaMirror commented

2026-04-25 05:57:21 -05:00

Originally created by @AlbertoSinigaglia on GitHub (Apr 1, 2025).
Original GitHub issue: https://github.com/open-webui/open-webui/issues/12292

Check Existing Issues

I have searched the existing issues and discussions.

Problem Description

I've discussed in this issue https://github.com/ollama/ollama/issues/9890 the fact that long context completely breaks the usability of models, due to the need of preloading the whole context.

Desired Solution you'd like

One of the founders/maintainers of Ollama intelligently suggested a sliding window approach on the client side: https://github.com/ollama/ollama/issues/9890#issuecomment-2740319483.

I think this is extremely interesting for OpenWeb UI to allow model-wide to preload a minimal context length and then increase it if the number of tokens in the chat gets close to it.

Alternatives Considered

No response

Additional Context

Ideally, with the introduction of the "Bypass Embedding and Retrieval" option, it's almost fundamental to have 128k as context length. Let alone the 1M token length served by Google with Gemini.

Originally created by @AlbertoSinigaglia on GitHub (Apr 1, 2025). Original GitHub issue: https://github.com/open-webui/open-webui/issues/12292 ### Check Existing Issues - [x] I have searched the existing issues and discussions. ### Problem Description I've discussed in this issue https://github.com/ollama/ollama/issues/9890 the fact that long context completely breaks the usability of models, due to the need of preloading the whole context. ### Desired Solution you'd like One of the founders/maintainers of Ollama intelligently suggested a sliding window approach on the client side: https://github.com/ollama/ollama/issues/9890#issuecomment-2740319483. I think this is extremely interesting for OpenWeb UI to allow model-wide to preload a minimal context length and then increase it if the number of tokens in the chat gets close to it. ### Alternatives Considered _No response_ ### Additional Context Ideally, with the introduction of the "Bypass Embedding and Retrieval" option, it's almost fundamental to have 128k as context length. Let alone the 1M token length served by Google with Gemini.

GiteaMirror closed this issue

2026-04-25 05:57:22 -05:00

GiteaMirror commented

2026-04-25 05:57:23 -05:00

@Classic298 commented on GitHub (Apr 1, 2025):

should be implemented with filters

There are already some example filters available which you can use

https://openwebui.com/f/hub/context_clip_filter

https://openwebui.com/f/houxin/token_clip_filter

Here the search query: https://openwebui.com/functions?query=context

@Classic298 commented on GitHub (Apr 1, 2025): should be implemented with filters There are already some example filters available which you can use https://openwebui.com/f/hub/context_clip_filter https://openwebui.com/f/houxin/token_clip_filter Here the search query: https://openwebui.com/functions?query=context

GiteaMirror commented

2026-04-25 05:57:23 -05:00

@AlbertoSinigaglia commented on GitHub (Apr 1, 2025):

should be implemented with filters

There are already some example filters available which you can use

https://openwebui.com/f/hub/context_clip_filter

https://openwebui.com/f/houxin/token_clip_filter

Here the search query: https://openwebui.com/functions?query=context

These clips the history to fit in the context window, instead i'm looking at the opposite, which is gradually increasing the context size of the model to fit the chat, instead of allocating a 128k token KV cache in ollama for a "hi there my dear LLM"

@AlbertoSinigaglia commented on GitHub (Apr 1, 2025): > should be implemented with filters > > There are already some example filters available which you can use > > https://openwebui.com/f/hub/context_clip_filter > > https://openwebui.com/f/houxin/token_clip_filter > > Here the search query: https://openwebui.com/functions?query=context These clips the history to fit in the context window, instead i'm looking at the opposite, which is gradually increasing the context size of the model to fit the chat, instead of allocating a 128k token KV cache in ollama for a "hi there my dear LLM"

GiteaMirror commented

2026-04-25 05:57:23 -05:00

@Classic298 commented on GitHub (Apr 1, 2025):

Ah. My bad I misunderstood the request then!

@Classic298 commented on GitHub (Apr 1, 2025): Ah. My bad I misunderstood the request then!

GiteaMirror commented

2026-04-25 05:57:24 -05:00

@AlbertoSinigaglia commented on GitHub (Apr 1, 2025):

Ah. My bad I misunderstood the request then!

No problem! Still nice functions to have

@AlbertoSinigaglia commented on GitHub (Apr 1, 2025): > Ah. My bad I misunderstood the request then! No problem! Still nice functions to have

Sign in to join this conversation.

Branches Tags

1 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: github-starred/open-webui#32066