[GH-ISSUE #23618] feat: Warn users if context is too large for Model #35559

Closed
opened 2026-04-25 09:45:16 -05:00 by GiteaMirror · 1 comment
Owner

Originally created by @TomTheWise on GitHub (Apr 12, 2026).
Original GitHub issue: https://github.com/open-webui/open-webui/issues/23618

Check Existing Issues

  • I have searched for all existing open AND closed issues and discussions for similar requests. I have found none that is comparable to my request.

Verify Feature Scope

  • I have read through and understood the scope definition for feature requests in the Issues section. I believe my feature request meets the definition and belongs in the Issues section instead of the Discussions.

Problem Description

  • Administrators usually know what they configured as ctx limits in ollama / llama.cpp / ik_llama.cpp and so on. You can additionally set max Token limit per Model - if its larger than llama.cpp or similar then it will not matter.
  • As token count is not comparable between different models, Open WebUI can not exactly predict what token count the chat currently will result in for the speciffic LLM.

But if it is running into a cap for the model - for example you query results in 60K token but you only have set up llama.cpp / ollama for 40K you have the statistics under a message and you will clearly see that the tokens are capped.
The issue: Currently every day to day users won't know that their token count was capped and they reached its limit. Theys might be whondering why so many stuff was ignored / forgotten but they do not know.

For example here with llama.cpp (that model currently has a limit of 12K configured in llama.cpp) it is still within the limits whcih is why completeion was successfull, but its super close. The next answer will be cut off and users won't know why - no matter how often they try.
Image

After reaching 12K, llama.cpp cut off and ends the chat completion:
Image
Image

Desired Solution you'd like

AFTER the LLM answered and it uses an API that supports that, it sends the statistics.
Ollama reports the statistics, llm.cpp and ik_llama.cpp (different format than Ollama but still) reports the statistics and I guess other API providers with their "flavors" of the APIs will do so too.

My desired solution would be so that PER MODEL two additional, optional parameter are introduced: statistics max ctx warning key, and the statistics max ctx warning value at which point, Open WebUI should warn the users that quality of the current chat will be deteriorated / bad because context limit has reached. Maybe as waring banner under the answer - NOT just temporary warning in the right top corner!

Such warnings are know from Cloud AI service like Gemini web page / app.

Alternatives Considered

  1. Simply make the warning based on "max_tokens" parameters of the model and don't introduce new options. However this parameter is active for my understanding - if OpenwebUI detects that the context is to long it will cut things off before sending to llama.cpp - this is not super accurate as it cant predict the exact token count from the LLM. Still this would be an acceptable good solution too that wont bloat with yet more options to set. Still it is super important that Users are wanred that they can no longer expect that the full chat context is being processed!
  2. Make only an additional ctx size warning value and not which statistics key should be automatically looked up / parsed - this would be bad as alone Ollama and llama.cpp call their max tokens differently in the statistics, OWUI devs would have to constantly add new parsers for that statistics.

Additional Context

No response

Originally created by @TomTheWise on GitHub (Apr 12, 2026). Original GitHub issue: https://github.com/open-webui/open-webui/issues/23618 ### Check Existing Issues - [x] I have searched for all existing **open AND closed** issues and discussions for similar requests. I have found none that is comparable to my request. ### Verify Feature Scope - [x] I have read through and understood the scope definition for feature requests in the Issues section. I believe my feature request meets the definition and belongs in the Issues section instead of the Discussions. ### Problem Description - Administrators usually know what they configured as ctx limits in ollama / llama.cpp / ik_llama.cpp and so on. You can additionally set max Token limit per Model - if its larger than llama.cpp or similar then it will not matter. - As token count is not comparable between different models, Open WebUI can not exactly predict what token count the chat currently will result in for the speciffic LLM. But if it is running into a cap for the model - for example you query results in 60K token but you only have set up llama.cpp / ollama for 40K you have the statistics under a message and you will clearly see that the tokens are capped. **The issue:** Currently every day to day users won't know that their token count was capped and they reached its limit. Theys might be whondering why so many stuff was ignored / forgotten but they do not know. For example here with llama.cpp (that model currently has a limit of 12K configured in llama.cpp) it is still within the limits whcih is why completeion was successfull, but its super close. The next answer will be cut off and users won't know why - no matter how often they try. <img width="543" height="389" alt="Image" src="https://github.com/user-attachments/assets/ac114f63-de76-4b2f-a575-7e4eb733d0ac" /> After reaching 12K, llama.cpp cut off and ends the chat completion: <img width="898" height="156" alt="Image" src="https://github.com/user-attachments/assets/7718da98-2a4b-43f2-b5e4-59a18de3857f" /> <img width="467" height="327" alt="Image" src="https://github.com/user-attachments/assets/0a901ade-4fa8-4115-a0ee-620e008cfad2" /> ### Desired Solution you'd like AFTER the LLM answered and it uses an API that supports that, it sends the statistics. Ollama reports the statistics, llm.cpp and ik_llama.cpp (different format than Ollama but still) reports the statistics and I guess other API providers with their "flavors" of the APIs will do so too. My desired solution would be so that PER MODEL two additional, optional parameter are introduced: statistics max ctx warning key, and the statistics max ctx warning value at which point, Open WebUI should warn the users that quality of the current chat will be deteriorated / bad because context limit has reached. Maybe as waring banner under the answer - NOT just temporary warning in the right top corner! Such warnings are know from Cloud AI service like Gemini web page / app. ### Alternatives Considered 1. Simply make the warning based on "max_tokens" parameters of the model and don't introduce new options. However this parameter is active for my understanding - if OpenwebUI detects that the context is to long it will cut things off before sending to llama.cpp - this is not super accurate as it cant predict the exact token count from the LLM. Still this would be an acceptable good solution too that wont bloat with yet more options to set. Still it is super important that Users are wanred that they can no longer expect that the full chat context is being processed! 2. Make only an additional ctx size warning value and not which statistics key should be automatically looked up / parsed - this would be bad as alone Ollama and llama.cpp call their max tokens differently in the statistics, OWUI devs would have to constantly add new parsers for that statistics. ### Additional Context _No response_
Author
Owner

@Classic298 commented on GitHub (Apr 12, 2026):

Can be done with a filter

Duplicate

<!-- gh-comment-id:4231336638 --> @Classic298 commented on GitHub (Apr 12, 2026): Can be done with a filter Duplicate
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/open-webui#35559