[GH-ISSUE #1952] feat: context warning message #12695

New Issue

GiteaMirror · 2026-04-19T19:36:11-05:00

GiteaMirror commented

2026-04-19 19:36:11 -05:00

Originally created by @notasquid1938 on GitHub (May 3, 2024).
Original GitHub issue: https://github.com/open-webui/open-webui/issues/1952

I decided to test gradient's 1 million context llama3 model by adjusting the context parameter accordingly. However, as can be seen in this server log I run out of memory trying to store all this context:

llama_new_context_with_model: n_ctx = 10240000
llama_new_context_with_model: n_batch = 512
llama_new_context_with_model: n_ubatch = 512
llama_new_context_with_model: freq_base = 10000.0
llama_new_context_with_model: freq_scale = 1
ggml_cuda_host_malloc: warning: failed to allocate 231424.00 MiB of pinned memory: out of memory
ggml_backend_cpu_buffer_type_alloc_buffer: failed to allocate buffer of size 242665652256
llama_kv_cache_init: failed to allocate buffer for kv cache
llama_new_context_with_model: llama_kv_cache_init() failed for self-attention cache
llama_init_from_gpt_params: error: failed to create context with model 'C:\Users\.ollama\models\blobs\sha256-02c4f7e34d04dcde88f545a41bd2ea829645794873082a938f79b9badf37075d'
{"function":"load_model","level":"ERR","line":410,"model":"C:\Users\.ollama\models\blobs\sha256-02c4f7e34d04dcde88f545a41bd2ea829645794873082a938f79b9badf37075d","msg":"unable to load model","tid":"10004","timestamp":1714759655}

The issue I had was the openwebui didn't show or explain this error. It just tries to generate the response forever before erroring out with a message about connection issues to ollama. It would be very helpful if the ui could pop up a message indicating your context length caused an out of memory issue, preferably with the amount of memory it was trying to use to make it easy for users to tune how much context their system can handle.

Originally created by @notasquid1938 on GitHub (May 3, 2024). Original GitHub issue: https://github.com/open-webui/open-webui/issues/1952 I decided to test gradient's 1 million context llama3 model by adjusting the context parameter accordingly. However, as can be seen in this server log I run out of memory trying to store all this context: > llama_new_context_with_model: n_ctx = 10240000 > llama_new_context_with_model: n_batch = 512 > llama_new_context_with_model: n_ubatch = 512 > llama_new_context_with_model: freq_base = 10000.0 > llama_new_context_with_model: freq_scale = 1 > ggml_cuda_host_malloc: warning: failed to allocate 231424.00 MiB of pinned memory: out of memory > ggml_backend_cpu_buffer_type_alloc_buffer: failed to allocate buffer of size 242665652256 > llama_kv_cache_init: failed to allocate buffer for kv cache > llama_new_context_with_model: llama_kv_cache_init() failed for self-attention cache > llama_init_from_gpt_params: error: failed to create context with model 'C:\Users\\.ollama\models\blobs\sha256-02c4f7e34d04dcde88f545a41bd2ea829645794873082a938f79b9badf37075d' > {"function":"load_model","level":"ERR","line":410,"model":"C:\\Users\\.ollama\\models\\blobs\\sha256-02c4f7e34d04dcde88f545a41bd2ea829645794873082a938f79b9badf37075d","msg":"unable to load model","tid":"10004","timestamp":1714759655} The issue I had was the openwebui didn't show or explain this error. It just tries to generate the response forever before erroring out with a message about connection issues to ollama. It would be very helpful if the ui could pop up a message indicating your context length caused an out of memory issue, preferably with the amount of memory it was trying to use to make it easy for users to tune how much context their system can handle.

GiteaMirror closed this issue

2026-04-19 19:36:11 -05:00

GiteaMirror commented

2026-04-19 19:36:11 -05:00

@edwardochoaphd commented on GitHub (May 9, 2024):

Would there be a bug:issue with Settings>Advanced Parameters>Context Length?

I'm also getting errors possibly running out of GPU memory ... in both these examples while testing my own LLM pipeline, an extra zero appears to have been added by OpenWebUI...?

time=2024-05-08T20:12:24.868Z level=WARN source=memory.go:17 msg="requested context length is greater than model max context length" requested=81920 model=65536
time=2024-05-08T05:24:12.168Z level=WARN source=memory.go:17 msg="requested context length is greater than model max context length" requested=20480 model=8192

Would an extra zero be added to the user's setting when sent from OpenWebUI? (i.e., not user error, possibly a bug in OpenWebUI?)

Note: The above user wants to test for 1M, however the extra zero changes their setting to 10M?...

@edwardochoaphd commented on GitHub (May 9, 2024): Would there be a bug:issue with Settings>Advanced Parameters>Context Length? I'm also getting errors possibly running out of GPU memory ... in both these examples while testing my own LLM pipeline, an extra zero appears to have been added by OpenWebUI...? - time=2024-05-08T20:12:24.868Z level=WARN source=memory.go:17 msg="requested context length is greater than model max context length" requested=81920 model=65536 - time=2024-05-08T05:24:12.168Z level=WARN source=memory.go:17 msg="requested context length is greater than model max context length" requested=20480 model=8192 Would an extra zero be added to the user's setting when sent from OpenWebUI? (i.e., not user error, possibly a bug in OpenWebUI?) Note: The above user wants to test for 1M, however the extra zero changes their setting to 10M?... ![extraZeroIssue](https://github.com/open-webui/open-webui/assets/20013196/85e3b0d9-4ad3-4362-82eb-7d347b251bb4)

GiteaMirror commented

2026-04-19 19:36:12 -05:00

@justinh-rahb commented on GitHub (May 9, 2024):

I am in favour of having an environment variable to disable the ability for users to change the num_ctx parameter from default.

@justinh-rahb commented on GitHub (May 9, 2024): I am in favour of having an environment variable to disable the ability for users to change the `num_ctx` parameter from default.

GiteaMirror commented

2026-04-19 19:36:12 -05:00

@Cdddo commented on GitHub (Sep 30, 2024):

The UI should definitely show how much context is available. Maybe as percentage if not in tokens

@Cdddo commented on GitHub (Sep 30, 2024): The UI should definitely show how much context is available. Maybe as percentage if not in tokens

GiteaMirror commented

2026-04-19 19:36:12 -05:00

@justinh-rahb commented on GitHub (Sep 30, 2024):

The UI should definitely show how much context is available. Maybe as percentage if not in tokens

Not practical to just add it, as one does, or it would have been already. Every model uses a different tokenizer, and it's not always possible to reliably determine it from the model name alone.

@justinh-rahb commented on GitHub (Sep 30, 2024): > The UI should definitely show how much context is available. Maybe as percentage if not in tokens Not practical to just add it, as one does, or it would have been already. Every model uses a different tokenizer, and it's not always possible to reliably determine it from the model name alone.

GiteaMirror referenced this issue

2026-04-19 22:34:15 -05:00

[GH-ISSUE #12695] issue: Fail to showing response when using Web search #16688

GiteaMirror referenced this issue

2026-04-25 06:06:06 -05:00

[GH-ISSUE #12695] issue: Fail to showing response when using Web search #32217

GiteaMirror referenced this issue

2026-05-05 17:28:38 -05:00

[GH-ISSUE #12695] issue: Fail to showing response when using Web search #55354

Sign in to join this conversation.