[GH-ISSUE #23703] issue: Notes feature not compatible with llama.cpp, enable_thinking is always injected? #20050

Closed
opened 2026-04-20 02:37:54 -05:00 by GiteaMirror · 5 comments
Owner

Originally created by @TomTheWise on GitHub (Apr 14, 2026).
Original GitHub issue: https://github.com/open-webui/open-webui/issues/23703

Check Existing Issues

  • I have searched for any existing and/or related issues.
  • I have searched for any existing and/or related discussions.
  • I have also searched in the CLOSED issues AND CLOSED discussions and found no related items (your issue might already be addressed on the development branch!).
  • I am using the latest version of Open WebUI.

Installation Method

Git Clone

Open WebUI Version

8.12

Ollama Version (if applicable)

No response

Operating System

debian13; llama cpp on latest build

Browser (if applicable)

Edge latest

Confirmation

  • I have read and followed all instructions in README.md.
  • I am using the latest version of both Open WebUI and Ollama.
  • I have included the browser console logs.
  • I have included the Docker container logs.
  • I have provided every relevant configuration, setting, and environment variable used in my setup.
  • I have clearly listed every relevant configuration, custom setting, environment variable, and command-line option that influences my setup (such as Docker Compose overrides, .env values, browser settings, authentication configurations, etc).
  • I have documented step-by-step reproduction instructions that are precise, sequential, and leave nothing to interpretation. My steps:
  • Start with the initial platform/version/OS and dependencies used,
  • Specify exact install/launch/configure commands,
  • List URLs visited, user input (incl. example values/emails/passwords if needed),
  • Describe all options and toggles enabled or changed,
  • Include any files or environmental changes,
  • Identify the expected and actual result at each stage,
  • Ensure any reasonably skilled user can follow and hit the same issue.

Expected Behavior

  • An AI Modell that fully works on normal Chat conversation should also be working with the Notes feature (the assistant chat within the Notes).
  • OpenWebUI should not inject stuff besides what is manually set by the admin (don't really understand if it truly is?)
  • OpenWebUI should be fully compatible to llama.cpp - the solution probably most people use vor local LLMs, that want a better solution than Ollama but don't have the resources to run vLLM.

Actual Behavior

When using llama.cpp directly as provider and use ANY model - even matured ones like gemma3 or gpt-oss:20b without ANY additional settings - it works flawlessly without any issue in normal chat conversations.
But as soon as you try it within the notes feature it stops working. The browser IMMEDIATELY gets an error 400 Bad Request, OpenWebUI shows nothing in the logs. llama.cpp however in its logs shows that apparently the API call was made as type "Assistent response prefil" but OpenWebUI with its request sent "enable_thinking".

Steps to Reproduce

  1. Connect OWUI to llama.cpp via V1 API
  2. Test if your model works as expected in chat conversation
  3. Go to Notes > AI > Chat and start chatting with that very model - it instantly stops processing and in the console of your Browser you can see error 400 Bad request towards the OWUI API endpoint.
  4. Check your llama.cpp log - it will show this error message:
    srv operator(): got exception: {"error":{"code":400,"message":"Assistant response prefill is incompatible with enable_thinking.","type":"invalid_request_error"}}
    srv log_server_r: done request: POST /v1/chat/completions IP 400

Try the same model in Ollama and the same steps - and it will work.

Logs & Screenshots

llama.cpp log:
srv operator(): got exception: {"error":{"code":400,"message":"Assistant response prefill is incompatible with enable_thinking.","type":"invalid_request_error"}}
srv log_server_r: done request: POST /v1/chat/completions IP 400

Browserlog:
manifest.json:1 Manifest: Enctype should be set to either application/x-www-form-urlencoded or multipart/form-data. It currently defaults to application/x-www-form-urlencoded
index.js:1733 [tiptap warn]: Duplicate extension names found: ['codeBlock', 'bulletList', 'listItem', 'listKeymap', 'orderedList']. This can lead to issues.
hf @ index.js:1733
index.js:1733 [tiptap warn]: Duplicate extension names found: ['codeBlock', 'bulletList', 'listItem', 'listKeymap', 'orderedList']. This can lead to issues.
hf @ index.js:1733
fetcher.js:76 POST https://OWUI-FQDN/api/chat/completions 400 (Bad Request)
window.fetch @ fetcher.js:76
m @ index.ts:341
we @ Chat.svelte:182
await in we
ve @ Chat.svelte:307
await in ve
mt @ MessageInput.svelte:547
keydown @ MessageInput.svelte:933
(anonymous) @ index-client.js:178
keydown @ RichTextInput.svelte:1088
(anonymous) @ index.js:3122
someProp @ index.js:5594
hl @ index.js:3120
t.dom.addEventListener.t.input.eventHandlers. @ index.js:3089

Additional Information

If I understand correctly, OWUI always set enable_thinking - I have no clue why - this also explains why you can't disable reasoning in the models Advanced Params in OWUI?

Originally created by @TomTheWise on GitHub (Apr 14, 2026). Original GitHub issue: https://github.com/open-webui/open-webui/issues/23703 ### Check Existing Issues - [x] I have searched for any existing and/or related issues. - [x] I have searched for any existing and/or related discussions. - [x] I have also searched in the CLOSED issues AND CLOSED discussions and found no related items (your issue might already be addressed on the development branch!). - [x] I am using the latest version of Open WebUI. ### Installation Method Git Clone ### Open WebUI Version 8.12 ### Ollama Version (if applicable) _No response_ ### Operating System debian13; llama cpp on latest build ### Browser (if applicable) Edge latest ### Confirmation - [x] I have read and followed all instructions in `README.md`. - [x] I am using the latest version of **both** Open WebUI and Ollama. - [x] I have included the browser console logs. - [x] I have included the Docker container logs. - [x] I have **provided every relevant configuration, setting, and environment variable used in my setup.** - [x] I have clearly **listed every relevant configuration, custom setting, environment variable, and command-line option that influences my setup** (such as Docker Compose overrides, .env values, browser settings, authentication configurations, etc). - [x] I have documented **step-by-step reproduction instructions that are precise, sequential, and leave nothing to interpretation**. My steps: - Start with the initial platform/version/OS and dependencies used, - Specify exact install/launch/configure commands, - List URLs visited, user input (incl. example values/emails/passwords if needed), - Describe all options and toggles enabled or changed, - Include any files or environmental changes, - Identify the expected and actual result at each stage, - Ensure any reasonably skilled user can follow and hit the same issue. ### Expected Behavior - An AI Modell that fully works on normal Chat conversation should also be working with the Notes feature (the assistant chat within the Notes). - OpenWebUI should not inject stuff besides what is manually set by the admin (don't really understand if it truly is?) - OpenWebUI should be fully compatible to llama.cpp - the solution probably most people use vor local LLMs, that want a better solution than Ollama but don't have the resources to run vLLM. ### Actual Behavior When using llama.cpp directly as provider and use ANY model - even matured ones like gemma3 or gpt-oss:20b without ANY additional settings - it works flawlessly without any issue in normal chat conversations. But as soon as you try it within the notes feature it stops working. The browser IMMEDIATELY gets an error 400 Bad Request, OpenWebUI shows nothing in the logs. llama.cpp however in its logs shows that apparently the API call was made as type "Assistent response prefil" but OpenWebUI with its request sent "enable_thinking". ### Steps to Reproduce 1. Connect OWUI to llama.cpp via V1 API 2. Test if your model works as expected in chat conversation 3. Go to Notes > AI > Chat and start chatting with that very model - it instantly stops processing and in the console of your Browser you can see error 400 Bad request towards the OWUI API endpoint. 4. Check your llama.cpp log - it will show this error message: srv operator(): got exception: {"error":{"code":400,"message":"Assistant response prefill is incompatible with enable_thinking.","type":"invalid_request_error"}} srv log_server_r: done request: POST /v1/chat/completions IP 400 Try the same model in Ollama and the same steps - and it will work. ### Logs & Screenshots llama.cpp log: srv operator(): got exception: {"error":{"code":400,"message":"Assistant response prefill is incompatible with enable_thinking.","type":"invalid_request_error"}} srv log_server_r: done request: POST /v1/chat/completions IP 400 Browserlog: manifest.json:1 Manifest: Enctype should be set to either application/x-www-form-urlencoded or multipart/form-data. It currently defaults to application/x-www-form-urlencoded index.js:1733 [tiptap warn]: Duplicate extension names found: ['codeBlock', 'bulletList', 'listItem', 'listKeymap', 'orderedList']. This can lead to issues. hf @ index.js:1733 index.js:1733 [tiptap warn]: Duplicate extension names found: ['codeBlock', 'bulletList', 'listItem', 'listKeymap', 'orderedList']. This can lead to issues. hf @ index.js:1733 fetcher.js:76 POST https://OWUI-FQDN/api/chat/completions 400 (Bad Request) window.fetch @ fetcher.js:76 m @ index.ts:341 we @ Chat.svelte:182 await in we ve @ Chat.svelte:307 await in ve mt @ MessageInput.svelte:547 keydown @ MessageInput.svelte:933 (anonymous) @ index-client.js:178 keydown @ RichTextInput.svelte:1088 (anonymous) @ index.js:3122 someProp @ index.js:5594 hl @ index.js:3120 t.dom.addEventListener.t.input.eventHandlers.<computed> @ index.js:3089 ### Additional Information If I understand correctly, OWUI always set enable_thinking - I have no clue why - this also explains why you can't disable reasoning in the models Advanced Params in OWUI?
GiteaMirror added the bug label 2026-04-20 02:37:54 -05:00
Author
Owner

@Classic298 commented on GitHub (Apr 14, 2026):

is this still reproducible in dev?

<!-- gh-comment-id:4243832667 --> @Classic298 commented on GitHub (Apr 14, 2026): is this still reproducible in dev?
Author
Owner

@TomTheWise commented on GitHub (Apr 14, 2026):

Sorry, I currently have no possibility to check dev - only in a few hours.

But if its already fixed in dev then that is great ! :)

<!-- gh-comment-id:4243907629 --> @TomTheWise commented on GitHub (Apr 14, 2026): Sorry, I currently have no possibility to check dev - only in a few hours. But if its already fixed in dev then that is great ! :)
Author
Owner

@Classic298 commented on GitHub (Apr 14, 2026):

Analyzed your issue

OWUI does not inject enable_thinking. Searched the entire codebase - zero occurrences of enable_thinking, chat_template_kwargs, prefill, or continue_final_message. The enable_thinking flag comes from the model's chat template inside llama.cpp (Qwen3, gpt-oss, etc. default to thinking enabled) OR you toggled / set it as an advanced parameter for the model.

The error in your case is a misreading of the llama.cpp error.
the phrase "Assistant response prefill is incompatible with enable_thinking" is llama.cpp's way of saying "you gave me a trailing assistant message AND the template wants a thinking block, those two states can't coexist."

This is what actually needs a fix

<!-- gh-comment-id:4244181159 --> @Classic298 commented on GitHub (Apr 14, 2026): Analyzed your issue OWUI does not inject enable_thinking. Searched the entire codebase - zero occurrences of enable_thinking, chat_template_kwargs, prefill, or continue_final_message. The enable_thinking flag comes from the model's chat template inside llama.cpp (Qwen3, gpt-oss, etc. default to thinking enabled) OR you toggled / set it as an advanced parameter for the model. The error in your case is a misreading of the llama.cpp error. the phrase "Assistant response prefill is incompatible with enable_thinking" is llama.cpp's way of saying "you gave me a trailing assistant message AND the template wants a thinking block, those two states can't coexist." This is what actually needs a fix
Author
Owner

@Classic298 commented on GitHub (Apr 14, 2026):

https://github.com/open-webui/open-webui/pull/23715

<!-- gh-comment-id:4244184626 --> @Classic298 commented on GitHub (Apr 14, 2026): https://github.com/open-webui/open-webui/pull/23715
Author
Owner

@Classic298 commented on GitHub (Apr 14, 2026):

fd93bd3414

<!-- gh-comment-id:4245532028 --> @Classic298 commented on GitHub (Apr 14, 2026): https://github.com/open-webui/open-webui/pull/23318/changes/fd93bd3414a1725219e14561bc5640b62f9fd4a1
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/open-webui#20050