[GH-ISSUE #21815] issue: [Bug] Reasoning model <think> tags stored as <details> HTML in DB, breaking KV cache on every new conversation turn #58243

New Issue

GiteaMirror · 2026-05-05T22:38:29-05:00

GiteaMirror commented

2026-05-05 22:38:29 -05:00

Originally created by @japneet644 on GitHub (Feb 24, 2026).
Original GitHub issue: https://github.com/open-webui/open-webui/issues/21815

Check Existing Issues

I have searched for any existing and/or related issues.
I have searched for any existing and/or related discussions.
I have also searched in the CLOSED issues AND CLOSED discussions and found no related items (your issue might already be addressed on the development branch!).
I am using the latest version of Open WebUI.

Installation Method

Docker

Open WebUI Version

v0.8.5

Ollama Version (if applicable)

N/A (Using llama.cpp backend via OpenAI API compatible endpoint)

Operating System

Ubuntu 24.04.4 LTS

Browser (if applicable)

Chrome

Confirmation

I have read and followed all instructions in README.md.
I am using the latest version of both Open WebUI and Ollama.
I have included the browser console logs.
I have included the Docker container logs.
I have provided every relevant configuration, setting, and environment variable used in my setup.
I have clearly listed every relevant configuration, custom setting, environment variable, and command-line option that influences my setup (such as Docker Compose overrides, .env values, browser settings, authentication configurations, etc).
I have documented step-by-step reproduction instructions that are precise, sequential, and leave nothing to interpretation. My steps:
Start with the initial platform/version/OS and dependencies used,
Specify exact install/launch/configure commands,
List URLs visited, user input (incl. example values/emails/passwords if needed),
Describe all options and toggles enabled or changed,
Include any files or environmental changes,
Identify the expected and actual result at each stage,
Ensure any reasonably skilled user can follow and hit the same issue.

Expected Behavior

When a model uses a tool (like Web Search), Open WebUI should preserve the original tool-call JSON/syntax generated by the model in the backend database. The UI formatting (the collapsible HTML <details...><summary> Thought for X seconds</summary>dropdown) should only be applied dynamically on the frontend. When a new user message is sent, the chat history array sent to the backend API must match the raw tokens the model originally generated so that backends like llama.cpp or vLLM can properly reuse their KV cache.

Actual Behavior

Open WebUI's tool-handling logic (middleware.py) generates HTML tags to display tool calls in the UI: content = f'{content}<details type="reasoning" done="..."><summary>Thought...

However, this HTML is injected directly into the actual content of the assistant's message, saved to the database, and then sent back to the LLM on subsequent turns.

Because the context history now contains Open WebUI's injected HTML instead of the raw tool-call tokens that the model actually generated, the prompt prefix instantly diverges. The backend drops the entire KV cache from the moment the first tool was used, forcing a complete reprocessing of the prompt.

Steps to Reproduce

Start Open WebUI configured with a backend that supports prompt caching (e.g., llama.cpp).
Load any model that supports tools and native tool calls (e.g., GLM 4.7 flash).
Enable the built-in Web Search tool.
Send a prompt that triggers the tool (e.g. get the latest stock market news from the web). Observe in the backend logs that the prompt is processed and cached successfully.
Wait for the generation to finish.
Send a follow-up prompt in the same chat (e.g., is the news good for the long-term growth of S&P500?)

Result: The backend prompt cache similarity drops to near-zero (or matches perfectly only up to the exact token where the first tool was called). Thousands of tokens must be re-evaluated from scratch because the injected <details> HTML caused a cache mismatch.

Backend: llama.cpp (Docker container)
Model: GLM 4.7 Flash (GGUF, native function calling)
llama.cpp flags: -ngl 999 -fa 1 --no-mmap --cache-type-k q8_0
--cache-type-v q8_0 --jinja -np 1 --swa-full
--cache-reuse 256 --slot-prompt-similarity 0.95
--spec-type ngram-mod -nkvo
Web search: OpenWebUI built-in with Tavily API

Logs & Screenshots

=== GLM 4.7 Flash Cache Eviction Logs ===

TURN 1 (Web Search Execution Processing):
[i] update_slots: id 1 | task 6869 | new prompt, n_ctx_slot = 32768, n_keep = 0, task.n_tokens = 11993
[i] update_slots: id 1 | task 6869 | n_tokens = 2512, memory_seq_rm [2512, end)

slot print_timing: id  1 | task 6869 | 
prompt eval time =   27859.39 ms /  9481 tokens (    2.94 ms per token,   340.32 tokens per second)
       eval time =    4909.33 ms /   157 tokens (   31.27 ms per token,    31.98 tokens per second)
      total time =   32768.72 ms /  9638 tokens
slot      release: id  1 | task 6869 | stop processing: n_tokens = 12149, truncated = 0
srv  update_slots: all slots are idle

TURN 2 (User's Follow-up prompt, ~6 minutes later at 23:33):
[i] update_slots: id 1 | task 7026 | new prompt, n_ctx_slot = 32768, n_keep = 0, task.n_tokens = 13108
[i] update_slots: id 1 | task 7026 | n_tokens = 2512, memory_seq_rm [2512, end)

slot get_availabl: id  1 | task -1 | selected slot by LCP similarity, sim_best = 0.192 (> 0.100 thold), f_keep = 0.207
srv  get_availabl: updating prompt cache
srv   prompt_save:  - saving prompt with length 12149, total state size = 333.405 MiB
srv          load:  - looking for better prompt, base f_keep = 0.207, sim = 0.192
srv        update:  - cache state: 7 prompts, 2526.126 MiB (limits: 8192.000 MiB, 32768 tokens, 298509 est)
srv        update:    - prompt 0x6011bf88c290:    9080 tokens, checkpoints:  0,   249.182 MiB
....
srv        update:    - prompt 0x6011bf5abe20:   12149 tokens, checkpoints:  0,   333.405 MiB
srv  get_availabl: prompt cache update took 63.15 ms
slot launch_slot_: id  1 | task -1 | sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> min-p -> ?xtc -> temp-ext -> dist 
slot launch_slot_: id  1 | task 7026 | processing task, is_child = 0
slot update_slots: id  1 | task 7026 | new prompt, n_ctx_slot = 32768, n_keep = 0, task.n_tokens = 13108
slot update_slots: id  1 | task 7026 | n_tokens = 2512, memory_seq_rm [2512, end)
slot update_slots: id  1 | task 7026 | prompt processing progress, n_tokens = 4560, batch.n_tokens = 2048, progress = 0.347879
slot update_slots: id  1 | task 7026 | n_tokens = 4560, memory_seq_rm [4560, end)
...
slot update_slots: id  1 | task 7026 | prompt processing progress, n_tokens = 12752, batch.n_tokens = 2048, progress = 0.972841
slot update_slots: id  1 | task 7026 | n_tokens = 12752, memory_seq_rm [12752, end)
slot update_slots: id  1 | task 7026 | prompt processing progress, n_tokens = 13108, batch.n_tokens = 356, progress = 1.000000
slot update_slots: id  1 | task 7026 | prompt done, n_tokens = 13108, batch.n_tokens = 356
slot init_sampler: id  1 | task 7026 | init sampler, took 3.46 ms, tokens: text = 13108, total = 13108
slot print_timing: id  1 | task 7026 | 
prompt eval time =   32768.91 ms / 10596 tokens (    3.09 ms per token,   323.36 tokens per second)
       eval time =   27672.47 ms /   781 tokens (   35.43 ms per token,    28.22 tokens per second)
      total time =   60441.37 ms / 11377 tokens

ANALYSIS:

In Turn 1, the total prompt length was 11,993 tokens. The cache matched exactly 2,512 tokens. Llama.cpp dropped the memory after 2,512 and evaluated the remaining 9,481 tokens.
In Turn 2, the total prompt length grew to 13,108 tokens. The cache should have matched up to almost 12,000 tokens.
INSTEAD, the cache mismatched at exactly token 2,512 AGAIN. Llama.cpp dropped the memory and was forced to evaluate 10,596 tokens from scratch, taking over 32 seconds just for prompt processing.
Token 2,512 mathematically corresponds to the exact start of the assistant's first response where OpenWebUI incorrectly injected the HTML <details type="reasoning"> tool call display tags into the message history instead of the tool call JSON.

Database Evidence (Proof of Database-level Injection):
Querying the SQLite database (/app/backend/data/webui.db) for the exact test chat proves that the HTML is permanently modifying the message content.

Script executed inside the container:

import sqlite3, json
conn = sqlite3.connect('/app/backend/data/webui.db')
cursor = conn.cursor()
cursor.execute("SELECT substr(json_extract(message.value, '$.content'), 1, 500) FROM chat, json_each(json_extract(chat.chat, '$.messages')) as message WHERE chat.id = '9eb2a76e-47f7-4850-bc79-4e998f2be306' AND json_extract(message.value, '$.role') = 'assistant' LIMIT 1;")
print(cursor.fetchone()[0])

Actual Output returned from the DB:

<details type="reasoning" done="true" duration="0">
<summary>Thought for 0 seconds</summary>
&gt; The user is asking for the latest stock market news from the web. I should use the search_web function to find recent stock market news. I&#x27;ll search for &quot;latest stock market news&quot; to get current information.
</details>
I'll search for the latest stock market news from the web for you.

This details structure was not generated by GLM 4.7. It was injected by middleware.py and then passed via the API payload.

Additional logs: Browser logs and Openwebui (docker logs are attached)

localhost-1771911321394.log

openwebui_docker_logs.txt

Additional Information

No response

Originally created by @japneet644 on GitHub (Feb 24, 2026). Original GitHub issue: https://github.com/open-webui/open-webui/issues/21815 ### Check Existing Issues - [x] I have searched for any existing and/or related issues. - [x] I have searched for any existing and/or related discussions. - [x] I have also searched in the CLOSED issues AND CLOSED discussions and found no related items (your issue might already be addressed on the development branch!). - [x] I am using the latest version of Open WebUI. ### Installation Method Docker ### Open WebUI Version v0.8.5 ### Ollama Version (if applicable) N/A (Using llama.cpp backend via OpenAI API compatible endpoint) ### Operating System Ubuntu 24.04.4 LTS ### Browser (if applicable) Chrome ### Confirmation - [x] I have read and followed all instructions in `README.md`. - [x] I am using the latest version of **both** Open WebUI and Ollama. - [x] I have included the browser console logs. - [x] I have included the Docker container logs. - [x] I have **provided every relevant configuration, setting, and environment variable used in my setup.** - [x] I have clearly **listed every relevant configuration, custom setting, environment variable, and command-line option that influences my setup** (such as Docker Compose overrides, .env values, browser settings, authentication configurations, etc). - [x] I have documented **step-by-step reproduction instructions that are precise, sequential, and leave nothing to interpretation**. My steps: - Start with the initial platform/version/OS and dependencies used, - Specify exact install/launch/configure commands, - List URLs visited, user input (incl. example values/emails/passwords if needed), - Describe all options and toggles enabled or changed, - Include any files or environmental changes, - Identify the expected and actual result at each stage, - Ensure any reasonably skilled user can follow and hit the same issue. ### Expected Behavior When a model uses a tool (like Web Search), Open WebUI should preserve the original tool-call JSON/syntax generated by the model in the backend database. The UI formatting (the collapsible HTML <details...><summary> Thought for X seconds</summary>dropdown) should only be applied dynamically on the frontend. When a new user message is sent, the chat history array sent to the backend API must match the raw tokens the model originally generated so that backends like llama.cpp or vLLM can properly reuse their KV cache. ### Actual Behavior Open WebUI's tool-handling logic (middleware.py) generates HTML tags to display tool calls in the UI: content = f'{content}<details type="reasoning" done="..."><summary>Thought... However, this HTML is injected directly into the actual content of the assistant's message, saved to the database, and then sent back to the LLM on subsequent turns. Because the context history now contains Open WebUI's injected HTML instead of the raw tool-call tokens that the model actually generated, the prompt prefix instantly diverges. The backend drops the entire KV cache from the moment the first tool was used, forcing a complete reprocessing of the prompt. ### Steps to Reproduce 1. Start Open WebUI configured with a backend that supports prompt caching (e.g., llama.cpp). 2. Load any model that supports tools and native tool calls (e.g., GLM 4.7 flash). 3. Enable the built-in Web Search tool. 4. Send a prompt that triggers the tool (e.g. get the latest stock market news from the web). Observe in the backend logs that the prompt is processed and cached successfully. 5. Wait for the generation to finish. 6. Send a follow-up prompt in the same chat (e.g., is the news good for the long-term growth of S&P500?) Result: The backend prompt cache similarity drops to near-zero (or matches perfectly only up to the exact token where the first tool was called). Thousands of tokens must be re-evaluated from scratch because the injected <details> HTML caused a cache mismatch. - Backend: llama.cpp (Docker container) - Model: GLM 4.7 Flash (GGUF, native function calling) - llama.cpp flags: -ngl 999 -fa 1 --no-mmap --cache-type-k q8_0 --cache-type-v q8_0 --jinja -np 1 --swa-full --cache-reuse 256 --slot-prompt-similarity 0.95 --spec-type ngram-mod -nkvo - Web search: OpenWebUI built-in with Tavily API ### Logs & Screenshots === GLM 4.7 Flash Cache Eviction Logs === TURN 1 (Web Search Execution Processing): [i] update_slots: id 1 | task 6869 | new prompt, n_ctx_slot = 32768, n_keep = 0, task.n_tokens = 11993 [i] update_slots: id 1 | task 6869 | n_tokens = 2512, memory_seq_rm [2512, end) ``` slot print_timing: id 1 | task 6869 | prompt eval time = 27859.39 ms / 9481 tokens ( 2.94 ms per token, 340.32 tokens per second) eval time = 4909.33 ms / 157 tokens ( 31.27 ms per token, 31.98 tokens per second) total time = 32768.72 ms / 9638 tokens slot release: id 1 | task 6869 | stop processing: n_tokens = 12149, truncated = 0 srv update_slots: all slots are idle ``` TURN 2 (User's Follow-up prompt, ~6 minutes later at 23:33): [i] update_slots: id 1 | task 7026 | new prompt, n_ctx_slot = 32768, n_keep = 0, task.n_tokens = 13108 [i] update_slots: id 1 | task 7026 | n_tokens = 2512, memory_seq_rm [2512, end) ``` slot get_availabl: id 1 | task -1 | selected slot by LCP similarity, sim_best = 0.192 (> 0.100 thold), f_keep = 0.207 srv get_availabl: updating prompt cache srv prompt_save: - saving prompt with length 12149, total state size = 333.405 MiB srv load: - looking for better prompt, base f_keep = 0.207, sim = 0.192 srv update: - cache state: 7 prompts, 2526.126 MiB (limits: 8192.000 MiB, 32768 tokens, 298509 est) srv update: - prompt 0x6011bf88c290: 9080 tokens, checkpoints: 0, 249.182 MiB .... srv update: - prompt 0x6011bf5abe20: 12149 tokens, checkpoints: 0, 333.405 MiB srv get_availabl: prompt cache update took 63.15 ms slot launch_slot_: id 1 | task -1 | sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> min-p -> ?xtc -> temp-ext -> dist slot launch_slot_: id 1 | task 7026 | processing task, is_child = 0 slot update_slots: id 1 | task 7026 | new prompt, n_ctx_slot = 32768, n_keep = 0, task.n_tokens = 13108 slot update_slots: id 1 | task 7026 | n_tokens = 2512, memory_seq_rm [2512, end) slot update_slots: id 1 | task 7026 | prompt processing progress, n_tokens = 4560, batch.n_tokens = 2048, progress = 0.347879 slot update_slots: id 1 | task 7026 | n_tokens = 4560, memory_seq_rm [4560, end) ... slot update_slots: id 1 | task 7026 | prompt processing progress, n_tokens = 12752, batch.n_tokens = 2048, progress = 0.972841 slot update_slots: id 1 | task 7026 | n_tokens = 12752, memory_seq_rm [12752, end) slot update_slots: id 1 | task 7026 | prompt processing progress, n_tokens = 13108, batch.n_tokens = 356, progress = 1.000000 slot update_slots: id 1 | task 7026 | prompt done, n_tokens = 13108, batch.n_tokens = 356 slot init_sampler: id 1 | task 7026 | init sampler, took 3.46 ms, tokens: text = 13108, total = 13108 slot print_timing: id 1 | task 7026 | prompt eval time = 32768.91 ms / 10596 tokens ( 3.09 ms per token, 323.36 tokens per second) eval time = 27672.47 ms / 781 tokens ( 35.43 ms per token, 28.22 tokens per second) total time = 60441.37 ms / 11377 tokens ``` ANALYSIS: - In Turn 1, the total prompt length was 11,993 tokens. The cache matched exactly 2,512 tokens. Llama.cpp dropped the memory after 2,512 and evaluated the remaining 9,481 tokens. - In Turn 2, the total prompt length grew to 13,108 tokens. The cache should have matched up to almost 12,000 tokens. - INSTEAD, the cache mismatched at exactly token 2,512 AGAIN. Llama.cpp dropped the memory and was forced to evaluate 10,596 tokens from scratch, taking over 32 seconds just for prompt processing. - Token 2,512 mathematically corresponds to the exact start of the assistant's first response where OpenWebUI incorrectly injected the HTML `<details type="reasoning">` tool call display tags into the message history instead of the tool call JSON. **Database Evidence (Proof of Database-level Injection):** Querying the SQLite database (`/app/backend/data/webui.db`) for the exact test chat proves that the HTML is permanently modifying the message content. *Script executed inside the container:* ```python import sqlite3, json conn = sqlite3.connect('/app/backend/data/webui.db') cursor = conn.cursor() cursor.execute("SELECT substr(json_extract(message.value, '$.content'), 1, 500) FROM chat, json_each(json_extract(chat.chat, '$.messages')) as message WHERE chat.id = '9eb2a76e-47f7-4850-bc79-4e998f2be306' AND json_extract(message.value, '$.role') = 'assistant' LIMIT 1;") print(cursor.fetchone()[0]) ``` *Actual Output returned from the DB:* ```html <details type="reasoning" done="true" duration="0"> <summary>Thought for 0 seconds</summary> > The user is asking for the latest stock market news from the web. I should use the search_web function to find recent stock market news. I'll search for "latest stock market news" to get current information. </details> I'll search for the latest stock market news from the web for you. ``` *This `details` structure was not generated by GLM 4.7. It was injected by `middleware.py` and then passed via the API payload.* Additional logs: Browser logs and Openwebui (docker logs are attached) [localhost-1771911321394.log](https://github.com/user-attachments/files/25509376/localhost-1771911321394.log) [openwebui_docker_logs.txt](https://github.com/user-attachments/files/25509387/openwebui_docker_logs.txt) ### Additional Information _No response_

GiteaMirror added the bug label 2026-05-05 22:38:29 -05:00

GiteaMirror closed this issue

2026-05-05 22:38:32 -05:00

GiteaMirror commented

2026-05-05 22:38:34 -05:00

@japneet644 commented on GitHub (Feb 24, 2026):

I dug into the source code and found the exact root cause of this cache eviction.

The issue actually happens in middleware.py inside the convert_output_to_messages() function. When OpenWebUI reconstructs the chat history from the database to send to the backend on Turn 2, this function silently strips out all <think>...</think> tags from the assistant's previous messages.

Because the <think> block is suddenly missing from the history array, backend servers (like llama.cpp ) instantly evict the prompt cache on Turn 2 at the exact token index where the reasoning block used to be.

The Fix:
Inside process_messages_with_output() (around line 1980 of middleware.py), convert_output_to_messages needs to be called with raw=True so it preserves the reasoning tags in the history array exactly as the model originally generated them:

- output_messages = convert_output_to_messages(message["output"])
+ output_messages = convert_output_to_messages(message["output"], raw=True)

I applied this one-line change to my local container and cache hits immediately returned to 95%+ across multi-turn reasoning conversations.

@japneet644 commented on GitHub (Feb 24, 2026): I dug into the source code and found the exact root cause of this cache eviction. The issue actually happens in `middleware.py` inside the `convert_output_to_messages()` function. When OpenWebUI reconstructs the chat history from the database to send to the backend on Turn 2, this function silently strips out all `<think>...</think>` tags from the assistant's previous messages. Because the `<think>` block is suddenly missing from the history array, backend servers (like `llama.cpp` ) instantly evict the prompt cache on Turn 2 at the exact token index where the reasoning block used to be. **The Fix:** Inside `process_messages_with_output()` (around line 1980 of `middleware.py`), `convert_output_to_messages` needs to be called with `raw=True` so it preserves the reasoning tags in the history array exactly as the model originally generated them: ```diff - output_messages = convert_output_to_messages(message["output"]) + output_messages = convert_output_to_messages(message["output"], raw=True) ``` I applied this one-line change to my local container and cache hits immediately returned to 95%+ across multi-turn reasoning conversations.

GiteaMirror commented

2026-05-05 22:38:36 -05:00

@tjbck commented on GitHub (Feb 24, 2026):

Addressed in dev!

@tjbck commented on GitHub (Feb 24, 2026): Addressed in dev!

Sign in to join this conversation.

Branches Tags

1 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: github-starred/open-webui#58243