mirror of
https://github.com/open-webui/open-webui.git
synced 2026-05-07 11:28:35 -05:00
[GH-ISSUE #21815] issue: [Bug] Reasoning model <think> tags stored as <details> HTML in DB, breaking KV cache on every new conversation turn #58243
Reference in New Issue
Block a user
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @japneet644 on GitHub (Feb 24, 2026).
Original GitHub issue: https://github.com/open-webui/open-webui/issues/21815
Check Existing Issues
Installation Method
Docker
Open WebUI Version
v0.8.5
Ollama Version (if applicable)
N/A (Using llama.cpp backend via OpenAI API compatible endpoint)
Operating System
Ubuntu 24.04.4 LTS
Browser (if applicable)
Chrome
Confirmation
README.md.Expected Behavior
When a model uses a tool (like Web Search), Open WebUI should preserve the original tool-call JSON/syntax generated by the model in the backend database. The UI formatting (the collapsible HTML <details...><summary> Thought for X seconds</summary>dropdown) should only be applied dynamically on the frontend. When a new user message is sent, the chat history array sent to the backend API must match the raw tokens the model originally generated so that backends like llama.cpp or vLLM can properly reuse their KV cache.
Actual Behavior
Open WebUI's tool-handling logic (middleware.py) generates HTML tags to display tool calls in the UI: content = f'{content}<details type="reasoning" done="..."><summary>Thought...
However, this HTML is injected directly into the actual content of the assistant's message, saved to the database, and then sent back to the LLM on subsequent turns.
Because the context history now contains Open WebUI's injected HTML instead of the raw tool-call tokens that the model actually generated, the prompt prefix instantly diverges. The backend drops the entire KV cache from the moment the first tool was used, forcing a complete reprocessing of the prompt.
Steps to Reproduce
Result: The backend prompt cache similarity drops to near-zero (or matches perfectly only up to the exact token where the first tool was called). Thousands of tokens must be re-evaluated from scratch because the injected <details> HTML caused a cache mismatch.
--cache-type-v q8_0 --jinja -np 1 --swa-full
--cache-reuse 256 --slot-prompt-similarity 0.95
--spec-type ngram-mod -nkvo
Logs & Screenshots
=== GLM 4.7 Flash Cache Eviction Logs ===
TURN 1 (Web Search Execution Processing):
[i] update_slots: id 1 | task 6869 | new prompt, n_ctx_slot = 32768, n_keep = 0, task.n_tokens = 11993
[i] update_slots: id 1 | task 6869 | n_tokens = 2512, memory_seq_rm [2512, end)
TURN 2 (User's Follow-up prompt, ~6 minutes later at 23:33):
[i] update_slots: id 1 | task 7026 | new prompt, n_ctx_slot = 32768, n_keep = 0, task.n_tokens = 13108
[i] update_slots: id 1 | task 7026 | n_tokens = 2512, memory_seq_rm [2512, end)
ANALYSIS:
<details type="reasoning">tool call display tags into the message history instead of the tool call JSON.Database Evidence (Proof of Database-level Injection):
Querying the SQLite database (
/app/backend/data/webui.db) for the exact test chat proves that the HTML is permanently modifying the message content.Script executed inside the container:
Actual Output returned from the DB:
This
detailsstructure was not generated by GLM 4.7. It was injected bymiddleware.pyand then passed via the API payload.Additional logs: Browser logs and Openwebui (docker logs are attached)
localhost-1771911321394.log
openwebui_docker_logs.txt
Additional Information
No response
@japneet644 commented on GitHub (Feb 24, 2026):
I dug into the source code and found the exact root cause of this cache eviction.
The issue actually happens in
middleware.pyinside theconvert_output_to_messages()function. When OpenWebUI reconstructs the chat history from the database to send to the backend on Turn 2, this function silently strips out all<think>...</think>tags from the assistant's previous messages.Because the
<think>block is suddenly missing from the history array, backend servers (likellama.cpp) instantly evict the prompt cache on Turn 2 at the exact token index where the reasoning block used to be.The Fix:
Inside
process_messages_with_output()(around line 1980 ofmiddleware.py),convert_output_to_messagesneeds to be called withraw=Trueso it preserves the reasoning tags in the history array exactly as the model originally generated them:I applied this one-line change to my local container and cache hits immediately returned to 95%+ across multi-turn reasoning conversations.
@tjbck commented on GitHub (Feb 24, 2026):
Addressed in dev!