[GH-ISSUE #23339] feat: Add model-level toggle to disable reinjecting reasoning/thinking into prompts (prevents <think> tag imitation and rendering/parsing failures) #58620

Open
opened 2026-05-05 23:33:45 -05:00 by GiteaMirror · 28 comments
Owner

Originally created by @ShirasawaSama on GitHub (Apr 2, 2026).
Original GitHub issue: https://github.com/open-webui/open-webui/issues/23339

Check Existing Issues

  • I have searched for all existing open AND closed issues and discussions for similar requests. I have found none that is comparable to my request.

Verify Feature Scope

  • I have read through and understood the scope definition for feature requests in the Issues section. I believe my feature request meets the definition and belongs in the Issues section instead of the Discussions.

Problem Description

Open WebUI reconstructs chat history from stored output items and may reinject reasoning/thinking (e.g. <think>...</think>) back into the next-turn messages (via convert_output_to_messages(..., raw=True)).

For some providers/models, once these tags appear in history, the model starts imitating/amplifying them (multiple/nested/variant thinking tags), which can eventually break frontend reasoning parsing/rendering (thinking blocks leak into normal text, duplicated blocks, malformed markdown/artifacts, etc.).

Screenshots

The content has been incorporated into the thought block:

Image Image

The reason is: if you pass the thought process to the LLM with a tag as the main text, the LLM will attempt to mimic it and also output a fake tag.

Image

Desired Solution you'd like

  • By default, Open WebUI should NOT reinject reasoning/thinking into the prompt history.
  • Users can opt-in per model if they explicitly need reasoning reinjection.

Alternatives Considered

Add a model-level toggle (advanced setting), e.g. reinject_reasoning / include_reasoning_in_history:

  • Default: OFF
  • When OFF: keep tool call structure, but do not add reasoning/thinking blocks back into messages.
  • When ON: preserve current behavior for users/providers that require it.

Additional Context

Relevant code (for maintainers)

  • backend/open_webui/utils/middleware.py: process_messages_with_output() uses convert_output_to_messages(..., raw=True)
  • backend/open_webui/utils/misc.py: convert_output_to_messages(..., raw=True) appends reasoning as <think>...</think>
Originally created by @ShirasawaSama on GitHub (Apr 2, 2026). Original GitHub issue: https://github.com/open-webui/open-webui/issues/23339 ### Check Existing Issues - [x] I have searched for all existing **open AND closed** issues and discussions for similar requests. I have found none that is comparable to my request. ### Verify Feature Scope - [x] I have read through and understood the scope definition for feature requests in the Issues section. I believe my feature request meets the definition and belongs in the Issues section instead of the Discussions. ### Problem Description Open WebUI reconstructs chat history from stored `output` items and may **reinject reasoning/thinking** (e.g. `<think>...</think>`) back into the next-turn `messages` (via `convert_output_to_messages(..., raw=True)`). For some providers/models, once these tags appear in history, the model starts **imitating/amplifying** them (multiple/nested/variant thinking tags), which can eventually **break frontend reasoning parsing/rendering** (thinking blocks leak into normal text, duplicated blocks, malformed markdown/artifacts, etc.). ### Screenshots The content has been incorporated into the thought block: <img width="832" height="204" alt="Image" src="https://github.com/user-attachments/assets/1e0d71e4-254d-409b-9ed8-d06ed1464688" /> <img width="2010" height="542" alt="Image" src="https://github.com/user-attachments/assets/75e6199b-4834-4ff7-abd8-077123e1923f" /> The reason is: if you pass the thought process to the LLM with a <think> tag as the main text, the LLM will attempt to mimic it and also output a fake <think> tag. <img width="1606" height="168" alt="Image" src="https://github.com/user-attachments/assets/9562362f-8b00-4b21-bc90-293c0a8ac01f" /> ### Desired Solution you'd like - By default, Open WebUI should **NOT** reinject reasoning/thinking into the prompt history. - Users can opt-in per model if they explicitly need reasoning reinjection. ### Alternatives Considered Add a **model-level toggle** (advanced setting), e.g. `reinject_reasoning` / `include_reasoning_in_history`: - **Default: OFF** - When OFF: keep tool call structure, but do not add reasoning/thinking blocks back into `messages`. - When ON: preserve current behavior for users/providers that require it. ### Additional Context **Relevant code (for maintainers)** - `backend/open_webui/utils/middleware.py`: `process_messages_with_output()` uses `convert_output_to_messages(..., raw=True)` - `backend/open_webui/utils/misc.py`: `convert_output_to_messages(..., raw=True)` appends reasoning as `<think>...</think>`
Author
Owner

@ShirasawaSama commented on GitHub (Apr 2, 2026):

Why we need feature flags here

Open WebUI converts structured output back into the next-turn LLM messages (often with raw=true). That makes “what the UI shows” and “what the model sees” tightly coupled. A small upstream change (or a provider leaking <think>) can pollute prompts, increase token cost, and re-inject sensitive tool data. Feature flags let us keep official defaults while giving downstream forks a safe way to harden privacy and stability.

What to gate (and why)

Output block Risk if fed back to LLM Why a flag helps Suggested default
Reasoning / <think>-style content Prompt pollution, repeated “thinking tags”, unstable behavior across providers Some deployments want KV-cache benefits; others want clean chat text only OFF in forks; keep ON only if you need it
Tool results (function_call_output) Highest chance of leaking secrets/PII/internal URLs; often very long Different orgs have different sensitivity + retention requirements ON but minimized (prefer summary over raw payload)
Tool arguments (function_call.arguments) Can include tokens, file paths, user data; also expands attack surface for prompt injection Lets you redact/whitelist fields without breaking tools ON but redacted (or OFF for strict mode)
Code interpreter output Large logs/data samples; can include paths and intermediate artifacts; heavy token cost Some want reproducibility; others want only final answer OFF (or summary-only)
Attachments / embeds / extracted file text “User file gets re-fed to model” privacy risk + huge context bloat Needed in some flows, unacceptable in others OFF by default unless explicitly required
Raw HTML / UI-rendered artifacts Not useful to the model; can contain injected markup/text Keeps LLM context clean and consistent OFF

Minimal flag design (pragmatic)

  • Two knobs cover most needs:
    • Include reasoning in LLM context (on/off)
    • Include tool payloads in LLM context with modes: none / summary / full
  • Keep the UI rendering decision separate from LLM context decision (so you can hide reasoning from users without changing official stateful behavior).
<!-- gh-comment-id:4175648470 --> @ShirasawaSama commented on GitHub (Apr 2, 2026): ### Why we need feature flags here Open WebUI converts structured `output` back into the next-turn LLM `messages` (often with `raw=true`). That makes “what the UI shows” and “what the model sees” tightly coupled. A small upstream change (or a provider leaking `<think>`) can **pollute prompts**, **increase token cost**, and **re-inject sensitive tool data**. Feature flags let us keep official defaults while giving downstream forks a safe way to harden privacy and stability. ### What to gate (and why) | Output block | Risk if fed back to LLM | Why a flag helps | Suggested default | |---|---|---|---| | **Reasoning / `<think>`-style content** | Prompt pollution, repeated “thinking tags”, unstable behavior across providers | Some deployments want KV-cache benefits; others want clean chat text only | **OFF** in forks; keep **ON** only if you need it | | **Tool results (`function_call_output`)** | Highest chance of leaking secrets/PII/internal URLs; often very long | Different orgs have different sensitivity + retention requirements | **ON but minimized** (prefer summary over raw payload) | | **Tool arguments (`function_call.arguments`)** | Can include tokens, file paths, user data; also expands attack surface for prompt injection | Lets you redact/whitelist fields without breaking tools | **ON but redacted** (or OFF for strict mode) | | **Code interpreter output** | Large logs/data samples; can include paths and intermediate artifacts; heavy token cost | Some want reproducibility; others want only final answer | **OFF** (or **summary-only**) | | **Attachments / embeds / extracted file text** | “User file gets re-fed to model” privacy risk + huge context bloat | Needed in some flows, unacceptable in others | **OFF by default** unless explicitly required | | **Raw HTML / UI-rendered artifacts** | Not useful to the model; can contain injected markup/text | Keeps LLM context clean and consistent | **OFF** | ### Minimal flag design (pragmatic) - **Two knobs** cover most needs: - **Include reasoning in LLM context** (on/off) - **Include tool payloads in LLM context** with modes: **none / summary / full** - Keep the **UI rendering** decision separate from **LLM context** decision (so you can hide reasoning from users without changing official stateful behavior).
Author
Owner

@visig9 commented on GitHub (Apr 4, 2026):

Also, Gemma 4 need it for the official requirement:

  1. Multi-Turn Conversations

    • No Thinking Content in History: In multi-turn conversations, the historical model output should only include the final response. Thoughts from previous model turns must not be added before the next user turn begins.

See: https://huggingface.co/google/gemma-4-E2B-it#3-multi-turn-conversations

<!-- gh-comment-id:4186583541 --> @visig9 commented on GitHub (Apr 4, 2026): Also, [Gemma 4](https://huggingface.co/google/gemma-4-E2B-it) need it for the official requirement: > 3. Multi-Turn Conversations > > - No Thinking Content in History: In multi-turn conversations, the historical model output should only include the final response. Thoughts from previous model turns must not be added before the next user turn begins. See: https://huggingface.co/google/gemma-4-E2B-it#3-multi-turn-conversations
Author
Owner

@nekomiya-hinata commented on GitHub (Apr 4, 2026):

I would like to suggest a complementary approach to the toggle: placing the thinking content into a dedicated, separate field in the message payload (which could be configurable or fixed, e.g., reasoning or reasoning_content).

Instead of concatenating the reasoning block into the raw content string with <think>...</think> tags (which directly causes the prompt pollution and imitation issues mentioned), Open WebUI could structure the multi-turn history like this:

{
  "role": "assistant",
  "reasoning_content": "The user wants to...",
  "content": "Here is the final answer..."
}

Why this helps:

  1. Aligns with API Standards: This is exactly how the official DeepSeek API separates thoughts from the final answer.
  2. Prevents Imitation: The model's standard content remains pristine, drastically reducing the chance of the LLM generating fake tags or malformed markdown.
  3. Flexibility: Coupled with the feature toggle mentioned above, users can map this to specific providers that actually support/expect a separate reasoning field in their API schema, while safely dropping it for models like Gemma 4 that strictly prohibit it.
<!-- gh-comment-id:4187604132 --> @nekomiya-hinata commented on GitHub (Apr 4, 2026): I would like to suggest a complementary approach to the toggle: **placing the thinking content into a dedicated, separate field in the message payload** (which could be configurable or fixed, e.g., `reasoning` or `reasoning_content`). Instead of concatenating the reasoning block into the raw `content` string with `<think>...</think>` tags (which directly causes the prompt pollution and imitation issues mentioned), Open WebUI could structure the multi-turn history like this: ```json { "role": "assistant", "reasoning_content": "The user wants to...", "content": "Here is the final answer..." } ``` **Why this helps:** 1. **Aligns with API Standards:** This is exactly how the official DeepSeek API separates thoughts from the final answer. 2. **Prevents Imitation:** The model's standard `content` remains pristine, drastically reducing the chance of the LLM generating fake tags or malformed markdown. 3. **Flexibility:** Coupled with the feature toggle mentioned above, users can map this to specific providers that actually support/expect a separate reasoning field in their API schema, while safely dropping it for models like Gemma 4 that strictly prohibit it.
Author
Owner

@ShirasawaSama commented on GitHub (Apr 8, 2026):

I would like to suggest a complementary approach to the toggle: placing the thinking content into a dedicated, separate field in the message payload (which could be configurable or fixed, e.g., or ).reasoning``reasoning_content

Instead of concatenating the reasoning block into the raw string with tags (which directly causes the prompt pollution and imitation issues mentioned), Open WebUI could structure the multi-turn history like this:content``<think>...</think>

{
"role": "assistant",
"reasoning_content": "The user wants to...",
"content": "Here is the final answer..."
}
Why this helps:

  1. Aligns with API Standards: This is exactly how the official DeepSeek API separates thoughts from the final answer.
  2. Prevents Imitation: The model's standard remains pristine, drastically reducing the chance of the LLM generating fake tags or malformed markdown.content
  3. Flexibility: Coupled with the feature toggle mentioned above, users can map this to specific providers that actually support/expect a separate reasoning field in their API schema, while safely dropping it for models like Gemma 4 that strictly prohibit it.

I disagree; in practice, many models will simply throw an error if they encounter an unknown field.

<!-- gh-comment-id:4206727193 --> @ShirasawaSama commented on GitHub (Apr 8, 2026): > I would like to suggest a complementary approach to the toggle: **placing the thinking content into a dedicated, separate field in the message payload** (which could be configurable or fixed, e.g., or ).`reasoning``reasoning_content` > > Instead of concatenating the reasoning block into the raw string with tags (which directly causes the prompt pollution and imitation issues mentioned), Open WebUI could structure the multi-turn history like this:`content``<think>...</think>` > > { > "role": "assistant", > "reasoning_content": "The user wants to...", > "content": "Here is the final answer..." > } > **Why this helps:** > > 1. **Aligns with API Standards:** This is exactly how the official DeepSeek API separates thoughts from the final answer. > 2. **Prevents Imitation:** The model's standard remains pristine, drastically reducing the chance of the LLM generating fake tags or malformed markdown.`content` > 3. **Flexibility:** Coupled with the feature toggle mentioned above, users can map this to specific providers that actually support/expect a separate reasoning field in their API schema, while safely dropping it for models like Gemma 4 that strictly prohibit it. I disagree; in practice, many models will simply throw an error if they encounter an unknown field.
Author
Owner

@Classic298 commented on GitHub (Apr 8, 2026):

also wouldnt this be possible with a relatively simple filter if anyone wants this?

<!-- gh-comment-id:4206736322 --> @Classic298 commented on GitHub (Apr 8, 2026): also wouldnt this be possible with a relatively simple filter if anyone wants this?
Author
Owner

@Classic298 commented on GitHub (Apr 8, 2026):

Also, Gemma 4 need it for the official requirement:
Multi-Turn Conversations
No Thinking Content in History: In multi-turn conversations, the historical model output should only include the final response. Thoughts from previous model turns must not be added before the next user turn begins.

Thats nice and all, but other models DO need it, that's why Open WebUI sends the reasoning content.

You can build yourself a simple filter that strips the reasoning content before sending it to the AI

<!-- gh-comment-id:4206742261 --> @Classic298 commented on GitHub (Apr 8, 2026): > Also, [Gemma 4](https://huggingface.co/google/gemma-4-E2B-it) need it for the official requirement: > Multi-Turn Conversations > No Thinking Content in History: In multi-turn conversations, the historical model output should only include the final response. Thoughts from previous model turns must not be added before the next user turn begins. Thats nice and all, but other models DO need it, that's why Open WebUI sends the reasoning content. You can build yourself a simple filter that strips the reasoning content before sending it to the AI
Author
Owner

@Classic298 commented on GitHub (Apr 8, 2026):

@ShirasawaSama

By default, Open WebUI should NOT reinject reasoning/thinking into the prompt history.

Why? Some providers and models need it - and if you have a model that strictly doesn't need the thinking sent back to the provider couldn't a filter handle this easily?

<!-- gh-comment-id:4206755697 --> @Classic298 commented on GitHub (Apr 8, 2026): @ShirasawaSama > By default, Open WebUI should NOT reinject reasoning/thinking into the prompt history. Why? Some providers and models need it - and if you have a model that strictly doesn't need the thinking sent back to the provider couldn't a filter handle this easily?
Author
Owner

@ShirasawaSama commented on GitHub (Apr 8, 2026):

Also, Gemma 4 need it for the official requirement:
Multi-Turn Conversations
No Thinking Content in History: In multi-turn conversations, the historical model output should only include the final response. Thoughts from previous model turns must not be added before the next user turn begins.

Thats nice and all, but other models DO need it, that's why Open WebUI sends the reasoning content.

You can build yourself a simple filter that strips the reasoning content before sending it to the AI

In fact, for most models, not passing the thinking process does not cause any issues, so there is no need to return it.

Only a small number of models, such as those that support "extended thinking" (e.g., Claude), require the thinking process to be returned.

However, even for models that require the thinking process to be returned, it is not returned via the <think> tag, but through a separate field.

Embedding the thinking process into the assistant’s body is incorrect under any circumstances, as it completely breaks the KV cache and causes token consumption to spike.

This is why a control option for different models is necessary.

<!-- gh-comment-id:4206776222 --> @ShirasawaSama commented on GitHub (Apr 8, 2026): > > Also, [Gemma 4](https://huggingface.co/google/gemma-4-E2B-it) need it for the official requirement: > > Multi-Turn Conversations > > No Thinking Content in History: In multi-turn conversations, the historical model output should only include the final response. Thoughts from previous model turns must not be added before the next user turn begins. > > Thats nice and all, but other models DO need it, that's why Open WebUI sends the reasoning content. > > You can build yourself a simple filter that strips the reasoning content before sending it to the AI In fact, for most models, not passing the thinking process does not cause any issues, so there is no need to return it. Only a small number of models, such as those that support "extended thinking" (e.g., Claude), require the thinking process to be returned. However, even for models that require the thinking process to be returned, it is not returned via the `<think>` tag, but through a separate field. Embedding the thinking process into the assistant’s body is incorrect under any circumstances, as it completely breaks the KV cache and causes token consumption to spike. This is why a control option for different models is necessary.
Author
Owner

@Classic298 commented on GitHub (Apr 8, 2026):

why would it break KV cache?

If i consistently send the same data and the same fields to the provider it will cache that and reuse the cache.

Only modifications to those fields would break KV cache


Regardless, why cannot a filter be used here? As you correctly said some models need it. Yes, not all - but this is where perhaps a filter should take place and adjust the payload for special needs of the model provider.

OpenAI also (for large companies) needs the safety identifier to be sent otherwise you will quickly not have an account anymore. Can be done by Open WebUI via flag/toggle - but shouldn't. A Filter is the better option here

<!-- gh-comment-id:4206802135 --> @Classic298 commented on GitHub (Apr 8, 2026): why would it break KV cache? If i consistently send the same data and the same fields to the provider it will cache that and reuse the cache. Only modifications to those fields would break KV cache ---- Regardless, why cannot a filter be used here? As you correctly said some models need it. Yes, not all - but this is where perhaps a filter should take place and adjust the payload for special needs of the model provider. OpenAI also (for large companies) needs the safety identifier to be sent otherwise you will quickly not have an account anymore. Can be done by Open WebUI via flag/toggle - but shouldn't. A Filter is the better option here
Author
Owner

@Classic298 commented on GitHub (Apr 8, 2026):

Btw UI-rendered artifacts are never sent to the model to begin with

<!-- gh-comment-id:4206838574 --> @Classic298 commented on GitHub (Apr 8, 2026): Btw UI-rendered artifacts are never sent to the model to begin with
Author
Owner

@ShirasawaSama commented on GitHub (Apr 8, 2026):

why would it break KV cache?

If i consistently send the same data and the same fields to the provider it will cache that and reuse the cache.

Only modifications to those fields would break KV cache

Regardless, why cannot a filter be used here? As you correctly said some models need it. Yes, not all - but this is where perhaps a filter should take place and adjust the payload for special needs of the model provider.

OpenAI also (for large companies) needs the safety identifier to be sent otherwise you will quickly not have an account anymore. Can be done by Open WebUI via flag/toggle - but shouldn't. A Filter is the better option here

Can you give me an example of a model that requires the thinking process to be returned via the <think> tag? As far as I know, the Gemini, Claude, ChatGPT, DeepSeek, Kimi, Grok, and Qwen series should not return the thinking process.

Returning the thinking process via the <think> tag pollutes the context and causes the LLM to insert multiple <think> tags into the output to simulate thinking.

<!-- gh-comment-id:4206842885 --> @ShirasawaSama commented on GitHub (Apr 8, 2026): > why would it break KV cache? > > If i consistently send the same data and the same fields to the provider it will cache that and reuse the cache. > > Only modifications to those fields would break KV cache > > Regardless, why cannot a filter be used here? As you correctly said some models need it. Yes, not all - but this is where perhaps a filter should take place and adjust the payload for special needs of the model provider. > > OpenAI also (for large companies) needs the safety identifier to be sent otherwise you will quickly not have an account anymore. Can be done by Open WebUI via flag/toggle - but shouldn't. A Filter is the better option here Can you give me an example of a model that requires the thinking process to be returned via the `<think>` tag? As far as I know, the Gemini, Claude, ChatGPT, DeepSeek, Kimi, Grok, and Qwen series should not return the thinking process. Returning the thinking process via the `<think>` tag pollutes the context and causes the LLM to insert multiple `<think>` tags into the output to simulate thinking.
Author
Owner

@Classic298 commented on GitHub (Apr 8, 2026):

Claude definitely requires returning the thinking to the provider across turns, and all providers that I know of require thinking to be sent back to the provider within the same turn at least (because if the model is only thinking and planning and then doing a tool call (end of generation) you need to send the thinking it just generated otherwise the model doesn't know what i planned to do and wants to do next).

So DeepSeek, Claude, Qwen, Kimi, all models with transparent thinking content require it to be sent within the same turn so the model knows why it did that tool call for example - and Anthropic in particular requires it across turns also

<!-- gh-comment-id:4206859086 --> @Classic298 commented on GitHub (Apr 8, 2026): Claude definitely requires returning the thinking to the provider across turns, and all providers that I know of require thinking to be sent back to the provider within the same turn at least (because if the model is only thinking and planning and then doing a tool call (end of generation) you need to send the thinking it just generated otherwise the model doesn't know what i planned to do and wants to do next). So DeepSeek, Claude, Qwen, Kimi, all models with transparent thinking content require it to be sent within the same turn so the model knows why it did that tool call for example - and Anthropic in particular requires it across turns also
Author
Owner

@ShirasawaSama commented on GitHub (Apr 8, 2026):

https://platform.claude.com/docs/en/build-with-claude/extended-thinking

Image Image

I checked Claude's documentation, and even when the model is called via a tool, the response should follow the original "thinking block" format rather than using the <think> tag.

I don't have much of an opinion on whether to return the thought process, but I do believe that the model's additional "thought" field should not be converted into a <think> tag, embedded within the assistant's content, and then sent back to the model.

9bd84258d0/backend/open_webui/utils/misc.py (L244-L247)

<!-- gh-comment-id:4206926634 --> @ShirasawaSama commented on GitHub (Apr 8, 2026): https://platform.claude.com/docs/en/build-with-claude/extended-thinking <img width="936" height="978" alt="Image" src="https://github.com/user-attachments/assets/c5d045f5-3418-42f1-ad0b-5b84c00cd89a" /> <img width="991" height="757" alt="Image" src="https://github.com/user-attachments/assets/4bb41363-6e8e-4f56-8521-d98d6bd341ba" /> I checked Claude's documentation, and even when the model is called via a tool, the response should follow the original "thinking block" format rather than using the `<think>` tag. I don't have much of an opinion on whether to return the thought process, but I do believe that the model's additional "thought" field should not be converted into a `<think>` tag, embedded within the assistant's content, and then sent back to the model. https://github.com/open-webui/open-webui/blob/9bd84258d09eefe7bf975878fb0e31a5dadfe0f8/backend/open_webui/utils/misc.py#L244-L247
Author
Owner

@ShirasawaSama commented on GitHub (Apr 8, 2026):

In fact, it’s currently nearly impossible to use the Claude model for native tool calls in OwUI, because OwUI inevitably generates messages that contain only the tool call but have empty content, causing Claude to throw an error immediately.

I used a filter to process the Claude model’s input, removing these empty blocks, which allowed it to use native tool calls.

In other words, send the reasoning to models are actually the rarer ones.

<!-- gh-comment-id:4206947959 --> @ShirasawaSama commented on GitHub (Apr 8, 2026): In fact, it’s currently nearly impossible to use the Claude model for native tool calls in OwUI, because OwUI inevitably generates messages that contain only the tool call but have empty content, causing Claude to throw an error immediately. I used a filter to process the Claude model’s input, removing these empty blocks, which allowed it to use native tool calls. In other words, send the reasoning to models are actually the rarer ones.
Author
Owner

@ShirasawaSama commented on GitHub (Apr 8, 2026):

DeepSeek:

https://api-docs.deepseek.com/guides/thinking_mode

Image
<!-- gh-comment-id:4207072504 --> @ShirasawaSama commented on GitHub (Apr 8, 2026): DeepSeek: https://api-docs.deepseek.com/guides/thinking_mode <img width="1001" height="406" alt="Image" src="https://github.com/user-attachments/assets/8f4c8b65-62be-4eb3-bfb1-393ac8e44e62" />
Author
Owner

@ShirasawaSama commented on GitHub (Apr 8, 2026):

Gemini 3 Tool Calls:

Gemini requires that only the Thought Signatures be returned, not the thought content.

Image Image Image
<!-- gh-comment-id:4207112027 --> @ShirasawaSama commented on GitHub (Apr 8, 2026): Gemini 3 Tool Calls: Gemini requires that only the **[Thought Signatures](https://ai.google.dev/gemini-api/docs/thought-signatures)** be returned, not the thought content. <img width="1086" height="641" alt="Image" src="https://github.com/user-attachments/assets/8c3ce334-fbd9-4a10-9297-667565013178" /> <img width="1111" height="785" alt="Image" src="https://github.com/user-attachments/assets/ff1d24ff-0e6a-4c2b-bbf1-6eea2e651d21" /> <img width="1112" height="840" alt="Image" src="https://github.com/user-attachments/assets/8db9723a-96be-40be-be3b-fbd22f95b2cb" />
Author
Owner

@ShirasawaSama commented on GitHub (Apr 8, 2026):

ChatGPT:

https://developers.openai.com/api/docs/guides/reasoning?example=planning#keeping-reasoning-items-in-context

When using the Responses API, it is recommended to return the response as-is:

Image
<!-- gh-comment-id:4207218158 --> @ShirasawaSama commented on GitHub (Apr 8, 2026): ChatGPT: https://developers.openai.com/api/docs/guides/reasoning?example=planning#keeping-reasoning-items-in-context When using the Responses API, it is recommended to return the response as-is: <img width="949" height="459" alt="Image" src="https://github.com/user-attachments/assets/8dea2866-d89a-42cb-8569-2d0eed3c48d1" />
Author
Owner

@ShirasawaSama commented on GitHub (Apr 8, 2026):

So I think the key point is: don’t inject the reasoning content into the assistant’s output using the <think> tag; instead, return the LLM’s data as-is as much as possible.

<!-- gh-comment-id:4207226372 --> @ShirasawaSama commented on GitHub (Apr 8, 2026): So I think the key point is: don’t inject the reasoning content into the assistant’s output using the `<think>` tag; instead, return the LLM’s data as-is as much as possible.
Author
Owner

@Classic298 commented on GitHub (Apr 8, 2026):

Yes i agree about how to inject it but i think current behavior on whether it is returned to the API at all can stay unchanged

<!-- gh-comment-id:4207551885 --> @Classic298 commented on GitHub (Apr 8, 2026): Yes i agree about how to inject it but i think current behavior on whether it is returned to the API at all can stay unchanged
Author
Owner

@8bit-coder commented on GitHub (Apr 9, 2026):

Adding to this that I'm experiencing bugs with multi-message Gemma 4 conversations as well. The thinking tag getting reinserted into the prompt along with the previous chain of thought causes huge issues and breaks the model's output. Is there a temporary fix for this?

<!-- gh-comment-id:4210906849 --> @8bit-coder commented on GitHub (Apr 9, 2026): Adding to this that I'm experiencing bugs with multi-message Gemma 4 conversations as well. The thinking tag getting reinserted into the prompt along with the previous chain of thought causes huge issues and breaks the model's output. Is there a temporary fix for this?
Author
Owner

@ShirasawaSama commented on GitHub (Apr 9, 2026):

Adding to this that I'm experiencing bugs with multi-message Gemma 4 conversations as well. The thinking tag getting reinserted into the prompt along with the previous chain of thought causes huge issues and breaks the model's output. Is there a temporary fix for this?

Add a "pass" between elif item_type == 'reasoning': and if raw:

9bd84258d0/backend/open_webui/utils/misc.py (L233-L242)

<!-- gh-comment-id:4211155125 --> @ShirasawaSama commented on GitHub (Apr 9, 2026): > Adding to this that I'm experiencing bugs with multi-message Gemma 4 conversations as well. The thinking tag getting reinserted into the prompt along with the previous chain of thought causes huge issues and breaks the model's output. Is there a temporary fix for this? Add a "pass" between `elif item_type == 'reasoning':` and `if raw:` https://github.com/open-webui/open-webui/blob/9bd84258d09eefe7bf975878fb0e31a5dadfe0f8/backend/open_webui/utils/misc.py#L233-L242
Author
Owner

@8bit-coder commented on GitHub (Apr 13, 2026):

I added the pass statement and it still has the issue. I installed vim in the container, edited the file on line 233 and saved it, opened it again to validate, and it still results in the same behavior. I even tried stopping and starting the container again but that undoes the changes so I have to do the changes while the container is live.

<!-- gh-comment-id:4239658988 --> @8bit-coder commented on GitHub (Apr 13, 2026): I added the pass statement and it still has the issue. I installed vim in the container, edited the file on line 233 and saved it, opened it again to validate, and it still results in the same behavior. I even tried stopping and starting the container again but that undoes the changes so I have to do the changes while the container is live.
Author
Owner

@Classic298 commented on GitHub (Apr 13, 2026):

@8bit-coder editing in the container and then restarting it with down and up -d doesnt really persist the changes in my experience

<!-- gh-comment-id:4239681931 --> @Classic298 commented on GitHub (Apr 13, 2026): @8bit-coder editing in the container and then restarting it with down and up -d doesnt really persist the changes in my experience
Author
Owner

@8bit-coder commented on GitHub (Apr 13, 2026):

Should I edit the container data itself then and then bring it up?

<!-- gh-comment-id:4239686600 --> @8bit-coder commented on GitHub (Apr 13, 2026): Should I edit the container data itself then and then bring it up?
Author
Owner

@Classic298 commented on GitHub (Apr 13, 2026):

yes custom Dockerfile or through a replacement script on startup

<!-- gh-comment-id:4239694453 --> @Classic298 commented on GitHub (Apr 13, 2026): yes custom Dockerfile or through a replacement script on startup
Author
Owner

@yrro commented on GitHub (Apr 22, 2026):

Here's a filter that strips <think>...</think> from assistant messages.

https://gist.github.com/yrro/b0f2765ea55ae3414e06b319dd07ae8e

Uncomment the print statement and look at your logs to check that it's working properly.

<!-- gh-comment-id:4298805838 --> @yrro commented on GitHub (Apr 22, 2026): Here's a filter that strips `<think>...</think>` from assistant messages. https://gist.github.com/yrro/b0f2765ea55ae3414e06b319dd07ae8e Uncomment the `print` statement and look at your logs to check that it's working properly.
Author
Owner

@itsHenry35 commented on GitHub (Apr 26, 2026):

Yeah this is indeed needed when deepseek reasoning is on with native tool calling (can't set to default as in that case the model doesn't know there are tools to call), or otherwise it returns an error.

Image
<!-- gh-comment-id:4322097153 --> @itsHenry35 commented on GitHub (Apr 26, 2026): Yeah this is indeed needed when deepseek reasoning is on with native tool calling (can't set to default as in that case the model doesn't know there are tools to call), or otherwise it returns an error. <img width="1470" height="716" alt="Image" src="https://github.com/user-attachments/assets/07f4206a-bf48-4035-84a7-6a3398514e2c" />
Author
Owner

@JoeEnderman commented on GitHub (May 4, 2026):

Here's a filter that strips <think>...</think> from assistant messages.

https://gist.github.com/yrro/b0f2765ea55ae3414e06b319dd07ae8e

Uncomment the print statement and look at your logs to check that it's working properly.

That fixes it. Thank you! Now Gemma 4 26b a4b works amazingly.

<!-- gh-comment-id:4367937114 --> @JoeEnderman commented on GitHub (May 4, 2026): > Here's a filter that strips `<think>...</think>` from assistant messages. > > https://gist.github.com/yrro/b0f2765ea55ae3414e06b319dd07ae8e > > Uncomment the `print` statement and look at your logs to check that it's working properly. That fixes it. Thank you! Now Gemma 4 26b a4b works amazingly.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/open-webui#58620