[GH-ISSUE #9348] Reasoning Models <thinking> Passed to Tag Generation #15465

Closed
opened 2026-04-19 21:39:10 -05:00 by GiteaMirror · 1 comment
Owner

Originally created by @csvance on GitHub (Feb 5, 2025).
Original GitHub issue: https://github.com/open-webui/open-webui/issues/9348

Bug Report

Installation Method

Docker

Environment

  • Open WebUI Version: 0.5.7
  • Operating System: Ubuntu 24.04

Confirmation:

  • I have read and followed all the instructions provided in the README.md.
  • I am on the latest version of both Open WebUI and Ollama.
  • I have included the browser console logs.
  • I have included the Docker container logs.
  • I have provided the exact steps to reproduce the bug in the "Steps to Reproduce" section below.

Expected Behavior:

Reasoning model "thinking" output should not be included for tag generation.

Actual Behavior:

Reasoning "thinking" is passed to the tag generation task model in-between <detail></detail> tags.

Description

Bug Summary:
Reasoning should not be passed to the tag generation model but it is. This makes it harder to accurately generate tags since the models inner reasoning dialogue often is less aligned with the tone of the conversation. It's especially bad for smaller task models because its stuffing the context with a ton of information that isn't aligned with creating useful tags, where as the non thinking/reasoning part of the response is.

Reproduction Details

Steps to Reproduce:

  1. Use seperate task model for tag generation, ie llama3.2:1b-instruct
  2. Start a new conversation with a model that supports reasoning with the message "what is 1 + 1?" which will trigger the reasoning model to think.
  3. The model thinks and finishes its response
  4. OpenWebUI serializes the message history sending it to the task model for tag generation. This includes the models reasoning/thinking inner dialogue.

Logs and Screenshots

Here is what the OpenAI API recieves, you can see the reasoning content is included inside of <details></details> tags when it probably shouldn't be. Keep in mind this is a very small amount of reasoning, sometimes the reasoning can be thousands of tokens or more!

INFO 02-04 16:11:29 logger.py:37] Received request chatcmpl-7a9b154df79548bc9c3a1c077c595f09: prompt: '<|begin_of_text|><|start_header_id|>system<|end_header_id|>

Cutting Knowledge Date: December 2023
Today Date: 04 Feb 2025

<|eot_id|><|start_header_id|>user<|end_header_id|>

### Task:
Generate 1-3 broad tags categorizing the main themes of the chat history, along with 1-3 more specific subtopic tags.

### Guidelines:
- Start with high-level domains (e.g. Science, Technology, Philosophy, Arts, Politics, Business, Health, Sports, Entertainment, Education)
- Consider including relevant subfields/subdomains if they are strongly represented throughout the conversation
- If content is too short (less than 3 messages) or too diverse, use only ["General"]
- Use the chat\'s primary language; default to English if multilingual
- Prioritize accuracy over specificity

### Output:
JSON format: { "tags": ["tag1", "tag2", "tag3"] }

### Chat History:
<chat_history>
USER: What is 1 + 1?
ASSISTANT: <details type="reasoning" done="true" duration="3">
<summary>Thought for 3 seconds</summary>
> First, I recognize that the user is asking for the sum of 1 and 1.
> 
> To solve this, I prepare two numbers: 1 and 1.
> 
> Then, I add these two numbers together.
> 
> Finally, the total is 2.
</details>


Sure, let\'s solve the problem step by step.

**Problem:** What is \\(1 + 1\\)?

**Solution:**

1. **Identify the numbers to add:**
   \\[
   1 \\quad \\text{and} \\quad 1
   \\]

2. **Perform the addition:**
   \\[
   1 + 1 = 2
   \\]

**Answer:** \\(\\boxed{2}\\)
</chat_history><|eot_id|><|start_header_id|>assistant<|end_header_id|>

', params: SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=1.0, top_p=1.0, top_k=-1, min_p=0.0, seed=None, stop=[], stop_token_ids=[], bad_words=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=3703, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None, guided_decoding=None), prompt_token_ids: None, lora_request: None, prompt_adapter_request: None.
Originally created by @csvance on GitHub (Feb 5, 2025). Original GitHub issue: https://github.com/open-webui/open-webui/issues/9348 # Bug Report ## Installation Method Docker ## Environment - **Open WebUI Version:** 0.5.7 - **Operating System:** Ubuntu 24.04 **Confirmation:** - [x] I have read and followed all the instructions provided in the README.md. - [x] I am on the latest version of both Open WebUI and Ollama. - [x] I have included the browser console logs. - [x] I have included the Docker container logs. - [x] I have provided the exact steps to reproduce the bug in the "Steps to Reproduce" section below. ## Expected Behavior: Reasoning model "thinking" output should not be included for tag generation. ## Actual Behavior: Reasoning "thinking" is passed to the tag generation task model in-between `<detail></detail>` tags. ## Description **Bug Summary:** Reasoning should not be passed to the tag generation model but it is. This makes it harder to accurately generate tags since the models inner reasoning dialogue often is less aligned with the tone of the conversation. It's especially bad for smaller task models because its stuffing the context with a ton of information that isn't aligned with creating useful tags, where as the non thinking/reasoning part of the response is. ## Reproduction Details **Steps to Reproduce:** 1. Use seperate task model for tag generation, ie llama3.2:1b-instruct 2. Start a new conversation with a model that supports reasoning with the message "what is 1 + 1?" which will trigger the reasoning model to think. 3. The model thinks and finishes its response 4. OpenWebUI serializes the message history sending it to the task model for tag generation. This includes the models reasoning/thinking inner dialogue. ## Logs and Screenshots Here is what the OpenAI API recieves, you can see the reasoning content is included inside of `<details></details>` tags when it probably shouldn't be. Keep in mind this is a very small amount of reasoning, sometimes the reasoning can be thousands of tokens or more! ``` INFO 02-04 16:11:29 logger.py:37] Received request chatcmpl-7a9b154df79548bc9c3a1c077c595f09: prompt: '<|begin_of_text|><|start_header_id|>system<|end_header_id|> Cutting Knowledge Date: December 2023 Today Date: 04 Feb 2025 <|eot_id|><|start_header_id|>user<|end_header_id|> ### Task: Generate 1-3 broad tags categorizing the main themes of the chat history, along with 1-3 more specific subtopic tags. ### Guidelines: - Start with high-level domains (e.g. Science, Technology, Philosophy, Arts, Politics, Business, Health, Sports, Entertainment, Education) - Consider including relevant subfields/subdomains if they are strongly represented throughout the conversation - If content is too short (less than 3 messages) or too diverse, use only ["General"] - Use the chat\'s primary language; default to English if multilingual - Prioritize accuracy over specificity ### Output: JSON format: { "tags": ["tag1", "tag2", "tag3"] } ### Chat History: <chat_history> USER: What is 1 + 1? ASSISTANT: <details type="reasoning" done="true" duration="3"> <summary>Thought for 3 seconds</summary> > First, I recognize that the user is asking for the sum of 1 and 1. > > To solve this, I prepare two numbers: 1 and 1. > > Then, I add these two numbers together. > > Finally, the total is 2. </details> Sure, let\'s solve the problem step by step. **Problem:** What is \\(1 + 1\\)? **Solution:** 1. **Identify the numbers to add:** \\[ 1 \\quad \\text{and} \\quad 1 \\] 2. **Perform the addition:** \\[ 1 + 1 = 2 \\] **Answer:** \\(\\boxed{2}\\) </chat_history><|eot_id|><|start_header_id|>assistant<|end_header_id|> ', params: SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=1.0, top_p=1.0, top_k=-1, min_p=0.0, seed=None, stop=[], stop_token_ids=[], bad_words=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=3703, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None, guided_decoding=None), prompt_token_ids: None, lora_request: None, prompt_adapter_request: None. ```
Author
Owner

@csvance commented on GitHub (Feb 5, 2025):

If you use OpenResty as a reverse proxy for multiple vLLM instances, you can use the following to work around the issue. Implementing the model as a pipe can work too, but OpenResty just makes this stuff so incredibly easy, and if you are already working with a ton of vLLM instances you need some sort of reverse proxy to stay sane anyway. You will need to manually add your model id to the connection because /v1/models isn't an endpoint.

        location /v1/chat/completions {
            set $proxy "";
            rewrite_by_lua '

                ngx.req.read_body()
                local body = ngx.req.get_body_data()

                local cjson = require("cjson")
                local req = cjson.decode(body)

                local model_host_map = {}
                model_host_map["llama3.2:1b-instruct-q6"] = "vllm-1:8000"
                model_host_map["DeepSeek-R1-Distill-Llama-70B-AWQ"] = "vllm-2:8000"
                ngx.var.proxy = model_host_map[req["model"]]

                if req["model"] == "llama3.2:1b-instruct-q6" then
                    for i, message in ipairs(req["messages"]) do
                        newstr, n, err = ngx.re.gsub(message["content"], "<details(\\n|.)*?details>", "", "")
                        req["messages"][i]["content"] = newstr
                    end
                end

                ngx.req.set_body_data(cjson.encode(req))

            ';
            proxy_pass http://$proxy$uri;
        }

<!-- gh-comment-id:2637280362 --> @csvance commented on GitHub (Feb 5, 2025): If you use OpenResty as a reverse proxy for multiple vLLM instances, you can use the following to work around the issue. Implementing the model as a pipe can work too, but OpenResty just makes this stuff so incredibly easy, and if you are already working with a ton of vLLM instances you need some sort of reverse proxy to stay sane anyway. You will need to manually add your model id to the connection because /v1/models isn't an endpoint. ``` location /v1/chat/completions { set $proxy ""; rewrite_by_lua ' ngx.req.read_body() local body = ngx.req.get_body_data() local cjson = require("cjson") local req = cjson.decode(body) local model_host_map = {} model_host_map["llama3.2:1b-instruct-q6"] = "vllm-1:8000" model_host_map["DeepSeek-R1-Distill-Llama-70B-AWQ"] = "vllm-2:8000" ngx.var.proxy = model_host_map[req["model"]] if req["model"] == "llama3.2:1b-instruct-q6" then for i, message in ipairs(req["messages"]) do newstr, n, err = ngx.re.gsub(message["content"], "<details(\\n|.)*?details>", "", "") req["messages"][i]["content"] = newstr end end ngx.req.set_body_data(cjson.encode(req)) '; proxy_pass http://$proxy$uri; } ```
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/open-webui#15465