mirror of
https://github.com/open-webui/open-webui.git
synced 2026-06-08 10:13:22 -05:00
[GH-ISSUE #24913] issue: conversation abruptly stops across multiple models and backends with many tool calls (REPEATABLE) #123741
Reference in New Issue
Block a user
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @vektorprime on GitHub (May 19, 2026).
Original GitHub issue: https://github.com/open-webui/open-webui/issues/24913
Check Existing Issues
Installation Method
Docker
Open WebUI Version
v0.9.5
Ollama Version (if applicable)
NA
Operating System
Ubuntu 24
Browser (if applicable)
Latest firefox
Confirmation
README.md.Expected Behavior
The model should continue generating and tool calling, but it abruptly stops only when interfacing through open-webui.
Actual Behavior
Just stops. I have to prompt it to continue or something similar.
Here's an example of me prompting it to continue.
Steps to Reproduce
Quick summary:
I am using open-webui as the frontend to my locally hosted setup. I am consistently seeing conversations stopping even though the model is supposed to continue generating. This occurs when the backend is vLLM and llama-cpp. It also occurs with both Qwen3.6 and Gemma4 models.
System with ALL software up to date:
Ubuntu 24
Docker image of open-webui
How to reproduce:
Make sure native tool calling is enabled for your model
Disable web search and other tools for the conversation so they don't get in the way
Enable open-terminal (for file writing and access)
Use either llama-CPP or vLLM as the backend
Use any model, but I first noticed on Gemma 4 31B, and I mainly use Qwen3.7 27B Q8 (I tried many quants and chat templates)
Paste the following prompt, and you'll see the conversation just stop between task 10-18. Almost almost always it's closer to the upper range for me.
Here's how I paste my prompt:
The prompt:
The logs & screenshots section will show what it looks like.
If you try this with llama-cpp as the backend it does the same thing. If you run that same model with same exact settings and prompt but use the llama-server webui (with similar MCP) it works just fine.
Logs & Screenshots
Here's what it looks like when it stops:
Here's what vLLM shows at the end:
(APIServer pid=1) INFO 05-19 16:58:45 [logger.py:92] Generated response chatcmpl-82807bd2f5345ab6 (streaming complete): output**: '\n\n\n\nT9: no\n\nTask 10: In beta.txt, replace yellow with gold. Print the full contents joined by commas.\n\n<tool_call>\n<function=run_command>\n<parameter=command>\npython3 -c "\nlines = open('/home/user/beta.txt').read().strip().split('\n')\nlines = [l for l in lines if l.strip()]\nlines = [l.replace('yellow','gold') if l == 'yellow' else l for l in lines]\nopen('/home/user/beta.txt','w').write('\n'.join(lines) + '\n')\nprint(','.join(lines))\n"\n\n\n</tool_call>', output_token_ids: None, finish_reason: streaming_complete**
(APIServer pid=1) INFO 05-19 16:58:45 [logger.py:63] Received request chatcmpl-8418f4846e0da28f: params: SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.6, top_p=0.95, top_k=20, min_p=0.0, seed=None, stop=[], stop_token_ids=[], bad_words=[], thinking_token_budget=None, include_stop_str_in_output=False, ignore_eos=False, max_tokens=65536, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, structured_outputs=None, extra_args=None), lora_request: None.
(APIServer pid=1) INFO: 172.17.0.1:56966 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=1) INFO 05-19 16:58:45 [async_llm.py:415] Added request chatcmpl-8418f4846e0da28f-8cc2de91.
(APIServer pid=1) INFO 05-19 16:58:48 [logger.py:92] Generated response chatcmpl-8418f4846e0da28f (streaming complete): output: 'The task 10 command is running. Let me wait for it.\n\n\n<tool_call>\n<function=get_process_status>\n<parameter=process_id>\n20260519-165845-6531de\n\n<parameter=wait>\n3\n\n\n</tool_call>', output_token_ids: None, finish_reason: streaming_complete
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
Here's ANOTHER run with a new conversation, same exact settings, model etc. In this one there's a function call that never seems to run or show up:
(APIServer pid=1) INFO 05-19 17:15:48 [logger.py:92] Generated response chatcmpl-883f6dde7c01e292 (streaming complete): output: 'beta.txt currently has 5 lines (blue, gold, orange, red, silver). So N=5.\n\n\n<tool_call>\n<function=get_process_status>\n<parameter=process_id>\n20260519-171546-21a6eb\n\n\n</tool_call>', output_token_ids: None, finish_reason: streaming_complete
(APIServer pid=1) INFO 05-19 17:15:49 [logger.py:63] Received request chatcmpl-a233d880ee7773ab: params: SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.6, top_p=0.95, top_k=20, min_p=0.0, seed=None, stop=[], stop_token_ids=[], bad_words=[], thinking_token_budget=None, include_stop_str_in_output=False, ignore_eos=False, max_tokens=65536, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, structured_outputs=None, extra_args=None), lora_request: None.
(APIServer pid=1) INFO: 172.17.0.1:52996 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=1) INFO 05-19 17:15:49 [async_llm.py:415] Added request chatcmpl-a233d880ee7773ab-8930fb05.
(APIServer pid=1) INFO 05-19 17:15:51 [logger.py:92] Generated response chatcmpl-a233d880ee7773ab (streaming complete): output: 'beta.txt currently has 5 lines (blue, gold, orange, red, silver). So colors=5.\n\n\n<tool_call>\n<function=run_command>\n<parameter=command>\necho "colors=5" >> /home/user/beta.txt && tail -n 1 /home/user/beta.txt\n\n\n</tool_call>', output_token_ids: None, finish_reason: streaming_complete
(APIServer pid=1) INFO 05-19 17:15:51 [loggers.py:271] Engine 000: Avg prompt throughput: 112.2 tokens/s, Avg generation throughput: 35.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 79.5%
And here's the screenshot for the second run:
Additional Information
We are not hitting a token generation limit, and the final_reason in vLLM shows streaming-complete. There's supposed to be another
@owui-terminator[bot] commented on GitHub (May 19, 2026):
🔍 Related Issues Found
I found some existing issues that might be related. Please check if any of these are duplicates or contain helpful solutions:
🟢 #20896 issue: Generation stops after tool call when routing Ollama through WebUI (GLM-4.7-Flash in OpenCode)
Very similar symptom: generation stops immediately after a tool call when using Open WebUI as the frontend, requiring manual continuation. It also involves local model backends and tool-calling behavior that halts mid-agent loop.
by HuysArthur ·
bug🟣 #23466 issue: Random response stops after tool call
Matches the core failure mode of responses randomly stopping after a tool call in Open WebUI. Although this report is less deterministic, it points to the same class of post-tool-call continuation bug.
by trinhkvo ·
bug🟣 #24607 issue: Incorrect tool parsing with several tool calls (specially provided with open-terminal)
Related because it describes problems once several tool calls have occurred, including raw tool output parsing and unexpected stopping. The new issue also appears after many sequential tool calls with open-terminal.
by N-point-N ·
bug🟣 #21768 issue: OpenAI-compatible streaming: finish_reason incorrectly returned as "stop" after streaming tool_calls
Highly relevant if the new issue is actually caused by Open WebUI returning the wrong streaming finish_reason after tool-call chunks. That would make agent frameworks think generation is complete and stop the loop prematurely.
by Sechma ·
bug🟣 #23863 issue: Tool calls with Gemma 4 requires
default->native->defaulttoggling ofFunction CallingRelevant as another tool-calling regression with Gemma 4 in Open WebUI, specifically around native/default function-calling behavior. Since the new issue reproduces with Gemma models and tool calls, it may share the same underlying tool-calling path.
by gitfrederic ·
bug💡 If your issue is a duplicate, please close it and add any additional details to the existing issue instead.
This comment was generated automatically. React with 👍 if helpful, 👎 if not.
@frenzybiscuit commented on GitHub (May 19, 2026):
Are you hitting the context limit? OWUI doesn't really tell you if you are. It just stops, like you're describing.
The only way to know is if your backend records what context you're using. It won't show up under OWUI (even with usage enabled) on tool calls if it fails during it.
@frenzybiscuit commented on GitHub (May 19, 2026):
For example, opening a single large file consumes 100k context for me.
@vektorprime commented on GitHub (May 19, 2026):
#23466 and #24607 - Not related because my experience doesn't show printing tool calls, mine experience is just stops generating or won't continue
#20896 - May be related, but their use case is that cli coding agent uses openweb-ui as the backend for API. So their setup may make troubleshooting more difficult.
#21768 - May be related.
#23863 - Not related, switching to Default tool calling doesn't fix my issue.
@vektorprime commented on GitHub (May 19, 2026):
No I am not. The context here is only 11k to 15k (when it stops), and my window size (KV cache size) is 160K+. Further, I am not hitting the PER generation limit too as confirmed by my vLLM logs.
I even tried to set a VERY high (65k) token generation limit to see if it it helps, and it did not.
(APIServer pid=1) INFO 05-19 17:19:08 [logger.py:63] Received request chatcmpl-a8d4c651970416da: params: SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.6, top_p=0.95, top_k=20, min_p=0.0, seed=None, stop=[], stop_token_ids=[], bad_words=[], thinking_token_budget=None, include_stop_str_in_output=False, ignore_eos=False, max_tokens=65536, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, structured_outputs=None, extra_args=None), lora_request: None.
@vektorprime commented on GitHub (May 19, 2026):
These files I am working with are created by the prompt, they only contain like 10-30 characters each, and they are only modified by the steps, they don't get bigger.
@frenzybiscuit commented on GitHub (May 19, 2026):
Okay... I can't replicate this.
Maybe someone else can?
@Classic298 commented on GitHub (May 19, 2026):
i also cannot replicate. This has been reported some times in the past and everytime it was a provider issue/upstream on inference layer. sending to discussions for now because absolutely not replicable here