[GH-ISSUE #24913] issue: conversation abruptly stops across multiple models and backends with many tool calls (REPEATABLE) #123741

Closed
opened 2026-05-21 03:13:34 -05:00 by GiteaMirror · 8 comments
Owner

Originally created by @vektorprime on GitHub (May 19, 2026).
Original GitHub issue: https://github.com/open-webui/open-webui/issues/24913

Check Existing Issues

  • I have searched for any existing and/or related issues.
  • I have searched for any existing and/or related discussions.
  • I have also searched in the CLOSED issues AND CLOSED discussions and found no related items (your issue might already be addressed on the development branch!).
  • I am using the latest version of Open WebUI.

Installation Method

Docker

Open WebUI Version

v0.9.5

Ollama Version (if applicable)

NA

Operating System

Ubuntu 24

Browser (if applicable)

Latest firefox

Confirmation

  • I have read and followed all instructions in README.md.
  • I am using the latest version of both Open WebUI and Ollama.
  • I have included the browser console logs.
  • I have included the Docker container logs.
  • I have provided every relevant configuration, setting, and environment variable used in my setup.
  • I have clearly listed every relevant configuration, custom setting, environment variable, and command-line option that influences my setup (such as Docker Compose overrides, .env values, browser settings, authentication configurations, etc).
  • I have documented step-by-step reproduction instructions that are precise, sequential, and leave nothing to interpretation. My steps:
  • Start with the initial platform/version/OS and dependencies used,
  • Specify exact install/launch/configure commands,
  • List URLs visited, user input (incl. example values/emails/passwords if needed),
  • Describe all options and toggles enabled or changed,
  • Include any files or environmental changes,
  • Identify the expected and actual result at each stage,
  • Ensure any reasonably skilled user can follow and hit the same issue.

Expected Behavior

The model should continue generating and tool calling, but it abruptly stops only when interfacing through open-webui.

Actual Behavior

Just stops. I have to prompt it to continue or something similar.
Here's an example of me prompting it to continue.

Image

Steps to Reproduce

Quick summary:
I am using open-webui as the frontend to my locally hosted setup. I am consistently seeing conversations stopping even though the model is supposed to continue generating. This occurs when the backend is vLLM and llama-cpp. It also occurs with both Qwen3.6 and Gemma4 models.

System with ALL software up to date:
Ubuntu 24
Docker image of open-webui

How to reproduce:
Make sure native tool calling is enabled for your model
Disable web search and other tools for the conversation so they don't get in the way
Enable open-terminal (for file writing and access)
Use either llama-CPP or vLLM as the backend
Use any model, but I first noticed on Gemma 4 31B, and I mainly use Qwen3.7 27B Q8 (I tried many quants and chat templates)

Paste the following prompt, and you'll see the conversation just stop between task 10-18. Almost almost always it's closer to the upper range for me.

Here's how I paste my prompt:

Image

The prompt:

Create these 3 files with these contents:
alpha.txt
apple:3
banana:5
cherry:2
date:7

beta.txt
red
blue
green
yellow


gamma.txt
status=draft
owner=lee
priority=medium



Next, complete the following tasks, do not group tool calls between tasks together:
1. Count the lines in alpha.txt. Print T1: followed by the count.
2. Append elderberry:4 to alpha.txt. Print the last line of alpha.txt.
3. Replace banana:5 with banana:6 in alpha.txt. Print the full banana line.
4. Sort the lines of beta.txt alphabetically. Print the full contents joined by commas.
5. Add a new line orange to the end of beta.txt. Print the line count of beta.txt.
6. Change status=draft to status=review in gamma.txt. Print the full status line.
7. Add reviewer=kim to gamma.txt. Print the full contents of gamma.txt joined by semicolons.
8. In alpha.txt, increase every numeric value by 1. Print the full contents joined by commas.
9. Move the line green from beta.txt to the end of gamma.txt as tag=green. Print whether green still appears in beta.txt: yes or no.
10. In beta.txt, replace yellow with gold. Print the full contents joined by commas.
11. Add a header line FRUITS to the top of alpha.txt. Print the first line.
12. Remove the line cherry:3 from alpha.txt. Print the line count of alpha.txt.
13. In gamma.txt, change priority=medium to priority=high. Print the full priority line.
14. Append silver to beta.txt, then sort beta.txt alphabetically. Print the full contents joined by pipes.
15. Add total_fruits=4 to gamma.txt, where 4 is the number of fruit entries in alpha.txt excluding the FRUITS header. Print the new line.
16. In alpha.txt, rename date to dragonfruit. Print the renamed line.
17. Create a summary line at the end of beta.txt in the format colors=N, where N is the number of color lines before the summary. Print the summary line.
18. In gamma.txt, alphabetize all lines by key name before the equals sign. Print the first line.
19. In alpha.txt, compute the sum of all numeric values. Print sum= followed by the result.
20. Print the final contents of all three files in this exact format: alpha=[...]; beta=[...]; gamma=[...], with each file's lines joined by commas.


At the end, cleanly list me the results from every step that you printed

The logs & screenshots section will show what it looks like.

If you try this with llama-cpp as the backend it does the same thing. If you run that same model with same exact settings and prompt but use the llama-server webui (with similar MCP) it works just fine.

Logs & Screenshots

Here's what it looks like when it stops:

Image

Here's what vLLM shows at the end:

(APIServer pid=1) INFO 05-19 16:58:45 [logger.py:92] Generated response chatcmpl-82807bd2f5345ab6 (streaming complete): output**: '\n\n\n\nT9: no\n\nTask 10: In beta.txt, replace yellow with gold. Print the full contents joined by commas.\n\n<tool_call>\n<function=run_command>\n<parameter=command>\npython3 -c "\nlines = open('/home/user/beta.txt').read().strip().split('\n')\nlines = [l for l in lines if l.strip()]\nlines = [l.replace('yellow','gold') if l == 'yellow' else l for l in lines]\nopen('/home/user/beta.txt','w').write('\n'.join(lines) + '\n')\nprint(','.join(lines))\n"\n\n\n</tool_call>', output_token_ids: None, finish_reason: streaming_complete**
(APIServer pid=1) INFO 05-19 16:58:45 [logger.py:63] Received request chatcmpl-8418f4846e0da28f: params: SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.6, top_p=0.95, top_k=20, min_p=0.0, seed=None, stop=[], stop_token_ids=[], bad_words=[], thinking_token_budget=None, include_stop_str_in_output=False, ignore_eos=False, max_tokens=65536, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, structured_outputs=None, extra_args=None), lora_request: None.
(APIServer pid=1) INFO: 172.17.0.1:56966 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=1) INFO 05-19 16:58:45 [async_llm.py:415] Added request chatcmpl-8418f4846e0da28f-8cc2de91.
(APIServer pid=1) INFO 05-19 16:58:48 [logger.py:92] Generated response chatcmpl-8418f4846e0da28f (streaming complete): output: 'The task 10 command is running. Let me wait for it.\n\n\n<tool_call>\n<function=get_process_status>\n<parameter=process_id>\n20260519-165845-6531de\n\n<parameter=wait>\n3\n\n\n</tool_call>', output_token_ids: None, finish_reason: streaming_complete

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

Here's ANOTHER run with a new conversation, same exact settings, model etc. In this one there's a function call that never seems to run or show up:
(APIServer pid=1) INFO 05-19 17:15:48 [logger.py:92] Generated response chatcmpl-883f6dde7c01e292 (streaming complete): output: 'beta.txt currently has 5 lines (blue, gold, orange, red, silver). So N=5.\n\n\n<tool_call>\n<function=get_process_status>\n<parameter=process_id>\n20260519-171546-21a6eb\n\n\n</tool_call>', output_token_ids: None, finish_reason: streaming_complete
(APIServer pid=1) INFO 05-19 17:15:49 [logger.py:63] Received request chatcmpl-a233d880ee7773ab: params: SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.6, top_p=0.95, top_k=20, min_p=0.0, seed=None, stop=[], stop_token_ids=[], bad_words=[], thinking_token_budget=None, include_stop_str_in_output=False, ignore_eos=False, max_tokens=65536, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, structured_outputs=None, extra_args=None), lora_request: None.
(APIServer pid=1) INFO: 172.17.0.1:52996 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=1) INFO 05-19 17:15:49 [async_llm.py:415] Added request chatcmpl-a233d880ee7773ab-8930fb05.
(APIServer pid=1) INFO 05-19 17:15:51 [logger.py:92] Generated response chatcmpl-a233d880ee7773ab (streaming complete): output: 'beta.txt currently has 5 lines (blue, gold, orange, red, silver). So colors=5.\n\n\n<tool_call>\n<function=run_command>\n<parameter=command>\necho "colors=5" >> /home/user/beta.txt && tail -n 1 /home/user/beta.txt\n\n\n</tool_call>', output_token_ids: None, finish_reason: streaming_complete

(APIServer pid=1) INFO 05-19 17:15:51 [loggers.py:271] Engine 000: Avg prompt throughput: 112.2 tokens/s, Avg generation throughput: 35.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 79.5%

And here's the screenshot for the second run:

Image

Additional Information

We are not hitting a token generation limit, and the final_reason in vLLM shows streaming-complete. There's supposed to be another

Originally created by @vektorprime on GitHub (May 19, 2026). Original GitHub issue: https://github.com/open-webui/open-webui/issues/24913 ### Check Existing Issues - [x] I have searched for any existing and/or related issues. - [x] I have searched for any existing and/or related discussions. - [x] I have also searched in the CLOSED issues AND CLOSED discussions and found no related items (your issue might already be addressed on the development branch!). - [x] I am using the latest version of Open WebUI. ### Installation Method Docker ### Open WebUI Version v0.9.5 ### Ollama Version (if applicable) NA ### Operating System Ubuntu 24 ### Browser (if applicable) Latest firefox ### Confirmation - [x] I have read and followed all instructions in `README.md`. - [x] I am using the latest version of **both** Open WebUI and Ollama. - [x] I have included the browser console logs. - [x] I have included the Docker container logs. - [x] I have **provided every relevant configuration, setting, and environment variable used in my setup.** - [x] I have clearly **listed every relevant configuration, custom setting, environment variable, and command-line option that influences my setup** (such as Docker Compose overrides, .env values, browser settings, authentication configurations, etc). - [x] I have documented **step-by-step reproduction instructions that are precise, sequential, and leave nothing to interpretation**. My steps: - Start with the initial platform/version/OS and dependencies used, - Specify exact install/launch/configure commands, - List URLs visited, user input (incl. example values/emails/passwords if needed), - Describe all options and toggles enabled or changed, - Include any files or environmental changes, - Identify the expected and actual result at each stage, - Ensure any reasonably skilled user can follow and hit the same issue. ### Expected Behavior The model should continue generating and tool calling, but it abruptly stops only when interfacing through open-webui. ### Actual Behavior Just stops. I have to prompt it to continue or something similar. Here's an example of me prompting it to continue. <img width="1132" height="387" alt="Image" src="https://github.com/user-attachments/assets/3af37b85-9a69-4aa7-ac77-2ca91ca94f01" /> ### Steps to Reproduce Quick summary: I am using open-webui as the frontend to my locally hosted setup. I am consistently seeing conversations stopping even though the model is supposed to continue generating. This occurs when the backend is vLLM and llama-cpp. It also occurs with both Qwen3.6 and Gemma4 models. System with ALL software up to date: Ubuntu 24 Docker image of open-webui How to reproduce: Make sure native tool calling is enabled for your model Disable web search and other tools for the conversation so they don't get in the way Enable open-terminal (for file writing and access) Use either llama-CPP or vLLM as the backend Use any model, but I first noticed on Gemma 4 31B, and I mainly use Qwen3.7 27B Q8 (I tried many quants and chat templates) Paste the following prompt, and you'll see the conversation just stop between task 10-18. Almost almost always it's closer to the upper range for me. Here's how I paste my prompt: <img width="1068" height="766" alt="Image" src="https://github.com/user-attachments/assets/73948006-2213-4c9d-bd65-ec7f41e5dd39" /> The prompt: ``` Create these 3 files with these contents: alpha.txt apple:3 banana:5 cherry:2 date:7 beta.txt red blue green yellow gamma.txt status=draft owner=lee priority=medium Next, complete the following tasks, do not group tool calls between tasks together: 1. Count the lines in alpha.txt. Print T1: followed by the count. 2. Append elderberry:4 to alpha.txt. Print the last line of alpha.txt. 3. Replace banana:5 with banana:6 in alpha.txt. Print the full banana line. 4. Sort the lines of beta.txt alphabetically. Print the full contents joined by commas. 5. Add a new line orange to the end of beta.txt. Print the line count of beta.txt. 6. Change status=draft to status=review in gamma.txt. Print the full status line. 7. Add reviewer=kim to gamma.txt. Print the full contents of gamma.txt joined by semicolons. 8. In alpha.txt, increase every numeric value by 1. Print the full contents joined by commas. 9. Move the line green from beta.txt to the end of gamma.txt as tag=green. Print whether green still appears in beta.txt: yes or no. 10. In beta.txt, replace yellow with gold. Print the full contents joined by commas. 11. Add a header line FRUITS to the top of alpha.txt. Print the first line. 12. Remove the line cherry:3 from alpha.txt. Print the line count of alpha.txt. 13. In gamma.txt, change priority=medium to priority=high. Print the full priority line. 14. Append silver to beta.txt, then sort beta.txt alphabetically. Print the full contents joined by pipes. 15. Add total_fruits=4 to gamma.txt, where 4 is the number of fruit entries in alpha.txt excluding the FRUITS header. Print the new line. 16. In alpha.txt, rename date to dragonfruit. Print the renamed line. 17. Create a summary line at the end of beta.txt in the format colors=N, where N is the number of color lines before the summary. Print the summary line. 18. In gamma.txt, alphabetize all lines by key name before the equals sign. Print the first line. 19. In alpha.txt, compute the sum of all numeric values. Print sum= followed by the result. 20. Print the final contents of all three files in this exact format: alpha=[...]; beta=[...]; gamma=[...], with each file's lines joined by commas. At the end, cleanly list me the results from every step that you printed ``` The logs & screenshots section will show what it looks like. If you try this with llama-cpp as the backend it does the same thing. If you run that same model with same exact settings and prompt but use the llama-server webui (with similar MCP) it works just fine. ### Logs & Screenshots Here's what it looks like when it stops: <img width="1137" height="690" alt="Image" src="https://github.com/user-attachments/assets/f3fb4c87-441e-47cf-acb3-a9f52498f4bc" /> Here's what vLLM shows at the end: (APIServer pid=1) INFO 05-19 16:58:45 [logger.py:92] Generated response chatcmpl-82807bd2f5345ab6 (streaming complete): output**: '\n\n</think>\n\n**T9: no**\n\n**Task 10: In beta.txt, replace yellow with gold. Print the full contents joined by commas.**\n\n<tool_call>\n<function=run_command>\n<parameter=command>\npython3 -c "\nlines = open(\'/home/user/beta.txt\').read().strip().split(\'\\n\')\nlines = [l for l in lines if l.strip()]\nlines = [l.replace(\'yellow\',\'gold\') if l == \'yellow\' else l for l in lines]\nopen(\'/home/user/beta.txt\',\'w\').write(\'\\n\'.join(lines) + \'\\n\')\nprint(\',\'.join(lines))\n"\n</parameter>\n</function>\n</tool_call>', output_token_ids: None, finish_reason: streaming_complete** (APIServer pid=1) INFO 05-19 16:58:45 [logger.py:63] Received request chatcmpl-8418f4846e0da28f: params: SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.6, top_p=0.95, top_k=20, min_p=0.0, seed=None, stop=[], stop_token_ids=[], bad_words=[], thinking_token_budget=None, include_stop_str_in_output=False, ignore_eos=False, max_tokens=65536, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, structured_outputs=None, extra_args=None), lora_request: None. (APIServer pid=1) INFO: 172.17.0.1:56966 - "POST /v1/chat/completions HTTP/1.1" 200 OK (APIServer pid=1) INFO 05-19 16:58:45 [async_llm.py:415] Added request chatcmpl-8418f4846e0da28f-8cc2de91. (APIServer pid=1) INFO 05-19 16:58:48 [logger.py:92] Generated response chatcmpl-8418f4846e0da28f (streaming complete): output: 'The task 10 command is running. Let me wait for it.\n</think>\n\n<tool_call>\n<function=get_process_status>\n<parameter=process_id>\n20260519-165845-6531de\n</parameter>\n<parameter=wait>\n3\n</parameter>\n</function>\n</tool_call>', output_token_ids: None, finish_reason: streaming_complete !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! **Here's ANOTHER run with a new conversation, same exact settings, model etc. In this one there's a function call that never seems to run or show up:** (APIServer pid=1) INFO 05-19 17:15:48 [logger.py:92] Generated response chatcmpl-883f6dde7c01e292 (streaming complete): output: 'beta.txt currently has 5 lines (blue, gold, orange, red, silver). So N=5.\n</think>\n\n<tool_call>\n<function=get_process_status>\n<parameter=process_id>\n20260519-171546-21a6eb\n</parameter>\n</function>\n</tool_call>', output_token_ids: None, finish_reason: streaming_complete (APIServer pid=1) INFO 05-19 17:15:49 [logger.py:63] Received request chatcmpl-a233d880ee7773ab: params: SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.6, top_p=0.95, top_k=20, min_p=0.0, seed=None, stop=[], stop_token_ids=[], bad_words=[], thinking_token_budget=None, include_stop_str_in_output=False, ignore_eos=False, max_tokens=65536, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, structured_outputs=None, extra_args=None), lora_request: None. **(APIServer pid=1) INFO: 172.17.0.1:52996 - "POST /v1/chat/completions HTTP/1.1" 200 OK (APIServer pid=1) INFO 05-19 17:15:49 [async_llm.py:415] Added request chatcmpl-a233d880ee7773ab-8930fb05. (APIServer pid=1) INFO 05-19 17:15:51 [logger.py:92] Generated response chatcmpl-a233d880ee7773ab (streaming complete): output: 'beta.txt currently has 5 lines (blue, gold, orange, red, silver). So colors=5.\n</think>\n\n<tool_call>\n<function=run_command>\n<parameter=command>\necho "colors=5" >> /home/user/beta.txt && tail -n 1 /home/user/beta.txt\n</parameter>\n</function>\n</tool_call>', output_token_ids: None, finish_reason: streaming_complete** (APIServer pid=1) INFO 05-19 17:15:51 [loggers.py:271] Engine 000: Avg prompt throughput: 112.2 tokens/s, Avg generation throughput: 35.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 79.5% And here's the screenshot for the second run: <img width="1059" height="565" alt="Image" src="https://github.com/user-attachments/assets/73cdc46b-330d-4871-b46d-051f0b3b7369" /> ### Additional Information We are not hitting a token generation limit, and the final_reason in vLLM shows streaming-complete. There's supposed to be another
GiteaMirror added the bug label 2026-05-21 03:13:34 -05:00
Author
Owner

@owui-terminator[bot] commented on GitHub (May 19, 2026):

🔍 Related Issues Found

I found some existing issues that might be related. Please check if any of these are duplicates or contain helpful solutions:

  1. 🟢 #20896 issue: Generation stops after tool call when routing Ollama through WebUI (GLM-4.7-Flash in OpenCode)
    Very similar symptom: generation stops immediately after a tool call when using Open WebUI as the frontend, requiring manual continuation. It also involves local model backends and tool-calling behavior that halts mid-agent loop.
    by HuysArthur · bug

  2. 🟣 #23466 issue: Random response stops after tool call
    Matches the core failure mode of responses randomly stopping after a tool call in Open WebUI. Although this report is less deterministic, it points to the same class of post-tool-call continuation bug.
    by trinhkvo · bug

  3. 🟣 #24607 issue: Incorrect tool parsing with several tool calls (specially provided with open-terminal)
    Related because it describes problems once several tool calls have occurred, including raw tool output parsing and unexpected stopping. The new issue also appears after many sequential tool calls with open-terminal.
    by N-point-N · bug

  4. 🟣 #21768 issue: OpenAI-compatible streaming: finish_reason incorrectly returned as "stop" after streaming tool_calls
    Highly relevant if the new issue is actually caused by Open WebUI returning the wrong streaming finish_reason after tool-call chunks. That would make agent frameworks think generation is complete and stop the loop prematurely.
    by Sechma · bug

  5. 🟣 #23863 issue: Tool calls with Gemma 4 requires default -> native -> default toggling of Function Calling
    Relevant as another tool-calling regression with Gemma 4 in Open WebUI, specifically around native/default function-calling behavior. Since the new issue reproduces with Gemma models and tool calls, it may share the same underlying tool-calling path.
    by gitfrederic · bug


💡 If your issue is a duplicate, please close it and add any additional details to the existing issue instead.

This comment was generated automatically. React with 👍 if helpful, 👎 if not.

<!-- gh-comment-id:4490263258 --> @owui-terminator[bot] commented on GitHub (May 19, 2026): <!-- terminator-bot:related-issues-reply --> 🔍 **Related Issues Found** I found some existing issues that might be related. Please check if any of these are duplicates or contain helpful solutions: 1. 🟢 [#20896](https://github.com/open-webui/open-webui/issues/20896) **issue: Generation stops after tool call when routing Ollama through WebUI (GLM-4.7-Flash in OpenCode)** *Very similar symptom: generation stops immediately after a tool call when using Open WebUI as the frontend, requiring manual continuation. It also involves local model backends and tool-calling behavior that halts mid-agent loop.* *by HuysArthur · `bug`* 2. 🟣 [#23466](https://github.com/open-webui/open-webui/issues/23466) **issue: Random response stops after tool call** *Matches the core failure mode of responses randomly stopping after a tool call in Open WebUI. Although this report is less deterministic, it points to the same class of post-tool-call continuation bug.* *by trinhkvo · `bug`* 3. 🟣 [#24607](https://github.com/open-webui/open-webui/issues/24607) **issue: Incorrect tool parsing with several tool calls (specially provided with open-terminal)** *Related because it describes problems once several tool calls have occurred, including raw tool output parsing and unexpected stopping. The new issue also appears after many sequential tool calls with open-terminal.* *by N-point-N · `bug`* 4. 🟣 [#21768](https://github.com/open-webui/open-webui/issues/21768) **issue: OpenAI-compatible streaming: finish_reason incorrectly returned as "stop" after streaming tool_calls** *Highly relevant if the new issue is actually caused by Open WebUI returning the wrong streaming finish_reason after tool-call chunks. That would make agent frameworks think generation is complete and stop the loop prematurely.* *by Sechma · `bug`* 5. 🟣 [#23863](https://github.com/open-webui/open-webui/issues/23863) **issue: Tool calls with Gemma 4 requires `default` -> `native` -> `default` toggling of `Function Calling`** *Relevant as another tool-calling regression with Gemma 4 in Open WebUI, specifically around native/default function-calling behavior. Since the new issue reproduces with Gemma models and tool calls, it may share the same underlying tool-calling path.* *by gitfrederic · `bug`* --- 💡 If your issue is a duplicate, please close it and add any additional details to the existing issue instead. *This comment was generated automatically.* React with 👍 if helpful, 👎 if not.
Author
Owner

@frenzybiscuit commented on GitHub (May 19, 2026):

Are you hitting the context limit? OWUI doesn't really tell you if you are. It just stops, like you're describing.

The only way to know is if your backend records what context you're using. It won't show up under OWUI (even with usage enabled) on tool calls if it fails during it.

<!-- gh-comment-id:4490313256 --> @frenzybiscuit commented on GitHub (May 19, 2026): Are you hitting the context limit? OWUI doesn't really tell you if you are. It just stops, like you're describing. The only way to know is if your backend records what context you're using. It won't show up under OWUI (even with usage enabled) on tool calls if it fails during it.
Author
Owner

@frenzybiscuit commented on GitHub (May 19, 2026):

For example, opening a single large file consumes 100k context for me.

<!-- gh-comment-id:4490314963 --> @frenzybiscuit commented on GitHub (May 19, 2026): For example, opening a single large file consumes 100k context for me.
Author
Owner

@vektorprime commented on GitHub (May 19, 2026):

🔍 Related Issues Found

I found some existing issues that might be related. Please check if any of these are duplicates or contain helpful solutions:

1. 🟢 [#20896](https://github.com/open-webui/open-webui/issues/20896) **issue: Generation stops after tool call when routing Ollama through WebUI (GLM-4.7-Flash in OpenCode)**
   _Very similar symptom: generation stops immediately after a tool call when using Open WebUI as the frontend, requiring manual continuation. It also involves local model backends and tool-calling behavior that halts mid-agent loop._
   _by HuysArthur · `bug`_

2. 🟣 [#23466](https://github.com/open-webui/open-webui/issues/23466) **issue: Random response stops after tool call**
   _Matches the core failure mode of responses randomly stopping after a tool call in Open WebUI. Although this report is less deterministic, it points to the same class of post-tool-call continuation bug._
   _by trinhkvo · `bug`_

3. 🟣 [#24607](https://github.com/open-webui/open-webui/issues/24607) **issue: Incorrect tool parsing with several tool calls (specially provided with open-terminal)**
   _Related because it describes problems once several tool calls have occurred, including raw tool output parsing and unexpected stopping. The new issue also appears after many sequential tool calls with open-terminal._
   _by N-point-N · `bug`_

4. 🟣 [#21768](https://github.com/open-webui/open-webui/issues/21768) **issue: OpenAI-compatible streaming: finish_reason incorrectly returned as "stop" after streaming tool_calls**
   _Highly relevant if the new issue is actually caused by Open WebUI returning the wrong streaming finish_reason after tool-call chunks. That would make agent frameworks think generation is complete and stop the loop prematurely._
   _by Sechma · `bug`_

5. 🟣 [#23863](https://github.com/open-webui/open-webui/issues/23863) **issue: Tool calls with Gemma 4 requires `default` -> `native` -> `default` toggling of `Function Calling`**
   _Relevant as another tool-calling regression with Gemma 4 in Open WebUI, specifically around native/default function-calling behavior. Since the new issue reproduces with Gemma models and tool calls, it may share the same underlying tool-calling path._
   _by gitfrederic · `bug`_

💡 If your issue is a duplicate, please close it and add any additional details to the existing issue instead.

This comment was generated automatically. React with 👍 if helpful, 👎 if not.

#23466 and #24607 - Not related because my experience doesn't show printing tool calls, mine experience is just stops generating or won't continue

#20896 - May be related, but their use case is that cli coding agent uses openweb-ui as the backend for API. So their setup may make troubleshooting more difficult.

#21768 - May be related.

#23863 - Not related, switching to Default tool calling doesn't fix my issue.

<!-- gh-comment-id:4490341391 --> @vektorprime commented on GitHub (May 19, 2026): > 🔍 **Related Issues Found** > > I found some existing issues that might be related. Please check if any of these are duplicates or contain helpful solutions: > > 1. 🟢 [#20896](https://github.com/open-webui/open-webui/issues/20896) **issue: Generation stops after tool call when routing Ollama through WebUI (GLM-4.7-Flash in OpenCode)** > _Very similar symptom: generation stops immediately after a tool call when using Open WebUI as the frontend, requiring manual continuation. It also involves local model backends and tool-calling behavior that halts mid-agent loop._ > _by HuysArthur · `bug`_ > > 2. 🟣 [#23466](https://github.com/open-webui/open-webui/issues/23466) **issue: Random response stops after tool call** > _Matches the core failure mode of responses randomly stopping after a tool call in Open WebUI. Although this report is less deterministic, it points to the same class of post-tool-call continuation bug._ > _by trinhkvo · `bug`_ > > 3. 🟣 [#24607](https://github.com/open-webui/open-webui/issues/24607) **issue: Incorrect tool parsing with several tool calls (specially provided with open-terminal)** > _Related because it describes problems once several tool calls have occurred, including raw tool output parsing and unexpected stopping. The new issue also appears after many sequential tool calls with open-terminal._ > _by N-point-N · `bug`_ > > 4. 🟣 [#21768](https://github.com/open-webui/open-webui/issues/21768) **issue: OpenAI-compatible streaming: finish_reason incorrectly returned as "stop" after streaming tool_calls** > _Highly relevant if the new issue is actually caused by Open WebUI returning the wrong streaming finish_reason after tool-call chunks. That would make agent frameworks think generation is complete and stop the loop prematurely._ > _by Sechma · `bug`_ > > 5. 🟣 [#23863](https://github.com/open-webui/open-webui/issues/23863) **issue: Tool calls with Gemma 4 requires `default` -> `native` -> `default` toggling of `Function Calling`** > _Relevant as another tool-calling regression with Gemma 4 in Open WebUI, specifically around native/default function-calling behavior. Since the new issue reproduces with Gemma models and tool calls, it may share the same underlying tool-calling path._ > _by gitfrederic · `bug`_ > > > 💡 If your issue is a duplicate, please close it and add any additional details to the existing issue instead. > > _This comment was generated automatically._ React with 👍 if helpful, 👎 if not. #23466 and #24607 - Not related because my experience doesn't show printing tool calls, mine experience is just stops generating or won't continue #20896 - May be related, but their use case is that cli coding agent uses openweb-ui as the backend for API. So their setup may make troubleshooting more difficult. #21768 - May be related. #23863 - Not related, switching to Default tool calling doesn't fix my issue.
Author
Owner

@vektorprime commented on GitHub (May 19, 2026):

Are you hitting the context limit? OWUI doesn't really tell you if you are. It just stops, like you're describing.

The only way to know is if your backend records what context you're using. It won't show up under OWUI (even with usage enabled) on tool calls if it fails during it.

No I am not. The context here is only 11k to 15k (when it stops), and my window size (KV cache size) is 160K+. Further, I am not hitting the PER generation limit too as confirmed by my vLLM logs.

I even tried to set a VERY high (65k) token generation limit to see if it it helps, and it did not.

(APIServer pid=1) INFO 05-19 17:19:08 [logger.py:63] Received request chatcmpl-a8d4c651970416da: params: SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.6, top_p=0.95, top_k=20, min_p=0.0, seed=None, stop=[], stop_token_ids=[], bad_words=[], thinking_token_budget=None, include_stop_str_in_output=False, ignore_eos=False, max_tokens=65536, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, structured_outputs=None, extra_args=None), lora_request: None.

<!-- gh-comment-id:4490351370 --> @vektorprime commented on GitHub (May 19, 2026): > Are you hitting the context limit? OWUI doesn't really tell you if you are. It just stops, like you're describing. > > The only way to know is if your backend records what context you're using. It won't show up under OWUI (even with usage enabled) on tool calls if it fails during it. No I am not. The context here is only 11k to 15k (when it stops), and my window size (KV cache size) is 160K+. Further, I am not hitting the PER generation limit too as confirmed by my vLLM logs. I even tried to set a VERY high (65k) token generation limit to see if it it helps, and it did not. (APIServer pid=1) INFO 05-19 17:19:08 [logger.py:63] Received request chatcmpl-a8d4c651970416da: params: SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.6, top_p=0.95, top_k=20, min_p=0.0, seed=None, stop=[], stop_token_ids=[], bad_words=[], thinking_token_budget=None, include_stop_str_in_output=False, ignore_eos=False, **max_tokens=65536,** min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, structured_outputs=None, extra_args=None), lora_request: None.
Author
Owner

@vektorprime commented on GitHub (May 19, 2026):

For example, opening a single large file consumes 100k context for me.

These files I am working with are created by the prompt, they only contain like 10-30 characters each, and they are only modified by the steps, they don't get bigger.

<!-- gh-comment-id:4490355650 --> @vektorprime commented on GitHub (May 19, 2026): > For example, opening a single large file consumes 100k context for me. These files I am working with are created by the prompt, they only contain like 10-30 characters each, and they are only modified by the steps, they don't get bigger.
Author
Owner

@frenzybiscuit commented on GitHub (May 19, 2026):

Okay... I can't replicate this.

Maybe someone else can?

<!-- gh-comment-id:4490519424 --> @frenzybiscuit commented on GitHub (May 19, 2026): Okay... I can't replicate this. Maybe someone else can?
Author
Owner

@Classic298 commented on GitHub (May 19, 2026):

i also cannot replicate. This has been reported some times in the past and everytime it was a provider issue/upstream on inference layer. sending to discussions for now because absolutely not replicable here

<!-- gh-comment-id:4490536814 --> @Classic298 commented on GitHub (May 19, 2026): i also cannot replicate. This has been reported some times in the past and everytime it was a provider issue/upstream on inference layer. sending to discussions for now because absolutely not replicable here
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/open-webui#123741