issue: When using llama.cpp as backend, pressing stop doesn't stop token generation #5950

New Issue

GiteaMirror · 2025-11-11T16:39:55-06:00

GiteaMirror commented

2025-11-11 16:39:55 -06:00

Originally created by @OracleToes on GitHub (Aug 4, 2025).

Check Existing Issues

I have searched the existing issues and discussions.
I am using the latest version of Open WebUI.

Installation Method

Pip Install

Open WebUI Version

v0.6.18

Ollama Version (if applicable)

No response

Operating System

Arch

Browser (if applicable)

No response

Confirmation

I have read and followed all instructions in README.md.
I am using the latest version of both Open WebUI and Ollama.
I have included the browser console logs.
I have included the Docker container logs.
I have provided every relevant configuration, setting, and environment variable used in my setup.
I have clearly listed every relevant configuration, custom setting, environment variable, and command-line option that influences my setup (such as Docker Compose overrides, .env values, browser settings, authentication configurations, etc).
I have documented step-by-step reproduction instructions that are precise, sequential, and leave nothing to interpretation. My steps:
Start with the initial platform/version/OS and dependencies used,
Specify exact install/launch/configure commands,
List URLs visited, user input (incl. example values/emails/passwords if needed),
Describe all options and toggles enabled or changed,
Include any files or environmental changes,
Identify the expected and actual result at each stage,
Ensure any reasonably skilled user can follow and hit the same issue.

Expected Behavior

Pressing stop button should stop token generation when using llama.cpp llama-server to serve models. Pressing the regenerate button should work.

Actual Behavior

Pressing the stop button stops webui from streaming tokens and hides the stop button, but when observing the terminal running llama-server, it is apparent that the model is still generating tokens. You can submit new text generation requests and the terminal reports that they are being queued up

Steps to Reproduce

Fresh install of webui and llama.cpp, serve webui and serve llama with a model
use openai API in connections, input the host and port
start a new chat, stop generation
submit regeneration or text generation requests, note how nothing happens.

Logs & Screenshots

Here is a log from llama.cpp llama-serve
at the beginning i start a fresh text generation instance, then i stop it with webui, but the terminal here shows no indication of stopping, then i queue up another text generation request, and you can see that reflected here, i think i did this two more times, but i think before the 3rd one was sent, the first task finally finished generating tokens and started on the second task.

main: server is listening on http://127.0.0.1:10100 - starting the main loop
srv  update_slots: all slots are idle
srv  log_server_r: request: GET /models 127.0.0.1 200
srv  params_from_: Chat format: Hermes 2 Pro
slot launch_slot_: id  0 | task 0 | processing task
slot update_slots: id  0 | task 0 | new prompt, n_ctx_slot = 4096, n_keep = 0, n_prompt_tokens = 770
slot update_slots: id  0 | task 0 | kv cache rm [0, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_past = 770, n_tokens = 770, progress = 1.000000
slot update_slots: id  0 | task 0 | prompt done, n_past = 770, n_tokens = 770
slot      release: id  0 | task 0 | stop processing: n_past = 1044, truncated = 0
slot print_timing: id  0 | task 0 | 
prompt eval time =     311.26 ms /   770 tokens (    0.40 ms per token,  2473.84 tokens per second)
       eval time =    5371.34 ms /   275 tokens (   19.53 ms per token,    51.20 tokens per second)
      total time =    5682.60 ms /  1045 tokens
srv  update_slots: all slots are idle
srv  log_server_r: request: POST /chat/completions 127.0.0.1 200
srv  log_server_r: request: GET /models 127.0.0.1 200
srv  params_from_: Chat format: Hermes 2 Pro
slot launch_slot_: id  0 | task 276 | processing task
slot update_slots: id  0 | task 276 | new prompt, n_ctx_slot = 4096, n_keep = 0, n_prompt_tokens = 1739
slot update_slots: id  0 | task 276 | kv cache rm [196, end)
slot update_slots: id  0 | task 276 | prompt processing progress, n_past = 1739, n_tokens = 1543, progress = 0.887292
slot update_slots: id  0 | task 276 | prompt done, n_past = 1739, n_tokens = 1543
srv  log_server_r: request: GET /models 127.0.0.1 200
srv  params_from_: Chat format: Hermes 2 Pro
srv  log_server_r: request: GET /models 127.0.0.1 200
srv  params_from_: Chat format: Hermes 2 Pro
srv  cancel_tasks: cancel task, id_task = 276
srv  log_server_r: request: POST /chat/completions 127.0.0.1 200
slot      release: id  0 | task 276 | stop processing: n_past = 2427, truncated = 0
slot launch_slot_: id  0 | task 588 | processing task
slot update_slots: id  0 | task 588 | new prompt, n_ctx_slot = 4096, n_keep = 0, n_prompt_tokens = 770
slot update_slots: id  0 | task 588 | kv cache rm [196, end)
slot update_slots: id  0 | task 588 | prompt processing progress, n_past = 770, n_tokens = 574, progress = 0.745455
slot update_slots: id  0 | task 588 | prompt done, n_past = 770, n_tokens = 574
srv  cancel_tasks: cancel task, id_task = 588
srv  log_server_r: request: POST /chat/completions 127.0.0.1 200
slot      release: id  0 | task 588 | stop processing: n_past = 771, truncated = 0
slot launch_slot_: id  0 | task 874 | processing task
slot update_slots: id  0 | task 874 | new prompt, n_ctx_slot = 4096, n_keep = 0, n_prompt_tokens = 770
slot update_slots: id  0 | task 874 | kv cache rm [231, end)
slot update_slots: id  0 | task 874 | prompt processing progress, n_past = 770, n_tokens = 539, progress = 0.700000
slot update_slots: id  0 | task 874 | prompt done, n_past = 770, n_tokens = 539
srv  cancel_tasks: cancel task, id_task = 874
srv  log_server_r: request: POST /chat/completions 127.0.0.1 200
slot      release: id  0 | task 874 | stop processing: n_past = 771, truncated = 0
srv  update_slots: all slots are idle
srv  log_server_r: request: GET /models 127.0.0.1 200
srv  params_from_: Chat format: Hermes 2 Pro
slot launch_slot_: id  0 | task 975 | processing task
slot update_slots: id  0 | task 975 | new prompt, n_ctx_slot = 4096, n_keep = 0, n_prompt_tokens = 770
slot update_slots: id  0 | task 975 | kv cache rm [230, end)
slot update_slots: id  0 | task 975 | prompt processing progress, n_past = 770, n_tokens = 540, progress = 0.701299
slot update_slots: id  0 | task 975 | prompt done, n_past = 770, n_tokens = 540
srv  cancel_tasks: cancel task, id_task = 975
srv  log_server_r: request: POST /chat/completions 127.0.0.1 200
slot      release: id  0 | task 975 | stop processing: n_past = 1033, truncated = 0
srv  update_slots: all slots are idle

Additional Information

The only way i know of to stop llama.cpp early is to ctrl-C the server from the terminal. This is not an option if you don't have access to the device or server hosting the server directly (ie you're hosting your own webui on a pc and serving it to your phone)
I am aware of #1166 but it was specific to ollama, and has been closed, there is discussion there of reopening the issue as it seems it's not quite solved for everyone. That aside, this is an issue specific to llama.cpp
A similar issue appears to have been solved for oobabooga's textgeneration-webui, and this issue was also specific to llama.cpp
https://github.com/oobabooga/text-generation-webui/issues/6966
This issue is very reproducible and there are other examples of users having this issue, but there are no open issues on this projects issue tracker for this specifically.

Originally created by @OracleToes on GitHub (Aug 4, 2025). ### Check Existing Issues - [x] I have searched the existing issues and discussions. - [x] I am using the latest version of Open WebUI. ### Installation Method Pip Install ### Open WebUI Version v0.6.18 ### Ollama Version (if applicable) _No response_ ### Operating System Arch ### Browser (if applicable) _No response_ ### Confirmation - [x] I have read and followed all instructions in `README.md`. - [x] I am using the latest version of **both** Open WebUI and Ollama. - [x] I have included the browser console logs. - [x] I have included the Docker container logs. - [x] I have **provided every relevant configuration, setting, and environment variable used in my setup.** - [x] I have clearly **listed every relevant configuration, custom setting, environment variable, and command-line option that influences my setup** (such as Docker Compose overrides, .env values, browser settings, authentication configurations, etc). - [x] I have documented **step-by-step reproduction instructions that are precise, sequential, and leave nothing to interpretation**. My steps: - Start with the initial platform/version/OS and dependencies used, - Specify exact install/launch/configure commands, - List URLs visited, user input (incl. example values/emails/passwords if needed), - Describe all options and toggles enabled or changed, - Include any files or environmental changes, - Identify the expected and actual result at each stage, - Ensure any reasonably skilled user can follow and hit the same issue. ### Expected Behavior Pressing stop button should stop token generation when using llama.cpp `llama-server` to serve models. Pressing the regenerate button _**should work.**_ ### Actual Behavior Pressing the stop button stops webui from streaming tokens and hides the stop button, but when observing the terminal running llama-server, it is apparent that the model is still generating tokens. You can submit new text generation requests and the terminal reports that they are being queued up ### Steps to Reproduce 1. Fresh install of webui and llama.cpp, serve webui and serve llama with a model 2. use openai API in connections, input the host and port 3. start a new chat, stop generation 4. submit regeneration or text generation requests, note how nothing happens. ### Logs & Screenshots Here is a log from llama.cpp `llama-serve` at the beginning i start a fresh text generation instance, then i stop it with webui, but the terminal here shows no indication of stopping, then i queue up another text generation request, and you can see that reflected here, i think i did this two more times, but i think before the 3rd one was sent, the first task finally finished generating tokens and started on the second task. ``` main: server is listening on http://127.0.0.1:10100 - starting the main loop srv update_slots: all slots are idle srv log_server_r: request: GET /models 127.0.0.1 200 srv params_from_: Chat format: Hermes 2 Pro slot launch_slot_: id 0 | task 0 | processing task slot update_slots: id 0 | task 0 | new prompt, n_ctx_slot = 4096, n_keep = 0, n_prompt_tokens = 770 slot update_slots: id 0 | task 0 | kv cache rm [0, end) slot update_slots: id 0 | task 0 | prompt processing progress, n_past = 770, n_tokens = 770, progress = 1.000000 slot update_slots: id 0 | task 0 | prompt done, n_past = 770, n_tokens = 770 slot release: id 0 | task 0 | stop processing: n_past = 1044, truncated = 0 slot print_timing: id 0 | task 0 | prompt eval time = 311.26 ms / 770 tokens ( 0.40 ms per token, 2473.84 tokens per second) eval time = 5371.34 ms / 275 tokens ( 19.53 ms per token, 51.20 tokens per second) total time = 5682.60 ms / 1045 tokens srv update_slots: all slots are idle srv log_server_r: request: POST /chat/completions 127.0.0.1 200 srv log_server_r: request: GET /models 127.0.0.1 200 srv params_from_: Chat format: Hermes 2 Pro slot launch_slot_: id 0 | task 276 | processing task slot update_slots: id 0 | task 276 | new prompt, n_ctx_slot = 4096, n_keep = 0, n_prompt_tokens = 1739 slot update_slots: id 0 | task 276 | kv cache rm [196, end) slot update_slots: id 0 | task 276 | prompt processing progress, n_past = 1739, n_tokens = 1543, progress = 0.887292 slot update_slots: id 0 | task 276 | prompt done, n_past = 1739, n_tokens = 1543 srv log_server_r: request: GET /models 127.0.0.1 200 srv params_from_: Chat format: Hermes 2 Pro srv log_server_r: request: GET /models 127.0.0.1 200 srv params_from_: Chat format: Hermes 2 Pro srv cancel_tasks: cancel task, id_task = 276 srv log_server_r: request: POST /chat/completions 127.0.0.1 200 slot release: id 0 | task 276 | stop processing: n_past = 2427, truncated = 0 slot launch_slot_: id 0 | task 588 | processing task slot update_slots: id 0 | task 588 | new prompt, n_ctx_slot = 4096, n_keep = 0, n_prompt_tokens = 770 slot update_slots: id 0 | task 588 | kv cache rm [196, end) slot update_slots: id 0 | task 588 | prompt processing progress, n_past = 770, n_tokens = 574, progress = 0.745455 slot update_slots: id 0 | task 588 | prompt done, n_past = 770, n_tokens = 574 srv cancel_tasks: cancel task, id_task = 588 srv log_server_r: request: POST /chat/completions 127.0.0.1 200 slot release: id 0 | task 588 | stop processing: n_past = 771, truncated = 0 slot launch_slot_: id 0 | task 874 | processing task slot update_slots: id 0 | task 874 | new prompt, n_ctx_slot = 4096, n_keep = 0, n_prompt_tokens = 770 slot update_slots: id 0 | task 874 | kv cache rm [231, end) slot update_slots: id 0 | task 874 | prompt processing progress, n_past = 770, n_tokens = 539, progress = 0.700000 slot update_slots: id 0 | task 874 | prompt done, n_past = 770, n_tokens = 539 srv cancel_tasks: cancel task, id_task = 874 srv log_server_r: request: POST /chat/completions 127.0.0.1 200 slot release: id 0 | task 874 | stop processing: n_past = 771, truncated = 0 srv update_slots: all slots are idle srv log_server_r: request: GET /models 127.0.0.1 200 srv params_from_: Chat format: Hermes 2 Pro slot launch_slot_: id 0 | task 975 | processing task slot update_slots: id 0 | task 975 | new prompt, n_ctx_slot = 4096, n_keep = 0, n_prompt_tokens = 770 slot update_slots: id 0 | task 975 | kv cache rm [230, end) slot update_slots: id 0 | task 975 | prompt processing progress, n_past = 770, n_tokens = 540, progress = 0.701299 slot update_slots: id 0 | task 975 | prompt done, n_past = 770, n_tokens = 540 srv cancel_tasks: cancel task, id_task = 975 srv log_server_r: request: POST /chat/completions 127.0.0.1 200 slot release: id 0 | task 975 | stop processing: n_past = 1033, truncated = 0 srv update_slots: all slots are idle ``` ### Additional Information The only way i know of to stop llama.cpp early is to ctrl-C the server from the terminal. This is not an option if you don't have access to the device or server hosting the server directly (ie you're hosting your own webui on a pc and serving it to your phone) I am aware of #1166 but it was specific to ollama, and has been closed, there is discussion there of reopening the issue as it seems it's not quite solved for everyone. That aside, this is an issue specific to llama.cpp A similar issue appears to have been solved for oobabooga's textgeneration-webui, and this issue was also specific to llama.cpp https://github.com/oobabooga/text-generation-webui/issues/6966 This issue is very reproducible and there are other examples of users having this issue, but there are no open issues on this projects issue tracker for this specifically.

GiteaMirror added the bug label 2025-11-11 16:39:55 -06:00

GiteaMirror closed this issue

2025-11-11 16:39:55 -06:00

GiteaMirror commented

2025-11-11 16:39:56 -06:00

@rgaricano commented on GitHub (Aug 4, 2025):

that it's an issue of llama.cpp (and EOS/EOG/EOT tokens defined in model's config) ...

you can try setting Advanced Param - Stop Sequence of the model or editing the modelfile of the model.
(search in https://github.com/ggml-org/llama.cpp for references)

@rgaricano commented on GitHub (Aug 4, 2025): that it's an issue of llama.cpp (and EOS/EOG/EOT tokens defined in model's config) ... you can try setting Advanced Param - Stop Sequence of the model or editing the modelfile of the model. (search in https://github.com/ggml-org/llama.cpp for references)

GiteaMirror commented

2025-11-11 16:39:56 -06:00

@OracleToes commented on GitHub (Aug 4, 2025):

Ok so i should make a modelfile and define the stop sequence in it? I don't see anything related to that on the readme.md though.

@OracleToes commented on GitHub (Aug 4, 2025): Ok so i should make a modelfile and define the stop sequence in it? I don't see anything related to that on the readme.md though.

GiteaMirror commented

2025-11-11 16:39:57 -06:00

@tjbck commented on GitHub (Aug 4, 2025):

This most likely is an issue from llama-server, nothing much we can do from our end as far as I'm concerned.

@tjbck commented on GitHub (Aug 4, 2025): This most likely is an issue from llama-server, nothing much we can do from our end as far as I'm concerned.

GiteaMirror commented

2025-11-11 16:39:58 -06:00

@rgaricano commented on GitHub (Aug 4, 2025):

Really, I'm not sure what is the possible workarouds, it also could be for other causes,
searching in llama.cpp repo I found this https://github.com/ggml-org/llama.cpp/issues/14051 that seem related with same model you are using.

It's just a suggestion, I'm only indicating that isn't a open-webui issue, and problem/solution have to be redirected to llama.cpp repo.

@rgaricano commented on GitHub (Aug 4, 2025): Really, I'm not sure what is the possible workarouds, it also could be for other causes, searching in llama.cpp repo I found this https://github.com/ggml-org/llama.cpp/issues/14051 that seem related with same model you are using. It's just a suggestion, I'm only indicating that isn't a open-webui issue, and problem/solution have to be redirected to llama.cpp repo.

GiteaMirror commented

2025-11-11 16:39:58 -06:00

@OracleToes commented on GitHub (Aug 4, 2025):

That issue seems to be more related to a fatal error resulting in a crash, at a glance it doesn't seem to related.
It seems like i'm experiencing the same issue with ollama though, and all of the models served by it were pulled from hf, so they all have modelfiles with proper stop sequences.
I'll stop generation with the stop button, and try to generate new text, it will give me a network error and (i think) when it's done generating the text for the last task, it eventually starts generating for the new one. ( #16224 )

@OracleToes commented on GitHub (Aug 4, 2025): That issue seems to be more related to a fatal error resulting in a crash, at a glance it doesn't seem to related. It seems like i'm experiencing the same issue with ollama though, and all of the models served by it were pulled from hf, so they all have modelfiles with proper stop sequences. I'll stop generation with the stop button, and try to generate new text, it will give me a network error and (i think) when it's done generating the text for the last task, it eventually starts generating for the new one. ( #16224 )

GiteaMirror commented

2025-11-11 16:39:58 -06:00

@rgaricano commented on GitHub (Aug 4, 2025):

sure?
All hermes 2 pro gguf that I saw in HF haven't modelfile:

https://huggingface.co/bartowski/Hermes-2-Pro-Mistral-7B-GGUF/tree/main
https://huggingface.co/NousResearch/Hermes-2-Pro-Mistral-7B-GGUF/tree/main
https://huggingface.co/Utkarsh55/Nous-Hermes-Llama-2-7B-GGUF-Profiles/tree/main
...

@rgaricano commented on GitHub (Aug 4, 2025): sure? [All hermes 2 pro gguf that I saw in HF](https://huggingface.co/models?search=hermes%202%20pro%20gguf) haven't modelfile: https://huggingface.co/bartowski/Hermes-2-Pro-Mistral-7B-GGUF/tree/main https://huggingface.co/NousResearch/Hermes-2-Pro-Mistral-7B-GGUF/tree/main https://huggingface.co/Utkarsh55/Nous-Hermes-Llama-2-7B-GGUF-Profiles/tree/main ...

GiteaMirror commented

2025-11-11 16:39:58 -06:00

@OracleToes commented on GitHub (Aug 4, 2025):

Ah, i see what you mean. This is the model i'm using, which is a qwen3 based merge
https://huggingface.co/mradermacher/POIROT-ECE-1.0-GGUF

But the log from llama.cpp says it's using a hermes chat format. I'll try with some other models, like base qwen and llama to see if they're giving the same issue.

@OracleToes commented on GitHub (Aug 4, 2025): Ah, i see what you mean. This is the model i'm using, which is a qwen3 based merge https://huggingface.co/mradermacher/POIROT-ECE-1.0-GGUF But the log from llama.cpp says it's using a hermes chat format. I'll try with some other models, like base qwen and llama to see if they're giving the same issue.

GiteaMirror commented

2025-11-11 16:39:59 -06:00

@rgaricano commented on GitHub (Aug 4, 2025):

check the model config/metadata: https://huggingface.co/spaces/CISCai/gguf-editor

@rgaricano commented on GitHub (Aug 4, 2025): check the model config/metadata: https://huggingface.co/spaces/CISCai/gguf-editor

GiteaMirror commented

2025-11-11 16:39:59 -06:00

@OracleToes commented on GitHub (Aug 4, 2025):

at this point i'm quite sure it's not a model specific issue. I checked the POIROT-ECE model with the gguf editor and it has metadata, as well as stop sequences defined in that metadata.
I also just tried this model which correctly has llama 3.x chat format, and asked it to generate a long story. I stopped the generation and hit regenerate, and same as before, llama.cpp reports a task being queued in the terminal, and no new text is generated in webui. After a while the model generated some gibberish, which appeared to be the tail end of the 'story'. this is unexpected but still buggy.

I also tried a Josiefied (abliterated) Qwen3 finetune and llama.cpp logs it as using hermes 2 pro chat format, so I don't think the POIROT model has any malformed metadata.

@OracleToes commented on GitHub (Aug 4, 2025): at this point i'm quite sure it's not a model specific issue. I checked the POIROT-ECE model with the gguf editor and it has metadata, as well as stop sequences defined in that metadata. I also just tried [this](https://huggingface.co/bartowski/Llama-3.2-3B-Instruct-GGUF) model which correctly has llama 3.x chat format, and asked it to generate a long story. I stopped the generation and hit regenerate, and same as before, llama.cpp reports a task being queued in the terminal, and no new text is generated in webui. After a while the model generated some gibberish, which appeared to be the tail end of the 'story'. this is unexpected but still buggy. I also tried a Josiefied (abliterated) Qwen3 finetune and llama.cpp logs it as using hermes 2 pro chat format, so I don't think the POIROT model has any malformed metadata.