mirror of
https://github.com/open-webui/open-webui.git
synced 2026-03-09 23:35:09 -05:00
issue: When using llama.cpp as backend, pressing stop doesn't stop token generation #5950
Reference in New Issue
Block a user
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @OracleToes on GitHub (Aug 4, 2025).
Check Existing Issues
Installation Method
Pip Install
Open WebUI Version
v0.6.18
Ollama Version (if applicable)
No response
Operating System
Arch
Browser (if applicable)
No response
Confirmation
README.md.Expected Behavior
Pressing stop button should stop token generation when using llama.cpp
llama-serverto serve models. Pressing the regenerate button should work.Actual Behavior
Pressing the stop button stops webui from streaming tokens and hides the stop button, but when observing the terminal running llama-server, it is apparent that the model is still generating tokens. You can submit new text generation requests and the terminal reports that they are being queued up
Steps to Reproduce
Logs & Screenshots
Here is a log from llama.cpp
llama-serveat the beginning i start a fresh text generation instance, then i stop it with webui, but the terminal here shows no indication of stopping, then i queue up another text generation request, and you can see that reflected here, i think i did this two more times, but i think before the 3rd one was sent, the first task finally finished generating tokens and started on the second task.
Additional Information
The only way i know of to stop llama.cpp early is to ctrl-C the server from the terminal. This is not an option if you don't have access to the device or server hosting the server directly (ie you're hosting your own webui on a pc and serving it to your phone)
I am aware of #1166 but it was specific to ollama, and has been closed, there is discussion there of reopening the issue as it seems it's not quite solved for everyone. That aside, this is an issue specific to llama.cpp
A similar issue appears to have been solved for oobabooga's textgeneration-webui, and this issue was also specific to llama.cpp
https://github.com/oobabooga/text-generation-webui/issues/6966
This issue is very reproducible and there are other examples of users having this issue, but there are no open issues on this projects issue tracker for this specifically.
@rgaricano commented on GitHub (Aug 4, 2025):
that it's an issue of llama.cpp (and EOS/EOG/EOT tokens defined in model's config) ...
you can try setting Advanced Param - Stop Sequence of the model or editing the modelfile of the model.
(search in https://github.com/ggml-org/llama.cpp for references)
@OracleToes commented on GitHub (Aug 4, 2025):
Ok so i should make a modelfile and define the stop sequence in it? I don't see anything related to that on the readme.md though.
@tjbck commented on GitHub (Aug 4, 2025):
This most likely is an issue from llama-server, nothing much we can do from our end as far as I'm concerned.
@rgaricano commented on GitHub (Aug 4, 2025):
Really, I'm not sure what is the possible workarouds, it also could be for other causes,
searching in llama.cpp repo I found this https://github.com/ggml-org/llama.cpp/issues/14051 that seem related with same model you are using.
It's just a suggestion, I'm only indicating that isn't a open-webui issue, and problem/solution have to be redirected to llama.cpp repo.
@OracleToes commented on GitHub (Aug 4, 2025):
That issue seems to be more related to a fatal error resulting in a crash, at a glance it doesn't seem to related.
It seems like i'm experiencing the same issue with ollama though, and all of the models served by it were pulled from hf, so they all have modelfiles with proper stop sequences.
I'll stop generation with the stop button, and try to generate new text, it will give me a network error and (i think) when it's done generating the text for the last task, it eventually starts generating for the new one. ( #16224 )
@rgaricano commented on GitHub (Aug 4, 2025):
sure?
All hermes 2 pro gguf that I saw in HF haven't modelfile:
https://huggingface.co/bartowski/Hermes-2-Pro-Mistral-7B-GGUF/tree/main
https://huggingface.co/NousResearch/Hermes-2-Pro-Mistral-7B-GGUF/tree/main
https://huggingface.co/Utkarsh55/Nous-Hermes-Llama-2-7B-GGUF-Profiles/tree/main
...
@OracleToes commented on GitHub (Aug 4, 2025):
Ah, i see what you mean. This is the model i'm using, which is a qwen3 based merge
https://huggingface.co/mradermacher/POIROT-ECE-1.0-GGUF
But the log from llama.cpp says it's using a hermes chat format. I'll try with some other models, like base qwen and llama to see if they're giving the same issue.
@rgaricano commented on GitHub (Aug 4, 2025):
check the model config/metadata: https://huggingface.co/spaces/CISCai/gguf-editor
@OracleToes commented on GitHub (Aug 4, 2025):
at this point i'm quite sure it's not a model specific issue. I checked the POIROT-ECE model with the gguf editor and it has metadata, as well as stop sequences defined in that metadata.
I also just tried this model which correctly has llama 3.x chat format, and asked it to generate a long story. I stopped the generation and hit regenerate, and same as before, llama.cpp reports a task being queued in the terminal, and no new text is generated in webui. After a while the model generated some gibberish, which appeared to be the tail end of the 'story'. this is unexpected but still buggy.
I also tried a Josiefied (abliterated) Qwen3 finetune and llama.cpp logs it as using hermes 2 pro chat format, so I don't think the POIROT model has any malformed metadata.