mirror of
https://github.com/open-webui/open-webui.git
synced 2026-05-06 10:58:17 -05:00
[GH-ISSUE #452] Switching between conversations kills response streaming #27592
Reference in New Issue
Block a user
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @robertvazan on GitHub (Jan 11, 2024).
Original GitHub issue: https://github.com/open-webui/open-webui/issues/452
Bug Report
Description
Bug Summary:
Response streaming works only if you stay in one conversation.
Steps to Reproduce:
Expected Behavior:
Response streaming should continue in the background. Switching back should show current state of the response.
Actual Behavior:
The response is gone along with all controls.
Reproduction Details
Confirmation:
I will provide additional information if you cannot reproduce the bug with the information provided so far.
@tjbck commented on GitHub (Jan 11, 2024):
Hi, Thanks for creating this issue! Unfortunately, because of the nature of the apis provided by Ollama (LLM providers in general), there isn't a quick fix for this. FYI, ChatGPT also experiences this issue as well. I'll close this issue as not planned for now, but I'll see if what I can do anything to accomodate your use case, thanks!
@robertvazan commented on GitHub (Jan 11, 2024):
ChatGPT is much faster though. It's surprising that it's hard to resume streaming when the backend keeps consuming the stream anyway even after the client disconnects. I guess the best workaround here is to open separate browser tab.
@robertvazan commented on GitHub (Jan 11, 2024):
BTW, ChatGPT does not discard the response but rather cancels it, keeping what was generated so far. It also makes it easier to regenerate the response. Ollama WebUI instead keeps consuming the response in the background with no way to see what's going on and no way to stop the process.
@justinh-rahb commented on GitHub (Jan 11, 2024):
@robertvazan it's important to note that ChatGPT runs on an infrastructure of 10,000+ GPUs, one of which is dedicated to generating your response, even if you think you've canceled the request from your end. In contrast, the Ollama WebUI operates using a single instance of Ollama, and does not support high concurrency or initiating a new generation while another is still being processed.
This limitation arises from the upstream project that provides the API utilized by this webui. As @tjbck suggested, there isn't a quick solution to this issue, as it originates from the core functionality of that project.
@robertvazan commented on GitHub (Jan 11, 2024):
@justinh-rahb Isn't Ollama CLI using the same APIs? If I press Ctrl+C in the CLI, the response is cancelled immediately and CPU load drops to zero in a second or two. Ditto if I invoke the API directly via curl and press Ctrl+C to cause curl to close the connection. So as far as I can tell, closing the connection is an easy way to cancel the whole process on ollama side.
@tjbck commented on GitHub (Jan 12, 2024):
@robertvazan I'll see what can be done, It's just that I'm incredibly busy this week so I don't have the capacity to take a look at the moment. If you manage to find a solution in the meantime, feel free to make a PR!
@robertvazan commented on GitHub (Jan 12, 2024):
@tjbck Just fixing #456 and stopping generation automatically when user switches chats (like ChatGPT does it) would be sufficient as a fix. Keeping streaming in the background and showing an up-to-date response when user navigates back would be a new feature, a convenient and useful one.
@tjbck commented on GitHub (Jan 12, 2024):
@robertvazan If you could test that it works as expected with only using APIs provided by Ollama, instead of comparing it to ollama CLI, that would be tremendously helpful! Also, feel free to make a PR with a possible fix :)
@robertvazan commented on GitHub (Jan 12, 2024):
@tjbck I did test it with the API directly. I executed the first curl example from ollama chat API docs with "Write a long blog post" prompt and pressed Ctrl+C while it was in the middle of the first paragraph. Curl exited, connection was closed, and ollama CPU usage dropped to zero almost immediately.