enh/issue: performance optimisation #1431

Closed
opened 2025-11-11 14:44:58 -06:00 by GiteaMirror · 31 comments
Owner

Originally created by @vertago1 on GitHub (Jul 4, 2024).

Originally assigned to: @tjbck on GitHub.

Is your feature request related to a problem? Please describe.
The performance of long chat threads is really bad. I don't know if it is a browser client side issue or an issue on the server side.

Describe the solution you'd like
A performance fix would be great, but even the ability to do a shallow clone where only the last context window or so of the chat is cloned would help.

Describe alternatives you've considered
I am resorting to starting a fresh chat with memory items added and trying to use the prompt to pick up where the last chat left off.

Additional context
I have tried a few different browsers, but the best fix so far has been to start a new chat and try my best to get it back on roughly the same track.

Originally created by @vertago1 on GitHub (Jul 4, 2024). Originally assigned to: @tjbck on GitHub. **Is your feature request related to a problem? Please describe.** The performance of long chat threads is really bad. I don't know if it is a browser client side issue or an issue on the server side. **Describe the solution you'd like** A performance fix would be great, but even the ability to do a shallow clone where only the last context window or so of the chat is cloned would help. **Describe alternatives you've considered** I am resorting to starting a fresh chat with memory items added and trying to use the prompt to pick up where the last chat left off. **Additional context** I have tried a few different browsers, but the best fix so far has been to start a new chat and try my best to get it back on roughly the same track.
GiteaMirror added the enhancementgood first issuehelp wanted labels 2025-11-11 14:44:58 -06:00
Author
Owner

@tjbck commented on GitHub (Jul 4, 2024):

If you could share an export of your chat so that we can diagnose which part of the UI is being the bottleneck, that would be sublime!

@tjbck commented on GitHub (Jul 4, 2024): If you could share an export of your chat so that we can diagnose which part of the UI is being the bottleneck, that would be sublime!
Author
Owner

@vertago1 commented on GitHub (Jul 4, 2024):

I generated a chat large enough to trigger the same behavior since it would be a pain to organically create one I am fine with sharing big enough just for this, and have included it along with the generator.py file (as a .txt since uploading .py isn't allowed) in case you want to play around with different generated content and sizes.

generated.json
generate.py.txt

To reproduce the behavior I imported the chat and sent a message through the UI. The client side browser locked up for a bit, but after waiting it will eventually continue and display the new message UI and eventually fill in the assistant message.

The performance is much, much better with a shorter chat. Let me know if that doesn't work for you, or you need more information.

@vertago1 commented on GitHub (Jul 4, 2024): I generated a chat large enough to trigger the same behavior since it would be a pain to organically create one I am fine with sharing big enough just for this, and have included it along with the generator.py file (as a .txt since uploading .py isn't allowed) in case you want to play around with different generated content and sizes. [generated.json](https://github.com/user-attachments/files/16102826/generated.json) [generate.py.txt](https://github.com/user-attachments/files/16102827/generate.py.txt) To reproduce the behavior I imported the chat and sent a message through the UI. The client side browser locked up for a bit, but after waiting it will eventually continue and display the new message UI and eventually fill in the assistant message. The performance is much, much better with a shorter chat. Let me know if that doesn't work for you, or you need more information.
Author
Owner

@GrayXu commented on GitHub (Aug 22, 2024):

Encountered similar issues, I controlled the token generation for each dialogue to only come from the cache, while cutting the chat context length to 0 to prevent speed issues with LLM serving.
However, in long threads, the interface lag was very noticeable.
When page streaming output occurred, the page became unresponsive and uncontrollable.

1724336557070
Not very knowledgeable about front-end, unsure if this performance record can be helpful.

@GrayXu commented on GitHub (Aug 22, 2024): Encountered similar issues, I controlled the token generation for each dialogue to only come from the cache, while cutting the chat context length to 0 to prevent speed issues with LLM serving. However, in long threads, the interface lag was very noticeable. When page streaming output occurred, the page became unresponsive and uncontrollable. ![1724336557070](https://github.com/user-attachments/assets/51866922-08f6-4f5e-b13d-da2d8194fa2c) Not very knowledgeable about front-end, unsure if this performance record can be helpful.
Author
Owner

@tjbck commented on GitHub (Aug 22, 2024):

PR Welcome

@tjbck commented on GitHub (Aug 22, 2024): PR Welcome
Author
Owner

@george-elphick-talieisin commented on GitHub (Sep 4, 2024):

Just to add to this. I don't know if this might be browser specific. I see horrible lockups and slow downs in Safari on macOS but not in Edge on the same machine.

image

Not sure if this really helps?

@george-elphick-talieisin commented on GitHub (Sep 4, 2024): Just to add to this. I don't know if this might be browser specific. I see horrible lockups and slow downs in Safari on macOS but not in Edge on the same machine. ![image](https://github.com/user-attachments/assets/9ef5b459-a61f-4908-8b39-cb396ca5aeb8) Not sure if this really helps?
Author
Owner

@ingmferrer commented on GitHub (Sep 4, 2024):

I'm experiencing performance issues in Edge, Chrome, and Arc Browser on macOS, specifically when the UI has long conversations. These issues occur particularly when the LLM is responding, which might be related to how the front end handles streaming chunks of data. Additionally, when using Arc Browser on iOS, the browser frequently crashes under similar circumstances, and I have to manually force close the app for it to work again.

@ingmferrer commented on GitHub (Sep 4, 2024): I'm experiencing performance issues in Edge, Chrome, and Arc Browser on macOS, specifically when the UI has long conversations. These issues occur particularly when the LLM is responding, which might be related to how the front end handles streaming chunks of data. Additionally, when using Arc Browser on iOS, the browser frequently crashes under similar circumstances, and I have to manually force close the app for it to work again.
Author
Owner

@george-elphick-talieisin commented on GitHub (Sep 4, 2024):

In my case I get one core on 100% on the safari tab, seemingly consumed by a JavaScript process, but the profiler doesn't tell me what is using all that cpu time.

@george-elphick-talieisin commented on GitHub (Sep 4, 2024): In my case I get one core on 100% on the safari tab, seemingly consumed by a JavaScript process, but the profiler doesn't tell me what is using all that cpu time.
Author
Owner

@quantarion commented on GitHub (Sep 4, 2024):

Same here, Chrome on ChromeOS Flex version 129. It gets very slow as the chat length grows. I often get a prompt from chrome asking me if I want to kill the page or wait for it to finish.

I got this with the profiler in chrome:

Screenshot 2024-09-04 13 57 55

However, if I reload the page and continue the chat, the lag is mostly gone.

@quantarion commented on GitHub (Sep 4, 2024): Same here, Chrome on ChromeOS Flex version 129. It gets very slow as the chat length grows. I often get a prompt from chrome asking me if I want to kill the page or wait for it to finish. I got this with the profiler in chrome: ![Screenshot 2024-09-04 13 57 55](https://github.com/user-attachments/assets/e78f5633-87d2-4b0e-aab2-dd52a66ab79d) However, if I reload the page and continue the chat, the lag is mostly gone.
Author
Owner

@george-elphick-talieisin commented on GitHub (Sep 4, 2024):

I've carried on stress-testing Edge on macOS which has slightly better profiling and found slowdown, but not quite as extreme as Safari. Note the following:

image

Note the "animation time" which is the ui's placeholder box after hitting the Send message button. I'm using Groq here, so the thinking time is too long for it to be groq's fault. Rendering the response uses up quite a lot of CPU, which is articulated in more detail here:

image

Hope this helps

@george-elphick-talieisin commented on GitHub (Sep 4, 2024): I've carried on stress-testing Edge on macOS which has slightly better profiling and found slowdown, but not quite as extreme as Safari. Note the following: ![image](https://github.com/user-attachments/assets/52089a5e-4e12-46a1-8032-9dfdfa9b8eca) Note the "animation time" which is the ui's placeholder box after hitting the Send message button. I'm using Groq here, so the thinking time is too long for it to be groq's fault. Rendering the response uses up quite a lot of CPU, which is articulated in more detail here: ![image](https://github.com/user-attachments/assets/7dfacfbc-b2b0-4456-ad3b-fc017b161e0f) Hope this helps
Author
Owner

@Keithsc commented on GitHub (Sep 19, 2024):

I completely agree with the previous comments regarding the lag in Open-WebUI, especially with long chat threads. The slowness and high CPU usage make the experience quite frustrating. I've found that restarting Firefox seems to help temporarily, but it's inconvenient. It's good to know this isn't just a Firefox issue. Hopefully, this will be resolved soon, as I sometimes even switch to using Firefox on my phone to continue chatting when the lag becomes too severe.

@Keithsc commented on GitHub (Sep 19, 2024): I completely agree with the previous comments regarding the lag in Open-WebUI, especially with long chat threads. The slowness and high CPU usage make the experience quite frustrating. I've found that restarting Firefox seems to help temporarily, but it's inconvenient. It's good to know this isn't just a Firefox issue. Hopefully, this will be resolved soon, as I sometimes even switch to using Firefox on my phone to continue chatting when the lag becomes too severe.
Author
Owner

@tjbck commented on GitHub (Sep 19, 2024):

I'll start my thorough investigation shortly but I'd greatly appreciate some help from the community here!

@tjbck commented on GitHub (Sep 19, 2024): I'll start my thorough investigation shortly but I'd greatly appreciate some help from the community here!
Author
Owner

@GrayXu commented on GitHub (Sep 20, 2024):

An interesting observation is that in the latest version 0.3.22, there isn't a noticeable lag or blocking in page interactions with long threads, but the model token output still significantly slows down to ~5 token/s.

@GrayXu commented on GitHub (Sep 20, 2024): An interesting observation is that in the latest version 0.3.22, there isn't a noticeable lag or blocking in page interactions with long threads, but the model token output still significantly slows down to ~5 token/s.
Author
Owner

@Keithsc commented on GitHub (Sep 20, 2024):

I upgraded to 0.3.22 also but I still think long threads are a problem, I've looked at Firefox / devtools / console and there seems to be a awful lot of logging happening in there, but I am not a developer so maybe that's normal.

What I have discovered is that if I go to Open-WebUI settings and set "Stream Chat Response" to "off" is that I still get the reply but it's just in a single dump but my cpu is now idle and I don't get any lock ups. Only down side is you don't get to see the response output in real time. This might be a work around for some ?

@Keithsc commented on GitHub (Sep 20, 2024): I upgraded to 0.3.22 also but I still think long threads are a problem, I've looked at Firefox / devtools / console and there seems to be a awful lot of logging happening in there, but I am not a developer so maybe that's normal. What I have discovered is that if I go to Open-WebUI settings and set "Stream Chat Response" to "off" is that I still get the reply but it's just in a single dump but my cpu is now idle and I don't get any lock ups. Only down side is you don't get to see the response output in real time. This might be a work around for some ?
Author
Owner

@quantarion commented on GitHub (Sep 20, 2024):

hey tjbck, just tell me what you need, I can compile/test/send reports as much as you need.

@quantarion commented on GitHub (Sep 20, 2024): hey tjbck, just tell me what you need, I can compile/test/send reports as much as you need.
Author
Owner

@dncc89 commented on GitHub (Sep 21, 2024):

I've found it's parsing the whole chat history into markdown every time when a new token arrives.

@dncc89 commented on GitHub (Sep 21, 2024): I've found it's parsing the whole chat history into markdown every time when a new token arrives.
Author
Owner

@GrayXu commented on GitHub (Sep 22, 2024):

pagination/lazyloading may be a general optimization here

@GrayXu commented on GitHub (Sep 22, 2024): pagination/lazyloading may be a general optimization here
Author
Owner

@thiswillbeyourgithub commented on GitHub (Sep 22, 2024):

I've found it's parsing the whole chat history into markdown every time when a new token arrives.

Hi I'm coming from #4881 and wanted to add that actually I think it might be every time a token is rendered and not every time a new token is received. As from my very limited testing it seems that turning off "Fluidly stream large external response chunks" while mermaid charts are rendered seems to make the generation noticeably faster.

@thiswillbeyourgithub commented on GitHub (Sep 22, 2024): > I've found it's parsing the whole chat history into markdown every time when a new token arrives. Hi I'm coming from #4881 and wanted to add that actually I think it might be every time a token is rendered and not every time a new token is received. As from my very limited testing it seems that turning off "Fluidly stream large external response chunks" while mermaid charts are rendered seems to make the generation noticeably faster.
Author
Owner

@tjbck commented on GitHub (Sep 23, 2024):

Testing wanted in latest dev! Several optimisation techniques have been applied to our message rendering process (e.g. infinite scrolling), so significant performance gain is expected here.

@tjbck commented on GitHub (Sep 23, 2024): Testing wanted in latest dev! Several optimisation techniques have been applied to our message rendering process (e.g. infinite scrolling), so significant performance gain is expected here.
Author
Owner

@ingmferrer commented on GitHub (Sep 24, 2024):

After testing git-822c47c, I noticed a significant improvement. The chat experience feels much smoother, especially in longer conversations. Considering I was using Arc Browser on iOS, the lag has almost completely disappeared. It was previously unusable, so this is a major enhancement.

@ingmferrer commented on GitHub (Sep 24, 2024): After testing [git-822c47c](https://github.com/open-webui/open-webui/pkgs/container/open-webui/278205972?tag=git-822c47c), I noticed a significant improvement. The chat experience feels much smoother, especially in longer conversations. Considering I was using Arc Browser on iOS, the lag has almost completely disappeared. It was previously unusable, so this is a major enhancement.
Author
Owner

@thiswillbeyourgithub commented on GitHub (Sep 24, 2024):

So this fixed my mermaid issue as completions are now just as fast as regular text but I'm suspecting a regression when dealing with vision models. I've tested it with a local Ollama model as well as 4o from OpenAI and Cloud Sonnet. And in all cases, the generation is suspiciously slow, just like for the mermaid before the fix. This was on Brave Browser and on 0.3.26. Am I the only one?

@thiswillbeyourgithub commented on GitHub (Sep 24, 2024): So this fixed my mermaid issue as completions are now just as fast as regular text but I'm suspecting a regression when dealing with vision models. I've tested it with a local Ollama model as well as 4o from OpenAI and Cloud Sonnet. And in all cases, the generation is suspiciously slow, just like for the mermaid before the fix. This was on Brave Browser and on 0.3.26. Am I the only one?
Author
Owner

@kaiiiiiiiii commented on GitHub (Sep 24, 2024):

After testing git-822c47c, I noticed a significant improvement. The chat experience feels much smoother, especially in longer conversations. Considering I was using Arc Browser on iOS, the lag has almost completely disappeared. It was previously unusable, so this is a major enhancement.

Same for me. Experience in Arc on iOS improved x100 🥳

@kaiiiiiiiii commented on GitHub (Sep 24, 2024): > After testing [git-822c47c](https://github.com/open-webui/open-webui/pkgs/container/open-webui/278205972?tag=git-822c47c), I noticed a significant improvement. The chat experience feels much smoother, especially in longer conversations. Considering I was using Arc Browser on iOS, the lag has almost completely disappeared. It was previously unusable, so this is a major enhancement. Same for me. Experience in Arc on iOS improved x100 🥳
Author
Owner

@kevinleeex commented on GitHub (Oct 14, 2024):

Hi @tjbck, thanks for resolving this out, it runs perfect with Chromium based browsers, but the issue still exist on Safari, for both Mac and iPhone.

@kevinleeex commented on GitHub (Oct 14, 2024): Hi @tjbck, thanks for resolving this out, it runs perfect with Chromium based browsers, but the issue still exist on Safari, for both Mac and iPhone.
Author
Owner

@vlbosch commented on GitHub (Oct 15, 2024):

Hi @tjbck, thanks for resolving this out, it runs perfect with Chromium based browsers, but the issue still exist on Safari, for both Mac and iPhone.

I can confirm the issue still arises on Safari. After long chats or prompts with much context, the page hangs until the answer is streamed. Switching between chats with much chats/context is very slow as well.

Please let me know what I can do to help triage the issue on Safari.

@vlbosch commented on GitHub (Oct 15, 2024): > Hi @tjbck, thanks for resolving this out, it runs perfect with Chromium based browsers, but the issue still exist on Safari, for both Mac and iPhone. I can confirm the issue still arises on Safari. After long chats or prompts with much context, the page hangs until the answer is streamed. Switching between chats with much chats/context is very slow as well. Please let me know what I can do to help triage the issue on Safari.
Author
Owner

@kanubacode commented on GitHub (Nov 1, 2024):

I am experiencing this issue for very long chats on Firefox. What takes a second or two starts taking longer than a minute, and sometimes it loses connection and upon re-establishing it encounters a session ID traceback, in which case the page must be reloaded and the response has to be regenerated to continue the chat. The extremely long responses are the real frustration though. If I export my chat file to plain text, it is about 2.5M, which is the size at which it became more than a slowdown issue, but much earlier for a large inconvenience.

@kanubacode commented on GitHub (Nov 1, 2024): I am experiencing this issue for very long chats on Firefox. What takes a second or two starts taking longer than a minute, and sometimes it loses connection and upon re-establishing it encounters a session ID traceback, in which case the page must be reloaded and the response has to be regenerated to continue the chat. The extremely long responses are the real frustration though. If I export my chat file to plain text, it is about 2.5M, which is the size at which it became more than a slowdown issue, but much earlier for a large inconvenience.
Author
Owner

@kanubacode commented on GitHub (Nov 3, 2024):

@tjbck This seems to not be an issue with Chromium -- I was able to continue that chat with fast responses for a couple of days, and then... a different issue with long chats on Chromium:

Uncaught (in promise) RangeError: Maximum call stack size exceeded
    at Ze (Chat.svelte:604:30)
    at Ze (Chat.svelte:611:15)
    at Ze (Chat.svelte:611:15)
    at Ze (Chat.svelte:611:15)
    at Ze (Chat.svelte:611:15)
    at Ze (Chat.svelte:611:15)
    at Ze (Chat.svelte:611:15)
    at Ze (Chat.svelte:611:15)
    at Ze (Chat.svelte:611:15)
    at Ze (Chat.svelte:611:15)

const createMessagesList is overflowing. It's unfortunate that my 2 week old chat is now at a standstill. If there's anything I can do to help address this, let me know.

@kanubacode commented on GitHub (Nov 3, 2024): @tjbck This seems to not be an issue with Chromium -- I was able to continue that chat with fast responses for a couple of days, and then... a different issue with long chats on Chromium: ```js Uncaught (in promise) RangeError: Maximum call stack size exceeded at Ze (Chat.svelte:604:30) at Ze (Chat.svelte:611:15) at Ze (Chat.svelte:611:15) at Ze (Chat.svelte:611:15) at Ze (Chat.svelte:611:15) at Ze (Chat.svelte:611:15) at Ze (Chat.svelte:611:15) at Ze (Chat.svelte:611:15) at Ze (Chat.svelte:611:15) at Ze (Chat.svelte:611:15) ``` `const createMessagesList` is overflowing. It's unfortunate that my 2 week old chat is now at a standstill. If there's anything I can do to help address this, let me know.
Author
Owner

@vertago1 commented on GitHub (Nov 3, 2024):

@tjbck This seems to not be an issue with Chromium -- I was able to continue that chat with fast responses for a couple of days, and then... a different issue with long chats on Chromium:

Uncaught (in promise) RangeError: Maximum call stack size exceeded
    at Ze (Chat.svelte:604:30)
    at Ze (Chat.svelte:611:15)
    at Ze (Chat.svelte:611:15)
    at Ze (Chat.svelte:611:15)
    at Ze (Chat.svelte:611:15)
    at Ze (Chat.svelte:611:15)
    at Ze (Chat.svelte:611:15)
    at Ze (Chat.svelte:611:15)
    at Ze (Chat.svelte:611:15)
    at Ze (Chat.svelte:611:15)

const createMessagesList is overflowing. It's unfortunate that my 2 week old chat is now at a standstill. If there's anything I can do to help address this, let me know.

I wonder if the recursion could be replaced with a non-recursive function and a heap allocated list.

@vertago1 commented on GitHub (Nov 3, 2024): > @tjbck This seems to not be an issue with Chromium -- I was able to continue that chat with fast responses for a couple of days, and then... a different issue with long chats on Chromium: > > ```js > Uncaught (in promise) RangeError: Maximum call stack size exceeded > at Ze (Chat.svelte:604:30) > at Ze (Chat.svelte:611:15) > at Ze (Chat.svelte:611:15) > at Ze (Chat.svelte:611:15) > at Ze (Chat.svelte:611:15) > at Ze (Chat.svelte:611:15) > at Ze (Chat.svelte:611:15) > at Ze (Chat.svelte:611:15) > at Ze (Chat.svelte:611:15) > at Ze (Chat.svelte:611:15) > ``` > > `const createMessagesList` is overflowing. It's unfortunate that my 2 week old chat is now at a standstill. If there's anything I can do to help address this, let me know. I wonder if the recursion could be replaced with a non-recursive function and a heap allocated list.
Author
Owner

@kanubacode commented on GitHub (Nov 3, 2024):

After testing some other browsers, I can confirm that the slowness for my long chat is just as impossible to use with Midori and other Gecko browsers. Similarly, the above stack overflow occurs with other Blink-based browsers such as Brave and Vivaldi.

@kanubacode commented on GitHub (Nov 3, 2024): After testing some other browsers, I can confirm that the slowness for my long chat is just as impossible to use with Midori and other Gecko browsers. Similarly, the above stack overflow occurs with other Blink-based browsers such as Brave and Vivaldi.
Author
Owner

@kevinhq commented on GitHub (Feb 16, 2025):

is this solved? i was going to give up on open web ui. ollama responds fast, but openweb ui is not. i am on 16gb ram, ryzen 5 3600 6-core, with debian 12.

@kevinhq commented on GitHub (Feb 16, 2025): is this solved? i was going to give up on open web ui. ollama responds fast, but openweb ui is not. i am on 16gb ram, ryzen 5 3600 6-core, with debian 12.
Author
Owner

@vertago1 commented on GitHub (Feb 16, 2025):

I haven't noticed performance issues on large chats anymore, but there are times when I need to adjust down the context length I am using down from the max so I have enough vram and doesn't switch back to CPU for ollama.

@vertago1 commented on GitHub (Feb 16, 2025): I haven't noticed performance issues on large chats anymore, but there are times when I need to adjust down the context length I am using down from the max so I have enough vram and doesn't switch back to CPU for ollama.
Author
Owner

@merrime-n commented on GitHub (Jun 1, 2025):

What is the current state of this issue? I am also experiencing serious performance issues working with Open WebUI; for instance, old and relatively large chat are loaded too slowly. Sometimes the admin panel tabs take forever to load. It's a pity because Open WebUI has been a lifesaver for us and only performance is its problem.

@merrime-n commented on GitHub (Jun 1, 2025): What is the current state of this issue? I am also experiencing serious performance issues working with Open WebUI; for instance, old and relatively large chat are loaded too slowly. Sometimes the admin panel tabs take forever to load. It's a pity because Open WebUI has been a lifesaver for us and only performance is its problem.
Author
Owner

@DocStatic97 commented on GitHub (Jun 5, 2025):

I too would like to know the state of this issue.
Large chats (or one with one or two images) take ages to load to simply never do.

@DocStatic97 commented on GitHub (Jun 5, 2025): I too would like to know the state of this issue. Large chats (or one with one or two images) take ages to load to simply never do.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/open-webui#1431