[GH-ISSUE #8541] OpenWebUI-Ollama does not fully utilize NVIDIA GPU when context length or parallel session icncreases #15162

New Issue

GiteaMirror · 2026-04-19T21:27:12-05:00

GiteaMirror commented

2026-04-19 21:27:12 -05:00

Originally created by @rpaGuyai on GitHub (Jan 14, 2025).
Original GitHub issue: https://github.com/open-webui/open-webui/issues/8541

Bug Report

Important Notes

Before submitting a bug report: Please check the Issues or Discussions section to see if a similar issue or feature request has already been posted. It's likely we're already tracking it! If you’re unsure, start a discussion post first. This will help us efficiently focus on improving the project.
Collaborate respectfully: We value a constructive attitude, so please be mindful of your communication. If negativity is part of your approach, our capacity to engage may be limited. We’re here to help if you’re open to learning and communicating positively. Remember, Open WebUI is a volunteer-driven project managed by a single maintainer and supported by contributors who also have full-time jobs. We appreciate your time and ask that you respect ours.
Contributing: If you encounter an issue, we highly encourage you to submit a pull request or fork the project. We actively work to prevent contributor burnout to maintain the quality and continuity of Open WebUI.
Bug reproducibility: If a bug cannot be reproduced with a :main or :dev Docker setup, or a pip install with Python 3.11, it may require additional help from the community. In such cases, we will move it to the "issues" Discussions section due to our limited resources. We encourage the community to assist with these issues. Remember, it’s not that the issue doesn’t exist; we need your help!

Note: Please remove the notes above when submitting your post. Thank you for your understanding and support!

Installation Method

[Describe the method you used to install the project, e.g., git clone, Docker, pip, etc.]

Environment

Open WebUI Version: [e.g., v0.3.11]
Ollama (if applicable): [e.g., v0.2.0, v0.1.32-rc1]
Operating System: [e.g., Windows 10, macOS Big Sur, Ubuntu 20.04]
Browser (if applicable): [e.g., Chrome 100.0, Firefox 98.0]

Confirmation:

I have read and followed all the instructions provided in the README.md.
I am on the latest version of both Open WebUI and Ollama.
I have included the browser console logs.
I have included the Docker container logs.
I have provided the exact steps to reproduce the bug in the "Steps to Reproduce" section below.

Expected Behavior:

[Describe what you expected to happen.]

Actual Behavior:

[Describe what actually happened.]

Description

Bug Summary:
I am hosting OpenWebUI on my server ( specs - AWS G4dn.12xlarge, Memory: 192 GB RAM, GPU: 4 x NVIDIA Tesla T4 GPUs, TITAL 64 gb GPU, 16 GB each).

Issues and scenarios:

I have found a sweet spot to get the optimized result, when the context length is set to 11000 and Environment="OLLAMA_NUM_PARALLEL=10" in ollamma . service file, it works good, utilizing all 4 GPUS and minimum CPU.

However, If I increase either the context length to let say 15000 or num parallel to let say 15, the speed reduces drastically and the load is being shared almost 50- 50 by CPU and GPU, GPU not being utilized fully casuing slowness in respsonse when just 5-6 concurrent sessions as there.
If I further increase either od context length to 20K or num parallel to 20, then in such cases and beyond, it stops using GPU and load is fully transferred to CPU which kills the speed totally.

Need help from experts please, is it due to some configurations in ollama or openwebui or is it that T4 GPUS in both cases each unit is only 16GB, is it that we need entire GPU memory in just 1GPU to fullu utilize it, if the GPU , in my case of 12x Large, it is 64(16*4) split into 4.

Reproduction Details

Steps to Reproduce:
[Outline the steps to reproduce the bug. Be as detailed as possible.]

Logs and Screenshots

Browser Console Logs:
[Include relevant browser console logs, if applicable]

Docker Container Logs:
[Include relevant Docker container logs, if applicable]

Screenshots/Screen Recordings (if applicable):
[Attach any relevant screenshots to help illustrate the issue]

Additional Information

[Include any additional details that may help in understanding and reproducing the issue. This could include specific configurations, error messages, or anything else relevant to the bug.]

Note

If the bug report is incomplete or does not follow the provided instructions, it may not be addressed. Please ensure that you have followed the steps outlined in the README.md and troubleshooting.md documents, and provide all necessary information for us to reproduce and address the issue. Thank you!

Originally created by @rpaGuyai on GitHub (Jan 14, 2025). Original GitHub issue: https://github.com/open-webui/open-webui/issues/8541 # Bug Report ## Important Notes - **Before submitting a bug report**: Please check the Issues or Discussions section to see if a similar issue or feature request has already been posted. It's likely we're already tracking it! If you’re unsure, start a discussion post first. This will help us efficiently focus on improving the project. - **Collaborate respectfully**: We value a constructive attitude, so please be mindful of your communication. If negativity is part of your approach, our capacity to engage may be limited. We’re here to help if you’re open to learning and communicating positively. Remember, Open WebUI is a volunteer-driven project managed by a single maintainer and supported by contributors who also have full-time jobs. We appreciate your time and ask that you respect ours. - **Contributing**: If you encounter an issue, we highly encourage you to submit a pull request or fork the project. We actively work to prevent contributor burnout to maintain the quality and continuity of Open WebUI. - **Bug reproducibility**: If a bug cannot be reproduced with a `:main` or `:dev` Docker setup, or a pip install with Python 3.11, it may require additional help from the community. In such cases, we will move it to the "issues" Discussions section due to our limited resources. We encourage the community to assist with these issues. Remember, it’s not that the issue doesn’t exist; we need your help! Note: Please remove the notes above when submitting your post. Thank you for your understanding and support! --- ## Installation Method [Describe the method you used to install the project, e.g., git clone, Docker, pip, etc.] ## Environment - **Open WebUI Version:** [e.g., v0.3.11] - **Ollama (if applicable):** [e.g., v0.2.0, v0.1.32-rc1] - **Operating System:** [e.g., Windows 10, macOS Big Sur, Ubuntu 20.04] - **Browser (if applicable):** [e.g., Chrome 100.0, Firefox 98.0] **Confirmation:** - [x] I have read and followed all the instructions provided in the README.md. - [x] I am on the latest version of both Open WebUI and Ollama. - [ ] I have included the browser console logs. - [ ] I have included the Docker container logs. - [x] I have provided the exact steps to reproduce the bug in the "Steps to Reproduce" section below. ## Expected Behavior: [Describe what you expected to happen.] ## Actual Behavior: [Describe what actually happened.] ## Description **Bug Summary:** I am hosting OpenWebUI on my server ( specs - AWS G4dn.12xlarge, Memory: 192 GB RAM, GPU: 4 x NVIDIA Tesla T4 GPUs, TITAL 64 gb GPU, 16 GB each). Issues and scenarios: I have found a sweet spot to get the optimized result, when the context length is set to 11000 and Environment="OLLAMA_NUM_PARALLEL=10" in ollamma . service file, it works good, utilizing all 4 GPUS and minimum CPU. However, If I increase either the context length to let say 15000 or num parallel to let say 15, the speed reduces drastically and the load is being shared almost 50- 50 by CPU and GPU, GPU not being utilized fully casuing slowness in respsonse when just 5-6 concurrent sessions as there. If I further increase either od context length to 20K or num parallel to 20, then in such cases and beyond, it stops using GPU and load is fully transferred to CPU which kills the speed totally. Need help from experts please, is it due to some configurations in ollama or openwebui or is it that T4 GPUS in both cases each unit is only 16GB, is it that we need entire GPU memory in just 1GPU to fullu utilize it, if the GPU , in my case of 12x Large, it is 64(16*4) split into 4. ## Reproduction Details **Steps to Reproduce:** [Outline the steps to reproduce the bug. Be as detailed as possible.] ## Logs and Screenshots **Browser Console Logs:** [Include relevant browser console logs, if applicable] **Docker Container Logs:** [Include relevant Docker container logs, if applicable] **Screenshots/Screen Recordings (if applicable):** [Attach any relevant screenshots to help illustrate the issue] ## Additional Information [Include any additional details that may help in understanding and reproducing the issue. This could include specific configurations, error messages, or anything else relevant to the bug.] ## Note If the bug report is incomplete or does not follow the provided instructions, it may not be addressed. Please ensure that you have followed the steps outlined in the README.md and troubleshooting.md documents, and provide all necessary information for us to reproduce and address the issue. Thank you!

GiteaMirror closed this issue

2026-04-19 21:27:12 -05:00

GiteaMirror referenced this issue

2026-04-19 23:15:47 -05:00

[GH-ISSUE #15162] issue: Unavailable direct connection chat with multiple workers due to WebSocket/API routing mismatch #17488

GiteaMirror referenced this issue

2026-04-20 05:58:34 -05:00

[PR #20259] [CLOSED] fix: multi-worker direct connection chat streaming timeout #25524

GiteaMirror referenced this issue

2026-04-20 06:38:00 -05:00

[PR #22402] [CLOSED] fix: enable direct connection mode with multiple workers using Redis ... #26659

GiteaMirror referenced this issue