num_gpu (Ollama) parameter in settings seems to be poorly explained by the hint. #3196

New Issue

GiteaMirror · 2025-11-11T15:25:24-06:00

GiteaMirror commented

2025-11-11 15:25:24 -06:00

Originally created by @mario-mlc on GitHub (Jan 5, 2025).

I finally managed to make Open WebUI work perfectly through the official Docker container with CUDA support, using Ollama on Host.

I had to difficulties at all to make this setup WORK on Ubuntu 22 Server. But it took a lot of time to make it WORK WELL (with a fair performance).

It was using little to nothing GPU and A LOT of machine RAM, Swap to disk and, of course, all CPU cores, taking a huge amount of time just for the model to answer "Hello!" when I asked for just a simple greeting as a test. My setup is a dual RTX 3090 GPU machine (and yes, with Ollama I can seamlessly break my big 44GB model between both GPU VRAMs, works like a charm) but I have a 14th gen i3 CPU and just 16GB of RAM, not intended (both) to be used for LLM processing.

Reading a lot of posts on other channels I found an issue on another system with a GPU number setting slider which the answer gave me a SPARK... and, at least for me, applying the same to Open WebUI settings WORKED:

I was setting num_gpu (Ollama) parameter as 2, because I have 2 RTX 3090 GPU boards. Don't know if I am too dumb to have made this inference... but it seems that this number has to do, ACTUALLY, with the number of LAYERS YOUR MODEL WILL USE ON GPU (which, in my case, is 64). As soon as I changed this parameter to 64 (thinking the other way around, I find it difficult to exist such a machine which can have 64 GPU cards installed, hence my doubt if I was making a dumb inference based on the hint), it started working as expected. GPU used soared to almost 100%, there is still some CPU usage but marginal compared to what it was and generation was quite fast.

I think the hint when you hover the mouse over this setting parameter (num_gpu (Ollama)) should be rewritten to reflect this reality. It will save a lot of hours and some headache for a lot of people, probably. And it is something easy and quick to be fixed (for people who have access and know how to do it).

The current text is "Set the number of GPU devices for computation" and some further explanation in that direction. I think it would be something like "Set the number of model layers you will fit in all your GPUs combined. Setting this number below the number of your model layers will allocate the remaining ones in your System RAM, using CPU to process them"

Originally created by @mario-mlc on GitHub (Jan 5, 2025). I finally managed to make Open WebUI work perfectly through the official Docker container with CUDA support, using Ollama on Host. I had to difficulties at all to make this setup WORK on Ubuntu 22 Server. But it took a lot of time to make it WORK WELL (with a fair performance). It was using little to nothing GPU and *A LOT* of machine RAM, Swap to disk and, of course, all CPU cores, taking a huge amount of time just for the model to answer "Hello!" when I asked for just a simple greeting as a test. My setup is a dual RTX 3090 GPU machine (and yes, with Ollama I can seamlessly break my big 44GB model between both GPU VRAMs, works like a charm) but I have a 14th gen i3 CPU and just 16GB of RAM, not intended (both) to be used for LLM processing. Reading a lot of posts on other channels I found an issue on another system with a GPU number setting slider which the answer gave me a SPARK... and, at least for me, applying the same to Open WebUI settings WORKED: I was setting num_gpu (Ollama) parameter as 2, because I have 2 RTX 3090 GPU boards. Don't know if I am too dumb to have made this inference... but it seems that this number has to do, ACTUALLY, with the number of LAYERS YOUR MODEL WILL USE ON GPU (which, in my case, is 64). As soon as I changed this parameter to 64 (thinking the other way around, I find it difficult to exist such a machine which can have 64 GPU cards installed, hence my doubt if I was making a dumb inference based on the hint), it started working as expected. GPU used soared to almost 100%, there is still some CPU usage but marginal compared to what it was and generation was quite fast. I think the hint when you hover the mouse over this setting parameter (num_gpu (Ollama)) should be rewritten to reflect this reality. It will save a lot of hours and some headache for a lot of people, probably. And it is something easy and quick to be fixed (for people who have access and know how to do it). The current text is "Set the number of GPU devices for computation" and some further explanation in that direction. I think it would be something like "Set the number of model layers you will fit in all your GPUs combined. Setting this number below the number of your model layers will allocate the remaining ones in your System RAM, using CPU to process them"

GiteaMirror closed this issue

2025-11-11 15:25:24 -06:00

GiteaMirror referenced this issue

2026-04-19 19:57:57 -05:00

[GH-ISSUE #3196] Can no longer nest custom models #13170

GiteaMirror referenced this issue

2026-04-25 03:17:18 -05:00

[GH-ISSUE #3196] Can no longer nest custom models #28698

GiteaMirror referenced this issue

2026-05-05 13:04:15 -05:00

[GH-ISSUE #3196] Can no longer nest custom models #51836

Sign in to join this conversation.