[GH-ISSUE #3511] On Windows, launching ollama from the shortcut or executable by clicking causes very slow tokens generation, but launching from commandline is fast #27923

Closed
opened 2026-04-22 05:34:42 -05:00 by GiteaMirror · 44 comments
Owner

Originally created by @lrq3000 on GitHub (Apr 6, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/3511

Originally assigned to: @dhiltgen on GitHub.

What is the issue?

Since I installed ollama (v0.1.30) on Windows 11 Pro, I run into a peculiar issue. When I launch ollama from the installed shortcut, which launches "ollama app.exe", or when I boot up my OS (which also starts up the same shortcut as configured by the ollama installer), ollama is extremely slow. If I do a "ollama run deepseek-coder", the model startup takes a very long time, several minutes, and when I type any input (eg, Hello), it takes again several minutes to generate each token (instead of 200-500ms/T with the workarounds).

However, I could fix the issue by simply closing the systray icon, and then either:

  • type ollama serve in a terminal, but then I need to keep this open and I don't get the ollama systray icon.
  • type ollama run deepseek-coder (or any other model), which will then also launch the ollama systray icon, just like launching ollama app.exe, but this time it works flawlessly, just like ollama serve.

I can confirm I can easily reproduce the bug simply by launching ollama app.exe manually, and the bug is not present with ollama serve and ollama run <model> (once ollama app.exe is first closed of course).

I read the logs but I did not find anything particularly telling. I will post a trace soon.

/EDIT: Here are the logs for when I launch ollama app.exe and it's slower (I launched ollama app.exe from the Windows shortcut then ollama run deepseek-coder:6.7b-instruct-q8_0 then I type Hello as a prompt, then CTRL-C to stop generation that was too long after 2 tokens):

app.log
server.log

Here are the logs for when I launch ollama run deepseek-coder:6.7b-instruct-q8_0 directly when ollama app.exe is killed:

app.log
server.log

What did you expect to see?

200-500ms/T generation speed and much faster model initialization, instead of several minutes for each.

Steps to reproduce

Launch ollama app.exe on Windows, this will be much slower than ollama serve or ollama run <model>.

Are there any recent changes that introduced the issue?

I don't know, I never used ollama before (since it was not available on Windows until recently).

OS

Windows

Architecture

x86

Platform

No response

Ollama version

0.1.30

GPU

Nvidia

GPU info

Nvidia GeForce 3060 Laptop

CPU

Intel

Other software

Intel i7-12700h

Originally created by @lrq3000 on GitHub (Apr 6, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/3511 Originally assigned to: @dhiltgen on GitHub. ### What is the issue? Since I installed ollama (v0.1.30) on Windows 11 Pro, I run into a peculiar issue. When I launch ollama from the installed shortcut, which launches "ollama app.exe", or when I boot up my OS (which also starts up the same shortcut as configured by the ollama installer), ollama is extremely slow. If I do a "ollama run deepseek-coder", the model startup takes a very long time, several minutes, and when I type any input (eg, Hello), it takes again several minutes to generate each token (instead of 200-500ms/T with the workarounds). However, I could fix the issue by simply closing the systray icon, and then either: * type `ollama serve` in a terminal, but then I need to keep this open and I don't get the ollama systray icon. * type `ollama run deepseek-coder` (or any other model), which will then also launch the ollama systray icon, just like launching `ollama app.exe`, but this time it works flawlessly, just like `ollama serve`. I can confirm I can easily reproduce the bug simply by launching `ollama app.exe` manually, and the bug is not present with `ollama serve` and `ollama run <model>` (once `ollama app.exe` is first closed of course). I read the logs but I did not find anything particularly telling. I will post a trace soon. /EDIT: Here are the logs for when I launch `ollama app.exe` and it's slower (I launched `ollama app.exe` from the Windows shortcut then `ollama run deepseek-coder:6.7b-instruct-q8_0` then I type `Hello` as a prompt, then CTRL-C to stop generation that was too long after 2 tokens): [app.log](https://github.com/ollama/ollama/files/14892997/app.log) [server.log](https://github.com/ollama/ollama/files/14892998/server.log) Here are the logs for when I launch `ollama run deepseek-coder:6.7b-instruct-q8_0` directly when `ollama app.exe` is killed: [app.log](https://github.com/ollama/ollama/files/14893003/app.log) [server.log](https://github.com/ollama/ollama/files/14893002/server.log) ### What did you expect to see? 200-500ms/T generation speed and much faster model initialization, instead of several minutes for each. ### Steps to reproduce Launch `ollama app.exe` on Windows, this will be much slower than `ollama serve` or `ollama run <model>`. ### Are there any recent changes that introduced the issue? I don't know, I never used ollama before (since it was not available on Windows until recently). ### OS Windows ### Architecture x86 ### Platform _No response_ ### Ollama version 0.1.30 ### GPU Nvidia ### GPU info Nvidia GeForce 3060 Laptop ### CPU Intel ### Other software Intel i7-12700h
GiteaMirror added the bugwindows labels 2026-04-22 05:34:43 -05:00
Author
Owner

@dhiltgen commented on GitHub (Apr 12, 2024):

This definitely isn't expected behavior. Looking at the logs, they seem ~identical, so I'm not sure what's going wrong. Is it possible there's an AV involved which could be slowing things down? Can you fire up Task Manager and look at it for both scenarios and see if you see any notable differences?

<!-- gh-comment-id:2052688062 --> @dhiltgen commented on GitHub (Apr 12, 2024): This definitely isn't expected behavior. Looking at the logs, they seem ~identical, so I'm not sure what's going wrong. Is it possible there's an AV involved which could be slowing things down? Can you fire up Task Manager and look at it for both scenarios and see if you see any notable differences?
Author
Owner

@lrq3000 commented on GitHub (Apr 13, 2024):

@dhiltgen Thank you for your attention and help in this issue. There is indeed an AV but looking up at the task manager it did not seem involved, but I will try again by disabling it.

<!-- gh-comment-id:2052752553 --> @lrq3000 commented on GitHub (Apr 13, 2024): @dhiltgen Thank you for your attention and help in this issue. There is indeed an AV but looking up at the task manager it did not seem involved, but I will try again by disabling it.
Author
Owner

@lrq3000 commented on GitHub (Apr 13, 2024):

/TL;DR: the issue now happens systematically when double-clicking on the ollama app.exe executable (without even a shortcut), but not when launching it from cmd.exe or PowerShell. More precisely, launching by double-clicking makes ollama.exe use 3-4x as much CPU and also increases the RAM memory usage, and hence causes models to not fit memory/cpu when they should run fine on my system.

@dhiltgen Unfortunately disabling all my protection systems (antivirus and firewall) did not help. The only thing that helps is closing the ollama app.exe that is launched at startup and relaunching it or just typing ollama run <model> and it works very fine, even when all my protection systems are activated.

In the task manager I don't remember seeing anything in particular but I'll try again.

/EDIT: I retried several times to monitor what is happening in the task manager, there are differences (observed in both the native task manager and Process Explorer).

  • It uses 1100-1300MB memory when it is working fine (95% total memory usage - because I have my web browser opened too) and 1-11% CPU usage and my whole OS is responsive, but it uses 1450-1700MB when it's buggy (99% memory usage) with 30-40% CPU usage and everything laggy.
  • When it works, it loads the model under a few seconds and generates a response almost instantly (or more accurately it starts streaming right away and streams as fast as ChatGPT), whereas when it's slow it gets stuck for several minutes or dozen of minutes on each token.
  • makes my computer laggy, including when I close the app, so it seems there is the issue that is happening also affects somehow the closing, maybe it's the memory assignments?

Additional observations not related to the task manager:

  • This issue happens even if I wait 20min for my computer to boot up before launching ollama, so I'm sure all background apps finished loadingd since a long time.
  • The issue also happens if I /bye and then ollama run <model> again (although the startup time is faster since the model is still loaded in memory it seems, but generation is as slow).
  • The issue happens also if I relaunch ollama app.exe with the exact same shortcut as the one used for startup.

After I wrote the above observations, my ollama got updated today to 0.1.31, and although I still observe the points above after the update, it seems to have slightly changed the behavior in a more reproducible way (or maybe I just did not notice before - /EDIT: actually I'm pretty sure this behavior happened before too in past releases but I just did not notice because I just did not expect this to be possible). Either way, hopefully this will ease a lot tracking down this issue. Here are the new insights I think you will find very interesting:

  • I can now systematically reproduce the issue without restarting, by simply killing the currently launched ollama.exe / ollama app.exe, and then double-clicking on C:\Users\<username>\AppData\Local\Programs\Ollama\ollama app.exe (hence no need for a shortcut). Then simply run a big enough model like ollama run deepseek-coder:6.7b-instruct-q8_0 for a 16 GB RAM system such as mine, and it will start behaving badly, including when closing it will make the whole OS lag and unresponsive (a sign of a RAM overflow).
  • I can systematcally avoid the issue (ie, get good performances) by first killing ollama.exe and then: either launching C:\Users\<username>\AppData\Local\Programs\Ollama\ollama app.exe in a terminal (I tried both with the old terminal and powershell, it works in both cases) and then again ollama run deepseek-coder:6.7b-instruct-q8_0 ; or by directly launching C:\Users\<username>\AppData\Local\Programs\Ollama\ollama app.exe without launching ollama app.exe beforehand (this latter behavior did not change).

Note that being administrator or not does not change anything: I did all my tests without being an administrator, but I tested launching the icon as an administrator by creating a shortcut with admin mode checked, and it did not help. Whereas launching ollama app.exe from the commandline is always fast even if in user mode.

Hence, it seems the issue happens only when launching ollama app.exe by double-clicking it or at startup (which emulates double-clicking it from its shortcut). A simple fix is to launch ollama app.exe by a batch command (and ollama could do this in its installer, instead of just creating a shortcut in the Startup folder of the startup menu, by placing a batch file there, or just prepend cmd.exe /k "path-to-ollama-app.exe" in the shortcut), but the correct fix is when we will find what causes the somehow different behavior between launching in commandline or from double-clicking that somehow increases both the memory and CPU usages of ollama app.exe.

(My guess is that Windows OS is treating both cases differently because of the UI library used somehow, and it then bundles a bunch of additional UI related libraries or stuff that ollama doesn't need, and it then causes the app to be just a bit too heavy than it can be for my system RAM to support the model I loaded. This would explain why the issue does not seem to happen with small enough models. But really I don't have enough data to ascertain this is the case.)

My new guess is that some logic inside of ollama app.exe or some library used by it is detecting how (environment) the app is launched and is behaving differently depending on this condition, by attaching more useless libraries (UI libraries?) that snowball and makes the whole app much more resource consuming. My guess would be the UI/systray icon library.

If you have any idea for me to debug this further, please let me know, I'm eager to understand what's really happening here even though I already have a workaround!

<!-- gh-comment-id:2052958386 --> @lrq3000 commented on GitHub (Apr 13, 2024): /TL;DR: the issue now happens systematically when double-clicking on the `ollama app.exe` executable (without even a shortcut), but not when launching it from `cmd.exe` or PowerShell. More precisely, launching by double-clicking makes `ollama.exe` use 3-4x as much CPU and also increases the RAM memory usage, and hence causes models to not fit memory/cpu when they should run fine on my system. @dhiltgen Unfortunately disabling all my protection systems (antivirus and firewall) did not help. The only thing that helps is closing the `ollama app.exe` that is launched at startup and relaunching it or just typing `ollama run <model>` and it works very fine, even when all my protection systems are activated. In the task manager I don't remember seeing anything in particular but I'll try again. /EDIT: I retried several times to monitor what is happening in the task manager, there are differences (observed in both the native task manager and Process Explorer). * It uses 1100-1300MB memory when it is working fine (95% total memory usage - because I have my web browser opened too) and 1-11% CPU usage and my whole OS is responsive, but it uses 1450-1700MB when it's buggy (99% memory usage) with 30-40% CPU usage and everything laggy. * When it works, it loads the model under a few seconds and generates a response almost instantly (or more accurately it starts streaming right away and streams as fast as ChatGPT), whereas when it's slow it gets stuck for several minutes or dozen of minutes on each token. * makes my computer laggy, including when I close the app, so it seems there is the issue that is happening also affects somehow the closing, maybe it's the memory assignments? Additional observations not related to the task manager: * This issue happens even if I wait 20min for my computer to boot up before launching ollama, so I'm sure all background apps finished loadingd since a long time. * The issue also happens if I `/bye` and then `ollama run <model>` again (although the startup time is faster since the model is still loaded in memory it seems, but generation is as slow). * The issue happens also if I relaunch ollama app.exe with the exact same shortcut as the one used for startup. After I wrote the above observations, my ollama got updated today to 0.1.31, and although I still observe the points above after the update, it seems to have slightly changed the behavior in a more reproducible way (or maybe I just did not notice before - /EDIT: actually I'm pretty sure this behavior happened before too in past releases but I just did not notice because I just did not expect this to be possible). Either way, hopefully this will ease a lot tracking down this issue. Here are the new insights I think you will find very interesting: * I can now **systematically reproduce** the issue without restarting, by simply killing the currently launched ollama.exe / ollama app.exe, and then double-clicking on `C:\Users\<username>\AppData\Local\Programs\Ollama\ollama app.exe` (hence no need for a shortcut). Then simply run a big enough model like `ollama run deepseek-coder:6.7b-instruct-q8_0` for a 16 GB RAM system such as mine, and it will start behaving badly, including when closing it will make the whole OS lag and unresponsive (a sign of a RAM overflow). * I can **systematcally avoid** the issue (ie, get good performances) by first killing ollama.exe and then: either launching `C:\Users\<username>\AppData\Local\Programs\Ollama\ollama app.exe` in `a terminal` (I tried both with the old terminal and powershell, it works in both cases) and then again `ollama run deepseek-coder:6.7b-instruct-q8_0` ; or by directly launching `C:\Users\<username>\AppData\Local\Programs\Ollama\ollama app.exe` without launching `ollama app.exe` beforehand (this latter behavior did not change). Note that being administrator or not does not change anything: I did all my tests without being an administrator, but I tested launching the icon as an administrator by creating a shortcut with admin mode checked, and it did not help. Whereas launching ollama app.exe from the commandline is always fast even if in user mode. Hence, it seems **the issue happens only when launching `ollama app.exe` by double-clicking it or at startup** (which emulates double-clicking it from its shortcut). **A simple fix is to launch `ollama app.exe` by a batch command** (and ollama could do this in its installer, instead of just creating a shortcut in the Startup folder of the startup menu, by placing a batch file there, or just prepend `cmd.exe /k "path-to-ollama-app.exe"` [in the shortcut](https://superuser.com/questions/1572383/shortcut-that-opens-a-command-prompt-and-runs-an-application)), but the correct fix is when we will find what causes the somehow **different behavior between launching in commandline or from double-clicking that somehow increases both the memory and CPU usages of `ollama app.exe`**. ~~(My guess is that Windows OS is treating both cases differently because of the UI library used somehow, and it then bundles a bunch of additional UI related libraries or stuff that ollama doesn't need, and it then causes the app to be just a bit too heavy than it can be for my system RAM to support the model I loaded. This would explain why the issue does not seem to happen with small enough models. But really I don't have enough data to ascertain this is the case.)~~ My new guess is that some logic inside of `ollama app.exe` or some library used by it is detecting how (environment) the app is launched and is behaving differently depending on this condition, by attaching more useless libraries (UI libraries?) that snowball and makes the whole app much more resource consuming. My guess would be the UI/systray icon library. If you have any idea for me to debug this further, please let me know, I'm eager to understand what's really happening here even though I already have a workaround!
Author
Owner

@joubertdj commented on GitHub (Apr 14, 2024):

@lrq3000 : Thanks for finding a workaround ... and for highlighting it here ... I thought I broke something ... I am only playing with Ollama and Langchain at the moment ... and I thought I did something (as I am playing with my own indexing and metadata etc.) that it suddenly became SO slow ... but I am also experiencing this slow down. After killing the "ollama app.exe" and starting it the 3way you mentioned it then improved ... however ... to me it still feels far slower than it was previously (considering I am also using it on an equivalent laptop as yours, albeit Windows 10 Pro).

<!-- gh-comment-id:2054125773 --> @joubertdj commented on GitHub (Apr 14, 2024): @lrq3000 : Thanks for finding a workaround ... and for highlighting it here ... I thought I broke something ... I am only playing with Ollama and Langchain at the moment ... and I thought I did something (as I am playing with my own indexing and metadata etc.) that it suddenly became SO slow ... but I am also experiencing this slow down. After killing the "ollama app.exe" and starting it the 3way you mentioned it then improved ... however ... to me it still feels far slower than it was previously (considering I am also using it on an equivalent laptop as yours, albeit Windows 10 Pro).
Author
Owner

@joubertdj commented on GitHub (Apr 14, 2024):

@lrq3000 : Thanks for finding a workaround ... and for highlighting it here ... I thought I broke something ... I am only playing with Ollama and Langchain at the moment ... and I thought I did something (as I am playing with my own indexing and metadata etc.) that it suddenly became SO slow ... but I am also experiencing this slow down. After killing the "ollama app.exe" and starting it the 3way you mentioned it then improved ... however ... to me it still feels far slower than it was previously (considering I am also using it on an equivalent laptop as yours, albeit Windows 10 Pro).

PS. I can confirm that the pre-release of 0.1.32 addresses the issue. It seems that the primary reason for it being slow was that it tried ONLY to use vRAM and not system RAM ...

<!-- gh-comment-id:2054158851 --> @joubertdj commented on GitHub (Apr 14, 2024): > @lrq3000 : Thanks for finding a workaround ... and for highlighting it here ... I thought I broke something ... I am only playing with Ollama and Langchain at the moment ... and I thought I did something (as I am playing with my own indexing and metadata etc.) that it suddenly became SO slow ... but I am also experiencing this slow down. After killing the "ollama app.exe" and starting it the 3way you mentioned it then improved ... however ... to me it still feels far slower than it was previously (considering I am also using it on an equivalent laptop as yours, albeit Windows 10 Pro). PS. I can confirm that the pre-release of 0.1.32 addresses the issue. It seems that the primary reason for it being slow was that it tried ONLY to use vRAM and not system RAM ...
Author
Owner

@dhiltgen commented on GitHub (Apr 15, 2024):

@joubertdj just to confirm, you're seeing good/consistent performance with 0.1.32 regardless of how the app is started?

In the 0.1.32 release we've changed to using subprocesses to manage the GPU.

@lrq3000 could you give it a try as well and see if the problem goes away in 0.1.32 for you?

If that release doesn't resolve it, the next things we can look at are what the processes are set to for Priority in the Task Manager Details view, and possibly looking at resmon.exe's cpu/disk/memory usage in these two scenarios.

<!-- gh-comment-id:2057861793 --> @dhiltgen commented on GitHub (Apr 15, 2024): @joubertdj just to confirm, you're seeing good/consistent performance with 0.1.32 regardless of how the app is started? In the 0.1.32 release we've changed to using subprocesses to manage the GPU. @lrq3000 could you give it a try as well and see if the problem goes away in 0.1.32 for you? If that release doesn't resolve it, the next things we can look at are what the processes are set to for Priority in the Task Manager Details view, and possibly looking at resmon.exe's cpu/disk/memory usage in these two scenarios.
Author
Owner

@joubertdj commented on GitHub (Apr 16, 2024):

@joubertdj just to confirm, you're seeing good/consistent performance with 0.1.32 regardless of how the app is started?

That is correct. It looks like it solved the issue for me.

<!-- gh-comment-id:2058472331 --> @joubertdj commented on GitHub (Apr 16, 2024): > @joubertdj just to confirm, you're seeing good/consistent performance with 0.1.32 regardless of how the app is started? That is correct. It looks like it solved the issue for me.
Author
Owner

@lrq3000 commented on GitHub (Apr 17, 2024):

I just tried v0.1.32 stable (not pre-release) and unfortunately no change, the same behavior appears, but worsened.

I can see the new ollama_llama_server.exe process, and now instead of ollama app.exe being the culprit, it's the server.exe.

So now what is happening is that when I am launching from commandline, which previously worked very fine, now it starts generating a sentence but then stops after a few token and the same issue happens (which is that it essentially freezes for a long time before generating the next token - like really really long, it's a real freeze). I could wait the whole evening and I'm not sure I would get one whole response from the LLM.

In comparison, when I launch from the shortcut, this issue happens right away after the first generated token.

One thing I noticed is that when this freeze happens, the CPU gets overutilized, it shoots up from like 1-10% to 40% and stays there even if I CTRL-C halt the LLM generation and even after /bye. When the LLM generates the tokens correctly, the CPU is always much less used.

<!-- gh-comment-id:2060106905 --> @lrq3000 commented on GitHub (Apr 17, 2024): I just tried v0.1.32 stable (not pre-release) and unfortunately no change, the same behavior appears, but worsened. I can see the new `ollama_llama_server.exe` process, and now instead of `ollama app.exe` being the culprit, it's the server.exe. So now what is happening is that when I am launching from commandline, which previously worked very fine, now it starts generating a sentence but then stops after a few token and the same issue happens (which is that it essentially freezes for a long time before generating the next token - like really really long, it's a real freeze). I could wait the whole evening and I'm not sure I would get one whole response from the LLM. In comparison, when I launch from the shortcut, this issue happens right away after the first generated token. One thing I noticed is that when this freeze happens, the CPU gets overutilized, it shoots up from like 1-10% to 40% and stays there even if I CTRL-C halt the LLM generation and even after /bye. When the LLM generates the tokens correctly, the CPU is always much less used.
Author
Owner

@joubertdj commented on GitHub (Apr 17, 2024):

I just upgraded to the stable release. Although I have to admit, the upgrade via the system icon (Update available restart) didn’t work at all, maybe that was a pre_release thing. I had to download OllamaSetup.exe manually, kill all the ollama processes (as even Quit Ollama didn’t work) and just install it via the installer downloaded.

My performance was “stable”, meaning, it is as I expected for this class of machine and better than v0.1.31.

When you run your application, and it is busy generating tokens. How much RAM does ollama_lama_server.exe consume? When I stressed tested a big LLM in v0.1.31 it got stuck at a maximum of 2GB (or at least the ollama.exe got stuck at that level, initially I thought they compiled the executable to be a 32-Bit exe). With v0.1.32 it consumes +- 9GB (using deepseek-coder:6.7b-instruct-fp16). However, even 9GB is a tad “low” for the 6.7b fp16 is it not?

When I ran the “deepseek-coder:6.7b-instruct-q4_1” model the memory utilization was again at +-9GB, which seems high for an q4_1 model … or am I missing something again?

In both instances the GPU was barely used … maybe once or twice during the entire token generation, and then it was never used again? Maybe the integrated RTX3050’s vRAM is not sufficient or something?

From: Stephen Karl Larroque @.>
Sent: Wednesday, April 17, 2024 2:20 AM
To: ollama/ollama @.
>
Cc: joubertdj @.>; Mention @.>
Subject: Re: [ollama/ollama] Ollama app.exe is extremely slow on Windows, but not ollama serve nor ollama run (Issue #3511)

I just tried v0.1.32 stable (not pre-release) and unfortunately no change, the same behavior appears, but worsened.

I can see the new ollama_llama_server.exe process, and now instead of ollama app.exe being the culprit, it's the server.exe.

So now what is happening is that when I am launching from commandline, which previously worked very fine, now it starts generating a sentence but then stops after a few token and the same issue happens (which is that it essentially freezes for a long time before generating the next token - like really really long, it's a real freeze). I could wait the whole evening and I'm not sure I would get one whole response from the LLM.

In comparison, when I launch from the shortcut, this issue happens right away after the first generated token.

One thing I noticed is that when this freeze happens, the CPU gets overutilized, it shoots up from like 1-10% to 40% and stays there even if I CTRL-C halt the LLM generation and even after /bye. When the LLM generates the tokens correctly, the CPU is always much less used.


Reply to this email directly, view it on GitHub https://github.com/ollama/ollama/issues/3511#issuecomment-2060106905 , or unsubscribe https://github.com/notifications/unsubscribe-auth/ATGJBFUXJGTX3NL6C4G77T3Y5W53HAVCNFSM6AAAAABF2IBXZ6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANRQGEYDMOJQGU .
You are receiving this because you were mentioned. https://github.com/notifications/beacon/ATGJBFV7MGN2OPSKFIJANADY5W53HA5CNFSM6AAAAABF2IBXZ6WGG33NNVSW45C7OR4XAZNMJFZXG5LFINXW23LFNZ2KUY3PNVWWK3TUL5UWJTT2ZK6JS.gif Message ID: @.*** @.***> >

<!-- gh-comment-id:2060910146 --> @joubertdj commented on GitHub (Apr 17, 2024): I just upgraded to the stable release. Although I have to admit, the upgrade via the system icon (Update available restart) didn’t work at all, maybe that was a pre_release thing. I had to download OllamaSetup.exe manually, kill all the ollama processes (as even Quit Ollama didn’t work) and just install it via the installer downloaded. My performance was “stable”, meaning, it is as I expected for this class of machine and better than v0.1.31. When you run your application, and it is busy generating tokens. How much RAM does ollama_lama_server.exe consume? When I stressed tested a big LLM in v0.1.31 it got stuck at a maximum of 2GB (or at least the ollama.exe got stuck at that level, initially I thought they compiled the executable to be a 32-Bit exe). With v0.1.32 it consumes +- 9GB (using deepseek-coder:6.7b-instruct-fp16). However, even 9GB is a tad “low” for the 6.7b fp16 is it not? When I ran the “deepseek-coder:6.7b-instruct-q4_1” model the memory utilization was again at +-9GB, which seems high for an q4_1 model … or am I missing something again? In both instances the GPU was barely used … maybe once or twice during the entire token generation, and then it was never used again? Maybe the integrated RTX3050’s vRAM is not sufficient or something? From: Stephen Karl Larroque ***@***.***> Sent: Wednesday, April 17, 2024 2:20 AM To: ollama/ollama ***@***.***> Cc: joubertdj ***@***.***>; Mention ***@***.***> Subject: Re: [ollama/ollama] Ollama app.exe is extremely slow on Windows, but not ollama serve nor ollama run (Issue #3511) I just tried v0.1.32 stable (not pre-release) and unfortunately no change, the same behavior appears, but worsened. I can see the new ollama_llama_server.exe process, and now instead of ollama app.exe being the culprit, it's the server.exe. So now what is happening is that when I am launching from commandline, which previously worked very fine, now it starts generating a sentence but then stops after a few token and the same issue happens (which is that it essentially freezes for a long time before generating the next token - like really really long, it's a real freeze). I could wait the whole evening and I'm not sure I would get one whole response from the LLM. In comparison, when I launch from the shortcut, this issue happens right away after the first generated token. One thing I noticed is that when this freeze happens, the CPU gets overutilized, it shoots up from like 1-10% to 40% and stays there even if I CTRL-C halt the LLM generation and even after /bye. When the LLM generates the tokens correctly, the CPU is always much less used. — Reply to this email directly, view it on GitHub <https://github.com/ollama/ollama/issues/3511#issuecomment-2060106905> , or unsubscribe <https://github.com/notifications/unsubscribe-auth/ATGJBFUXJGTX3NL6C4G77T3Y5W53HAVCNFSM6AAAAABF2IBXZ6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANRQGEYDMOJQGU> . You are receiving this because you were mentioned. <https://github.com/notifications/beacon/ATGJBFV7MGN2OPSKFIJANADY5W53HA5CNFSM6AAAAABF2IBXZ6WGG33NNVSW45C7OR4XAZNMJFZXG5LFINXW23LFNZ2KUY3PNVWWK3TUL5UWJTT2ZK6JS.gif> Message ID: ***@***.*** ***@***.***> >
Author
Owner

@lrq3000 commented on GitHub (Apr 17, 2024):

So I ran additional tests to observe whether it's an issue with a lack of RAM. I uninstalled and reinstalled several times both versions from the exe installers on the GitHub releases page.

  • v0.1.31 commandline: 1463 MB RAM to 1616 RAM -- when reloading model after some time after /bye, it consumes only 1254-1360 MB RAM. In fact, without /bye, by waiting for some time, the RAM usage goes as low as 1090 MB. These figures do not change when sending prompts to the model. CPU usage: from 0% to ~60% while generating tokens (other processes cumulatively take only 3-4% CPU).
    Note that when lacking RAM, this indeed reproduces the issue. But when there is no lack of RAM, the issue persists only when launching from clicking on the executable or shortcut (see below).

  • v0.1.31 from clicking on binary or shortcut: 1614 MB, stays at 25-30% CPU usage even after CTRL-C break and /bye for a while, a dozen seconds maybe (until next token generation?) then stops and falls back to 0%.
    But even so, 2.1 GB RAM still available and despite this, program is still horribly slow. So it's not only caused by insufficient available RAM.
    After waiting, it still goes to 1531 MB. And it still horribly slow. Ah after some time after reloading, during generation, it can go lower to 1392 MB. Still very very slow, despite 2.1GB free RAM.
    CPU usage: 30-40% while generating tokens (other processes cumulatively take only 3-4% CPU).

  • v0.1.32 commandline: 1462MB RAM right away, same RAM after waiting a while. Fast generation, as before. CPU usage: from 0% to ~60% while generating tokens (other processes cumulatively take only 3-4% CPU) (same as before).

  • v0.1.32 from clicking on binary or shortcut: 1460 MB RAM right away, same RAM after waiting a while. Slow generation as before. CPU usage: 30-40% while generating tokens (other processes cumulatively take only 3-4% CPU) (as before).


Key take aways:

  • v0.1.32 new RAM handling and subprocess compared to v0.1.31 allows to normalize RAM usage: first launch is smaller than before (2GB less!), but subsequent model reloadings are bigger (1-2GB more).
  • Lacking RAM can mimic the issue of sluggish token generation.
  • But when there is way enough RAM, the issue persists in both version: launching by clicking on the executable or shortcut really produces systematically sluggish token generation, despite equal RAM usage compared to commandline launch (which displays fast generation).
  • There is one notable difference between sluggish and fast tokens generations: when it's sluggish (launched by clicking), CPU usage hovers around 30-40% but for some reason it causes micro-freezes on the whole computer sometimes (especially when quitting ollama from the systray icon) ; whereas when fast, CPU usage goes from 0% to 60% and right back to 0% when finished.

Note that I am not only using ollama, I regularly use this exact same model (same quantization, same number of parameters, same instruct mode) with koboldcpp and gpt4all and there is no such issue, I always get fast tokens generation there. This issue is specific to ollama and only when launching by clicking on the executable/shortcut.

So indeed my issue seems to be different from @joubertdj 's, and is still present in the latest v0.1.32 .

Given that my system is pretty vanilla, I expect others will run into the same issue at some point, so even if we can't figure out the culprit right now, I think it's worth keeping this issue open for others to find it and contribute to the detective work.

<!-- gh-comment-id:2062593971 --> @lrq3000 commented on GitHub (Apr 17, 2024): So I ran additional tests to observe whether it's an issue with a lack of RAM. I uninstalled and reinstalled several times both versions from the exe installers on the GitHub releases page. * v0.1.31 commandline: 1463 MB RAM to 1616 RAM -- when reloading model after some time after /bye, it consumes only 1254-1360 MB RAM. In fact, without /bye, by waiting for some time, the RAM usage goes as low as 1090 MB. These figures do not change when sending prompts to the model. CPU usage: from 0% to ~60% while generating tokens (other processes cumulatively take only 3-4% CPU). Note that when lacking RAM, this indeed reproduces the issue. But when there is no lack of RAM, the issue persists only when launching from clicking on the executable or shortcut (see below). * v0.1.31 from clicking on binary or shortcut: 1614 MB, stays at 25-30% CPU usage even after CTRL-C break and /bye for a while, a dozen seconds maybe (until next token generation?) then stops and falls back to 0%. But even so, 2.1 GB RAM still available and despite this, program is still horribly slow. So it's not only caused by insufficient available RAM. After waiting, it still goes to 1531 MB. And it still horribly slow. Ah after some time after reloading, during generation, it can go lower to 1392 MB. Still very very slow, despite 2.1GB free RAM. CPU usage: 30-40% while generating tokens (other processes cumulatively take only 3-4% CPU). * v0.1.32 commandline: 1462MB RAM right away, same RAM after waiting a while. Fast generation, as before. CPU usage: from 0% to ~60% while generating tokens (other processes cumulatively take only 3-4% CPU) (same as before). * v0.1.32 from clicking on binary or shortcut: 1460 MB RAM right away, same RAM after waiting a while. Slow generation as before. CPU usage: 30-40% while generating tokens (other processes cumulatively take only 3-4% CPU) (as before). ---- Key take aways: * v0.1.32 new RAM handling and subprocess compared to v0.1.31 allows to normalize RAM usage: first launch is smaller than before (2GB less!), but subsequent model reloadings are bigger (1-2GB more). * Lacking RAM can mimic the issue of sluggish token generation. * But when there is way enough RAM, **the issue persists in both version: launching by clicking on the executable or shortcut really produces systematically sluggish token generation**, despite equal RAM usage compared to commandline launch (which displays fast generation). * There is one notable difference between sluggish and fast tokens generations: when it's sluggish (launched by clicking), CPU usage hovers around 30-40% but for some reason it causes micro-freezes on the whole computer sometimes (especially when quitting ollama from the systray icon) ; whereas when fast, CPU usage goes from 0% to 60% and right back to 0% when finished. Note that I am not only using ollama, I regularly use this exact same model (same quantization, same number of parameters, same instruct mode) with koboldcpp and gpt4all and there is no such issue, I always get fast tokens generation there. This issue is specific to ollama and only when launching by clicking on the executable/shortcut. So indeed my issue seems to be different from @joubertdj 's, and is still present in the latest v0.1.32 . Given that my system is pretty vanilla, I expect others will run into the same issue at some point, so even if we can't figure out the culprit right now, I think it's worth keeping this issue open for others to find it and contribute to the detective work.
Author
Owner

@joubertdj commented on GitHub (Apr 18, 2024):

@lrq3000 : This morning, I read your feedback and without initially starting Ollama from the icon, I started it from the console via "ollama serve". Its performance was, at minimum, four times better/faster than when I started it via the icon!!! I thought, myabe, this was only due to a fresh start, so I stopped "olama serve" and started it with the icon. It was noticebly slower!!!????

You are correct, although I may have had some different issue previously, there is definitely some performance issue here somewhere. I still however do not fully understand how a 7B model that is 4GB (almost) in file size only takes 2GB or RAM though ... that portion has me a bit "scratching-my-head" ...

<!-- gh-comment-id:2063298646 --> @joubertdj commented on GitHub (Apr 18, 2024): @lrq3000 : This morning, I read your feedback and without initially starting Ollama from the icon, I started it from the console via "ollama serve". Its performance was, at minimum, four times better/faster than when I started it via the icon!!! I thought, myabe, this was only due to a fresh start, so I stopped "olama serve" and started it with the icon. It was noticebly slower!!!???? You are correct, although I may have had some different issue previously, there is definitely some performance issue here somewhere. I still however do not fully understand how a 7B model that is 4GB (almost) in file size only takes 2GB or RAM though ... that portion has me a bit "scratching-my-head" ...
Author
Owner

@lrq3000 commented on GitHub (Apr 18, 2024):

@joubertdj I'm not sure either about the difference but I think this filesize is not the model but the size the app is taking in memory, and also Windows uses compression methods. The figures I reported are from the Windows native task manager. When I use Process Explorer, I get the expected total memory size of ~8GB, but I do not report it because it does not change, this is simply the model's size, so there is no difference when I launch via the icon or via commandline, or between different versions, etc.

<!-- gh-comment-id:2063375409 --> @lrq3000 commented on GitHub (Apr 18, 2024): @joubertdj I'm not sure either about the difference but I think this filesize is not the model but the size the app is taking in memory, and also Windows uses compression methods. The figures I reported are from the Windows native task manager. When I use Process Explorer, I get the expected total memory size of ~8GB, but I do not report it because it does not change, this is simply the model's size, so there is no difference when I launch via the icon or via commandline, or between different versions, etc.
Author
Owner

@mmacphail commented on GitHub (Apr 21, 2024):

Thank you !!! I have exactly the same problem. I was wondering why after a fresh install of ollama it was fast, and then sometimes it was slow. Fucking hell !!

<!-- gh-comment-id:2068130922 --> @mmacphail commented on GitHub (Apr 21, 2024): Thank you !!! I have exactly the same problem. I was wondering why after a fresh install of ollama it was fast, and then sometimes it was slow. Fucking hell !!
Author
Owner

@papyr commented on GitHub (Apr 28, 2024):

This happens in AMD and the GUI / window does not open.

I can verify its running in the task mgr, but there not open window. with latest release...

<!-- gh-comment-id:2081556014 --> @papyr commented on GitHub (Apr 28, 2024): This happens in AMD and the GUI / window does not open. I can verify its running in the task mgr, but there not open window. with latest release...
Author
Owner

@dhiltgen commented on GitHub (Apr 28, 2024):

We've reshuffled the packaging model a bit on windows in the latest release. I still don't understand the root cause on this performance bug, but it's possible that reshuffling might have an impact, so I'd suggest folks give 0.1.33 a try.

<!-- gh-comment-id:2081601093 --> @dhiltgen commented on GitHub (Apr 28, 2024): We've reshuffled the packaging model a bit on windows in the latest release. I still don't understand the root cause on this performance bug, but it's possible that reshuffling might have an impact, so I'd suggest folks give [0.1.33](https://github.com/ollama/ollama/releases) a try.
Author
Owner

@qsdhj commented on GitHub (May 3, 2024):

I have the same issue now. After the update to ollama 0.1.33 on Windows 11 Pro
To be honest I am unsure if I have the same problem.

The first prompt I do is working normal.
From the second prompt onwards, my GPU gets max utilized for 2-6 minutes, than I get the answer from the llm.
I tried it with the cli and with langchain, get the same issue with both methods.

This happened after pdating from an older version of ollama to 0.1.33 and I also installed torch with cuda support.
Before that I got a GPU Utilisation from around 40%-50% instead of the max utilisation.

I now have cuda installed over the nvidia installer and with torch. Maybe that is my problem?
nvidia-smi:
grafik

torch.version: '2.3.0+cu118'

<!-- gh-comment-id:2092770075 --> @qsdhj commented on GitHub (May 3, 2024): I have the same issue now. After the update to ollama 0.1.33 on Windows 11 Pro To be honest I am unsure if I have the same problem. The first prompt I do is working normal. From the second prompt onwards, my GPU gets max utilized for 2-6 minutes, than I get the answer from the llm. I tried it with the cli and with langchain, get the same issue with both methods. This happened after pdating from an older version of ollama to 0.1.33 and I also installed torch with cuda support. Before that I got a GPU Utilisation from around 40%-50% instead of the max utilisation. I now have cuda installed over the nvidia installer and with torch. Maybe that is my problem? nvidia-smi: ![grafik](https://github.com/ollama/ollama/assets/166700412/ef279838-0f49-4e86-a254-f8899a0af817) torch.__version__: '2.3.0+cu118'
Author
Owner

@dhiltgen commented on GitHub (May 11, 2024):

I still haven't managed to reproduce this or figure out what the culprit is for the strange slow-down in performance.

For folks who have seen this behavior, can you share a bit more about your system setup so that hopefully I can find a way to repro and ultimately fix whatever it is?

  • Windows 10 or 11, pro or home?
  • Personal system or work system that might have a GPO config, or any other add-on software that could explain it? (e.g., AV software, etc.)
  • Laptop or Desktop? If laptop, any difference plugged in vs on battery?
  • CPU model (e.g., is it a modern system with efficiency cores that maybe we're getting stuck on?) Get-WmiObject -Class Win32_Processor -ComputerName. | Select-Object -Property [a-z]*
  • System memory - are we paging when this happens? systeminfo | find "Virtual Memory"
<!-- gh-comment-id:2105994903 --> @dhiltgen commented on GitHub (May 11, 2024): I still haven't managed to reproduce this or figure out what the culprit is for the strange slow-down in performance. For folks who have seen this behavior, can you share a bit more about your system setup so that hopefully I can find a way to repro and ultimately fix whatever it is? - Windows 10 or 11, pro or home? - Personal system or work system that might have a GPO config, or any other add-on software that could explain it? (e.g., AV software, etc.) - Laptop or Desktop? If laptop, any difference plugged in vs on battery? - CPU model (e.g., is it a modern system with efficiency cores that maybe we're getting stuck on?) `Get-WmiObject -Class Win32_Processor -ComputerName. | Select-Object -Property [a-z]*` - System memory - are we paging when this happens? `systeminfo | find "Virtual Memory"`
Author
Owner

@lrq3000 commented on GitHub (May 13, 2024):

@dhiltgen

OS version

Windows 11 pro.

Personal system or work system

Personal system.
AV is Avira but I also used Avast in the past and I also tried to uninstall it fully at some point in my tests.

Laptop or Desktop?

Laptop. It is always plugged and I am using an external fan stand (KLIM Cyclone).

CPU model

PSComputerName                          : NEOM
Availability                            : 3
CpuStatus                               : 1
CurrentVoltage                          : 13
DeviceID                                : CPU0
ErrorCleared                            :
ErrorDescription                        :
LastErrorCode                           :
LoadPercentage                          : 2
Status                                  : OK
StatusInfo                              : 3
AddressWidth                            : 64
DataWidth                               : 64
ExtClock                                : 100
L2CacheSize                             : 11776
L2CacheSpeed                            :
MaxClockSpeed                           : 2300
PowerManagementSupported                : False
ProcessorType                           : 3
Revision                                :
SocketDesignation                       : U3E1
Version                                 :
VoltageCaps                             :
Architecture                            : 9
AssetTag                                :
Caption                                 : Intel64 Family 6 Model 154 Stepping 3
Characteristics                         : 252
ConfigManagerErrorCode                  :
ConfigManagerUserConfig                 :
CreationClassName                       : Win32_Processor
CurrentClockSpeed                       : 2300
Description                             : Intel64 Family 6 Model 154 Stepping 3
Family                                  : 198
InstallDate                             :
L3CacheSize                             : 24576
L3CacheSpeed                            : 0
Level                                   : 6
Manufacturer                            : GenuineIntel
Name                                    : 12th Gen Intel(R) Core(TM) i7-12700H
NumberOfCores                           : 14
NumberOfEnabledCore                     : 14
NumberOfLogicalProcessors               : 20
OtherFamilyDescription                  :
PartNumber                              :
PNPDeviceID                             :
PowerManagementCapabilities             :
ProcessorId                             : BFEBFBFF000906A3
Role                                    : CPU
SecondLevelAddressTranslationExtensions : False
SerialNumber                            :
Stepping                                :
SystemCreationClassName                 : Win32_ComputerSystem
SystemName                              : NEOM
ThreadCount                             : 20
UniqueId                                :
UpgradeMethod                           : 1
VirtualizationFirmwareEnabled           : False
VMMonitorModeExtensions                 : False
Scope                                   : System.Management.ManagementScope
Path                                    : \\NEOM\root\cimv2:Win32_Processor.DeviceID="CPU0"
Options                                 : System.Management.ObjectGetOptions
ClassPath                               : \\NEOM\root\cimv2:Win32_Processor
Properties                              : {AddressWidth, Architecture, AssetTag, Availability...}
SystemProperties                        : {__GENUS, __CLASS, __SUPERCLASS, __DYNASTY...}
Qualifiers                              : {dynamic, Locale, provider, UUID}
Site                                    :
Container                               :

Additionally I used Intel's app to list the number of P-cores and E-cores:
MpOZhen4qN
ProcID_cIysqwLu0S
ProcID_esoFDTMji4

System memory

/EDIT: done the tests (sorry for the delay).

  • When ollama is launched from the icon, model loaded (ollama run ...) but not generating:
Total physical memory:                      16,069 MB
Available physical memory:                  927 MB
Virtual memory: maximum size:               40,714 MB
Virtual memory: available:                  5,749 MB
Virtual memory: in use:                     34,965 MB
  • When ollama is launched from the icon and is generating tokens very slowly:
Total physical memory:                      16,069 MB
Available physical memory:                  1,092 MB
Virtual memory: maximum size:               40,714 MB
Virtual memory: available:                  4,806 MB
Virtual memory: in use:                     35,908 MB
  • When ollama is closed:
Total physical memory:                      16,069 MB
Available physical memory:                  4,967 MB
Virtual memory: maximum size:               35,305 MB
Virtual memory: available:                  6,961 MB
Virtual memory: in use:                     28,344 MB
  • When ollama is launched from the terminal and is generating tokens fast:
Total physical memory:                      16,069 MB
Available physical memory:                  533 MB
Virtual memory: maximum size:               40,493 MB
Virtual memory: available:                  5,318 MB
Virtual memory: in use:                     35,175 MB
  • When ollama is launched from the terminal and is generating tokens fast at first but then generates slowly:
Total physical memory:                    16 069 MB
Available physical memory:                647 MB
Virtual memory: maximum size:        40 494 MB
Virtual memory: available:             4 988 MB
Virtual memory: in use:                 35 506 MB
<!-- gh-comment-id:2106929561 --> @lrq3000 commented on GitHub (May 13, 2024): @dhiltgen ### OS version Windows 11 pro. ### Personal system or work system Personal system. AV is Avira but I also used Avast in the past and I also tried to uninstall it fully at some point in my tests. ### Laptop or Desktop? Laptop. It is always plugged and I am using an external fan stand (KLIM Cyclone). ### CPU model ``` PSComputerName : NEOM Availability : 3 CpuStatus : 1 CurrentVoltage : 13 DeviceID : CPU0 ErrorCleared : ErrorDescription : LastErrorCode : LoadPercentage : 2 Status : OK StatusInfo : 3 AddressWidth : 64 DataWidth : 64 ExtClock : 100 L2CacheSize : 11776 L2CacheSpeed : MaxClockSpeed : 2300 PowerManagementSupported : False ProcessorType : 3 Revision : SocketDesignation : U3E1 Version : VoltageCaps : Architecture : 9 AssetTag : Caption : Intel64 Family 6 Model 154 Stepping 3 Characteristics : 252 ConfigManagerErrorCode : ConfigManagerUserConfig : CreationClassName : Win32_Processor CurrentClockSpeed : 2300 Description : Intel64 Family 6 Model 154 Stepping 3 Family : 198 InstallDate : L3CacheSize : 24576 L3CacheSpeed : 0 Level : 6 Manufacturer : GenuineIntel Name : 12th Gen Intel(R) Core(TM) i7-12700H NumberOfCores : 14 NumberOfEnabledCore : 14 NumberOfLogicalProcessors : 20 OtherFamilyDescription : PartNumber : PNPDeviceID : PowerManagementCapabilities : ProcessorId : BFEBFBFF000906A3 Role : CPU SecondLevelAddressTranslationExtensions : False SerialNumber : Stepping : SystemCreationClassName : Win32_ComputerSystem SystemName : NEOM ThreadCount : 20 UniqueId : UpgradeMethod : 1 VirtualizationFirmwareEnabled : False VMMonitorModeExtensions : False Scope : System.Management.ManagementScope Path : \\NEOM\root\cimv2:Win32_Processor.DeviceID="CPU0" Options : System.Management.ObjectGetOptions ClassPath : \\NEOM\root\cimv2:Win32_Processor Properties : {AddressWidth, Architecture, AssetTag, Availability...} SystemProperties : {__GENUS, __CLASS, __SUPERCLASS, __DYNASTY...} Qualifiers : {dynamic, Locale, provider, UUID} Site : Container : ``` Additionally I used [Intel's app](https://www.intel.com/content/www/us/en/support/articles/000097881/processors.html) to list the number of P-cores and E-cores: ![MpOZhen4qN](https://github.com/ollama/ollama/assets/1118942/f7ad3ca1-93c4-45f0-8569-95eac37a3410) ![ProcID_cIysqwLu0S](https://github.com/ollama/ollama/assets/1118942/e65f8cc3-0310-4f6d-8427-805c23f1df15) ![ProcID_esoFDTMji4](https://github.com/ollama/ollama/assets/1118942/481759d0-e939-4e9c-84c0-0ac7ef7f93f4) ### System memory /EDIT: done the tests (sorry for the delay). * When ollama is launched from the icon, model loaded (`ollama run ...`) but not generating: ``` Total physical memory: 16,069 MB Available physical memory: 927 MB Virtual memory: maximum size: 40,714 MB Virtual memory: available: 5,749 MB Virtual memory: in use: 34,965 MB ``` * When ollama is launched from the icon and is generating tokens very slowly: ``` Total physical memory: 16,069 MB Available physical memory: 1,092 MB Virtual memory: maximum size: 40,714 MB Virtual memory: available: 4,806 MB Virtual memory: in use: 35,908 MB ``` * When ollama is closed: ``` Total physical memory: 16,069 MB Available physical memory: 4,967 MB Virtual memory: maximum size: 35,305 MB Virtual memory: available: 6,961 MB Virtual memory: in use: 28,344 MB ``` * When ollama is launched from the terminal and is generating tokens fast: ``` Total physical memory: 16,069 MB Available physical memory: 533 MB Virtual memory: maximum size: 40,493 MB Virtual memory: available: 5,318 MB Virtual memory: in use: 35,175 MB ``` * When ollama is launched from the terminal and is generating tokens fast at first but then generates slowly: ``` Total physical memory: 16 069 MB Available physical memory: 647 MB Virtual memory: maximum size: 40 494 MB Virtual memory: available: 4 988 MB Virtual memory: in use: 35 506 MB ```
Author
Owner

@qsdhj commented on GitHub (May 14, 2024):

OS version

Windows 11 Enterprise
Version: 23H2

System

  • Work system
  • Sophos, ZScaler installed and acitve
  • Laptop, only used plugged in

CPU

11th Gen Intel(R) Core(TM) i7-11850H @ 2.50GHz

RAM

32,0 GB

VRAM

6GB
NVIDIA RTX A3000

I think my problem was different from the one, called in this issue. I use ollama together with a HuggingFace Sentence-Transformer in Langchain and LlamaIndex.
My problem is that if I install the torch version with CUDA to use it with my Embedding modell, I get this weird behaivour in Ollama, where the GPU is running on 100% load for a few minutes until the llm is responsing.
Terminating my Python script, and the ollama processes, fixes it for the first call to an llm, after that its the same, until I restart windows.
Installing the non CUDA torch version fixed that, just now I can't use the GPU for creating embeddings, so its not a usable solution for me.

<!-- gh-comment-id:2109577846 --> @qsdhj commented on GitHub (May 14, 2024): ## OS version Windows 11 Enterprise Version: 23H2 ## System + Work system + Sophos, ZScaler installed and acitve + Laptop, only used plugged in ## CPU 11th Gen Intel(R) Core(TM) i7-11850H @ 2.50GHz ## RAM 32,0 GB ## VRAM 6GB NVIDIA RTX A3000 ## I think my problem was different from the one, called in this issue. I use ollama together with a HuggingFace Sentence-Transformer in Langchain and LlamaIndex. My problem is that if I install the torch version with CUDA to use it with my Embedding modell, I get this weird behaivour in Ollama, where the GPU is running on 100% load for a few minutes until the llm is responsing. Terminating my Python script, and the ollama processes, fixes it for the first call to an llm, after that its the same, until I restart windows. Installing the non CUDA torch version fixed that, just now I can't use the GPU for creating embeddings, so its not a usable solution for me.
Author
Owner

@lowfatgeek commented on GitHub (May 23, 2024):

I am faced with exactly the same issue as well. I have to run "ollama serve" in the terminal and I have to keep the Terminal window open side by side with the browser (OpenWebUI), or else the performance becomes so slow and sluggish. Even if I maximize the browser and Terminal window behind it, the performance is so slow. Both Terminal and Browser should be open side by side.

I'm on Intel i5-13450HX, 16GB RAM, and RTX4050 6GB GPU.

<!-- gh-comment-id:2128071009 --> @lowfatgeek commented on GitHub (May 23, 2024): I am faced with exactly the same issue as well. I have to run "ollama serve" in the terminal and I have to keep the Terminal window open side by side with the browser (OpenWebUI), or else the performance becomes so slow and sluggish. Even if I maximize the browser and Terminal window behind it, the performance is so slow. Both Terminal and Browser should be open side by side. I'm on Intel i5-13450HX, 16GB RAM, and RTX4050 6GB GPU.
Author
Owner

@dhiltgen commented on GitHub (May 31, 2024):

It seems so far that the common theme is recent CPUs on a laptop. Perhaps there's some quirk causing ollama to run on efficiency cores? Has anyone experienced this slowdown on a desktop or a system without efficiency cores?

Perhaps another experiment to try - open Task Manager, and see if any of the ollama processes have a green leaf indicating they're being scheduled on the efficiency cores. If so, right click on the process and uncheck Efficiency mode

If this does turn out to be the underlying cause, there does appear to be an API we can call to programmatically adjust how the process is scheduled.

<!-- gh-comment-id:2142559497 --> @dhiltgen commented on GitHub (May 31, 2024): It seems so far that the common theme is recent CPUs on a laptop. Perhaps there's some quirk causing ollama to run on efficiency cores? Has anyone experienced this slowdown on a desktop or a system without efficiency cores? Perhaps another experiment to try - open Task Manager, and see if any of the ollama processes have a green leaf indicating they're being scheduled on the efficiency cores. If so, right click on the process and uncheck `Efficiency mode` If this does turn out to be the underlying cause, there does appear to be an API we can call to programmatically adjust how the process is scheduled.
Author
Owner

@wac81 commented on GitHub (Jun 3, 2024):

Check the power management items, including bios, my case is, if the screen is turned on, it runs fast, if the screen is turned off, it runs very slow, check this time mainly occupied in cpu

<!-- gh-comment-id:2145505865 --> @wac81 commented on GitHub (Jun 3, 2024): Check the power management items, including bios, my case is, if the screen is turned on, it runs fast, if the screen is turned off, it runs very slow, check this time mainly occupied in cpu
Author
Owner

@chyok commented on GitHub (Jul 4, 2024):

Hi @dhiltgen ,
Some problem here, I am working on a desktop computer. When I open using the desktop icons, the CPU only allocates to 4 small cores, while the other 8 big guy are just observing. If I use the ollama serve, everything is normal and fast.

image

My PC detail is as follows:
Windows 11 23H2 version 22631.2861
i7-12700K
RTX 3060Ti

It feels like Windows is treating it as a background task and only allocating the small cores to it.

<!-- gh-comment-id:2209057327 --> @chyok commented on GitHub (Jul 4, 2024): Hi @dhiltgen , Some problem here, I am working on a desktop computer. When I open using the desktop icons, the CPU only allocates to 4 small cores, while the other 8 big guy are just observing. If I use the `ollama serve`, everything is normal and fast. <img width="513" alt="image" src="https://github.com/ollama/ollama/assets/32629225/023b15b1-7ff9-451f-bc91-0b2eb1a3eaa3"> My PC detail is as follows: Windows 11 23H2 version 22631.2861 i7-12700K RTX 3060Ti It feels like Windows is treating it as a background task and only allocating the small cores to it.
Author
Owner

@dhiltgen commented on GitHub (Jul 22, 2024):

@chyok when you see it running on the "small cores" does it also show up with the green leaf icon in TaskManager? I haven't been able to reproduce this in my setups, but if someone can confirm definitively this is the case, then I should be able to code up a fix to force it to not go into efficiency mode.

Screenshot 2024-07-22 at 10 38 12 AM
<!-- gh-comment-id:2243481592 --> @dhiltgen commented on GitHub (Jul 22, 2024): @chyok when you see it running on the "small cores" does it also show up with the green leaf icon in TaskManager? I haven't been able to reproduce this in my setups, but if someone can confirm definitively this is the case, then I should be able to code up a fix to force it to not go into efficiency mode. <img width="59" alt="Screenshot 2024-07-22 at 10 38 12 AM" src="https://github.com/user-attachments/assets/404bd4d3-2115-44fa-9716-3c3007d2647d">
Author
Owner

@chyok commented on GitHub (Jul 23, 2024):

@chyok when you see it running on the "small cores" does it also show up with the green leaf icon in TaskManager? I haven't been able to reproduce this in my setups, but if someone can confirm definitively this is the case, then I should be able to code up a fix to force it to not go into efficiency mode.

Screenshot 2024-07-22 at 10 38 12 AM

It's quite strange that it doesn't enter efficiency mode, and it's very easy to reproduce the problem on my system. I had to use "process lasso" to force it to be scheduled on the big cores.

<!-- gh-comment-id:2244901497 --> @chyok commented on GitHub (Jul 23, 2024): > @chyok when you see it running on the "small cores" does it also show up with the green leaf icon in TaskManager? I haven't been able to reproduce this in my setups, but if someone can confirm definitively this is the case, then I should be able to code up a fix to force it to not go into efficiency mode. > > <img alt="Screenshot 2024-07-22 at 10 38 12 AM" width="59" src="https://private-user-images.githubusercontent.com/4033016/351059863-404bd4d3-2115-44fa-9716-3c3007d2647d.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MjE3MzIzODEsIm5iZiI6MTcyMTczMjA4MSwicGF0aCI6Ii80MDMzMDE2LzM1MTA1OTg2My00MDRiZDRkMy0yMTE1LTQ0ZmEtOTcxNi0zYzMwMDdkMjY0N2QucG5nP1gtQW16LUFsZ29yaXRobT1BV1M0LUhNQUMtU0hBMjU2JlgtQW16LUNyZWRlbnRpYWw9QUtJQVZDT0RZTFNBNTNQUUs0WkElMkYyMDI0MDcyMyUyRnVzLWVhc3QtMSUyRnMzJTJGYXdzNF9yZXF1ZXN0JlgtQW16LURhdGU9MjAyNDA3MjNUMTA1NDQxWiZYLUFtei1FeHBpcmVzPTMwMCZYLUFtei1TaWduYXR1cmU9MzlmMGQ5NDc4YjMyMjc1ZDM4NGQxOTZiY2NjYjQ2NWVhZjVmYTgzMGM3ZDdjZDk1ZDQwYWQ0MDQ3NDY3MjhhZSZYLUFtei1TaWduZWRIZWFkZXJzPWhvc3QmYWN0b3JfaWQ9MCZrZXlfaWQ9MCZyZXBvX2lkPTAifQ.46H54e9k342ZSTbgbedXtX5r4uIgFokCdbVM2R6VVy4"> It's quite strange that it doesn't enter efficiency mode, and it's very easy to reproduce the problem on my system. I had to use "process lasso" to force it to be scheduled on the big cores.
Author
Owner

@dhiltgen commented on GitHub (Jul 30, 2024):

From what folks are seeing, it sounds like this may not be an efficiency setting, but a process priority setting.

In Task Manager, what is the Priority for the ollama_llama_server.exe process?

In Win 10 click the "Details" tab at the top. In Win 11, click the Details option on the left. Then right-click on ollama_llama_server.exe

Win10 example:
Screenshot 2024-07-30 at 4 55 29 PM

<!-- gh-comment-id:2259386638 --> @dhiltgen commented on GitHub (Jul 30, 2024): From what folks are seeing, it sounds like this may not be an efficiency setting, but a process priority setting. In Task Manager, what is the Priority for the `ollama_llama_server.exe` process? In Win 10 click the "Details" tab at the top. In Win 11, click the Details option on the left. Then right-click on `ollama_llama_server.exe` Win10 example: <img width="526" alt="Screenshot 2024-07-30 at 4 55 29 PM" src="https://github.com/user-attachments/assets/ef73f7be-5e77-4717-969f-5230ec965350">
Author
Owner

@chyok commented on GitHub (Jul 31, 2024):

I'm traveling outside, and unable to use computer, but I'm quite certain that the priority is normal, even though I've tried to increase it, but it still hasn't worked. Setting the relevance and manually specifying the big core can solve the issue, but this process is new each time, so I need to set it every time.

<!-- gh-comment-id:2259471522 --> @chyok commented on GitHub (Jul 31, 2024): I'm traveling outside, and unable to use computer, but I'm quite certain that the priority is normal, even though I've tried to increase it, but it still hasn't worked. Setting the relevance and manually specifying the big core can solve the issue, but this process is new each time, so I need to set it every time.
Author
Owner

@lrq3000 commented on GitHub (Jul 31, 2024):

Yes I also confirm that priority is normal, I even tried to set it to high using ProcessExplorer.

<!-- gh-comment-id:2259833662 --> @lrq3000 commented on GitHub (Jul 31, 2024): Yes I also confirm that priority is normal, I even tried to set it to high using ProcessExplorer.
Author
Owner

@dhiltgen commented on GitHub (Aug 1, 2024):

Strange. So it's not the Priority, and it's not efficiency mode either.

<!-- gh-comment-id:2264217050 --> @dhiltgen commented on GitHub (Aug 1, 2024): Strange. So it's not the Priority, and it's not efficiency mode either.
Author
Owner

@greggft commented on GitHub (Aug 12, 2024):

Under windows, maybe try this
Settings
type in GPU in the search
Select Graphic Settings
Should take you to
Graphics performance settings
Click on
Browse
Add each of the executables you want to assign to your GPU
add the executables for ollama here
See if that helps

<!-- gh-comment-id:2284448211 --> @greggft commented on GitHub (Aug 12, 2024): Under windows, maybe try this Settings type in GPU in the search Select Graphic Settings Should take you to Graphics performance settings Click on Browse Add each of the executables you want to assign to your GPU add the executables for ollama here See if that helps
Author
Owner

@ivanvengeruk commented on GitHub (Aug 18, 2024):

Under windows, maybe try this Settings type in GPU in the search Select Graphic Settings Should take you to Graphics performance settings Click on Browse Add each of the executables you want to assign to your GPU add the executables for ollama here See if that helps

  • no, its not helps

have same issue with slow tokens.
ollama version: 0.3.6 - latest at date of post
OS: Windows 11 Pro
23H2 22631.4037 Windows Feature Experience Pack 1000.22700.1027.0
configuration:
Laptop

CPU
12th Gen Intel(R) Core(TM) i7-12800HX
2,00 GHz
Core: 16
Logical cores: 24
Virtualization: ON

RAM: 64GB

Graphical processor: 0
NVIDIA RTX A1000 Laptop GPU
Driver version: 32.0.15.6076
DirectX: 12 (FL 12.1)
Dedicated GPU memory 0,9/4,0 Gb
Shared GPU memory 0,2/31,9 Gb
GPU Memory 1,1/35,9 Gb

reproduce problem:
start ollama app.exe from gui (main menu) or double-click on app (it runs processes Ollama, ollama.exe with ~30 mb memory)
then in terminal "ollama run llama3:8b --verbose" (it runs processes ollama.exe (again but with ~10mb memory), ollama_llama server.exe)
then "hello"
result:
Hello! It's nice to meet you. Is there something I can help you with, or would you like to chat?

total duration: 2m29.5910962s
load duration: 67.9495ms
prompt eval count: 11 token(s)
prompt eval duration: 17.11587s
prompt eval rate: 0.64 tokens/s
eval count: 26 token(s)
eval duration: 2m12.403243s
eval rate: 0.20 tokens/s

workaround:
kill all ollama processes in taskmanager
then in terminal "ollama run llama3:8b --verbose"
(it will start Ollama, ollama.exe twice (one with ~30 mb memory, one with ~10 mb), ollama_llama_server.exe)
then "hello"
result:
Hello! It's nice to meet you. Is there something I can help you with, or would you like to chat?

total duration: 3.0733124s
load duration: 24.1601ms
prompt eval count: 11 token(s)
prompt eval duration: 762.752ms
prompt eval rate: 14.42 tokens/s
eval count: 26 token(s)
eval duration: 2.284004s
eval rate: 11.38 tokens/s

In both cases processes are same. In both cases if i try to kill process ollama.exe (30mb) - then it auto restarts, and in slow case it countinues to generate slow, in fast case it continues fast. Only the running Ollama matter. From gui-slow from terminal - fast.
Made dump for both cases, maybe it can help(dump from taskmanager).
trying to attach dumps, but 1 DMP ~80 mb, github allow 25mb compress to .7z format 20mb , but github allow GZ or ZIP, which size is more than 25 mb even on ultra mode archivation. So just tell if these dumps would be helpfull then tell me a way how i could send them. Or you can make them locally by same way as i mentoid above.

<!-- gh-comment-id:2295161027 --> @ivanvengeruk commented on GitHub (Aug 18, 2024): > Under windows, maybe try this Settings type in GPU in the search Select Graphic Settings Should take you to Graphics performance settings Click on Browse Add each of the executables you want to assign to your GPU add the executables for ollama here See if that helps - no, its not helps have same issue with slow tokens. ollama version: 0.3.6 - latest at date of post OS: Windows 11 Pro 23H2 22631.4037 Windows Feature Experience Pack 1000.22700.1027.0 configuration: Laptop CPU 12th Gen Intel(R) Core(TM) i7-12800HX 2,00 GHz Core: 16 Logical cores: 24 Virtualization: ON RAM: 64GB Graphical processor: 0 NVIDIA RTX A1000 Laptop GPU Driver version: 32.0.15.6076 DirectX: 12 (FL 12.1) Dedicated GPU memory 0,9/4,0 Gb Shared GPU memory 0,2/31,9 Gb GPU Memory 1,1/35,9 Gb reproduce problem: start ollama app.exe from gui (main menu) or double-click on app (it runs processes Ollama, ollama.exe with ~30 mb memory) then in terminal "ollama run llama3:8b --verbose" (it runs processes ollama.exe (again but with ~10mb memory), ollama_llama server.exe) then "hello" result: Hello! It's nice to meet you. Is there something I can help you with, or would you like to chat? total duration: 2m29.5910962s load duration: 67.9495ms prompt eval count: 11 token(s) prompt eval duration: 17.11587s prompt eval rate: 0.64 tokens/s eval count: 26 token(s) eval duration: 2m12.403243s eval rate: 0.20 tokens/s workaround: kill all ollama processes in taskmanager then in terminal "ollama run llama3:8b --verbose" (it will start Ollama, ollama.exe twice (one with ~30 mb memory, one with ~10 mb), ollama_llama_server.exe) then "hello" result: Hello! It's nice to meet you. Is there something I can help you with, or would you like to chat? total duration: 3.0733124s load duration: 24.1601ms prompt eval count: 11 token(s) prompt eval duration: 762.752ms prompt eval rate: 14.42 tokens/s eval count: 26 token(s) eval duration: 2.284004s eval rate: 11.38 tokens/s In both cases processes are same. In both cases if i try to kill process ollama.exe (30mb) - then it auto restarts, and in slow case it countinues to generate slow, in fast case it continues fast. Only the running Ollama matter. From gui-slow from terminal - fast. Made dump for both cases, maybe it can help(dump from taskmanager). trying to attach dumps, but 1 DMP ~80 mb, github allow 25mb compress to .7z format 20mb , but github allow GZ or ZIP, which size is more than 25 mb even on ultra mode archivation. So just tell if these dumps would be helpfull then tell me a way how i could send them. Or you can make them locally by same way as i mentoid above.
Author
Owner

@jayghoshrao commented on GitHub (Aug 31, 2024):

In my case, setting the process priority to high speeds up the output instantly. The difference between "normal" and high is quite stark, too.

<!-- gh-comment-id:2322889433 --> @jayghoshrao commented on GitHub (Aug 31, 2024): In my case, setting the process priority to high speeds up the output instantly. The difference between "normal" and high is quite stark, too.
Author
Owner

@Fatalmemory commented on GitHub (Sep 18, 2024):

old thread, but to chance a guess - could it be related to windows superfetch? windows could either be caching files differently for files that load on startup, or worst case scenario, could be paging the files into a hibernation file on the hard drive and caching them that way, which would be much slower than RAM until they're swapped into actual RAM

<!-- gh-comment-id:2357350068 --> @Fatalmemory commented on GitHub (Sep 18, 2024): old thread, but to chance a guess - could it be related to windows superfetch? windows could either be caching files differently for files that load on startup, or worst case scenario, could be paging the files into a hibernation file on the hard drive and caching them that way, which would be much slower than RAM until they're swapped into actual RAM
Author
Owner

@lrq3000 commented on GitHub (Sep 22, 2024):

Thank you for working on this but I think it is maybe premature to close this issue as I wrote manually changing task priority did not help in the past. I will try again when this will be released in an update and will let you guys know if the issue is fixed or not for me.

<!-- gh-comment-id:2366763959 --> @lrq3000 commented on GitHub (Sep 22, 2024): Thank you for working on this but I think it is maybe premature to close this issue as I wrote manually changing task priority did not help in the past. I will try again when this will be released in an update and will let you guys know if the issue is fixed or not for me.
Author
Owner

@ValleZ commented on GitHub (Oct 16, 2024):

It still doesn't run on windows

<!-- gh-comment-id:2417765310 --> @ValleZ commented on GitHub (Oct 16, 2024): It still doesn't run on windows
Author
Owner

@dhiltgen commented on GitHub (Oct 17, 2024):

@ValleZ can you clarify? This issue was tracking a performance problem which we believe we've fixed, but if it completely doesn't run for you, that sounds like an unrelated defect. If you mean the performance problem wasn't fixed, please explain a bit more about your scenario.

<!-- gh-comment-id:2419988779 --> @dhiltgen commented on GitHub (Oct 17, 2024): @ValleZ can you clarify? This issue was tracking a performance problem which we believe we've fixed, but if it completely doesn't run for you, that sounds like an unrelated defect. If you mean the performance problem wasn't fixed, please explain a bit more about your scenario.
Author
Owner

@ValleZ commented on GitHub (Oct 17, 2024):

Yeah, maybe. I tried to run it on windows by downloading the exe file and it spent some time setting it up and then setup closed itself with no confirmation or nothing. I repeated the setup several times with the same result and one try I rebooted PC just in case. Then after set up I tried to run it from start menu and it didn't start, nothing appeared. I then noticed there was non-clickable ollama icon in status bar. Maybe it's how it's expected to "run" but I just uninstalled it and then run koboldcpp with no problems. Compiling and running plain llama.cpp also works fine.

<!-- gh-comment-id:2420007572 --> @ValleZ commented on GitHub (Oct 17, 2024): Yeah, maybe. I tried to run it on windows by downloading the exe file and it spent some time setting it up and then setup closed itself with no confirmation or nothing. I repeated the setup several times with the same result and one try I rebooted PC just in case. Then after set up I tried to run it from start menu and it didn't start, nothing appeared. I then noticed there was non-clickable ollama icon in status bar. Maybe it's how it's expected to "run" but I just uninstalled it and then run koboldcpp with no problems. Compiling and running plain llama.cpp also works fine.
Author
Owner

@ValleZ commented on GitHub (Oct 17, 2024):

Googling why it won't start gave this ticket for some reason, weird, yes, it's totally unrelated.

<!-- gh-comment-id:2420010449 --> @ValleZ commented on GitHub (Oct 17, 2024): Googling why it won't start gave this ticket for some reason, weird, yes, it's totally unrelated.
Author
Owner

@dhiltgen commented on GitHub (Oct 17, 2024):

@ValleZ what happens if you open a fresh PowerShell terminal and type ollama run llama3.1 ?

<!-- gh-comment-id:2420040134 --> @dhiltgen commented on GitHub (Oct 17, 2024): @ValleZ what happens if you open a fresh PowerShell terminal and type `ollama run llama3.1` ?
Author
Owner

@ValleZ commented on GitHub (Oct 17, 2024):

Idk, I uninstalled it. If it’s a terminal only solution it should not have
UI for installer or should have confirmation that installation is completed
and now you are supposed to type something somewhere to proceed

On Thu, Oct 17, 2024 at 12:52 PM Daniel Hiltgen @.***>
wrote:

@ValleZ https://github.com/ValleZ what happens if you open a fresh
PowerShell terminal and type ollama run llama3.1 ?


Reply to this email directly, view it on GitHub
https://github.com/ollama/ollama/issues/3511#issuecomment-2420040134,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/AAG5EH5HT6I4D3Z755AJJ73Z37TMZAVCNFSM6AAAAABF2IBXZ6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIMRQGA2DAMJTGQ
.
You are receiving this because you were mentioned.Message ID:
@.***>

<!-- gh-comment-id:2420044616 --> @ValleZ commented on GitHub (Oct 17, 2024): Idk, I uninstalled it. If it’s a terminal only solution it should not have UI for installer or should have confirmation that installation is completed and now you are supposed to type something somewhere to proceed On Thu, Oct 17, 2024 at 12:52 PM Daniel Hiltgen ***@***.***> wrote: > @ValleZ <https://github.com/ValleZ> what happens if you open a fresh > PowerShell terminal and type ollama run llama3.1 ? > > — > Reply to this email directly, view it on GitHub > <https://github.com/ollama/ollama/issues/3511#issuecomment-2420040134>, > or unsubscribe > <https://github.com/notifications/unsubscribe-auth/AAG5EH5HT6I4D3Z755AJJ73Z37TMZAVCNFSM6AAAAABF2IBXZ6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIMRQGA2DAMJTGQ> > . > You are receiving this because you were mentioned.Message ID: > ***@***.***> >
Author
Owner

@dhiltgen commented on GitHub (Oct 17, 2024):

@ValleZ yes, Ollama is a terminal and API based tool for running LLMs. If you're looking for a GUI, there's a ton of community based UIs listed here you can explore - https://github.com/ollama/ollama?tab=readme-ov-file#web--desktop

<!-- gh-comment-id:2420052051 --> @dhiltgen commented on GitHub (Oct 17, 2024): @ValleZ yes, Ollama is a terminal and API based tool for running LLMs. If you're looking for a GUI, there's a ton of community based UIs listed here you can explore - https://github.com/ollama/ollama?tab=readme-ov-file#web--desktop
Author
Owner

@mimo-to commented on GitHub (Oct 31, 2025):

yah same happened here,
the ollama GUI is not opening and u cant able to access it from system tray
but in cli everything is working fine

<!-- gh-comment-id:3474675943 --> @mimo-to commented on GitHub (Oct 31, 2025): yah same happened here, the ollama GUI is not opening and u cant able to access it from system tray but in cli everything is working fine
Author
Owner

@cdll commented on GitHub (Mar 18, 2026):

same issue still on ollama@v0.18.1😭

<!-- gh-comment-id:4082882060 --> @cdll commented on GitHub (Mar 18, 2026): same issue still on `ollama@v0.18.1`😭
Author
Owner

@lrq3000 commented on GitHub (Mar 18, 2026):

I use lemonade-server, the GUI and systray works fine and very fast on Windows and it exposes an API aed port compliant with most of ollama's features (it can be used in local chatbot apps as an ollama server).

<!-- gh-comment-id:4083102923 --> @lrq3000 commented on GitHub (Mar 18, 2026): I use [lemonade-server](https://github.com/lemonade-sdk/lemonade), the GUI and systray works fine and very fast on Windows and it exposes an API aed port compliant with most of ollama's features (it can be used in local chatbot apps as an ollama server).
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#27923