[GH-ISSUE #4486] Not compiled with GPU offload support #64842

New Issue

GiteaMirror · 2026-05-03T18:56:42-05:00

GiteaMirror commented

2026-05-03 18:56:42 -05:00

Originally created by @oldgithubman on GitHub (May 17, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/4486

Originally assigned to: @dhiltgen on GitHub.

What is the issue?

Trying to use ollama like normal with GPU. Worked before update. Now only using CPU.
$ journalctl -u ollama
reveals
WARN [server_params_parse] Not compiled with GPU offload support, --n-gpu-layers option will be ignored. See main README.md for information on enabling GPU BLAS support | n_gpu_layers=-1

I do not manually compile ollama. I use the standard install script.
Main README.md contains no mention of BLAS

OS

Linux

GPU

Nvidia

CPU

Intel

Ollama version

0.1.38

Originally created by @oldgithubman on GitHub (May 17, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/4486 Originally assigned to: @dhiltgen on GitHub. ### What is the issue? Trying to use ollama like normal with GPU. Worked before update. Now only using CPU. `$ journalctl -u ollama` reveals `WARN [server_params_parse] Not compiled with GPU offload support, --n-gpu-layers option will be ignored. See main README.md for information on enabling GPU BLAS support | n_gpu_layers=-1` 1. I do not manually compile ollama. I use the standard install script. 2. Main README.md contains no mention of BLAS ### OS Linux ### GPU Nvidia ### CPU Intel ### Ollama version 0.1.38

GiteaMirror added the bug needs more info labels 2026-05-03 18:56:42 -05:00

GiteaMirror closed this issue

2026-05-03 18:56:44 -05:00

GiteaMirror commented

2026-05-03 18:56:46 -05:00

@oldgithubman commented on GitHub (May 17, 2024):

Figured it out. Ollama seems to think the model is too big to fit in VRAM (it isn't - it worked fine before the update). There is a lack of any useful communication about this to the user. As mentioned above, digging in the log actually sends you in the wrong direction

@oldgithubman commented on GitHub (May 17, 2024): Figured it out. Ollama seems to think the model is too big to fit in VRAM (it isn't - it worked fine before the update). There is a lack of any useful communication about this to the user. As mentioned above, digging in the log actually sends you in the *wrong* direction

GiteaMirror commented

2026-05-03 18:56:48 -05:00

@jmorganca commented on GitHub (May 17, 2024):

Hi @oldmanjk sorry about this. May I ask which model you are running? and on which GPU?

@jmorganca commented on GitHub (May 17, 2024): Hi @oldmanjk sorry about this. May I ask which model you are running? and on which GPU?

GiteaMirror commented

2026-05-03 18:56:50 -05:00

@mroxso commented on GitHub (May 17, 2024):

I think I got the same issue.
Running llama2:latest and llama3:latest on my GTX 1660 SUPER.
Worked before, now I updated to the latest Ollama and it seems that it mostly uses CPU which is way slower.

// Update:
For me it seems like I had another process blocking my VRAM (A Python process). I saw this with nvidia-smi
I restarted my local PC and now it works again with GPU for me.

@mroxso commented on GitHub (May 17, 2024): I think I got the same issue. Running llama2:latest and llama3:latest on my GTX 1660 SUPER. Worked before, now I updated to the latest Ollama and it seems that it mostly uses CPU which is way slower. // Update: For me it seems like I had another process blocking my VRAM (A Python process). I saw this with nvidia-smi I restarted my local PC and now it works again with GPU for me.

GiteaMirror commented

2026-05-03 18:56:51 -05:00

@jukofyork commented on GitHub (May 17, 2024):

Has anybody an idea of the code we need to remove to stop it ignoring our num_gpu settings (again, sigh...)?

@jukofyork commented on GitHub (May 17, 2024): Has anybody an idea of the code we need to remove to stop it ignoring our `num_gpu` settings (again, sigh...)?

GiteaMirror commented

2026-05-03 18:56:53 -05:00

@jukofyork commented on GitHub (May 17, 2024):

It's at the bottom of llm/memory.go:

        //if memoryRequiredPartial > memoryAvailable {
        //      slog.Debug("insufficient VRAM to load any model layers")
        //      return 0, 0, memoryRequiredTotal
        //}

@jukofyork commented on GitHub (May 17, 2024): It's at the bottom of `llm/memory.go`: ``` //if memoryRequiredPartial > memoryAvailable { // slog.Debug("insufficient VRAM to load any model layers") // return 0, 0, memoryRequiredTotal //} ```

GiteaMirror commented

2026-05-03 18:56:55 -05:00

@oldgithubman commented on GitHub (May 17, 2024):

Hi @oldmanjk sorry about this. May I ask which model you are running? and on which GPU?

llama3 on a 1080 Ti

@oldgithubman commented on GitHub (May 17, 2024): > Hi @oldmanjk sorry about this. May I ask which model you are running? and on which GPU? llama3 on a 1080 Ti

GiteaMirror commented

2026-05-03 18:56:56 -05:00

@oldgithubman commented on GitHub (May 17, 2024):

I think I got the same issue. Running llama2:latest and llama3:latest on my GTX 1660 SUPER. Worked before, now I updated to the latest Ollama and it seems that it mostly uses CPU which is way slower.

// Update: For me it seems like I had another process blocking my VRAM (A Python process). I saw this with nvidia-smi I restarted my local PC and now it works again with GPU for me.

Definitely worth keeping an eye on your GPU memory (which I do - I keep a widget in view at all times - that wasn't the issue for me)

@oldgithubman commented on GitHub (May 17, 2024): > I think I got the same issue. Running llama2:latest and llama3:latest on my GTX 1660 SUPER. Worked before, now I updated to the latest Ollama and it seems that it mostly uses CPU which is way slower. > > // Update: For me it seems like I had another process blocking my VRAM (A Python process). I saw this with nvidia-smi I restarted my local PC and now it works again with GPU for me. Definitely worth keeping an eye on your GPU memory (which I do - I keep a widget in view at all times - that wasn't the issue for me)

GiteaMirror commented

2026-05-03 18:56:57 -05:00

@oldgithubman commented on GitHub (May 17, 2024):

Has anybody an idea of the code we need to remove to stop it ignoring our num_gpu settings (again, sigh...)?

Also weird is how, if ollama thinks it can't fit the entire model in VRAM, it doesn't attempt to put any layers in VRAM. I actually like this behavior though because it makes it obvious something is wrong. Still, more communication to the user would be good

@oldgithubman commented on GitHub (May 17, 2024): > Has anybody an idea of the code we need to remove to stop it ignoring our `num_gpu` settings (again, sigh...)? Also weird is how, if ollama thinks it can't fit the entire model in VRAM, it doesn't attempt to put *any* layers in VRAM. I actually like this behavior though because it makes it obvious something is wrong. Still, more communication to the user would be good

GiteaMirror commented

2026-05-03 18:56:58 -05:00

@uncomfyhalomacro commented on GitHub (May 18, 2024):

got the same issue here on openSUSE Tumbleweed. one thing i noticed is, it uses the GPU for a moment then gone...

Screencast_20240518_221101.webm

@uncomfyhalomacro commented on GitHub (May 18, 2024): got the same issue here on openSUSE Tumbleweed. one thing i noticed is, it uses the GPU for a moment then gone... [Screencast_20240518_221101.webm](https://github.com/ollama/ollama/assets/66054069/7ddafa42-b21f-402c-a77f-a3febc6754e6)

GiteaMirror commented

2026-05-03 18:57:00 -05:00

@dhiltgen commented on GitHub (May 21, 2024):

We've recently introduced ollama ps which will help show how much of the model has loaded into VRAM.

We've fixed a few bugs recently around num_gpu handling in some of our prediction logic, but I'm not sure that addresses your comment @jukofyork. Can you explain what you're trying to do? The goal of our prediction algorithm is to set num_gpu automatically based on the available VRAM. There is a minimum requirement for models and if we can't even allocate that minimal amount, then we will fall back to CPU. If we can satisfy the minimal amount, but not load the full amount, we will partially load on the GPU. Are you trying to set a lower value to preserve more space on the GPU, or did we predict incorrectly and you're trying to specify more layers? If our prediction was right, and you still push higher, we'll likely OOM crash by trying to allocate too many layers on the GPU.

@oldmanjk can you clarify your problem? Perhaps ollama ps output and server log can help us understand what's going on.

@dhiltgen commented on GitHub (May 21, 2024): We've recently introduced `ollama ps` which will help show how much of the model has loaded into VRAM. We've fixed a few bugs recently around num_gpu handling in some of our prediction logic, but I'm not sure that addresses your comment @jukofyork. Can you explain what you're trying to do? The goal of our prediction algorithm is to set num_gpu automatically based on the available VRAM. There is a minimum requirement for models and if we can't even allocate that minimal amount, then we will fall back to CPU. If we can satisfy the minimal amount, but not load the full amount, we will partially load on the GPU. Are you trying to set a lower value to preserve more space on the GPU, or did we predict incorrectly and you're trying to specify more layers? If our prediction was right, and you still push higher, we'll likely OOM crash by trying to allocate too many layers on the GPU. @oldmanjk can you clarify your problem? Perhaps `ollama ps` output and server log can help us understand what's going on.

GiteaMirror commented

2026-05-03 18:57:05 -05:00

@oldgithubman commented on GitHub (May 21, 2024):

We've recently introduced ollama ps which will help show how much of the model has loaded into VRAM.

We've fixed a few bugs recently around num_gpu handling in some of our prediction logic, but I'm not sure that addresses your comment @jukofyork. Can you explain what you're trying to do? The goal of our prediction algorithm is to set num_gpu automatically based on the available VRAM. There is a minimum requirement for models and if we can't even allocate that minimal amount, then we will fall back to CPU. If we can satisfy the minimal amount, but not load the full amount, we will partially load on the GPU. Are you trying to set a lower value to preserve more space on the GPU, or did we predict incorrectly and you're trying to specify more layers? If our prediction was right, and you still push higher, we'll likely OOM crash by trying to allocate too many layers on the GPU.

@oldmanjk can you clarify your problem? Perhaps ollama ps output and server log can help us understand what's going on.

I'm not at a terminal atm, but ollama refuses to load the same size models it used to and that other back ends will (like ooba with llama-cpp-python). Depending on the model/quant, I have to reduce num_gpu by a few layers compared to old ollama or ooba. When you've carefully optimized your quants like i have, this is the difference between fully-offloaded and not. On a repurposed mining rig, this destroys performance. Also, if I don't change the modelfile (which is a pain on a slow rig), ollama won't offload anything to gpu

@oldgithubman commented on GitHub (May 21, 2024): > We've recently introduced `ollama ps` which will help show how much of the model has loaded into VRAM. > > We've fixed a few bugs recently around num_gpu handling in some of our prediction logic, but I'm not sure that addresses your comment @jukofyork. Can you explain what you're trying to do? The goal of our prediction algorithm is to set num_gpu automatically based on the available VRAM. There is a minimum requirement for models and if we can't even allocate that minimal amount, then we will fall back to CPU. If we can satisfy the minimal amount, but not load the full amount, we will partially load on the GPU. Are you trying to set a lower value to preserve more space on the GPU, or did we predict incorrectly and you're trying to specify more layers? If our prediction was right, and you still push higher, we'll likely OOM crash by trying to allocate too many layers on the GPU. > > @oldmanjk can you clarify your problem? Perhaps `ollama ps` output and server log can help us understand what's going on. I'm not at a terminal atm, but ollama refuses to load the same size models it used to and that other back ends will (like ooba with llama-cpp-python). Depending on the model/quant, I have to reduce num_gpu by a few layers compared to old ollama or ooba. When you've carefully optimized your quants like i have, this is the difference between fully-offloaded and not. On a repurposed mining rig, this destroys performance. Also, if I don't change the modelfile (which is a pain on a slow rig), ollama won't offload anything to gpu

GiteaMirror commented

2026-05-03 18:57:06 -05:00

@oldgithubman commented on GitHub (May 22, 2024):

Example walkthrough:

Determine ooba/llama-cpp-python can load and run over 1K context (this usually means it will run full context without crashing) my latest Meta-Llama-3-70B-Instruct-Q3_K_L quant with 8K context and 48 layers offloaded to GPU.
So now I'll go to the Modelfile and set num_ctx to 8192, num_gpu to 48, num_thread to 32 (to get ollama to use 24 threads (sigh)), and import the gguf into ollama.
About a minute and 37.1 GB of wear and tear on my nvme later, success.
Attempt to call model.
ollama offloads nothing to GPU (or system RAM, for that matter (outside of a few GiB) - it's running it straight off the nvme? Why? I have over 90 GiB of RAM free)
```
$ ollama ps
NAME                                            ID              SIZE    PROCESSOR       UNTIL              
Meta-Llama-3-70B-Instruct-Q3_K_L-8K:latest      b9345a582769    41 GB   100% CPU        4 minutes from now
```
ollama_logs.txt attached - note the three locations I've highlighted falsehoods claimed by ollama.
ollama_logs.txt
ollama rm Meta-Llama-3-70B-Instruct-Q3_K_L-8K (autocomplete would be nice)
Go back to the Modelfile, set num_gpu to 47, and import the gguf into ollama again.
Wait another minute or so (on this very-fast machine - on the old mining rig this can take upwards of ten minutes).
Another 37.1 GB of wear and tear on my nvme later, success.
Attempt to call model.
Lucky! This time it works. Sometimes I have to repeat these steps a few times. 22.8 / 24.0 GiB used - that layer should have fit (in fact, we already know it does).

Edit - Now ollama is using all 32 threads (I want it to use 24 probably) and basically 0% GPU. I have no idea what's going on here.
Edit - Removing num_thread produces 20% CPU utilization, whereas before I was seeing 10%. I don't know what's going on here either. Assuming we want all physical cores utilized, it should be 24/32 or 75%

@oldgithubman commented on GitHub (May 22, 2024): Example walkthrough: 1. Determine ooba/llama-cpp-python can load and run over 1K context (this usually means it will run full context without crashing) my latest Meta-Llama-3-70B-Instruct-Q3_K_L quant with 8K context and 48 layers offloaded to GPU. 2. So now I'll go to the Modelfile and set num_ctx to 8192, num_gpu to 48, num_thread to 32 (to get ollama to use 24 threads (*sigh*)), and import the gguf into ollama. 3. About a minute and 37.1 GB of wear and tear on my nvme later, success. 4. Attempt to call model. 5. ollama offloads *nothing* to GPU (or system RAM, for that matter (outside of a few GiB) - it's running it straight off the nvme? Why? I have over 90 GiB of RAM free) ``` $ ollama ps NAME ID SIZE PROCESSOR UNTIL Meta-Llama-3-70B-Instruct-Q3_K_L-8K:latest b9345a582769 41 GB 100% CPU 4 minutes from now ``` ollama_logs.txt attached - note the three locations I've highlighted falsehoods claimed by ollama. [ollama_logs.txt](https://github.com/ollama/ollama/files/15397123/ollama_logs.txt) 6. `ollama rm Meta-Llama-3-70B-Instruct-Q3_K_L-8K` (autocomplete would be nice) 7. Go back to the Modelfile, set num_gpu to 47, and import the gguf into ollama again. 8. Wait another minute or so (on this very-fast machine - on the old mining rig this can take upwards of ten minutes). 9. Another 37.1 GB of wear and tear on my nvme later, success. 10. Attempt to call model. 11. Lucky! This time it works. Sometimes I have to repeat these steps a few times. 22.8 / 24.0 GiB used - that layer should have fit (in fact, we already know it does). Edit - Now ollama is using all 32 threads (I want it to use 24 probably) and basically 0% GPU. I have no idea what's going on here. Edit - Removing num_thread produces 20% CPU utilization, whereas before I was seeing 10%. I don't know what's going on here either. Assuming we want all physical cores utilized, it should be 24/32 or 75%

GiteaMirror commented

2026-05-03 18:57:08 -05:00

@dhiltgen commented on GitHub (May 31, 2024):

@oldmanjk the log you attached above seems to show a 2nd attempt where we fell back to the runners/cpu_avx2/ollama_llama_server CPU subprocess, after most likely unsuccessfully running on the GPU. Can you share a complete log so we can see what went wrong?

sudo systemctl stop ollama
OLLAMA_DEBUG=1 ollama serve 2>&1 | tee server.log

@dhiltgen commented on GitHub (May 31, 2024): @oldmanjk the log you attached above seems to show a 2nd attempt where we fell back to the `runners/cpu_avx2/ollama_llama_server` CPU subprocess, after most likely unsuccessfully running on the GPU. Can you share a complete log so we can see what went wrong? ``` sudo systemctl stop ollama OLLAMA_DEBUG=1 ollama serve 2>&1 | tee server.log ```

GiteaMirror commented

2026-05-03 18:57:10 -05:00

@oldgithubman commented on GitHub (Jun 1, 2024):

@oldmanjk the log you attached above seems to show a 2nd attempt where we fell back to the runners/cpu_avx2/ollama_llama_server CPU subprocess, after most likely unsuccessfully running on the GPU. Can you share a complete log so we can see what went wrong?
sudo systemctl stop ollama
OLLAMA_DEBUG=1 ollama serve 2>&1 | tee server.log

requested.log

What is clear, from both logs (as I already pointed out in the previous log), is ollama is wrong about memory, both total and available. Ollama says my NVIDIA GeForce RTX 4090 (founder's edition - as standard as it gets) has 23.6 GiB total memory (obviously wrong) and 23.2 GiB available memory (also wrong). The true numbers, according to nvidia-smi, are 24564 MiB (24.0 GiB, of course) total memory and 55 MiB used (24564 MiB - 55 MiB = 24509 MiB = 23.9 GiB available memory). So ollama thinks I have less memory than I do, so it refuses to load models it used to load just fine. Hence why not offloading a layer or two to GPU causes it to work again. I think you have all the information you need from me. You just need to figure out why ollama is incorrectly detecting memory. If I had to guess, it's probably a classic case of wrong units or conversions thereof (GiB vs GB). You know, that thing they beat into our heads to be careful about in high school science class. The thing that caused the Challenger disaster. Y'all need to slow down, be more careful, and put out good code. This would, paradoxically, give you more time because you wouldn't have to spend so much time putting out fires. Again, all of this information was already available, so this was an unnecessary waste of my time too. I've attached the requested log anyway.

Edit - I'm no software dev, but...maybe start here: https://github.com/ollama/ollama/pull/4328
If I'm right that that's the problem (that a dev arbitrarily decided to shave a layer of space off as a "buffer", breaking the existing workflows of countless users, and no one notifying the user base or even all the other devs, causing hours of wasted time and confusion)...well...that's pretty bone-headed. The obvious typo in the original comment (the one one would catch by reviewing one's pull request even once) illustrates my point (about slowing down) pretty spectacularly. Hell, a spell checker would have caught that. If I sound frustrated, it's because I am

@oldgithubman commented on GitHub (Jun 1, 2024): > @oldmanjk the log you attached above seems to show a 2nd attempt where we fell back to the `runners/cpu_avx2/ollama_llama_server` CPU subprocess, after most likely unsuccessfully running on the GPU. Can you share a complete log so we can see what went wrong? > > ``` > sudo systemctl stop ollama > OLLAMA_DEBUG=1 ollama serve 2>&1 | tee server.log > ``` [requested.log](https://github.com/user-attachments/files/15520305/requested.log) What is clear, from *both* logs (as I already pointed out in the previous log), is ollama is wrong about memory, both total and available. Ollama says my NVIDIA GeForce RTX 4090 (founder's edition - as standard as it gets) has 23.6 GiB total memory (*obviously* wrong) and 23.2 GiB available memory (also wrong). The true numbers, according to nvidia-smi, are 24564 MiB (24.0 GiB, of course) total memory and 55 MiB used (24564 MiB - 55 MiB = 24509 MiB = 23.9 GiB available memory). So ollama *thinks* I have less memory than I do, so it refuses to load models it used to load just fine. Hence why *not* offloading a layer or two to GPU causes it to work again. I think you have all the information you need from me. You just need to figure out why ollama is incorrectly detecting memory. If I had to guess, it's probably a classic case of wrong units or conversions thereof (GiB vs GB). You know, that thing they beat into our heads to be careful about in high school science class. The thing that caused the Challenger disaster. Y'all need to slow down, be more careful, and put out good code. This would, paradoxically, give you *more* time because you wouldn't have to spend so much time putting out fires. Again, all of this information was already available, so this was an unnecessary waste of my time too. I've attached the requested log anyway. Edit - I'm no software dev, but...maybe start here: https://github.com/ollama/ollama/pull/4328 If I'm right that that's the problem (that a dev arbitrarily decided to shave a layer of space off as a "buffer", breaking the existing workflows of countless users, and no one notifying the user base or even *all the other devs*, causing hours of wasted time and confusion)...well...that's pretty bone-headed. The obvious typo in the original comment (the one one would catch by reviewing one's pull request *even once*) illustrates my point (about slowing down) pretty spectacularly. Hell, a *spell checker* would have caught that. If I sound frustrated, it's because I am

GiteaMirror commented

2026-05-03 18:57:11 -05:00

@kriansa commented on GitHub (Jun 3, 2024):

@oldmanjk the log you attached above seems to show a 2nd attempt where we fell back to the runners/cpu_avx2/ollama_llama_server CPU subprocess, after most likely unsuccessfully running on the GPU. Can you share a complete log so we can see what went wrong?
sudo systemctl stop ollama
OLLAMA_DEBUG=1 ollama serve 2>&1 | tee server.log
requested.log

What is clear, from both logs (as I already pointed out in the previous log), is ollama is wrong about memory, both total and available. Ollama says my NVIDIA GeForce RTX 4090 (founder's edition - as standard as it gets) has 23.6 GiB total memory (obviously wrong) and 23.2 GiB available memory (also wrong). The true numbers, according to nvidia-smi, are 24564 MiB (24.0 GiB, of course) total memory and 55 MiB used (24564 MiB - 55 MiB = 24509 MiB = 23.9 GiB available memory). So ollama thinks I have less memory than I do, so it refuses to load models it used to load just fine. Hence why not offloading a layer or two to GPU causes it to work again. I think you have all the information you need from me. You just need to figure out why ollama is incorrectly detecting memory. If I had to guess, it's probably a classic case of wrong units or conversions thereof (GiB vs GB). You know, that thing they beat into our heads to be careful about in high school science class. The thing that caused the Challenger disaster. Y'all need to slow down, be more careful, and put out good code. This would, paradoxically, give you more time because you wouldn't have to spend so much time putting out fires. Again, all of this information was already available, so this was an unnecessary waste of my time too. I've attached the requested log anyway.

Edit - I'm no software dev, but...maybe start here: #4328 If I'm right that that's the problem (that a dev arbitrarily decided to shave a layer of space off as a "buffer", breaking the existing workflows of countless users, and no one notifying the user base or even all the other devs, causing hours of wasted time and confusion)...well...that's pretty bone-headed. The obvious typo in the original comment (the one one would catch by reviewing one's pull request even once) illustrates my point (about slowing down) pretty spectacularly. Hell, a spell checker would have caught that. If I sound frustrated, it's because I am

I'm not affiliated with this project by any means, I'm just a peasant who happens to be facing this issue as well, and I appreciate your diagnostics so far, I'm also using a Pascal based GPU and no luck.

That said, and while I understand your frustration as I'm also affected by this issue, there's no need to be snarky with the contributors. This is open source and no one is obligated to provide free support, most of us do it for passion. Next time avoid expressing your frustration like that towards other developers who owe you nothing, it will hurt more than you think. As a more practical and constructive criticism, you can point out what you think is the cause of the issue and ask how you can help address it, perhaps even patching and recompiling if you know how to.

@kriansa commented on GitHub (Jun 3, 2024): > > @oldmanjk the log you attached above seems to show a 2nd attempt where we fell back to the `runners/cpu_avx2/ollama_llama_server` CPU subprocess, after most likely unsuccessfully running on the GPU. Can you share a complete log so we can see what went wrong? > > ``` > > sudo systemctl stop ollama > > OLLAMA_DEBUG=1 ollama serve 2>&1 | tee server.log > > ``` > > [requested.log](https://github.com/user-attachments/files/15520305/requested.log) > > What is clear, from _both_ logs (as I already pointed out in the previous log), is ollama is wrong about memory, both total and available. Ollama says my NVIDIA GeForce RTX 4090 (founder's edition - as standard as it gets) has 23.6 GiB total memory (_obviously_ wrong) and 23.2 GiB available memory (also wrong). The true numbers, according to nvidia-smi, are 24564 MiB (24.0 GiB, of course) total memory and 55 MiB used (24564 MiB - 55 MiB = 24509 MiB = 23.9 GiB available memory). So ollama _thinks_ I have less memory than I do, so it refuses to load models it used to load just fine. Hence why _not_ offloading a layer or two to GPU causes it to work again. I think you have all the information you need from me. You just need to figure out why ollama is incorrectly detecting memory. If I had to guess, it's probably a classic case of wrong units or conversions thereof (GiB vs GB). You know, that thing they beat into our heads to be careful about in high school science class. The thing that caused the Challenger disaster. Y'all need to slow down, be more careful, and put out good code. This would, paradoxically, give you _more_ time because you wouldn't have to spend so much time putting out fires. Again, all of this information was already available, so this was an unnecessary waste of my time too. I've attached the requested log anyway. > > Edit - I'm no software dev, but...maybe start here: #4328 If I'm right that that's the problem (that a dev arbitrarily decided to shave a layer of space off as a "buffer", breaking the existing workflows of countless users, and no one notifying the user base or even _all the other devs_, causing hours of wasted time and confusion)...well...that's pretty bone-headed. The obvious typo in the original comment (the one one would catch by reviewing one's pull request _even once_) illustrates my point (about slowing down) pretty spectacularly. Hell, a _spell checker_ would have caught that. If I sound frustrated, it's because I am I'm not affiliated with this project by any means, I'm just a peasant who happens to be facing this issue as well, and I appreciate your diagnostics so far, I'm also using a Pascal based GPU and no luck. That said, and while I understand your frustration as I'm also affected by this issue, there's no need to be snarky with the contributors. This is open source and no one is obligated to provide free support, most of us do it for passion. Next time avoid expressing your frustration like that towards other developers who owe you nothing, it will hurt more than you think. As a more practical and constructive criticism, you can point out what you think is the cause of the issue and ask how you can help address it, perhaps even patching and recompiling if you know how to.

GiteaMirror commented

2026-05-03 18:57:11 -05:00

@oldgithubman commented on GitHub (Jun 3, 2024):

I appreciate your diagnostics so far

Thank you and you're welcome.

I understand your frustration as I'm also affected by this issue

Then you don't understand my frustration.

no one is obligated to provide free support

Straw man.

avoid expressing your frustration like that

Point considered and rejected.

developers who owe you nothing

Straw man.

it will hurt more than you think

You can't know this. If it would hurt you, that's a you problem and I would suggest recalling the ancient wisdom of "sticks and stones..."

As a more practical and constructive criticism, you can point out what you think is the cause of the issue

Did you actually read what I wrote?

and ask how you can help address it,

If the devs need help, they can ask. As they've been doing. And as I've been responding with their requests. You haven't actually read this thread, have you? All I've been doing is helping. You just think I'm mean. I prioritize actually helping over what people think about me. Why didn't you ask how you can help address it, since you think that's valuable advice?

perhaps even patching and recompiling if you know how to

I don't know how to, but I'd learn if they asked. That would be consistent with my past behavior. At this point, I'm not even sure why I'm wasting time on responding to you. You suggest I help, when that's what I'm doing here. Yeah, I'm done. Peace.

@oldgithubman commented on GitHub (Jun 3, 2024): > I appreciate your diagnostics so far Thank you and you're welcome. > I understand your frustration as I'm also affected by this issue Then you don't understand my frustration. > no one is obligated to provide free support Straw man. > avoid expressing your frustration like that Point considered and rejected. > developers who owe you nothing Straw man. > it will hurt more than you think You can't know this. If it would hurt *you*, that's a *you* problem and I would suggest recalling the ancient wisdom of "sticks and stones..." > As a more practical and constructive criticism, you can point out what you think is the cause of the issue Did you actually read what I wrote? > and ask how you can help address it, If the devs need help, they can ask. As they've been doing. And as I've been responding with their requests. You haven't actually read this thread, have you? All I've been doing is helping. You just think I'm mean. I prioritize actually helping over what people think about me. Why didn't you ask how you can help address it, since you think that's valuable advice? > perhaps even patching and recompiling if you know how to I don't know how to, but I'd learn if they asked. That would be consistent with my past behavior. At this point, I'm not even sure why I'm wasting time on responding to you. You suggest I help, *when that's what I'm doing here*. Yeah, I'm done. Peace.

GiteaMirror commented

2026-05-03 18:57:12 -05:00

@czrpb commented on GitHub (Jul 22, 2024):

Literally like 20min later: Im an idiot! On arch linux install ollama-cuda. Why it took me hours to find that is yet another bit of evidence I probably should be given a keyboard! hahaha!

Hi! Here is my equivalent issue and log file. Hope this helps!!

time=2024-07-21T18:55:31.136-07:00 level=INFO source=gpu.go:205 msg="looking for compatible GPUs"
time=2024-07-21T18:55:31.136-07:00 level=DEBUG source=gpu.go:91 msg="searching for GPU discovery libraries for NVIDIA"
time=2024-07-21T18:55:31.136-07:00 level=DEBUG source=gpu.go:468 msg="Searching for GPU library" name=libcuda.so*
time=2024-07-21T18:55:31.136-07:00 level=DEBUG source=gpu.go:487 msg="gpu library search" globs="[<<<HOME>>/libcuda.so** /usr/local/cuda*/targets/*/lib/libcuda.so* /usr/lib/*-linux-gnu/nvidia/current/libcuda.so* /usr/lib/*-linux-gnu/libcuda.so* /usr/lib/wsl/lib/libcuda.so* /usr/lib/wsl/drivers/*/libcuda.so* /opt/cuda/lib*/libcuda.so* /usr/local/cuda/lib*/libcuda.so* /usr/lib*/libcuda.so* /usr/local/lib*/libcuda.so*]"
time=2024-07-21T18:55:31.146-07:00 level=DEBUG source=gpu.go:521 msg="discovered GPU libraries" paths="[/usr/lib/libcuda.so.555.58.02 /usr/lib64/libcuda.so.555.58.02]"
CUDA driver version: 12.5
time=2024-07-21T18:55:31.254-07:00 level=DEBUG source=gpu.go:124 msg="detected GPUs" count=1 library=/usr/lib/libcuda.so.555.58.02
[GPU-ccd02350-1222-88d2-1eda-c906b3aff9a8] CUDA totalMem 7788 mb
[GPU-ccd02350-1222-88d2-1eda-c906b3aff9a8] CUDA freeMem 7369 mb
[GPU-ccd02350-1222-88d2-1eda-c906b3aff9a8] Compute Capability 7.5
time=2024-07-21T18:55:31.457-07:00 level=DEBUG source=amd_linux.go:356 msg="amdgpu driver not detected /sys/module/amdgpu"
releasing cuda driver library
time=2024-07-21T18:55:31.457-07:00 level=INFO source=types.go:105 msg="inference compute" id=GPU-ccd02350-1222-88d2-1eda-c906b3aff9a8 library=cuda compute=7.5 driver=12.5 name="NVIDIA GeForce RTX 2070 SUPER" total="7.6 GiB" available="7.2 GiB"

  .  .  .

time=2024-07-21T18:55:37.564-07:00 level=INFO source=sched.go:701 msg="new model will fit in available VRAM in single GPU, loading" model=<<<HOME>>/.ollama/models/blobs/sha256-dd0c6f2ea876e4c433325df3398386f24e00d321abf6cec197c1bc1fcf1e0025 gpu=GPU-ccd02350-1222-88d2-1eda-c906b3aff9a8 parallel=4 available=7727087616 required="2.6 GiB"
time=2024-07-21T18:55:37.564-07:00 level=DEBUG source=server.go:100 msg="system memory" total="31.3 GiB" free="29.4 GiB" free_swap="0 B"
time=2024-07-21T18:55:37.564-07:00 level=DEBUG source=memory.go:101 msg=evaluating library=cuda gpu_count=1 available="[7.2 GiB]"
time=2024-07-21T18:55:37.564-07:00 level=INFO source=memory.go:309 msg="offload to cuda" layers.requested=-1 layers.model=19 layers.offload=19 layers.split="" memory.available="[7.2 GiB]" memory.required.full="2.6 GiB" memory.required.partial="2.6 GiB" memory.required.kv="144.0 MiB" memory.required.allocations="[2.6 GiB]" memory.weights.total="1.2 GiB" memory.weights.repeating="797.2 MiB" memory.weights.nonrepeating="410.2 MiB" memory.graph.full="504.0 MiB" memory.graph.partial="914.2 MiB"

  .  .  .

time=2024-07-21T18:55:37.565-07:00 level=INFO source=server.go:383 msg="starting llama server" cmd="/tmp/ollama485515592/runners/cpu_avx2/ollama_llama_server --model <<<HOME>>/.ollama/models/blobs/sha256-dd0c6f2ea876e4c433325df3398386f24e00d321abf6cec197c1bc1fcf1e0025 --ctx-size 8192 --batch-size 512 --embedding --log-disable --n-gpu-layers 19 --verbose --parallel 4 --port 34295"
time=2024-07-21T18:55:37.565-07:00 level=DEBUG source=server.go:398 msg=subprocess environment="[CUDA_PATH=/opt/cuda PATH=/usr/local/bin:/usr/bin:/bin:/usr/local/sbin:/opt/cuda/bin:/opt/cuda/nsight_compute:/opt/cuda/nsight_systems/bin:/usr/lib/jvm/default/bin:/usr/bin/site_perl:/usr/bin/vendor_perl:/usr/bin/core_perl LD_LIBRARY_PATH=/tmp/ollama485515592/runners/cpu_avx2:/tmp/ollama485515592/runners]"
time=2024-07-21T18:55:37.565-07:00 level=INFO source=sched.go:437 msg="loaded runners" count=1
time=2024-07-21T18:55:37.565-07:00 level=INFO source=server.go:571 msg="waiting for llama runner to start responding"
time=2024-07-21T18:55:37.565-07:00 level=INFO source=server.go:612 msg="waiting for server to become available" status="llm server error"
WARN [server_params_parse] Not compiled with GPU offload support, --n-gpu-layers option will be ignored. See main README.md for information on enabling GPU BLAS support | n_gpu_layers=-1 tid="126402163820352" timestamp=1721613337
INFO [main] build info | build=3337 commit="a8db2a9ce" tid="126402163820352" timestamp=1721613337
INFO [main] system info | n_threads=6 n_threads_batch=-1 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 0 | " tid="126402163820352" timestamp=1721613337 total_threads=6
INFO [main] HTTP server listening | hostname="127.0.0.1" n_threads_http="6" port="34295" tid="126402163820352" timestamp=1721613337

ollama-cleaner.log

@czrpb commented on GitHub (Jul 22, 2024): Literally like 20min later: Im an idiot! On arch linux install `ollama-cuda`. Why it took me hours to find that is yet another bit of evidence I probably should be given a keyboard! hahaha! ----- Hi! Here is my equivalent issue and log file. Hope this helps!! ``` time=2024-07-21T18:55:31.136-07:00 level=INFO source=gpu.go:205 msg="looking for compatible GPUs" time=2024-07-21T18:55:31.136-07:00 level=DEBUG source=gpu.go:91 msg="searching for GPU discovery libraries for NVIDIA" time=2024-07-21T18:55:31.136-07:00 level=DEBUG source=gpu.go:468 msg="Searching for GPU library" name=libcuda.so* time=2024-07-21T18:55:31.136-07:00 level=DEBUG source=gpu.go:487 msg="gpu library search" globs="[<<<HOME>>/libcuda.so** /usr/local/cuda*/targets/*/lib/libcuda.so* /usr/lib/*-linux-gnu/nvidia/current/libcuda.so* /usr/lib/*-linux-gnu/libcuda.so* /usr/lib/wsl/lib/libcuda.so* /usr/lib/wsl/drivers/*/libcuda.so* /opt/cuda/lib*/libcuda.so* /usr/local/cuda/lib*/libcuda.so* /usr/lib*/libcuda.so* /usr/local/lib*/libcuda.so*]" time=2024-07-21T18:55:31.146-07:00 level=DEBUG source=gpu.go:521 msg="discovered GPU libraries" paths="[/usr/lib/libcuda.so.555.58.02 /usr/lib64/libcuda.so.555.58.02]" CUDA driver version: 12.5 time=2024-07-21T18:55:31.254-07:00 level=DEBUG source=gpu.go:124 msg="detected GPUs" count=1 library=/usr/lib/libcuda.so.555.58.02 [GPU-ccd02350-1222-88d2-1eda-c906b3aff9a8] CUDA totalMem 7788 mb [GPU-ccd02350-1222-88d2-1eda-c906b3aff9a8] CUDA freeMem 7369 mb [GPU-ccd02350-1222-88d2-1eda-c906b3aff9a8] Compute Capability 7.5 time=2024-07-21T18:55:31.457-07:00 level=DEBUG source=amd_linux.go:356 msg="amdgpu driver not detected /sys/module/amdgpu" releasing cuda driver library time=2024-07-21T18:55:31.457-07:00 level=INFO source=types.go:105 msg="inference compute" id=GPU-ccd02350-1222-88d2-1eda-c906b3aff9a8 library=cuda compute=7.5 driver=12.5 name="NVIDIA GeForce RTX 2070 SUPER" total="7.6 GiB" available="7.2 GiB" . . . time=2024-07-21T18:55:37.564-07:00 level=INFO source=sched.go:701 msg="new model will fit in available VRAM in single GPU, loading" model=<<<HOME>>/.ollama/models/blobs/sha256-dd0c6f2ea876e4c433325df3398386f24e00d321abf6cec197c1bc1fcf1e0025 gpu=GPU-ccd02350-1222-88d2-1eda-c906b3aff9a8 parallel=4 available=7727087616 required="2.6 GiB" time=2024-07-21T18:55:37.564-07:00 level=DEBUG source=server.go:100 msg="system memory" total="31.3 GiB" free="29.4 GiB" free_swap="0 B" time=2024-07-21T18:55:37.564-07:00 level=DEBUG source=memory.go:101 msg=evaluating library=cuda gpu_count=1 available="[7.2 GiB]" time=2024-07-21T18:55:37.564-07:00 level=INFO source=memory.go:309 msg="offload to cuda" layers.requested=-1 layers.model=19 layers.offload=19 layers.split="" memory.available="[7.2 GiB]" memory.required.full="2.6 GiB" memory.required.partial="2.6 GiB" memory.required.kv="144.0 MiB" memory.required.allocations="[2.6 GiB]" memory.weights.total="1.2 GiB" memory.weights.repeating="797.2 MiB" memory.weights.nonrepeating="410.2 MiB" memory.graph.full="504.0 MiB" memory.graph.partial="914.2 MiB" . . . time=2024-07-21T18:55:37.565-07:00 level=INFO source=server.go:383 msg="starting llama server" cmd="/tmp/ollama485515592/runners/cpu_avx2/ollama_llama_server --model <<<HOME>>/.ollama/models/blobs/sha256-dd0c6f2ea876e4c433325df3398386f24e00d321abf6cec197c1bc1fcf1e0025 --ctx-size 8192 --batch-size 512 --embedding --log-disable --n-gpu-layers 19 --verbose --parallel 4 --port 34295" time=2024-07-21T18:55:37.565-07:00 level=DEBUG source=server.go:398 msg=subprocess environment="[CUDA_PATH=/opt/cuda PATH=/usr/local/bin:/usr/bin:/bin:/usr/local/sbin:/opt/cuda/bin:/opt/cuda/nsight_compute:/opt/cuda/nsight_systems/bin:/usr/lib/jvm/default/bin:/usr/bin/site_perl:/usr/bin/vendor_perl:/usr/bin/core_perl LD_LIBRARY_PATH=/tmp/ollama485515592/runners/cpu_avx2:/tmp/ollama485515592/runners]" time=2024-07-21T18:55:37.565-07:00 level=INFO source=sched.go:437 msg="loaded runners" count=1 time=2024-07-21T18:55:37.565-07:00 level=INFO source=server.go:571 msg="waiting for llama runner to start responding" time=2024-07-21T18:55:37.565-07:00 level=INFO source=server.go:612 msg="waiting for server to become available" status="llm server error" WARN [server_params_parse] Not compiled with GPU offload support, --n-gpu-layers option will be ignored. See main README.md for information on enabling GPU BLAS support | n_gpu_layers=-1 tid="126402163820352" timestamp=1721613337 INFO [main] build info | build=3337 commit="a8db2a9ce" tid="126402163820352" timestamp=1721613337 INFO [main] system info | n_threads=6 n_threads_batch=-1 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 0 | " tid="126402163820352" timestamp=1721613337 total_threads=6 INFO [main] HTTP server listening | hostname="127.0.0.1" n_threads_http="6" port="34295" tid="126402163820352" timestamp=1721613337 ``` [ollama-cleaner.log](https://github.com/user-attachments/files/16327095/ollama-cleaner.log)

GiteaMirror commented

2026-05-03 18:57:13 -05:00

@oldgithubman commented on GitHub (Jul 22, 2024):

Literally like 20min later: Im an idiot! On arch linux install ollama-cuda. Why it took me hours to find that is yet another bit of evidence I probably should be given a keyboard! hahaha!

Hi! Here is my equivalent issue and log file. Hope this helps!!

time=2024-07-21T18:55:31.136-07:00 level=INFO source=gpu.go:205 msg="looking for compatible GPUs"
time=2024-07-21T18:55:31.136-07:00 level=DEBUG source=gpu.go:91 msg="searching for GPU discovery libraries for NVIDIA"
time=2024-07-21T18:55:31.136-07:00 level=DEBUG source=gpu.go:468 msg="Searching for GPU library" name=libcuda.so*
time=2024-07-21T18:55:31.136-07:00 level=DEBUG source=gpu.go:487 msg="gpu library search" globs="[<<<HOME>>/libcuda.so** /usr/local/cuda*/targets/*/lib/libcuda.so* /usr/lib/*-linux-gnu/nvidia/current/libcuda.so* /usr/lib/*-linux-gnu/libcuda.so* /usr/lib/wsl/lib/libcuda.so* /usr/lib/wsl/drivers/*/libcuda.so* /opt/cuda/lib*/libcuda.so* /usr/local/cuda/lib*/libcuda.so* /usr/lib*/libcuda.so* /usr/local/lib*/libcuda.so*]"
time=2024-07-21T18:55:31.146-07:00 level=DEBUG source=gpu.go:521 msg="discovered GPU libraries" paths="[/usr/lib/libcuda.so.555.58.02 /usr/lib64/libcuda.so.555.58.02]"
CUDA driver version: 12.5
time=2024-07-21T18:55:31.254-07:00 level=DEBUG source=gpu.go:124 msg="detected GPUs" count=1 library=/usr/lib/libcuda.so.555.58.02
[GPU-ccd02350-1222-88d2-1eda-c906b3aff9a8] CUDA totalMem 7788 mb
[GPU-ccd02350-1222-88d2-1eda-c906b3aff9a8] CUDA freeMem 7369 mb
[GPU-ccd02350-1222-88d2-1eda-c906b3aff9a8] Compute Capability 7.5
time=2024-07-21T18:55:31.457-07:00 level=DEBUG source=amd_linux.go:356 msg="amdgpu driver not detected /sys/module/amdgpu"
releasing cuda driver library
time=2024-07-21T18:55:31.457-07:00 level=INFO source=types.go:105 msg="inference compute" id=GPU-ccd02350-1222-88d2-1eda-c906b3aff9a8 library=cuda compute=7.5 driver=12.5 name="NVIDIA GeForce RTX 2070 SUPER" total="7.6 GiB" available="7.2 GiB"

  .  .  .

time=2024-07-21T18:55:37.564-07:00 level=INFO source=sched.go:701 msg="new model will fit in available VRAM in single GPU, loading" model=<<<HOME>>/.ollama/models/blobs/sha256-dd0c6f2ea876e4c433325df3398386f24e00d321abf6cec197c1bc1fcf1e0025 gpu=GPU-ccd02350-1222-88d2-1eda-c906b3aff9a8 parallel=4 available=7727087616 required="2.6 GiB"
time=2024-07-21T18:55:37.564-07:00 level=DEBUG source=server.go:100 msg="system memory" total="31.3 GiB" free="29.4 GiB" free_swap="0 B"
time=2024-07-21T18:55:37.564-07:00 level=DEBUG source=memory.go:101 msg=evaluating library=cuda gpu_count=1 available="[7.2 GiB]"
time=2024-07-21T18:55:37.564-07:00 level=INFO source=memory.go:309 msg="offload to cuda" layers.requested=-1 layers.model=19 layers.offload=19 layers.split="" memory.available="[7.2 GiB]" memory.required.full="2.6 GiB" memory.required.partial="2.6 GiB" memory.required.kv="144.0 MiB" memory.required.allocations="[2.6 GiB]" memory.weights.total="1.2 GiB" memory.weights.repeating="797.2 MiB" memory.weights.nonrepeating="410.2 MiB" memory.graph.full="504.0 MiB" memory.graph.partial="914.2 MiB"

  .  .  .

time=2024-07-21T18:55:37.565-07:00 level=INFO source=server.go:383 msg="starting llama server" cmd="/tmp/ollama485515592/runners/cpu_avx2/ollama_llama_server --model <<<HOME>>/.ollama/models/blobs/sha256-dd0c6f2ea876e4c433325df3398386f24e00d321abf6cec197c1bc1fcf1e0025 --ctx-size 8192 --batch-size 512 --embedding --log-disable --n-gpu-layers 19 --verbose --parallel 4 --port 34295"
time=2024-07-21T18:55:37.565-07:00 level=DEBUG source=server.go:398 msg=subprocess environment="[CUDA_PATH=/opt/cuda PATH=/usr/local/bin:/usr/bin:/bin:/usr/local/sbin:/opt/cuda/bin:/opt/cuda/nsight_compute:/opt/cuda/nsight_systems/bin:/usr/lib/jvm/default/bin:/usr/bin/site_perl:/usr/bin/vendor_perl:/usr/bin/core_perl LD_LIBRARY_PATH=/tmp/ollama485515592/runners/cpu_avx2:/tmp/ollama485515592/runners]"
time=2024-07-21T18:55:37.565-07:00 level=INFO source=sched.go:437 msg="loaded runners" count=1
time=2024-07-21T18:55:37.565-07:00 level=INFO source=server.go:571 msg="waiting for llama runner to start responding"
time=2024-07-21T18:55:37.565-07:00 level=INFO source=server.go:612 msg="waiting for server to become available" status="llm server error"
WARN [server_params_parse] Not compiled with GPU offload support, --n-gpu-layers option will be ignored. See main README.md for information on enabling GPU BLAS support | n_gpu_layers=-1 tid="126402163820352" timestamp=1721613337
INFO [main] build info | build=3337 commit="a8db2a9ce" tid="126402163820352" timestamp=1721613337
INFO [main] system info | n_threads=6 n_threads_batch=-1 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 0 | " tid="126402163820352" timestamp=1721613337 total_threads=6
INFO [main] HTTP server listening | hostname="127.0.0.1" n_threads_http="6" port="34295" tid="126402163820352" timestamp=1721613337

ollama-cleaner.log

I switched to llama.cpp. It's better

Edit - I'm evaluating mistral.rs now. Excellent dev

@oldgithubman commented on GitHub (Jul 22, 2024): > Literally like 20min later: Im an idiot! On arch linux install `ollama-cuda`. Why it took me hours to find that is yet another bit of evidence I probably should be given a keyboard! hahaha! > > Hi! Here is my equivalent issue and log file. Hope this helps!! > > ``` > time=2024-07-21T18:55:31.136-07:00 level=INFO source=gpu.go:205 msg="looking for compatible GPUs" > time=2024-07-21T18:55:31.136-07:00 level=DEBUG source=gpu.go:91 msg="searching for GPU discovery libraries for NVIDIA" > time=2024-07-21T18:55:31.136-07:00 level=DEBUG source=gpu.go:468 msg="Searching for GPU library" name=libcuda.so* > time=2024-07-21T18:55:31.136-07:00 level=DEBUG source=gpu.go:487 msg="gpu library search" globs="[<<<HOME>>/libcuda.so** /usr/local/cuda*/targets/*/lib/libcuda.so* /usr/lib/*-linux-gnu/nvidia/current/libcuda.so* /usr/lib/*-linux-gnu/libcuda.so* /usr/lib/wsl/lib/libcuda.so* /usr/lib/wsl/drivers/*/libcuda.so* /opt/cuda/lib*/libcuda.so* /usr/local/cuda/lib*/libcuda.so* /usr/lib*/libcuda.so* /usr/local/lib*/libcuda.so*]" > time=2024-07-21T18:55:31.146-07:00 level=DEBUG source=gpu.go:521 msg="discovered GPU libraries" paths="[/usr/lib/libcuda.so.555.58.02 /usr/lib64/libcuda.so.555.58.02]" > CUDA driver version: 12.5 > time=2024-07-21T18:55:31.254-07:00 level=DEBUG source=gpu.go:124 msg="detected GPUs" count=1 library=/usr/lib/libcuda.so.555.58.02 > [GPU-ccd02350-1222-88d2-1eda-c906b3aff9a8] CUDA totalMem 7788 mb > [GPU-ccd02350-1222-88d2-1eda-c906b3aff9a8] CUDA freeMem 7369 mb > [GPU-ccd02350-1222-88d2-1eda-c906b3aff9a8] Compute Capability 7.5 > time=2024-07-21T18:55:31.457-07:00 level=DEBUG source=amd_linux.go:356 msg="amdgpu driver not detected /sys/module/amdgpu" > releasing cuda driver library > time=2024-07-21T18:55:31.457-07:00 level=INFO source=types.go:105 msg="inference compute" id=GPU-ccd02350-1222-88d2-1eda-c906b3aff9a8 library=cuda compute=7.5 driver=12.5 name="NVIDIA GeForce RTX 2070 SUPER" total="7.6 GiB" available="7.2 GiB" > > . . . > > time=2024-07-21T18:55:37.564-07:00 level=INFO source=sched.go:701 msg="new model will fit in available VRAM in single GPU, loading" model=<<<HOME>>/.ollama/models/blobs/sha256-dd0c6f2ea876e4c433325df3398386f24e00d321abf6cec197c1bc1fcf1e0025 gpu=GPU-ccd02350-1222-88d2-1eda-c906b3aff9a8 parallel=4 available=7727087616 required="2.6 GiB" > time=2024-07-21T18:55:37.564-07:00 level=DEBUG source=server.go:100 msg="system memory" total="31.3 GiB" free="29.4 GiB" free_swap="0 B" > time=2024-07-21T18:55:37.564-07:00 level=DEBUG source=memory.go:101 msg=evaluating library=cuda gpu_count=1 available="[7.2 GiB]" > time=2024-07-21T18:55:37.564-07:00 level=INFO source=memory.go:309 msg="offload to cuda" layers.requested=-1 layers.model=19 layers.offload=19 layers.split="" memory.available="[7.2 GiB]" memory.required.full="2.6 GiB" memory.required.partial="2.6 GiB" memory.required.kv="144.0 MiB" memory.required.allocations="[2.6 GiB]" memory.weights.total="1.2 GiB" memory.weights.repeating="797.2 MiB" memory.weights.nonrepeating="410.2 MiB" memory.graph.full="504.0 MiB" memory.graph.partial="914.2 MiB" > > . . . > > time=2024-07-21T18:55:37.565-07:00 level=INFO source=server.go:383 msg="starting llama server" cmd="/tmp/ollama485515592/runners/cpu_avx2/ollama_llama_server --model <<<HOME>>/.ollama/models/blobs/sha256-dd0c6f2ea876e4c433325df3398386f24e00d321abf6cec197c1bc1fcf1e0025 --ctx-size 8192 --batch-size 512 --embedding --log-disable --n-gpu-layers 19 --verbose --parallel 4 --port 34295" > time=2024-07-21T18:55:37.565-07:00 level=DEBUG source=server.go:398 msg=subprocess environment="[CUDA_PATH=/opt/cuda PATH=/usr/local/bin:/usr/bin:/bin:/usr/local/sbin:/opt/cuda/bin:/opt/cuda/nsight_compute:/opt/cuda/nsight_systems/bin:/usr/lib/jvm/default/bin:/usr/bin/site_perl:/usr/bin/vendor_perl:/usr/bin/core_perl LD_LIBRARY_PATH=/tmp/ollama485515592/runners/cpu_avx2:/tmp/ollama485515592/runners]" > time=2024-07-21T18:55:37.565-07:00 level=INFO source=sched.go:437 msg="loaded runners" count=1 > time=2024-07-21T18:55:37.565-07:00 level=INFO source=server.go:571 msg="waiting for llama runner to start responding" > time=2024-07-21T18:55:37.565-07:00 level=INFO source=server.go:612 msg="waiting for server to become available" status="llm server error" > WARN [server_params_parse] Not compiled with GPU offload support, --n-gpu-layers option will be ignored. See main README.md for information on enabling GPU BLAS support | n_gpu_layers=-1 tid="126402163820352" timestamp=1721613337 > INFO [main] build info | build=3337 commit="a8db2a9ce" tid="126402163820352" timestamp=1721613337 > INFO [main] system info | n_threads=6 n_threads_batch=-1 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 0 | " tid="126402163820352" timestamp=1721613337 total_threads=6 > INFO [main] HTTP server listening | hostname="127.0.0.1" n_threads_http="6" port="34295" tid="126402163820352" timestamp=1721613337 > ``` > > [ollama-cleaner.log](https://github.com/user-attachments/files/16327095/ollama-cleaner.log) I switched to llama.cpp. It's better Edit - I'm evaluating mistral.rs now. Excellent dev

GiteaMirror commented

2026-05-03 18:57:14 -05:00

@dhiltgen commented on GitHub (Aug 9, 2024):

We've fixed quite a few prediction bugs since 0.1.38, so I'm going to close this one out. If you're still hitting OOM's on 0.3.4, please share what model you were trying to load, and the server log and I'll reopen.

@dhiltgen commented on GitHub (Aug 9, 2024): We've fixed quite a few prediction bugs since 0.1.38, so I'm going to close this one out. If you're still hitting OOM's on 0.3.4, please share what model you were trying to load, and the server log and I'll reopen.

GiteaMirror commented

2026-05-03 18:57:15 -05:00

@Shadowfita commented on GitHub (Sep 17, 2024):

I'm having this issue when trying to run llama3.1:70b with an rtx 3080 and 64GB RAM. It seems that the ollama_llama_server.exe shipped with the latest windows download from the website doesn't have GPU offloading enabled? I get the same error described above.

WARN [server_params_parse] Not compiled with GPU offload support, --n-gpu-layers option will be ignored. See main README.md for information on enabling GPU BLAS support | n_gpu_layers=-1 tid="107816" timestamp=1726533930

@Shadowfita commented on GitHub (Sep 17, 2024): I'm having this issue when trying to run llama3.1:70b with an rtx 3080 and 64GB RAM. It seems that the ollama_llama_server.exe shipped with the latest windows download from the website doesn't have GPU offloading enabled? I get the same error described above. ``` WARN [server_params_parse] Not compiled with GPU offload support, --n-gpu-layers option will be ignored. See main README.md for information on enabling GPU BLAS support | n_gpu_layers=-1 tid="107816" timestamp=1726533930 ```

GiteaMirror commented

2026-05-03 18:57:16 -05:00

@KaloyanGeorgiev99 commented on GitHub (Sep 19, 2024):

I'm having this issue when trying to run llama3.1:70b with an rtx 3080 and 64GB RAM. It seems that the ollama_llama_server.exe shipped with the latest windows download from the website doesn't have GPU offloading enabled? I get the same error described above.
WARN [server_params_parse] Not compiled with GPU offload support, --n-gpu-layers option will be ignored. See main README.md for information on enabling GPU BLAS support | n_gpu_layers=-1 tid="107816" timestamp=1726533930i

i also have the same issue :(

@KaloyanGeorgiev99 commented on GitHub (Sep 19, 2024): > I'm having this issue when trying to run llama3.1:70b with an rtx 3080 and 64GB RAM. It seems that the ollama_llama_server.exe shipped with the latest windows download from the website doesn't have GPU offloading enabled? I get the same error described above. > > ``` > WARN [server_params_parse] Not compiled with GPU offload support, --n-gpu-layers option will be ignored. See main README.md for information on enabling GPU BLAS support | n_gpu_layers=-1 tid="107816" timestamp=1726533930i i also have the same issue :(

GiteaMirror commented

2026-05-03 18:57:16 -05:00

@Shadowfita commented on GitHub (Sep 19, 2024):

There is definitely an issue with the latest binaries, at least on windows. My ollama instance running on my linux server with a gtx 1070 and 32gb ram is able to run llama 3.1 with an excel file, offloading accordingly and providing a response, but my windows PC with RTX 3080 and 64GB ram is giving me an out of memory error.

@Shadowfita commented on GitHub (Sep 19, 2024): There is definitely an issue with the latest binaries, at least on windows. My ollama instance running on my linux server with a gtx 1070 and 32gb ram is able to run llama 3.1 with an excel file, offloading accordingly and providing a response, but my windows PC with RTX 3080 and 64GB ram is giving me an out of memory error.

GiteaMirror commented

2026-05-03 18:57:17 -05:00

@dhiltgen commented on GitHub (Sep 24, 2024):

@Shadowfita @KaloyanGeorgiev99 can you share more complete server logs? This scenario most likely occurs when we're trying to recover from a prior crash failing to start the GPU runner, then fall back to the CPU runner but incorrectly pass a GPU related flag. I'd like to see what the earlier error(s) were leading up to this.

@dhiltgen commented on GitHub (Sep 24, 2024): @Shadowfita @KaloyanGeorgiev99 can you share more complete server logs? This scenario most likely occurs when we're trying to recover from a prior crash failing to start the GPU runner, then fall back to the CPU runner but incorrectly pass a GPU related flag. I'd like to see what the earlier error(s) were leading up to this.

GiteaMirror commented

2026-05-03 18:57:18 -05:00

@Shadowfita commented on GitHub (Sep 24, 2024):

@Shadowfita @KaloyanGeorgiev99 can you share more complete server logs? This scenario most likely occurs when we're trying to recover from a prior crash failing to start the GPU runner, then fall back to the CPU runner but incorrectly pass a GPU related flag. I'd like to see what the earlier error(s) were leading up to this.

I'll try send through some logs this afternoon, thanks @dhiltgen .

@Shadowfita commented on GitHub (Sep 24, 2024): > @Shadowfita @KaloyanGeorgiev99 can you share more complete server logs? This scenario most likely occurs when we're trying to recover from a prior crash failing to start the GPU runner, then fall back to the CPU runner but incorrectly pass a GPU related flag. I'd like to see what the earlier error(s) were leading up to this. I'll try send through some logs this afternoon, thanks @dhiltgen .

GiteaMirror commented

2026-05-03 18:57:18 -05:00

@dhiltgen commented on GitHub (Oct 16, 2024):

If you're still seeing the problem, please upgrade to the latest release, and if that doesn't clear it up, share a more complete server log so I can see why the prior runner crashed and I'll reopen the issue and investigate.

@dhiltgen commented on GitHub (Oct 16, 2024): If you're still seeing the problem, please upgrade to the latest release, and if that doesn't clear it up, share a more complete server log so I can see why the prior runner crashed and I'll reopen the issue and investigate.

Sign in to join this conversation.

Branches Tags

main

hoyyeva/fix-claude-channels-env

parth-update-hermes-launch

hoyyeva/vscode-extension-docs-update

parth-gemma4-chat-template-renderer

parth-api-status-context-length

hoyyeva/wire-up-context-length

hoyyeva/claude-code-context-doc

jmorganca/investigate-issue-17046

hoyyeva/hermes-docs

jmorganca/agent-loop-style

hoyyeva/openclaw

parth-agent-loop

hoyyeva/ollama-vscode-extension

brucemacd/cache-metrics

brucemacd/hermes-desktop

hoyyeva/docs-vscode

parth-input-style-experiment

brucemacd/docs-glm52

hoyyeva/poc-docs

Parth/mlx-launch-recommendations

parth-first-time-app-cli-experience

test/darwin-xcode-pin

improve-cloud-model-recommendations

hoyyeva/goose-docs

jmorganca/context-limit-fixes

hoyyeva/qwen-doc

hoyyeva/vscode-docs

jmorganca/remove-mlx-imagegen-code

parth-copilot-token-length-defaults

hoyyeva/poolside-windows

laguna-support

jmorganca/harden-markdown-rendering

laguna-renderer-parser

laguna-llamacpp

codex/make-integration-hidden-and-lunchable

brucemacd/omp-docs

pdevine/gguf-mtp-oldstyle

hoyyeva/migrate-pi

hoyyeva/anthropic-local-image-path

parth-launch-codex-app

hoyyeva/anthropic-reference-images-path

parth-anthropic-reference-images-path

brucemacd/download-before-remove

hoyyeva/editor-config-repair

parth-mlx-decode-checkpoints

parth/hide-claude-desktop-till-release

parth-add-claude-code-autoinstall

release_v0.22.0

pdevine/manifest-list

codex/fix-codex-model-metadata-warning

pdevine/addressable-manifest

brucemacd/launch-fetch-reccomended

jmorganca/llama-compat

launch-copilot-cli

release_v0.20.7

parth-auto-save-backup

parth-test

jmorganca/gemma4-audio-replacements

fix-manifest-digest-on-pull

hoyyeva/vscode-improve

brucemacd/install-server-wait

parth/update-claude-docs

brucemac/start-ap-install

pdevine/mlx-update

pdevine/qwen35_vision

drifkin/api-show-fallback

mintlify/image-generation-1773352582

hoyyeva/server-context-length-local-config

jmorganca/faster-reptition-penalties

jmorganca/convert-nemotron

parth-pi-thinking

pdevine/sampling-penalties

jmorganca/fix-create-quantization-memory

dongchen/resumable_transfer_fix

pdevine/sampling-cache-error

jessegross/mlx-usage

hoyyeva/openclaw-config

hoyyeva/app-html

pdevine/qwen3next

brucemacd/sign-sh-install

brucemacd/tui-update

brucemacd/usage-api

jmorganca/launch-empty

fix-app-dist-embed

mxyng/mlx-compile

mxyng/mlx-quant

mxyng/mlx-glm4.7

mxyng/mlx

brucemacd/simplify-model-picker

jmorganca/qwen3-concurrent

fix-glm-4.7-flash-mla-config

drifkin/qwen3-coder-opening-tag

brucemacd/usage-cli

fix-cuda12-fattn-shmem

ollama-imagegen-docs

parth/fix-multiline-inputs

brucemacd/config-docs

mxyng/model-files

mxyng/simple-execute

fix-imagegen-ollama-models

mxyng/async-upload

jmorganca/lazy-no-dtype-changes

imagegen-auto-detect-create

parth/decrease-concurrent-download-hf

fix-mlx-quantize-init

jmorganca/x-cleanup

usage

imagegen-readme

jmorganca/glm-image

mlx-gpu-cd

jmorganca/imagegen-modelfile

parth/agent-skills

parth/agent-allowlist

parth/signed-in-offline

parth/agents

parth/fix-context-chopping

improve-cloud-flow

parth/add-models-websearch

parth/prompt-renderer-mcp

jmorganca/native-settings

jmorganca/download-stream-hash

jmorganca/client2-rebased

brucemacd/oai-chat-req-multipart

jessegross/multi_chunk_reserve

grace/additional-omit-empty

grace/mistral-3-large

mxyng/tokenizer2

mxyng/tokenizer

jessegross/flash

hoyyeva/windows-nacked-app

mxyng/cleanup-attention

grace/deepseek-parser

hoyyeva/remember-unsent-prompt

parth/add-lfs-pointer-error-conversion

parth/olmo2-test2

hoyyeva/ollama-launchagent-plist

nicole/olmo-model

parth/olmo-test

mxyng/remove-embedded

parth/render-template

jmorganca/intellect-3

parth/remove-prealloc-linter

jmorganca/cmd-eval

nicole/nomic-embed-text-fix

mxyng/lint-2

hoyyeva/add-gemini-3-pro-preview

hoyyeva/load-model-list

mxyng/expand-path

mxyng/environ-2

hoyyeva/deeplink-json-encoding

parth/improve-tool-calling-tests

hoyyeva/conversation

hoyyeva/assistant-edit-response

hoyyeva/thinking

origin/brucemacd/invalid-char-i-err

parth/improve-tool-calling

jmorganca/required-omitempty

grace/qwen3-vl-tests

mxyng/iter-client

parth/docs-readme

nicole/embed-test

pdevine/integration-benchstat

parth/remove-generate-cmd

parth/add-toolcall-id

mxyng/server-tests

jmorganca/glm-4.6

jmorganca/gin-h-compat

drifkin/stable-tool-args

pdevine/qwen3-more-thinking

parth/add-websearch-client

nicole/websearch_local

jmorganca/qwen3-coder-updates

grace/deepseek-v3-migration-tests

mxyng/fix-create

jmorganca/cloud-errors

pdevine/parser-tidy

revert-12233-parth/simplify-entrypoints-runner

parth/enable-so-gpt-oss

brucemacd/qwen3vl

jmorganca/readme-simplify

parth/gpt-oss-structured-outputs

revert-12039-jmorganca/tools-braces

mxyng/embeddings

mxyng/gguf

mxyng/benchmark

mxyng/types-null

parth/move-parsing

mxyng/gemma2

jmorganca/docs

mxyng/16-bit

mxyng/create-stdin

pdevine/authorizedkeys

mxyng/quant

parth/opt-in-error-context-window

brucemacd/cache-models

brucemacd/runner-completion

jmorganca/llama-update-6

brucemacd/benchmark-list

brucemacd/partial-read-caps

parth/deepseek-r1-tools

mxyng/omit-array

parth/tool-prefix-temp

brucemacd/runner-test

jmorganca/qwen25vl

brucemacd/model-forward-test-ext

parth/python-function-parsing

jmorganca/cuda-compression-none

drifkin/num-parallel

drifkin/chat-truncation-fix

jmorganca/sync

parth/python-tools-calling

drifkin/array-head-count

brucemacd/create-no-loop

parth/server-enable-content-stream-with-tools

qwen25omni

mxyng/v3

brucemacd/ropeconfig

jmorganca/silence-tokenizer

parth/sample-so-test

parth/sampling-structured-outputs

brucemacd/doc-go-engine

parth/constrained-sampling-json

jmorganca/mistral-wip

brucemacd/mistral-small-convert

parth/sample-unmarshal-json-for-params

brucemacd/jomorganca/mistral

pdevine/bfloat16

jmorganca/mistral

brucemacd/mistral

pdevine/logging

parth/sample-correctness-fix

parth/sample-fix-sorting

jmorgan/sample-fix-sorting-extras

jmorganca/temp-0-images

brucemacd/parallel-embed-models

brucemacd/shim-grammar

jmorganca/fix-gguf-error

bmizerany/nameswork

jmorganca/faster-releases

bmizerany/validatenames

brucemacd/err-no-vocab

brucemacd/rope-config

brucemacd/err-hint

brucemacd/qwen2_5

brucemacd/logprobs

brucemacd/new_runner_graph_bench

progress-flicker

brucemacd/forward-test

brucemacd/go_qwen2

pdevine/gemma2

jmorganca/add-missing-symlink-eval

mxyng/next-debug

parth/set-context-size-openai

brucemacd/next-bpe-bench

brucemacd/next-bpe-test

brucemacd/new_runner_e2e

brucemacd/new_runner_qwen2

pdevine/convert-cohere2

brucemacd/convert-cli

parth/log-probs

mxyng/next-mlx

mxyng/cmd-history

parth/templating

parth/tokenize-detokenize

brucemacd/check-key-register

bmizerany/grammar

jmorganca/vendor-081b29bd

mxyng/func-checks

jmorganca/fix-null-format

parth/fix-default-to-warn-json

jmorganca/qwen2vl

jmorganca/no-concat

parth/cmd-cleanup-SO

brucemacd/check-key-register-structured-err

parth/openai-stream-usage

parth/fix-referencing-so

stream-tools-stop

jmorganca/degin-1

brucemacd/install-path-clean

brucemacd/push-name-validation

brucemacd/browser-key-register

jmorganca/openai-fix-first-message

jmorganca/fix-proxy

jessegross/sample

parth/disallow-streaming-tools

dhiltgen/remove_submodule

jmorganca/ga

jmorganca/mllama

pdevine/newlines

pdevine/geems-2b

jmorganca/llama-bump

mxyng/modelname-7

mxyng/gin-slog

mxyng/modelname-6

jyan/convert-prog

jyan/quant5

paligemma-support

pdevine/import-docs

jmorganca/openai-context

jyan/paligemma

jyan/p2

jyan/palitest

bmizerany/embedspeedup

jmorganca/llama-vit

brucemacd/allow-ollama

royh/ep-methods

royh/whisper

mxyng/api-models

mxyng/fix-memory

jyan/q4_4/8

jyan/ollama-v

royh/stream-tools

roy-embed-parallel

bmizerany/hrm

revert-5963-revert-5924-mxyng/llama3.1-rope

royh/embed-viz

jyan/local2

jyan/auth

jyan/local

jyan/parse-temp

jmorganca/template-mistral

jyan/reord-g

royh-openai-suffixdocs

royh-imgembed

royh-embed-parallel

jyan/quant4

royh-precision

jyan/progress

pdevine/fix-template

jyan/quant3

pdevine/ggla

mxyng/update-registry-domain

jmorganca/ggml-static

mxyng/create-context

jyan/v0.146

mxyng/layers-from-files

build_dist

bmizerany/noseek

royh-ls

royh-name

timeout

mxyng/server-timestamp

bmizerany/nosillyggufslurps

royh-params

jmorganca/llama-cpp-7c26775

royh-openai-delete

royh-show-rigid

jmorganca/enable-fa

jmorganca/no-error-template

jyan/format

royh-testdelete

bmizerany/fastverify

language_support

pdevine/ps-glitches

brucemacd/tokenize

bruce/iq-quants

bmizerany/filepathwithcoloninhost

mxyng/split-bin

bmizerany/client-registry

jmorganca/if-none-match

native

jmorganca/native

jmorganca/batch-embeddings

jmorganca/initcmake

jmorganca/mm

pdevine/showggmlinfo

modenameenforcealphanum

bmizerany/modenameenforcealphanum

jmorganca/done-reason

jmorganca/llama-cpp-8960fe8

ollama.com

bmizerany/filepathnobuild

bmizerany/types/model/defaultfix

rmdisplaylong

nogogen

bmizerany/x

modelfile-readme

bmizerany/replacecolon

jmorganca/limit

jmorganca/execstack

jmorganca/replace-assets

mxyng/tune-concurrency

jmorganca/testing

whitespace-detection

jmorganca/options

upgrade-all

scratch

cuda-search

mattw/airenamer

mattw/allmodelsonhuggingface

mattw/quantcontext

mattw/whatneedstorun

brucemacd/llama-mem-calc

mattw/faq-context

mattw/communitylinks

mattw/noprune

mattw/python-functioncalling

rename

mxyng/install

pulse

remove-first

editor

mattw/selfqueryingretrieval

cgo

mattw/howtoquant

api

matt/streamingapi

format-config

mxyng/extra-args

shell

update-nous-hermes

cp-model

upload-progress

fix-unknown-model

fix-model-names

delete-fix

insecure-registry

ls

deletemodels

progressbar

readme-updates

license-layers

skip-list

list-models

modelpath

matt/examplemodelfiles

distribution

go-opts

1 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: github-starred/ollama#64842