[GH-ISSUE #4486] Not compiled with GPU offload support #64842

Closed
opened 2026-05-03 18:56:42 -05:00 by GiteaMirror · 25 comments
Owner

Originally created by @oldgithubman on GitHub (May 17, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/4486

Originally assigned to: @dhiltgen on GitHub.

What is the issue?

Trying to use ollama like normal with GPU. Worked before update. Now only using CPU.
$ journalctl -u ollama
reveals
WARN [server_params_parse] Not compiled with GPU offload support, --n-gpu-layers option will be ignored. See main README.md for information on enabling GPU BLAS support | n_gpu_layers=-1

  1. I do not manually compile ollama. I use the standard install script.
  2. Main README.md contains no mention of BLAS

OS

Linux

GPU

Nvidia

CPU

Intel

Ollama version

0.1.38

Originally created by @oldgithubman on GitHub (May 17, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/4486 Originally assigned to: @dhiltgen on GitHub. ### What is the issue? Trying to use ollama like normal with GPU. Worked before update. Now only using CPU. `$ journalctl -u ollama` reveals `WARN [server_params_parse] Not compiled with GPU offload support, --n-gpu-layers option will be ignored. See main README.md for information on enabling GPU BLAS support | n_gpu_layers=-1` 1. I do not manually compile ollama. I use the standard install script. 2. Main README.md contains no mention of BLAS ### OS Linux ### GPU Nvidia ### CPU Intel ### Ollama version 0.1.38
GiteaMirror added the bugneeds more info labels 2026-05-03 18:56:42 -05:00
Author
Owner

@oldgithubman commented on GitHub (May 17, 2024):

Figured it out. Ollama seems to think the model is too big to fit in VRAM (it isn't - it worked fine before the update). There is a lack of any useful communication about this to the user. As mentioned above, digging in the log actually sends you in the wrong direction

<!-- gh-comment-id:2116550099 --> @oldgithubman commented on GitHub (May 17, 2024): Figured it out. Ollama seems to think the model is too big to fit in VRAM (it isn't - it worked fine before the update). There is a lack of any useful communication about this to the user. As mentioned above, digging in the log actually sends you in the *wrong* direction
Author
Owner

@jmorganca commented on GitHub (May 17, 2024):

Hi @oldmanjk sorry about this. May I ask which model you are running? and on which GPU?

<!-- gh-comment-id:2116854854 --> @jmorganca commented on GitHub (May 17, 2024): Hi @oldmanjk sorry about this. May I ask which model you are running? and on which GPU?
Author
Owner

@mroxso commented on GitHub (May 17, 2024):

I think I got the same issue.
Running llama2:latest and llama3:latest on my GTX 1660 SUPER.
Worked before, now I updated to the latest Ollama and it seems that it mostly uses CPU which is way slower.

// Update:
For me it seems like I had another process blocking my VRAM (A Python process). I saw this with nvidia-smi
I restarted my local PC and now it works again with GPU for me.

<!-- gh-comment-id:2116949684 --> @mroxso commented on GitHub (May 17, 2024): I think I got the same issue. Running llama2:latest and llama3:latest on my GTX 1660 SUPER. Worked before, now I updated to the latest Ollama and it seems that it mostly uses CPU which is way slower. // Update: For me it seems like I had another process blocking my VRAM (A Python process). I saw this with nvidia-smi I restarted my local PC and now it works again with GPU for me.
Author
Owner

@jukofyork commented on GitHub (May 17, 2024):

Has anybody an idea of the code we need to remove to stop it ignoring our num_gpu settings (again, sigh...)?

<!-- gh-comment-id:2118211285 --> @jukofyork commented on GitHub (May 17, 2024): Has anybody an idea of the code we need to remove to stop it ignoring our `num_gpu` settings (again, sigh...)?
Author
Owner

@jukofyork commented on GitHub (May 17, 2024):

It's at the bottom of llm/memory.go:

        //if memoryRequiredPartial > memoryAvailable {
        //      slog.Debug("insufficient VRAM to load any model layers")
        //      return 0, 0, memoryRequiredTotal
        //}
<!-- gh-comment-id:2118230899 --> @jukofyork commented on GitHub (May 17, 2024): It's at the bottom of `llm/memory.go`: ``` //if memoryRequiredPartial > memoryAvailable { // slog.Debug("insufficient VRAM to load any model layers") // return 0, 0, memoryRequiredTotal //} ```
Author
Owner

@oldgithubman commented on GitHub (May 17, 2024):

Hi @oldmanjk sorry about this. May I ask which model you are running? and on which GPU?

llama3 on a 1080 Ti

<!-- gh-comment-id:2118413737 --> @oldgithubman commented on GitHub (May 17, 2024): > Hi @oldmanjk sorry about this. May I ask which model you are running? and on which GPU? llama3 on a 1080 Ti
Author
Owner

@oldgithubman commented on GitHub (May 17, 2024):

I think I got the same issue. Running llama2:latest and llama3:latest on my GTX 1660 SUPER. Worked before, now I updated to the latest Ollama and it seems that it mostly uses CPU which is way slower.

// Update: For me it seems like I had another process blocking my VRAM (A Python process). I saw this with nvidia-smi I restarted my local PC and now it works again with GPU for me.

Definitely worth keeping an eye on your GPU memory (which I do - I keep a widget in view at all times - that wasn't the issue for me)

<!-- gh-comment-id:2118415646 --> @oldgithubman commented on GitHub (May 17, 2024): > I think I got the same issue. Running llama2:latest and llama3:latest on my GTX 1660 SUPER. Worked before, now I updated to the latest Ollama and it seems that it mostly uses CPU which is way slower. > > // Update: For me it seems like I had another process blocking my VRAM (A Python process). I saw this with nvidia-smi I restarted my local PC and now it works again with GPU for me. Definitely worth keeping an eye on your GPU memory (which I do - I keep a widget in view at all times - that wasn't the issue for me)
Author
Owner

@oldgithubman commented on GitHub (May 17, 2024):

Has anybody an idea of the code we need to remove to stop it ignoring our num_gpu settings (again, sigh...)?

Also weird is how, if ollama thinks it can't fit the entire model in VRAM, it doesn't attempt to put any layers in VRAM. I actually like this behavior though because it makes it obvious something is wrong. Still, more communication to the user would be good

<!-- gh-comment-id:2118417656 --> @oldgithubman commented on GitHub (May 17, 2024): > Has anybody an idea of the code we need to remove to stop it ignoring our `num_gpu` settings (again, sigh...)? Also weird is how, if ollama thinks it can't fit the entire model in VRAM, it doesn't attempt to put *any* layers in VRAM. I actually like this behavior though because it makes it obvious something is wrong. Still, more communication to the user would be good
Author
Owner

@uncomfyhalomacro commented on GitHub (May 18, 2024):

got the same issue here on openSUSE Tumbleweed. one thing i noticed is, it uses the GPU for a moment then gone...

Screencast_20240518_221101.webm

<!-- gh-comment-id:2118852712 --> @uncomfyhalomacro commented on GitHub (May 18, 2024): got the same issue here on openSUSE Tumbleweed. one thing i noticed is, it uses the GPU for a moment then gone... [Screencast_20240518_221101.webm](https://github.com/ollama/ollama/assets/66054069/7ddafa42-b21f-402c-a77f-a3febc6754e6)
Author
Owner

@dhiltgen commented on GitHub (May 21, 2024):

We've recently introduced ollama ps which will help show how much of the model has loaded into VRAM.

We've fixed a few bugs recently around num_gpu handling in some of our prediction logic, but I'm not sure that addresses your comment @jukofyork. Can you explain what you're trying to do? The goal of our prediction algorithm is to set num_gpu automatically based on the available VRAM. There is a minimum requirement for models and if we can't even allocate that minimal amount, then we will fall back to CPU. If we can satisfy the minimal amount, but not load the full amount, we will partially load on the GPU. Are you trying to set a lower value to preserve more space on the GPU, or did we predict incorrectly and you're trying to specify more layers? If our prediction was right, and you still push higher, we'll likely OOM crash by trying to allocate too many layers on the GPU.

@oldmanjk can you clarify your problem? Perhaps ollama ps output and server log can help us understand what's going on.

<!-- gh-comment-id:2123459526 --> @dhiltgen commented on GitHub (May 21, 2024): We've recently introduced `ollama ps` which will help show how much of the model has loaded into VRAM. We've fixed a few bugs recently around num_gpu handling in some of our prediction logic, but I'm not sure that addresses your comment @jukofyork. Can you explain what you're trying to do? The goal of our prediction algorithm is to set num_gpu automatically based on the available VRAM. There is a minimum requirement for models and if we can't even allocate that minimal amount, then we will fall back to CPU. If we can satisfy the minimal amount, but not load the full amount, we will partially load on the GPU. Are you trying to set a lower value to preserve more space on the GPU, or did we predict incorrectly and you're trying to specify more layers? If our prediction was right, and you still push higher, we'll likely OOM crash by trying to allocate too many layers on the GPU. @oldmanjk can you clarify your problem? Perhaps `ollama ps` output and server log can help us understand what's going on.
Author
Owner

@oldgithubman commented on GitHub (May 21, 2024):

We've recently introduced ollama ps which will help show how much of the model has loaded into VRAM.

We've fixed a few bugs recently around num_gpu handling in some of our prediction logic, but I'm not sure that addresses your comment @jukofyork. Can you explain what you're trying to do? The goal of our prediction algorithm is to set num_gpu automatically based on the available VRAM. There is a minimum requirement for models and if we can't even allocate that minimal amount, then we will fall back to CPU. If we can satisfy the minimal amount, but not load the full amount, we will partially load on the GPU. Are you trying to set a lower value to preserve more space on the GPU, or did we predict incorrectly and you're trying to specify more layers? If our prediction was right, and you still push higher, we'll likely OOM crash by trying to allocate too many layers on the GPU.

@oldmanjk can you clarify your problem? Perhaps ollama ps output and server log can help us understand what's going on.

I'm not at a terminal atm, but ollama refuses to load the same size models it used to and that other back ends will (like ooba with llama-cpp-python). Depending on the model/quant, I have to reduce num_gpu by a few layers compared to old ollama or ooba. When you've carefully optimized your quants like i have, this is the difference between fully-offloaded and not. On a repurposed mining rig, this destroys performance. Also, if I don't change the modelfile (which is a pain on a slow rig), ollama won't offload anything to gpu

<!-- gh-comment-id:2123478173 --> @oldgithubman commented on GitHub (May 21, 2024): > We've recently introduced `ollama ps` which will help show how much of the model has loaded into VRAM. > > We've fixed a few bugs recently around num_gpu handling in some of our prediction logic, but I'm not sure that addresses your comment @jukofyork. Can you explain what you're trying to do? The goal of our prediction algorithm is to set num_gpu automatically based on the available VRAM. There is a minimum requirement for models and if we can't even allocate that minimal amount, then we will fall back to CPU. If we can satisfy the minimal amount, but not load the full amount, we will partially load on the GPU. Are you trying to set a lower value to preserve more space on the GPU, or did we predict incorrectly and you're trying to specify more layers? If our prediction was right, and you still push higher, we'll likely OOM crash by trying to allocate too many layers on the GPU. > > @oldmanjk can you clarify your problem? Perhaps `ollama ps` output and server log can help us understand what's going on. I'm not at a terminal atm, but ollama refuses to load the same size models it used to and that other back ends will (like ooba with llama-cpp-python). Depending on the model/quant, I have to reduce num_gpu by a few layers compared to old ollama or ooba. When you've carefully optimized your quants like i have, this is the difference between fully-offloaded and not. On a repurposed mining rig, this destroys performance. Also, if I don't change the modelfile (which is a pain on a slow rig), ollama won't offload anything to gpu
Author
Owner

@oldgithubman commented on GitHub (May 22, 2024):

Example walkthrough:

  1. Determine ooba/llama-cpp-python can load and run over 1K context (this usually means it will run full context without crashing) my latest Meta-Llama-3-70B-Instruct-Q3_K_L quant with 8K context and 48 layers offloaded to GPU.
  2. So now I'll go to the Modelfile and set num_ctx to 8192, num_gpu to 48, num_thread to 32 (to get ollama to use 24 threads (sigh)), and import the gguf into ollama.
  3. About a minute and 37.1 GB of wear and tear on my nvme later, success.
  4. Attempt to call model.
  5. ollama offloads nothing to GPU (or system RAM, for that matter (outside of a few GiB) - it's running it straight off the nvme? Why? I have over 90 GiB of RAM free)
    $ ollama ps
    NAME                                            ID              SIZE    PROCESSOR       UNTIL              
    Meta-Llama-3-70B-Instruct-Q3_K_L-8K:latest      b9345a582769    41 GB   100% CPU        4 minutes from now
    
    ollama_logs.txt attached - note the three locations I've highlighted falsehoods claimed by ollama.
    ollama_logs.txt
  6. ollama rm Meta-Llama-3-70B-Instruct-Q3_K_L-8K (autocomplete would be nice)
  7. Go back to the Modelfile, set num_gpu to 47, and import the gguf into ollama again.
  8. Wait another minute or so (on this very-fast machine - on the old mining rig this can take upwards of ten minutes).
  9. Another 37.1 GB of wear and tear on my nvme later, success.
  10. Attempt to call model.
  11. Lucky! This time it works. Sometimes I have to repeat these steps a few times. 22.8 / 24.0 GiB used - that layer should have fit (in fact, we already know it does).

Edit - Now ollama is using all 32 threads (I want it to use 24 probably) and basically 0% GPU. I have no idea what's going on here.
Edit - Removing num_thread produces 20% CPU utilization, whereas before I was seeing 10%. I don't know what's going on here either. Assuming we want all physical cores utilized, it should be 24/32 or 75%

<!-- gh-comment-id:2123749647 --> @oldgithubman commented on GitHub (May 22, 2024): Example walkthrough: 1. Determine ooba/llama-cpp-python can load and run over 1K context (this usually means it will run full context without crashing) my latest Meta-Llama-3-70B-Instruct-Q3_K_L quant with 8K context and 48 layers offloaded to GPU. 2. So now I'll go to the Modelfile and set num_ctx to 8192, num_gpu to 48, num_thread to 32 (to get ollama to use 24 threads (*sigh*)), and import the gguf into ollama. 3. About a minute and 37.1 GB of wear and tear on my nvme later, success. 4. Attempt to call model. 5. ollama offloads *nothing* to GPU (or system RAM, for that matter (outside of a few GiB) - it's running it straight off the nvme? Why? I have over 90 GiB of RAM free) ``` $ ollama ps NAME ID SIZE PROCESSOR UNTIL Meta-Llama-3-70B-Instruct-Q3_K_L-8K:latest b9345a582769 41 GB 100% CPU 4 minutes from now ``` ollama_logs.txt attached - note the three locations I've highlighted falsehoods claimed by ollama. [ollama_logs.txt](https://github.com/ollama/ollama/files/15397123/ollama_logs.txt) 6. `ollama rm Meta-Llama-3-70B-Instruct-Q3_K_L-8K` (autocomplete would be nice) 7. Go back to the Modelfile, set num_gpu to 47, and import the gguf into ollama again. 8. Wait another minute or so (on this very-fast machine - on the old mining rig this can take upwards of ten minutes). 9. Another 37.1 GB of wear and tear on my nvme later, success. 10. Attempt to call model. 11. Lucky! This time it works. Sometimes I have to repeat these steps a few times. 22.8 / 24.0 GiB used - that layer should have fit (in fact, we already know it does). Edit - Now ollama is using all 32 threads (I want it to use 24 probably) and basically 0% GPU. I have no idea what's going on here. Edit - Removing num_thread produces 20% CPU utilization, whereas before I was seeing 10%. I don't know what's going on here either. Assuming we want all physical cores utilized, it should be 24/32 or 75%
Author
Owner

@dhiltgen commented on GitHub (May 31, 2024):

@oldmanjk the log you attached above seems to show a 2nd attempt where we fell back to the runners/cpu_avx2/ollama_llama_server CPU subprocess, after most likely unsuccessfully running on the GPU. Can you share a complete log so we can see what went wrong?

sudo systemctl stop ollama
OLLAMA_DEBUG=1 ollama serve 2>&1 | tee server.log
<!-- gh-comment-id:2143121718 --> @dhiltgen commented on GitHub (May 31, 2024): @oldmanjk the log you attached above seems to show a 2nd attempt where we fell back to the `runners/cpu_avx2/ollama_llama_server` CPU subprocess, after most likely unsuccessfully running on the GPU. Can you share a complete log so we can see what went wrong? ``` sudo systemctl stop ollama OLLAMA_DEBUG=1 ollama serve 2>&1 | tee server.log ```
Author
Owner

@oldgithubman commented on GitHub (Jun 1, 2024):

@oldmanjk the log you attached above seems to show a 2nd attempt where we fell back to the runners/cpu_avx2/ollama_llama_server CPU subprocess, after most likely unsuccessfully running on the GPU. Can you share a complete log so we can see what went wrong?

sudo systemctl stop ollama
OLLAMA_DEBUG=1 ollama serve 2>&1 | tee server.log

requested.log

What is clear, from both logs (as I already pointed out in the previous log), is ollama is wrong about memory, both total and available. Ollama says my NVIDIA GeForce RTX 4090 (founder's edition - as standard as it gets) has 23.6 GiB total memory (obviously wrong) and 23.2 GiB available memory (also wrong). The true numbers, according to nvidia-smi, are 24564 MiB (24.0 GiB, of course) total memory and 55 MiB used (24564 MiB - 55 MiB = 24509 MiB = 23.9 GiB available memory). So ollama thinks I have less memory than I do, so it refuses to load models it used to load just fine. Hence why not offloading a layer or two to GPU causes it to work again. I think you have all the information you need from me. You just need to figure out why ollama is incorrectly detecting memory. If I had to guess, it's probably a classic case of wrong units or conversions thereof (GiB vs GB). You know, that thing they beat into our heads to be careful about in high school science class. The thing that caused the Challenger disaster. Y'all need to slow down, be more careful, and put out good code. This would, paradoxically, give you more time because you wouldn't have to spend so much time putting out fires. Again, all of this information was already available, so this was an unnecessary waste of my time too. I've attached the requested log anyway.

Edit - I'm no software dev, but...maybe start here: https://github.com/ollama/ollama/pull/4328
If I'm right that that's the problem (that a dev arbitrarily decided to shave a layer of space off as a "buffer", breaking the existing workflows of countless users, and no one notifying the user base or even all the other devs, causing hours of wasted time and confusion)...well...that's pretty bone-headed. The obvious typo in the original comment (the one one would catch by reviewing one's pull request even once) illustrates my point (about slowing down) pretty spectacularly. Hell, a spell checker would have caught that. If I sound frustrated, it's because I am

<!-- gh-comment-id:2143278490 --> @oldgithubman commented on GitHub (Jun 1, 2024): > @oldmanjk the log you attached above seems to show a 2nd attempt where we fell back to the `runners/cpu_avx2/ollama_llama_server` CPU subprocess, after most likely unsuccessfully running on the GPU. Can you share a complete log so we can see what went wrong? > > ``` > sudo systemctl stop ollama > OLLAMA_DEBUG=1 ollama serve 2>&1 | tee server.log > ``` [requested.log](https://github.com/user-attachments/files/15520305/requested.log) What is clear, from *both* logs (as I already pointed out in the previous log), is ollama is wrong about memory, both total and available. Ollama says my NVIDIA GeForce RTX 4090 (founder's edition - as standard as it gets) has 23.6 GiB total memory (*obviously* wrong) and 23.2 GiB available memory (also wrong). The true numbers, according to nvidia-smi, are 24564 MiB (24.0 GiB, of course) total memory and 55 MiB used (24564 MiB - 55 MiB = 24509 MiB = 23.9 GiB available memory). So ollama *thinks* I have less memory than I do, so it refuses to load models it used to load just fine. Hence why *not* offloading a layer or two to GPU causes it to work again. I think you have all the information you need from me. You just need to figure out why ollama is incorrectly detecting memory. If I had to guess, it's probably a classic case of wrong units or conversions thereof (GiB vs GB). You know, that thing they beat into our heads to be careful about in high school science class. The thing that caused the Challenger disaster. Y'all need to slow down, be more careful, and put out good code. This would, paradoxically, give you *more* time because you wouldn't have to spend so much time putting out fires. Again, all of this information was already available, so this was an unnecessary waste of my time too. I've attached the requested log anyway. Edit - I'm no software dev, but...maybe start here: https://github.com/ollama/ollama/pull/4328 If I'm right that that's the problem (that a dev arbitrarily decided to shave a layer of space off as a "buffer", breaking the existing workflows of countless users, and no one notifying the user base or even *all the other devs*, causing hours of wasted time and confusion)...well...that's pretty bone-headed. The obvious typo in the original comment (the one one would catch by reviewing one's pull request *even once*) illustrates my point (about slowing down) pretty spectacularly. Hell, a *spell checker* would have caught that. If I sound frustrated, it's because I am
Author
Owner

@kriansa commented on GitHub (Jun 3, 2024):

@oldmanjk the log you attached above seems to show a 2nd attempt where we fell back to the runners/cpu_avx2/ollama_llama_server CPU subprocess, after most likely unsuccessfully running on the GPU. Can you share a complete log so we can see what went wrong?

sudo systemctl stop ollama
OLLAMA_DEBUG=1 ollama serve 2>&1 | tee server.log

requested.log

What is clear, from both logs (as I already pointed out in the previous log), is ollama is wrong about memory, both total and available. Ollama says my NVIDIA GeForce RTX 4090 (founder's edition - as standard as it gets) has 23.6 GiB total memory (obviously wrong) and 23.2 GiB available memory (also wrong). The true numbers, according to nvidia-smi, are 24564 MiB (24.0 GiB, of course) total memory and 55 MiB used (24564 MiB - 55 MiB = 24509 MiB = 23.9 GiB available memory). So ollama thinks I have less memory than I do, so it refuses to load models it used to load just fine. Hence why not offloading a layer or two to GPU causes it to work again. I think you have all the information you need from me. You just need to figure out why ollama is incorrectly detecting memory. If I had to guess, it's probably a classic case of wrong units or conversions thereof (GiB vs GB). You know, that thing they beat into our heads to be careful about in high school science class. The thing that caused the Challenger disaster. Y'all need to slow down, be more careful, and put out good code. This would, paradoxically, give you more time because you wouldn't have to spend so much time putting out fires. Again, all of this information was already available, so this was an unnecessary waste of my time too. I've attached the requested log anyway.

Edit - I'm no software dev, but...maybe start here: #4328 If I'm right that that's the problem (that a dev arbitrarily decided to shave a layer of space off as a "buffer", breaking the existing workflows of countless users, and no one notifying the user base or even all the other devs, causing hours of wasted time and confusion)...well...that's pretty bone-headed. The obvious typo in the original comment (the one one would catch by reviewing one's pull request even once) illustrates my point (about slowing down) pretty spectacularly. Hell, a spell checker would have caught that. If I sound frustrated, it's because I am

I'm not affiliated with this project by any means, I'm just a peasant who happens to be facing this issue as well, and I appreciate your diagnostics so far, I'm also using a Pascal based GPU and no luck.

That said, and while I understand your frustration as I'm also affected by this issue, there's no need to be snarky with the contributors. This is open source and no one is obligated to provide free support, most of us do it for passion. Next time avoid expressing your frustration like that towards other developers who owe you nothing, it will hurt more than you think. As a more practical and constructive criticism, you can point out what you think is the cause of the issue and ask how you can help address it, perhaps even patching and recompiling if you know how to.

<!-- gh-comment-id:2144461948 --> @kriansa commented on GitHub (Jun 3, 2024): > > @oldmanjk the log you attached above seems to show a 2nd attempt where we fell back to the `runners/cpu_avx2/ollama_llama_server` CPU subprocess, after most likely unsuccessfully running on the GPU. Can you share a complete log so we can see what went wrong? > > ``` > > sudo systemctl stop ollama > > OLLAMA_DEBUG=1 ollama serve 2>&1 | tee server.log > > ``` > > [requested.log](https://github.com/user-attachments/files/15520305/requested.log) > > What is clear, from _both_ logs (as I already pointed out in the previous log), is ollama is wrong about memory, both total and available. Ollama says my NVIDIA GeForce RTX 4090 (founder's edition - as standard as it gets) has 23.6 GiB total memory (_obviously_ wrong) and 23.2 GiB available memory (also wrong). The true numbers, according to nvidia-smi, are 24564 MiB (24.0 GiB, of course) total memory and 55 MiB used (24564 MiB - 55 MiB = 24509 MiB = 23.9 GiB available memory). So ollama _thinks_ I have less memory than I do, so it refuses to load models it used to load just fine. Hence why _not_ offloading a layer or two to GPU causes it to work again. I think you have all the information you need from me. You just need to figure out why ollama is incorrectly detecting memory. If I had to guess, it's probably a classic case of wrong units or conversions thereof (GiB vs GB). You know, that thing they beat into our heads to be careful about in high school science class. The thing that caused the Challenger disaster. Y'all need to slow down, be more careful, and put out good code. This would, paradoxically, give you _more_ time because you wouldn't have to spend so much time putting out fires. Again, all of this information was already available, so this was an unnecessary waste of my time too. I've attached the requested log anyway. > > Edit - I'm no software dev, but...maybe start here: #4328 If I'm right that that's the problem (that a dev arbitrarily decided to shave a layer of space off as a "buffer", breaking the existing workflows of countless users, and no one notifying the user base or even _all the other devs_, causing hours of wasted time and confusion)...well...that's pretty bone-headed. The obvious typo in the original comment (the one one would catch by reviewing one's pull request _even once_) illustrates my point (about slowing down) pretty spectacularly. Hell, a _spell checker_ would have caught that. If I sound frustrated, it's because I am I'm not affiliated with this project by any means, I'm just a peasant who happens to be facing this issue as well, and I appreciate your diagnostics so far, I'm also using a Pascal based GPU and no luck. That said, and while I understand your frustration as I'm also affected by this issue, there's no need to be snarky with the contributors. This is open source and no one is obligated to provide free support, most of us do it for passion. Next time avoid expressing your frustration like that towards other developers who owe you nothing, it will hurt more than you think. As a more practical and constructive criticism, you can point out what you think is the cause of the issue and ask how you can help address it, perhaps even patching and recompiling if you know how to.
Author
Owner

@oldgithubman commented on GitHub (Jun 3, 2024):

I appreciate your diagnostics so far

Thank you and you're welcome.

I understand your frustration as I'm also affected by this issue

Then you don't understand my frustration.

no one is obligated to provide free support

Straw man.

avoid expressing your frustration like that

Point considered and rejected.

developers who owe you nothing

Straw man.

it will hurt more than you think

You can't know this. If it would hurt you, that's a you problem and I would suggest recalling the ancient wisdom of "sticks and stones..."

As a more practical and constructive criticism, you can point out what you think is the cause of the issue

Did you actually read what I wrote?

and ask how you can help address it,

If the devs need help, they can ask. As they've been doing. And as I've been responding with their requests. You haven't actually read this thread, have you? All I've been doing is helping. You just think I'm mean. I prioritize actually helping over what people think about me. Why didn't you ask how you can help address it, since you think that's valuable advice?

perhaps even patching and recompiling if you know how to

I don't know how to, but I'd learn if they asked. That would be consistent with my past behavior. At this point, I'm not even sure why I'm wasting time on responding to you. You suggest I help, when that's what I'm doing here. Yeah, I'm done. Peace.

<!-- gh-comment-id:2144809797 --> @oldgithubman commented on GitHub (Jun 3, 2024): > I appreciate your diagnostics so far Thank you and you're welcome. > I understand your frustration as I'm also affected by this issue Then you don't understand my frustration. > no one is obligated to provide free support Straw man. > avoid expressing your frustration like that Point considered and rejected. > developers who owe you nothing Straw man. > it will hurt more than you think You can't know this. If it would hurt *you*, that's a *you* problem and I would suggest recalling the ancient wisdom of "sticks and stones..." > As a more practical and constructive criticism, you can point out what you think is the cause of the issue Did you actually read what I wrote? > and ask how you can help address it, If the devs need help, they can ask. As they've been doing. And as I've been responding with their requests. You haven't actually read this thread, have you? All I've been doing is helping. You just think I'm mean. I prioritize actually helping over what people think about me. Why didn't you ask how you can help address it, since you think that's valuable advice? > perhaps even patching and recompiling if you know how to I don't know how to, but I'd learn if they asked. That would be consistent with my past behavior. At this point, I'm not even sure why I'm wasting time on responding to you. You suggest I help, *when that's what I'm doing here*. Yeah, I'm done. Peace.
Author
Owner

@czrpb commented on GitHub (Jul 22, 2024):

Literally like 20min later: Im an idiot! On arch linux install ollama-cuda. Why it took me hours to find that is yet another bit of evidence I probably should be given a keyboard! hahaha!


Hi! Here is my equivalent issue and log file. Hope this helps!!

time=2024-07-21T18:55:31.136-07:00 level=INFO source=gpu.go:205 msg="looking for compatible GPUs"
time=2024-07-21T18:55:31.136-07:00 level=DEBUG source=gpu.go:91 msg="searching for GPU discovery libraries for NVIDIA"
time=2024-07-21T18:55:31.136-07:00 level=DEBUG source=gpu.go:468 msg="Searching for GPU library" name=libcuda.so*
time=2024-07-21T18:55:31.136-07:00 level=DEBUG source=gpu.go:487 msg="gpu library search" globs="[<<<HOME>>/libcuda.so** /usr/local/cuda*/targets/*/lib/libcuda.so* /usr/lib/*-linux-gnu/nvidia/current/libcuda.so* /usr/lib/*-linux-gnu/libcuda.so* /usr/lib/wsl/lib/libcuda.so* /usr/lib/wsl/drivers/*/libcuda.so* /opt/cuda/lib*/libcuda.so* /usr/local/cuda/lib*/libcuda.so* /usr/lib*/libcuda.so* /usr/local/lib*/libcuda.so*]"
time=2024-07-21T18:55:31.146-07:00 level=DEBUG source=gpu.go:521 msg="discovered GPU libraries" paths="[/usr/lib/libcuda.so.555.58.02 /usr/lib64/libcuda.so.555.58.02]"
CUDA driver version: 12.5
time=2024-07-21T18:55:31.254-07:00 level=DEBUG source=gpu.go:124 msg="detected GPUs" count=1 library=/usr/lib/libcuda.so.555.58.02
[GPU-ccd02350-1222-88d2-1eda-c906b3aff9a8] CUDA totalMem 7788 mb
[GPU-ccd02350-1222-88d2-1eda-c906b3aff9a8] CUDA freeMem 7369 mb
[GPU-ccd02350-1222-88d2-1eda-c906b3aff9a8] Compute Capability 7.5
time=2024-07-21T18:55:31.457-07:00 level=DEBUG source=amd_linux.go:356 msg="amdgpu driver not detected /sys/module/amdgpu"
releasing cuda driver library
time=2024-07-21T18:55:31.457-07:00 level=INFO source=types.go:105 msg="inference compute" id=GPU-ccd02350-1222-88d2-1eda-c906b3aff9a8 library=cuda compute=7.5 driver=12.5 name="NVIDIA GeForce RTX 2070 SUPER" total="7.6 GiB" available="7.2 GiB"

  .  .  .

time=2024-07-21T18:55:37.564-07:00 level=INFO source=sched.go:701 msg="new model will fit in available VRAM in single GPU, loading" model=<<<HOME>>/.ollama/models/blobs/sha256-dd0c6f2ea876e4c433325df3398386f24e00d321abf6cec197c1bc1fcf1e0025 gpu=GPU-ccd02350-1222-88d2-1eda-c906b3aff9a8 parallel=4 available=7727087616 required="2.6 GiB"
time=2024-07-21T18:55:37.564-07:00 level=DEBUG source=server.go:100 msg="system memory" total="31.3 GiB" free="29.4 GiB" free_swap="0 B"
time=2024-07-21T18:55:37.564-07:00 level=DEBUG source=memory.go:101 msg=evaluating library=cuda gpu_count=1 available="[7.2 GiB]"
time=2024-07-21T18:55:37.564-07:00 level=INFO source=memory.go:309 msg="offload to cuda" layers.requested=-1 layers.model=19 layers.offload=19 layers.split="" memory.available="[7.2 GiB]" memory.required.full="2.6 GiB" memory.required.partial="2.6 GiB" memory.required.kv="144.0 MiB" memory.required.allocations="[2.6 GiB]" memory.weights.total="1.2 GiB" memory.weights.repeating="797.2 MiB" memory.weights.nonrepeating="410.2 MiB" memory.graph.full="504.0 MiB" memory.graph.partial="914.2 MiB"

  .  .  .

time=2024-07-21T18:55:37.565-07:00 level=INFO source=server.go:383 msg="starting llama server" cmd="/tmp/ollama485515592/runners/cpu_avx2/ollama_llama_server --model <<<HOME>>/.ollama/models/blobs/sha256-dd0c6f2ea876e4c433325df3398386f24e00d321abf6cec197c1bc1fcf1e0025 --ctx-size 8192 --batch-size 512 --embedding --log-disable --n-gpu-layers 19 --verbose --parallel 4 --port 34295"
time=2024-07-21T18:55:37.565-07:00 level=DEBUG source=server.go:398 msg=subprocess environment="[CUDA_PATH=/opt/cuda PATH=/usr/local/bin:/usr/bin:/bin:/usr/local/sbin:/opt/cuda/bin:/opt/cuda/nsight_compute:/opt/cuda/nsight_systems/bin:/usr/lib/jvm/default/bin:/usr/bin/site_perl:/usr/bin/vendor_perl:/usr/bin/core_perl LD_LIBRARY_PATH=/tmp/ollama485515592/runners/cpu_avx2:/tmp/ollama485515592/runners]"
time=2024-07-21T18:55:37.565-07:00 level=INFO source=sched.go:437 msg="loaded runners" count=1
time=2024-07-21T18:55:37.565-07:00 level=INFO source=server.go:571 msg="waiting for llama runner to start responding"
time=2024-07-21T18:55:37.565-07:00 level=INFO source=server.go:612 msg="waiting for server to become available" status="llm server error"
WARN [server_params_parse] Not compiled with GPU offload support, --n-gpu-layers option will be ignored. See main README.md for information on enabling GPU BLAS support | n_gpu_layers=-1 tid="126402163820352" timestamp=1721613337
INFO [main] build info | build=3337 commit="a8db2a9ce" tid="126402163820352" timestamp=1721613337
INFO [main] system info | n_threads=6 n_threads_batch=-1 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 0 | " tid="126402163820352" timestamp=1721613337 total_threads=6
INFO [main] HTTP server listening | hostname="127.0.0.1" n_threads_http="6" port="34295" tid="126402163820352" timestamp=1721613337

ollama-cleaner.log

<!-- gh-comment-id:2241918341 --> @czrpb commented on GitHub (Jul 22, 2024): Literally like 20min later: Im an idiot! On arch linux install `ollama-cuda`. Why it took me hours to find that is yet another bit of evidence I probably should be given a keyboard! hahaha! ----- Hi! Here is my equivalent issue and log file. Hope this helps!! ``` time=2024-07-21T18:55:31.136-07:00 level=INFO source=gpu.go:205 msg="looking for compatible GPUs" time=2024-07-21T18:55:31.136-07:00 level=DEBUG source=gpu.go:91 msg="searching for GPU discovery libraries for NVIDIA" time=2024-07-21T18:55:31.136-07:00 level=DEBUG source=gpu.go:468 msg="Searching for GPU library" name=libcuda.so* time=2024-07-21T18:55:31.136-07:00 level=DEBUG source=gpu.go:487 msg="gpu library search" globs="[<<<HOME>>/libcuda.so** /usr/local/cuda*/targets/*/lib/libcuda.so* /usr/lib/*-linux-gnu/nvidia/current/libcuda.so* /usr/lib/*-linux-gnu/libcuda.so* /usr/lib/wsl/lib/libcuda.so* /usr/lib/wsl/drivers/*/libcuda.so* /opt/cuda/lib*/libcuda.so* /usr/local/cuda/lib*/libcuda.so* /usr/lib*/libcuda.so* /usr/local/lib*/libcuda.so*]" time=2024-07-21T18:55:31.146-07:00 level=DEBUG source=gpu.go:521 msg="discovered GPU libraries" paths="[/usr/lib/libcuda.so.555.58.02 /usr/lib64/libcuda.so.555.58.02]" CUDA driver version: 12.5 time=2024-07-21T18:55:31.254-07:00 level=DEBUG source=gpu.go:124 msg="detected GPUs" count=1 library=/usr/lib/libcuda.so.555.58.02 [GPU-ccd02350-1222-88d2-1eda-c906b3aff9a8] CUDA totalMem 7788 mb [GPU-ccd02350-1222-88d2-1eda-c906b3aff9a8] CUDA freeMem 7369 mb [GPU-ccd02350-1222-88d2-1eda-c906b3aff9a8] Compute Capability 7.5 time=2024-07-21T18:55:31.457-07:00 level=DEBUG source=amd_linux.go:356 msg="amdgpu driver not detected /sys/module/amdgpu" releasing cuda driver library time=2024-07-21T18:55:31.457-07:00 level=INFO source=types.go:105 msg="inference compute" id=GPU-ccd02350-1222-88d2-1eda-c906b3aff9a8 library=cuda compute=7.5 driver=12.5 name="NVIDIA GeForce RTX 2070 SUPER" total="7.6 GiB" available="7.2 GiB" . . . time=2024-07-21T18:55:37.564-07:00 level=INFO source=sched.go:701 msg="new model will fit in available VRAM in single GPU, loading" model=<<<HOME>>/.ollama/models/blobs/sha256-dd0c6f2ea876e4c433325df3398386f24e00d321abf6cec197c1bc1fcf1e0025 gpu=GPU-ccd02350-1222-88d2-1eda-c906b3aff9a8 parallel=4 available=7727087616 required="2.6 GiB" time=2024-07-21T18:55:37.564-07:00 level=DEBUG source=server.go:100 msg="system memory" total="31.3 GiB" free="29.4 GiB" free_swap="0 B" time=2024-07-21T18:55:37.564-07:00 level=DEBUG source=memory.go:101 msg=evaluating library=cuda gpu_count=1 available="[7.2 GiB]" time=2024-07-21T18:55:37.564-07:00 level=INFO source=memory.go:309 msg="offload to cuda" layers.requested=-1 layers.model=19 layers.offload=19 layers.split="" memory.available="[7.2 GiB]" memory.required.full="2.6 GiB" memory.required.partial="2.6 GiB" memory.required.kv="144.0 MiB" memory.required.allocations="[2.6 GiB]" memory.weights.total="1.2 GiB" memory.weights.repeating="797.2 MiB" memory.weights.nonrepeating="410.2 MiB" memory.graph.full="504.0 MiB" memory.graph.partial="914.2 MiB" . . . time=2024-07-21T18:55:37.565-07:00 level=INFO source=server.go:383 msg="starting llama server" cmd="/tmp/ollama485515592/runners/cpu_avx2/ollama_llama_server --model <<<HOME>>/.ollama/models/blobs/sha256-dd0c6f2ea876e4c433325df3398386f24e00d321abf6cec197c1bc1fcf1e0025 --ctx-size 8192 --batch-size 512 --embedding --log-disable --n-gpu-layers 19 --verbose --parallel 4 --port 34295" time=2024-07-21T18:55:37.565-07:00 level=DEBUG source=server.go:398 msg=subprocess environment="[CUDA_PATH=/opt/cuda PATH=/usr/local/bin:/usr/bin:/bin:/usr/local/sbin:/opt/cuda/bin:/opt/cuda/nsight_compute:/opt/cuda/nsight_systems/bin:/usr/lib/jvm/default/bin:/usr/bin/site_perl:/usr/bin/vendor_perl:/usr/bin/core_perl LD_LIBRARY_PATH=/tmp/ollama485515592/runners/cpu_avx2:/tmp/ollama485515592/runners]" time=2024-07-21T18:55:37.565-07:00 level=INFO source=sched.go:437 msg="loaded runners" count=1 time=2024-07-21T18:55:37.565-07:00 level=INFO source=server.go:571 msg="waiting for llama runner to start responding" time=2024-07-21T18:55:37.565-07:00 level=INFO source=server.go:612 msg="waiting for server to become available" status="llm server error" WARN [server_params_parse] Not compiled with GPU offload support, --n-gpu-layers option will be ignored. See main README.md for information on enabling GPU BLAS support | n_gpu_layers=-1 tid="126402163820352" timestamp=1721613337 INFO [main] build info | build=3337 commit="a8db2a9ce" tid="126402163820352" timestamp=1721613337 INFO [main] system info | n_threads=6 n_threads_batch=-1 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 0 | " tid="126402163820352" timestamp=1721613337 total_threads=6 INFO [main] HTTP server listening | hostname="127.0.0.1" n_threads_http="6" port="34295" tid="126402163820352" timestamp=1721613337 ``` [ollama-cleaner.log](https://github.com/user-attachments/files/16327095/ollama-cleaner.log)
Author
Owner

@oldgithubman commented on GitHub (Jul 22, 2024):

Literally like 20min later: Im an idiot! On arch linux install ollama-cuda. Why it took me hours to find that is yet another bit of evidence I probably should be given a keyboard! hahaha!

Hi! Here is my equivalent issue and log file. Hope this helps!!

time=2024-07-21T18:55:31.136-07:00 level=INFO source=gpu.go:205 msg="looking for compatible GPUs"
time=2024-07-21T18:55:31.136-07:00 level=DEBUG source=gpu.go:91 msg="searching for GPU discovery libraries for NVIDIA"
time=2024-07-21T18:55:31.136-07:00 level=DEBUG source=gpu.go:468 msg="Searching for GPU library" name=libcuda.so*
time=2024-07-21T18:55:31.136-07:00 level=DEBUG source=gpu.go:487 msg="gpu library search" globs="[<<<HOME>>/libcuda.so** /usr/local/cuda*/targets/*/lib/libcuda.so* /usr/lib/*-linux-gnu/nvidia/current/libcuda.so* /usr/lib/*-linux-gnu/libcuda.so* /usr/lib/wsl/lib/libcuda.so* /usr/lib/wsl/drivers/*/libcuda.so* /opt/cuda/lib*/libcuda.so* /usr/local/cuda/lib*/libcuda.so* /usr/lib*/libcuda.so* /usr/local/lib*/libcuda.so*]"
time=2024-07-21T18:55:31.146-07:00 level=DEBUG source=gpu.go:521 msg="discovered GPU libraries" paths="[/usr/lib/libcuda.so.555.58.02 /usr/lib64/libcuda.so.555.58.02]"
CUDA driver version: 12.5
time=2024-07-21T18:55:31.254-07:00 level=DEBUG source=gpu.go:124 msg="detected GPUs" count=1 library=/usr/lib/libcuda.so.555.58.02
[GPU-ccd02350-1222-88d2-1eda-c906b3aff9a8] CUDA totalMem 7788 mb
[GPU-ccd02350-1222-88d2-1eda-c906b3aff9a8] CUDA freeMem 7369 mb
[GPU-ccd02350-1222-88d2-1eda-c906b3aff9a8] Compute Capability 7.5
time=2024-07-21T18:55:31.457-07:00 level=DEBUG source=amd_linux.go:356 msg="amdgpu driver not detected /sys/module/amdgpu"
releasing cuda driver library
time=2024-07-21T18:55:31.457-07:00 level=INFO source=types.go:105 msg="inference compute" id=GPU-ccd02350-1222-88d2-1eda-c906b3aff9a8 library=cuda compute=7.5 driver=12.5 name="NVIDIA GeForce RTX 2070 SUPER" total="7.6 GiB" available="7.2 GiB"

  .  .  .

time=2024-07-21T18:55:37.564-07:00 level=INFO source=sched.go:701 msg="new model will fit in available VRAM in single GPU, loading" model=<<<HOME>>/.ollama/models/blobs/sha256-dd0c6f2ea876e4c433325df3398386f24e00d321abf6cec197c1bc1fcf1e0025 gpu=GPU-ccd02350-1222-88d2-1eda-c906b3aff9a8 parallel=4 available=7727087616 required="2.6 GiB"
time=2024-07-21T18:55:37.564-07:00 level=DEBUG source=server.go:100 msg="system memory" total="31.3 GiB" free="29.4 GiB" free_swap="0 B"
time=2024-07-21T18:55:37.564-07:00 level=DEBUG source=memory.go:101 msg=evaluating library=cuda gpu_count=1 available="[7.2 GiB]"
time=2024-07-21T18:55:37.564-07:00 level=INFO source=memory.go:309 msg="offload to cuda" layers.requested=-1 layers.model=19 layers.offload=19 layers.split="" memory.available="[7.2 GiB]" memory.required.full="2.6 GiB" memory.required.partial="2.6 GiB" memory.required.kv="144.0 MiB" memory.required.allocations="[2.6 GiB]" memory.weights.total="1.2 GiB" memory.weights.repeating="797.2 MiB" memory.weights.nonrepeating="410.2 MiB" memory.graph.full="504.0 MiB" memory.graph.partial="914.2 MiB"

  .  .  .

time=2024-07-21T18:55:37.565-07:00 level=INFO source=server.go:383 msg="starting llama server" cmd="/tmp/ollama485515592/runners/cpu_avx2/ollama_llama_server --model <<<HOME>>/.ollama/models/blobs/sha256-dd0c6f2ea876e4c433325df3398386f24e00d321abf6cec197c1bc1fcf1e0025 --ctx-size 8192 --batch-size 512 --embedding --log-disable --n-gpu-layers 19 --verbose --parallel 4 --port 34295"
time=2024-07-21T18:55:37.565-07:00 level=DEBUG source=server.go:398 msg=subprocess environment="[CUDA_PATH=/opt/cuda PATH=/usr/local/bin:/usr/bin:/bin:/usr/local/sbin:/opt/cuda/bin:/opt/cuda/nsight_compute:/opt/cuda/nsight_systems/bin:/usr/lib/jvm/default/bin:/usr/bin/site_perl:/usr/bin/vendor_perl:/usr/bin/core_perl LD_LIBRARY_PATH=/tmp/ollama485515592/runners/cpu_avx2:/tmp/ollama485515592/runners]"
time=2024-07-21T18:55:37.565-07:00 level=INFO source=sched.go:437 msg="loaded runners" count=1
time=2024-07-21T18:55:37.565-07:00 level=INFO source=server.go:571 msg="waiting for llama runner to start responding"
time=2024-07-21T18:55:37.565-07:00 level=INFO source=server.go:612 msg="waiting for server to become available" status="llm server error"
WARN [server_params_parse] Not compiled with GPU offload support, --n-gpu-layers option will be ignored. See main README.md for information on enabling GPU BLAS support | n_gpu_layers=-1 tid="126402163820352" timestamp=1721613337
INFO [main] build info | build=3337 commit="a8db2a9ce" tid="126402163820352" timestamp=1721613337
INFO [main] system info | n_threads=6 n_threads_batch=-1 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 0 | " tid="126402163820352" timestamp=1721613337 total_threads=6
INFO [main] HTTP server listening | hostname="127.0.0.1" n_threads_http="6" port="34295" tid="126402163820352" timestamp=1721613337

ollama-cleaner.log

I switched to llama.cpp. It's better

Edit - I'm evaluating mistral.rs now. Excellent dev

<!-- gh-comment-id:2242666416 --> @oldgithubman commented on GitHub (Jul 22, 2024): > Literally like 20min later: Im an idiot! On arch linux install `ollama-cuda`. Why it took me hours to find that is yet another bit of evidence I probably should be given a keyboard! hahaha! > > Hi! Here is my equivalent issue and log file. Hope this helps!! > > ``` > time=2024-07-21T18:55:31.136-07:00 level=INFO source=gpu.go:205 msg="looking for compatible GPUs" > time=2024-07-21T18:55:31.136-07:00 level=DEBUG source=gpu.go:91 msg="searching for GPU discovery libraries for NVIDIA" > time=2024-07-21T18:55:31.136-07:00 level=DEBUG source=gpu.go:468 msg="Searching for GPU library" name=libcuda.so* > time=2024-07-21T18:55:31.136-07:00 level=DEBUG source=gpu.go:487 msg="gpu library search" globs="[<<<HOME>>/libcuda.so** /usr/local/cuda*/targets/*/lib/libcuda.so* /usr/lib/*-linux-gnu/nvidia/current/libcuda.so* /usr/lib/*-linux-gnu/libcuda.so* /usr/lib/wsl/lib/libcuda.so* /usr/lib/wsl/drivers/*/libcuda.so* /opt/cuda/lib*/libcuda.so* /usr/local/cuda/lib*/libcuda.so* /usr/lib*/libcuda.so* /usr/local/lib*/libcuda.so*]" > time=2024-07-21T18:55:31.146-07:00 level=DEBUG source=gpu.go:521 msg="discovered GPU libraries" paths="[/usr/lib/libcuda.so.555.58.02 /usr/lib64/libcuda.so.555.58.02]" > CUDA driver version: 12.5 > time=2024-07-21T18:55:31.254-07:00 level=DEBUG source=gpu.go:124 msg="detected GPUs" count=1 library=/usr/lib/libcuda.so.555.58.02 > [GPU-ccd02350-1222-88d2-1eda-c906b3aff9a8] CUDA totalMem 7788 mb > [GPU-ccd02350-1222-88d2-1eda-c906b3aff9a8] CUDA freeMem 7369 mb > [GPU-ccd02350-1222-88d2-1eda-c906b3aff9a8] Compute Capability 7.5 > time=2024-07-21T18:55:31.457-07:00 level=DEBUG source=amd_linux.go:356 msg="amdgpu driver not detected /sys/module/amdgpu" > releasing cuda driver library > time=2024-07-21T18:55:31.457-07:00 level=INFO source=types.go:105 msg="inference compute" id=GPU-ccd02350-1222-88d2-1eda-c906b3aff9a8 library=cuda compute=7.5 driver=12.5 name="NVIDIA GeForce RTX 2070 SUPER" total="7.6 GiB" available="7.2 GiB" > > . . . > > time=2024-07-21T18:55:37.564-07:00 level=INFO source=sched.go:701 msg="new model will fit in available VRAM in single GPU, loading" model=<<<HOME>>/.ollama/models/blobs/sha256-dd0c6f2ea876e4c433325df3398386f24e00d321abf6cec197c1bc1fcf1e0025 gpu=GPU-ccd02350-1222-88d2-1eda-c906b3aff9a8 parallel=4 available=7727087616 required="2.6 GiB" > time=2024-07-21T18:55:37.564-07:00 level=DEBUG source=server.go:100 msg="system memory" total="31.3 GiB" free="29.4 GiB" free_swap="0 B" > time=2024-07-21T18:55:37.564-07:00 level=DEBUG source=memory.go:101 msg=evaluating library=cuda gpu_count=1 available="[7.2 GiB]" > time=2024-07-21T18:55:37.564-07:00 level=INFO source=memory.go:309 msg="offload to cuda" layers.requested=-1 layers.model=19 layers.offload=19 layers.split="" memory.available="[7.2 GiB]" memory.required.full="2.6 GiB" memory.required.partial="2.6 GiB" memory.required.kv="144.0 MiB" memory.required.allocations="[2.6 GiB]" memory.weights.total="1.2 GiB" memory.weights.repeating="797.2 MiB" memory.weights.nonrepeating="410.2 MiB" memory.graph.full="504.0 MiB" memory.graph.partial="914.2 MiB" > > . . . > > time=2024-07-21T18:55:37.565-07:00 level=INFO source=server.go:383 msg="starting llama server" cmd="/tmp/ollama485515592/runners/cpu_avx2/ollama_llama_server --model <<<HOME>>/.ollama/models/blobs/sha256-dd0c6f2ea876e4c433325df3398386f24e00d321abf6cec197c1bc1fcf1e0025 --ctx-size 8192 --batch-size 512 --embedding --log-disable --n-gpu-layers 19 --verbose --parallel 4 --port 34295" > time=2024-07-21T18:55:37.565-07:00 level=DEBUG source=server.go:398 msg=subprocess environment="[CUDA_PATH=/opt/cuda PATH=/usr/local/bin:/usr/bin:/bin:/usr/local/sbin:/opt/cuda/bin:/opt/cuda/nsight_compute:/opt/cuda/nsight_systems/bin:/usr/lib/jvm/default/bin:/usr/bin/site_perl:/usr/bin/vendor_perl:/usr/bin/core_perl LD_LIBRARY_PATH=/tmp/ollama485515592/runners/cpu_avx2:/tmp/ollama485515592/runners]" > time=2024-07-21T18:55:37.565-07:00 level=INFO source=sched.go:437 msg="loaded runners" count=1 > time=2024-07-21T18:55:37.565-07:00 level=INFO source=server.go:571 msg="waiting for llama runner to start responding" > time=2024-07-21T18:55:37.565-07:00 level=INFO source=server.go:612 msg="waiting for server to become available" status="llm server error" > WARN [server_params_parse] Not compiled with GPU offload support, --n-gpu-layers option will be ignored. See main README.md for information on enabling GPU BLAS support | n_gpu_layers=-1 tid="126402163820352" timestamp=1721613337 > INFO [main] build info | build=3337 commit="a8db2a9ce" tid="126402163820352" timestamp=1721613337 > INFO [main] system info | n_threads=6 n_threads_batch=-1 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 0 | " tid="126402163820352" timestamp=1721613337 total_threads=6 > INFO [main] HTTP server listening | hostname="127.0.0.1" n_threads_http="6" port="34295" tid="126402163820352" timestamp=1721613337 > ``` > > [ollama-cleaner.log](https://github.com/user-attachments/files/16327095/ollama-cleaner.log) I switched to llama.cpp. It's better Edit - I'm evaluating mistral.rs now. Excellent dev
Author
Owner

@dhiltgen commented on GitHub (Aug 9, 2024):

We've fixed quite a few prediction bugs since 0.1.38, so I'm going to close this one out. If you're still hitting OOM's on 0.3.4, please share what model you were trying to load, and the server log and I'll reopen.

<!-- gh-comment-id:2278871515 --> @dhiltgen commented on GitHub (Aug 9, 2024): We've fixed quite a few prediction bugs since 0.1.38, so I'm going to close this one out. If you're still hitting OOM's on 0.3.4, please share what model you were trying to load, and the server log and I'll reopen.
Author
Owner

@Shadowfita commented on GitHub (Sep 17, 2024):

I'm having this issue when trying to run llama3.1:70b with an rtx 3080 and 64GB RAM. It seems that the ollama_llama_server.exe shipped with the latest windows download from the website doesn't have GPU offloading enabled? I get the same error described above.

WARN [server_params_parse] Not compiled with GPU offload support, --n-gpu-layers option will be ignored. See main README.md for information on enabling GPU BLAS support | n_gpu_layers=-1 tid="107816" timestamp=1726533930
<!-- gh-comment-id:2354280940 --> @Shadowfita commented on GitHub (Sep 17, 2024): I'm having this issue when trying to run llama3.1:70b with an rtx 3080 and 64GB RAM. It seems that the ollama_llama_server.exe shipped with the latest windows download from the website doesn't have GPU offloading enabled? I get the same error described above. ``` WARN [server_params_parse] Not compiled with GPU offload support, --n-gpu-layers option will be ignored. See main README.md for information on enabling GPU BLAS support | n_gpu_layers=-1 tid="107816" timestamp=1726533930 ```
Author
Owner

@KaloyanGeorgiev99 commented on GitHub (Sep 19, 2024):

I'm having this issue when trying to run llama3.1:70b with an rtx 3080 and 64GB RAM. It seems that the ollama_llama_server.exe shipped with the latest windows download from the website doesn't have GPU offloading enabled? I get the same error described above.

WARN [server_params_parse] Not compiled with GPU offload support, --n-gpu-layers option will be ignored. See main README.md for information on enabling GPU BLAS support | n_gpu_layers=-1 tid="107816" timestamp=1726533930i

i also have the same issue :(

<!-- gh-comment-id:2361892439 --> @KaloyanGeorgiev99 commented on GitHub (Sep 19, 2024): > I'm having this issue when trying to run llama3.1:70b with an rtx 3080 and 64GB RAM. It seems that the ollama_llama_server.exe shipped with the latest windows download from the website doesn't have GPU offloading enabled? I get the same error described above. > > ``` > WARN [server_params_parse] Not compiled with GPU offload support, --n-gpu-layers option will be ignored. See main README.md for information on enabling GPU BLAS support | n_gpu_layers=-1 tid="107816" timestamp=1726533930i i also have the same issue :(
Author
Owner

@Shadowfita commented on GitHub (Sep 19, 2024):

There is definitely an issue with the latest binaries, at least on windows. My ollama instance running on my linux server with a gtx 1070 and 32gb ram is able to run llama 3.1 with an excel file, offloading accordingly and providing a response, but my windows PC with RTX 3080 and 64GB ram is giving me an out of memory error.

<!-- gh-comment-id:2362333589 --> @Shadowfita commented on GitHub (Sep 19, 2024): There is definitely an issue with the latest binaries, at least on windows. My ollama instance running on my linux server with a gtx 1070 and 32gb ram is able to run llama 3.1 with an excel file, offloading accordingly and providing a response, but my windows PC with RTX 3080 and 64GB ram is giving me an out of memory error.
Author
Owner

@dhiltgen commented on GitHub (Sep 24, 2024):

@Shadowfita @KaloyanGeorgiev99 can you share more complete server logs? This scenario most likely occurs when we're trying to recover from a prior crash failing to start the GPU runner, then fall back to the CPU runner but incorrectly pass a GPU related flag. I'd like to see what the earlier error(s) were leading up to this.

<!-- gh-comment-id:2372036580 --> @dhiltgen commented on GitHub (Sep 24, 2024): @Shadowfita @KaloyanGeorgiev99 can you share more complete server logs? This scenario most likely occurs when we're trying to recover from a prior crash failing to start the GPU runner, then fall back to the CPU runner but incorrectly pass a GPU related flag. I'd like to see what the earlier error(s) were leading up to this.
Author
Owner

@Shadowfita commented on GitHub (Sep 24, 2024):

@Shadowfita @KaloyanGeorgiev99 can you share more complete server logs? This scenario most likely occurs when we're trying to recover from a prior crash failing to start the GPU runner, then fall back to the CPU runner but incorrectly pass a GPU related flag. I'd like to see what the earlier error(s) were leading up to this.

I'll try send through some logs this afternoon, thanks @dhiltgen .

<!-- gh-comment-id:2372480347 --> @Shadowfita commented on GitHub (Sep 24, 2024): > @Shadowfita @KaloyanGeorgiev99 can you share more complete server logs? This scenario most likely occurs when we're trying to recover from a prior crash failing to start the GPU runner, then fall back to the CPU runner but incorrectly pass a GPU related flag. I'd like to see what the earlier error(s) were leading up to this. I'll try send through some logs this afternoon, thanks @dhiltgen .
Author
Owner

@dhiltgen commented on GitHub (Oct 16, 2024):

If you're still seeing the problem, please upgrade to the latest release, and if that doesn't clear it up, share a more complete server log so I can see why the prior runner crashed and I'll reopen the issue and investigate.

<!-- gh-comment-id:2417311322 --> @dhiltgen commented on GitHub (Oct 16, 2024): If you're still seeing the problem, please upgrade to the latest release, and if that doesn't clear it up, share a more complete server log so I can see why the prior runner crashed and I'll reopen the issue and investigate.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#64842