[GH-ISSUE #2229] ollama-cuda using vram but not gpu #47790

Closed
opened 2026-04-28 05:20:53 -05:00 by GiteaMirror · 4 comments
Owner

Originally created by @Rabcor on GitHub (Jan 27, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/2229

As title says, i have a problem on endeavouros (arch); where if i try to run ollama-cuda it will happily eat my vram, but won't really touch my gpu, instead running entirely on cpu.

24-01-56- kitty

here are also the seemingly relevant terminal outputs:

2024/01/27 16:39:08 payload_common.go:145: INFO Dynamic LLM libraries [cuda_v12 cpu cpu_avx2 cpu_avx]
2024/01/27 16:39:08 gpu.go:91: INFO Detecting GPU type
2024/01/27 16:39:08 gpu.go:210: INFO Searching for GPU management library libnvidia-ml.so
2024/01/27 16:39:08 gpu.go:256: INFO Discovered GPU libraries: [/opt/cuda/targets/x86_64-linux/lib/stubs/libnvidia-ml.so /usr/lib/libnvidia-ml.so.545.29.06 /usr/lib32/libnvidia-ml.so.545.29.06 /usr/lib64/libnvidia-ml.so.545.29.06]

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
WARNING:

You should always run with libnvidia-ml.so that is installed with your
NVIDIA Display Driver. By default it's installed in /usr/lib and /usr/lib64.
libnvidia-ml.so in GDK package is a stub library that is attached only for
build purposes (e.g. machine that you build your application doesn't have
to have Display Driver installed).
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
Linked to libnvidia-ml library at wrong path : /opt/cuda/targets/x86_64-linux/lib/stubs/libnvidia-ml.so

2024/01/27 16:39:08 gpu.go:267: INFO Unable to load CUDA management library /opt/cuda/targets/x86_64-linux/lib/stubs/libnvidia-ml.so: nvml vram init failure: 9
2024/01/27 16:39:08 gpu.go:96: INFO Nvidia GPU detected
2024/01/27 16:39:08 gpu.go:137: INFO CUDA Compute Capability detected: 8.6
[GIN] 2024/01/27 - 16:39:08 | 200 |      72.506µs |       127.0.0.1 | HEAD     "/"
[GIN] 2024/01/27 - 16:39:08 | 200 |    2.094985ms |       127.0.0.1 | GET      "/api/tags"
[GIN] 2024/01/27 - 16:39:14 | 200 |      23.054µs |       127.0.0.1 | HEAD     "/"
[GIN] 2024/01/27 - 16:39:14 | 200 |     392.387µs |       127.0.0.1 | POST     "/api/show"
[GIN] 2024/01/27 - 16:39:14 | 200 |     253.937µs |       127.0.0.1 | POST     "/api/show"
2024/01/27 16:39:14 gpu.go:137: INFO CUDA Compute Capability detected: 8.6
2024/01/27 16:39:14 gpu.go:137: INFO CUDA Compute Capability detected: 8.6
2024/01/27 16:39:14 cpu_common.go:11: INFO CPU has AVX2
loading library /tmp/ollama1142629703/cuda_v12/libext_server.so
2024/01/27 16:39:14 dyn_ext_server.go:90: INFO Loading Dynamic llm server: /tmp/ollama1142629703/cuda_v12/libext_server.so
2024/01/27 16:39:14 dyn_ext_server.go:139: INFO Initializing llama server
system info: AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3070 Ti Laptop GPU, compute capability 8.6, VMM: yes
llama_model_loader: loaded meta data with 22 key-value pairs and 543 tensors from /home/rabcor/.ollama/models/blobs/sha256:5dec2af2b0468ea0ff2bbb7c79fb91b73a7346e0853717c9b7821f006854c9bb (version GGUF V3 (latest)1
...
...
llm_load_tensors: using CUDA for GPU accelerationn
llm_load_tensors: system memory used  = 16760.45 MiB
llm_load_tensors: VRAM used           = 6433.44 MiB
llm_load_tensors: offloading 17 repeating layers to GPU
llm_load_tensors: offloaded 17/61 layers to GPU

(Might be duplicate of https://github.com/ollama/ollama/issues/2064 and/or https://github.com/ollama/ollama/issues/2120 ; I say 2120 particularly because I have the same issue described there with ollama server crashing due to cuda running out of vram as well, so there might be a relation.)

Originally created by @Rabcor on GitHub (Jan 27, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/2229 As title says, i have a problem on endeavouros (arch); where if i try to run ollama-cuda it will happily eat my vram, but won't really touch my gpu, instead running entirely on cpu. ![24-01-56- kitty](https://github.com/ollama/ollama/assets/5684325/d7ff0b54-2b06-474a-a033-0bb75ccc8f5c) here are also the seemingly relevant terminal outputs: ``` 2024/01/27 16:39:08 payload_common.go:145: INFO Dynamic LLM libraries [cuda_v12 cpu cpu_avx2 cpu_avx] 2024/01/27 16:39:08 gpu.go:91: INFO Detecting GPU type 2024/01/27 16:39:08 gpu.go:210: INFO Searching for GPU management library libnvidia-ml.so 2024/01/27 16:39:08 gpu.go:256: INFO Discovered GPU libraries: [/opt/cuda/targets/x86_64-linux/lib/stubs/libnvidia-ml.so /usr/lib/libnvidia-ml.so.545.29.06 /usr/lib32/libnvidia-ml.so.545.29.06 /usr/lib64/libnvidia-ml.so.545.29.06] !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! WARNING: You should always run with libnvidia-ml.so that is installed with your NVIDIA Display Driver. By default it's installed in /usr/lib and /usr/lib64. libnvidia-ml.so in GDK package is a stub library that is attached only for build purposes (e.g. machine that you build your application doesn't have to have Display Driver installed). !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! Linked to libnvidia-ml library at wrong path : /opt/cuda/targets/x86_64-linux/lib/stubs/libnvidia-ml.so 2024/01/27 16:39:08 gpu.go:267: INFO Unable to load CUDA management library /opt/cuda/targets/x86_64-linux/lib/stubs/libnvidia-ml.so: nvml vram init failure: 9 2024/01/27 16:39:08 gpu.go:96: INFO Nvidia GPU detected 2024/01/27 16:39:08 gpu.go:137: INFO CUDA Compute Capability detected: 8.6 [GIN] 2024/01/27 - 16:39:08 | 200 | 72.506µs | 127.0.0.1 | HEAD "/" [GIN] 2024/01/27 - 16:39:08 | 200 | 2.094985ms | 127.0.0.1 | GET "/api/tags" [GIN] 2024/01/27 - 16:39:14 | 200 | 23.054µs | 127.0.0.1 | HEAD "/" [GIN] 2024/01/27 - 16:39:14 | 200 | 392.387µs | 127.0.0.1 | POST "/api/show" [GIN] 2024/01/27 - 16:39:14 | 200 | 253.937µs | 127.0.0.1 | POST "/api/show" 2024/01/27 16:39:14 gpu.go:137: INFO CUDA Compute Capability detected: 8.6 2024/01/27 16:39:14 gpu.go:137: INFO CUDA Compute Capability detected: 8.6 2024/01/27 16:39:14 cpu_common.go:11: INFO CPU has AVX2 loading library /tmp/ollama1142629703/cuda_v12/libext_server.so 2024/01/27 16:39:14 dyn_ext_server.go:90: INFO Loading Dynamic llm server: /tmp/ollama1142629703/cuda_v12/libext_server.so 2024/01/27 16:39:14 dyn_ext_server.go:139: INFO Initializing llama server system info: AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | ggml_init_cublas: GGML_CUDA_FORCE_MMQ: no ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes ggml_init_cublas: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 3070 Ti Laptop GPU, compute capability 8.6, VMM: yes llama_model_loader: loaded meta data with 22 key-value pairs and 543 tensors from /home/rabcor/.ollama/models/blobs/sha256:5dec2af2b0468ea0ff2bbb7c79fb91b73a7346e0853717c9b7821f006854c9bb (version GGUF V3 (latest)1 ... ... llm_load_tensors: using CUDA for GPU accelerationn llm_load_tensors: system memory used = 16760.45 MiB llm_load_tensors: VRAM used = 6433.44 MiB llm_load_tensors: offloading 17 repeating layers to GPU llm_load_tensors: offloaded 17/61 layers to GPU ``` (Might be duplicate of https://github.com/ollama/ollama/issues/2064 and/or https://github.com/ollama/ollama/issues/2120 ; I say 2120 particularly because I have the same issue described there with ollama server crashing due to cuda running out of vram as well, so there might be a relation.)
Author
Owner

@easp commented on GitHub (Jan 27, 2024):

Only 17 out of 60 layers are on GPU, presumably because your GPU doesn't have enough VRAM for more. Your GPU looks idle because it's spending most of its time waiting for the CPU to process the ~2/3rds of the model that doesn't fit in VRAM.

<!-- gh-comment-id:1913331371 --> @easp commented on GitHub (Jan 27, 2024): Only 17 out of 60 layers are on GPU, presumably because your GPU doesn't have enough VRAM for more. Your GPU looks idle because it's spending most of its time waiting for the CPU to process the ~2/3rds of the model that doesn't fit in VRAM.
Author
Owner

@remy415 commented on GitHub (Jan 31, 2024):

@Rabcor I would also fix your LD_LIBRARY_PATH so that /usr/lib64 comes before /opt/cuda/targets/etcetetc

export LD_LIBRARY_PATH="/usr/lib64:$LD_LIBRARY_PATH"

It probably loaded the correct library but this will help eliminate potential edge cases.

Also as easp said: the model you're loading is too large, try running mistral 7b (ollama run mistral).

<!-- gh-comment-id:1918184841 --> @remy415 commented on GitHub (Jan 31, 2024): @Rabcor I would also fix your LD_LIBRARY_PATH so that /usr/lib64 comes before /opt/cuda/targets/etcetetc `export LD_LIBRARY_PATH="/usr/lib64:$LD_LIBRARY_PATH"` It probably loaded the correct library but this will help eliminate potential edge cases. Also as easp said: the model you're loading is too large, try running mistral 7b (ollama run mistral).
Author
Owner

@Rabcor commented on GitHub (Feb 2, 2024):

@Rabcor I would also fix your LD_LIBRARY_PATH so that /usr/lib64 comes before /opt/cuda/targets/etcetetc

export LD_LIBRARY_PATH="/usr/lib64:$LD_LIBRARY_PATH"

It probably loaded the correct library but this will help eliminate potential edge cases.

Also as easp said: the model you're loading is too large, try running mistral 7b (ollama run mistral).

Yeah i'll try with mistral, but i'm not sure how to handle the LD_LIBRARY_PATH thing you said, it isn't set by default in my environment.

Also I'm not quite sure what changed since i last tried to run ollama, but i am not getting those warnings i got in the OP anymore:

image

So I guess it should be fine?

<!-- gh-comment-id:1923586110 --> @Rabcor commented on GitHub (Feb 2, 2024): > @Rabcor I would also fix your LD_LIBRARY_PATH so that /usr/lib64 comes before /opt/cuda/targets/etcetetc > > `export LD_LIBRARY_PATH="/usr/lib64:$LD_LIBRARY_PATH"` > > It probably loaded the correct library but this will help eliminate potential edge cases. > > Also as easp said: the model you're loading is too large, try running mistral 7b (ollama run mistral). Yeah i'll try with mistral, but i'm not sure how to handle the LD_LIBRARY_PATH thing you said, it isn't set by default in my environment. Also I'm not quite sure what changed since i last tried to run ollama, but i am not getting those warnings i got in the OP anymore: ![image](https://github.com/ollama/ollama/assets/5684325/bb032f12-37a2-4cab-955b-acfde19c3237) So I guess it should be fine?
Author
Owner

@Rabcor commented on GitHub (Feb 2, 2024):

after trying with mistral which fits entirely on my vram, i found that it does indeed use my gpu for all the processing so i guess i can just close this. Thanks for explaining this to me guys!

<!-- gh-comment-id:1923766896 --> @Rabcor commented on GitHub (Feb 2, 2024): after trying with mistral which fits entirely on my vram, i found that it does indeed use my gpu for all the processing so i guess i can just close this. Thanks for explaining this to me guys!
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#47790