[GH-ISSUE #9924] can't run model in nvidia 4090d #53009

Open
opened 2026-04-29 01:39:18 -05:00 by GiteaMirror · 4 comments
Owner

Originally created by @OpenPie-DTXLab on GitHub (Mar 21, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/9924

Originally assigned to: @dhiltgen on GitHub.

What is the issue?

nvidia driver and cuda version : NVIDIA-Linux-x86_64-535.171.04.run cuda_12.1.0_530.30.02_linux.run

Image Image Image Image Image

Relevant log output


OS

Linux

GPU

Nvidia

CPU

AMD

Ollama version

0.62

Originally created by @OpenPie-DTXLab on GitHub (Mar 21, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/9924 Originally assigned to: @dhiltgen on GitHub. ### What is the issue? nvidia driver and cuda version : NVIDIA-Linux-x86_64-535.171.04.run cuda_12.1.0_530.30.02_linux.run <img width="685" alt="Image" src="https://github.com/user-attachments/assets/758771d2-60e8-46b0-8c27-c7793750719e" /> <img width="1189" alt="Image" src="https://github.com/user-attachments/assets/db79efeb-2ce0-408d-a850-fce84cd14197" /> <img width="1130" alt="Image" src="https://github.com/user-attachments/assets/09591806-7f47-45b7-a90e-eafde8aded2d" /> <img width="1482" alt="Image" src="https://github.com/user-attachments/assets/a4096b70-456c-490f-b594-a37cdbe0c80b" /> <img width="2393" alt="Image" src="https://github.com/user-attachments/assets/3a4ecc86-b9ce-4d26-9590-e1601c3c9d78" /> ### Relevant log output ```shell ``` ### OS Linux ### GPU Nvidia ### CPU AMD ### Ollama version 0.62
GiteaMirror added the bugneeds more infonvidiawindows labels 2026-04-29 01:39:19 -05:00
Author
Owner

@OpenPie-DTXLab commented on GitHub (Mar 21, 2025):

any suggestion is appreciated

<!-- gh-comment-id:2742478072 --> @OpenPie-DTXLab commented on GitHub (Mar 21, 2025): any suggestion is appreciated
Author
Owner

@gus147 commented on GitHub (Mar 22, 2025):

My 4090 suddenly isn't working too.

[gus147@Clevo gusAI]$ ollama serve
2025/03/22 10:23:12 routes.go:1230: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:2048 OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://127.0.0.1:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_KV_CACHE_TYPE: OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/home/gus147/.ollama/models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NEW_ENGINE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://* vscode-file://*] OLLAMA_SCHED_SPREAD:false ROCR_VISIBLE_DEVICES: http_proxy: https_proxy: no_proxy:]"
time=2025-03-22T10:23:12.710+05:45 level=INFO source=images.go:432 msg="total blobs: 105"
time=2025-03-22T10:23:12.711+05:45 level=INFO source=images.go:439 msg="total unused blobs removed: 0"
time=2025-03-22T10:23:12.711+05:45 level=INFO source=routes.go:1297 msg="Listening on 127.0.0.1:11434 (version 0.6.2)"
time=2025-03-22T10:23:12.712+05:45 level=INFO source=gpu.go:217 msg="looking for compatible GPUs"
time=2025-03-22T10:23:12.811+05:45 level=INFO source=types.go:130 msg="inference compute" id=GPU-caec3a31-0206-a53d-2803-06c6288efa81 library=cuda variant=v12 compute=8.9 driver=12.4 name="NVIDIA GeForce RTX 4090 Laptop GPU" total="15.7 GiB" available="14.9 GiB"

<!-- gh-comment-id:2744988234 --> @gus147 commented on GitHub (Mar 22, 2025): My 4090 suddenly isn't working too. [gus147@Clevo gusAI]$ ollama serve 2025/03/22 10:23:12 routes.go:1230: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:2048 OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://127.0.0.1:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_KV_CACHE_TYPE: OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/home/gus147/.ollama/models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NEW_ENGINE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://* vscode-file://*] OLLAMA_SCHED_SPREAD:false ROCR_VISIBLE_DEVICES: http_proxy: https_proxy: no_proxy:]" time=2025-03-22T10:23:12.710+05:45 level=INFO source=images.go:432 msg="total blobs: 105" time=2025-03-22T10:23:12.711+05:45 level=INFO source=images.go:439 msg="total unused blobs removed: 0" time=2025-03-22T10:23:12.711+05:45 level=INFO source=routes.go:1297 msg="Listening on 127.0.0.1:11434 (version 0.6.2)" time=2025-03-22T10:23:12.712+05:45 level=INFO source=gpu.go:217 msg="looking for compatible GPUs" time=2025-03-22T10:23:12.811+05:45 level=INFO source=types.go:130 msg="inference compute" id=GPU-caec3a31-0206-a53d-2803-06c6288efa81 library=cuda variant=v12 compute=8.9 driver=12.4 name="NVIDIA GeForce RTX 4090 Laptop GPU" total="15.7 GiB" available="14.9 GiB"
Author
Owner

@FlyRaytheon commented on GitHub (Mar 24, 2025):

Have you ever seen a message like <msg="waiting for server to become available" status="llm server error">? I encountered a similar problem. In , it showed that the model was fully loaded on the GPU, but it was actually loaded on the CPU after this error, resulting in slow inference. The following is an excerpt from my error message:

initializing /usr/lib/x86_64-linux-gnu/libcuda.so.535.183.01

msg="inference compute" id=GPU-25a04dbd-e249-8e05-20fd-6051811bf9cc library=cuda variant=v12 compute=9.0 driver=12.2 name="NVIDIA H800 PCIe" total="79.1 GiB" available="78.6 GiB"

msg="updating cuda memory data" gpu=GPU-25a04dbd-e249-8e05-20fd6051811bf9cc name="NVIDIA H800 PCIe" overhead="0 B" before.total="79.1 GiB" before.free="78.6 GiB" now.total="79.1 GiB" now.free="78.6 GiB" now.used="470.6 MiB"

msg=evaluating library=cuda gpu_count=1 available="[78.6 GiB]"

msg="new model will fit in available VRAM in single GPU, loading" model=/home/te/.ollama/models/blobs/sha256-7ccc6415b2c7cb61ff8e01fec069d6f2fd6e213c509824d642c8a15c3d002e73 gpu=GPU-25a04dbd-e249-8e05-20fd-6051811bf9cc parallel=4 available=84449558528 required="21.5 GiB"

msg="compatible gpu libraries" compatible=[cuda_v11]

msg="adding gpu library" path=/usr/local/lib/ollama/cuda_v11

msg="starting llama server" cmd="/usr/local/bin/ollama runner --model /home/te/.ollama/models/blobs/sha256-
7ccc6415b2c7cb61ff8e01fec069d6f2fd6e213c509824d642c8a15c3d002e73 --ctxsize 8192 --batch-size 512 --n-gpu-layers 65 --verbose --threads 64 --parallel 4 --port 41195"

msg=subprocess environment="[PATH=/run/user/1011/fnm_multishells/3535267_1742817394919/bin:/usr/local/miniconda3/bin:/usr/local/miniconda3/condabin:/usr/local/miniconda3/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin:/home/te/env LD_LIBRARY_PATH=/usr/local/lib/ollama/cuda_v11:/usr/local/lib/ollama CUDA_VISIBLE_DEVICES=GPU-25a04dbd-e249-8e05-20fd-6051811bf9cc]"

msg="loaded runners" count=1

msg="waiting for llama runner to start responding"

msg="waiting for server to become available" status="llm server error"

msg="starting go runner"

msg="ggml backend load all from path" path=/usr/local/lib/ollama/cuda_v11

msg="ggml backend load all from path" path=/usr/local/lib/ollama

msg=system info="CPU : LLAMAFILE = 1 | cgo(gcc)" threads=64

...
load_tensors: layer 64 assigned to device CPU

<!-- gh-comment-id:2748572084 --> @FlyRaytheon commented on GitHub (Mar 24, 2025): Have you ever seen a message like <msg="waiting for server to become available" status="llm server error">? I encountered a similar problem. In <ollama ps> , it showed that the model was fully loaded on the GPU, but it was actually loaded on the CPU after this error, resulting in slow inference. The following is an excerpt from my error message: initializing /usr/lib/x86_64-linux-gnu/libcuda.so.535.183.01 msg="inference compute" id=GPU-25a04dbd-e249-8e05-20fd-6051811bf9cc library=cuda variant=v12 compute=9.0 driver=12.2 name="NVIDIA H800 PCIe" total="79.1 GiB" available="78.6 GiB" msg="updating cuda memory data" gpu=GPU-25a04dbd-e249-8e05-20fd6051811bf9cc name="NVIDIA H800 PCIe" overhead="0 B" before.total="79.1 GiB" before.free="78.6 GiB" now.total="79.1 GiB" now.free="78.6 GiB" now.used="470.6 MiB" msg=evaluating library=cuda gpu_count=1 available="[78.6 GiB]" msg="new model will fit in available VRAM in single GPU, loading" model=/home/te/.ollama/models/blobs/sha256-7ccc6415b2c7cb61ff8e01fec069d6f2fd6e213c509824d642c8a15c3d002e73 gpu=GPU-25a04dbd-e249-8e05-20fd-6051811bf9cc parallel=4 available=84449558528 required="21.5 GiB" msg="compatible gpu libraries" compatible=[cuda_v11] msg="adding gpu library" path=/usr/local/lib/ollama/cuda_v11 msg="starting llama server" cmd="/usr/local/bin/ollama runner --model /home/te/.ollama/models/blobs/sha256- 7ccc6415b2c7cb61ff8e01fec069d6f2fd6e213c509824d642c8a15c3d002e73 --ctxsize 8192 --batch-size 512 --n-gpu-layers 65 --verbose --threads 64 --parallel 4 --port 41195" msg=subprocess environment="[PATH=/run/user/1011/fnm_multishells/3535267_1742817394919/bin:/usr/local/miniconda3/bin:/usr/local/miniconda3/condabin:/usr/local/miniconda3/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin:/home/te/env LD_LIBRARY_PATH=/usr/local/lib/ollama/cuda_v11:/usr/local/lib/ollama CUDA_VISIBLE_DEVICES=GPU-25a04dbd-e249-8e05-20fd-6051811bf9cc]" msg="loaded runners" count=1 msg="waiting for llama runner to start responding" msg="waiting for server to become available" status="llm server error" msg="starting go runner" msg="ggml backend load all from path" path=/usr/local/lib/ollama/cuda_v11 msg="ggml backend load all from path" path=/usr/local/lib/ollama msg=system info="CPU : LLAMAFILE = 1 | cgo(gcc)" threads=64 ... load_tensors: layer 64 assigned to device CPU
Author
Owner

@dhiltgen commented on GitHub (Jul 5, 2025):

If you're still having trouble, please upgrade to the latest version of Ollama, and if that doesn't clear up the failure to load on your GPU, please share a more complete server log showing startup, and loading the model so we can see why it's falling back to CPU only.

<!-- gh-comment-id:3040249462 --> @dhiltgen commented on GitHub (Jul 5, 2025): If you're still having trouble, please upgrade to the latest version of Ollama, and if that doesn't clear up the failure to load on your GPU, please share a more complete server log showing startup, and loading the model so we can see why it's falling back to CPU only.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#53009