[GH-ISSUE #11220] Models no longer load to GPU in 0.9.3 #53905

Closed
opened 2026-04-29 04:56:26 -05:00 by GiteaMirror · 13 comments
Owner

Originally created by @jakehlee on GitHub (Jun 27, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/11220

What is the issue?

The 0.9.3 container is unable to load models to GPU despite the 0.9.2 container being able to in the exact same deployment and configuration. This is the case for both gemma3:27b-it-qat and phi4:14b.

0_9_2.txt

0_9_3.txt

Relevant log output

Relevant log output from each version attached above.

OS

Linux, Docker

GPU

Nvidia

CPU

Intel

Ollama version

0.9.3

Originally created by @jakehlee on GitHub (Jun 27, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/11220 ### What is the issue? The 0.9.3 container is unable to load models to GPU despite the 0.9.2 container being able to in the exact same deployment and configuration. This is the case for both gemma3:27b-it-qat and phi4:14b. [0_9_2.txt](https://github.com/user-attachments/files/20941078/0_9_2.txt) [0_9_3.txt](https://github.com/user-attachments/files/20941098/0_9_3.txt) ### Relevant log output ```shell Relevant log output from each version attached above. ``` ### OS Linux, Docker ### GPU Nvidia ### CPU Intel ### Ollama version 0.9.3
GiteaMirror added the bug label 2026-04-29 04:56:26 -05:00
Author
Owner

@LFd3v commented on GitHub (Jun 27, 2025):

It looks like it is related to #11211. If you are on Win, you may want to check one of the comments to make some of the models work until an update is released (or downgrade).

<!-- gh-comment-id:3013304553 --> @LFd3v commented on GitHub (Jun 27, 2025): It looks like it is related to #11211. If you are on Win, you may want to check one of the comments to make some of the models work until an update is released (or downgrade).
Author
Owner

@rick-github commented on GitHub (Jun 27, 2025):

time=2025-06-26T23:22:24.048-07:00 level=INFO source=runner.go:815 msg="starting go runner"
load_backend: loaded CPU backend from /usr/lib/ollama/libggml-cpu-icelake.so
time=2025-06-26T23:22:24.065-07:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.AVX512=1 CPU.0.AVX512_VBMI=1 CPU.0.AVX512_VNNI=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 compiler=cgo(gcc)

No GPU enabled backends found. How did you install ollama?

<!-- gh-comment-id:3013325143 --> @rick-github commented on GitHub (Jun 27, 2025): ```console time=2025-06-26T23:22:24.048-07:00 level=INFO source=runner.go:815 msg="starting go runner" load_backend: loaded CPU backend from /usr/lib/ollama/libggml-cpu-icelake.so time=2025-06-26T23:22:24.065-07:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.AVX512=1 CPU.0.AVX512_VBMI=1 CPU.0.AVX512_VNNI=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 compiler=cgo(gcc) ``` No GPU enabled backends found. How did you install ollama?
Author
Owner

@jakehlee commented on GitHub (Jun 27, 2025):

@LFd3v thanks for the reference, I'm on Linux. Staying on 0.9.2.

@rick-github I am using the ollama/ollama:0.9.2 and ollama/ollama:0.9.3 docker images as-is (with identical configurations)

<!-- gh-comment-id:3014600855 --> @jakehlee commented on GitHub (Jun 27, 2025): @LFd3v thanks for the reference, I'm on Linux. Staying on 0.9.2. @rick-github I am using the `ollama/ollama:0.9.2` and `ollama/ollama:0.9.3` docker images as-is (with identical configurations)
Author
Owner

@rick-github commented on GitHub (Jun 27, 2025):

Can you share your configuration?

What's the output if you run the following:

docker run --rm --gpus all --entrypoint bash ollama/ollama:0.9.3 -c 'ls -l /usr/lib/ollama'
<!-- gh-comment-id:3014664419 --> @rick-github commented on GitHub (Jun 27, 2025): Can you share your configuration? What's the output if you run the following: ``` docker run --rm --gpus all --entrypoint bash ollama/ollama:0.9.3 -c 'ls -l /usr/lib/ollama' ```
Author
Owner

@jakehlee commented on GitHub (Jun 28, 2025):

Apologies, I should've mentioned that I'm running docker images via apptainer on an HPC system. I ran the following script completely from scratch; note that the --nv argument does the following:

  • Ensure that the /dev/nvidiaX device entries are available inside the container, so that the GPU cards in the host are accessible.
  • Locate and bind the basic CUDA libraries from the host into the container, so that they are available to the container, and match the kernel GPU driver on the host.
  • Set the LD_LIBRARY_PATH inside the container so that the bound-in version of the CUDA libraries are used by applications run inside the container.

Just to emphasize, I am interested in what changed between 0.9.2 and 0.9.3 that might've caused this to break. Thank you!

apptainer pull --disable-cache docker://ollama/ollama:0.9.2
apptainer pull --disable-cache docker://ollama/ollama:0.9.3

echo "ollama 0.9.2"
apptainer instance start --nv "$SCRATCH_DIR/ollama_0.9.2.sif" ollama_instance_0_9_2
apptainer exec instance://ollama_instance_0_9_2 ls -l /usr/lib/ollama

echo "ollama 0.9.3"
apptainer instance start --nv "$SCRATCH_DIR/ollama_0.9.3.sif" ollama_instance_0_9_3
apptainer exec instance://ollama_instance_0_9_3 ls -l /usr/lib/ollama

Here's the output:

Pulling Ollama container...
INFO:    Converting OCI blobs to SIF format
INFO:    Starting build...
2025/06/27 17:14:58  info unpack layer: sha256:13b7e930469f6d3575a320709035c6acf6f5485a76abcf03d1b92a64c09c2476
2025/06/27 17:14:58  info unpack layer: sha256:97ca0261c3138237b4262306382193974505ab6967eec51bbfeb7908fb12b034
2025/06/27 17:14:58  info unpack layer: sha256:e0fa0ad9f5bdc7d30b05be00c3663e4076d288995657ebe622a4c721031715b6
2025/06/27 17:14:58  info unpack layer: sha256:6574d84719207f59862dad06a34eec2b332afeccf4d51f5aae16de99fd72b8a7
INFO:    Creating SIF file...
Pulling Ollama container...
INFO:    Converting OCI blobs to SIF format
INFO:    Starting build...
2025/06/27 17:15:46  info unpack layer: sha256:13b7e930469f6d3575a320709035c6acf6f5485a76abcf03d1b92a64c09c2476
2025/06/27 17:15:46  info unpack layer: sha256:97ca0261c3138237b4262306382193974505ab6967eec51bbfeb7908fb12b034
2025/06/27 17:15:46  info unpack layer: sha256:9131c6c1e7a830f8966b223a45754a46747427f7acc0fe62b1e7aefb7167fb90
2025/06/27 17:15:46  info unpack layer: sha256:7a18c6e5fb4829330f6fe29a29fac35a231ee89d75bb45c0534fc67e915dddee
INFO:    Creating SIF file...

ollama 0.9.2
INFO:    underlay of /etc/localtime required more than 50 (70) bind mounts
INFO:    underlay of /usr/bin/nvidia-smi required more than 50 (278) bind mounts

INFO:    instance started successfully
total 4737
drwxr-xr-x. 2 nobody nogroup      0 Jun 18 06:01 cuda_v11
drwxr-xr-x. 2 nobody nogroup      0 Jun 18 06:03 cuda_v12
-rwxr-xr-x. 1 nobody nogroup 595648 Jun 18 05:50 libggml-base.so
-rwxr-xr-x. 1 nobody nogroup 619280 Jun 18 05:50 libggml-cpu-alderlake.so
-rwxr-xr-x. 1 nobody nogroup 619280 Jun 18 05:50 libggml-cpu-haswell.so
-rwxr-xr-x. 1 nobody nogroup 725776 Jun 18 05:50 libggml-cpu-icelake.so
-rwxr-xr-x. 1 nobody nogroup 606992 Jun 18 05:50 libggml-cpu-sandybridge.so
-rwxr-xr-x. 1 nobody nogroup 729872 Jun 18 05:50 libggml-cpu-skylakex.so
-rwxr-xr-x. 1 nobody nogroup 480048 Jun 18 05:50 libggml-cpu-sse42.so
-rwxr-xr-x. 1 nobody nogroup 475952 Jun 18 05:50 libggml-cpu-x64.so

ollama 0.9.3
INFO:    underlay of /etc/localtime required more than 50 (70) bind mounts
INFO:    underlay of /usr/bin/nvidia-smi required more than 50 (278) bind mounts

INFO:    instance started successfully
total 2109647
lrwxrwxrwx. 1 nobody nogroup         21 Jun 26 00:38 libcublas.so.12 -> libcublas.so.12.8.4.1
-rwxr-xr-x. 1 nobody nogroup  116388640 Jul  7  2015 libcublas.so.12.8.4.1
lrwxrwxrwx. 1 nobody nogroup         23 Jun 26 00:38 libcublasLt.so.12 -> libcublasLt.so.12.8.4.1
-rwxr-xr-x. 1 nobody nogroup  751771728 Jul  7  2015 libcublasLt.so.12.8.4.1
lrwxrwxrwx. 1 nobody nogroup         20 Jun 26 00:38 libcudart.so.12 -> libcudart.so.12.8.90
-rwxr-xr-x. 1 nobody nogroup     728800 Jul  7  2015 libcudart.so.12.8.90
-rwxr-xr-x. 1 nobody nogroup     595648 Jun 26 00:28 libggml-base.so
-rwxr-xr-x. 1 nobody nogroup     619280 Jun 26 00:28 libggml-cpu-alderlake.so
-rwxr-xr-x. 1 nobody nogroup     619280 Jun 26 00:28 libggml-cpu-haswell.so
-rwxr-xr-x. 1 nobody nogroup     725776 Jun 26 00:28 libggml-cpu-icelake.so
-rwxr-xr-x. 1 nobody nogroup     606992 Jun 26 00:28 libggml-cpu-sandybridge.so
-rwxr-xr-x. 1 nobody nogroup     729872 Jun 26 00:28 libggml-cpu-skylakex.so
-rwxr-xr-x. 1 nobody nogroup     480048 Jun 26 00:28 libggml-cpu-sse42.so
-rwxr-xr-x. 1 nobody nogroup     475952 Jun 26 00:28 libggml-cpu-x64.so
-rwxr-xr-x. 1 nobody nogroup 1286539248 Jun 26 00:38 libggml-cuda.so
<!-- gh-comment-id:3014711327 --> @jakehlee commented on GitHub (Jun 28, 2025): Apologies, I should've mentioned that I'm running docker images via apptainer on an HPC system. I ran the following script completely from scratch; note that the `--nv` argument does the [following:](https://apptainer.org/docs/user/1.0/gpu.html) > - Ensure that the /dev/nvidiaX device entries are available inside the container, so that the GPU cards in the host are accessible. > - Locate and bind the basic CUDA libraries from the host into the container, so that they are available to the container, and match the kernel GPU driver on the host. > - Set the LD_LIBRARY_PATH inside the container so that the bound-in version of the CUDA libraries are used by applications run inside the container. Just to emphasize, I am interested in what changed between 0.9.2 and 0.9.3 that might've caused this to break. Thank you! ``` apptainer pull --disable-cache docker://ollama/ollama:0.9.2 apptainer pull --disable-cache docker://ollama/ollama:0.9.3 echo "ollama 0.9.2" apptainer instance start --nv "$SCRATCH_DIR/ollama_0.9.2.sif" ollama_instance_0_9_2 apptainer exec instance://ollama_instance_0_9_2 ls -l /usr/lib/ollama echo "ollama 0.9.3" apptainer instance start --nv "$SCRATCH_DIR/ollama_0.9.3.sif" ollama_instance_0_9_3 apptainer exec instance://ollama_instance_0_9_3 ls -l /usr/lib/ollama ``` Here's the output: ``` Pulling Ollama container... INFO: Converting OCI blobs to SIF format INFO: Starting build... 2025/06/27 17:14:58 info unpack layer: sha256:13b7e930469f6d3575a320709035c6acf6f5485a76abcf03d1b92a64c09c2476 2025/06/27 17:14:58 info unpack layer: sha256:97ca0261c3138237b4262306382193974505ab6967eec51bbfeb7908fb12b034 2025/06/27 17:14:58 info unpack layer: sha256:e0fa0ad9f5bdc7d30b05be00c3663e4076d288995657ebe622a4c721031715b6 2025/06/27 17:14:58 info unpack layer: sha256:6574d84719207f59862dad06a34eec2b332afeccf4d51f5aae16de99fd72b8a7 INFO: Creating SIF file... Pulling Ollama container... INFO: Converting OCI blobs to SIF format INFO: Starting build... 2025/06/27 17:15:46 info unpack layer: sha256:13b7e930469f6d3575a320709035c6acf6f5485a76abcf03d1b92a64c09c2476 2025/06/27 17:15:46 info unpack layer: sha256:97ca0261c3138237b4262306382193974505ab6967eec51bbfeb7908fb12b034 2025/06/27 17:15:46 info unpack layer: sha256:9131c6c1e7a830f8966b223a45754a46747427f7acc0fe62b1e7aefb7167fb90 2025/06/27 17:15:46 info unpack layer: sha256:7a18c6e5fb4829330f6fe29a29fac35a231ee89d75bb45c0534fc67e915dddee INFO: Creating SIF file... ollama 0.9.2 INFO: underlay of /etc/localtime required more than 50 (70) bind mounts INFO: underlay of /usr/bin/nvidia-smi required more than 50 (278) bind mounts INFO: instance started successfully total 4737 drwxr-xr-x. 2 nobody nogroup 0 Jun 18 06:01 cuda_v11 drwxr-xr-x. 2 nobody nogroup 0 Jun 18 06:03 cuda_v12 -rwxr-xr-x. 1 nobody nogroup 595648 Jun 18 05:50 libggml-base.so -rwxr-xr-x. 1 nobody nogroup 619280 Jun 18 05:50 libggml-cpu-alderlake.so -rwxr-xr-x. 1 nobody nogroup 619280 Jun 18 05:50 libggml-cpu-haswell.so -rwxr-xr-x. 1 nobody nogroup 725776 Jun 18 05:50 libggml-cpu-icelake.so -rwxr-xr-x. 1 nobody nogroup 606992 Jun 18 05:50 libggml-cpu-sandybridge.so -rwxr-xr-x. 1 nobody nogroup 729872 Jun 18 05:50 libggml-cpu-skylakex.so -rwxr-xr-x. 1 nobody nogroup 480048 Jun 18 05:50 libggml-cpu-sse42.so -rwxr-xr-x. 1 nobody nogroup 475952 Jun 18 05:50 libggml-cpu-x64.so ollama 0.9.3 INFO: underlay of /etc/localtime required more than 50 (70) bind mounts INFO: underlay of /usr/bin/nvidia-smi required more than 50 (278) bind mounts INFO: instance started successfully total 2109647 lrwxrwxrwx. 1 nobody nogroup 21 Jun 26 00:38 libcublas.so.12 -> libcublas.so.12.8.4.1 -rwxr-xr-x. 1 nobody nogroup 116388640 Jul 7 2015 libcublas.so.12.8.4.1 lrwxrwxrwx. 1 nobody nogroup 23 Jun 26 00:38 libcublasLt.so.12 -> libcublasLt.so.12.8.4.1 -rwxr-xr-x. 1 nobody nogroup 751771728 Jul 7 2015 libcublasLt.so.12.8.4.1 lrwxrwxrwx. 1 nobody nogroup 20 Jun 26 00:38 libcudart.so.12 -> libcudart.so.12.8.90 -rwxr-xr-x. 1 nobody nogroup 728800 Jul 7 2015 libcudart.so.12.8.90 -rwxr-xr-x. 1 nobody nogroup 595648 Jun 26 00:28 libggml-base.so -rwxr-xr-x. 1 nobody nogroup 619280 Jun 26 00:28 libggml-cpu-alderlake.so -rwxr-xr-x. 1 nobody nogroup 619280 Jun 26 00:28 libggml-cpu-haswell.so -rwxr-xr-x. 1 nobody nogroup 725776 Jun 26 00:28 libggml-cpu-icelake.so -rwxr-xr-x. 1 nobody nogroup 606992 Jun 26 00:28 libggml-cpu-sandybridge.so -rwxr-xr-x. 1 nobody nogroup 729872 Jun 26 00:28 libggml-cpu-skylakex.so -rwxr-xr-x. 1 nobody nogroup 480048 Jun 26 00:28 libggml-cpu-sse42.so -rwxr-xr-x. 1 nobody nogroup 475952 Jun 26 00:28 libggml-cpu-x64.so -rwxr-xr-x. 1 nobody nogroup 1286539248 Jun 26 00:38 libggml-cuda.so ```
Author
Owner

@filips123 commented on GitHub (Jun 28, 2025):

I have the same problem, also on HPC with Slurm, but running natively, without containers.

These are the outputs of the Slurm job that was running Ollama and the processing script:

stderr-58671049.log
stdout-58671049.log

I cancelled the job after some time, so this is also visible in logs at the end.

These are the outputs of nvidia-smi and nvidia-smi -q from the worker node while Ollama was running:

nvidia-smi.log
nvidia-smi-q.log

It seems Ollama 0.9.3 is using the CPU backend for some reason and not utilizing the GPU at all.

On Ollama 0.9.2, the same setup is working fine:

stderr-58671197.log
stdout-58671197.log
nvidia-smi.log
nvidia-smi-q.log

<!-- gh-comment-id:3015174660 --> @filips123 commented on GitHub (Jun 28, 2025): I have the same problem, also on HPC with Slurm, but running natively, without containers. These are the outputs of the Slurm job that was running Ollama and the processing script: [stderr-58671049.log](https://github.com/user-attachments/files/20960158/stderr-58671049.log) [stdout-58671049.log](https://github.com/user-attachments/files/20960157/stdout-58671049.log) I cancelled the job after some time, so this is also visible in logs at the end. These are the outputs of `nvidia-smi` and `nvidia-smi -q` from the worker node while Ollama was running: [nvidia-smi.log](https://github.com/user-attachments/files/20960175/nvidia-smi.log) [nvidia-smi-q.log](https://github.com/user-attachments/files/20960174/nvidia-smi-q.log) It seems Ollama 0.9.3 is using the CPU backend for some reason and not utilizing the GPU at all. On Ollama 0.9.2, the same setup is working fine: [stderr-58671197.log](https://github.com/user-attachments/files/20960195/stderr-58671197.log) [stdout-58671197.log](https://github.com/user-attachments/files/20960194/stdout-58671197.log) [nvidia-smi.log](https://github.com/user-attachments/files/20960192/nvidia-smi.log) [nvidia-smi-q.log](https://github.com/user-attachments/files/20960193/nvidia-smi-q.log)
Author
Owner

@rick-github commented on GitHub (Jun 28, 2025):

I am interested in what changed between 0.9.2 and 0.9.3

CUDA v11 is no longer supported and the CUDA libraries for v12 have been moved up one level to /usr/lib/ollama.

<!-- gh-comment-id:3015183280 --> @rick-github commented on GitHub (Jun 28, 2025): > I am interested in what changed between 0.9.2 and 0.9.3 CUDA v11 is no longer supported and the CUDA libraries for v12 have been moved up one level to /usr/lib/ollama.
Author
Owner

@filips123 commented on GitHub (Jun 28, 2025):

Well, at least in my case, CUDA 12.9 is available and Ollama 0.9.3 detected the GPU, but still decided to use the CPU backend:

time=2025-06-28T12:25:20.188+02:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="251.1 GiB" before.free="246.9 GiB" before.free_swap="62.5 GiB" now.total="251.1 GiB" now.free="246.8 GiB" now.free_swap="62.5 GiB"
initializing /usr/lib64/libcuda.so.575.51.03
dlsym: cuInit - 0x14efae974790
dlsym: cuDriverGetVersion - 0x14efae974850
dlsym: cuDeviceGetCount - 0x14efae9749d0
dlsym: cuDeviceGet - 0x14efae974910
dlsym: cuDeviceGetAttribute - 0x14efae974f10
dlsym: cuDeviceGetUuid - 0x14efae974b50
dlsym: cuDeviceGetName - 0x14efae974a90
dlsym: cuCtxCreate_v3 - 0x14efae975a50
dlsym: cuMemGetInfo_v2 - 0x14efae9786f0
dlsym: cuCtxDestroy - 0x14efae9da710
calling cuInit
calling cuDriverGetVersion
raw version 0x2f3a
CUDA driver version: 12.9
calling cuDeviceGetCount
device count 1
time=2025-06-28T12:25:20.691+02:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-6dbb5729-a924-a8d1-13c4-bbb2c9b67d8b name="NVIDIA H100 PCIe" overhead="0 B" before.total="79.2 GiB" before.free="78.7 GiB" now.total="79.2 GiB" now.free="78.7 GiB" now.used="456.8 MiB"
releasing cuda driver library
time=2025-06-28T12:25:20.691+02:00 level=DEBUG source=sched.go:185 msg="updating default concurrency" OLLAMA_MAX_LOADED_MODELS=3 gpu_count=1
time=2025-06-28T12:25:20.714+02:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=general.alignment default=32
time=2025-06-28T12:25:20.760+02:00 level=DEBUG source=sched.go:228 msg="loading first model" model=/tmp/58671049/models/blobs/sha256-043a363c6ca35e3b1a29b8a5b0bbd28474820239bbc5ad943c9be18f0dc77b66
time=2025-06-28T12:25:20.760+02:00 level=DEBUG source=memory.go:111 msg=evaluating library=cuda gpu_count=1 available="[78.7 GiB]"
time=2025-06-28T12:25:20.761+02:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=qwen25vl.vision.image_size default=0
time=2025-06-28T12:25:20.763+02:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=qwen25vl.vision.max_pixels default=1003520
time=2025-06-28T12:25:20.764+02:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=qwen25vl.attention.key_length default=128
time=2025-06-28T12:25:20.764+02:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=qwen25vl.attention.value_length default=128
time=2025-06-28T12:25:20.765+02:00 level=INFO source=sched.go:788 msg="new model will fit in available VRAM in single GPU, loading" model=/tmp/58671049/models/blobs/sha256-043a363c6ca35e3b1a29b8a5b0bbd28474820239bbc5ad943c9be18f0dc77b66 gpu=GPU-6dbb5729-a924-a8d1-13c4-bbb2c9b67d8b parallel=2 available=84539277312 required="25.3 GiB"
time=2025-06-28T12:25:20.766+02:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="251.1 GiB" before.free="246.8 GiB" before.free_swap="62.5 GiB" now.total="251.1 GiB" now.free="246.7 GiB" now.free_swap="62.5 GiB"
initializing /usr/lib64/libcuda.so.575.51.03
dlsym: cuInit - 0x14efae974790
dlsym: cuDriverGetVersion - 0x14efae974850
dlsym: cuDeviceGetCount - 0x14efae9749d0
dlsym: cuDeviceGet - 0x14efae974910
dlsym: cuDeviceGetAttribute - 0x14efae974f10
dlsym: cuDeviceGetUuid - 0x14efae974b50
dlsym: cuDeviceGetName - 0x14efae974a90
dlsym: cuCtxCreate_v3 - 0x14efae975a50
dlsym: cuMemGetInfo_v2 - 0x14efae9786f0
dlsym: cuCtxDestroy - 0x14efae9da710
calling cuInit
calling cuDriverGetVersion
raw version 0x2f3a
CUDA driver version: 12.9
calling cuDeviceGetCount
device count 1
time=2025-06-28T12:25:21.000+02:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-6dbb5729-a924-a8d1-13c4-bbb2c9b67d8b name="NVIDIA H100 PCIe" overhead="0 B" before.total="79.2 GiB" before.free="78.7 GiB" now.total="79.2 GiB" now.free="78.7 GiB" now.used="456.8 MiB"
releasing cuda driver library
time=2025-06-28T12:25:21.000+02:00 level=INFO source=server.go:135 msg="system memory" total="251.1 GiB" free="246.7 GiB" free_swap="62.5 GiB"
time=2025-06-28T12:25:21.000+02:00 level=DEBUG source=memory.go:111 msg=evaluating library=cuda gpu_count=1 available="[78.7 GiB]"
time=2025-06-28T12:25:21.001+02:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=qwen25vl.vision.image_size default=0
time=2025-06-28T12:25:21.001+02:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=qwen25vl.vision.max_pixels default=1003520
time=2025-06-28T12:25:21.002+02:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=qwen25vl.attention.key_length default=128
time=2025-06-28T12:25:21.002+02:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=qwen25vl.attention.value_length default=128
time=2025-06-28T12:25:21.002+02:00 level=INFO source=server.go:175 msg=offload library=cuda layers.requested=-1 layers.model=65 layers.offload=65 layers.split="" memory.available="[78.7 GiB]" memory.gpu_overhead="0 B" memory.required.full="25.3 GiB" memory.required.partial="25.3 GiB" memory.required.kv="2.0 GiB" memory.required.allocations="[25.3 GiB]" memory.weights.total="18.1 GiB" memory.weights.repeating="17.5 GiB" memory.weights.nonrepeating="609.1 MiB" memory.graph.full="1.7 GiB" memory.graph.partial="1.7 GiB" projector.weights="1.2 GiB" projector.graph="1.6 GiB"
time=2025-06-28T12:25:21.003+02:00 level=DEBUG source=server.go:291 msg="compatible gpu libraries" compatible=[]
time=2025-06-28T12:25:21.039+02:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=general.alignment default=32
time=2025-06-28T12:25:21.042+02:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=tokenizer.ggml.pretokenizer default="(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+"
time=2025-06-28T12:25:21.042+02:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=tokenizer.ggml.add_bos_token default=true
time=2025-06-28T12:25:21.042+02:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=tokenizer.ggml.bos_token_id default=0
time=2025-06-28T12:25:21.043+02:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=tokenizer.ggml.eos_token_ids default="&{size:0 values:[]}"
time=2025-06-28T12:25:21.043+02:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=qwen25vl.rope.dimension_count default=128
time=2025-06-28T12:25:21.043+02:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=qwen25vl.rope.freq_scale default=1
time=2025-06-28T12:25:21.043+02:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=qwen25vl.qwen25vl.vision.fullatt_block_indexes default="&{size:0 values:[7 15 23 31]}"
time=2025-06-28T12:25:21.043+02:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=qwen25vl.vision.max_pixels default=1003520
time=2025-06-28T12:25:21.044+02:00 level=INFO source=server.go:438 msg="starting llama server" cmd="/d/hpc/home/fs90700/medieval/ollama/bin/ollama runner --ollama-engine --model /tmp/58671049/models/blobs/sha256-043a363c6ca35e3b1a29b8a5b0bbd28474820239bbc5ad943c9be18f0dc77b66 --ctx-size 8192 --batch-size 512 --n-gpu-layers 65 --threads 32 --parallel 2 --port 42221"
time=2025-06-28T12:25:21.044+02:00 level=DEBUG source=server.go:439 msg=subprocess OLLAMA_MODELS=/tmp/58671049/models ROCR_VISIBLE_DEVICES=0 CUDA_VISIBLE_DEVICES=GPU-6dbb5729-a924-a8d1-13c4-bbb2c9b67d8b GPU_DEVICE_ORDINAL=0 OLLAMA_DEBUG=1 LD_LIBRARY_PATH=/d/hpc/home/fs90700/medieval/ollama/lib/ollama:/cvmfs/sling.si/modules/el7/software/Python/3.12.3-GCCcore-13.3.0/lib:/cvmfs/sling.si/modules/el7/software/OpenSSL/3/lib:/cvmfs/sling.si/modules/el7/software/libffi/3.4.5-GCCcore-13.3.0/lib64:/cvmfs/sling.si/modules/el7/software/XZ/5.4.5-GCCcore-13.3.0/lib:/cvmfs/sling.si/modules/el7/software/SQLite/3.45.3-GCCcore-13.3.0/lib:/cvmfs/sling.si/modules/el7/software/Tcl/8.6.14-GCCcore-13.3.0/lib:/cvmfs/sling.si/modules/el7/software/libreadline/8.2-GCCcore-13.3.0/lib:/cvmfs/sling.si/modules/el7/software/ncurses/6.5-GCCcore-13.3.0/lib:/cvmfs/sling.si/modules/el7/software/bzip2/1.0.8-GCCcore-13.3.0/lib:/cvmfs/sling.si/modules/el7/software/binutils/2.42-GCCcore-13.3.0/lib:/cvmfs/sling.si/modules/el7/software/zlib/1.3.1-GCCcore-13.3.0/lib:/cvmfs/sling.si/modules/el7/software/GCCcore/13.3.0/lib64:/d/hpc/home/fs90700/medieval/ollama/lib/ollama PATH=/cvmfs/sling.si/modules/el7/software/Python/3.12.3-GCCcore-13.3.0/bin:/cvmfs/sling.si/modules/el7/software/OpenSSL/3/bin:/cvmfs/sling.si/modules/el7/software/XZ/5.4.5-GCCcore-13.3.0/bin:/cvmfs/sling.si/modules/el7/software/SQLite/3.45.3-GCCcore-13.3.0/bin:/cvmfs/sling.si/modules/el7/software/Tcl/8.6.14-GCCcore-13.3.0/bin:/cvmfs/sling.si/modules/el7/software/ncurses/6.5-GCCcore-13.3.0/bin:/cvmfs/sling.si/modules/el7/software/bzip2/1.0.8-GCCcore-13.3.0/bin:/cvmfs/sling.si/modules/el7/software/binutils/2.42-GCCcore-13.3.0/bin:/cvmfs/sling.si/modules/el7/software/GCCcore/13.3.0/bin:/d/hpc/home/fs90700/.local/bin:/d/hpc/home/fs90700/bin:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/opt/puppetlabs/bin OLLAMA_KEEP_ALIVE=1h OLLAMA_MAX_LOADED_MODELS=3 OLLAMA_LIBRARY_PATH=/d/hpc/home/fs90700/medieval/ollama/lib/ollama
time=2025-06-28T12:25:21.045+02:00 level=INFO source=sched.go:483 msg="loaded runners" count=1
time=2025-06-28T12:25:21.045+02:00 level=INFO source=server.go:598 msg="waiting for llama runner to start responding"
time=2025-06-28T12:25:21.045+02:00 level=INFO source=server.go:632 msg="waiting for server to become available" status="llm server not responding"
time=2025-06-28T12:25:21.068+02:00 level=INFO source=runner.go:925 msg="starting ollama engine"
time=2025-06-28T12:25:21.069+02:00 level=INFO source=runner.go:983 msg="Server listening on 127.0.0.1:42221"
time=2025-06-28T12:25:21.106+02:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=general.alignment default=32
time=2025-06-28T12:25:21.109+02:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=general.name default=""
time=2025-06-28T12:25:21.109+02:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=general.description default=""
time=2025-06-28T12:25:21.109+02:00 level=INFO source=ggml.go:92 msg="" architecture=qwen25vl file_type=Q4_K_M name="" description="" num_tensors=1290 num_key_values=36
time=2025-06-28T12:25:21.109+02:00 level=DEBUG source=ggml.go:94 msg="ggml backend load all from path" path=/d/hpc/home/fs90700/medieval/ollama/lib/ollama
time=2025-06-28T12:25:21.353+02:00 level=INFO source=server.go:632 msg="waiting for server to become available" status="llm server loading model"
load_backend: loaded CPU backend from /d/hpc/home/fs90700/medieval/ollama/lib/ollama/libggml-cpu-icelake.so
time=2025-06-28T12:25:21.568+02:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.AVX512=1 CPU.0.AVX512_VBMI=1 CPU.0.AVX512_VNNI=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 compiler=cgo(gcc)
time=2025-06-28T12:25:21.577+02:00 level=INFO source=ggml.go:359 msg="model weights" buffer=CPU size="19.7 GiB"
time=2025-06-28T12:25:21.578+02:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=tokenizer.ggml.pretokenizer default="(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+"
time=2025-06-28T12:25:21.578+02:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=tokenizer.ggml.add_bos_token default=true
time=2025-06-28T12:25:21.578+02:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=tokenizer.ggml.bos_token_id default=0
time=2025-06-28T12:25:21.578+02:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=tokenizer.ggml.eos_token_ids default="&{size:0 values:[]}"
time=2025-06-28T12:25:21.578+02:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=qwen25vl.rope.dimension_count default=128
time=2025-06-28T12:25:21.578+02:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=qwen25vl.rope.freq_scale default=1
time=2025-06-28T12:25:21.579+02:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=qwen25vl.qwen25vl.vision.fullatt_block_indexes default="&{size:0 values:[7 15 23 31]}"
time=2025-06-28T12:25:21.579+02:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=qwen25vl.vision.max_pixels default=1003520
time=2025-06-28T12:25:21.861+02:00 level=DEBUG source=vocabulary.go:52 msg="adding bos token to prompt" id=[0]
time=2025-06-28T12:25:21.862+02:00 level=DEBUG source=ggml.go:630 msg="compute graph" nodes=1748 splits=1
time=2025-06-28T12:25:21.862+02:00 level=INFO source=ggml.go:648 msg="compute graph" backend=CPU buffer_type=CPU size="1.6 GiB"
time=2025-06-28T12:25:22.181+02:00 level=DEBUG source=ggml.go:630 msg="compute graph" nodes=2441 splits=1
time=2025-06-28T12:25:22.181+02:00 level=INFO source=ggml.go:648 msg="compute graph" backend=CPU buffer_type=CPU size="1.6 GiB"
time=2025-06-28T12:25:22.182+02:00 level=DEBUG source=runner.go:883 msg=memory allocated.InputWeights=437944320A allocated.CPU.Weights="[312184832A 312184832A 312184832A 312184832A 312184832A 312184832A 312184832A 312184832A 275689472A 275689472A 312184832A 275689472A 274337792A 310833152A 275689472A 274337792A 310833152A 275689472A 274337792A 310833152A 275689472A 274337792A 310833152A 275689472A 274337792A 310833152A 275689472A 274337792A 310833152A 275689472A 274337792A 310833152A 275689472A 274337792A 310833152A 275689472A 274337792A 310833152A 275689472A 274337792A 310833152A 275689472A 274337792A 310833152A 275689472A 274337792A 310833152A 275689472A 274337792A 310833152A 275689472A 274337792A 310833152A 275689472A 274337792A 310833152A 312184832A 310833152A 310833152A 312184832A 310833152A 310833152A 312184832A 310833152A 1950071296A]" allocated.CPU.Cache="[33554432A 33554432A 33554432A 33554432A 33554432A 33554432A 33554432A 33554432A 33554432A 33554432A 33554432A 33554432A 33554432A 33554432A 33554432A 33554432A 33554432A 33554432A 33554432A 33554432A 33554432A 33554432A 33554432A 33554432A 33554432A 33554432A 33554432A 33554432A 33554432A 33554432A 33554432A 33554432A 33554432A 33554432A 33554432A 33554432A 33554432A 33554432A 33554432A 33554432A 33554432A 33554432A 33554432A 33554432A 33554432A 33554432A 33554432A 33554432A 33554432A 33554432A 33554432A 33554432A 33554432A 33554432A 33554432A 33554432A 33554432A 33554432A 33554432A 33554432A 33554432A 33554432A 33554432A 33554432A 0U]" allocated.CPU.Graph=1682268160A
time=2025-06-28T12:25:22.281+02:00 level=DEBUG source=server.go:643 msg="model load progress 0.03"
time=2025-06-28T12:25:23.256+02:00 level=DEBUG source=server.go:643 msg="model load progress 0.29"
time=2025-06-28T12:25:23.740+02:00 level=DEBUG source=server.go:643 msg="model load progress 0.42"
time=2025-06-28T12:25:24.181+02:00 level=DEBUG source=server.go:643 msg="model load progress 0.54"
time=2025-06-28T12:25:24.568+02:00 level=DEBUG source=server.go:643 msg="model load progress 0.64"
time=2025-06-28T12:25:25.024+02:00 level=DEBUG source=server.go:643 msg="model load progress 0.77"
time=2025-06-28T12:25:25.496+02:00 level=DEBUG source=server.go:643 msg="model load progress 0.90"
time=2025-06-28T12:25:25.914+02:00 level=DEBUG source=server.go:643 msg="model load progress 1.00"
time=2025-06-28T12:25:26.290+02:00 level=INFO source=server.go:637 msg="llama runner started in 5.25 seconds"
time=2025-06-28T12:25:26.496+02:00 level=DEBUG source=sched.go:495 msg="finished setting up" runner.name=registry.ollama.ai/library/qwen2.5vl:32b runner.inference=cuda runner.devices=1 runner.size="25.3 GiB" runner.vram="25.3 GiB" runner.parallel=2 runner.pid=1685622 runner.model=/tmp/58671049/models/blobs/sha256-043a363c6ca35e3b1a29b8a5b0bbd28474820239bbc5ad943c9be18f0dc77b66 runner.num_ctx=8192
time=2025-06-28T12:25:26.665+02:00 level=DEBUG source=server.go:736 msg="completion request" images=1 prompt=1043 format=""
time=2025-06-28T12:25:27.119+02:00 level=DEBUG source=vocabulary.go:52 msg="adding bos token to prompt" id=[0]
time=2025-06-28T12:25:27.323+02:00 level=DEBUG source=vocabulary.go:52 msg="adding bos token to prompt" id=[0]
time=2025-06-28T12:25:27.323+02:00 level=DEBUG source=cache.go:136 msg="loading cache slot" id=0 cache=0 prompt=1467 used=0 remaining=1467
<!-- gh-comment-id:3015193701 --> @filips123 commented on GitHub (Jun 28, 2025): Well, at least in my case, CUDA 12.9 is available and Ollama 0.9.3 detected the GPU, but still decided to use the CPU backend: ``` time=2025-06-28T12:25:20.188+02:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="251.1 GiB" before.free="246.9 GiB" before.free_swap="62.5 GiB" now.total="251.1 GiB" now.free="246.8 GiB" now.free_swap="62.5 GiB" initializing /usr/lib64/libcuda.so.575.51.03 dlsym: cuInit - 0x14efae974790 dlsym: cuDriverGetVersion - 0x14efae974850 dlsym: cuDeviceGetCount - 0x14efae9749d0 dlsym: cuDeviceGet - 0x14efae974910 dlsym: cuDeviceGetAttribute - 0x14efae974f10 dlsym: cuDeviceGetUuid - 0x14efae974b50 dlsym: cuDeviceGetName - 0x14efae974a90 dlsym: cuCtxCreate_v3 - 0x14efae975a50 dlsym: cuMemGetInfo_v2 - 0x14efae9786f0 dlsym: cuCtxDestroy - 0x14efae9da710 calling cuInit calling cuDriverGetVersion raw version 0x2f3a CUDA driver version: 12.9 calling cuDeviceGetCount device count 1 time=2025-06-28T12:25:20.691+02:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-6dbb5729-a924-a8d1-13c4-bbb2c9b67d8b name="NVIDIA H100 PCIe" overhead="0 B" before.total="79.2 GiB" before.free="78.7 GiB" now.total="79.2 GiB" now.free="78.7 GiB" now.used="456.8 MiB" releasing cuda driver library time=2025-06-28T12:25:20.691+02:00 level=DEBUG source=sched.go:185 msg="updating default concurrency" OLLAMA_MAX_LOADED_MODELS=3 gpu_count=1 time=2025-06-28T12:25:20.714+02:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=general.alignment default=32 time=2025-06-28T12:25:20.760+02:00 level=DEBUG source=sched.go:228 msg="loading first model" model=/tmp/58671049/models/blobs/sha256-043a363c6ca35e3b1a29b8a5b0bbd28474820239bbc5ad943c9be18f0dc77b66 time=2025-06-28T12:25:20.760+02:00 level=DEBUG source=memory.go:111 msg=evaluating library=cuda gpu_count=1 available="[78.7 GiB]" time=2025-06-28T12:25:20.761+02:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=qwen25vl.vision.image_size default=0 time=2025-06-28T12:25:20.763+02:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=qwen25vl.vision.max_pixels default=1003520 time=2025-06-28T12:25:20.764+02:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=qwen25vl.attention.key_length default=128 time=2025-06-28T12:25:20.764+02:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=qwen25vl.attention.value_length default=128 time=2025-06-28T12:25:20.765+02:00 level=INFO source=sched.go:788 msg="new model will fit in available VRAM in single GPU, loading" model=/tmp/58671049/models/blobs/sha256-043a363c6ca35e3b1a29b8a5b0bbd28474820239bbc5ad943c9be18f0dc77b66 gpu=GPU-6dbb5729-a924-a8d1-13c4-bbb2c9b67d8b parallel=2 available=84539277312 required="25.3 GiB" time=2025-06-28T12:25:20.766+02:00 level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="251.1 GiB" before.free="246.8 GiB" before.free_swap="62.5 GiB" now.total="251.1 GiB" now.free="246.7 GiB" now.free_swap="62.5 GiB" initializing /usr/lib64/libcuda.so.575.51.03 dlsym: cuInit - 0x14efae974790 dlsym: cuDriverGetVersion - 0x14efae974850 dlsym: cuDeviceGetCount - 0x14efae9749d0 dlsym: cuDeviceGet - 0x14efae974910 dlsym: cuDeviceGetAttribute - 0x14efae974f10 dlsym: cuDeviceGetUuid - 0x14efae974b50 dlsym: cuDeviceGetName - 0x14efae974a90 dlsym: cuCtxCreate_v3 - 0x14efae975a50 dlsym: cuMemGetInfo_v2 - 0x14efae9786f0 dlsym: cuCtxDestroy - 0x14efae9da710 calling cuInit calling cuDriverGetVersion raw version 0x2f3a CUDA driver version: 12.9 calling cuDeviceGetCount device count 1 time=2025-06-28T12:25:21.000+02:00 level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-6dbb5729-a924-a8d1-13c4-bbb2c9b67d8b name="NVIDIA H100 PCIe" overhead="0 B" before.total="79.2 GiB" before.free="78.7 GiB" now.total="79.2 GiB" now.free="78.7 GiB" now.used="456.8 MiB" releasing cuda driver library time=2025-06-28T12:25:21.000+02:00 level=INFO source=server.go:135 msg="system memory" total="251.1 GiB" free="246.7 GiB" free_swap="62.5 GiB" time=2025-06-28T12:25:21.000+02:00 level=DEBUG source=memory.go:111 msg=evaluating library=cuda gpu_count=1 available="[78.7 GiB]" time=2025-06-28T12:25:21.001+02:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=qwen25vl.vision.image_size default=0 time=2025-06-28T12:25:21.001+02:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=qwen25vl.vision.max_pixels default=1003520 time=2025-06-28T12:25:21.002+02:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=qwen25vl.attention.key_length default=128 time=2025-06-28T12:25:21.002+02:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=qwen25vl.attention.value_length default=128 time=2025-06-28T12:25:21.002+02:00 level=INFO source=server.go:175 msg=offload library=cuda layers.requested=-1 layers.model=65 layers.offload=65 layers.split="" memory.available="[78.7 GiB]" memory.gpu_overhead="0 B" memory.required.full="25.3 GiB" memory.required.partial="25.3 GiB" memory.required.kv="2.0 GiB" memory.required.allocations="[25.3 GiB]" memory.weights.total="18.1 GiB" memory.weights.repeating="17.5 GiB" memory.weights.nonrepeating="609.1 MiB" memory.graph.full="1.7 GiB" memory.graph.partial="1.7 GiB" projector.weights="1.2 GiB" projector.graph="1.6 GiB" time=2025-06-28T12:25:21.003+02:00 level=DEBUG source=server.go:291 msg="compatible gpu libraries" compatible=[] time=2025-06-28T12:25:21.039+02:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=general.alignment default=32 time=2025-06-28T12:25:21.042+02:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=tokenizer.ggml.pretokenizer default="(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+" time=2025-06-28T12:25:21.042+02:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=tokenizer.ggml.add_bos_token default=true time=2025-06-28T12:25:21.042+02:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=tokenizer.ggml.bos_token_id default=0 time=2025-06-28T12:25:21.043+02:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=tokenizer.ggml.eos_token_ids default="&{size:0 values:[]}" time=2025-06-28T12:25:21.043+02:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=qwen25vl.rope.dimension_count default=128 time=2025-06-28T12:25:21.043+02:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=qwen25vl.rope.freq_scale default=1 time=2025-06-28T12:25:21.043+02:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=qwen25vl.qwen25vl.vision.fullatt_block_indexes default="&{size:0 values:[7 15 23 31]}" time=2025-06-28T12:25:21.043+02:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=qwen25vl.vision.max_pixels default=1003520 time=2025-06-28T12:25:21.044+02:00 level=INFO source=server.go:438 msg="starting llama server" cmd="/d/hpc/home/fs90700/medieval/ollama/bin/ollama runner --ollama-engine --model /tmp/58671049/models/blobs/sha256-043a363c6ca35e3b1a29b8a5b0bbd28474820239bbc5ad943c9be18f0dc77b66 --ctx-size 8192 --batch-size 512 --n-gpu-layers 65 --threads 32 --parallel 2 --port 42221" time=2025-06-28T12:25:21.044+02:00 level=DEBUG source=server.go:439 msg=subprocess OLLAMA_MODELS=/tmp/58671049/models ROCR_VISIBLE_DEVICES=0 CUDA_VISIBLE_DEVICES=GPU-6dbb5729-a924-a8d1-13c4-bbb2c9b67d8b GPU_DEVICE_ORDINAL=0 OLLAMA_DEBUG=1 LD_LIBRARY_PATH=/d/hpc/home/fs90700/medieval/ollama/lib/ollama:/cvmfs/sling.si/modules/el7/software/Python/3.12.3-GCCcore-13.3.0/lib:/cvmfs/sling.si/modules/el7/software/OpenSSL/3/lib:/cvmfs/sling.si/modules/el7/software/libffi/3.4.5-GCCcore-13.3.0/lib64:/cvmfs/sling.si/modules/el7/software/XZ/5.4.5-GCCcore-13.3.0/lib:/cvmfs/sling.si/modules/el7/software/SQLite/3.45.3-GCCcore-13.3.0/lib:/cvmfs/sling.si/modules/el7/software/Tcl/8.6.14-GCCcore-13.3.0/lib:/cvmfs/sling.si/modules/el7/software/libreadline/8.2-GCCcore-13.3.0/lib:/cvmfs/sling.si/modules/el7/software/ncurses/6.5-GCCcore-13.3.0/lib:/cvmfs/sling.si/modules/el7/software/bzip2/1.0.8-GCCcore-13.3.0/lib:/cvmfs/sling.si/modules/el7/software/binutils/2.42-GCCcore-13.3.0/lib:/cvmfs/sling.si/modules/el7/software/zlib/1.3.1-GCCcore-13.3.0/lib:/cvmfs/sling.si/modules/el7/software/GCCcore/13.3.0/lib64:/d/hpc/home/fs90700/medieval/ollama/lib/ollama PATH=/cvmfs/sling.si/modules/el7/software/Python/3.12.3-GCCcore-13.3.0/bin:/cvmfs/sling.si/modules/el7/software/OpenSSL/3/bin:/cvmfs/sling.si/modules/el7/software/XZ/5.4.5-GCCcore-13.3.0/bin:/cvmfs/sling.si/modules/el7/software/SQLite/3.45.3-GCCcore-13.3.0/bin:/cvmfs/sling.si/modules/el7/software/Tcl/8.6.14-GCCcore-13.3.0/bin:/cvmfs/sling.si/modules/el7/software/ncurses/6.5-GCCcore-13.3.0/bin:/cvmfs/sling.si/modules/el7/software/bzip2/1.0.8-GCCcore-13.3.0/bin:/cvmfs/sling.si/modules/el7/software/binutils/2.42-GCCcore-13.3.0/bin:/cvmfs/sling.si/modules/el7/software/GCCcore/13.3.0/bin:/d/hpc/home/fs90700/.local/bin:/d/hpc/home/fs90700/bin:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/opt/puppetlabs/bin OLLAMA_KEEP_ALIVE=1h OLLAMA_MAX_LOADED_MODELS=3 OLLAMA_LIBRARY_PATH=/d/hpc/home/fs90700/medieval/ollama/lib/ollama time=2025-06-28T12:25:21.045+02:00 level=INFO source=sched.go:483 msg="loaded runners" count=1 time=2025-06-28T12:25:21.045+02:00 level=INFO source=server.go:598 msg="waiting for llama runner to start responding" time=2025-06-28T12:25:21.045+02:00 level=INFO source=server.go:632 msg="waiting for server to become available" status="llm server not responding" time=2025-06-28T12:25:21.068+02:00 level=INFO source=runner.go:925 msg="starting ollama engine" time=2025-06-28T12:25:21.069+02:00 level=INFO source=runner.go:983 msg="Server listening on 127.0.0.1:42221" time=2025-06-28T12:25:21.106+02:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=general.alignment default=32 time=2025-06-28T12:25:21.109+02:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=general.name default="" time=2025-06-28T12:25:21.109+02:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=general.description default="" time=2025-06-28T12:25:21.109+02:00 level=INFO source=ggml.go:92 msg="" architecture=qwen25vl file_type=Q4_K_M name="" description="" num_tensors=1290 num_key_values=36 time=2025-06-28T12:25:21.109+02:00 level=DEBUG source=ggml.go:94 msg="ggml backend load all from path" path=/d/hpc/home/fs90700/medieval/ollama/lib/ollama time=2025-06-28T12:25:21.353+02:00 level=INFO source=server.go:632 msg="waiting for server to become available" status="llm server loading model" load_backend: loaded CPU backend from /d/hpc/home/fs90700/medieval/ollama/lib/ollama/libggml-cpu-icelake.so time=2025-06-28T12:25:21.568+02:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.AVX512=1 CPU.0.AVX512_VBMI=1 CPU.0.AVX512_VNNI=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 compiler=cgo(gcc) time=2025-06-28T12:25:21.577+02:00 level=INFO source=ggml.go:359 msg="model weights" buffer=CPU size="19.7 GiB" time=2025-06-28T12:25:21.578+02:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=tokenizer.ggml.pretokenizer default="(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+" time=2025-06-28T12:25:21.578+02:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=tokenizer.ggml.add_bos_token default=true time=2025-06-28T12:25:21.578+02:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=tokenizer.ggml.bos_token_id default=0 time=2025-06-28T12:25:21.578+02:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=tokenizer.ggml.eos_token_ids default="&{size:0 values:[]}" time=2025-06-28T12:25:21.578+02:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=qwen25vl.rope.dimension_count default=128 time=2025-06-28T12:25:21.578+02:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=qwen25vl.rope.freq_scale default=1 time=2025-06-28T12:25:21.579+02:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=qwen25vl.qwen25vl.vision.fullatt_block_indexes default="&{size:0 values:[7 15 23 31]}" time=2025-06-28T12:25:21.579+02:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=qwen25vl.vision.max_pixels default=1003520 time=2025-06-28T12:25:21.861+02:00 level=DEBUG source=vocabulary.go:52 msg="adding bos token to prompt" id=[0] time=2025-06-28T12:25:21.862+02:00 level=DEBUG source=ggml.go:630 msg="compute graph" nodes=1748 splits=1 time=2025-06-28T12:25:21.862+02:00 level=INFO source=ggml.go:648 msg="compute graph" backend=CPU buffer_type=CPU size="1.6 GiB" time=2025-06-28T12:25:22.181+02:00 level=DEBUG source=ggml.go:630 msg="compute graph" nodes=2441 splits=1 time=2025-06-28T12:25:22.181+02:00 level=INFO source=ggml.go:648 msg="compute graph" backend=CPU buffer_type=CPU size="1.6 GiB" time=2025-06-28T12:25:22.182+02:00 level=DEBUG source=runner.go:883 msg=memory allocated.InputWeights=437944320A allocated.CPU.Weights="[312184832A 312184832A 312184832A 312184832A 312184832A 312184832A 312184832A 312184832A 275689472A 275689472A 312184832A 275689472A 274337792A 310833152A 275689472A 274337792A 310833152A 275689472A 274337792A 310833152A 275689472A 274337792A 310833152A 275689472A 274337792A 310833152A 275689472A 274337792A 310833152A 275689472A 274337792A 310833152A 275689472A 274337792A 310833152A 275689472A 274337792A 310833152A 275689472A 274337792A 310833152A 275689472A 274337792A 310833152A 275689472A 274337792A 310833152A 275689472A 274337792A 310833152A 275689472A 274337792A 310833152A 275689472A 274337792A 310833152A 312184832A 310833152A 310833152A 312184832A 310833152A 310833152A 312184832A 310833152A 1950071296A]" allocated.CPU.Cache="[33554432A 33554432A 33554432A 33554432A 33554432A 33554432A 33554432A 33554432A 33554432A 33554432A 33554432A 33554432A 33554432A 33554432A 33554432A 33554432A 33554432A 33554432A 33554432A 33554432A 33554432A 33554432A 33554432A 33554432A 33554432A 33554432A 33554432A 33554432A 33554432A 33554432A 33554432A 33554432A 33554432A 33554432A 33554432A 33554432A 33554432A 33554432A 33554432A 33554432A 33554432A 33554432A 33554432A 33554432A 33554432A 33554432A 33554432A 33554432A 33554432A 33554432A 33554432A 33554432A 33554432A 33554432A 33554432A 33554432A 33554432A 33554432A 33554432A 33554432A 33554432A 33554432A 33554432A 33554432A 0U]" allocated.CPU.Graph=1682268160A time=2025-06-28T12:25:22.281+02:00 level=DEBUG source=server.go:643 msg="model load progress 0.03" time=2025-06-28T12:25:23.256+02:00 level=DEBUG source=server.go:643 msg="model load progress 0.29" time=2025-06-28T12:25:23.740+02:00 level=DEBUG source=server.go:643 msg="model load progress 0.42" time=2025-06-28T12:25:24.181+02:00 level=DEBUG source=server.go:643 msg="model load progress 0.54" time=2025-06-28T12:25:24.568+02:00 level=DEBUG source=server.go:643 msg="model load progress 0.64" time=2025-06-28T12:25:25.024+02:00 level=DEBUG source=server.go:643 msg="model load progress 0.77" time=2025-06-28T12:25:25.496+02:00 level=DEBUG source=server.go:643 msg="model load progress 0.90" time=2025-06-28T12:25:25.914+02:00 level=DEBUG source=server.go:643 msg="model load progress 1.00" time=2025-06-28T12:25:26.290+02:00 level=INFO source=server.go:637 msg="llama runner started in 5.25 seconds" time=2025-06-28T12:25:26.496+02:00 level=DEBUG source=sched.go:495 msg="finished setting up" runner.name=registry.ollama.ai/library/qwen2.5vl:32b runner.inference=cuda runner.devices=1 runner.size="25.3 GiB" runner.vram="25.3 GiB" runner.parallel=2 runner.pid=1685622 runner.model=/tmp/58671049/models/blobs/sha256-043a363c6ca35e3b1a29b8a5b0bbd28474820239bbc5ad943c9be18f0dc77b66 runner.num_ctx=8192 time=2025-06-28T12:25:26.665+02:00 level=DEBUG source=server.go:736 msg="completion request" images=1 prompt=1043 format="" time=2025-06-28T12:25:27.119+02:00 level=DEBUG source=vocabulary.go:52 msg="adding bos token to prompt" id=[0] time=2025-06-28T12:25:27.323+02:00 level=DEBUG source=vocabulary.go:52 msg="adding bos token to prompt" id=[0] time=2025-06-28T12:25:27.323+02:00 level=DEBUG source=cache.go:136 msg="loading cache slot" id=0 cache=0 prompt=1467 used=0 remaining=1467 ```
Author
Owner

@rick-github commented on GitHub (Jun 28, 2025):

time=2025-06-28T12:25:21.109+02:00 level=DEBUG source=ggml.go:94 msg="ggml backend load all from path" path=/d/hpc/home/fs90700/medieval/ollama/lib/ollama
load_backend: loaded CPU backend from /d/hpc/home/fs90700/medieval/ollama/lib/ollama/libggml-cpu-icelake.so
time=2025-06-28T12:25:21.568+02:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.AVX512=1 CPU.0.AVX512_VBMI=1 CPU.0.AVX512_VNNI=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 compiler=cgo(gcc)

It's using the CPU because the CUDA backend isn't being found. What's the contents of /d/hpc/home/fs90700/medieval/ollama/lib/ollama?

In both cases it looks like HPC installation is interfering with ollamas ability to detect/load the CUDA backend. As I don't have access to such an environment, I can't duplicate the issue. Running ollama/ollama:0.9.3 in docker works fine in my server environments.

<!-- gh-comment-id:3015251041 --> @rick-github commented on GitHub (Jun 28, 2025): ``` time=2025-06-28T12:25:21.109+02:00 level=DEBUG source=ggml.go:94 msg="ggml backend load all from path" path=/d/hpc/home/fs90700/medieval/ollama/lib/ollama load_backend: loaded CPU backend from /d/hpc/home/fs90700/medieval/ollama/lib/ollama/libggml-cpu-icelake.so time=2025-06-28T12:25:21.568+02:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.AVX512=1 CPU.0.AVX512_VBMI=1 CPU.0.AVX512_VNNI=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 compiler=cgo(gcc) ``` It's using the CPU because the CUDA backend isn't being found. What's the contents of `/d/hpc/home/fs90700/medieval/ollama/lib/ollama`? In both cases it looks like HPC installation is interfering with ollamas ability to detect/load the CUDA backend. As I don't have access to such an environment, I can't duplicate the issue. Running ollama/ollama:0.9.3 in docker works fine in my server environments.
Author
Owner

@filips123 commented on GitHub (Jun 28, 2025):

What's the contents of /d/hpc/home/fs90700/medieval/ollama/lib/ollama?

$ ls -al /d/hpc/home/fs90700/medieval/ollama/lib/ollama
total 2700426
drwxr-xr-x 2 fs90700 fs90700         16 Jun 26 10:10 .
drwxr-xr-x 3 fs90700 fs90700          1 Jun 26 10:10 ..
lrwxrwxrwx 1 fs90700 fs90700         23 Jun 26 09:38 libcublasLt.so.12 -> libcublasLt.so.12.8.4.1
-rwxr-xr-x 1 fs90700 fs90700  751771728 Jul  8  2015 libcublasLt.so.12.8.4.1
lrwxrwxrwx 1 fs90700 fs90700         21 Jun 26 09:38 libcublas.so.12 -> libcublas.so.12.8.4.1
-rwxr-xr-x 1 fs90700 fs90700  116388640 Jul  8  2015 libcublas.so.12.8.4.1
lrwxrwxrwx 1 fs90700 fs90700         20 Jun 26 09:38 libcudart.so.12 -> libcudart.so.12.8.90
-rwxr-xr-x 1 fs90700 fs90700     728800 Jul  8  2015 libcudart.so.12.8.90
-rwxr-xr-x 1 fs90700 fs90700     595648 Jun 26 09:27 libggml-base.so
-rwxr-xr-x 1 fs90700 fs90700     619280 Jun 26 09:27 libggml-cpu-alderlake.so
-rwxr-xr-x 1 fs90700 fs90700     619280 Jun 26 09:27 libggml-cpu-haswell.so
-rwxr-xr-x 1 fs90700 fs90700     725776 Jun 26 09:27 libggml-cpu-icelake.so
-rwxr-xr-x 1 fs90700 fs90700     606992 Jun 26 09:27 libggml-cpu-sandybridge.so
-rwxr-xr-x 1 fs90700 fs90700     729872 Jun 26 09:27 libggml-cpu-skylakex.so
-rwxr-xr-x 1 fs90700 fs90700     480048 Jun 26 09:27 libggml-cpu-sse42.so
-rwxr-xr-x 1 fs90700 fs90700     475952 Jun 26 09:27 libggml-cpu-x64.so
-rwxr-xr-x 1 fs90700 fs90700 1286539248 Jun 26 09:38 libggml-cuda.so
-rwxr-xr-x 1 fs90700 fs90700  604949568 Jun 26 09:41 libggml-hip.so

It's just the extracted archive from https://github.com/ollama/ollama/releases/download/v0.9.3/ollama-linux-amd64.tgz.

And yeah, it's likely something to do with Slurm/HPC. If I run Ollama from a script started with sbatch, it doesn't detect CUDA backend, but if I manually SSH into worker node and run it there, it does detect it. However, running scripts with sbatch is the recommended was of starting Slurm jobs, so I would like to fix this somehow.

Here is the log of Ollama server started from a script after I pulled and ran the model (CUDA was not loaded):
ollama-server-1.log

And here is the log of Ollama server started manually from terminal of the same worker node (CUDA was loaded):
ollama-server-2.log

It seems that in the first case, Ollama searched for GPU libraries in more places, but /usr/lib64/libcuda.so.575.51.03 was loaded in both cases, so I don't know if this is relevant:

time=2025-06-28T18:45:41.707+02:00 level=DEBUG source=gpu.go:525 msg="gpu library search" globs="[/d/hpc/home/fs90700/medieval/ollama/lib/ollama/libcuda.so* /cvmfs/sling.si/modules/el7/software/Python/3.12.3-GCCcore-13.3.0/lib/libcuda.so* /cvmfs/sling.si/modules/el7/software/OpenSSL/3/lib/libcuda.so* /cvmfs/sling.si/modules/el7/software/libffi/3.4.5-GCCcore-13.3.0/lib64/libcuda.so* /cvmfs/sling.si/modules/el7/software/XZ/5.4.5-GCCcore-13.3.0/lib/libcuda.so* /cvmfs/sling.si/modules/el7/software/SQLite/3.45.3-GCCcore-13.3.0/lib/libcuda.so* /cvmfs/sling.si/modules/el7/software/Tcl/8.6.14-GCCcore-13.3.0/lib/libcuda.so* /cvmfs/sling.si/modules/el7/software/libreadline/8.2-GCCcore-13.3.0/lib/libcuda.so* /cvmfs/sling.si/modules/el7/software/ncurses/6.5-GCCcore-13.3.0/lib/libcuda.so* /cvmfs/sling.si/modules/el7/software/bzip2/1.0.8-GCCcore-13.3.0/lib/libcuda.so* /cvmfs/sling.si/modules/el7/software/binutils/2.42-GCCcore-13.3.0/lib/libcuda.so* /cvmfs/sling.si/modules/el7/software/zlib/1.3.1-GCCcore-13.3.0/lib/libcuda.so* /cvmfs/sling.si/modules/el7/software/GCCcore/13.3.0/lib64/libcuda.so* /usr/local/cuda*/targets/*/lib/libcuda.so* /usr/lib/*-linux-gnu/nvidia/current/libcuda.so* /usr/lib/*-linux-gnu/libcuda.so* /usr/lib/wsl/lib/libcuda.so* /usr/lib/wsl/drivers/*/libcuda.so* /opt/cuda/lib*/libcuda.so* /usr/local/cuda/lib*/libcuda.so* /usr/lib*/libcuda.so* /usr/local/lib*/libcuda.so*]"
time=2025-06-28T18:45:41.710+02:00 level=DEBUG source=gpu.go:558 msg="discovered GPU libraries" paths=[/usr/lib64/libcuda.so.575.51.03]
initializing /usr/lib64/libcuda.so.575.51.03
time=2025-06-28T18:54:17.216+02:00 level=DEBUG source=gpu.go:525 msg="gpu library search" globs="[/d/hpc/home/fs90700/medieval/ollama/lib/ollama/libcuda.so* /d/hpc/home/fs90700/medieval/libcuda.so* /usr/local/cuda*/targets/*/lib/libcuda.so* /usr/lib/*-linux-gnu/nvidia/current/libcuda.so* /usr/lib/*-linux-gnu/libcuda.so* /usr/lib/wsl/lib/libcuda.so* /usr/lib/wsl/drivers/*/libcuda.so* /opt/cuda/lib*/libcuda.so* /usr/local/cuda/lib*/libcuda.so* /usr/lib*/libcuda.so* /usr/local/lib*/libcuda.so*]"
time=2025-06-28T18:54:17.218+02:00 level=DEBUG source=gpu.go:558 msg="discovered GPU libraries" paths=[/usr/lib64/libcuda.so.575.51.03]
initializing /usr/lib64/libcuda.so.575.51.03

Another difference is that LD_LIBRARY_PATH and some other environment variables were different:

time=2025-06-28T18:49:19.056+02:00 level=DEBUG source=server.go:439 msg=subprocess OLLAMA_MODELS=/tmp/58675550/models ROCR_VISIBLE_DEVICES=0 CUDA_VISIBLE_DEVICES=GPU-95ef9cc9-a32d-0e44-1184-65b4a5a472b1 GPU_DEVICE_ORDINAL=0 OLLAMA_DEBUG=1 LD_LIBRARY_PATH=/d/hpc/home/fs90700/medieval/ollama/lib/ollama:/cvmfs/sling.si/modules/el7/software/Python/3.12.3-GCCcore-13.3.0/lib:/cvmfs/sling.si/modules/el7/software/OpenSSL/3/lib:/cvmfs/sling.si/modules/el7/software/libffi/3.4.5-GCCcore-13.3.0/lib64:/cvmfs/sling.si/modules/el7/software/XZ/5.4.5-GCCcore-13.3.0/lib:/cvmfs/sling.si/modules/el7/software/SQLite/3.45.3-GCCcore-13.3.0/lib:/cvmfs/sling.si/modules/el7/software/Tcl/8.6.14-GCCcore-13.3.0/lib:/cvmfs/sling.si/modules/el7/software/libreadline/8.2-GCCcore-13.3.0/lib:/cvmfs/sling.si/modules/el7/software/ncurses/6.5-GCCcore-13.3.0/lib:/cvmfs/sling.si/modules/el7/software/bzip2/1.0.8-GCCcore-13.3.0/lib:/cvmfs/sling.si/modules/el7/software/binutils/2.42-GCCcore-13.3.0/lib:/cvmfs/sling.si/modules/el7/software/zlib/1.3.1-GCCcore-13.3.0/lib:/cvmfs/sling.si/modules/el7/software/GCCcore/13.3.0/lib64:/d/hpc/home/fs90700/medieval/ollama/lib/ollama PATH=/cvmfs/sling.si/modules/el7/software/Python/3.12.3-GCCcore-13.3.0/bin:/cvmfs/sling.si/modules/el7/software/OpenSSL/3/bin:/cvmfs/sling.si/modules/el7/software/XZ/5.4.5-GCCcore-13.3.0/bin:/cvmfs/sling.si/modules/el7/software/SQLite/3.45.3-GCCcore-13.3.0/bin:/cvmfs/sling.si/modules/el7/software/Tcl/8.6.14-GCCcore-13.3.0/bin:/cvmfs/sling.si/modules/el7/software/ncurses/6.5-GCCcore-13.3.0/bin:/cvmfs/sling.si/modules/el7/software/bzip2/1.0.8-GCCcore-13.3.0/bin:/cvmfs/sling.si/modules/el7/software/binutils/2.42-GCCcore-13.3.0/bin:/cvmfs/sling.si/modules/el7/software/GCCcore/13.3.0/bin:/d/hpc/home/fs90700/.local/bin:/d/hpc/home/fs90700/bin:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/opt/puppetlabs/bin OLLAMA_KEEP_ALIVE=1h OLLAMA_MAX_LOADED_MODELS=3 OLLAMA_LIBRARY_PATH=/d/hpc/home/fs90700/medieval/ollama/lib/ollama
time=2025-06-28T18:57:15.996+02:00 level=DEBUG source=server.go:439 msg=subprocess OLLAMA_MODELS=/tmp/58675550/models OLLAMA_DEBUG=1 PATH=/d/hpc/home/fs90700/.local/bin:/d/hpc/home/fs90700/bin:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/opt/puppetlabs/bin OLLAMA_KEEP_ALIVE=1h OLLAMA_MAX_LOADED_MODELS=3 OLLAMA_LIBRARY_PATH=/d/hpc/home/fs90700/medieval/ollama/lib/ollama LD_LIBRARY_PATH=/d/hpc/home/fs90700/medieval/ollama/lib/ollama:/d/hpc/home/fs90700/medieval/ollama/lib/ollama CUDA_VISIBLE_DEVICES=GPU-95ef9cc9-a32d-0e44-1184-65b4a5a472b1

Otherwise, logs seem pretty similar, apart from the fact that in the second case, CUDA backend was loaded:

ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA H100 PCIe, compute capability 9.0, VMM: yes
load_backend: loaded CUDA backend from /d/hpc/home/fs90700/medieval/ollama/lib/ollama/libggml-cuda.so
<!-- gh-comment-id:3015797278 --> @filips123 commented on GitHub (Jun 28, 2025): > What's the contents of `/d/hpc/home/fs90700/medieval/ollama/lib/ollama`? ```sh $ ls -al /d/hpc/home/fs90700/medieval/ollama/lib/ollama total 2700426 drwxr-xr-x 2 fs90700 fs90700 16 Jun 26 10:10 . drwxr-xr-x 3 fs90700 fs90700 1 Jun 26 10:10 .. lrwxrwxrwx 1 fs90700 fs90700 23 Jun 26 09:38 libcublasLt.so.12 -> libcublasLt.so.12.8.4.1 -rwxr-xr-x 1 fs90700 fs90700 751771728 Jul 8 2015 libcublasLt.so.12.8.4.1 lrwxrwxrwx 1 fs90700 fs90700 21 Jun 26 09:38 libcublas.so.12 -> libcublas.so.12.8.4.1 -rwxr-xr-x 1 fs90700 fs90700 116388640 Jul 8 2015 libcublas.so.12.8.4.1 lrwxrwxrwx 1 fs90700 fs90700 20 Jun 26 09:38 libcudart.so.12 -> libcudart.so.12.8.90 -rwxr-xr-x 1 fs90700 fs90700 728800 Jul 8 2015 libcudart.so.12.8.90 -rwxr-xr-x 1 fs90700 fs90700 595648 Jun 26 09:27 libggml-base.so -rwxr-xr-x 1 fs90700 fs90700 619280 Jun 26 09:27 libggml-cpu-alderlake.so -rwxr-xr-x 1 fs90700 fs90700 619280 Jun 26 09:27 libggml-cpu-haswell.so -rwxr-xr-x 1 fs90700 fs90700 725776 Jun 26 09:27 libggml-cpu-icelake.so -rwxr-xr-x 1 fs90700 fs90700 606992 Jun 26 09:27 libggml-cpu-sandybridge.so -rwxr-xr-x 1 fs90700 fs90700 729872 Jun 26 09:27 libggml-cpu-skylakex.so -rwxr-xr-x 1 fs90700 fs90700 480048 Jun 26 09:27 libggml-cpu-sse42.so -rwxr-xr-x 1 fs90700 fs90700 475952 Jun 26 09:27 libggml-cpu-x64.so -rwxr-xr-x 1 fs90700 fs90700 1286539248 Jun 26 09:38 libggml-cuda.so -rwxr-xr-x 1 fs90700 fs90700 604949568 Jun 26 09:41 libggml-hip.so ``` It's just the extracted archive from https://github.com/ollama/ollama/releases/download/v0.9.3/ollama-linux-amd64.tgz. And yeah, it's likely something to do with Slurm/HPC. If I run Ollama from a script started with `sbatch`, it doesn't detect CUDA backend, but if I manually SSH into worker node and run it there, it does detect it. However, running scripts with `sbatch` is the recommended was of starting Slurm jobs, so I would like to fix this somehow. Here is the log of Ollama server started from a script after I pulled and ran the model (CUDA was not loaded): [ollama-server-1.log](https://github.com/user-attachments/files/20961858/ollama-server-1.log) And here is the log of Ollama server started manually from terminal of the same worker node (CUDA was loaded): [ollama-server-2.log](https://github.com/user-attachments/files/20961870/ollama-server-2.log) It seems that in the first case, Ollama searched for GPU libraries in more places, but `/usr/lib64/libcuda.so.575.51.03` was loaded in both cases, so I don't know if this is relevant: ``` time=2025-06-28T18:45:41.707+02:00 level=DEBUG source=gpu.go:525 msg="gpu library search" globs="[/d/hpc/home/fs90700/medieval/ollama/lib/ollama/libcuda.so* /cvmfs/sling.si/modules/el7/software/Python/3.12.3-GCCcore-13.3.0/lib/libcuda.so* /cvmfs/sling.si/modules/el7/software/OpenSSL/3/lib/libcuda.so* /cvmfs/sling.si/modules/el7/software/libffi/3.4.5-GCCcore-13.3.0/lib64/libcuda.so* /cvmfs/sling.si/modules/el7/software/XZ/5.4.5-GCCcore-13.3.0/lib/libcuda.so* /cvmfs/sling.si/modules/el7/software/SQLite/3.45.3-GCCcore-13.3.0/lib/libcuda.so* /cvmfs/sling.si/modules/el7/software/Tcl/8.6.14-GCCcore-13.3.0/lib/libcuda.so* /cvmfs/sling.si/modules/el7/software/libreadline/8.2-GCCcore-13.3.0/lib/libcuda.so* /cvmfs/sling.si/modules/el7/software/ncurses/6.5-GCCcore-13.3.0/lib/libcuda.so* /cvmfs/sling.si/modules/el7/software/bzip2/1.0.8-GCCcore-13.3.0/lib/libcuda.so* /cvmfs/sling.si/modules/el7/software/binutils/2.42-GCCcore-13.3.0/lib/libcuda.so* /cvmfs/sling.si/modules/el7/software/zlib/1.3.1-GCCcore-13.3.0/lib/libcuda.so* /cvmfs/sling.si/modules/el7/software/GCCcore/13.3.0/lib64/libcuda.so* /usr/local/cuda*/targets/*/lib/libcuda.so* /usr/lib/*-linux-gnu/nvidia/current/libcuda.so* /usr/lib/*-linux-gnu/libcuda.so* /usr/lib/wsl/lib/libcuda.so* /usr/lib/wsl/drivers/*/libcuda.so* /opt/cuda/lib*/libcuda.so* /usr/local/cuda/lib*/libcuda.so* /usr/lib*/libcuda.so* /usr/local/lib*/libcuda.so*]" time=2025-06-28T18:45:41.710+02:00 level=DEBUG source=gpu.go:558 msg="discovered GPU libraries" paths=[/usr/lib64/libcuda.so.575.51.03] initializing /usr/lib64/libcuda.so.575.51.03 ``` ``` time=2025-06-28T18:54:17.216+02:00 level=DEBUG source=gpu.go:525 msg="gpu library search" globs="[/d/hpc/home/fs90700/medieval/ollama/lib/ollama/libcuda.so* /d/hpc/home/fs90700/medieval/libcuda.so* /usr/local/cuda*/targets/*/lib/libcuda.so* /usr/lib/*-linux-gnu/nvidia/current/libcuda.so* /usr/lib/*-linux-gnu/libcuda.so* /usr/lib/wsl/lib/libcuda.so* /usr/lib/wsl/drivers/*/libcuda.so* /opt/cuda/lib*/libcuda.so* /usr/local/cuda/lib*/libcuda.so* /usr/lib*/libcuda.so* /usr/local/lib*/libcuda.so*]" time=2025-06-28T18:54:17.218+02:00 level=DEBUG source=gpu.go:558 msg="discovered GPU libraries" paths=[/usr/lib64/libcuda.so.575.51.03] initializing /usr/lib64/libcuda.so.575.51.03 ``` Another difference is that `LD_LIBRARY_PATH` and some other environment variables were different: ``` time=2025-06-28T18:49:19.056+02:00 level=DEBUG source=server.go:439 msg=subprocess OLLAMA_MODELS=/tmp/58675550/models ROCR_VISIBLE_DEVICES=0 CUDA_VISIBLE_DEVICES=GPU-95ef9cc9-a32d-0e44-1184-65b4a5a472b1 GPU_DEVICE_ORDINAL=0 OLLAMA_DEBUG=1 LD_LIBRARY_PATH=/d/hpc/home/fs90700/medieval/ollama/lib/ollama:/cvmfs/sling.si/modules/el7/software/Python/3.12.3-GCCcore-13.3.0/lib:/cvmfs/sling.si/modules/el7/software/OpenSSL/3/lib:/cvmfs/sling.si/modules/el7/software/libffi/3.4.5-GCCcore-13.3.0/lib64:/cvmfs/sling.si/modules/el7/software/XZ/5.4.5-GCCcore-13.3.0/lib:/cvmfs/sling.si/modules/el7/software/SQLite/3.45.3-GCCcore-13.3.0/lib:/cvmfs/sling.si/modules/el7/software/Tcl/8.6.14-GCCcore-13.3.0/lib:/cvmfs/sling.si/modules/el7/software/libreadline/8.2-GCCcore-13.3.0/lib:/cvmfs/sling.si/modules/el7/software/ncurses/6.5-GCCcore-13.3.0/lib:/cvmfs/sling.si/modules/el7/software/bzip2/1.0.8-GCCcore-13.3.0/lib:/cvmfs/sling.si/modules/el7/software/binutils/2.42-GCCcore-13.3.0/lib:/cvmfs/sling.si/modules/el7/software/zlib/1.3.1-GCCcore-13.3.0/lib:/cvmfs/sling.si/modules/el7/software/GCCcore/13.3.0/lib64:/d/hpc/home/fs90700/medieval/ollama/lib/ollama PATH=/cvmfs/sling.si/modules/el7/software/Python/3.12.3-GCCcore-13.3.0/bin:/cvmfs/sling.si/modules/el7/software/OpenSSL/3/bin:/cvmfs/sling.si/modules/el7/software/XZ/5.4.5-GCCcore-13.3.0/bin:/cvmfs/sling.si/modules/el7/software/SQLite/3.45.3-GCCcore-13.3.0/bin:/cvmfs/sling.si/modules/el7/software/Tcl/8.6.14-GCCcore-13.3.0/bin:/cvmfs/sling.si/modules/el7/software/ncurses/6.5-GCCcore-13.3.0/bin:/cvmfs/sling.si/modules/el7/software/bzip2/1.0.8-GCCcore-13.3.0/bin:/cvmfs/sling.si/modules/el7/software/binutils/2.42-GCCcore-13.3.0/bin:/cvmfs/sling.si/modules/el7/software/GCCcore/13.3.0/bin:/d/hpc/home/fs90700/.local/bin:/d/hpc/home/fs90700/bin:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/opt/puppetlabs/bin OLLAMA_KEEP_ALIVE=1h OLLAMA_MAX_LOADED_MODELS=3 OLLAMA_LIBRARY_PATH=/d/hpc/home/fs90700/medieval/ollama/lib/ollama ``` ``` time=2025-06-28T18:57:15.996+02:00 level=DEBUG source=server.go:439 msg=subprocess OLLAMA_MODELS=/tmp/58675550/models OLLAMA_DEBUG=1 PATH=/d/hpc/home/fs90700/.local/bin:/d/hpc/home/fs90700/bin:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/opt/puppetlabs/bin OLLAMA_KEEP_ALIVE=1h OLLAMA_MAX_LOADED_MODELS=3 OLLAMA_LIBRARY_PATH=/d/hpc/home/fs90700/medieval/ollama/lib/ollama LD_LIBRARY_PATH=/d/hpc/home/fs90700/medieval/ollama/lib/ollama:/d/hpc/home/fs90700/medieval/ollama/lib/ollama CUDA_VISIBLE_DEVICES=GPU-95ef9cc9-a32d-0e44-1184-65b4a5a472b1 ``` Otherwise, logs seem pretty similar, apart from the fact that in the second case, CUDA backend was loaded: ``` ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA H100 PCIe, compute capability 9.0, VMM: yes load_backend: loaded CUDA backend from /d/hpc/home/fs90700/medieval/ollama/lib/ollama/libggml-cuda.so ```
Author
Owner

@jakehlee commented on GitHub (Jun 29, 2025):

Tip

tl;dr SLURM incorrectly sets ROCR_VISIBLE_DEVICES=0, patch with unset ROCR_VISIBLE_DEVICES or add Flags=nvidia_gpu_env to your gres.conf. Note that ROCR_VISIBLE_DEVICES="" does not work, it must be unset.

Taking a look at this (noted as temporary) block that was added in 1c6669e64c

4129af9205/ml/backend/ggml/ggml/src/ggml-backend-reg.cpp (L577-L584)

I just tried the following from my sbatch script:

echo "HIP $HIP_VISIBLE_DEVICES"
echo "ROCR $ROCR_VISIBLE_DEVICES"
# output
HIP 
ROCR 0

This is strange, considering our system only has Intel CPUs and Nvidia GPUs. @filips123 I also see these lines in your logs:

ime=2025-06-28T18:49:19.056+02:00 [...] **ROCR_VISIBLE_DEVICES=0** CUDA_VISIBLE_DEVICES=GPU-95ef9cc9-a32d-0e44-1184-65b4a5a472b1 GPU_DEVICE_ORDINAL=0 OLLAMA_DEBUG=1 ...

It appears that an incorrect Slurm configuration or an older Slurm version might cause certain GPU-related variables to be manually set: https://support.schedmd.com/show_bug.cgi?id=11097

Slurm currently sets CUDA_VISIBLE_DEVICES, ROCR_VISIBLE_DEVICES, and GPU_DEVICE_ORDINAL for tasks that request GPUs. We've run into an issue where this causes ROCm-based applications to fail.

Hey Matt, this is now in 21.08 rc1 with commits 0705ded00d...0c7fb08c4e.
We made some changes compared to v2: instead of a dedicated EnvVars field, we simply added on to the preexisting Flags field in gres.conf
Flags
Optional flags that can be specified to change configured behavior of the GRES.
Allowed values at present are:
...
nvidia_gpu_env
Set environment variable CUDA_VISIBLE_DEVICES for all GPUs on the specified node(s).
amd_gpu_env
Set environment variable ROCR_VISIBLE_DEVICES for all GPUs on the specified node(s).
opencl_env
Set environment variable GPU_DEVICE_ORDINAL for all GPUs on the specified node(s).
no_gpu_env
Set no GPU-specific environment variables.
We also added ROCR_VISIBLE_DEVICES to prolog/epilog, like CUDA_VISIBLE_DEVICES.
We are planning on tweaking things before 21.08 is released, but you can go ahead and play around with this.

I've confirmed that our own gres.conf (located in the same directory as slurm.conf, filepath at SLURM_CONF) is missing the Flags=nvidia_gpu_env, so it is incorrectly setting ROCR_VISIBLE_DEVICES=[0,1], causing that codeblock to prevent the loading of the CUDA backend.

After passing including unset ROCR_VISIBLE_DEVICES to the jobscript, it finally loads models to the GPU:

time=2025-06-28T19:30:06.013-07:00 level=INFO source=server.go:438 msg="starting llama server" cmd="/usr/bin/ollama runner --ollama-engine --model /home/jakelee/ollama_models/blobs/sha256-ccc0cddac56136ef0969cf2e3e9ac051124c937be42503b47ec570dead85ff87 --ctx-size 131072 --batch-size 512 --n-gpu-layers 63 --threads 64 --flash-attn --kv-cache-type q4_0 --parallel 4 --port 33027"
time=2025-06-28T19:30:06.013-07:00 level=DEBUG source=server.go:439 msg=subprocess CUDA_VISIBLE_DEVICES=GPU-d854155b-7f48-5e97-509d-ad2fb35f454b GPU_DEVICE_ORDINAL=0,1 LD_LIBRARY_PATH=/usr/lib/ollama:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/.singularity.d/libs:/usr/lib/ollama OLLAMA_DEBUG=1 OLLAMA_DIR=/home/jakelee/ollama_models OLLAMA_FLASH_ATTENTION=1 OLLAMA_HOST=0.0.0.0:11434 OLLAMA_KEEP_ALIVE=24h OLLAMA_KV_CACHE_TYPE=q4_0 OLLAMA_LOG_LEVEL=DEBUG OLLAMA_MAX_LOADED_MODELS=3 OLLAMA_MAX_QUEUE=512 OLLAMA_MODELS=/home/jakelee/ollama_models OLLAMA_NUM_PARALLEL=4 OLLAMA_PORT=11434 PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin OLLAMA_LIBRARY_PATH=/usr/lib/ollama
time=2025-06-28T19:30:06.017-07:00 level=INFO source=sched.go:483 msg="loaded runners" count=1
time=2025-06-28T19:30:06.017-07:00 level=INFO source=server.go:598 msg="waiting for llama runner to start responding"
time=2025-06-28T19:30:06.018-07:00 level=INFO source=server.go:632 msg="waiting for server to become available" status="llm server not responding"
time=2025-06-28T19:30:06.029-07:00 level=INFO source=runner.go:925 msg="starting ollama engine"
time=2025-06-28T19:30:06.030-07:00 level=INFO source=runner.go:983 msg="Server listening on 127.0.0.1:33027"
time=2025-06-28T19:30:06.060-07:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=general.alignment default=32
time=2025-06-28T19:30:06.061-07:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=general.name default=""
time=2025-06-28T19:30:06.061-07:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=general.description default=""
time=2025-06-28T19:30:06.061-07:00 level=INFO source=ggml.go:92 msg="" architecture=gemma3 file_type=Q4_0 name="" description="" num_tensors=1247 num_key_values=40
time=2025-06-28T19:30:06.061-07:00 level=DEBUG source=ggml.go:94 msg="ggml backend load all from path" path=/usr/lib/ollama
time=2025-06-28T19:30:06.302-07:00 level=INFO source=server.go:632 msg="waiting for server to become available" status="llm server loading model"
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA A100 80GB PCIe, compute capability 8.0, VMM: yes
load_backend: loaded CUDA backend from /usr/lib/ollama/libggml-cuda.so
load_backend: loaded CPU backend from /usr/lib/ollama/libggml-cpu-icelake.so
time=2025-06-28T19:30:07.728-07:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.AVX512=1 CPU.0.AVX512_VBMI=1 CPU.0.AVX512_VNNI=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc)
time=2025-06-28T19:30:07.856-07:00 level=INFO source=ggml.go:359 msg="model weights" buffer=CUDA0 size="16.8 GiB"
time=2025-06-28T19:30:07.856-07:00 level=INFO source=ggml.go:359 msg="model weights" buffer=CPU size="2.6 GiB"
time=2025-06-28T19:30:07.856-07:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=tokenizer.ggml.eot_token_id default=106
time=2025-06-28T19:30:07.856-07:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=tokenizer.ggml.eos_token_ids default="&{size:0 values:[]}"
time=2025-06-28T19:30:07.858-07:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=gemma3.rope.local.freq_base default=10000
time=2025-06-28T19:30:07.858-07:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=gemma3.rope.global.freq_base default=1e+06
time=2025-06-28T19:30:07.858-07:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=gemma3.rope.freq_scale default=1
time=2025-06-28T19:30:07.858-07:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=gemma3.mm_tokens_per_image default=256
time=2025-06-28T19:30:08.026-07:00 level=DEBUG source=ggml.go:630 msg="compute graph" nodes=972 splits=1
time=2025-06-28T19:30:08.026-07:00 level=INFO source=ggml.go:648 msg="compute graph" backend=CUDA0 buffer_type=CUDA0 size="1.1 GiB"
time=2025-06-28T19:30:08.026-07:00 level=INFO source=ggml.go:648 msg="compute graph" backend=CPU buffer_type=CPU size="0 B"
time=2025-06-28T19:30:08.039-07:00 level=DEBUG source=ggml.go:630 msg="compute graph" nodes=2489 splits=2
time=2025-06-28T19:30:08.039-07:00 level=INFO source=ggml.go:648 msg="compute graph" backend=CUDA0 buffer_type=CUDA0 size="1.1 GiB"
time=2025-06-28T19:30:08.039-07:00 level=INFO source=ggml.go:648 msg="compute graph" backend=CPU buffer_type=CPU size="10.5 MiB"

In closing, I don't think there's anything to be fixed here - ROCR_VISIBLE_DEVICES and CUDA_VISIBLE_DEVICES really shouldn't be set together. However, it'd be nice if this failed with a debug message instead.

<!-- gh-comment-id:3016261057 --> @jakehlee commented on GitHub (Jun 29, 2025): > [!TIP] > _tl;dr_ SLURM incorrectly sets `ROCR_VISIBLE_DEVICES=0`, patch with `unset ROCR_VISIBLE_DEVICES` or add `Flags=nvidia_gpu_env` to your `gres.conf`. Note that `ROCR_VISIBLE_DEVICES=""` does not work, it must be unset. Taking a look at this (noted as temporary) block that was added in 1c6669e64cc8a482fbf1e35c0249f17b35a4e87a https://github.com/ollama/ollama/blob/4129af9205763a113719c7ef102d5c6ff0f1e2e8/ml/backend/ggml/ggml/src/ggml-backend-reg.cpp#L577-L584 I just tried the following from my sbatch script: ``` echo "HIP $HIP_VISIBLE_DEVICES" echo "ROCR $ROCR_VISIBLE_DEVICES" # output HIP ROCR 0 ``` This is strange, considering our system only has Intel CPUs and Nvidia GPUs. @filips123 I also see these lines in your logs: ``` ime=2025-06-28T18:49:19.056+02:00 [...] **ROCR_VISIBLE_DEVICES=0** CUDA_VISIBLE_DEVICES=GPU-95ef9cc9-a32d-0e44-1184-65b4a5a472b1 GPU_DEVICE_ORDINAL=0 OLLAMA_DEBUG=1 ... ``` It appears that an incorrect Slurm configuration or an older Slurm version might cause certain GPU-related variables to be manually set: https://support.schedmd.com/show_bug.cgi?id=11097 > Slurm currently sets CUDA_VISIBLE_DEVICES, ROCR_VISIBLE_DEVICES, and GPU_DEVICE_ORDINAL for tasks that request GPUs. We've run into an issue where this causes ROCm-based applications to fail. > Hey Matt, this is now in 21.08 rc1 with commits https://github.com/SchedMD/slurm/compare/0705ded00d4e...0c7fb08c4e7d. > We made some changes compared to v2: instead of a dedicated EnvVars field, we simply added on to the preexisting Flags field in gres.conf > Flags > Optional flags that can be specified to change configured behavior of the GRES. > Allowed values at present are: > ... > nvidia_gpu_env > Set environment variable CUDA_VISIBLE_DEVICES for all GPUs on the specified node(s). > amd_gpu_env > Set environment variable ROCR_VISIBLE_DEVICES for all GPUs on the specified node(s). > opencl_env > Set environment variable GPU_DEVICE_ORDINAL for all GPUs on the specified node(s). > no_gpu_env > Set no GPU-specific environment variables. > We also added ROCR_VISIBLE_DEVICES to prolog/epilog, like CUDA_VISIBLE_DEVICES. > We are planning on tweaking things before 21.08 is released, but you can go ahead and play around with this. I've confirmed that our own `gres.conf` (located in the same directory as `slurm.conf`, filepath at `SLURM_CONF`) is missing the `Flags=nvidia_gpu_env`, so it is incorrectly setting `ROCR_VISIBLE_DEVICES=[0,1]`, causing that codeblock to prevent the loading of the CUDA backend. After passing including `unset ROCR_VISIBLE_DEVICES` to the jobscript, it finally loads models to the GPU: ``` time=2025-06-28T19:30:06.013-07:00 level=INFO source=server.go:438 msg="starting llama server" cmd="/usr/bin/ollama runner --ollama-engine --model /home/jakelee/ollama_models/blobs/sha256-ccc0cddac56136ef0969cf2e3e9ac051124c937be42503b47ec570dead85ff87 --ctx-size 131072 --batch-size 512 --n-gpu-layers 63 --threads 64 --flash-attn --kv-cache-type q4_0 --parallel 4 --port 33027" time=2025-06-28T19:30:06.013-07:00 level=DEBUG source=server.go:439 msg=subprocess CUDA_VISIBLE_DEVICES=GPU-d854155b-7f48-5e97-509d-ad2fb35f454b GPU_DEVICE_ORDINAL=0,1 LD_LIBRARY_PATH=/usr/lib/ollama:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/.singularity.d/libs:/usr/lib/ollama OLLAMA_DEBUG=1 OLLAMA_DIR=/home/jakelee/ollama_models OLLAMA_FLASH_ATTENTION=1 OLLAMA_HOST=0.0.0.0:11434 OLLAMA_KEEP_ALIVE=24h OLLAMA_KV_CACHE_TYPE=q4_0 OLLAMA_LOG_LEVEL=DEBUG OLLAMA_MAX_LOADED_MODELS=3 OLLAMA_MAX_QUEUE=512 OLLAMA_MODELS=/home/jakelee/ollama_models OLLAMA_NUM_PARALLEL=4 OLLAMA_PORT=11434 PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin OLLAMA_LIBRARY_PATH=/usr/lib/ollama time=2025-06-28T19:30:06.017-07:00 level=INFO source=sched.go:483 msg="loaded runners" count=1 time=2025-06-28T19:30:06.017-07:00 level=INFO source=server.go:598 msg="waiting for llama runner to start responding" time=2025-06-28T19:30:06.018-07:00 level=INFO source=server.go:632 msg="waiting for server to become available" status="llm server not responding" time=2025-06-28T19:30:06.029-07:00 level=INFO source=runner.go:925 msg="starting ollama engine" time=2025-06-28T19:30:06.030-07:00 level=INFO source=runner.go:983 msg="Server listening on 127.0.0.1:33027" time=2025-06-28T19:30:06.060-07:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=general.alignment default=32 time=2025-06-28T19:30:06.061-07:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=general.name default="" time=2025-06-28T19:30:06.061-07:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=general.description default="" time=2025-06-28T19:30:06.061-07:00 level=INFO source=ggml.go:92 msg="" architecture=gemma3 file_type=Q4_0 name="" description="" num_tensors=1247 num_key_values=40 time=2025-06-28T19:30:06.061-07:00 level=DEBUG source=ggml.go:94 msg="ggml backend load all from path" path=/usr/lib/ollama time=2025-06-28T19:30:06.302-07:00 level=INFO source=server.go:632 msg="waiting for server to become available" status="llm server loading model" ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA A100 80GB PCIe, compute capability 8.0, VMM: yes load_backend: loaded CUDA backend from /usr/lib/ollama/libggml-cuda.so load_backend: loaded CPU backend from /usr/lib/ollama/libggml-cpu-icelake.so time=2025-06-28T19:30:07.728-07:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.AVX512=1 CPU.0.AVX512_VBMI=1 CPU.0.AVX512_VNNI=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc) time=2025-06-28T19:30:07.856-07:00 level=INFO source=ggml.go:359 msg="model weights" buffer=CUDA0 size="16.8 GiB" time=2025-06-28T19:30:07.856-07:00 level=INFO source=ggml.go:359 msg="model weights" buffer=CPU size="2.6 GiB" time=2025-06-28T19:30:07.856-07:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=tokenizer.ggml.eot_token_id default=106 time=2025-06-28T19:30:07.856-07:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=tokenizer.ggml.eos_token_ids default="&{size:0 values:[]}" time=2025-06-28T19:30:07.858-07:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=gemma3.rope.local.freq_base default=10000 time=2025-06-28T19:30:07.858-07:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=gemma3.rope.global.freq_base default=1e+06 time=2025-06-28T19:30:07.858-07:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=gemma3.rope.freq_scale default=1 time=2025-06-28T19:30:07.858-07:00 level=DEBUG source=ggml.go:206 msg="key with type not found" key=gemma3.mm_tokens_per_image default=256 time=2025-06-28T19:30:08.026-07:00 level=DEBUG source=ggml.go:630 msg="compute graph" nodes=972 splits=1 time=2025-06-28T19:30:08.026-07:00 level=INFO source=ggml.go:648 msg="compute graph" backend=CUDA0 buffer_type=CUDA0 size="1.1 GiB" time=2025-06-28T19:30:08.026-07:00 level=INFO source=ggml.go:648 msg="compute graph" backend=CPU buffer_type=CPU size="0 B" time=2025-06-28T19:30:08.039-07:00 level=DEBUG source=ggml.go:630 msg="compute graph" nodes=2489 splits=2 time=2025-06-28T19:30:08.039-07:00 level=INFO source=ggml.go:648 msg="compute graph" backend=CUDA0 buffer_type=CUDA0 size="1.1 GiB" time=2025-06-28T19:30:08.039-07:00 level=INFO source=ggml.go:648 msg="compute graph" backend=CPU buffer_type=CPU size="10.5 MiB" ``` In closing, I don't think there's anything to be fixed here - `ROCR_VISIBLE_DEVICES` and `CUDA_VISIBLE_DEVICES` really shouldn't be set together. However, it'd be nice if this failed with a debug message instead.
Author
Owner

@jakehlee commented on GitHub (Jun 29, 2025):

@hxse I also see in your logs that you have ROCR_VISIBLE_DEVICES:; I will update my PR so that it also handles empty strings instead of just checking that the variable is completely unset.

<!-- gh-comment-id:3016265779 --> @jakehlee commented on GitHub (Jun 29, 2025): @hxse I also see in your logs that you have `ROCR_VISIBLE_DEVICES:`; I will update my PR so that it also handles empty strings instead of just checking that the variable is completely unset.
Author
Owner

@hideaki-t commented on GitHub (Jun 30, 2025):

I also confirmed starting ollama after unsetting ROCR_VISIBLE_DEVICES in an interactive session (srun --pty) works as expected.
so I will temporarily put unset ROCR_VISIBLE_DEVICES into a script for sbatch.

<!-- gh-comment-id:3018996394 --> @hideaki-t commented on GitHub (Jun 30, 2025): I also confirmed starting ollama after unsetting `ROCR_VISIBLE_DEVICES` in an interactive session (`srun --pty`) works as expected. so I will temporarily put `unset ROCR_VISIBLE_DEVICES` into a script for sbatch.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#53905