AMD RX 6900 XT - ggml_cuda_init: failed to initialize ROCm: no ROCm-capable device is detected #6855

Open
opened 2025-11-12 13:47:09 -06:00 by GiteaMirror · 4 comments
Owner

Originally created by @moophlo on GitHub (Apr 26, 2025).

What is the issue?

I'm running ollama:rocm container in a kubernetes cluster.
The worker node of the cluster has 2 GPU:

GPU0: RX7900XT
GPU1: RX6900XT

I'm setting in the deployment all the relevant variables as per documentation:

HIP_VISIBLE_DEVICES:1
HSA_OVERRIDE_GFX_VERSION:10.3.0
HCC_AMDGPU_TARGET:gfx1030
ROCR_VISIBLE_DEVICES:1

While starting I can see the GPU is properly detected and the other one is filtered out as expected, but then at a certain point:

time=2025-04-26T15:49:31.376Z level=DEBUG source=memory.go:108 msg=evaluating library=rocm gpu_count=1 available="[16.0 GiB]"
time=2025-04-26T15:49:31.376Z level=WARN source=ggml.go:152 msg="key not found" key=bert.vision.block_count default=0
time=2025-04-26T15:49:31.376Z level=WARN source=ggml.go:152 msg="key not found" key=bert.attention.head_count_kv default=1
time=2025-04-26T15:49:31.376Z level=WARN source=ggml.go:152 msg="key not found" key=bert.attention.key_length default=64
time=2025-04-26T15:49:31.376Z level=WARN source=ggml.go:152 msg="key not found" key=bert.attention.value_length default=64
time=2025-04-26T15:49:31.376Z level=WARN source=ggml.go:152 msg="key not found" key=bert.attention.head_count_kv default=1
time=2025-04-26T15:49:31.376Z level=INFO source=sched.go:722 msg="new model will fit in available VRAM in single GPU, loading" model=/home/ollama/models/.ollama/models/blobs/sha256-819c2adf5ce6df2b6bd2ae4ca90d2a69f060afeb438d0c171db57daa02e39c3d gpu=GPU-7685bd29c5086f3d parallel=1 available=17145991168 required="1.1 GiB"
time=2025-04-26T15:49:31.376Z level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="62.5 GiB" before.free="31.5 GiB" before.free_swap="0 B" now.total="62.5 GiB" now.free="31.5 GiB" now.free_swap="0 B"
time=2025-04-26T15:49:31.376Z level=DEBUG source=amd_linux.go:488 msg="updating rocm free memory" gpu=GPU-7685bd29c5086f3d name=1002:73bf before="16.0 GiB" now="16.0 GiB"
time=2025-04-26T15:49:31.376Z level=INFO source=server.go:105 msg="system memory" total="62.5 GiB" free="31.5 GiB" free_swap="0 B"
time=2025-04-26T15:49:31.376Z level=DEBUG source=memory.go:108 msg=evaluating library=rocm gpu_count=1 available="[16.0 GiB]"
time=2025-04-26T15:49:31.376Z level=WARN source=ggml.go:152 msg="key not found" key=bert.vision.block_count default=0
time=2025-04-26T15:49:31.376Z level=WARN source=ggml.go:152 msg="key not found" key=bert.attention.head_count_kv default=1
time=2025-04-26T15:49:31.376Z level=WARN source=ggml.go:152 msg="key not found" key=bert.attention.key_length default=64
time=2025-04-26T15:49:31.376Z level=WARN source=ggml.go:152 msg="key not found" key=bert.attention.value_length default=64
time=2025-04-26T15:49:31.376Z level=WARN source=ggml.go:152 msg="key not found" key=bert.attention.head_count_kv default=1
time=2025-04-26T15:49:31.376Z level=INFO source=server.go:138 msg=offload library=rocm layers.requested=-1 layers.model=25 layers.offload=25 layers.split="" memory.available="[16.0 GiB]" memory.gpu_overhead="0 B" memory.required.full="1.1 GiB" memory.required.partial="1.1 GiB" memory.required.kv="3.0 MiB" memory.required.allocations="[1.1 GiB]" memory.weights.total="636.8 MiB" memory.weights.repeating="577.2 MiB" memory.weights.nonrepeating="59.6 MiB" memory.graph.full="8.0 MiB" memory.graph.partial="8.0 MiB"
time=2025-04-26T15:49:31.376Z level=DEBUG source=server.go:262 msg="compatible gpu libraries" compatible=[rocm]
llama_model_loader: loaded meta data with 23 key-value pairs and 389 tensors from /home/ollama/models/.ollama/models/blobs/sha256-819c2adf5ce6df2b6bd2ae4ca90d2a69f060afeb438d0c171db57daa02e39c3d (version GGUF V3 (latest))

At the moment I'm running in debug mode to obtain more verbosity.

One more element to add is that if I switch to GPU0 and I filter out the RX6900XT then is working properly and loading the model on the GPU and not falling back to the CPU backend.

Any help would be really appreciated

Relevant log output

2025/04/26 15:48:54 routes.go:1232: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES:1 HSA_OVERRIDE_GFX_VERSION:10.3.0 HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:2048 OLLAMA_DEBUG:true OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://0.0.0.0:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_KV_CACHE_TYPE: OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/home/ollama/models/.ollama/models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NEW_ENGINE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[* http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://* vscode-file://*] OLLAMA_SCHED_SPREAD:false ROCR_VISIBLE_DEVICES:1 http_proxy: https_proxy: no_proxy:]"
time=2025-04-26T15:48:54.977Z level=INFO source=images.go:458 msg="total blobs: 52"
time=2025-04-26T15:48:54.978Z level=INFO source=images.go:465 msg="total unused blobs removed: 0"
time=2025-04-26T15:48:54.978Z level=INFO source=routes.go:1299 msg="Listening on [::]:11434 (version 0.6.6)"
time=2025-04-26T15:48:54.978Z level=DEBUG source=sched.go:107 msg="starting llm scheduler"
time=2025-04-26T15:48:54.978Z level=INFO source=gpu.go:217 msg="looking for compatible GPUs"
time=2025-04-26T15:48:54.979Z level=DEBUG source=gpu.go:98 msg="searching for GPU discovery libraries for NVIDIA"
time=2025-04-26T15:48:54.979Z level=DEBUG source=gpu.go:501 msg="Searching for GPU library" name=libcuda.so*
time=2025-04-26T15:48:54.979Z level=DEBUG source=gpu.go:525 msg="gpu library search" globs="[/usr/lib/ollama/libcuda.so* /usr/local/nvidia/lib/libcuda.so* /usr/local/nvidia/lib64/libcuda.so* /usr/local/cuda*/targets/*/lib/libcuda.so* /usr/lib/*-linux-gnu/nvidia/current/libcuda.so* /usr/lib/*-linux-gnu/libcuda.so* /usr/lib/wsl/lib/libcuda.so* /usr/lib/wsl/drivers/*/libcuda.so* /opt/cuda/lib*/libcuda.so* /usr/local/cuda/lib*/libcuda.so* /usr/lib*/libcuda.so* /usr/local/lib*/libcuda.so*]"
time=2025-04-26T15:48:55.048Z level=DEBUG source=gpu.go:558 msg="discovered GPU libraries" paths=[]
time=2025-04-26T15:48:55.048Z level=DEBUG source=gpu.go:501 msg="Searching for GPU library" name=libcudart.so*
time=2025-04-26T15:48:55.048Z level=DEBUG source=gpu.go:525 msg="gpu library search" globs="[/usr/lib/ollama/libcudart.so* /usr/local/nvidia/lib/libcudart.so* /usr/local/nvidia/lib64/libcudart.so* /usr/lib/ollama/cuda_v*/libcudart.so* /usr/local/cuda/lib64/libcudart.so* /usr/lib/x86_64-linux-gnu/nvidia/current/libcudart.so* /usr/lib/x86_64-linux-gnu/libcudart.so* /usr/lib/wsl/lib/libcudart.so* /usr/lib/wsl/drivers/*/libcudart.so* /opt/cuda/lib64/libcudart.so* /usr/local/cuda*/targets/aarch64-linux/lib/libcudart.so* /usr/lib/aarch64-linux-gnu/nvidia/current/libcudart.so* /usr/lib/aarch64-linux-gnu/libcudart.so* /usr/local/cuda/lib*/libcudart.so* /usr/lib*/libcudart.so* /usr/local/lib*/libcudart.so*]"
time=2025-04-26T15:48:55.049Z level=DEBUG source=gpu.go:558 msg="discovered GPU libraries" paths=[]
time=2025-04-26T15:48:55.049Z level=DEBUG source=amd_linux.go:101 msg="evaluating amdgpu node /sys/class/kfd/kfd/topology/nodes/0/properties"
time=2025-04-26T15:48:55.049Z level=DEBUG source=amd_linux.go:121 msg="detected CPU /sys/class/kfd/kfd/topology/nodes/0/properties"
time=2025-04-26T15:48:55.049Z level=DEBUG source=amd_linux.go:101 msg="evaluating amdgpu node /sys/class/kfd/kfd/topology/nodes/1/properties"
time=2025-04-26T15:48:55.049Z level=DEBUG source=amd_linux.go:206 msg="mapping amdgpu to drm sysfs nodes" amdgpu=/sys/class/kfd/kfd/topology/nodes/1/properties vendor=4098 device=29772 unique_id=7433242190493591676
time=2025-04-26T15:48:55.049Z level=DEBUG source=amd_linux.go:240 msg=matched amdgpu=/sys/class/kfd/kfd/topology/nodes/1/properties drm=/sys/class/drm/card1/device
time=2025-04-26T15:48:55.050Z level=DEBUG source=amd_linux.go:318 msg="amdgpu memory" gpu=0 total="20.0 GiB"
time=2025-04-26T15:48:55.050Z level=DEBUG source=amd_linux.go:319 msg="amdgpu memory" gpu=0 available="20.0 GiB"
time=2025-04-26T15:48:55.050Z level=INFO source=amd_linux.go:332 msg="filtering out device per user request" id=GPU-67282e63a5d1287c visible_devices=[1]
time=2025-04-26T15:48:55.050Z level=DEBUG source=amd_linux.go:101 msg="evaluating amdgpu node /sys/class/kfd/kfd/topology/nodes/2/properties"
time=2025-04-26T15:48:55.050Z level=DEBUG source=amd_linux.go:206 msg="mapping amdgpu to drm sysfs nodes" amdgpu=/sys/class/kfd/kfd/topology/nodes/2/properties vendor=4098 device=29631 unique_id=8540440255474986813
time=2025-04-26T15:48:55.050Z level=DEBUG source=amd_linux.go:219 msg="failed to read sysfs node" file=/sys/class/drm/card1-DP-1/device/vendor error="open /sys/class/drm/card1-DP-1/device/vendor: no such file or directory"
time=2025-04-26T15:48:55.050Z level=DEBUG source=amd_linux.go:219 msg="failed to read sysfs node" file=/sys/class/drm/card1-DP-2/device/vendor error="open /sys/class/drm/card1-DP-2/device/vendor: no such file or directory"
time=2025-04-26T15:48:55.050Z level=DEBUG source=amd_linux.go:219 msg="failed to read sysfs node" file=/sys/class/drm/card1-DP-3/device/vendor error="open /sys/class/drm/card1-DP-3/device/vendor: no such file or directory"
time=2025-04-26T15:48:55.050Z level=DEBUG source=amd_linux.go:219 msg="failed to read sysfs node" file=/sys/class/drm/card1-HDMI-A-1/device/vendor error="open /sys/class/drm/card1-HDMI-A-1/device/vendor: no such file or directory"
time=2025-04-26T15:48:55.050Z level=DEBUG source=amd_linux.go:219 msg="failed to read sysfs node" file=/sys/class/drm/card1-Writeback-1/device/vendor error="open /sys/class/drm/card1-Writeback-1/device/vendor: no such file or directory"
time=2025-04-26T15:48:55.050Z level=DEBUG source=amd_linux.go:240 msg=matched amdgpu=/sys/class/kfd/kfd/topology/nodes/2/properties drm=/sys/class/drm/card2/device
time=2025-04-26T15:48:55.050Z level=DEBUG source=amd_linux.go:318 msg="amdgpu memory" gpu=1 total="16.0 GiB"
time=2025-04-26T15:48:55.050Z level=DEBUG source=amd_linux.go:319 msg="amdgpu memory" gpu=1 available="16.0 GiB"
time=2025-04-26T15:48:55.050Z level=DEBUG source=amd_common.go:16 msg="evaluating potential rocm lib dir /usr/lib/ollama/rocm"
time=2025-04-26T15:48:55.050Z level=DEBUG source=amd_common.go:44 msg="detected ROCM next to ollama executable /usr/lib/ollama/rocm"
time=2025-04-26T15:48:55.050Z level=INFO source=amd_linux.go:389 msg="skipping rocm gfx compatibility check" HSA_OVERRIDE_GFX_VERSION=10.3.0
time=2025-04-26T15:48:55.054Z level=INFO source=types.go:130 msg="inference compute" id=GPU-7685bd29c5086f3d library=rocm variant="" compute=gfx1030 driver=6.12 name=1002:73bf total="16.0 GiB" available="16.0 GiB"
[GIN] 2025/04/26 - 15:48:55 | 200 |      45.608µs |    192.168.10.1 | GET      "/"
[GIN] 2025/04/26 - 15:48:56 | 200 |      27.346µs |    192.168.10.1 | GET      "/"
[GIN] 2025/04/26 - 15:48:57 | 200 |      18.863µs |    192.168.10.1 | GET      "/"
[GIN] 2025/04/26 - 15:48:58 | 200 |      20.339µs |    192.168.10.1 | GET      "/"
[GIN] 2025/04/26 - 15:48:59 | 200 |      27.412µs |    192.168.10.1 | GET      "/"
[GIN] 2025/04/26 - 15:49:00 | 200 |      24.791µs |    192.168.10.1 | GET      "/"
[GIN] 2025/04/26 - 15:49:01 | 200 |      15.701µs |    192.168.10.1 | GET      "/"
[GIN] 2025/04/26 - 15:49:02 | 200 |      14.232µs |    192.168.10.1 | GET      "/"
[GIN] 2025/04/26 - 15:49:03 | 200 |      17.825µs |    192.168.10.1 | GET      "/"
[GIN] 2025/04/26 - 15:49:04 | 200 |      17.923µs |    192.168.10.1 | GET      "/"
[GIN] 2025/04/26 - 15:49:05 | 200 |      24.265µs |    192.168.10.1 | GET      "/"
[GIN] 2025/04/26 - 15:49:06 | 200 |      25.764µs |    192.168.10.1 | GET      "/"
[GIN] 2025/04/26 - 15:49:07 | 200 |      14.055µs |    192.168.10.1 | GET      "/"
[GIN] 2025/04/26 - 15:49:08 | 200 |      12.819µs |    192.168.10.1 | GET      "/"
[GIN] 2025/04/26 - 15:49:09 | 200 |      13.869µs |    192.168.10.1 | GET      "/"
[GIN] 2025/04/26 - 15:49:10 | 200 |      13.224µs |    192.168.10.1 | GET      "/"
[GIN] 2025/04/26 - 15:49:11 | 200 |      13.465µs |    192.168.10.1 | GET      "/"
[GIN] 2025/04/26 - 15:49:12 | 200 |      33.276µs |    192.168.10.1 | GET      "/"
[GIN] 2025/04/26 - 15:49:13 | 200 |      14.354µs |    192.168.10.1 | GET      "/"
[GIN] 2025/04/26 - 15:49:14 | 200 |      14.805µs |    192.168.10.1 | GET      "/"
[GIN] 2025/04/26 - 15:49:15 | 200 |      13.669µs |    192.168.10.1 | GET      "/"
[GIN] 2025/04/26 - 15:49:16 | 200 |      14.005µs |    192.168.10.1 | GET      "/"
[GIN] 2025/04/26 - 15:49:17 | 200 |      17.338µs |    192.168.10.1 | GET      "/"
[GIN] 2025/04/26 - 15:49:18 | 200 |      18.535µs |    192.168.10.1 | GET      "/"
[GIN] 2025/04/26 - 15:49:19 | 200 |      23.488µs |    192.168.10.1 | GET      "/"
[GIN] 2025/04/26 - 15:49:20 | 200 |      17.602µs |    192.168.10.1 | GET      "/"
[GIN] 2025/04/26 - 15:49:21 | 200 |      13.956µs |    192.168.10.1 | GET      "/"
[GIN] 2025/04/26 - 15:49:22 | 200 |      13.766µs |    192.168.10.1 | GET      "/"
[GIN] 2025/04/26 - 15:49:23 | 200 |      21.209µs |    192.168.10.1 | GET      "/"
[GIN] 2025/04/26 - 15:49:24 | 200 |       17.51µs |    192.168.10.1 | GET      "/"
[GIN] 2025/04/26 - 15:49:25 | 200 |      18.154µs |    192.168.10.1 | GET      "/"
[GIN] 2025/04/26 - 15:49:26 | 200 |     179.984µs |    192.168.10.1 | GET      "/"
[GIN] 2025/04/26 - 15:49:27 | 200 |       22.29µs |    192.168.10.1 | GET      "/"
[GIN] 2025/04/26 - 15:49:28 | 200 |      14.871µs |    192.168.10.1 | GET      "/"
[GIN] 2025/04/26 - 15:49:29 | 200 |       22.09µs |    192.168.10.1 | GET      "/"
[GIN] 2025/04/26 - 15:49:31 | 200 |      14.426µs |    192.168.10.1 | GET      "/"
time=2025-04-26T15:49:31.370Z level=WARN source=ggml.go:152 msg="key not found" key=general.alignment default=32
time=2025-04-26T15:49:31.371Z level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="62.5 GiB" before.free="31.7 GiB" before.free_swap="0 B" now.total="62.5 GiB" now.free="31.5 GiB" now.free_swap="0 B"
time=2025-04-26T15:49:31.371Z level=DEBUG source=amd_linux.go:488 msg="updating rocm free memory" gpu=GPU-7685bd29c5086f3d name=1002:73bf before="16.0 GiB" now="16.0 GiB"
time=2025-04-26T15:49:31.371Z level=DEBUG source=sched.go:183 msg="updating default concurrency" OLLAMA_MAX_LOADED_MODELS=3 gpu_count=1
time=2025-04-26T15:49:31.373Z level=WARN source=ggml.go:152 msg="key not found" key=general.alignment default=32
time=2025-04-26T15:49:31.375Z level=WARN source=ggml.go:152 msg="key not found" key=general.alignment default=32
time=2025-04-26T15:49:31.376Z level=DEBUG source=sched.go:226 msg="loading first model" model=/home/ollama/models/.ollama/models/blobs/sha256-819c2adf5ce6df2b6bd2ae4ca90d2a69f060afeb438d0c171db57daa02e39c3d
time=2025-04-26T15:49:31.376Z level=DEBUG source=memory.go:108 msg=evaluating library=rocm gpu_count=1 available="[16.0 GiB]"
time=2025-04-26T15:49:31.376Z level=WARN source=ggml.go:152 msg="key not found" key=bert.vision.block_count default=0
time=2025-04-26T15:49:31.376Z level=WARN source=ggml.go:152 msg="key not found" key=bert.attention.head_count_kv default=1
time=2025-04-26T15:49:31.376Z level=WARN source=ggml.go:152 msg="key not found" key=bert.attention.key_length default=64
time=2025-04-26T15:49:31.376Z level=WARN source=ggml.go:152 msg="key not found" key=bert.attention.value_length default=64
time=2025-04-26T15:49:31.376Z level=WARN source=ggml.go:152 msg="key not found" key=bert.attention.head_count_kv default=1
time=2025-04-26T15:49:31.376Z level=INFO source=sched.go:722 msg="new model will fit in available VRAM in single GPU, loading" model=/home/ollama/models/.ollama/models/blobs/sha256-819c2adf5ce6df2b6bd2ae4ca90d2a69f060afeb438d0c171db57daa02e39c3d gpu=GPU-7685bd29c5086f3d parallel=1 available=17145991168 required="1.1 GiB"
time=2025-04-26T15:49:31.376Z level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="62.5 GiB" before.free="31.5 GiB" before.free_swap="0 B" now.total="62.5 GiB" now.free="31.5 GiB" now.free_swap="0 B"
time=2025-04-26T15:49:31.376Z level=DEBUG source=amd_linux.go:488 msg="updating rocm free memory" gpu=GPU-7685bd29c5086f3d name=1002:73bf before="16.0 GiB" now="16.0 GiB"
time=2025-04-26T15:49:31.376Z level=INFO source=server.go:105 msg="system memory" total="62.5 GiB" free="31.5 GiB" free_swap="0 B"
time=2025-04-26T15:49:31.376Z level=DEBUG source=memory.go:108 msg=evaluating library=rocm gpu_count=1 available="[16.0 GiB]"
time=2025-04-26T15:49:31.376Z level=WARN source=ggml.go:152 msg="key not found" key=bert.vision.block_count default=0
time=2025-04-26T15:49:31.376Z level=WARN source=ggml.go:152 msg="key not found" key=bert.attention.head_count_kv default=1
time=2025-04-26T15:49:31.376Z level=WARN source=ggml.go:152 msg="key not found" key=bert.attention.key_length default=64
time=2025-04-26T15:49:31.376Z level=WARN source=ggml.go:152 msg="key not found" key=bert.attention.value_length default=64
time=2025-04-26T15:49:31.376Z level=WARN source=ggml.go:152 msg="key not found" key=bert.attention.head_count_kv default=1
time=2025-04-26T15:49:31.376Z level=INFO source=server.go:138 msg=offload library=rocm layers.requested=-1 layers.model=25 layers.offload=25 layers.split="" memory.available="[16.0 GiB]" memory.gpu_overhead="0 B" memory.required.full="1.1 GiB" memory.required.partial="1.1 GiB" memory.required.kv="3.0 MiB" memory.required.allocations="[1.1 GiB]" memory.weights.total="636.8 MiB" memory.weights.repeating="577.2 MiB" memory.weights.nonrepeating="59.6 MiB" memory.graph.full="8.0 MiB" memory.graph.partial="8.0 MiB"
time=2025-04-26T15:49:31.376Z level=DEBUG source=server.go:262 msg="compatible gpu libraries" compatible=[rocm]
llama_model_loader: loaded meta data with 23 key-value pairs and 389 tensors from /home/ollama/models/.ollama/models/blobs/sha256-819c2adf5ce6df2b6bd2ae4ca90d2a69f060afeb438d0c171db57daa02e39c3d (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = bert
llama_model_loader: - kv   1:                               general.name str              = mxbai-embed-large-v1
llama_model_loader: - kv   2:                           bert.block_count u32              = 24
llama_model_loader: - kv   3:                        bert.context_length u32              = 512
llama_model_loader: - kv   4:                      bert.embedding_length u32              = 1024
llama_model_loader: - kv   5:                   bert.feed_forward_length u32              = 4096
llama_model_loader: - kv   6:                  bert.attention.head_count u32              = 16
llama_model_loader: - kv   7:          bert.attention.layer_norm_epsilon f32              = 0.000000
llama_model_loader: - kv   8:                          general.file_type u32              = 1
llama_model_loader: - kv   9:                      bert.attention.causal bool             = false
llama_model_loader: - kv  10:                          bert.pooling_type u32              = 2
llama_model_loader: - kv  11:            tokenizer.ggml.token_type_count u32              = 2
llama_model_loader: - kv  12:                tokenizer.ggml.bos_token_id u32              = 101
llama_model_loader: - kv  13:                tokenizer.ggml.eos_token_id u32              = 102
llama_model_loader: - kv  14:                       tokenizer.ggml.model str              = bert
llama_model_loader: - kv  15:                      tokenizer.ggml.tokens arr[str,30522]   = ["[PAD]", "[unused0]", "[unused1]", "...
llama_model_loader: - kv  16:                      tokenizer.ggml.scores arr[f32,30522]   = [-1000.000000, -1000.000000, -1000.00...
llama_model_loader: - kv  17:                  tokenizer.ggml.token_type arr[i32,30522]   = [3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  18:            tokenizer.ggml.unknown_token_id u32              = 100
llama_model_loader: - kv  19:          tokenizer.ggml.seperator_token_id u32              = 102
llama_model_loader: - kv  20:            tokenizer.ggml.padding_token_id u32              = 0
llama_model_loader: - kv  21:                tokenizer.ggml.cls_token_id u32              = 101
llama_model_loader: - kv  22:               tokenizer.ggml.mask_token_id u32              = 103
llama_model_loader: - type  f32:  243 tensors
llama_model_loader: - type  f16:  146 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = F16
print_info: file size   = 637.85 MiB (16.02 BPW) 
init_tokenizer: initializing tokenizer for type 3
load: control token:    101 '[CLS]' is not marked as EOG
load: control token:    103 '[MASK]' is not marked as EOG
load: control token:      0 '[PAD]' is not marked as EOG
load: control token:    100 '[UNK]' is not marked as EOG
load: control token:    102 '[SEP]' is not marked as EOG
load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
load: special tokens cache size = 5
load: token to piece cache size = 0.2032 MB
print_info: arch             = bert
print_info: vocab_only       = 1
print_info: model type       = ?B
print_info: model params     = 334.09 M
print_info: general.name     = mxbai-embed-large-v1
print_info: vocab type       = WPM
print_info: n_vocab          = 30522
print_info: n_merges         = 0
print_info: BOS token        = 101 '[CLS]'
print_info: EOS token        = 102 '[SEP]'
print_info: UNK token        = 100 '[UNK]'
print_info: SEP token        = 102 '[SEP]'
print_info: PAD token        = 0 '[PAD]'
print_info: MASK token       = 103 '[MASK]'
print_info: LF token         = 0 '[PAD]'
print_info: EOG token        = 102 '[SEP]'
print_info: max token length = 21
llama_model_load: vocab only - skipping tensors
[GIN] 2025/04/26 - 15:49:32 | 200 |      23.128µs |    192.168.10.1 | GET      "/"
time=2025-04-26T15:49:32.367Z level=DEBUG source=server.go:335 msg="adding gpu library" path=/usr/lib/ollama/rocm
time=2025-04-26T15:49:32.367Z level=DEBUG source=server.go:343 msg="adding gpu dependency paths" paths=[/usr/lib/ollama/rocm]
time=2025-04-26T15:49:32.368Z level=INFO source=server.go:405 msg="starting llama server" cmd="/usr/bin/ollama runner --model /home/ollama/models/.ollama/models/blobs/sha256-819c2adf5ce6df2b6bd2ae4ca90d2a69f060afeb438d0c171db57daa02e39c3d --ctx-size 512 --batch-size 512 --n-gpu-layers 25 --verbose --threads 8 --parallel 1 --port 21385"
time=2025-04-26T15:49:32.368Z level=DEBUG source=server.go:423 msg=subprocess environment="[PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin LD_LIBRARY_PATH=/usr/lib/ollama/rocm:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/lib/ollama/rocm:/usr/lib/ollama HIP_VISIBLE_DEVICES=1 ROCR_VISIBLE_DEVICES=GPU-7685bd29c5086f3d ROCM_VISIBLE_DEVICES=1 HSA_OVERRIDE_GFX_VERSION=10.3.0]"
time=2025-04-26T15:49:32.368Z level=INFO source=sched.go:451 msg="loaded runners" count=1
time=2025-04-26T15:49:32.368Z level=INFO source=server.go:580 msg="waiting for llama runner to start responding"
time=2025-04-26T15:49:32.368Z level=INFO source=server.go:614 msg="waiting for server to become available" status="llm server error"
time=2025-04-26T15:49:32.377Z level=INFO source=runner.go:853 msg="starting go runner"
time=2025-04-26T15:49:32.377Z level=DEBUG source=ggml.go:99 msg="ggml backend load all from path" path=/usr/lib/ollama/rocm
[GIN] 2025/04/26 - 15:49:33 | 200 |      19.852µs |    192.168.10.1 | GET      "/"
/opt/amdgpu/share/libdrm/amdgpu.ids: No such file or directory
/opt/amdgpu/share/libdrm/amdgpu.ids: No such file or directory
ggml_cuda_init: failed to initialize ROCm: no ROCm-capable device is detected
load_backend: loaded ROCm backend from /usr/lib/ollama/rocm/libggml-hip.so
time=2025-04-26T15:49:33.155Z level=DEBUG source=ggml.go:93 msg="skipping path which is not part of ollama" path=/usr/local/nvidia/lib
time=2025-04-26T15:49:33.155Z level=DEBUG source=ggml.go:93 msg="skipping path which is not part of ollama" path=/usr/local/nvidia/lib64
time=2025-04-26T15:49:33.155Z level=DEBUG source=ggml.go:99 msg="ggml backend load all from path" path=/usr/lib/ollama
load_backend: loaded CPU backend from /usr/lib/ollama/libggml-cpu-alderlake.so
time=2025-04-26T15:49:33.156Z level=INFO source=ggml.go:109 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX_VNNI=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 compiler=cgo(gcc)
time=2025-04-26T15:49:33.156Z level=INFO source=runner.go:913 msg="Server listening on 127.0.0.1:21385"
llama_model_loader: loaded meta data with 23 key-value pairs and 389 tensors from /home/ollama/models/.ollama/models/blobs/sha256-819c2adf5ce6df2b6bd2ae4ca90d2a69f060afeb438d0c171db57daa02e39c3d (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = bert
llama_model_loader: - kv   1:                               general.name str              = mxbai-embed-large-v1
llama_model_loader: - kv   2:                           bert.block_count u32              = 24
llama_model_loader: - kv   3:                        bert.context_length u32              = 512
llama_model_loader: - kv   4:                      bert.embedding_length u32              = 1024
llama_model_loader: - kv   5:                   bert.feed_forward_length u32              = 4096
llama_model_loader: - kv   6:                  bert.attention.head_count u32              = 16
llama_model_loader: - kv   7:          bert.attention.layer_norm_epsilon f32              = 0.000000
llama_model_loader: - kv   8:                          general.file_type u32              = 1
llama_model_loader: - kv   9:                      bert.attention.causal bool             = false
llama_model_loader: - kv  10:                          bert.pooling_type u32              = 2
llama_model_loader: - kv  11:            tokenizer.ggml.token_type_count u32              = 2
llama_model_loader: - kv  12:                tokenizer.ggml.bos_token_id u32              = 101
llama_model_loader: - kv  13:                tokenizer.ggml.eos_token_id u32              = 102
llama_model_loader: - kv  14:                       tokenizer.ggml.model str              = bert
llama_model_loader: - kv  15:                      tokenizer.ggml.tokens arr[str,30522]   = ["[PAD]", "[unused0]", "[unused1]", "...
llama_model_loader: - kv  16:                      tokenizer.ggml.scores arr[f32,30522]   = [-1000.000000, -1000.000000, -1000.00...
llama_model_loader: - kv  17:                  tokenizer.ggml.token_type arr[i32,30522]   = [3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  18:            tokenizer.ggml.unknown_token_id u32              = 100
llama_model_loader: - kv  19:          tokenizer.ggml.seperator_token_id u32              = 102
llama_model_loader: - kv  20:            tokenizer.ggml.padding_token_id u32              = 0
llama_model_loader: - kv  21:                tokenizer.ggml.cls_token_id u32              = 101
llama_model_loader: - kv  22:               tokenizer.ggml.mask_token_id u32              = 103
llama_model_loader: - type  f32:  243 tensors
llama_model_loader: - type  f16:  146 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = F16
print_info: file size   = 637.85 MiB (16.02 BPW) 
init_tokenizer: initializing tokenizer for type 3
load: control token:    101 '[CLS]' is not marked as EOG
load: control token:    103 '[MASK]' is not marked as EOG
load: control token:      0 '[PAD]' is not marked as EOG
load: control token:    100 '[UNK]' is not marked as EOG
load: control token:    102 '[SEP]' is not marked as EOG
load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
load: special tokens cache size = 5
load: token to piece cache size = 0.2032 MB
print_info: arch             = bert
print_info: vocab_only       = 0
print_info: n_ctx_train      = 512
print_info: n_embd           = 1024
print_info: n_layer          = 24
print_info: n_head           = 16
print_info: n_head_kv        = 16
print_info: n_rot            = 64
print_info: n_swa            = 0
print_info: n_swa_pattern    = 1
print_info: n_embd_head_k    = 64
print_info: n_embd_head_v    = 64
print_info: n_gqa            = 1
print_info: n_embd_k_gqa     = 1024
print_info: n_embd_v_gqa     = 1024
print_info: f_norm_eps       = 1.0e-12
print_info: f_norm_rms_eps   = 0.0e+00
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: f_attn_scale     = 0.0e+00
print_info: n_ff             = 4096
print_info: n_expert         = 0
print_info: n_expert_used    = 0
print_info: causal attn      = 0
print_info: pooling type     = 2
print_info: rope type        = 2
print_info: rope scaling     = linear
print_info: freq_base_train  = 10000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn  = 512
print_info: rope_finetuned   = unknown
print_info: ssm_d_conv       = 0
print_info: ssm_d_inner      = 0
print_info: ssm_d_state      = 0
print_info: ssm_dt_rank      = 0
print_info: ssm_dt_b_c_rms   = 0
print_info: model type       = 335M
print_info: model params     = 334.09 M
print_info: general.name     = mxbai-embed-large-v1
print_info: vocab type       = WPM
print_info: n_vocab          = 30522
print_info: n_merges         = 0
print_info: BOS token        = 101 '[CLS]'
print_info: EOS token        = 102 '[SEP]'
print_info: UNK token        = 100 '[UNK]'
print_info: SEP token        = 102 '[SEP]'
print_info: PAD token        = 0 '[PAD]'
print_info: MASK token       = 103 '[MASK]'
print_info: LF token         = 0 '[PAD]'
print_info: EOG token        = 102 '[SEP]'
print_info: max token length = 21
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: layer   0 assigned to device CPU, is_swa = 0
load_tensors: layer   1 assigned to device CPU, is_swa = 0
load_tensors: layer   2 assigned to device CPU, is_swa = 0
load_tensors: layer   3 assigned to device CPU, is_swa = 0
load_tensors: layer   4 assigned to device CPU, is_swa = 0
load_tensors: layer   5 assigned to device CPU, is_swa = 0
load_tensors: layer   6 assigned to device CPU, is_swa = 0
load_tensors: layer   7 assigned to device CPU, is_swa = 0
load_tensors: layer   8 assigned to device CPU, is_swa = 0
load_tensors: layer   9 assigned to device CPU, is_swa = 0
load_tensors: layer  10 assigned to device CPU, is_swa = 0
load_tensors: layer  11 assigned to device CPU, is_swa = 0
load_tensors: layer  12 assigned to device CPU, is_swa = 0
load_tensors: layer  13 assigned to device CPU, is_swa = 0
load_tensors: layer  14 assigned to device CPU, is_swa = 0
load_tensors: layer  15 assigned to device CPU, is_swa = 0
load_tensors: layer  16 assigned to device CPU, is_swa = 0
load_tensors: layer  17 assigned to device CPU, is_swa = 0
load_tensors: layer  18 assigned to device CPU, is_swa = 0
load_tensors: layer  19 assigned to device CPU, is_swa = 0
load_tensors: layer  20 assigned to device CPU, is_swa = 0
load_tensors: layer  21 assigned to device CPU, is_swa = 0
load_tensors: layer  22 assigned to device CPU, is_swa = 0
load_tensors: layer  23 assigned to device CPU, is_swa = 0
load_tensors: layer  24 assigned to device CPU, is_swa = 0
time=2025-04-26T15:49:33.372Z level=INFO source=server.go:614 msg="waiting for server to become available" status="llm server loading model"
[GIN] 2025/04/26 - 15:49:34 | 200 |       16.28µs |    192.168.10.1 | GET      "/"
[GIN] 2025/04/26 - 15:49:35 | 200 |      12.119µs |    192.168.10.1 | GET      "/"
[GIN] 2025/04/26 - 15:49:36 | 200 |      12.501µs |    192.168.10.1 | GET      "/"
[GIN] 2025/04/26 - 15:49:37 | 200 |      16.986µs |    192.168.10.1 | GET      "/"
[GIN] 2025/04/26 - 15:49:38 | 200 |      18.783µs |    192.168.10.1 | GET      "/"
[GIN] 2025/04/26 - 15:49:39 | 200 |      13.436µs |    192.168.10.1 | GET      "/"
[GIN] 2025/04/26 - 15:49:40 | 200 |      19.586µs |    192.168.10.1 | GET      "/"
[GIN] 2025/04/26 - 15:49:41 | 200 |      23.479µs |    192.168.10.1 | GET      "/"
[GIN] 2025/04/26 - 15:49:42 | 200 |      23.868µs |    192.168.10.1 | GET      "/"
[GIN] 2025/04/26 - 15:49:43 | 200 |      13.501µs |    192.168.10.1 | GET      "/"
[GIN] 2025/04/26 - 15:49:44 | 200 |      14.025µs |    192.168.10.1 | GET      "/"
[GIN] 2025/04/26 - 15:49:45 | 200 |       13.98µs |    192.168.10.1 | GET      "/"
[GIN] 2025/04/26 - 15:49:46 | 200 |      22.553µs |    192.168.10.1 | GET      "/"
[GIN] 2025/04/26 - 15:49:47 | 200 |      13.092µs |    192.168.10.1 | GET      "/"
[GIN] 2025/04/26 - 15:49:48 | 200 |      19.752µs |    192.168.10.1 | GET      "/"
[GIN] 2025/04/26 - 15:49:49 | 200 |      12.943µs |    192.168.10.1 | GET      "/"
[GIN] 2025/04/26 - 15:49:50 | 200 |      13.455µs |    192.168.10.1 | GET      "/"
[GIN] 2025/04/26 - 15:49:51 | 200 |      12.638µs |    192.168.10.1 | GET      "/"
[GIN] 2025/04/26 - 15:49:52 | 200 |      21.598µs |    192.168.10.1 | GET      "/"
load_tensors:   CPU_Mapped model buffer size =   637.85 MiB
llama_context: constructing llama_context
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 512
llama_context: n_ctx_per_seq = 512
llama_context: n_batch       = 512
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 0
llama_context: flash_attn    = 0
llama_context: freq_base     = 10000.0
llama_context: freq_scale    = 1
set_abort_callback: call
llama_context:        CPU  output buffer size =     0.00 MiB
llama_context: n_ctx = 512
llama_context: n_ctx = 512 (padded)
init: kv_size = 512, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 24, can_shift = 1
init: layer   0: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CPU
init: layer   1: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CPU
init: layer   2: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CPU
init: layer   3: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CPU
init: layer   4: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CPU
init: layer   5: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CPU
init: layer   6: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CPU
init: layer   7: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CPU
init: layer   8: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CPU
init: layer   9: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CPU
init: layer  10: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CPU
init: layer  11: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CPU
init: layer  12: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CPU
init: layer  13: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CPU
init: layer  14: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CPU
init: layer  15: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CPU
init: layer  16: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CPU
init: layer  17: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CPU
init: layer  18: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CPU
init: layer  19: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CPU
init: layer  20: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CPU
init: layer  21: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CPU
init: layer  22: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CPU
init: layer  23: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CPU
init:        CPU KV buffer size =    48.00 MiB
llama_context: KV self size  =   48.00 MiB, K (f16):   24.00 MiB, V (f16):   24.00 MiB
llama_context: enumerating backends
llama_context: backend_ptrs.size() = 1
llama_context: max_nodes = 65536
llama_context: worst-case: n_tokens = 512, n_seqs = 1, n_outputs = 0
llama_context: reserving graph for n_tokens = 512, n_seqs = 1
llama_context: reserving graph for n_tokens = 1, n_seqs = 1
llama_context: reserving graph for n_tokens = 512, n_seqs = 1
llama_context:        CPU compute buffer size =    27.01 MiB
llama_context: graph nodes  = 825
llama_context: graph splits = 1
time=2025-04-26T15:49:53.446Z level=INFO source=server.go:619 msg="llama runner started in 21.08 seconds"
time=2025-04-26T15:49:53.446Z level=DEBUG source=sched.go:464 msg="finished setting up runner" model=/home/ollama/models/.ollama/models/blobs/sha256-819c2adf5ce6df2b6bd2ae4ca90d2a69f060afeb438d0c171db57daa02e39c3d
time=2025-04-26T15:49:53.450Z level=WARN source=ggml.go:152 msg="key not found" key=general.alignment default=32
time=2025-04-26T15:49:53.452Z level=DEBUG source=runner.go:686 msg="embedding request" content="tell me more"
time=2025-04-26T15:49:53.454Z level=DEBUG source=cache.go:104 msg="loading cache slot" id=0 cache=0 prompt=5 used=0 remaining=5
[GIN] 2025/04/26 - 15:49:53 | 200 | 22.671213222s |  173.249.47.211 | POST     "/api/embed"
time=2025-04-26T15:49:53.477Z level=DEBUG source=sched.go:468 msg="context for request finished"
time=2025-04-26T15:49:53.477Z level=DEBUG source=sched.go:341 msg="runner with non-zero duration has gone idle, adding timer" modelPath=/home/ollama/models/.ollama/models/blobs/sha256-819c2adf5ce6df2b6bd2ae4ca90d2a69f060afeb438d0c171db57daa02e39c3d duration=5m0s
time=2025-04-26T15:49:53.477Z level=DEBUG source=sched.go:359 msg="after processing request finished event" modelPath=/home/ollama/models/.ollama/models/blobs/sha256-819c2adf5ce6df2b6bd2ae4ca90d2a69f060afeb438d0c171db57daa02e39c3d refCount=0
[GIN] 2025/04/26 - 15:49:53 | 200 |      15.649µs |    192.168.10.1 | GET      "/"
[GIN] 2025/04/26 - 15:49:54 | 200 |      19.318µs |    192.168.10.1 | GET      "/"
[GIN] 2025/04/26 - 15:49:55 | 200 |      13.768µs |    192.168.10.1 | GET      "/"
[GIN] 2025/04/26 - 15:49:56 | 200 |      18.993µs |    192.168.10.1 | GET      "/"
[GIN] 2025/04/26 - 15:49:57 | 200 |      15.537µs |    192.168.10.1 | GET      "/"
[GIN] 2025/04/26 - 15:49:58 | 200 |      14.674µs |    192.168.10.1 | GET      "/"
[GIN] 2025/04/26 - 15:49:59 | 200 |       14.12µs |    192.168.10.1 | GET      "/"
[GIN] 2025/04/26 - 15:50:00 | 200 |       13.63µs |    192.168.10.1 | GET      "/"
[GIN] 2025/04/26 - 15:50:01 | 200 |      13.562µs |    192.168.10.1 | GET      "/"
[GIN] 2025/04/26 - 15:50:02 | 200 |      14.125µs |    192.168.10.1 | GET      "/"
[GIN] 2025/04/26 - 15:50:03 | 200 |      14.033µs |    192.168.10.1 | GET      "/"
[GIN] 2025/04/26 - 15:50:04 | 200 |      13.703µs |    192.168.10.1 | GET      "/"
[GIN] 2025/04/26 - 15:50:05 | 200 |       14.48µs |    192.168.10.1 | GET      "/"
[GIN] 2025/04/26 - 15:50:06 | 200 |      16.471µs |    192.168.10.1 | GET      "/"
[GIN] 2025/04/26 - 15:50:07 | 200 |      15.057µs |    192.168.10.1 | GET      "/"
[GIN] 2025/04/26 - 15:50:08 | 200 |      14.551µs |    192.168.10.1 | GET      "/"
[GIN] 2025/04/26 - 15:50:09 | 200 |      13.898µs |    192.168.10.1 | GET      "/"
[GIN] 2025/04/26 - 15:50:10 | 200 |      18.579µs |    192.168.10.1 | GET      "/"
[GIN] 2025/04/26 - 15:50:12 | 200 |      14.124µs |    192.168.10.1 | GET      "/"
[GIN] 2025/04/26 - 15:50:13 | 200 |      13.713µs |    192.168.10.1 | GET      "/"
[GIN] 2025/04/26 - 15:50:14 | 200 |      16.543µs |    192.168.10.1 | GET      "/"
[GIN] 2025/04/26 - 15:50:15 | 200 |      22.086µs |    192.168.10.1 | GET      "/"
[GIN] 2025/04/26 - 15:50:16 | 200 |      13.266µs |    192.168.10.1 | GET      "/"
[GIN] 2025/04/26 - 15:50:17 | 200 |      13.639µs |    192.168.10.1 | GET      "/"
[GIN] 2025/04/26 - 15:50:18 | 200 |      14.333µs |    192.168.10.1 | GET      "/"
[GIN] 2025/04/26 - 15:50:19 | 200 |      14.326µs |    192.168.10.1 | GET      "/"

OS

OS: Linux Mint 22.1 x86_64

GPU

GPU0: AMD ATI Radeon RX 7900 XT/7900 XTX/7900 GRE/7900M
GPU1: AMD ATI Radeon RX 6800/6800 XT / 6900 XT

CPU

CPU: Intel i9-14900K (32) @ 5.700GHz

Ollama version

ollama/ollama:0.6.6-rocm

Originally created by @moophlo on GitHub (Apr 26, 2025). ### What is the issue? I'm running ollama:rocm container in a kubernetes cluster. The worker node of the cluster has 2 GPU: GPU0: RX7900XT GPU1: RX6900XT I'm setting in the deployment all the relevant variables as per documentation: HIP_VISIBLE_DEVICES:1 HSA_OVERRIDE_GFX_VERSION:10.3.0 HCC_AMDGPU_TARGET:gfx1030 ROCR_VISIBLE_DEVICES:1 While starting I can see the GPU is properly detected and the other one is filtered out as expected, but then at a certain point: ``` time=2025-04-26T15:49:31.376Z level=DEBUG source=memory.go:108 msg=evaluating library=rocm gpu_count=1 available="[16.0 GiB]" time=2025-04-26T15:49:31.376Z level=WARN source=ggml.go:152 msg="key not found" key=bert.vision.block_count default=0 time=2025-04-26T15:49:31.376Z level=WARN source=ggml.go:152 msg="key not found" key=bert.attention.head_count_kv default=1 time=2025-04-26T15:49:31.376Z level=WARN source=ggml.go:152 msg="key not found" key=bert.attention.key_length default=64 time=2025-04-26T15:49:31.376Z level=WARN source=ggml.go:152 msg="key not found" key=bert.attention.value_length default=64 time=2025-04-26T15:49:31.376Z level=WARN source=ggml.go:152 msg="key not found" key=bert.attention.head_count_kv default=1 time=2025-04-26T15:49:31.376Z level=INFO source=sched.go:722 msg="new model will fit in available VRAM in single GPU, loading" model=/home/ollama/models/.ollama/models/blobs/sha256-819c2adf5ce6df2b6bd2ae4ca90d2a69f060afeb438d0c171db57daa02e39c3d gpu=GPU-7685bd29c5086f3d parallel=1 available=17145991168 required="1.1 GiB" time=2025-04-26T15:49:31.376Z level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="62.5 GiB" before.free="31.5 GiB" before.free_swap="0 B" now.total="62.5 GiB" now.free="31.5 GiB" now.free_swap="0 B" time=2025-04-26T15:49:31.376Z level=DEBUG source=amd_linux.go:488 msg="updating rocm free memory" gpu=GPU-7685bd29c5086f3d name=1002:73bf before="16.0 GiB" now="16.0 GiB" time=2025-04-26T15:49:31.376Z level=INFO source=server.go:105 msg="system memory" total="62.5 GiB" free="31.5 GiB" free_swap="0 B" time=2025-04-26T15:49:31.376Z level=DEBUG source=memory.go:108 msg=evaluating library=rocm gpu_count=1 available="[16.0 GiB]" time=2025-04-26T15:49:31.376Z level=WARN source=ggml.go:152 msg="key not found" key=bert.vision.block_count default=0 time=2025-04-26T15:49:31.376Z level=WARN source=ggml.go:152 msg="key not found" key=bert.attention.head_count_kv default=1 time=2025-04-26T15:49:31.376Z level=WARN source=ggml.go:152 msg="key not found" key=bert.attention.key_length default=64 time=2025-04-26T15:49:31.376Z level=WARN source=ggml.go:152 msg="key not found" key=bert.attention.value_length default=64 time=2025-04-26T15:49:31.376Z level=WARN source=ggml.go:152 msg="key not found" key=bert.attention.head_count_kv default=1 time=2025-04-26T15:49:31.376Z level=INFO source=server.go:138 msg=offload library=rocm layers.requested=-1 layers.model=25 layers.offload=25 layers.split="" memory.available="[16.0 GiB]" memory.gpu_overhead="0 B" memory.required.full="1.1 GiB" memory.required.partial="1.1 GiB" memory.required.kv="3.0 MiB" memory.required.allocations="[1.1 GiB]" memory.weights.total="636.8 MiB" memory.weights.repeating="577.2 MiB" memory.weights.nonrepeating="59.6 MiB" memory.graph.full="8.0 MiB" memory.graph.partial="8.0 MiB" time=2025-04-26T15:49:31.376Z level=DEBUG source=server.go:262 msg="compatible gpu libraries" compatible=[rocm] llama_model_loader: loaded meta data with 23 key-value pairs and 389 tensors from /home/ollama/models/.ollama/models/blobs/sha256-819c2adf5ce6df2b6bd2ae4ca90d2a69f060afeb438d0c171db57daa02e39c3d (version GGUF V3 (latest)) ``` At the moment I'm running in debug mode to obtain more verbosity. One more element to add is that if I switch to GPU0 and I filter out the RX6900XT then is working properly and loading the model on the GPU and not falling back to the CPU backend. Any help would be really appreciated ### Relevant log output ```shell 2025/04/26 15:48:54 routes.go:1232: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES:1 HSA_OVERRIDE_GFX_VERSION:10.3.0 HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:2048 OLLAMA_DEBUG:true OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://0.0.0.0:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_KV_CACHE_TYPE: OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/home/ollama/models/.ollama/models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NEW_ENGINE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[* http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://* vscode-file://*] OLLAMA_SCHED_SPREAD:false ROCR_VISIBLE_DEVICES:1 http_proxy: https_proxy: no_proxy:]" time=2025-04-26T15:48:54.977Z level=INFO source=images.go:458 msg="total blobs: 52" time=2025-04-26T15:48:54.978Z level=INFO source=images.go:465 msg="total unused blobs removed: 0" time=2025-04-26T15:48:54.978Z level=INFO source=routes.go:1299 msg="Listening on [::]:11434 (version 0.6.6)" time=2025-04-26T15:48:54.978Z level=DEBUG source=sched.go:107 msg="starting llm scheduler" time=2025-04-26T15:48:54.978Z level=INFO source=gpu.go:217 msg="looking for compatible GPUs" time=2025-04-26T15:48:54.979Z level=DEBUG source=gpu.go:98 msg="searching for GPU discovery libraries for NVIDIA" time=2025-04-26T15:48:54.979Z level=DEBUG source=gpu.go:501 msg="Searching for GPU library" name=libcuda.so* time=2025-04-26T15:48:54.979Z level=DEBUG source=gpu.go:525 msg="gpu library search" globs="[/usr/lib/ollama/libcuda.so* /usr/local/nvidia/lib/libcuda.so* /usr/local/nvidia/lib64/libcuda.so* /usr/local/cuda*/targets/*/lib/libcuda.so* /usr/lib/*-linux-gnu/nvidia/current/libcuda.so* /usr/lib/*-linux-gnu/libcuda.so* /usr/lib/wsl/lib/libcuda.so* /usr/lib/wsl/drivers/*/libcuda.so* /opt/cuda/lib*/libcuda.so* /usr/local/cuda/lib*/libcuda.so* /usr/lib*/libcuda.so* /usr/local/lib*/libcuda.so*]" time=2025-04-26T15:48:55.048Z level=DEBUG source=gpu.go:558 msg="discovered GPU libraries" paths=[] time=2025-04-26T15:48:55.048Z level=DEBUG source=gpu.go:501 msg="Searching for GPU library" name=libcudart.so* time=2025-04-26T15:48:55.048Z level=DEBUG source=gpu.go:525 msg="gpu library search" globs="[/usr/lib/ollama/libcudart.so* /usr/local/nvidia/lib/libcudart.so* /usr/local/nvidia/lib64/libcudart.so* /usr/lib/ollama/cuda_v*/libcudart.so* /usr/local/cuda/lib64/libcudart.so* /usr/lib/x86_64-linux-gnu/nvidia/current/libcudart.so* /usr/lib/x86_64-linux-gnu/libcudart.so* /usr/lib/wsl/lib/libcudart.so* /usr/lib/wsl/drivers/*/libcudart.so* /opt/cuda/lib64/libcudart.so* /usr/local/cuda*/targets/aarch64-linux/lib/libcudart.so* /usr/lib/aarch64-linux-gnu/nvidia/current/libcudart.so* /usr/lib/aarch64-linux-gnu/libcudart.so* /usr/local/cuda/lib*/libcudart.so* /usr/lib*/libcudart.so* /usr/local/lib*/libcudart.so*]" time=2025-04-26T15:48:55.049Z level=DEBUG source=gpu.go:558 msg="discovered GPU libraries" paths=[] time=2025-04-26T15:48:55.049Z level=DEBUG source=amd_linux.go:101 msg="evaluating amdgpu node /sys/class/kfd/kfd/topology/nodes/0/properties" time=2025-04-26T15:48:55.049Z level=DEBUG source=amd_linux.go:121 msg="detected CPU /sys/class/kfd/kfd/topology/nodes/0/properties" time=2025-04-26T15:48:55.049Z level=DEBUG source=amd_linux.go:101 msg="evaluating amdgpu node /sys/class/kfd/kfd/topology/nodes/1/properties" time=2025-04-26T15:48:55.049Z level=DEBUG source=amd_linux.go:206 msg="mapping amdgpu to drm sysfs nodes" amdgpu=/sys/class/kfd/kfd/topology/nodes/1/properties vendor=4098 device=29772 unique_id=7433242190493591676 time=2025-04-26T15:48:55.049Z level=DEBUG source=amd_linux.go:240 msg=matched amdgpu=/sys/class/kfd/kfd/topology/nodes/1/properties drm=/sys/class/drm/card1/device time=2025-04-26T15:48:55.050Z level=DEBUG source=amd_linux.go:318 msg="amdgpu memory" gpu=0 total="20.0 GiB" time=2025-04-26T15:48:55.050Z level=DEBUG source=amd_linux.go:319 msg="amdgpu memory" gpu=0 available="20.0 GiB" time=2025-04-26T15:48:55.050Z level=INFO source=amd_linux.go:332 msg="filtering out device per user request" id=GPU-67282e63a5d1287c visible_devices=[1] time=2025-04-26T15:48:55.050Z level=DEBUG source=amd_linux.go:101 msg="evaluating amdgpu node /sys/class/kfd/kfd/topology/nodes/2/properties" time=2025-04-26T15:48:55.050Z level=DEBUG source=amd_linux.go:206 msg="mapping amdgpu to drm sysfs nodes" amdgpu=/sys/class/kfd/kfd/topology/nodes/2/properties vendor=4098 device=29631 unique_id=8540440255474986813 time=2025-04-26T15:48:55.050Z level=DEBUG source=amd_linux.go:219 msg="failed to read sysfs node" file=/sys/class/drm/card1-DP-1/device/vendor error="open /sys/class/drm/card1-DP-1/device/vendor: no such file or directory" time=2025-04-26T15:48:55.050Z level=DEBUG source=amd_linux.go:219 msg="failed to read sysfs node" file=/sys/class/drm/card1-DP-2/device/vendor error="open /sys/class/drm/card1-DP-2/device/vendor: no such file or directory" time=2025-04-26T15:48:55.050Z level=DEBUG source=amd_linux.go:219 msg="failed to read sysfs node" file=/sys/class/drm/card1-DP-3/device/vendor error="open /sys/class/drm/card1-DP-3/device/vendor: no such file or directory" time=2025-04-26T15:48:55.050Z level=DEBUG source=amd_linux.go:219 msg="failed to read sysfs node" file=/sys/class/drm/card1-HDMI-A-1/device/vendor error="open /sys/class/drm/card1-HDMI-A-1/device/vendor: no such file or directory" time=2025-04-26T15:48:55.050Z level=DEBUG source=amd_linux.go:219 msg="failed to read sysfs node" file=/sys/class/drm/card1-Writeback-1/device/vendor error="open /sys/class/drm/card1-Writeback-1/device/vendor: no such file or directory" time=2025-04-26T15:48:55.050Z level=DEBUG source=amd_linux.go:240 msg=matched amdgpu=/sys/class/kfd/kfd/topology/nodes/2/properties drm=/sys/class/drm/card2/device time=2025-04-26T15:48:55.050Z level=DEBUG source=amd_linux.go:318 msg="amdgpu memory" gpu=1 total="16.0 GiB" time=2025-04-26T15:48:55.050Z level=DEBUG source=amd_linux.go:319 msg="amdgpu memory" gpu=1 available="16.0 GiB" time=2025-04-26T15:48:55.050Z level=DEBUG source=amd_common.go:16 msg="evaluating potential rocm lib dir /usr/lib/ollama/rocm" time=2025-04-26T15:48:55.050Z level=DEBUG source=amd_common.go:44 msg="detected ROCM next to ollama executable /usr/lib/ollama/rocm" time=2025-04-26T15:48:55.050Z level=INFO source=amd_linux.go:389 msg="skipping rocm gfx compatibility check" HSA_OVERRIDE_GFX_VERSION=10.3.0 time=2025-04-26T15:48:55.054Z level=INFO source=types.go:130 msg="inference compute" id=GPU-7685bd29c5086f3d library=rocm variant="" compute=gfx1030 driver=6.12 name=1002:73bf total="16.0 GiB" available="16.0 GiB" [GIN] 2025/04/26 - 15:48:55 | 200 | 45.608µs | 192.168.10.1 | GET  "/" [GIN] 2025/04/26 - 15:48:56 | 200 | 27.346µs | 192.168.10.1 | GET  "/" [GIN] 2025/04/26 - 15:48:57 | 200 | 18.863µs | 192.168.10.1 | GET  "/" [GIN] 2025/04/26 - 15:48:58 | 200 | 20.339µs | 192.168.10.1 | GET  "/" [GIN] 2025/04/26 - 15:48:59 | 200 | 27.412µs | 192.168.10.1 | GET  "/" [GIN] 2025/04/26 - 15:49:00 | 200 | 24.791µs | 192.168.10.1 | GET  "/" [GIN] 2025/04/26 - 15:49:01 | 200 | 15.701µs | 192.168.10.1 | GET  "/" [GIN] 2025/04/26 - 15:49:02 | 200 | 14.232µs | 192.168.10.1 | GET  "/" [GIN] 2025/04/26 - 15:49:03 | 200 | 17.825µs | 192.168.10.1 | GET  "/" [GIN] 2025/04/26 - 15:49:04 | 200 | 17.923µs | 192.168.10.1 | GET  "/" [GIN] 2025/04/26 - 15:49:05 | 200 | 24.265µs | 192.168.10.1 | GET  "/" [GIN] 2025/04/26 - 15:49:06 | 200 | 25.764µs | 192.168.10.1 | GET  "/" [GIN] 2025/04/26 - 15:49:07 | 200 | 14.055µs | 192.168.10.1 | GET  "/" [GIN] 2025/04/26 - 15:49:08 | 200 | 12.819µs | 192.168.10.1 | GET  "/" [GIN] 2025/04/26 - 15:49:09 | 200 | 13.869µs | 192.168.10.1 | GET  "/" [GIN] 2025/04/26 - 15:49:10 | 200 | 13.224µs | 192.168.10.1 | GET  "/" [GIN] 2025/04/26 - 15:49:11 | 200 | 13.465µs | 192.168.10.1 | GET  "/" [GIN] 2025/04/26 - 15:49:12 | 200 | 33.276µs | 192.168.10.1 | GET  "/" [GIN] 2025/04/26 - 15:49:13 | 200 | 14.354µs | 192.168.10.1 | GET  "/" [GIN] 2025/04/26 - 15:49:14 | 200 | 14.805µs | 192.168.10.1 | GET  "/" [GIN] 2025/04/26 - 15:49:15 | 200 | 13.669µs | 192.168.10.1 | GET  "/" [GIN] 2025/04/26 - 15:49:16 | 200 | 14.005µs | 192.168.10.1 | GET  "/" [GIN] 2025/04/26 - 15:49:17 | 200 | 17.338µs | 192.168.10.1 | GET  "/" [GIN] 2025/04/26 - 15:49:18 | 200 | 18.535µs | 192.168.10.1 | GET  "/" [GIN] 2025/04/26 - 15:49:19 | 200 | 23.488µs | 192.168.10.1 | GET  "/" [GIN] 2025/04/26 - 15:49:20 | 200 | 17.602µs | 192.168.10.1 | GET  "/" [GIN] 2025/04/26 - 15:49:21 | 200 | 13.956µs | 192.168.10.1 | GET  "/" [GIN] 2025/04/26 - 15:49:22 | 200 | 13.766µs | 192.168.10.1 | GET  "/" [GIN] 2025/04/26 - 15:49:23 | 200 | 21.209µs | 192.168.10.1 | GET  "/" [GIN] 2025/04/26 - 15:49:24 | 200 | 17.51µs | 192.168.10.1 | GET  "/" [GIN] 2025/04/26 - 15:49:25 | 200 | 18.154µs | 192.168.10.1 | GET  "/" [GIN] 2025/04/26 - 15:49:26 | 200 | 179.984µs | 192.168.10.1 | GET  "/" [GIN] 2025/04/26 - 15:49:27 | 200 | 22.29µs | 192.168.10.1 | GET  "/" [GIN] 2025/04/26 - 15:49:28 | 200 | 14.871µs | 192.168.10.1 | GET  "/" [GIN] 2025/04/26 - 15:49:29 | 200 | 22.09µs | 192.168.10.1 | GET  "/" [GIN] 2025/04/26 - 15:49:31 | 200 | 14.426µs | 192.168.10.1 | GET  "/" time=2025-04-26T15:49:31.370Z level=WARN source=ggml.go:152 msg="key not found" key=general.alignment default=32 time=2025-04-26T15:49:31.371Z level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="62.5 GiB" before.free="31.7 GiB" before.free_swap="0 B" now.total="62.5 GiB" now.free="31.5 GiB" now.free_swap="0 B" time=2025-04-26T15:49:31.371Z level=DEBUG source=amd_linux.go:488 msg="updating rocm free memory" gpu=GPU-7685bd29c5086f3d name=1002:73bf before="16.0 GiB" now="16.0 GiB" time=2025-04-26T15:49:31.371Z level=DEBUG source=sched.go:183 msg="updating default concurrency" OLLAMA_MAX_LOADED_MODELS=3 gpu_count=1 time=2025-04-26T15:49:31.373Z level=WARN source=ggml.go:152 msg="key not found" key=general.alignment default=32 time=2025-04-26T15:49:31.375Z level=WARN source=ggml.go:152 msg="key not found" key=general.alignment default=32 time=2025-04-26T15:49:31.376Z level=DEBUG source=sched.go:226 msg="loading first model" model=/home/ollama/models/.ollama/models/blobs/sha256-819c2adf5ce6df2b6bd2ae4ca90d2a69f060afeb438d0c171db57daa02e39c3d time=2025-04-26T15:49:31.376Z level=DEBUG source=memory.go:108 msg=evaluating library=rocm gpu_count=1 available="[16.0 GiB]" time=2025-04-26T15:49:31.376Z level=WARN source=ggml.go:152 msg="key not found" key=bert.vision.block_count default=0 time=2025-04-26T15:49:31.376Z level=WARN source=ggml.go:152 msg="key not found" key=bert.attention.head_count_kv default=1 time=2025-04-26T15:49:31.376Z level=WARN source=ggml.go:152 msg="key not found" key=bert.attention.key_length default=64 time=2025-04-26T15:49:31.376Z level=WARN source=ggml.go:152 msg="key not found" key=bert.attention.value_length default=64 time=2025-04-26T15:49:31.376Z level=WARN source=ggml.go:152 msg="key not found" key=bert.attention.head_count_kv default=1 time=2025-04-26T15:49:31.376Z level=INFO source=sched.go:722 msg="new model will fit in available VRAM in single GPU, loading" model=/home/ollama/models/.ollama/models/blobs/sha256-819c2adf5ce6df2b6bd2ae4ca90d2a69f060afeb438d0c171db57daa02e39c3d gpu=GPU-7685bd29c5086f3d parallel=1 available=17145991168 required="1.1 GiB" time=2025-04-26T15:49:31.376Z level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="62.5 GiB" before.free="31.5 GiB" before.free_swap="0 B" now.total="62.5 GiB" now.free="31.5 GiB" now.free_swap="0 B" time=2025-04-26T15:49:31.376Z level=DEBUG source=amd_linux.go:488 msg="updating rocm free memory" gpu=GPU-7685bd29c5086f3d name=1002:73bf before="16.0 GiB" now="16.0 GiB" time=2025-04-26T15:49:31.376Z level=INFO source=server.go:105 msg="system memory" total="62.5 GiB" free="31.5 GiB" free_swap="0 B" time=2025-04-26T15:49:31.376Z level=DEBUG source=memory.go:108 msg=evaluating library=rocm gpu_count=1 available="[16.0 GiB]" time=2025-04-26T15:49:31.376Z level=WARN source=ggml.go:152 msg="key not found" key=bert.vision.block_count default=0 time=2025-04-26T15:49:31.376Z level=WARN source=ggml.go:152 msg="key not found" key=bert.attention.head_count_kv default=1 time=2025-04-26T15:49:31.376Z level=WARN source=ggml.go:152 msg="key not found" key=bert.attention.key_length default=64 time=2025-04-26T15:49:31.376Z level=WARN source=ggml.go:152 msg="key not found" key=bert.attention.value_length default=64 time=2025-04-26T15:49:31.376Z level=WARN source=ggml.go:152 msg="key not found" key=bert.attention.head_count_kv default=1 time=2025-04-26T15:49:31.376Z level=INFO source=server.go:138 msg=offload library=rocm layers.requested=-1 layers.model=25 layers.offload=25 layers.split="" memory.available="[16.0 GiB]" memory.gpu_overhead="0 B" memory.required.full="1.1 GiB" memory.required.partial="1.1 GiB" memory.required.kv="3.0 MiB" memory.required.allocations="[1.1 GiB]" memory.weights.total="636.8 MiB" memory.weights.repeating="577.2 MiB" memory.weights.nonrepeating="59.6 MiB" memory.graph.full="8.0 MiB" memory.graph.partial="8.0 MiB" time=2025-04-26T15:49:31.376Z level=DEBUG source=server.go:262 msg="compatible gpu libraries" compatible=[rocm] llama_model_loader: loaded meta data with 23 key-value pairs and 389 tensors from /home/ollama/models/.ollama/models/blobs/sha256-819c2adf5ce6df2b6bd2ae4ca90d2a69f060afeb438d0c171db57daa02e39c3d (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = bert llama_model_loader: - kv 1: general.name str = mxbai-embed-large-v1 llama_model_loader: - kv 2: bert.block_count u32 = 24 llama_model_loader: - kv 3: bert.context_length u32 = 512 llama_model_loader: - kv 4: bert.embedding_length u32 = 1024 llama_model_loader: - kv 5: bert.feed_forward_length u32 = 4096 llama_model_loader: - kv 6: bert.attention.head_count u32 = 16 llama_model_loader: - kv 7: bert.attention.layer_norm_epsilon f32 = 0.000000 llama_model_loader: - kv 8: general.file_type u32 = 1 llama_model_loader: - kv 9: bert.attention.causal bool = false llama_model_loader: - kv 10: bert.pooling_type u32 = 2 llama_model_loader: - kv 11: tokenizer.ggml.token_type_count u32 = 2 llama_model_loader: - kv 12: tokenizer.ggml.bos_token_id u32 = 101 llama_model_loader: - kv 13: tokenizer.ggml.eos_token_id u32 = 102 llama_model_loader: - kv 14: tokenizer.ggml.model str = bert llama_model_loader: - kv 15: tokenizer.ggml.tokens arr[str,30522] = ["[PAD]", "[unused0]", "[unused1]", "... llama_model_loader: - kv 16: tokenizer.ggml.scores arr[f32,30522] = [-1000.000000, -1000.000000, -1000.00... llama_model_loader: - kv 17: tokenizer.ggml.token_type arr[i32,30522] = [3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 18: tokenizer.ggml.unknown_token_id u32 = 100 llama_model_loader: - kv 19: tokenizer.ggml.seperator_token_id u32 = 102 llama_model_loader: - kv 20: tokenizer.ggml.padding_token_id u32 = 0 llama_model_loader: - kv 21: tokenizer.ggml.cls_token_id u32 = 101 llama_model_loader: - kv 22: tokenizer.ggml.mask_token_id u32 = 103 llama_model_loader: - type f32: 243 tensors llama_model_loader: - type f16: 146 tensors print_info: file format = GGUF V3 (latest) print_info: file type = F16 print_info: file size = 637.85 MiB (16.02 BPW) init_tokenizer: initializing tokenizer for type 3 load: control token: 101 '[CLS]' is not marked as EOG load: control token: 103 '[MASK]' is not marked as EOG load: control token: 0 '[PAD]' is not marked as EOG load: control token: 100 '[UNK]' is not marked as EOG load: control token: 102 '[SEP]' is not marked as EOG load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect load: special tokens cache size = 5 load: token to piece cache size = 0.2032 MB print_info: arch = bert print_info: vocab_only = 1 print_info: model type = ?B print_info: model params = 334.09 M print_info: general.name = mxbai-embed-large-v1 print_info: vocab type = WPM print_info: n_vocab = 30522 print_info: n_merges = 0 print_info: BOS token = 101 '[CLS]' print_info: EOS token = 102 '[SEP]' print_info: UNK token = 100 '[UNK]' print_info: SEP token = 102 '[SEP]' print_info: PAD token = 0 '[PAD]' print_info: MASK token = 103 '[MASK]' print_info: LF token = 0 '[PAD]' print_info: EOG token = 102 '[SEP]' print_info: max token length = 21 llama_model_load: vocab only - skipping tensors [GIN] 2025/04/26 - 15:49:32 | 200 | 23.128µs | 192.168.10.1 | GET  "/" time=2025-04-26T15:49:32.367Z level=DEBUG source=server.go:335 msg="adding gpu library" path=/usr/lib/ollama/rocm time=2025-04-26T15:49:32.367Z level=DEBUG source=server.go:343 msg="adding gpu dependency paths" paths=[/usr/lib/ollama/rocm] time=2025-04-26T15:49:32.368Z level=INFO source=server.go:405 msg="starting llama server" cmd="/usr/bin/ollama runner --model /home/ollama/models/.ollama/models/blobs/sha256-819c2adf5ce6df2b6bd2ae4ca90d2a69f060afeb438d0c171db57daa02e39c3d --ctx-size 512 --batch-size 512 --n-gpu-layers 25 --verbose --threads 8 --parallel 1 --port 21385" time=2025-04-26T15:49:32.368Z level=DEBUG source=server.go:423 msg=subprocess environment="[PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin LD_LIBRARY_PATH=/usr/lib/ollama/rocm:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/lib/ollama/rocm:/usr/lib/ollama HIP_VISIBLE_DEVICES=1 ROCR_VISIBLE_DEVICES=GPU-7685bd29c5086f3d ROCM_VISIBLE_DEVICES=1 HSA_OVERRIDE_GFX_VERSION=10.3.0]" time=2025-04-26T15:49:32.368Z level=INFO source=sched.go:451 msg="loaded runners" count=1 time=2025-04-26T15:49:32.368Z level=INFO source=server.go:580 msg="waiting for llama runner to start responding" time=2025-04-26T15:49:32.368Z level=INFO source=server.go:614 msg="waiting for server to become available" status="llm server error" time=2025-04-26T15:49:32.377Z level=INFO source=runner.go:853 msg="starting go runner" time=2025-04-26T15:49:32.377Z level=DEBUG source=ggml.go:99 msg="ggml backend load all from path" path=/usr/lib/ollama/rocm [GIN] 2025/04/26 - 15:49:33 | 200 | 19.852µs | 192.168.10.1 | GET  "/" /opt/amdgpu/share/libdrm/amdgpu.ids: No such file or directory /opt/amdgpu/share/libdrm/amdgpu.ids: No such file or directory ggml_cuda_init: failed to initialize ROCm: no ROCm-capable device is detected load_backend: loaded ROCm backend from /usr/lib/ollama/rocm/libggml-hip.so time=2025-04-26T15:49:33.155Z level=DEBUG source=ggml.go:93 msg="skipping path which is not part of ollama" path=/usr/local/nvidia/lib time=2025-04-26T15:49:33.155Z level=DEBUG source=ggml.go:93 msg="skipping path which is not part of ollama" path=/usr/local/nvidia/lib64 time=2025-04-26T15:49:33.155Z level=DEBUG source=ggml.go:99 msg="ggml backend load all from path" path=/usr/lib/ollama load_backend: loaded CPU backend from /usr/lib/ollama/libggml-cpu-alderlake.so time=2025-04-26T15:49:33.156Z level=INFO source=ggml.go:109 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX_VNNI=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 compiler=cgo(gcc) time=2025-04-26T15:49:33.156Z level=INFO source=runner.go:913 msg="Server listening on 127.0.0.1:21385" llama_model_loader: loaded meta data with 23 key-value pairs and 389 tensors from /home/ollama/models/.ollama/models/blobs/sha256-819c2adf5ce6df2b6bd2ae4ca90d2a69f060afeb438d0c171db57daa02e39c3d (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = bert llama_model_loader: - kv 1: general.name str = mxbai-embed-large-v1 llama_model_loader: - kv 2: bert.block_count u32 = 24 llama_model_loader: - kv 3: bert.context_length u32 = 512 llama_model_loader: - kv 4: bert.embedding_length u32 = 1024 llama_model_loader: - kv 5: bert.feed_forward_length u32 = 4096 llama_model_loader: - kv 6: bert.attention.head_count u32 = 16 llama_model_loader: - kv 7: bert.attention.layer_norm_epsilon f32 = 0.000000 llama_model_loader: - kv 8: general.file_type u32 = 1 llama_model_loader: - kv 9: bert.attention.causal bool = false llama_model_loader: - kv 10: bert.pooling_type u32 = 2 llama_model_loader: - kv 11: tokenizer.ggml.token_type_count u32 = 2 llama_model_loader: - kv 12: tokenizer.ggml.bos_token_id u32 = 101 llama_model_loader: - kv 13: tokenizer.ggml.eos_token_id u32 = 102 llama_model_loader: - kv 14: tokenizer.ggml.model str = bert llama_model_loader: - kv 15: tokenizer.ggml.tokens arr[str,30522] = ["[PAD]", "[unused0]", "[unused1]", "... llama_model_loader: - kv 16: tokenizer.ggml.scores arr[f32,30522] = [-1000.000000, -1000.000000, -1000.00... llama_model_loader: - kv 17: tokenizer.ggml.token_type arr[i32,30522] = [3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 18: tokenizer.ggml.unknown_token_id u32 = 100 llama_model_loader: - kv 19: tokenizer.ggml.seperator_token_id u32 = 102 llama_model_loader: - kv 20: tokenizer.ggml.padding_token_id u32 = 0 llama_model_loader: - kv 21: tokenizer.ggml.cls_token_id u32 = 101 llama_model_loader: - kv 22: tokenizer.ggml.mask_token_id u32 = 103 llama_model_loader: - type f32: 243 tensors llama_model_loader: - type f16: 146 tensors print_info: file format = GGUF V3 (latest) print_info: file type = F16 print_info: file size = 637.85 MiB (16.02 BPW) init_tokenizer: initializing tokenizer for type 3 load: control token: 101 '[CLS]' is not marked as EOG load: control token: 103 '[MASK]' is not marked as EOG load: control token: 0 '[PAD]' is not marked as EOG load: control token: 100 '[UNK]' is not marked as EOG load: control token: 102 '[SEP]' is not marked as EOG load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect load: special tokens cache size = 5 load: token to piece cache size = 0.2032 MB print_info: arch = bert print_info: vocab_only = 0 print_info: n_ctx_train = 512 print_info: n_embd = 1024 print_info: n_layer = 24 print_info: n_head = 16 print_info: n_head_kv = 16 print_info: n_rot = 64 print_info: n_swa = 0 print_info: n_swa_pattern = 1 print_info: n_embd_head_k = 64 print_info: n_embd_head_v = 64 print_info: n_gqa = 1 print_info: n_embd_k_gqa = 1024 print_info: n_embd_v_gqa = 1024 print_info: f_norm_eps = 1.0e-12 print_info: f_norm_rms_eps = 0.0e+00 print_info: f_clamp_kqv = 0.0e+00 print_info: f_max_alibi_bias = 0.0e+00 print_info: f_logit_scale = 0.0e+00 print_info: f_attn_scale = 0.0e+00 print_info: n_ff = 4096 print_info: n_expert = 0 print_info: n_expert_used = 0 print_info: causal attn = 0 print_info: pooling type = 2 print_info: rope type = 2 print_info: rope scaling = linear print_info: freq_base_train = 10000.0 print_info: freq_scale_train = 1 print_info: n_ctx_orig_yarn = 512 print_info: rope_finetuned = unknown print_info: ssm_d_conv = 0 print_info: ssm_d_inner = 0 print_info: ssm_d_state = 0 print_info: ssm_dt_rank = 0 print_info: ssm_dt_b_c_rms = 0 print_info: model type = 335M print_info: model params = 334.09 M print_info: general.name = mxbai-embed-large-v1 print_info: vocab type = WPM print_info: n_vocab = 30522 print_info: n_merges = 0 print_info: BOS token = 101 '[CLS]' print_info: EOS token = 102 '[SEP]' print_info: UNK token = 100 '[UNK]' print_info: SEP token = 102 '[SEP]' print_info: PAD token = 0 '[PAD]' print_info: MASK token = 103 '[MASK]' print_info: LF token = 0 '[PAD]' print_info: EOG token = 102 '[SEP]' print_info: max token length = 21 load_tensors: loading model tensors, this can take a while... (mmap = true) load_tensors: layer 0 assigned to device CPU, is_swa = 0 load_tensors: layer 1 assigned to device CPU, is_swa = 0 load_tensors: layer 2 assigned to device CPU, is_swa = 0 load_tensors: layer 3 assigned to device CPU, is_swa = 0 load_tensors: layer 4 assigned to device CPU, is_swa = 0 load_tensors: layer 5 assigned to device CPU, is_swa = 0 load_tensors: layer 6 assigned to device CPU, is_swa = 0 load_tensors: layer 7 assigned to device CPU, is_swa = 0 load_tensors: layer 8 assigned to device CPU, is_swa = 0 load_tensors: layer 9 assigned to device CPU, is_swa = 0 load_tensors: layer 10 assigned to device CPU, is_swa = 0 load_tensors: layer 11 assigned to device CPU, is_swa = 0 load_tensors: layer 12 assigned to device CPU, is_swa = 0 load_tensors: layer 13 assigned to device CPU, is_swa = 0 load_tensors: layer 14 assigned to device CPU, is_swa = 0 load_tensors: layer 15 assigned to device CPU, is_swa = 0 load_tensors: layer 16 assigned to device CPU, is_swa = 0 load_tensors: layer 17 assigned to device CPU, is_swa = 0 load_tensors: layer 18 assigned to device CPU, is_swa = 0 load_tensors: layer 19 assigned to device CPU, is_swa = 0 load_tensors: layer 20 assigned to device CPU, is_swa = 0 load_tensors: layer 21 assigned to device CPU, is_swa = 0 load_tensors: layer 22 assigned to device CPU, is_swa = 0 load_tensors: layer 23 assigned to device CPU, is_swa = 0 load_tensors: layer 24 assigned to device CPU, is_swa = 0 time=2025-04-26T15:49:33.372Z level=INFO source=server.go:614 msg="waiting for server to become available" status="llm server loading model" [GIN] 2025/04/26 - 15:49:34 | 200 | 16.28µs | 192.168.10.1 | GET  "/" [GIN] 2025/04/26 - 15:49:35 | 200 | 12.119µs | 192.168.10.1 | GET  "/" [GIN] 2025/04/26 - 15:49:36 | 200 | 12.501µs | 192.168.10.1 | GET  "/" [GIN] 2025/04/26 - 15:49:37 | 200 | 16.986µs | 192.168.10.1 | GET  "/" [GIN] 2025/04/26 - 15:49:38 | 200 | 18.783µs | 192.168.10.1 | GET  "/" [GIN] 2025/04/26 - 15:49:39 | 200 | 13.436µs | 192.168.10.1 | GET  "/" [GIN] 2025/04/26 - 15:49:40 | 200 | 19.586µs | 192.168.10.1 | GET  "/" [GIN] 2025/04/26 - 15:49:41 | 200 | 23.479µs | 192.168.10.1 | GET  "/" [GIN] 2025/04/26 - 15:49:42 | 200 | 23.868µs | 192.168.10.1 | GET  "/" [GIN] 2025/04/26 - 15:49:43 | 200 | 13.501µs | 192.168.10.1 | GET  "/" [GIN] 2025/04/26 - 15:49:44 | 200 | 14.025µs | 192.168.10.1 | GET  "/" [GIN] 2025/04/26 - 15:49:45 | 200 | 13.98µs | 192.168.10.1 | GET  "/" [GIN] 2025/04/26 - 15:49:46 | 200 | 22.553µs | 192.168.10.1 | GET  "/" [GIN] 2025/04/26 - 15:49:47 | 200 | 13.092µs | 192.168.10.1 | GET  "/" [GIN] 2025/04/26 - 15:49:48 | 200 | 19.752µs | 192.168.10.1 | GET  "/" [GIN] 2025/04/26 - 15:49:49 | 200 | 12.943µs | 192.168.10.1 | GET  "/" [GIN] 2025/04/26 - 15:49:50 | 200 | 13.455µs | 192.168.10.1 | GET  "/" [GIN] 2025/04/26 - 15:49:51 | 200 | 12.638µs | 192.168.10.1 | GET  "/" [GIN] 2025/04/26 - 15:49:52 | 200 | 21.598µs | 192.168.10.1 | GET  "/" load_tensors: CPU_Mapped model buffer size = 637.85 MiB llama_context: constructing llama_context llama_context: n_seq_max = 1 llama_context: n_ctx = 512 llama_context: n_ctx_per_seq = 512 llama_context: n_batch = 512 llama_context: n_ubatch = 512 llama_context: causal_attn = 0 llama_context: flash_attn = 0 llama_context: freq_base = 10000.0 llama_context: freq_scale = 1 set_abort_callback: call llama_context: CPU output buffer size = 0.00 MiB llama_context: n_ctx = 512 llama_context: n_ctx = 512 (padded) init: kv_size = 512, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 24, can_shift = 1 init: layer 0: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CPU init: layer 1: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CPU init: layer 2: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CPU init: layer 3: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CPU init: layer 4: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CPU init: layer 5: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CPU init: layer 6: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CPU init: layer 7: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CPU init: layer 8: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CPU init: layer 9: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CPU init: layer 10: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CPU init: layer 11: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CPU init: layer 12: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CPU init: layer 13: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CPU init: layer 14: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CPU init: layer 15: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CPU init: layer 16: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CPU init: layer 17: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CPU init: layer 18: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CPU init: layer 19: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CPU init: layer 20: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CPU init: layer 21: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CPU init: layer 22: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CPU init: layer 23: n_embd_k_gqa = 1024, n_embd_v_gqa = 1024, dev = CPU init: CPU KV buffer size = 48.00 MiB llama_context: KV self size = 48.00 MiB, K (f16): 24.00 MiB, V (f16): 24.00 MiB llama_context: enumerating backends llama_context: backend_ptrs.size() = 1 llama_context: max_nodes = 65536 llama_context: worst-case: n_tokens = 512, n_seqs = 1, n_outputs = 0 llama_context: reserving graph for n_tokens = 512, n_seqs = 1 llama_context: reserving graph for n_tokens = 1, n_seqs = 1 llama_context: reserving graph for n_tokens = 512, n_seqs = 1 llama_context: CPU compute buffer size = 27.01 MiB llama_context: graph nodes = 825 llama_context: graph splits = 1 time=2025-04-26T15:49:53.446Z level=INFO source=server.go:619 msg="llama runner started in 21.08 seconds" time=2025-04-26T15:49:53.446Z level=DEBUG source=sched.go:464 msg="finished setting up runner" model=/home/ollama/models/.ollama/models/blobs/sha256-819c2adf5ce6df2b6bd2ae4ca90d2a69f060afeb438d0c171db57daa02e39c3d time=2025-04-26T15:49:53.450Z level=WARN source=ggml.go:152 msg="key not found" key=general.alignment default=32 time=2025-04-26T15:49:53.452Z level=DEBUG source=runner.go:686 msg="embedding request" content="tell me more" time=2025-04-26T15:49:53.454Z level=DEBUG source=cache.go:104 msg="loading cache slot" id=0 cache=0 prompt=5 used=0 remaining=5 [GIN] 2025/04/26 - 15:49:53 | 200 | 22.671213222s | 173.249.47.211 | POST  "/api/embed" time=2025-04-26T15:49:53.477Z level=DEBUG source=sched.go:468 msg="context for request finished" time=2025-04-26T15:49:53.477Z level=DEBUG source=sched.go:341 msg="runner with non-zero duration has gone idle, adding timer" modelPath=/home/ollama/models/.ollama/models/blobs/sha256-819c2adf5ce6df2b6bd2ae4ca90d2a69f060afeb438d0c171db57daa02e39c3d duration=5m0s time=2025-04-26T15:49:53.477Z level=DEBUG source=sched.go:359 msg="after processing request finished event" modelPath=/home/ollama/models/.ollama/models/blobs/sha256-819c2adf5ce6df2b6bd2ae4ca90d2a69f060afeb438d0c171db57daa02e39c3d refCount=0 [GIN] 2025/04/26 - 15:49:53 | 200 | 15.649µs | 192.168.10.1 | GET  "/" [GIN] 2025/04/26 - 15:49:54 | 200 | 19.318µs | 192.168.10.1 | GET  "/" [GIN] 2025/04/26 - 15:49:55 | 200 | 13.768µs | 192.168.10.1 | GET  "/" [GIN] 2025/04/26 - 15:49:56 | 200 | 18.993µs | 192.168.10.1 | GET  "/" [GIN] 2025/04/26 - 15:49:57 | 200 | 15.537µs | 192.168.10.1 | GET  "/" [GIN] 2025/04/26 - 15:49:58 | 200 | 14.674µs | 192.168.10.1 | GET  "/" [GIN] 2025/04/26 - 15:49:59 | 200 | 14.12µs | 192.168.10.1 | GET  "/" [GIN] 2025/04/26 - 15:50:00 | 200 | 13.63µs | 192.168.10.1 | GET  "/" [GIN] 2025/04/26 - 15:50:01 | 200 | 13.562µs | 192.168.10.1 | GET  "/" [GIN] 2025/04/26 - 15:50:02 | 200 | 14.125µs | 192.168.10.1 | GET  "/" [GIN] 2025/04/26 - 15:50:03 | 200 | 14.033µs | 192.168.10.1 | GET  "/" [GIN] 2025/04/26 - 15:50:04 | 200 | 13.703µs | 192.168.10.1 | GET  "/" [GIN] 2025/04/26 - 15:50:05 | 200 | 14.48µs | 192.168.10.1 | GET  "/" [GIN] 2025/04/26 - 15:50:06 | 200 | 16.471µs | 192.168.10.1 | GET  "/" [GIN] 2025/04/26 - 15:50:07 | 200 | 15.057µs | 192.168.10.1 | GET  "/" [GIN] 2025/04/26 - 15:50:08 | 200 | 14.551µs | 192.168.10.1 | GET  "/" [GIN] 2025/04/26 - 15:50:09 | 200 | 13.898µs | 192.168.10.1 | GET  "/" [GIN] 2025/04/26 - 15:50:10 | 200 | 18.579µs | 192.168.10.1 | GET  "/" [GIN] 2025/04/26 - 15:50:12 | 200 | 14.124µs | 192.168.10.1 | GET  "/" [GIN] 2025/04/26 - 15:50:13 | 200 | 13.713µs | 192.168.10.1 | GET  "/" [GIN] 2025/04/26 - 15:50:14 | 200 | 16.543µs | 192.168.10.1 | GET  "/" [GIN] 2025/04/26 - 15:50:15 | 200 | 22.086µs | 192.168.10.1 | GET  "/" [GIN] 2025/04/26 - 15:50:16 | 200 | 13.266µs | 192.168.10.1 | GET  "/" [GIN] 2025/04/26 - 15:50:17 | 200 | 13.639µs | 192.168.10.1 | GET  "/" [GIN] 2025/04/26 - 15:50:18 | 200 | 14.333µs | 192.168.10.1 | GET  "/" [GIN] 2025/04/26 - 15:50:19 | 200 | 14.326µs | 192.168.10.1 | GET  "/" ``` ### OS OS: Linux Mint 22.1 x86_64 ### GPU GPU0: AMD ATI Radeon RX 7900 XT/7900 XTX/7900 GRE/7900M GPU1: AMD ATI Radeon RX 6800/6800 XT / 6900 XT ### CPU CPU: Intel i9-14900K (32) @ 5.700GHz ### Ollama version ollama/ollama:0.6.6-rocm
GiteaMirror added the bug label 2025-11-12 13:47:09 -06:00
Author
Owner

@sunarowicz commented on GitHub (May 2, 2025):

Same problem here, I get
ggml_cuda_init: failed to initialize ROCm: no ROCm-capable device is detected

Ollama version: 0.6.6 (installed in system, not container by official install script)
ROCm installed
GPU: iGP 780M in Rzyen 7 78003D
System variables used: HSA_OVERRIDE_GFX_VERSION=11.0.0 HCC_AMDGPU_TARGETS=gfx1103 CUDA_VISIBLE_DEVICES=-1

rocminfo (truncated:

*******                  
Agent 2                  
*******                  
  Name:                    gfx1036                            
  Uuid:                    GPU-XX                             
  Marketing Name:          AMD Radeon Graphics                
  Vendor Name:             AMD               

Ollama messages (truncated):

llama_model_load: vocab only - skipping tensors
time=2025-05-02T16:00:01.289+02:00 level=INFO source=server.go:405 msg="starting llama server" cmd="/usr/local/bin/ollama runner --model /home/ai/.ollama/models/blobs/sha256-5ee4f07cdb9beadbbb293e85803c569b01bd37ed059d2715faa7bb405f31caa6 --ctx-size 8192 --batch-size 512 --n-gpu-layers 37 --threads 8 --parallel 4 --port 40579"
time=2025-05-02T16:00:01.290+02:00 level=INFO source=sched.go:451 msg="loaded runners" count=1
time=2025-05-02T16:00:01.290+02:00 level=INFO source=server.go:580 msg="waiting for llama runner to start responding"
time=2025-05-02T16:00:01.290+02:00 level=INFO source=server.go:614 msg="waiting for server to become available" status="llm server error"
time=2025-05-02T16:00:01.300+02:00 level=INFO source=runner.go:853 msg="starting go runner"
ggml_cuda_init: failed to initialize ROCm: no ROCm-capable device is detected
load_backend: loaded ROCm backend from /usr/local/lib/ollama/rocm/libggml-hip.so
load_backend: loaded CPU backend from /usr/local/lib/ollama/libggml-cpu-icelake.so
time=2025-05-02T16:00:01.344+02:00 level=INFO source=ggml.go:109 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.AVX512=1 CPU.0.AVX512_VBMI=1 CPU.0.AVX512_VNNI=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 compiler=cgo(gcc)
time=2025-05-02T16:00:01.345+02:00 level=INFO source=runner.go:913 msg="Server listening on 127.0.0.1:40579"

Any idea how to solve?

@sunarowicz commented on GitHub (May 2, 2025): Same problem here, I get `ggml_cuda_init: failed to initialize ROCm: no ROCm-capable device is detected` Ollama version: 0.6.6 (installed in system, not container by official install script) ROCm installed GPU: iGP 780M in Rzyen 7 78003D System variables used: HSA_OVERRIDE_GFX_VERSION=11.0.0 HCC_AMDGPU_TARGETS=gfx1103 CUDA_VISIBLE_DEVICES=-1 rocminfo (truncated: ``` ******* Agent 2 ******* Name: gfx1036 Uuid: GPU-XX Marketing Name: AMD Radeon Graphics Vendor Name: AMD ``` Ollama messages (truncated): ``` llama_model_load: vocab only - skipping tensors time=2025-05-02T16:00:01.289+02:00 level=INFO source=server.go:405 msg="starting llama server" cmd="/usr/local/bin/ollama runner --model /home/ai/.ollama/models/blobs/sha256-5ee4f07cdb9beadbbb293e85803c569b01bd37ed059d2715faa7bb405f31caa6 --ctx-size 8192 --batch-size 512 --n-gpu-layers 37 --threads 8 --parallel 4 --port 40579" time=2025-05-02T16:00:01.290+02:00 level=INFO source=sched.go:451 msg="loaded runners" count=1 time=2025-05-02T16:00:01.290+02:00 level=INFO source=server.go:580 msg="waiting for llama runner to start responding" time=2025-05-02T16:00:01.290+02:00 level=INFO source=server.go:614 msg="waiting for server to become available" status="llm server error" time=2025-05-02T16:00:01.300+02:00 level=INFO source=runner.go:853 msg="starting go runner" ggml_cuda_init: failed to initialize ROCm: no ROCm-capable device is detected load_backend: loaded ROCm backend from /usr/local/lib/ollama/rocm/libggml-hip.so load_backend: loaded CPU backend from /usr/local/lib/ollama/libggml-cpu-icelake.so time=2025-05-02T16:00:01.344+02:00 level=INFO source=ggml.go:109 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.AVX512=1 CPU.0.AVX512_VBMI=1 CPU.0.AVX512_VNNI=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 compiler=cgo(gcc) time=2025-05-02T16:00:01.345+02:00 level=INFO source=runner.go:913 msg="Server listening on 127.0.0.1:40579" ``` Any idea how to solve?
Author
Owner

@moophlo commented on GitHub (May 2, 2025):

Not sure about iGPU but try with these:

HIP_VISIBLE_DEVICES=1
HSA_OVERRIDE_GFX_VERSION=11.0.0
HCC_AMDGPU_TARGET=gfx1100
ROCR_VISIBLE_DEVICES=1

Actually if you only have the Integrated GPU you shouldn't need any variable at all, ideally it should autodetect

@moophlo commented on GitHub (May 2, 2025): Not sure about iGPU but try with these: HIP_VISIBLE_DEVICES=1 HSA_OVERRIDE_GFX_VERSION=11.0.0 HCC_AMDGPU_TARGET=gfx1100 ROCR_VISIBLE_DEVICES=1 Actually if you only have the Integrated GPU you shouldn't need any variable at all, ideally it should autodetect
Author
Owner

@sunarowicz commented on GitHub (May 2, 2025):

When I start server with AMD_LOG_LEVEL=3 and OLLAMA_DEBUG=1 I get the following additional info:

time=2025-05-02T16:24:57.999+02:00 level=INFO source=runner.go:853 msg="starting go runner"
time=2025-05-02T16:24:58.000+02:00 level=DEBUG source=ggml.go:99 msg="ggml backend load all from path" path=/usr/local/lib/ollama/rocm
:3:rocdevice.cpp            :469 : 145600849622d us:  Initializing HSA stack.
:3:rocdevice.cpp            :555 : 145600858892d us:  Enumerated GPU agents = 0
:3:hip_context.cpp          :49  : 145600858897d us:  Direct Dispatch: 1
:3:hip_device_runtime.cpp   :649 : 145600858914d us:   hipGetDeviceCount ( 0x72165e6c4310 ) 
:3:hip_device_runtime.cpp   :651 : 145600858917d us:  hipGetDeviceCount: Returned hipErrorNoDevice : 
ggml_cuda_init: failed to initialize ROCm: no ROCm-capable device is detected
load_backend: loaded ROCm backend from /usr/local/lib/ollama/rocm/libggml-hip.so

But still do not understand why I get "Enumerated GPU agents = 0" when rocminfo reports the GPU agent.

@sunarowicz commented on GitHub (May 2, 2025): When I start server with AMD_LOG_LEVEL=3 and OLLAMA_DEBUG=1 I get the following additional info: ``` time=2025-05-02T16:24:57.999+02:00 level=INFO source=runner.go:853 msg="starting go runner" time=2025-05-02T16:24:58.000+02:00 level=DEBUG source=ggml.go:99 msg="ggml backend load all from path" path=/usr/local/lib/ollama/rocm :3:rocdevice.cpp :469 : 145600849622d us: Initializing HSA stack. :3:rocdevice.cpp :555 : 145600858892d us: Enumerated GPU agents = 0 :3:hip_context.cpp :49 : 145600858897d us: Direct Dispatch: 1 :3:hip_device_runtime.cpp :649 : 145600858914d us: hipGetDeviceCount ( 0x72165e6c4310 ) :3:hip_device_runtime.cpp :651 : 145600858917d us: hipGetDeviceCount: Returned hipErrorNoDevice : ggml_cuda_init: failed to initialize ROCm: no ROCm-capable device is detected load_backend: loaded ROCm backend from /usr/local/lib/ollama/rocm/libggml-hip.so ``` But still do not understand why I get "Enumerated GPU agents = 0" when rocminfo reports the GPU agent.
Author
Owner

@sunarowicz commented on GitHub (May 2, 2025):

Not sure about iGPU but try with these:

HIP_VISIBLE_DEVICES=1 HSA_OVERRIDE_GFX_VERSION=11.0.0 HCC_AMDGPU_TARGET=gfx1100 ROCR_VISIBLE_DEVICES=1

Actually if you only have the Integrated GPU you shouldn't need any variable at all, ideally it should autodetect

Thank you for replying to me.

Although your recommendation didn't help, it moved me forward a bit. I found that using HSA_OVERRIDE_GFX_VERSION=11.0.0 only makes ollama finally to try to load layers on iGPU. But then it crashes because of this error:

Memory access fault by GPU node-1 (Agent handle: 0x5d8d5aeee0b0) on address 0x70cee5a30000. Reason: Page not present or supervisor privilege.
time=2025-05-02T18:12:35.513+02:00 level=ERROR source=sched.go:457 msg="error loading llama server" error="llama runner process has terminated: signal: aborted (core dumped)"

But this seems to be a different story, already reported (but not responded yet) here: #8851.

@sunarowicz commented on GitHub (May 2, 2025): > Not sure about iGPU but try with these: > > HIP_VISIBLE_DEVICES=1 HSA_OVERRIDE_GFX_VERSION=11.0.0 HCC_AMDGPU_TARGET=gfx1100 ROCR_VISIBLE_DEVICES=1 > > Actually if you only have the Integrated GPU you shouldn't need any variable at all, ideally it should autodetect Thank you for replying to me. Although your recommendation didn't help, it moved me forward a bit. I found that using HSA_OVERRIDE_GFX_VERSION=11.0.0 only makes ollama finally to try to load layers on iGPU. But then it crashes because of this error: ``` Memory access fault by GPU node-1 (Agent handle: 0x5d8d5aeee0b0) on address 0x70cee5a30000. Reason: Page not present or supervisor privilege. time=2025-05-02T18:12:35.513+02:00 level=ERROR source=sched.go:457 msg="error loading llama server" error="llama runner process has terminated: signal: aborted (core dumped)" ``` But this seems to be a different story, already reported (but not responded yet) here: [#8851](https://github.com/ollama/ollama/issues/8851).
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama-ollama#6855