[GH-ISSUE #12908] ROCM Library not found, wrong location? - gfx1201 unsupported in current container image #55069

Closed
opened 2026-04-29 08:16:25 -05:00 by GiteaMirror · 15 comments
Owner

Originally created by @lu2 on GitHub (Nov 2, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/12908

Originally assigned to: @dhiltgen on GitHub.

What is the issue?

Using latest docker image of ollama:rocm with AMD Radeon RX 9070 XT

Getting following errors in the log, when model is loaded and talked to:

rocblaslt error: Cannot read /usr/lib/ollama/rocm/hipblaslt/library/TensileLibrary_lazy_gfx1201.dat: No such file or directory

rocblaslt error: Could not load /usr/lib/ollama/rocm/hipblaslt/library/TensileLibrary_lazy_gfx1201.dat
hipBLASLt error: Heuristic Fetch Failed!

Looking in the container, this library exists, but on different location:

root@76da308225a5:/# find / | grep TensileLibrary_lazy_gfx1201.dat
/usr/lib/ollama/rocm/rocblas/library/TensileLibrary_lazy_gfx1201.dat

Should this be fixed in the image or maybe in ollama itself?

Relevant log output

time=2025-11-02T01:15:55.825Z level=INFO source=routes.go:1524 msg="server config" env="map[CUDA_VISIBLE_DEVICES: GGML_VK_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:4096 OLLAMA_DEBUG:INFO OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://0.0.0.0:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_KV_CACHE_TYPE: OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/root/.ollama/models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NEW_ENGINE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:1 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://* vscode-file://*] OLLAMA_REMOTES:[ollama.com] OLLAMA_SCHED_SPREAD:false ROCR_VISIBLE_DEVICES: http_proxy: https_proxy: no_proxy:]"
time=2025-11-02T01:15:55.826Z level=INFO source=images.go:522 msg="total blobs: 18"
time=2025-11-02T01:15:55.826Z level=INFO source=images.go:529 msg="total unused blobs removed: 0"
time=2025-11-02T01:15:55.826Z level=INFO source=routes.go:1577 msg="Listening on [::]:11434 (version 0.12.9)"
time=2025-11-02T01:15:55.827Z level=INFO source=runner.go:76 msg="discovering available GPUs..."
time=2025-11-02T01:15:55.828Z level=INFO source=server.go:400 msg="starting runner" cmd="/usr/bin/ollama runner --ollama-engine --port 46199"
time=2025-11-02T01:15:56.602Z level=INFO source=server.go:400 msg="starting runner" cmd="/usr/bin/ollama runner --ollama-engine --port 44135"
time=2025-11-02T01:15:57.380Z level=INFO source=types.go:42 msg="inference compute" id=GPU-a9298820b4e99df1 filtered_id="" library=ROCm compute=gfx1201 name=ROCm0 description="AMD Radeon Graphics" libdirs=ollama,rocm driver=60342.13 pci_id=0000:03:00.0 type=discrete total="15.9 GiB" available="15.8 GiB"
time=2025-11-02T01:15:57.380Z level=INFO source=routes.go:1618 msg="entering low vram mode" "total vram"="15.9 GiB" threshold="20.0 GiB"
time=2025-11-02T01:16:09.100Z level=INFO source=server.go:400 msg="starting runner" cmd="/usr/bin/ollama runner --ollama-engine --port 46381"
llama_model_loader: loaded meta data with 22 key-value pairs and 291 tensors from /root/.ollama/models/blobs/sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = Meta-Llama-3-8B-Instruct
llama_model_loader: - kv   2:                          llama.block_count u32              = 32
llama_model_loader: - kv   3:                       llama.context_length u32              = 8192
llama_model_loader: - kv   4:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv   6:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv   7:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv   8:                       llama.rope.freq_base f32              = 500000.000000
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  10:                          general.file_type u32              = 2
llama_model_loader: - kv  11:                           llama.vocab_size u32              = 128256
llama_model_loader: - kv  12:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv  13:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  14:                         tokenizer.ggml.pre str              = llama-bpe
llama_model_loader: - kv  15:                      tokenizer.ggml.tokens arr[str,128256]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  16:                  tokenizer.ggml.token_type arr[i32,128256]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  17:                      tokenizer.ggml.merges arr[str,280147]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv  18:                tokenizer.ggml.bos_token_id u32              = 128000
llama_model_loader: - kv  19:                tokenizer.ggml.eos_token_id u32              = 128009
llama_model_loader: - kv  20:                    tokenizer.chat_template str              = {% set loop_messages = messages %}{% ...
llama_model_loader: - kv  21:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type q4_0:  225 tensors
llama_model_loader: - type q6_K:    1 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q4_0
print_info: file size   = 4.33 GiB (4.64 BPW) 
load: printing all EOG tokens:
load:   - 128001 ('<|end_of_text|>')
load:   - 128009 ('<|eot_id|>')
load: special tokens cache size = 256
load: token to piece cache size = 0.8000 MB
print_info: arch             = llama
print_info: vocab_only       = 1
print_info: model type       = ?B
print_info: model params     = 8.03 B
print_info: general.name     = Meta-Llama-3-8B-Instruct
print_info: vocab type       = BPE
print_info: n_vocab          = 128256
print_info: n_merges         = 280147
print_info: BOS token        = 128000 '<|begin_of_text|>'
print_info: EOS token        = 128009 '<|eot_id|>'
print_info: EOT token        = 128009 '<|eot_id|>'
print_info: LF token         = 198 'Ċ'
print_info: EOG token        = 128001 '<|end_of_text|>'
print_info: EOG token        = 128009 '<|eot_id|>'
print_info: max token length = 256
llama_model_load: vocab only - skipping tensors
time=2025-11-02T01:16:10.030Z level=INFO source=server.go:400 msg="starting runner" cmd="/usr/bin/ollama runner --model /root/.ollama/models/blobs/sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa --port 45241"
time=2025-11-02T01:16:10.031Z level=INFO source=server.go:470 msg="system memory" total="62.6 GiB" free="55.6 GiB" free_swap="7.0 GiB"
time=2025-11-02T01:16:10.031Z level=INFO source=memory.go:37 msg="new model will fit in available VRAM across minimum required GPUs, loading" model=/root/.ollama/models/blobs/sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa library=ROCm parallel=1 required="5.4 GiB" gpus=1
time=2025-11-02T01:16:10.031Z level=INFO source=server.go:522 msg=offload library=ROCm layers.requested=-1 layers.model=33 layers.offload=33 layers.split=[33] memory.available="[15.8 GiB]" memory.gpu_overhead="0 B" memory.required.full="5.4 GiB" memory.required.partial="5.4 GiB" memory.required.kv="512.0 MiB" memory.required.allocations="[5.4 GiB]" memory.weights.total="4.1 GiB" memory.weights.repeating="3.7 GiB" memory.weights.nonrepeating="411.0 MiB" memory.graph.full="296.0 MiB" memory.graph.partial="677.5 MiB"
time=2025-11-02T01:16:10.039Z level=INFO source=runner.go:910 msg="starting go runner"
load_backend: loaded CPU backend from /usr/lib/ollama/libggml-cpu-alderlake.so
/opt/amdgpu/share/libdrm/amdgpu.ids: No such file or directory
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
  Device 0: AMD Radeon Graphics, gfx1201 (0x1201), VMM: no, Wave Size: 32, ID: GPU-a9298820b4e99df1
load_backend: loaded ROCm backend from /usr/lib/ollama/rocm/libggml-hip.so
time=2025-11-02T01:16:10.761Z level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX_VNNI=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 ROCm.0.NO_VMM=1 ROCm.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc)
time=2025-11-02T01:16:10.761Z level=INFO source=runner.go:946 msg="Server listening on 127.0.0.1:45241"
time=2025-11-02T01:16:10.769Z level=INFO source=runner.go:845 msg=load request="{Operation:commit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:false KvSize:4096 KvCacheType: NumThreads:8 GPULayers:33[ID:GPU-a9298820b4e99df1 Layers:33(0..32)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:true}"
llama_model_load_from_file_impl: using device ROCm0 (AMD Radeon Graphics) (0000:03:00.0) - 16222 MiB free
time=2025-11-02T01:16:10.769Z level=INFO source=server.go:1251 msg="waiting for llama runner to start responding"
time=2025-11-02T01:16:10.769Z level=INFO source=server.go:1285 msg="waiting for server to become available" status="llm server loading model"
llama_model_loader: loaded meta data with 22 key-value pairs and 291 tensors from /root/.ollama/models/blobs/sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = Meta-Llama-3-8B-Instruct
llama_model_loader: - kv   2:                          llama.block_count u32              = 32
llama_model_loader: - kv   3:                       llama.context_length u32              = 8192
llama_model_loader: - kv   4:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv   6:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv   7:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv   8:                       llama.rope.freq_base f32              = 500000.000000
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  10:                          general.file_type u32              = 2
llama_model_loader: - kv  11:                           llama.vocab_size u32              = 128256
llama_model_loader: - kv  12:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv  13:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  14:                         tokenizer.ggml.pre str              = llama-bpe
llama_model_loader: - kv  15:                      tokenizer.ggml.tokens arr[str,128256]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  16:                  tokenizer.ggml.token_type arr[i32,128256]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  17:                      tokenizer.ggml.merges arr[str,280147]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv  18:                tokenizer.ggml.bos_token_id u32              = 128000
llama_model_loader: - kv  19:                tokenizer.ggml.eos_token_id u32              = 128009
llama_model_loader: - kv  20:                    tokenizer.chat_template str              = {% set loop_messages = messages %}{% ...
llama_model_loader: - kv  21:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type q4_0:  225 tensors
llama_model_loader: - type q6_K:    1 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q4_0
print_info: file size   = 4.33 GiB (4.64 BPW) 
load: printing all EOG tokens:
load:   - 128001 ('<|end_of_text|>')
load:   - 128009 ('<|eot_id|>')
load: special tokens cache size = 256
load: token to piece cache size = 0.8000 MB
print_info: arch             = llama
print_info: vocab_only       = 0
print_info: n_ctx_train      = 8192
print_info: n_embd           = 4096
print_info: n_layer          = 32
print_info: n_head           = 32
print_info: n_head_kv        = 8
print_info: n_rot            = 128
print_info: n_swa            = 0
print_info: is_swa_any       = 0
print_info: n_embd_head_k    = 128
print_info: n_embd_head_v    = 128
print_info: n_gqa            = 4
print_info: n_embd_k_gqa     = 1024
print_info: n_embd_v_gqa     = 1024
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-05
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: f_attn_scale     = 0.0e+00
print_info: n_ff             = 14336
print_info: n_expert         = 0
print_info: n_expert_used    = 0
print_info: causal attn      = 1
print_info: pooling type     = 0
print_info: rope type        = 0
print_info: rope scaling     = linear
print_info: freq_base_train  = 500000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn  = 8192
print_info: rope_finetuned   = unknown
print_info: model type       = 8B
print_info: model params     = 8.03 B
print_info: general.name     = Meta-Llama-3-8B-Instruct
print_info: vocab type       = BPE
print_info: n_vocab          = 128256
print_info: n_merges         = 280147
print_info: BOS token        = 128000 '<|begin_of_text|>'
print_info: EOS token        = 128009 '<|eot_id|>'
print_info: EOT token        = 128009 '<|eot_id|>'
print_info: LF token         = 198 'Ċ'
print_info: EOG token        = 128001 '<|end_of_text|>'
print_info: EOG token        = 128009 '<|eot_id|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: offloading 32 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 33/33 layers to GPU
load_tensors:        ROCm0 model buffer size =  4155.99 MiB
load_tensors:   CPU_Mapped model buffer size =   281.81 MiB
llama_init_from_model: model default pooling_type is [0], but [-1] was specified
llama_context: constructing llama_context
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 4096
llama_context: n_ctx_per_seq = 4096
llama_context: n_batch       = 512
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = disabled
llama_context: kv_unified    = false
llama_context: freq_base     = 500000.0
llama_context: freq_scale    = 1
llama_context: n_ctx_per_seq (4096) < n_ctx_train (8192) -- the full capacity of the model will not be utilized
llama_context:  ROCm_Host  output buffer size =     0.50 MiB
llama_kv_cache:      ROCm0 KV buffer size =   512.00 MiB
llama_kv_cache: size =  512.00 MiB (  4096 cells,  32 layers,  1/1 seqs), K (f16):  256.00 MiB, V (f16):  256.00 MiB

rocblaslt error: Cannot read /usr/lib/ollama/rocm/hipblaslt/library/TensileLibrary_lazy_gfx1201.dat: No such file or directory

rocblaslt error: Could not load /usr/lib/ollama/rocm/hipblaslt/library/TensileLibrary_lazy_gfx1201.dat
hipBLASLt error: Heuristic Fetch Failed!
This message will be only be displayed once, unless the ROCBLAS_VERBOSE_HIPBLASLT_ERROR environment variable is set.

rocBLAS warning: hipBlasLT failed, falling back to tensile. 
This message will be only be displayed once, unless the ROCBLAS_VERBOSE_TENSILE_ERROR environment variable is set.
llama_context:      ROCm0 compute buffer size =   300.01 MiB
llama_context:  ROCm_Host compute buffer size =    20.01 MiB
llama_context: graph nodes  = 1158
llama_context: graph splits = 2
time=2025-11-02T01:16:12.023Z level=INFO source=server.go:1289 msg="llama runner started in 1.99 seconds"

OS

Docker

GPU

AMD

CPU

Intel

Ollama version

0.12.9

Originally created by @lu2 on GitHub (Nov 2, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/12908 Originally assigned to: @dhiltgen on GitHub. ### What is the issue? Using [latest docker image](https://hub.docker.com/layers/ollama/ollama/0.12.9-rocm/images/sha256-bc3c9f1744100d2937a9a4c2e3daafe61155a9165f67e0e7def22430a37ac944 ) of ollama:rocm with AMD Radeon RX 9070 XT Getting following errors in the log, when model is loaded and talked to: ``` rocblaslt error: Cannot read /usr/lib/ollama/rocm/hipblaslt/library/TensileLibrary_lazy_gfx1201.dat: No such file or directory rocblaslt error: Could not load /usr/lib/ollama/rocm/hipblaslt/library/TensileLibrary_lazy_gfx1201.dat hipBLASLt error: Heuristic Fetch Failed! ``` Looking in the container, this library exists, but on different location: ``` root@76da308225a5:/# find / | grep TensileLibrary_lazy_gfx1201.dat /usr/lib/ollama/rocm/rocblas/library/TensileLibrary_lazy_gfx1201.dat ``` Should this be fixed in the image or maybe in ollama itself? ### Relevant log output ```shell time=2025-11-02T01:15:55.825Z level=INFO source=routes.go:1524 msg="server config" env="map[CUDA_VISIBLE_DEVICES: GGML_VK_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:4096 OLLAMA_DEBUG:INFO OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://0.0.0.0:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_KV_CACHE_TYPE: OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/root/.ollama/models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NEW_ENGINE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:1 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://* vscode-file://*] OLLAMA_REMOTES:[ollama.com] OLLAMA_SCHED_SPREAD:false ROCR_VISIBLE_DEVICES: http_proxy: https_proxy: no_proxy:]" time=2025-11-02T01:15:55.826Z level=INFO source=images.go:522 msg="total blobs: 18" time=2025-11-02T01:15:55.826Z level=INFO source=images.go:529 msg="total unused blobs removed: 0" time=2025-11-02T01:15:55.826Z level=INFO source=routes.go:1577 msg="Listening on [::]:11434 (version 0.12.9)" time=2025-11-02T01:15:55.827Z level=INFO source=runner.go:76 msg="discovering available GPUs..." time=2025-11-02T01:15:55.828Z level=INFO source=server.go:400 msg="starting runner" cmd="/usr/bin/ollama runner --ollama-engine --port 46199" time=2025-11-02T01:15:56.602Z level=INFO source=server.go:400 msg="starting runner" cmd="/usr/bin/ollama runner --ollama-engine --port 44135" time=2025-11-02T01:15:57.380Z level=INFO source=types.go:42 msg="inference compute" id=GPU-a9298820b4e99df1 filtered_id="" library=ROCm compute=gfx1201 name=ROCm0 description="AMD Radeon Graphics" libdirs=ollama,rocm driver=60342.13 pci_id=0000:03:00.0 type=discrete total="15.9 GiB" available="15.8 GiB" time=2025-11-02T01:15:57.380Z level=INFO source=routes.go:1618 msg="entering low vram mode" "total vram"="15.9 GiB" threshold="20.0 GiB" time=2025-11-02T01:16:09.100Z level=INFO source=server.go:400 msg="starting runner" cmd="/usr/bin/ollama runner --ollama-engine --port 46381" llama_model_loader: loaded meta data with 22 key-value pairs and 291 tensors from /root/.ollama/models/blobs/sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = llama llama_model_loader: - kv 1: general.name str = Meta-Llama-3-8B-Instruct llama_model_loader: - kv 2: llama.block_count u32 = 32 llama_model_loader: - kv 3: llama.context_length u32 = 8192 llama_model_loader: - kv 4: llama.embedding_length u32 = 4096 llama_model_loader: - kv 5: llama.feed_forward_length u32 = 14336 llama_model_loader: - kv 6: llama.attention.head_count u32 = 32 llama_model_loader: - kv 7: llama.attention.head_count_kv u32 = 8 llama_model_loader: - kv 8: llama.rope.freq_base f32 = 500000.000000 llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 llama_model_loader: - kv 10: general.file_type u32 = 2 llama_model_loader: - kv 11: llama.vocab_size u32 = 128256 llama_model_loader: - kv 12: llama.rope.dimension_count u32 = 128 llama_model_loader: - kv 13: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 14: tokenizer.ggml.pre str = llama-bpe llama_model_loader: - kv 15: tokenizer.ggml.tokens arr[str,128256] = ["!", "\"", "#", "$", "%", "&", "'", ... llama_model_loader: - kv 16: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 17: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "... llama_model_loader: - kv 18: tokenizer.ggml.bos_token_id u32 = 128000 llama_model_loader: - kv 19: tokenizer.ggml.eos_token_id u32 = 128009 llama_model_loader: - kv 20: tokenizer.chat_template str = {% set loop_messages = messages %}{% ... llama_model_loader: - kv 21: general.quantization_version u32 = 2 llama_model_loader: - type f32: 65 tensors llama_model_loader: - type q4_0: 225 tensors llama_model_loader: - type q6_K: 1 tensors print_info: file format = GGUF V3 (latest) print_info: file type = Q4_0 print_info: file size = 4.33 GiB (4.64 BPW) load: printing all EOG tokens: load: - 128001 ('<|end_of_text|>') load: - 128009 ('<|eot_id|>') load: special tokens cache size = 256 load: token to piece cache size = 0.8000 MB print_info: arch = llama print_info: vocab_only = 1 print_info: model type = ?B print_info: model params = 8.03 B print_info: general.name = Meta-Llama-3-8B-Instruct print_info: vocab type = BPE print_info: n_vocab = 128256 print_info: n_merges = 280147 print_info: BOS token = 128000 '<|begin_of_text|>' print_info: EOS token = 128009 '<|eot_id|>' print_info: EOT token = 128009 '<|eot_id|>' print_info: LF token = 198 'Ċ' print_info: EOG token = 128001 '<|end_of_text|>' print_info: EOG token = 128009 '<|eot_id|>' print_info: max token length = 256 llama_model_load: vocab only - skipping tensors time=2025-11-02T01:16:10.030Z level=INFO source=server.go:400 msg="starting runner" cmd="/usr/bin/ollama runner --model /root/.ollama/models/blobs/sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa --port 45241" time=2025-11-02T01:16:10.031Z level=INFO source=server.go:470 msg="system memory" total="62.6 GiB" free="55.6 GiB" free_swap="7.0 GiB" time=2025-11-02T01:16:10.031Z level=INFO source=memory.go:37 msg="new model will fit in available VRAM across minimum required GPUs, loading" model=/root/.ollama/models/blobs/sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa library=ROCm parallel=1 required="5.4 GiB" gpus=1 time=2025-11-02T01:16:10.031Z level=INFO source=server.go:522 msg=offload library=ROCm layers.requested=-1 layers.model=33 layers.offload=33 layers.split=[33] memory.available="[15.8 GiB]" memory.gpu_overhead="0 B" memory.required.full="5.4 GiB" memory.required.partial="5.4 GiB" memory.required.kv="512.0 MiB" memory.required.allocations="[5.4 GiB]" memory.weights.total="4.1 GiB" memory.weights.repeating="3.7 GiB" memory.weights.nonrepeating="411.0 MiB" memory.graph.full="296.0 MiB" memory.graph.partial="677.5 MiB" time=2025-11-02T01:16:10.039Z level=INFO source=runner.go:910 msg="starting go runner" load_backend: loaded CPU backend from /usr/lib/ollama/libggml-cpu-alderlake.so /opt/amdgpu/share/libdrm/amdgpu.ids: No such file or directory ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 1 ROCm devices: Device 0: AMD Radeon Graphics, gfx1201 (0x1201), VMM: no, Wave Size: 32, ID: GPU-a9298820b4e99df1 load_backend: loaded ROCm backend from /usr/lib/ollama/rocm/libggml-hip.so time=2025-11-02T01:16:10.761Z level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX_VNNI=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 ROCm.0.NO_VMM=1 ROCm.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc) time=2025-11-02T01:16:10.761Z level=INFO source=runner.go:946 msg="Server listening on 127.0.0.1:45241" time=2025-11-02T01:16:10.769Z level=INFO source=runner.go:845 msg=load request="{Operation:commit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:false KvSize:4096 KvCacheType: NumThreads:8 GPULayers:33[ID:GPU-a9298820b4e99df1 Layers:33(0..32)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:true}" llama_model_load_from_file_impl: using device ROCm0 (AMD Radeon Graphics) (0000:03:00.0) - 16222 MiB free time=2025-11-02T01:16:10.769Z level=INFO source=server.go:1251 msg="waiting for llama runner to start responding" time=2025-11-02T01:16:10.769Z level=INFO source=server.go:1285 msg="waiting for server to become available" status="llm server loading model" llama_model_loader: loaded meta data with 22 key-value pairs and 291 tensors from /root/.ollama/models/blobs/sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = llama llama_model_loader: - kv 1: general.name str = Meta-Llama-3-8B-Instruct llama_model_loader: - kv 2: llama.block_count u32 = 32 llama_model_loader: - kv 3: llama.context_length u32 = 8192 llama_model_loader: - kv 4: llama.embedding_length u32 = 4096 llama_model_loader: - kv 5: llama.feed_forward_length u32 = 14336 llama_model_loader: - kv 6: llama.attention.head_count u32 = 32 llama_model_loader: - kv 7: llama.attention.head_count_kv u32 = 8 llama_model_loader: - kv 8: llama.rope.freq_base f32 = 500000.000000 llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 llama_model_loader: - kv 10: general.file_type u32 = 2 llama_model_loader: - kv 11: llama.vocab_size u32 = 128256 llama_model_loader: - kv 12: llama.rope.dimension_count u32 = 128 llama_model_loader: - kv 13: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 14: tokenizer.ggml.pre str = llama-bpe llama_model_loader: - kv 15: tokenizer.ggml.tokens arr[str,128256] = ["!", "\"", "#", "$", "%", "&", "'", ... llama_model_loader: - kv 16: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 17: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "... llama_model_loader: - kv 18: tokenizer.ggml.bos_token_id u32 = 128000 llama_model_loader: - kv 19: tokenizer.ggml.eos_token_id u32 = 128009 llama_model_loader: - kv 20: tokenizer.chat_template str = {% set loop_messages = messages %}{% ... llama_model_loader: - kv 21: general.quantization_version u32 = 2 llama_model_loader: - type f32: 65 tensors llama_model_loader: - type q4_0: 225 tensors llama_model_loader: - type q6_K: 1 tensors print_info: file format = GGUF V3 (latest) print_info: file type = Q4_0 print_info: file size = 4.33 GiB (4.64 BPW) load: printing all EOG tokens: load: - 128001 ('<|end_of_text|>') load: - 128009 ('<|eot_id|>') load: special tokens cache size = 256 load: token to piece cache size = 0.8000 MB print_info: arch = llama print_info: vocab_only = 0 print_info: n_ctx_train = 8192 print_info: n_embd = 4096 print_info: n_layer = 32 print_info: n_head = 32 print_info: n_head_kv = 8 print_info: n_rot = 128 print_info: n_swa = 0 print_info: is_swa_any = 0 print_info: n_embd_head_k = 128 print_info: n_embd_head_v = 128 print_info: n_gqa = 4 print_info: n_embd_k_gqa = 1024 print_info: n_embd_v_gqa = 1024 print_info: f_norm_eps = 0.0e+00 print_info: f_norm_rms_eps = 1.0e-05 print_info: f_clamp_kqv = 0.0e+00 print_info: f_max_alibi_bias = 0.0e+00 print_info: f_logit_scale = 0.0e+00 print_info: f_attn_scale = 0.0e+00 print_info: n_ff = 14336 print_info: n_expert = 0 print_info: n_expert_used = 0 print_info: causal attn = 1 print_info: pooling type = 0 print_info: rope type = 0 print_info: rope scaling = linear print_info: freq_base_train = 500000.0 print_info: freq_scale_train = 1 print_info: n_ctx_orig_yarn = 8192 print_info: rope_finetuned = unknown print_info: model type = 8B print_info: model params = 8.03 B print_info: general.name = Meta-Llama-3-8B-Instruct print_info: vocab type = BPE print_info: n_vocab = 128256 print_info: n_merges = 280147 print_info: BOS token = 128000 '<|begin_of_text|>' print_info: EOS token = 128009 '<|eot_id|>' print_info: EOT token = 128009 '<|eot_id|>' print_info: LF token = 198 'Ċ' print_info: EOG token = 128001 '<|end_of_text|>' print_info: EOG token = 128009 '<|eot_id|>' print_info: max token length = 256 load_tensors: loading model tensors, this can take a while... (mmap = true) load_tensors: offloading 32 repeating layers to GPU load_tensors: offloading output layer to GPU load_tensors: offloaded 33/33 layers to GPU load_tensors: ROCm0 model buffer size = 4155.99 MiB load_tensors: CPU_Mapped model buffer size = 281.81 MiB llama_init_from_model: model default pooling_type is [0], but [-1] was specified llama_context: constructing llama_context llama_context: n_seq_max = 1 llama_context: n_ctx = 4096 llama_context: n_ctx_per_seq = 4096 llama_context: n_batch = 512 llama_context: n_ubatch = 512 llama_context: causal_attn = 1 llama_context: flash_attn = disabled llama_context: kv_unified = false llama_context: freq_base = 500000.0 llama_context: freq_scale = 1 llama_context: n_ctx_per_seq (4096) < n_ctx_train (8192) -- the full capacity of the model will not be utilized llama_context: ROCm_Host output buffer size = 0.50 MiB llama_kv_cache: ROCm0 KV buffer size = 512.00 MiB llama_kv_cache: size = 512.00 MiB ( 4096 cells, 32 layers, 1/1 seqs), K (f16): 256.00 MiB, V (f16): 256.00 MiB rocblaslt error: Cannot read /usr/lib/ollama/rocm/hipblaslt/library/TensileLibrary_lazy_gfx1201.dat: No such file or directory rocblaslt error: Could not load /usr/lib/ollama/rocm/hipblaslt/library/TensileLibrary_lazy_gfx1201.dat hipBLASLt error: Heuristic Fetch Failed! This message will be only be displayed once, unless the ROCBLAS_VERBOSE_HIPBLASLT_ERROR environment variable is set. rocBLAS warning: hipBlasLT failed, falling back to tensile. This message will be only be displayed once, unless the ROCBLAS_VERBOSE_TENSILE_ERROR environment variable is set. llama_context: ROCm0 compute buffer size = 300.01 MiB llama_context: ROCm_Host compute buffer size = 20.01 MiB llama_context: graph nodes = 1158 llama_context: graph splits = 2 time=2025-11-02T01:16:12.023Z level=INFO source=server.go:1289 msg="llama runner started in 1.99 seconds" ``` ### OS Docker ### GPU AMD ### CPU Intel ### Ollama version 0.12.9
GiteaMirror added the dockerbugamd labels 2026-04-29 08:16:25 -05:00
Author
Owner

@dhiltgen commented on GitHub (Nov 5, 2025):

It looks like this is specific to gfx12 which now brings in a dependency on hipblaslt tensile files, which we do not currently bundle. We've fallen a bit behind on ROCm which complicates supporting the latest AMD GPUs. We're working to get Vulkan enabled soon which will provide coverage for a broad set of GPUs, and allow us to leapfrog to a much newer version of ROCm (likely based on TheRock) while still maintaining a broad set of supported GPUS between ROCm+Vulkan.

<!-- gh-comment-id:3493650152 --> @dhiltgen commented on GitHub (Nov 5, 2025): It looks like this is specific to gfx12 which now brings in a dependency on hipblaslt tensile files, which we do not currently bundle. We've fallen a bit behind on ROCm which complicates supporting the latest AMD GPUs. We're working to get Vulkan enabled soon which will provide coverage for a broad set of GPUs, and allow us to leapfrog to a much newer version of ROCm (likely based on TheRock) while still maintaining a broad set of supported GPUS between ROCm+Vulkan.
Author
Owner

@andornaut commented on GitHub (Nov 6, 2025):

@dhiltgen

specific to gfx12 which now brings in a dependency on hipblaslt tensile files, which we do not currently bundle. We've fallen a bit behind on ROCm which complicat

So, is this a regression? Do you know the last working version of Ollama (container image) for e.g. 9070xt?

<!-- gh-comment-id:3497170694 --> @andornaut commented on GitHub (Nov 6, 2025): @dhiltgen > specific to gfx12 which **now** brings in a dependency on hipblaslt tensile files, which we do not currently bundle. We've fallen a bit behind on ROCm which complicat So, is this a regression? Do you know the last working version of Ollama (container image) for e.g. 9070xt?
Author
Owner

@tmench23 commented on GitHub (Nov 17, 2025):

I currently am facing the same issue with the 9070 (non XT variant). Are there any known workarounds?

<!-- gh-comment-id:3542952750 --> @tmench23 commented on GitHub (Nov 17, 2025): I currently am facing the same issue with the 9070 (non XT variant). Are there any known workarounds?
Author
Owner

@xzxshmuner-boop commented on GitHub (Nov 19, 2025):

@dhiltgen

It looks like this is specific to gfx12 which now brings in a dependency on hipblaslt tensile files, which we do not currently bundle. We've fallen a bit behind on ROCm which complicates supporting the latest AMD GPUs. We're working to get Vulkan enabled soon which will provide coverage for a broad set of GPUs, and allow us to leapfrog to a much newer version of ROCm (likely based on TheRock) while still maintaining a broad set of supported GPUS between ROCm+Vulkan.

Are there any plans for when to support the latest AMD rocm

<!-- gh-comment-id:3552543275 --> @xzxshmuner-boop commented on GitHub (Nov 19, 2025): @dhiltgen > It looks like this is specific to gfx12 which now brings in a dependency on hipblaslt tensile files, which we do not currently bundle. We've fallen a bit behind on ROCm which complicates supporting the latest AMD GPUs. We're working to get Vulkan enabled soon which will provide coverage for a broad set of GPUs, and allow us to leapfrog to a much newer version of ROCm (likely based on TheRock) while still maintaining a broad set of supported GPUS between ROCm+Vulkan. Are there any plans for when to support the latest AMD rocm
Author
Owner

@NexGen-3D-Printing commented on GitHub (Nov 20, 2025):

I just hit this error with RX 9060 XT 16GB as well, I assume this is why I'm seeing underwhelming performance, as the same GPU under LMStudio is almost twice as responsive.

<!-- gh-comment-id:3555704007 --> @NexGen-3D-Printing commented on GitHub (Nov 20, 2025): I just hit this error with RX 9060 XT 16GB as well, I assume this is why I'm seeing underwhelming performance, as the same GPU under LMStudio is almost twice as responsive.
Author
Owner

@dhiltgen commented on GitHub (Nov 21, 2025):

I have a draft PR to update a newer ROCm, but saw crashes on a few of our test systems which work fine on the current version of ROCm we're bundling, so we paused updating until we can get Vulkan enabled by default. Then Vulkan can provide a fall-back for any systems/GPUs where the newer ROCm becomes a regression. Ultimately I plan to update the PR with a ROCm v7 or potentially nightly build of TheRock.

<!-- gh-comment-id:3564618828 --> @dhiltgen commented on GitHub (Nov 21, 2025): I have a draft PR to update a [newer ROCm](https://github.com/ollama/ollama/pull/10676), but saw crashes on a few of our test systems which work fine on the current version of ROCm we're bundling, so we paused updating until we can get Vulkan enabled by default. Then Vulkan can provide a fall-back for any systems/GPUs where the newer ROCm becomes a regression. Ultimately I plan to update the PR with a ROCm v7 or potentially nightly build of [TheRock](https://github.com/ROCm/TheRock).
Author
Owner

@NexGen-3D-Printing commented on GitHub (Nov 21, 2025):

I just tried to switch to Vulkan, but unfortunately under TrueNAS Scale, Vulkan is not available for Ollama, logs say its disabled.

Standard image is CPU only, I'm not able to apply this under TrueNAS: "set OLLAMA_VULKAN=1"

<!-- gh-comment-id:3564910663 --> @NexGen-3D-Printing commented on GitHub (Nov 21, 2025): I just tried to switch to Vulkan, but unfortunately under TrueNAS Scale, Vulkan is not available for Ollama, logs say its disabled. Standard image is CPU only, I'm not able to apply this under TrueNAS: "set OLLAMA_VULKAN=1"
Author
Owner

@MorrisLu-Taipei commented on GitHub (Dec 14, 2025):

same issue, no any workaround so far.

<!-- gh-comment-id:3650895593 --> @MorrisLu-Taipei commented on GitHub (Dec 14, 2025): same issue, no any workaround so far.
Author
Owner

@hopperath commented on GitHub (Jan 7, 2026):

Similiar issue with 9060XT 16gb as well, rocm 7.1.1. If I link the file from the rocm directory the "no such file" message goes away, but still fails with
rocblaslt error: Could not load /usr/local/lib/ollama/rocm/hipblaslt/library/TensileLibrary_lazy_gfx1200.dat
hipBLASLt error: Heuristic Fetch Failed!

<!-- gh-comment-id:3717167367 --> @hopperath commented on GitHub (Jan 7, 2026): Similiar issue with 9060XT 16gb as well, rocm 7.1.1. If I link the file from the rocm directory the "no such file" message goes away, but still fails with rocblaslt error: Could not load /usr/local/lib/ollama/rocm/hipblaslt/library/TensileLibrary_lazy_gfx1200.dat hipBLASLt error: Heuristic Fetch Failed!
Author
Owner

@androiddrew commented on GitHub (Feb 5, 2026):

@lu2 I have a fork https://github.com/androiddrew/ollama/tree/drew/rocm7p2 where I built Ollama with ROCm 7.2. I believe I resolved the errors you were seeing:

rocblaslt error: Cannot read /usr/lib/ollama/rocm/hipblaslt/library/TensileLibrary_lazy_gfx1201.dat: No such file or directory

rocblaslt error: Could not load /usr/lib/ollama/rocm/hipblaslt/library/TensileLibrary_lazy_gfx1201.dat
hipBLASLt error: Heuristic Fetch Failed!

I built a docker image to test with too https://hub.docker.com/repository/docker/androiddrew/ollama/tags/0.15.4-rocm-7.2/sha256-5de6c0238bc99642b54af76abcfcf1506cbf4c52e6a1ffe727ef0f8bfaeacff8

Let me know if it works for you.

strings /usr/lib/ollama/rocm/rocblas/library/TensileLibrary_lazy_gfx1201.dat | egrep -i '\.co|Kernels|Manifest|TensileLibrary_Type' | head -n 200

Though will show that there are no tuned kernels for the gfx1201 only fallback kernels. I am only seeing 18 TPS for https://huggingface.co/unsloth/Devstral-Small-2-24B-Instruct-2512-GGUF/blob/main/Devstral-Small-2-24B-Instruct-2512-UD-Q8_K_XL.gguf. That is a dense model but I would have expected some better performance than that.

@dhiltgen please take a look at my fork, if it's up to expectations I could create a PR.

<!-- gh-comment-id:3854823325 --> @androiddrew commented on GitHub (Feb 5, 2026): @lu2 I have a fork https://github.com/androiddrew/ollama/tree/drew/rocm7p2 where I built Ollama with ROCm 7.2. I believe I resolved the errors you were seeing: ``` rocblaslt error: Cannot read /usr/lib/ollama/rocm/hipblaslt/library/TensileLibrary_lazy_gfx1201.dat: No such file or directory rocblaslt error: Could not load /usr/lib/ollama/rocm/hipblaslt/library/TensileLibrary_lazy_gfx1201.dat hipBLASLt error: Heuristic Fetch Failed! ``` I built a docker image to test with too https://hub.docker.com/repository/docker/androiddrew/ollama/tags/0.15.4-rocm-7.2/sha256-5de6c0238bc99642b54af76abcfcf1506cbf4c52e6a1ffe727ef0f8bfaeacff8 Let me know if it works for you. ```bash strings /usr/lib/ollama/rocm/rocblas/library/TensileLibrary_lazy_gfx1201.dat | egrep -i '\.co|Kernels|Manifest|TensileLibrary_Type' | head -n 200 ``` Though will show that there are no tuned kernels for the gfx1201 only fallback kernels. I am only seeing 18 TPS for https://huggingface.co/unsloth/Devstral-Small-2-24B-Instruct-2512-GGUF/blob/main/Devstral-Small-2-24B-Instruct-2512-UD-Q8_K_XL.gguf. That is a dense model but I would have expected some better performance than that. @dhiltgen please take a look at my fork, if it's up to expectations I could create a PR.
Author
Owner

@androiddrew commented on GitHub (Feb 5, 2026):

Just compared LM Studio and my Ollama solution. Same 33TPS effectively for the 4 bit Devstral 2 small 24B model on my gfx1201. So...yeah I guess the next step is figuring out how to squeeze more out of this now with maybe Tensile finetuning.

I am trying to figure out how these Tensile libraries are produced for the rocm/dev-almalinux-8:7.2-complete image that was used for those rocm kernels.

<!-- gh-comment-id:3855189861 --> @androiddrew commented on GitHub (Feb 5, 2026): Just compared LM Studio and my Ollama solution. Same 33TPS effectively for the 4 bit Devstral 2 small 24B model on my gfx1201. So...yeah I guess the next step is figuring out how to squeeze more out of this now with maybe Tensile finetuning. I am trying to figure out how these Tensile libraries are produced for the rocm/dev-almalinux-8:7.2-complete image that was used for those rocm kernels.
Author
Owner

@lu2 commented on GitHub (Feb 5, 2026):

Thanks @androiddrew . I just tried your docker image androiddrew/ollama:0.15.4-rocm-7.2
The error message is indeed gone, but performance wise, it is same like official ollama/ollama:rocm.

I have to let more clever people to compare the results:

androiddrew/ollama:0.15.4-rocm-7.2 results:

lu2@LU2PCHL:~$ podman logs ollama -f
time=2026-02-05T20:12:52.259Z level=INFO source=routes.go:1622 msg="server config" env="map[CUDA_VISIBLE_DEVICES: GGML_VK_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:0 OLLAMA_DEBUG:INFO OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://0.0.0.0:11434 OLLAMA_KEEP_ALIVE:5m0s OLLAMA_KV_CACHE_TYPE: OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/root/.ollama/models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NEW_ENGINE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:1 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://* vscode-file://*] OLLAMA_REMOTES:[ollama.com] OLLAMA_SCHED_SPREAD:false OLLAMA_VULKAN:false ROCR_VISIBLE_DEVICES: http_proxy: https_proxy: no_proxy:]"
time=2026-02-05T20:12:52.265Z level=INFO source=images.go:473 msg="total blobs: 88"
time=2026-02-05T20:12:52.266Z level=INFO source=images.go:480 msg="total unused blobs removed: 0"
[GIN-debug] [WARNING] Creating an Engine instance with the Logger and Recovery middleware already attached.

[GIN-debug] [WARNING] Running in "debug" mode. Switch to "release" mode in production.
 - using env:	export GIN_MODE=release
 - using code:	gin.SetMode(gin.ReleaseMode)

[GIN-debug] HEAD   /                         --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func1 (5 handlers)
[GIN-debug] GET    /                         --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func2 (5 handlers)
[GIN-debug] HEAD   /api/version              --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func3 (5 handlers)
[GIN-debug] GET    /api/version              --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func4 (5 handlers)
[GIN-debug] POST   /api/pull                 --> github.com/ollama/ollama/server.(*Server).PullHandler-fm (5 handlers)
[GIN-debug] POST   /api/push                 --> github.com/ollama/ollama/server.(*Server).PushHandler-fm (5 handlers)
[GIN-debug] HEAD   /api/tags                 --> github.com/ollama/ollama/server.(*Server).ListHandler-fm (5 handlers)
[GIN-debug] GET    /api/tags                 --> github.com/ollama/ollama/server.(*Server).ListHandler-fm (5 handlers)
[GIN-debug] POST   /api/show                 --> github.com/ollama/ollama/server.(*Server).ShowHandler-fm (5 handlers)
[GIN-debug] DELETE /api/delete               --> github.com/ollama/ollama/server.(*Server).DeleteHandler-fm (5 handlers)
[GIN-debug] POST   /api/me                   --> github.com/ollama/ollama/server.(*Server).WhoamiHandler-fm (5 handlers)
[GIN-debug] POST   /api/signout              --> github.com/ollama/ollama/server.(*Server).SignoutHandler-fm (5 handlers)
[GIN-debug] DELETE /api/user/keys/:encodedKey --> github.com/ollama/ollama/server.(*Server).SignoutHandler-fm (5 handlers)
[GIN-debug] POST   /api/create               --> github.com/ollama/ollama/server.(*Server).CreateHandler-fm (5 handlers)
[GIN-debug] POST   /api/blobs/:digest        --> github.com/ollama/ollama/server.(*Server).CreateBlobHandler-fm (5 handlers)
[GIN-debug] HEAD   /api/blobs/:digest        --> github.com/ollama/ollama/server.(*Server).HeadBlobHandler-fm (5 handlers)
[GIN-debug] POST   /api/copy                 --> github.com/ollama/ollama/server.(*Server).CopyHandler-fm (5 handlers)
[GIN-debug] GET    /api/ps                   --> github.com/ollama/ollama/server.(*Server).PsHandler-fm (5 handlers)
[GIN-debug] POST   /api/generate             --> github.com/ollama/ollama/server.(*Server).GenerateHandler-fm (5 handlers)
[GIN-debug] POST   /api/chat                 --> github.com/ollama/ollama/server.(*Server).ChatHandler-fm (5 handlers)
[GIN-debug] POST   /api/embed                --> github.com/ollama/ollama/server.(*Server).EmbedHandler-fm (5 handlers)
[GIN-debug] POST   /api/embeddings           --> github.com/ollama/ollama/server.(*Server).EmbeddingsHandler-fm (5 handlers)
[GIN-debug] POST   /v1/chat/completions      --> github.com/ollama/ollama/server.(*Server).ChatHandler-fm (6 handlers)
[GIN-debug] POST   /v1/completions           --> github.com/ollama/ollama/server.(*Server).GenerateHandler-fm (6 handlers)
[GIN-debug] POST   /v1/embeddings            --> github.com/ollama/ollama/server.(*Server).EmbedHandler-fm (6 handlers)
[GIN-debug] GET    /v1/models                --> github.com/ollama/ollama/server.(*Server).ListHandler-fm (6 handlers)
[GIN-debug] GET    /v1/models/:model         --> github.com/ollama/ollama/server.(*Server).ShowHandler-fm (6 handlers)
[GIN-debug] POST   /v1/responses             --> github.com/ollama/ollama/server.(*Server).ChatHandler-fm (6 handlers)
[GIN-debug] POST   /v1/images/generations    --> github.com/ollama/ollama/server.(*Server).GenerateHandler-fm (6 handlers)
[GIN-debug] POST   /v1/images/edits          --> github.com/ollama/ollama/server.(*Server).GenerateHandler-fm (6 handlers)
[GIN-debug] POST   /v1/messages              --> github.com/ollama/ollama/server.(*Server).ChatHandler-fm (6 handlers)
time=2026-02-05T20:12:52.267Z level=INFO source=routes.go:1675 msg="Listening on [::]:11434 (version 0.0.0)"
time=2026-02-05T20:12:52.267Z level=INFO source=runner.go:67 msg="discovering available GPUs..."
time=2026-02-05T20:12:52.268Z level=INFO source=server.go:430 msg="starting runner" cmd="/usr/bin/ollama runner --ollama-engine --port 44757"
time=2026-02-05T20:12:52.341Z level=INFO source=server.go:430 msg="starting runner" cmd="/usr/bin/ollama runner --ollama-engine --port 42031"
time=2026-02-05T20:12:52.464Z level=INFO source=types.go:42 msg="inference compute" id=GPU-a9298820b4e99df1 filter_id="" library=ROCm compute=gfx1201 name=ROCm0 description="AMD Radeon RX 9070 XT" libdirs=ollama,rocm driver=70226.1 pci_id=0000:03:00.0 type=discrete total="15.9 GiB" available="14.9 GiB"
time=2026-02-05T20:12:52.464Z level=INFO source=routes.go:1725 msg="vram-based default context" total_vram="15.9 GiB" default_num_ctx=4096
[GIN] 2026/02/05 - 20:18:46 | 200 |      73.963µs |       127.0.0.1 | HEAD     "/"
[GIN] 2026/02/05 - 20:18:46 | 200 |   130.04483ms |       127.0.0.1 | POST     "/api/show"
time=2026-02-05T20:18:46.250Z level=INFO source=server.go:430 msg="starting runner" cmd="/usr/bin/ollama runner --ollama-engine --port 44091"
time=2026-02-05T20:18:46.319Z level=WARN source=cpu_linux.go:130 msg="failed to parse CPU allowed micro secs" error="strconv.ParseInt: parsing \"max\": invalid syntax"
time=2026-02-05T20:18:46.372Z level=INFO source=server.go:246 msg="enabling flash attention"
time=2026-02-05T20:18:46.372Z level=INFO source=server.go:430 msg="starting runner" cmd="/usr/bin/ollama runner --ollama-engine --model /root/.ollama/models/blobs/sha256-c580819bed79c92d01a42227bd6d8fd66b9ec60d5329f5eb73f812f156af7807 --port 42701"
time=2026-02-05T20:18:46.372Z level=INFO source=sched.go:452 msg="system memory" total="62.6 GiB" free="62.5 GiB" free_swap="8.0 GiB"
time=2026-02-05T20:18:46.372Z level=INFO source=sched.go:459 msg="gpu memory" id=GPU-a9298820b4e99df1 library=ROCm available="14.5 GiB" free="15.0 GiB" minimum="457.0 MiB" overhead="0 B"
time=2026-02-05T20:18:46.372Z level=INFO source=server.go:756 msg="loading model" "model layers"=41 requested=-1
time=2026-02-05T20:18:46.380Z level=INFO source=runner.go:1405 msg="starting ollama engine"
time=2026-02-05T20:18:46.380Z level=INFO source=runner.go:1440 msg="Server listening on 127.0.0.1:42701"
time=2026-02-05T20:18:46.383Z level=INFO source=runner.go:1278 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Enabled KvSize:4096 KvCacheType: NumThreads:8 GPULayers:41[ID:GPU-a9298820b4e99df1 Layers:41(0..40)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2026-02-05T20:18:46.427Z level=INFO source=ggml.go:136 msg="" architecture=mistral3 file_type=Q4_K_M name="" description="" num_tensors=585 num_key_values=51
load_backend: loaded CPU backend from /usr/lib/ollama/libggml-cpu-alderlake.so
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
  Device 0: AMD Radeon RX 9070 XT, gfx1201 (0x1201), VMM: no, Wave Size: 32, ID: GPU-a9298820b4e99df1
load_backend: loaded ROCm backend from /usr/lib/ollama/rocm/libggml-hip.so
time=2026-02-05T20:18:46.466Z level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX_VNNI=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 ROCm.0.NO_VMM=1 ROCm.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc)
time=2026-02-05T20:18:47.846Z level=INFO source=runner.go:1278 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Enabled KvSize:4096 KvCacheType: NumThreads:8 GPULayers:40[ID:GPU-a9298820b4e99df1 Layers:40(0..39)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2026-02-05T20:18:48.115Z level=INFO source=runner.go:1278 msg=load request="{Operation:alloc LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Enabled KvSize:4096 KvCacheType: NumThreads:8 GPULayers:40[ID:GPU-a9298820b4e99df1 Layers:40(0..39)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2026-02-05T20:18:48.432Z level=INFO source=runner.go:1278 msg=load request="{Operation:commit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Enabled KvSize:4096 KvCacheType: NumThreads:8 GPULayers:40[ID:GPU-a9298820b4e99df1 Layers:40(0..39)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2026-02-05T20:18:48.432Z level=INFO source=ggml.go:482 msg="offloading 40 repeating layers to GPU"
time=2026-02-05T20:18:48.432Z level=INFO source=ggml.go:486 msg="offloading output layer to CPU"
time=2026-02-05T20:18:48.432Z level=INFO source=ggml.go:494 msg="offloaded 40/41 layers to GPU"
time=2026-02-05T20:18:48.432Z level=INFO source=device.go:240 msg="model weights" device=ROCm0 size="12.5 GiB"
time=2026-02-05T20:18:48.432Z level=INFO source=device.go:245 msg="model weights" device=CPU size="1.7 GiB"
time=2026-02-05T20:18:48.432Z level=INFO source=device.go:251 msg="kv cache" device=ROCm0 size="640.0 MiB"
time=2026-02-05T20:18:48.432Z level=INFO source=device.go:262 msg="compute graph" device=ROCm0 size="951.7 MiB"
time=2026-02-05T20:18:48.432Z level=INFO source=device.go:267 msg="compute graph" device=CPU size="10.0 MiB"
time=2026-02-05T20:18:48.432Z level=INFO source=device.go:272 msg="total memory" size="15.7 GiB"
time=2026-02-05T20:18:48.432Z level=INFO source=sched.go:526 msg="loaded runners" count=1
time=2026-02-05T20:18:48.432Z level=INFO source=server.go:1349 msg="waiting for llama runner to start responding"
time=2026-02-05T20:18:48.433Z level=INFO source=server.go:1383 msg="waiting for server to become available" status="llm server loading model"
time=2026-02-05T20:18:49.942Z level=INFO source=server.go:1387 msg="llama runner started in 3.57 seconds"
[GIN] 2026/02/05 - 20:18:50 | 200 |  4.757222876s |       127.0.0.1 | POST     "/api/generate"
[GIN] 2026/02/05 - 20:19:03 | 200 |      14.355µs |       127.0.0.1 | HEAD     "/"
[GIN] 2026/02/05 - 20:19:03 | 200 |   91.200279ms |       127.0.0.1 | POST     "/api/show"
[GIN] 2026/02/05 - 20:19:04 | 200 |  584.092158ms |       127.0.0.1 | POST     "/api/generate"

root@d1c948104bd6:/# strings /usr/lib/ollama/rocm/rocblas/library/TensileLibrary_lazy_gfx1201.dat | egrep -i '\.co|Kernels|Manifest|TensileLibrary_Type' | head -n 200
ATensileLibrary_Type_SS_Contraction_l_Alik_Bljk_Cijk_Dijk_fallback
ATensileLibrary_Type_DD_Contraction_l_Alik_Bljk_Cijk_Dijk_fallback
ATensileLibrary_Type_CC_Contraction_l_Alik_Bljk_Cijk_Dijk_fallback
ATensileLibrary_Type_ZZ_Contraction_l_Alik_Bljk_Cijk_Dijk_fallback
ETensileLibrary_Type_HS_HPA_Contraction_l_Alik_Bljk_Cijk_Dijk_fallback
ATensileLibrary_Type_HH_Contraction_l_Alik_Bljk_Cijk_Dijk_fallback
ETensileLibrary_Type_HH_HPA_Contraction_l_Alik_Bljk_Cijk_Dijk_fallback
HTensileLibrary_Type_4xi8I_HPA_Contraction_l_Alik_Bljk_Cijk_Dijk_fallback
ETensileLibrary_Type_BS_HPA_Contraction_l_Alik_Bljk_Cijk_Dijk_fallback
ETensileLibrary_Type_BB_HPA_Contraction_l_Alik_Bljk_Cijk_Dijk_fallback
FTensileLibrary_Type_I8I_HPA_Contraction_l_Alik_Bljk_Cijk_Dijk_fallback
BTensileLibrary_Type_CC_Contraction_l_AlikC_Bjlk_Cijk_Dijk_fallback
BTensileLibrary_Type_ZZ_Contraction_l_AlikC_Bjlk_Cijk_Dijk_fallback
ATensileLibrary_Type_SS_Contraction_l_Alik_Bjlk_Cijk_Dijk_fallback
ATensileLibrary_Type_DD_Contraction_l_Alik_Bjlk_Cijk_Dijk_fallback
ATensileLibrary_Type_CC_Contraction_l_Alik_Bjlk_Cijk_Dijk_fallback
ATensileLibrary_Type_ZZ_Contraction_l_Alik_Bjlk_Cijk_Dijk_fallback
ETensileLibrary_Type_HS_HPA_Contraction_l_Alik_Bjlk_Cijk_Dijk_fallback
ATensileLibrary_Type_HH_Contraction_l_Alik_Bjlk_Cijk_Dijk_fallback
ETensileLibrary_Type_HH_HPA_Contraction_l_Alik_Bjlk_Cijk_Dijk_fallback
HTensileLibrary_Type_4xi8I_HPA_Contraction_l_Alik_Bjlk_Cijk_Dijk_fallback
ETensileLibrary_Type_BS_HPA_Contraction_l_Alik_Bjlk_Cijk_Dijk_fallback
ETensileLibrary_Type_BB_HPA_Contraction_l_Alik_Bjlk_Cijk_Dijk_fallback
FTensileLibrary_Type_I8I_HPA_Contraction_l_Alik_Bjlk_Cijk_Dijk_fallback
CTensileLibrary_Type_CC_Contraction_l_AlikC_BjlkC_Cijk_Dijk_fallback
CTensileLibrary_Type_ZZ_Contraction_l_AlikC_BjlkC_Cijk_Dijk_fallback
ATensileLibrary_Type_SS_Contraction_l_Ailk_Bjlk_Cijk_Dijk_fallback
ATensileLibrary_Type_DD_Contraction_l_Ailk_Bjlk_Cijk_Dijk_fallback
ATensileLibrary_Type_CC_Contraction_l_Ailk_Bjlk_Cijk_Dijk_fallback
ATensileLibrary_Type_ZZ_Contraction_l_Ailk_Bjlk_Cijk_Dijk_fallback
ETensileLibrary_Type_HS_HPA_Contraction_l_Ailk_Bjlk_Cijk_Dijk_fallback
ATensileLibrary_Type_HH_Contraction_l_Ailk_Bjlk_Cijk_Dijk_fallback
ETensileLibrary_Type_HH_HPA_Contraction_l_Ailk_Bjlk_Cijk_Dijk_fallback
HTensileLibrary_Type_4xi8I_HPA_Contraction_l_Ailk_Bjlk_Cijk_Dijk_fallback
ETensileLibrary_Type_BS_HPA_Contraction_l_Ailk_Bjlk_Cijk_Dijk_fallback
ETensileLibrary_Type_BB_HPA_Contraction_l_Ailk_Bjlk_Cijk_Dijk_fallback
FTensileLibrary_Type_I8I_HPA_Contraction_l_Ailk_Bjlk_Cijk_Dijk_fallback
ATensileLibrary_Type_SS_Contraction_l_Ailk_Bljk_Cijk_Dijk_fallback
ATensileLibrary_Type_DD_Contraction_l_Ailk_Bljk_Cijk_Dijk_fallback
ATensileLibrary_Type_CC_Contraction_l_Ailk_Bljk_Cijk_Dijk_fallback
ATensileLibrary_Type_ZZ_Contraction_l_Ailk_Bljk_Cijk_Dijk_fallback
ETensileLibrary_Type_HS_HPA_Contraction_l_Ailk_Bljk_Cijk_Dijk_fallback
ATensileLibrary_Type_HH_Contraction_l_Ailk_Bljk_Cijk_Dijk_fallback
ETensileLibrary_Type_HH_HPA_Contraction_l_Ailk_Bljk_Cijk_Dijk_fallback
HTensileLibrary_Type_4xi8I_HPA_Contraction_l_Ailk_Bljk_Cijk_Dijk_fallback
ETensileLibrary_Type_BS_HPA_Contraction_l_Ailk_Bljk_Cijk_Dijk_fallback
ETensileLibrary_Type_BB_HPA_Contraction_l_Ailk_Bljk_Cijk_Dijk_fallback
FTensileLibrary_Type_I8I_HPA_Contraction_l_Ailk_Bljk_Cijk_Dijk_fallback
BTensileLibrary_Type_CC_Contraction_l_Alik_BjlkC_Cijk_Dijk_fallback
BTensileLibrary_Type_ZZ_Contraction_l_Alik_BjlkC_Cijk_Dijk_fallback
BTensileLibrary_Type_CC_Contraction_l_Ailk_BjlkC_Cijk_Dijk_fallback
BTensileLibrary_Type_ZZ_Contraction_l_Ailk_BjlkC_Cijk_Dijk_fallback
BTensileLibrary_Type_CC_Contraction_l_AlikC_Bljk_Cijk_Dijk_fallback
BTensileLibrary_Type_ZZ_Contraction_l_AlikC_Bljk_Cijk_Dijk_fallback

lu2@LU2PCHL:~$ podman exec -it ollama ollama run devstral-small-2:24b "Good morning." --verbose
Good morning! How can I assist you today?

total duration:       583.005608ms
load duration:        89.950904ms
prompt eval count:    558 token(s)
prompt eval duration: 48.7875ms
prompt eval rate:     11437.36 tokens/s
eval count:           11 token(s)
eval duration:        439.923541ms
eval rate:            25.00 tokens/s

ollama/ollama:rocm (675519d3d134) 3 days ago version results:

lu2@LU2PCHL:~$ podman logs ollama -f
time=2026-02-05T20:05:16.417Z level=INFO source=routes.go:1631 msg="server config" env="map[CUDA_VISIBLE_DEVICES: GGML_VK_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:4096 OLLAMA_DEBUG:INFO OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://0.0.0.0:11434 OLLAMA_KEEP_ALIVE:5m0s OLLAMA_KV_CACHE_TYPE: OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/root/.ollama/models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NEW_ENGINE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:1 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://* vscode-file://*] OLLAMA_REMOTES:[ollama.com] OLLAMA_SCHED_SPREAD:false OLLAMA_VULKAN:false ROCR_VISIBLE_DEVICES: http_proxy: https_proxy: no_proxy:]"
time=2026-02-05T20:05:16.422Z level=INFO source=images.go:473 msg="total blobs: 88"
time=2026-02-05T20:05:16.423Z level=INFO source=images.go:480 msg="total unused blobs removed: 0"
time=2026-02-05T20:05:16.423Z level=INFO source=routes.go:1684 msg="Listening on [::]:11434 (version 0.15.4)"
time=2026-02-05T20:05:16.424Z level=INFO source=runner.go:67 msg="discovering available GPUs..."
time=2026-02-05T20:05:16.425Z level=INFO source=server.go:429 msg="starting runner" cmd="/usr/bin/ollama runner --ollama-engine --port 33011"
time=2026-02-05T20:05:17.243Z level=INFO source=server.go:429 msg="starting runner" cmd="/usr/bin/ollama runner --ollama-engine --port 34601"
time=2026-02-05T20:05:17.988Z level=INFO source=types.go:42 msg="inference compute" id=GPU-a9298820b4e99df1 filter_id="" library=ROCm compute=gfx1201 name=ROCm0 description="AMD Radeon Graphics" libdirs=ollama,rocm driver=60342.13 pci_id=0000:03:00.0 type=discrete total="15.9 GiB" available="14.9 GiB"
time=2026-02-05T20:05:17.988Z level=INFO source=routes.go:1725 msg="entering low vram mode" "total vram"="15.9 GiB" threshold="20.0 GiB"
[GIN] 2026/02/05 - 20:05:21 | 200 |      45.745µs |       127.0.0.1 | HEAD     "/"
[GIN] 2026/02/05 - 20:05:21 | 200 |  108.972067ms |       127.0.0.1 | POST     "/api/show"
time=2026-02-05T20:05:21.676Z level=INFO source=server.go:429 msg="starting runner" cmd="/usr/bin/ollama runner --ollama-engine --port 37165"
time=2026-02-05T20:05:22.486Z level=WARN source=cpu_linux.go:130 msg="failed to parse CPU allowed micro secs" error="strconv.ParseInt: parsing \"max\": invalid syntax"
time=2026-02-05T20:05:22.550Z level=INFO source=server.go:245 msg="enabling flash attention"
time=2026-02-05T20:05:22.550Z level=INFO source=server.go:429 msg="starting runner" cmd="/usr/bin/ollama runner --ollama-engine --model /root/.ollama/models/blobs/sha256-c580819bed79c92d01a42227bd6d8fd66b9ec60d5329f5eb73f812f156af7807 --port 32943"
time=2026-02-05T20:05:22.551Z level=INFO source=sched.go:452 msg="system memory" total="62.6 GiB" free="62.4 GiB" free_swap="8.0 GiB"
time=2026-02-05T20:05:22.551Z level=INFO source=sched.go:459 msg="gpu memory" id=GPU-a9298820b4e99df1 library=ROCm available="14.5 GiB" free="14.9 GiB" minimum="457.0 MiB" overhead="0 B"
time=2026-02-05T20:05:22.551Z level=INFO source=server.go:755 msg="loading model" "model layers"=41 requested=-1
time=2026-02-05T20:05:22.565Z level=INFO source=runner.go:1405 msg="starting ollama engine"
time=2026-02-05T20:05:22.565Z level=INFO source=runner.go:1440 msg="Server listening on 127.0.0.1:32943"
time=2026-02-05T20:05:22.574Z level=INFO source=runner.go:1278 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Enabled KvSize:4096 KvCacheType: NumThreads:8 GPULayers:41[ID:GPU-a9298820b4e99df1 Layers:41(0..40)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2026-02-05T20:05:22.615Z level=INFO source=ggml.go:136 msg="" architecture=mistral3 file_type=Q4_K_M name="" description="" num_tensors=585 num_key_values=51
load_backend: loaded CPU backend from /usr/lib/ollama/libggml-cpu-alderlake.so
/opt/amdgpu/share/libdrm/amdgpu.ids: No such file or directory
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
  Device 0: AMD Radeon Graphics, gfx1201 (0x1201), VMM: no, Wave Size: 32, ID: GPU-a9298820b4e99df1
load_backend: loaded ROCm backend from /usr/lib/ollama/rocm/libggml-hip.so
time=2026-02-05T20:05:23.321Z level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX_VNNI=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 ROCm.0.NO_VMM=1 ROCm.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc)

rocblaslt error: Cannot read /usr/lib/ollama/rocm/hipblaslt/library/TensileLibrary_lazy_gfx1201.dat: No such file or directory

rocblaslt error: Could not load /usr/lib/ollama/rocm/hipblaslt/library/TensileLibrary_lazy_gfx1201.dat
hipBLASLt error: Heuristic Fetch Failed!
This message will be only be displayed once, unless the ROCBLAS_VERBOSE_HIPBLASLT_ERROR environment variable is set.

rocBLAS warning: hipBlasLT failed, falling back to tensile. 
This message will be only be displayed once, unless the ROCBLAS_VERBOSE_TENSILE_ERROR environment variable is set.
time=2026-02-05T20:05:23.799Z level=INFO source=runner.go:1278 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Enabled KvSize:4096 KvCacheType: NumThreads:8 GPULayers:40[ID:GPU-a9298820b4e99df1 Layers:40(0..39)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2026-02-05T20:05:24.080Z level=INFO source=runner.go:1278 msg=load request="{Operation:alloc LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Enabled KvSize:4096 KvCacheType: NumThreads:8 GPULayers:40[ID:GPU-a9298820b4e99df1 Layers:40(0..39)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2026-02-05T20:05:24.344Z level=INFO source=runner.go:1278 msg=load request="{Operation:commit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Enabled KvSize:4096 KvCacheType: NumThreads:8 GPULayers:40[ID:GPU-a9298820b4e99df1 Layers:40(0..39)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2026-02-05T20:05:24.344Z level=INFO source=ggml.go:482 msg="offloading 40 repeating layers to GPU"
time=2026-02-05T20:05:24.344Z level=INFO source=ggml.go:486 msg="offloading output layer to CPU"
time=2026-02-05T20:05:24.344Z level=INFO source=ggml.go:494 msg="offloaded 40/41 layers to GPU"
time=2026-02-05T20:05:24.344Z level=INFO source=device.go:240 msg="model weights" device=ROCm0 size="12.5 GiB"
time=2026-02-05T20:05:24.344Z level=INFO source=device.go:245 msg="model weights" device=CPU size="1.7 GiB"
time=2026-02-05T20:05:24.344Z level=INFO source=device.go:251 msg="kv cache" device=ROCm0 size="640.0 MiB"
time=2026-02-05T20:05:24.344Z level=INFO source=device.go:262 msg="compute graph" device=ROCm0 size="951.7 MiB"
time=2026-02-05T20:05:24.344Z level=INFO source=device.go:267 msg="compute graph" device=CPU size="10.0 MiB"
time=2026-02-05T20:05:24.344Z level=INFO source=device.go:272 msg="total memory" size="15.7 GiB"
time=2026-02-05T20:05:24.344Z level=INFO source=sched.go:526 msg="loaded runners" count=1
time=2026-02-05T20:05:24.344Z level=INFO source=server.go:1347 msg="waiting for llama runner to start responding"
time=2026-02-05T20:05:24.344Z level=INFO source=server.go:1381 msg="waiting for server to become available" status="llm server loading model"
time=2026-02-05T20:05:26.108Z level=INFO source=server.go:1385 msg="llama runner started in 3.56 seconds"
[GIN] 2026/02/05 - 20:05:27 | 200 |  5.545776409s |       127.0.0.1 | POST     "/api/generate"
[GIN] 2026/02/05 - 20:06:30 | 200 |      17.986µs |       127.0.0.1 | HEAD     "/"
[GIN] 2026/02/05 - 20:06:30 | 200 |   94.351682ms |       127.0.0.1 | POST     "/api/show"
[GIN] 2026/02/05 - 20:06:30 | 200 |  590.841802ms |       127.0.0.1 | POST     "/api/generate"

lu2@LU2PCHL:~$ podman exec -it ollama ollama run devstral-small-2:24b "Good morning." --verbose
Good morning! How can I assist you today?

total duration:       589.745833ms
load duration:        97.75503ms
prompt eval count:    558 token(s)
prompt eval duration: 47.990647ms
prompt eval rate:     11627.27 tokens/s
eval count:           11 token(s)
eval duration:        438.511897ms
eval rate:            25.08 tokens/s
<!-- gh-comment-id:3856084551 --> @lu2 commented on GitHub (Feb 5, 2026): Thanks @androiddrew . I just tried your docker image androiddrew/ollama:0.15.4-rocm-7.2 The error message is indeed gone, but performance wise, it is same like official ollama/ollama:rocm. I have to let more clever people to compare the results: **androiddrew/ollama:0.15.4-rocm-7.2 results:** ``` lu2@LU2PCHL:~$ podman logs ollama -f time=2026-02-05T20:12:52.259Z level=INFO source=routes.go:1622 msg="server config" env="map[CUDA_VISIBLE_DEVICES: GGML_VK_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:0 OLLAMA_DEBUG:INFO OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://0.0.0.0:11434 OLLAMA_KEEP_ALIVE:5m0s OLLAMA_KV_CACHE_TYPE: OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/root/.ollama/models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NEW_ENGINE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:1 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://* vscode-file://*] OLLAMA_REMOTES:[ollama.com] OLLAMA_SCHED_SPREAD:false OLLAMA_VULKAN:false ROCR_VISIBLE_DEVICES: http_proxy: https_proxy: no_proxy:]" time=2026-02-05T20:12:52.265Z level=INFO source=images.go:473 msg="total blobs: 88" time=2026-02-05T20:12:52.266Z level=INFO source=images.go:480 msg="total unused blobs removed: 0" [GIN-debug] [WARNING] Creating an Engine instance with the Logger and Recovery middleware already attached. [GIN-debug] [WARNING] Running in "debug" mode. Switch to "release" mode in production. - using env: export GIN_MODE=release - using code: gin.SetMode(gin.ReleaseMode) [GIN-debug] HEAD / --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func1 (5 handlers) [GIN-debug] GET / --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func2 (5 handlers) [GIN-debug] HEAD /api/version --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func3 (5 handlers) [GIN-debug] GET /api/version --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func4 (5 handlers) [GIN-debug] POST /api/pull --> github.com/ollama/ollama/server.(*Server).PullHandler-fm (5 handlers) [GIN-debug] POST /api/push --> github.com/ollama/ollama/server.(*Server).PushHandler-fm (5 handlers) [GIN-debug] HEAD /api/tags --> github.com/ollama/ollama/server.(*Server).ListHandler-fm (5 handlers) [GIN-debug] GET /api/tags --> github.com/ollama/ollama/server.(*Server).ListHandler-fm (5 handlers) [GIN-debug] POST /api/show --> github.com/ollama/ollama/server.(*Server).ShowHandler-fm (5 handlers) [GIN-debug] DELETE /api/delete --> github.com/ollama/ollama/server.(*Server).DeleteHandler-fm (5 handlers) [GIN-debug] POST /api/me --> github.com/ollama/ollama/server.(*Server).WhoamiHandler-fm (5 handlers) [GIN-debug] POST /api/signout --> github.com/ollama/ollama/server.(*Server).SignoutHandler-fm (5 handlers) [GIN-debug] DELETE /api/user/keys/:encodedKey --> github.com/ollama/ollama/server.(*Server).SignoutHandler-fm (5 handlers) [GIN-debug] POST /api/create --> github.com/ollama/ollama/server.(*Server).CreateHandler-fm (5 handlers) [GIN-debug] POST /api/blobs/:digest --> github.com/ollama/ollama/server.(*Server).CreateBlobHandler-fm (5 handlers) [GIN-debug] HEAD /api/blobs/:digest --> github.com/ollama/ollama/server.(*Server).HeadBlobHandler-fm (5 handlers) [GIN-debug] POST /api/copy --> github.com/ollama/ollama/server.(*Server).CopyHandler-fm (5 handlers) [GIN-debug] GET /api/ps --> github.com/ollama/ollama/server.(*Server).PsHandler-fm (5 handlers) [GIN-debug] POST /api/generate --> github.com/ollama/ollama/server.(*Server).GenerateHandler-fm (5 handlers) [GIN-debug] POST /api/chat --> github.com/ollama/ollama/server.(*Server).ChatHandler-fm (5 handlers) [GIN-debug] POST /api/embed --> github.com/ollama/ollama/server.(*Server).EmbedHandler-fm (5 handlers) [GIN-debug] POST /api/embeddings --> github.com/ollama/ollama/server.(*Server).EmbeddingsHandler-fm (5 handlers) [GIN-debug] POST /v1/chat/completions --> github.com/ollama/ollama/server.(*Server).ChatHandler-fm (6 handlers) [GIN-debug] POST /v1/completions --> github.com/ollama/ollama/server.(*Server).GenerateHandler-fm (6 handlers) [GIN-debug] POST /v1/embeddings --> github.com/ollama/ollama/server.(*Server).EmbedHandler-fm (6 handlers) [GIN-debug] GET /v1/models --> github.com/ollama/ollama/server.(*Server).ListHandler-fm (6 handlers) [GIN-debug] GET /v1/models/:model --> github.com/ollama/ollama/server.(*Server).ShowHandler-fm (6 handlers) [GIN-debug] POST /v1/responses --> github.com/ollama/ollama/server.(*Server).ChatHandler-fm (6 handlers) [GIN-debug] POST /v1/images/generations --> github.com/ollama/ollama/server.(*Server).GenerateHandler-fm (6 handlers) [GIN-debug] POST /v1/images/edits --> github.com/ollama/ollama/server.(*Server).GenerateHandler-fm (6 handlers) [GIN-debug] POST /v1/messages --> github.com/ollama/ollama/server.(*Server).ChatHandler-fm (6 handlers) time=2026-02-05T20:12:52.267Z level=INFO source=routes.go:1675 msg="Listening on [::]:11434 (version 0.0.0)" time=2026-02-05T20:12:52.267Z level=INFO source=runner.go:67 msg="discovering available GPUs..." time=2026-02-05T20:12:52.268Z level=INFO source=server.go:430 msg="starting runner" cmd="/usr/bin/ollama runner --ollama-engine --port 44757" time=2026-02-05T20:12:52.341Z level=INFO source=server.go:430 msg="starting runner" cmd="/usr/bin/ollama runner --ollama-engine --port 42031" time=2026-02-05T20:12:52.464Z level=INFO source=types.go:42 msg="inference compute" id=GPU-a9298820b4e99df1 filter_id="" library=ROCm compute=gfx1201 name=ROCm0 description="AMD Radeon RX 9070 XT" libdirs=ollama,rocm driver=70226.1 pci_id=0000:03:00.0 type=discrete total="15.9 GiB" available="14.9 GiB" time=2026-02-05T20:12:52.464Z level=INFO source=routes.go:1725 msg="vram-based default context" total_vram="15.9 GiB" default_num_ctx=4096 [GIN] 2026/02/05 - 20:18:46 | 200 | 73.963µs | 127.0.0.1 | HEAD "/" [GIN] 2026/02/05 - 20:18:46 | 200 | 130.04483ms | 127.0.0.1 | POST "/api/show" time=2026-02-05T20:18:46.250Z level=INFO source=server.go:430 msg="starting runner" cmd="/usr/bin/ollama runner --ollama-engine --port 44091" time=2026-02-05T20:18:46.319Z level=WARN source=cpu_linux.go:130 msg="failed to parse CPU allowed micro secs" error="strconv.ParseInt: parsing \"max\": invalid syntax" time=2026-02-05T20:18:46.372Z level=INFO source=server.go:246 msg="enabling flash attention" time=2026-02-05T20:18:46.372Z level=INFO source=server.go:430 msg="starting runner" cmd="/usr/bin/ollama runner --ollama-engine --model /root/.ollama/models/blobs/sha256-c580819bed79c92d01a42227bd6d8fd66b9ec60d5329f5eb73f812f156af7807 --port 42701" time=2026-02-05T20:18:46.372Z level=INFO source=sched.go:452 msg="system memory" total="62.6 GiB" free="62.5 GiB" free_swap="8.0 GiB" time=2026-02-05T20:18:46.372Z level=INFO source=sched.go:459 msg="gpu memory" id=GPU-a9298820b4e99df1 library=ROCm available="14.5 GiB" free="15.0 GiB" minimum="457.0 MiB" overhead="0 B" time=2026-02-05T20:18:46.372Z level=INFO source=server.go:756 msg="loading model" "model layers"=41 requested=-1 time=2026-02-05T20:18:46.380Z level=INFO source=runner.go:1405 msg="starting ollama engine" time=2026-02-05T20:18:46.380Z level=INFO source=runner.go:1440 msg="Server listening on 127.0.0.1:42701" time=2026-02-05T20:18:46.383Z level=INFO source=runner.go:1278 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Enabled KvSize:4096 KvCacheType: NumThreads:8 GPULayers:41[ID:GPU-a9298820b4e99df1 Layers:41(0..40)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" time=2026-02-05T20:18:46.427Z level=INFO source=ggml.go:136 msg="" architecture=mistral3 file_type=Q4_K_M name="" description="" num_tensors=585 num_key_values=51 load_backend: loaded CPU backend from /usr/lib/ollama/libggml-cpu-alderlake.so ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 1 ROCm devices: Device 0: AMD Radeon RX 9070 XT, gfx1201 (0x1201), VMM: no, Wave Size: 32, ID: GPU-a9298820b4e99df1 load_backend: loaded ROCm backend from /usr/lib/ollama/rocm/libggml-hip.so time=2026-02-05T20:18:46.466Z level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX_VNNI=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 ROCm.0.NO_VMM=1 ROCm.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc) time=2026-02-05T20:18:47.846Z level=INFO source=runner.go:1278 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Enabled KvSize:4096 KvCacheType: NumThreads:8 GPULayers:40[ID:GPU-a9298820b4e99df1 Layers:40(0..39)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" time=2026-02-05T20:18:48.115Z level=INFO source=runner.go:1278 msg=load request="{Operation:alloc LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Enabled KvSize:4096 KvCacheType: NumThreads:8 GPULayers:40[ID:GPU-a9298820b4e99df1 Layers:40(0..39)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" time=2026-02-05T20:18:48.432Z level=INFO source=runner.go:1278 msg=load request="{Operation:commit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Enabled KvSize:4096 KvCacheType: NumThreads:8 GPULayers:40[ID:GPU-a9298820b4e99df1 Layers:40(0..39)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" time=2026-02-05T20:18:48.432Z level=INFO source=ggml.go:482 msg="offloading 40 repeating layers to GPU" time=2026-02-05T20:18:48.432Z level=INFO source=ggml.go:486 msg="offloading output layer to CPU" time=2026-02-05T20:18:48.432Z level=INFO source=ggml.go:494 msg="offloaded 40/41 layers to GPU" time=2026-02-05T20:18:48.432Z level=INFO source=device.go:240 msg="model weights" device=ROCm0 size="12.5 GiB" time=2026-02-05T20:18:48.432Z level=INFO source=device.go:245 msg="model weights" device=CPU size="1.7 GiB" time=2026-02-05T20:18:48.432Z level=INFO source=device.go:251 msg="kv cache" device=ROCm0 size="640.0 MiB" time=2026-02-05T20:18:48.432Z level=INFO source=device.go:262 msg="compute graph" device=ROCm0 size="951.7 MiB" time=2026-02-05T20:18:48.432Z level=INFO source=device.go:267 msg="compute graph" device=CPU size="10.0 MiB" time=2026-02-05T20:18:48.432Z level=INFO source=device.go:272 msg="total memory" size="15.7 GiB" time=2026-02-05T20:18:48.432Z level=INFO source=sched.go:526 msg="loaded runners" count=1 time=2026-02-05T20:18:48.432Z level=INFO source=server.go:1349 msg="waiting for llama runner to start responding" time=2026-02-05T20:18:48.433Z level=INFO source=server.go:1383 msg="waiting for server to become available" status="llm server loading model" time=2026-02-05T20:18:49.942Z level=INFO source=server.go:1387 msg="llama runner started in 3.57 seconds" [GIN] 2026/02/05 - 20:18:50 | 200 | 4.757222876s | 127.0.0.1 | POST "/api/generate" [GIN] 2026/02/05 - 20:19:03 | 200 | 14.355µs | 127.0.0.1 | HEAD "/" [GIN] 2026/02/05 - 20:19:03 | 200 | 91.200279ms | 127.0.0.1 | POST "/api/show" [GIN] 2026/02/05 - 20:19:04 | 200 | 584.092158ms | 127.0.0.1 | POST "/api/generate" ``` ``` root@d1c948104bd6:/# strings /usr/lib/ollama/rocm/rocblas/library/TensileLibrary_lazy_gfx1201.dat | egrep -i '\.co|Kernels|Manifest|TensileLibrary_Type' | head -n 200 ATensileLibrary_Type_SS_Contraction_l_Alik_Bljk_Cijk_Dijk_fallback ATensileLibrary_Type_DD_Contraction_l_Alik_Bljk_Cijk_Dijk_fallback ATensileLibrary_Type_CC_Contraction_l_Alik_Bljk_Cijk_Dijk_fallback ATensileLibrary_Type_ZZ_Contraction_l_Alik_Bljk_Cijk_Dijk_fallback ETensileLibrary_Type_HS_HPA_Contraction_l_Alik_Bljk_Cijk_Dijk_fallback ATensileLibrary_Type_HH_Contraction_l_Alik_Bljk_Cijk_Dijk_fallback ETensileLibrary_Type_HH_HPA_Contraction_l_Alik_Bljk_Cijk_Dijk_fallback HTensileLibrary_Type_4xi8I_HPA_Contraction_l_Alik_Bljk_Cijk_Dijk_fallback ETensileLibrary_Type_BS_HPA_Contraction_l_Alik_Bljk_Cijk_Dijk_fallback ETensileLibrary_Type_BB_HPA_Contraction_l_Alik_Bljk_Cijk_Dijk_fallback FTensileLibrary_Type_I8I_HPA_Contraction_l_Alik_Bljk_Cijk_Dijk_fallback BTensileLibrary_Type_CC_Contraction_l_AlikC_Bjlk_Cijk_Dijk_fallback BTensileLibrary_Type_ZZ_Contraction_l_AlikC_Bjlk_Cijk_Dijk_fallback ATensileLibrary_Type_SS_Contraction_l_Alik_Bjlk_Cijk_Dijk_fallback ATensileLibrary_Type_DD_Contraction_l_Alik_Bjlk_Cijk_Dijk_fallback ATensileLibrary_Type_CC_Contraction_l_Alik_Bjlk_Cijk_Dijk_fallback ATensileLibrary_Type_ZZ_Contraction_l_Alik_Bjlk_Cijk_Dijk_fallback ETensileLibrary_Type_HS_HPA_Contraction_l_Alik_Bjlk_Cijk_Dijk_fallback ATensileLibrary_Type_HH_Contraction_l_Alik_Bjlk_Cijk_Dijk_fallback ETensileLibrary_Type_HH_HPA_Contraction_l_Alik_Bjlk_Cijk_Dijk_fallback HTensileLibrary_Type_4xi8I_HPA_Contraction_l_Alik_Bjlk_Cijk_Dijk_fallback ETensileLibrary_Type_BS_HPA_Contraction_l_Alik_Bjlk_Cijk_Dijk_fallback ETensileLibrary_Type_BB_HPA_Contraction_l_Alik_Bjlk_Cijk_Dijk_fallback FTensileLibrary_Type_I8I_HPA_Contraction_l_Alik_Bjlk_Cijk_Dijk_fallback CTensileLibrary_Type_CC_Contraction_l_AlikC_BjlkC_Cijk_Dijk_fallback CTensileLibrary_Type_ZZ_Contraction_l_AlikC_BjlkC_Cijk_Dijk_fallback ATensileLibrary_Type_SS_Contraction_l_Ailk_Bjlk_Cijk_Dijk_fallback ATensileLibrary_Type_DD_Contraction_l_Ailk_Bjlk_Cijk_Dijk_fallback ATensileLibrary_Type_CC_Contraction_l_Ailk_Bjlk_Cijk_Dijk_fallback ATensileLibrary_Type_ZZ_Contraction_l_Ailk_Bjlk_Cijk_Dijk_fallback ETensileLibrary_Type_HS_HPA_Contraction_l_Ailk_Bjlk_Cijk_Dijk_fallback ATensileLibrary_Type_HH_Contraction_l_Ailk_Bjlk_Cijk_Dijk_fallback ETensileLibrary_Type_HH_HPA_Contraction_l_Ailk_Bjlk_Cijk_Dijk_fallback HTensileLibrary_Type_4xi8I_HPA_Contraction_l_Ailk_Bjlk_Cijk_Dijk_fallback ETensileLibrary_Type_BS_HPA_Contraction_l_Ailk_Bjlk_Cijk_Dijk_fallback ETensileLibrary_Type_BB_HPA_Contraction_l_Ailk_Bjlk_Cijk_Dijk_fallback FTensileLibrary_Type_I8I_HPA_Contraction_l_Ailk_Bjlk_Cijk_Dijk_fallback ATensileLibrary_Type_SS_Contraction_l_Ailk_Bljk_Cijk_Dijk_fallback ATensileLibrary_Type_DD_Contraction_l_Ailk_Bljk_Cijk_Dijk_fallback ATensileLibrary_Type_CC_Contraction_l_Ailk_Bljk_Cijk_Dijk_fallback ATensileLibrary_Type_ZZ_Contraction_l_Ailk_Bljk_Cijk_Dijk_fallback ETensileLibrary_Type_HS_HPA_Contraction_l_Ailk_Bljk_Cijk_Dijk_fallback ATensileLibrary_Type_HH_Contraction_l_Ailk_Bljk_Cijk_Dijk_fallback ETensileLibrary_Type_HH_HPA_Contraction_l_Ailk_Bljk_Cijk_Dijk_fallback HTensileLibrary_Type_4xi8I_HPA_Contraction_l_Ailk_Bljk_Cijk_Dijk_fallback ETensileLibrary_Type_BS_HPA_Contraction_l_Ailk_Bljk_Cijk_Dijk_fallback ETensileLibrary_Type_BB_HPA_Contraction_l_Ailk_Bljk_Cijk_Dijk_fallback FTensileLibrary_Type_I8I_HPA_Contraction_l_Ailk_Bljk_Cijk_Dijk_fallback BTensileLibrary_Type_CC_Contraction_l_Alik_BjlkC_Cijk_Dijk_fallback BTensileLibrary_Type_ZZ_Contraction_l_Alik_BjlkC_Cijk_Dijk_fallback BTensileLibrary_Type_CC_Contraction_l_Ailk_BjlkC_Cijk_Dijk_fallback BTensileLibrary_Type_ZZ_Contraction_l_Ailk_BjlkC_Cijk_Dijk_fallback BTensileLibrary_Type_CC_Contraction_l_AlikC_Bljk_Cijk_Dijk_fallback BTensileLibrary_Type_ZZ_Contraction_l_AlikC_Bljk_Cijk_Dijk_fallback ``` ``` lu2@LU2PCHL:~$ podman exec -it ollama ollama run devstral-small-2:24b "Good morning." --verbose Good morning! How can I assist you today? total duration: 583.005608ms load duration: 89.950904ms prompt eval count: 558 token(s) prompt eval duration: 48.7875ms prompt eval rate: 11437.36 tokens/s eval count: 11 token(s) eval duration: 439.923541ms eval rate: 25.00 tokens/s ``` **ollama/ollama:rocm (675519d3d134) 3 days ago version results:** ``` lu2@LU2PCHL:~$ podman logs ollama -f time=2026-02-05T20:05:16.417Z level=INFO source=routes.go:1631 msg="server config" env="map[CUDA_VISIBLE_DEVICES: GGML_VK_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:4096 OLLAMA_DEBUG:INFO OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://0.0.0.0:11434 OLLAMA_KEEP_ALIVE:5m0s OLLAMA_KV_CACHE_TYPE: OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/root/.ollama/models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NEW_ENGINE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:1 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://* vscode-file://*] OLLAMA_REMOTES:[ollama.com] OLLAMA_SCHED_SPREAD:false OLLAMA_VULKAN:false ROCR_VISIBLE_DEVICES: http_proxy: https_proxy: no_proxy:]" time=2026-02-05T20:05:16.422Z level=INFO source=images.go:473 msg="total blobs: 88" time=2026-02-05T20:05:16.423Z level=INFO source=images.go:480 msg="total unused blobs removed: 0" time=2026-02-05T20:05:16.423Z level=INFO source=routes.go:1684 msg="Listening on [::]:11434 (version 0.15.4)" time=2026-02-05T20:05:16.424Z level=INFO source=runner.go:67 msg="discovering available GPUs..." time=2026-02-05T20:05:16.425Z level=INFO source=server.go:429 msg="starting runner" cmd="/usr/bin/ollama runner --ollama-engine --port 33011" time=2026-02-05T20:05:17.243Z level=INFO source=server.go:429 msg="starting runner" cmd="/usr/bin/ollama runner --ollama-engine --port 34601" time=2026-02-05T20:05:17.988Z level=INFO source=types.go:42 msg="inference compute" id=GPU-a9298820b4e99df1 filter_id="" library=ROCm compute=gfx1201 name=ROCm0 description="AMD Radeon Graphics" libdirs=ollama,rocm driver=60342.13 pci_id=0000:03:00.0 type=discrete total="15.9 GiB" available="14.9 GiB" time=2026-02-05T20:05:17.988Z level=INFO source=routes.go:1725 msg="entering low vram mode" "total vram"="15.9 GiB" threshold="20.0 GiB" [GIN] 2026/02/05 - 20:05:21 | 200 | 45.745µs | 127.0.0.1 | HEAD "/" [GIN] 2026/02/05 - 20:05:21 | 200 | 108.972067ms | 127.0.0.1 | POST "/api/show" time=2026-02-05T20:05:21.676Z level=INFO source=server.go:429 msg="starting runner" cmd="/usr/bin/ollama runner --ollama-engine --port 37165" time=2026-02-05T20:05:22.486Z level=WARN source=cpu_linux.go:130 msg="failed to parse CPU allowed micro secs" error="strconv.ParseInt: parsing \"max\": invalid syntax" time=2026-02-05T20:05:22.550Z level=INFO source=server.go:245 msg="enabling flash attention" time=2026-02-05T20:05:22.550Z level=INFO source=server.go:429 msg="starting runner" cmd="/usr/bin/ollama runner --ollama-engine --model /root/.ollama/models/blobs/sha256-c580819bed79c92d01a42227bd6d8fd66b9ec60d5329f5eb73f812f156af7807 --port 32943" time=2026-02-05T20:05:22.551Z level=INFO source=sched.go:452 msg="system memory" total="62.6 GiB" free="62.4 GiB" free_swap="8.0 GiB" time=2026-02-05T20:05:22.551Z level=INFO source=sched.go:459 msg="gpu memory" id=GPU-a9298820b4e99df1 library=ROCm available="14.5 GiB" free="14.9 GiB" minimum="457.0 MiB" overhead="0 B" time=2026-02-05T20:05:22.551Z level=INFO source=server.go:755 msg="loading model" "model layers"=41 requested=-1 time=2026-02-05T20:05:22.565Z level=INFO source=runner.go:1405 msg="starting ollama engine" time=2026-02-05T20:05:22.565Z level=INFO source=runner.go:1440 msg="Server listening on 127.0.0.1:32943" time=2026-02-05T20:05:22.574Z level=INFO source=runner.go:1278 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Enabled KvSize:4096 KvCacheType: NumThreads:8 GPULayers:41[ID:GPU-a9298820b4e99df1 Layers:41(0..40)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" time=2026-02-05T20:05:22.615Z level=INFO source=ggml.go:136 msg="" architecture=mistral3 file_type=Q4_K_M name="" description="" num_tensors=585 num_key_values=51 load_backend: loaded CPU backend from /usr/lib/ollama/libggml-cpu-alderlake.so /opt/amdgpu/share/libdrm/amdgpu.ids: No such file or directory ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 1 ROCm devices: Device 0: AMD Radeon Graphics, gfx1201 (0x1201), VMM: no, Wave Size: 32, ID: GPU-a9298820b4e99df1 load_backend: loaded ROCm backend from /usr/lib/ollama/rocm/libggml-hip.so time=2026-02-05T20:05:23.321Z level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX_VNNI=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 ROCm.0.NO_VMM=1 ROCm.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc) rocblaslt error: Cannot read /usr/lib/ollama/rocm/hipblaslt/library/TensileLibrary_lazy_gfx1201.dat: No such file or directory rocblaslt error: Could not load /usr/lib/ollama/rocm/hipblaslt/library/TensileLibrary_lazy_gfx1201.dat hipBLASLt error: Heuristic Fetch Failed! This message will be only be displayed once, unless the ROCBLAS_VERBOSE_HIPBLASLT_ERROR environment variable is set. rocBLAS warning: hipBlasLT failed, falling back to tensile. This message will be only be displayed once, unless the ROCBLAS_VERBOSE_TENSILE_ERROR environment variable is set. time=2026-02-05T20:05:23.799Z level=INFO source=runner.go:1278 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Enabled KvSize:4096 KvCacheType: NumThreads:8 GPULayers:40[ID:GPU-a9298820b4e99df1 Layers:40(0..39)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" time=2026-02-05T20:05:24.080Z level=INFO source=runner.go:1278 msg=load request="{Operation:alloc LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Enabled KvSize:4096 KvCacheType: NumThreads:8 GPULayers:40[ID:GPU-a9298820b4e99df1 Layers:40(0..39)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" time=2026-02-05T20:05:24.344Z level=INFO source=runner.go:1278 msg=load request="{Operation:commit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Enabled KvSize:4096 KvCacheType: NumThreads:8 GPULayers:40[ID:GPU-a9298820b4e99df1 Layers:40(0..39)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" time=2026-02-05T20:05:24.344Z level=INFO source=ggml.go:482 msg="offloading 40 repeating layers to GPU" time=2026-02-05T20:05:24.344Z level=INFO source=ggml.go:486 msg="offloading output layer to CPU" time=2026-02-05T20:05:24.344Z level=INFO source=ggml.go:494 msg="offloaded 40/41 layers to GPU" time=2026-02-05T20:05:24.344Z level=INFO source=device.go:240 msg="model weights" device=ROCm0 size="12.5 GiB" time=2026-02-05T20:05:24.344Z level=INFO source=device.go:245 msg="model weights" device=CPU size="1.7 GiB" time=2026-02-05T20:05:24.344Z level=INFO source=device.go:251 msg="kv cache" device=ROCm0 size="640.0 MiB" time=2026-02-05T20:05:24.344Z level=INFO source=device.go:262 msg="compute graph" device=ROCm0 size="951.7 MiB" time=2026-02-05T20:05:24.344Z level=INFO source=device.go:267 msg="compute graph" device=CPU size="10.0 MiB" time=2026-02-05T20:05:24.344Z level=INFO source=device.go:272 msg="total memory" size="15.7 GiB" time=2026-02-05T20:05:24.344Z level=INFO source=sched.go:526 msg="loaded runners" count=1 time=2026-02-05T20:05:24.344Z level=INFO source=server.go:1347 msg="waiting for llama runner to start responding" time=2026-02-05T20:05:24.344Z level=INFO source=server.go:1381 msg="waiting for server to become available" status="llm server loading model" time=2026-02-05T20:05:26.108Z level=INFO source=server.go:1385 msg="llama runner started in 3.56 seconds" [GIN] 2026/02/05 - 20:05:27 | 200 | 5.545776409s | 127.0.0.1 | POST "/api/generate" [GIN] 2026/02/05 - 20:06:30 | 200 | 17.986µs | 127.0.0.1 | HEAD "/" [GIN] 2026/02/05 - 20:06:30 | 200 | 94.351682ms | 127.0.0.1 | POST "/api/show" [GIN] 2026/02/05 - 20:06:30 | 200 | 590.841802ms | 127.0.0.1 | POST "/api/generate" ``` ``` lu2@LU2PCHL:~$ podman exec -it ollama ollama run devstral-small-2:24b "Good morning." --verbose Good morning! How can I assist you today? total duration: 589.745833ms load duration: 97.75503ms prompt eval count: 558 token(s) prompt eval duration: 47.990647ms prompt eval rate: 11627.27 tokens/s eval count: 11 token(s) eval duration: 438.511897ms eval rate: 25.08 tokens/s ```
Author
Owner

@androiddrew commented on GitHub (Feb 5, 2026):

@lu2

time=2026-02-05T20:18:48.432Z level=INFO source=ggml.go:494 msg="offloaded 40/41 layers to GPU"

the Devstral2 Small model is pretty beefy and will exceed the 30 Gib of vram unless you use low context len.. OLLAMA_CONTEXT_LENGTH set to something low like 8192 should put the whole model on your gpu.

The missing 2 GiB on these cards is apparently because of ecc. I hear you can turn it off to get the full 32 GiB on your card, but then you run some different risks.

<!-- gh-comment-id:3856874504 --> @androiddrew commented on GitHub (Feb 5, 2026): @lu2 ``` time=2026-02-05T20:18:48.432Z level=INFO source=ggml.go:494 msg="offloaded 40/41 layers to GPU" ``` the Devstral2 Small model is pretty beefy and will exceed the 30 Gib of vram unless you use low context len.. `OLLAMA_CONTEXT_LENGTH` set to something low like 8192 should put the whole model on your gpu. The missing 2 GiB on these cards is apparently because of ecc. I hear you can turn it off to get the full 32 GiB on your card, but then you run some different risks.
Author
Owner

@dhiltgen commented on GitHub (Mar 11, 2026):

Release 0.17.8 updates Linux to ROCm v7 which covers support for this GPU. Please give the RC a try and let us know if you run into any problems.

<!-- gh-comment-id:4041969698 --> @dhiltgen commented on GitHub (Mar 11, 2026): Release 0.17.8 updates Linux to ROCm v7 which covers support for this GPU. Please give the [RC a try](https://github.com/ollama/ollama/blob/main/docs/linux.mdx#installing-specific-versions) and let us know if you run into any problems.
Author
Owner

@depuhitv commented on GitHub (Mar 28, 2026):

This error still shows up in the latest docker ollama/ollama:rocm image on 0.19.0-rc0. e6777885093e

time=2026-03-28T13:38:43.386Z level=INFO source=routes.go:1740 msg="server config" env="map[CUDA_VISIBLE_DEVICES: GGML_VK_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES:0,1 HSA_OVERRIDE_GFX_VERSION:12.0.1 HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:10000 OLLAMA_DEBUG:INFO OLLAMA_DEBUG_LOG_REQUESTS:false OLLAMA_EDITOR: OLLAMA_FLASH_ATTENTION:true OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://0.0.0.0:11434 OLLAMA_KEEP_ALIVE:5m0s OLLAMA_KV_CACHE_TYPE:q8_0 OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/root/.ollama/models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NEW_ENGINE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NO_CLOUD:false OLLAMA_NUM_PARALLEL:1 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://* vscode-file://*] OLLAMA_REMOTES:[ollama.com] OLLAMA_SCHED_SPREAD:true OLLAMA_VULKAN:false ROCR_VISIBLE_DEVICES: http_proxy: https_proxy: no_proxy:]"
time=2026-03-28T13:38:43.386Z level=INFO source=routes.go:1742 msg="Ollama cloud disabled: false"
time=2026-03-28T13:38:43.387Z level=INFO source=images.go:477 msg="total blobs: 22"
time=2026-03-28T13:38:43.388Z level=INFO source=images.go:484 msg="total unused blobs removed: 0"
time=2026-03-28T13:38:43.388Z level=INFO source=routes.go:1798 msg="Listening on [::]:11434 (version 0.19.0-rc0)"
time=2026-03-28T13:38:43.388Z level=INFO source=runner.go:67 msg="discovering available GPUs..."
time=2026-03-28T13:38:43.388Z level=WARN source=runner.go:485 msg="user overrode visible devices" HIP_VISIBLE_DEVICES=0,1
time=2026-03-28T13:38:43.388Z level=WARN source=runner.go:485 msg="user overrode visible devices" HSA_OVERRIDE_GFX_VERSION=12.0.1
time=2026-03-28T13:38:43.388Z level=WARN source=runner.go:489 msg="if GPUs are not correctly discovered, unset and try again"
time=2026-03-28T13:38:43.388Z level=INFO source=server.go:432 msg="starting runner" cmd="/usr/bin/ollama runner --ollama-engine --port 34259"
time=2026-03-28T13:38:43.829Z level=INFO source=server.go:432 msg="starting runner" cmd="/usr/bin/ollama runner --ollama-engine --port 43559"
time=2026-03-28T13:38:43.829Z level=INFO source=server.go:432 msg="starting runner" cmd="/usr/bin/ollama runner --ollama-engine --port 36787"
time=2026-03-28T13:38:43.959Z level=INFO source=types.go:42 msg="inference compute" id=GPU-73f0398c6bad186f filter_id="" library=ROCm compute=gfx1201 name=ROCm0 description="AMD Radeon Graphics" libdirs=ollama,rocm driver=70226.1 pci_id=0000:03:00.0 type=discrete total="31.9 GiB" available="31.8 GiB"
time=2026-03-28T13:38:43.959Z level=INFO source=types.go:42 msg="inference compute" id=GPU-e38dd89909d5bf02 filter_id="" library=ROCm compute=gfx1201 name=ROCm1 description="AMD Radeon Graphics" libdirs=ollama,rocm driver=70226.1 pci_id=0000:07:00.0 type=discrete total="31.9 GiB" available="31.5 GiB"
time=2026-03-28T13:38:43.959Z level=INFO source=routes.go:1848 msg="vram-based default context" total_vram="63.7 GiB" default_num_ctx=262144
time=2026-03-28T13:39:07.796Z level=INFO source=server.go:432 msg="starting runner" cmd="/usr/bin/ollama runner --ollama-engine --port 32933"
time=2026-03-28T13:39:08.233Z level=WARN source=cpu_linux.go:130 msg="failed to parse CPU allowed micro secs" error="strconv.ParseInt: parsing \"max\": invalid syntax"
time=2026-03-28T13:39:08.305Z level=INFO source=server.go:247 msg="enabling flash attention"
time=2026-03-28T13:39:08.306Z level=INFO source=server.go:432 msg="starting runner" cmd="/usr/bin/ollama runner --ollama-engine --model /root/.ollama/models/blobs/sha256-e7b273f9636059a689e3ddcab3716e4f65abe0143ac978e46673ad0e52d09efb --port 38679"
time=2026-03-28T13:39:08.306Z level=INFO source=sched.go:484 msg="system memory" total="188.3 GiB" free="188.2 GiB" free_swap="8.0 GiB"
time=2026-03-28T13:39:08.306Z level=INFO source=sched.go:491 msg="gpu memory" id=GPU-73f0398c6bad186f library=ROCm available="31.4 GiB" free="31.8 GiB" minimum="457.0 MiB" overhead="0 B"
time=2026-03-28T13:39:08.306Z level=INFO source=sched.go:491 msg="gpu memory" id=GPU-e38dd89909d5bf02 library=ROCm available="31.1 GiB" free="31.5 GiB" minimum="457.0 MiB" overhead="0 B"
time=2026-03-28T13:39:08.306Z level=INFO source=server.go:759 msg="loading model" "model layers"=25 requested=-1
time=2026-03-28T13:39:08.312Z level=INFO source=runner.go:1411 msg="starting ollama engine"
time=2026-03-28T13:39:08.312Z level=INFO source=runner.go:1446 msg="Server listening on 127.0.0.1:38679"
time=2026-03-28T13:39:08.318Z level=INFO source=runner.go:1284 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Enabled KvSize:10000 KvCacheType:q8_0 NumThreads:16 GPULayers:25[ID:GPU-e38dd89909d5bf02 Layers:25(0..24)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2026-03-28T13:39:08.346Z level=INFO source=ggml.go:136 msg="" architecture=gptoss file_type=MXFP4 name="" description="" num_tensors=459 num_key_values=32
load_backend: loaded CPU backend from /usr/lib/ollama/libggml-cpu-icelake.so
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 2 ROCm devices:
  Device 0: AMD Radeon Graphics, gfx1201 (0x1201), VMM: no, Wave Size: 32, ID: GPU-73f0398c6bad186f
  Device 1: AMD Radeon Graphics, gfx1201 (0x1201), VMM: no, Wave Size: 32, ID: GPU-e38dd89909d5bf02
load_backend: loaded ROCm backend from /usr/lib/ollama/rocm/libggml-hip.so
time=2026-03-28T13:39:08.386Z level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.AVX512=1 CPU.0.AVX512_VBMI=1 CPU.0.AVX512_VNNI=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 ROCm.0.NO_VMM=1 ROCm.0.PEER_MAX_BATCH_SIZE=128 ROCm.1.NO_VMM=1 ROCm.1.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc)

rocblaslt error: Cannot read "TensileLibrary_lazy_gfx1201.dat": No such file or directory

rocblaslt error: Could not load "TensileLibrary_lazy_gfx1201.dat"
time=2026-03-28T13:39:08.603Z level=INFO source=runner.go:1284 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Enabled KvSize:10000 KvCacheType:q8_0 NumThreads:16 GPULayers:25[ID:GPU-73f0398c6bad186f Layers:13(0..12) ID:GPU-e38dd89909d5bf02 Layers:12(13..24)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2026-03-28T13:39:08.711Z level=INFO source=runner.go:1284 msg=load request="{Operation:alloc LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Enabled KvSize:10000 KvCacheType:q8_0 NumThreads:16 GPULayers:25[ID:GPU-73f0398c6bad186f Layers:13(0..12) ID:GPU-e38dd89909d5bf02 Layers:12(13..24)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2026-03-28T13:39:08.778Z level=INFO source=runner.go:1284 msg=load request="{Operation:commit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Enabled KvSize:10000 KvCacheType:q8_0 NumThreads:16 GPULayers:25[ID:GPU-73f0398c6bad186f Layers:13(0..12) ID:GPU-e38dd89909d5bf02 Layers:12(13..24)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2026-03-28T13:39:08.778Z level=INFO source=ggml.go:482 msg="offloading 24 repeating layers to GPU"
time=2026-03-28T13:39:08.778Z level=INFO source=ggml.go:489 msg="offloading output layer to GPU"
time=2026-03-28T13:39:08.778Z level=INFO source=ggml.go:494 msg="offloaded 25/25 layers to GPU"
time=2026-03-28T13:39:08.778Z level=INFO source=device.go:240 msg="model weights" device=ROCm0 size="5.8 GiB"
time=2026-03-28T13:39:08.778Z level=INFO source=device.go:240 msg="model weights" device=ROCm1 size="6.0 GiB"
time=2026-03-28T13:39:08.778Z level=INFO source=device.go:245 msg="model weights" device=CPU size="1.1 GiB"
time=2026-03-28T13:39:08.778Z level=INFO source=device.go:251 msg="kv cache" device=ROCm0 size="97.2 MiB"
time=2026-03-28T13:39:08.778Z level=INFO source=device.go:251 msg="kv cache" device=ROCm1 size="87.7 MiB"
time=2026-03-28T13:39:08.778Z level=INFO source=device.go:262 msg="compute graph" device=ROCm0 size="193.3 MiB"
time=2026-03-28T13:39:08.778Z level=INFO source=device.go:262 msg="compute graph" device=ROCm1 size="138.3 MiB"
time=2026-03-28T13:39:08.778Z level=INFO source=device.go:267 msg="compute graph" device=CPU size="5.6 MiB"
time=2026-03-28T13:39:08.778Z level=INFO source=device.go:272 msg="total memory" size="13.3 GiB"
time=2026-03-28T13:39:08.778Z level=INFO source=sched.go:561 msg="loaded runners" count=1
time=2026-03-28T13:39:08.778Z level=INFO source=server.go:1352 msg="waiting for llama runner to start responding"
<!-- gh-comment-id:4148099141 --> @depuhitv commented on GitHub (Mar 28, 2026): This error still shows up in the latest docker ollama/ollama:rocm image on 0.19.0-rc0. [e6777885093e](https://hub.docker.com/layers/ollama/ollama/rocm/images/sha256-e6777885093ea1c91992523a6de8f3ac0c4a5d3d30e9be9b140a3f41cb3ea254) ``` time=2026-03-28T13:38:43.386Z level=INFO source=routes.go:1740 msg="server config" env="map[CUDA_VISIBLE_DEVICES: GGML_VK_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES:0,1 HSA_OVERRIDE_GFX_VERSION:12.0.1 HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:10000 OLLAMA_DEBUG:INFO OLLAMA_DEBUG_LOG_REQUESTS:false OLLAMA_EDITOR: OLLAMA_FLASH_ATTENTION:true OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://0.0.0.0:11434 OLLAMA_KEEP_ALIVE:5m0s OLLAMA_KV_CACHE_TYPE:q8_0 OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/root/.ollama/models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NEW_ENGINE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NO_CLOUD:false OLLAMA_NUM_PARALLEL:1 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://* vscode-file://*] OLLAMA_REMOTES:[ollama.com] OLLAMA_SCHED_SPREAD:true OLLAMA_VULKAN:false ROCR_VISIBLE_DEVICES: http_proxy: https_proxy: no_proxy:]" time=2026-03-28T13:38:43.386Z level=INFO source=routes.go:1742 msg="Ollama cloud disabled: false" time=2026-03-28T13:38:43.387Z level=INFO source=images.go:477 msg="total blobs: 22" time=2026-03-28T13:38:43.388Z level=INFO source=images.go:484 msg="total unused blobs removed: 0" time=2026-03-28T13:38:43.388Z level=INFO source=routes.go:1798 msg="Listening on [::]:11434 (version 0.19.0-rc0)" time=2026-03-28T13:38:43.388Z level=INFO source=runner.go:67 msg="discovering available GPUs..." time=2026-03-28T13:38:43.388Z level=WARN source=runner.go:485 msg="user overrode visible devices" HIP_VISIBLE_DEVICES=0,1 time=2026-03-28T13:38:43.388Z level=WARN source=runner.go:485 msg="user overrode visible devices" HSA_OVERRIDE_GFX_VERSION=12.0.1 time=2026-03-28T13:38:43.388Z level=WARN source=runner.go:489 msg="if GPUs are not correctly discovered, unset and try again" time=2026-03-28T13:38:43.388Z level=INFO source=server.go:432 msg="starting runner" cmd="/usr/bin/ollama runner --ollama-engine --port 34259" time=2026-03-28T13:38:43.829Z level=INFO source=server.go:432 msg="starting runner" cmd="/usr/bin/ollama runner --ollama-engine --port 43559" time=2026-03-28T13:38:43.829Z level=INFO source=server.go:432 msg="starting runner" cmd="/usr/bin/ollama runner --ollama-engine --port 36787" time=2026-03-28T13:38:43.959Z level=INFO source=types.go:42 msg="inference compute" id=GPU-73f0398c6bad186f filter_id="" library=ROCm compute=gfx1201 name=ROCm0 description="AMD Radeon Graphics" libdirs=ollama,rocm driver=70226.1 pci_id=0000:03:00.0 type=discrete total="31.9 GiB" available="31.8 GiB" time=2026-03-28T13:38:43.959Z level=INFO source=types.go:42 msg="inference compute" id=GPU-e38dd89909d5bf02 filter_id="" library=ROCm compute=gfx1201 name=ROCm1 description="AMD Radeon Graphics" libdirs=ollama,rocm driver=70226.1 pci_id=0000:07:00.0 type=discrete total="31.9 GiB" available="31.5 GiB" time=2026-03-28T13:38:43.959Z level=INFO source=routes.go:1848 msg="vram-based default context" total_vram="63.7 GiB" default_num_ctx=262144 time=2026-03-28T13:39:07.796Z level=INFO source=server.go:432 msg="starting runner" cmd="/usr/bin/ollama runner --ollama-engine --port 32933" time=2026-03-28T13:39:08.233Z level=WARN source=cpu_linux.go:130 msg="failed to parse CPU allowed micro secs" error="strconv.ParseInt: parsing \"max\": invalid syntax" time=2026-03-28T13:39:08.305Z level=INFO source=server.go:247 msg="enabling flash attention" time=2026-03-28T13:39:08.306Z level=INFO source=server.go:432 msg="starting runner" cmd="/usr/bin/ollama runner --ollama-engine --model /root/.ollama/models/blobs/sha256-e7b273f9636059a689e3ddcab3716e4f65abe0143ac978e46673ad0e52d09efb --port 38679" time=2026-03-28T13:39:08.306Z level=INFO source=sched.go:484 msg="system memory" total="188.3 GiB" free="188.2 GiB" free_swap="8.0 GiB" time=2026-03-28T13:39:08.306Z level=INFO source=sched.go:491 msg="gpu memory" id=GPU-73f0398c6bad186f library=ROCm available="31.4 GiB" free="31.8 GiB" minimum="457.0 MiB" overhead="0 B" time=2026-03-28T13:39:08.306Z level=INFO source=sched.go:491 msg="gpu memory" id=GPU-e38dd89909d5bf02 library=ROCm available="31.1 GiB" free="31.5 GiB" minimum="457.0 MiB" overhead="0 B" time=2026-03-28T13:39:08.306Z level=INFO source=server.go:759 msg="loading model" "model layers"=25 requested=-1 time=2026-03-28T13:39:08.312Z level=INFO source=runner.go:1411 msg="starting ollama engine" time=2026-03-28T13:39:08.312Z level=INFO source=runner.go:1446 msg="Server listening on 127.0.0.1:38679" time=2026-03-28T13:39:08.318Z level=INFO source=runner.go:1284 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Enabled KvSize:10000 KvCacheType:q8_0 NumThreads:16 GPULayers:25[ID:GPU-e38dd89909d5bf02 Layers:25(0..24)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" time=2026-03-28T13:39:08.346Z level=INFO source=ggml.go:136 msg="" architecture=gptoss file_type=MXFP4 name="" description="" num_tensors=459 num_key_values=32 load_backend: loaded CPU backend from /usr/lib/ollama/libggml-cpu-icelake.so ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 2 ROCm devices: Device 0: AMD Radeon Graphics, gfx1201 (0x1201), VMM: no, Wave Size: 32, ID: GPU-73f0398c6bad186f Device 1: AMD Radeon Graphics, gfx1201 (0x1201), VMM: no, Wave Size: 32, ID: GPU-e38dd89909d5bf02 load_backend: loaded ROCm backend from /usr/lib/ollama/rocm/libggml-hip.so time=2026-03-28T13:39:08.386Z level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.AVX512=1 CPU.0.AVX512_VBMI=1 CPU.0.AVX512_VNNI=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 ROCm.0.NO_VMM=1 ROCm.0.PEER_MAX_BATCH_SIZE=128 ROCm.1.NO_VMM=1 ROCm.1.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc) rocblaslt error: Cannot read "TensileLibrary_lazy_gfx1201.dat": No such file or directory rocblaslt error: Could not load "TensileLibrary_lazy_gfx1201.dat" time=2026-03-28T13:39:08.603Z level=INFO source=runner.go:1284 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Enabled KvSize:10000 KvCacheType:q8_0 NumThreads:16 GPULayers:25[ID:GPU-73f0398c6bad186f Layers:13(0..12) ID:GPU-e38dd89909d5bf02 Layers:12(13..24)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" time=2026-03-28T13:39:08.711Z level=INFO source=runner.go:1284 msg=load request="{Operation:alloc LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Enabled KvSize:10000 KvCacheType:q8_0 NumThreads:16 GPULayers:25[ID:GPU-73f0398c6bad186f Layers:13(0..12) ID:GPU-e38dd89909d5bf02 Layers:12(13..24)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" time=2026-03-28T13:39:08.778Z level=INFO source=runner.go:1284 msg=load request="{Operation:commit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Enabled KvSize:10000 KvCacheType:q8_0 NumThreads:16 GPULayers:25[ID:GPU-73f0398c6bad186f Layers:13(0..12) ID:GPU-e38dd89909d5bf02 Layers:12(13..24)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" time=2026-03-28T13:39:08.778Z level=INFO source=ggml.go:482 msg="offloading 24 repeating layers to GPU" time=2026-03-28T13:39:08.778Z level=INFO source=ggml.go:489 msg="offloading output layer to GPU" time=2026-03-28T13:39:08.778Z level=INFO source=ggml.go:494 msg="offloaded 25/25 layers to GPU" time=2026-03-28T13:39:08.778Z level=INFO source=device.go:240 msg="model weights" device=ROCm0 size="5.8 GiB" time=2026-03-28T13:39:08.778Z level=INFO source=device.go:240 msg="model weights" device=ROCm1 size="6.0 GiB" time=2026-03-28T13:39:08.778Z level=INFO source=device.go:245 msg="model weights" device=CPU size="1.1 GiB" time=2026-03-28T13:39:08.778Z level=INFO source=device.go:251 msg="kv cache" device=ROCm0 size="97.2 MiB" time=2026-03-28T13:39:08.778Z level=INFO source=device.go:251 msg="kv cache" device=ROCm1 size="87.7 MiB" time=2026-03-28T13:39:08.778Z level=INFO source=device.go:262 msg="compute graph" device=ROCm0 size="193.3 MiB" time=2026-03-28T13:39:08.778Z level=INFO source=device.go:262 msg="compute graph" device=ROCm1 size="138.3 MiB" time=2026-03-28T13:39:08.778Z level=INFO source=device.go:267 msg="compute graph" device=CPU size="5.6 MiB" time=2026-03-28T13:39:08.778Z level=INFO source=device.go:272 msg="total memory" size="13.3 GiB" time=2026-03-28T13:39:08.778Z level=INFO source=sched.go:561 msg="loaded runners" count=1 time=2026-03-28T13:39:08.778Z level=INFO source=server.go:1352 msg="waiting for llama runner to start responding" ```
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#55069