[GH-ISSUE #11854] Can't use latest rocm (6.4.3) #54382

Open
opened 2026-04-29 05:51:49 -05:00 by GiteaMirror · 3 comments
Owner

Originally created by @deific on GitHub (Aug 11, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/11854

Originally assigned to: @dhiltgen on GitHub.

What is the issue?

The latest version of Rocm is already 6.4.3, and the next major version 7.0 has been previewed and may be released soon. I saw ARG ROCMVERSION=6.3.3 in Dockerfile, which means that ollama has been using the old version of 6.3.3. How to use the latest 6.4.3? I tried to modify ARG ROCMVERSION=6.4.3 to recompile the image and run it, but the log indicated that the GPU was recognized, but the final run rolled back to the CPU(load_backend: loaded CPU backend from /usr/lib/ollama/libggml-cpu-icelake.so).

OS: Ubuntu24.04
Linux: 6.14.0-27-generic
GPU: AMD Radeon Graphics 8060s gfx1151

ollama log:

time=2025-08-11T10:57:51.988Z level=INFO source=sched.go:786 msg="new model will fit in available VRAM in single GPU, loading" model=/root/.ollama/models/blobs/sha256-3d0b790534fe4b79525fc3692950408dca41171676ed7e21db57af5c65ef6ab6 gpu=0 parallel=1 available=68553601024 required="2.2 GiB"
time=2025-08-11T10:57:51.988Z level=INFO source=server.go:135 msg="system memory" total="62.4 GiB" free="46.7 GiB" free_swap="8.0 GiB"
time=2025-08-11T10:57:51.988Z level=INFO source=server.go:175 msg=offload library=rocm layers.requested=-1 layers.model=29 layers.offload=29 layers.split="" memory.available="[63.8 GiB]" memory.gpu_overhead="0 B" memory.required.full="2.2 GiB" memory.required.partial="2.2 GiB" memory.required.kv="448.0 MiB" memory.required.allocations="[2.2 GiB]" memory.weights.total="1.1 GiB" memory.weights.repeating="880.3 MiB" memory.weights.nonrepeating="243.4 MiB" memory.graph.full="149.3 MiB" memory.graph.partial="149.3 MiB"
llama_model_loader: loaded meta data with 27 key-value pairs and 311 tensors from /root/.ollama/models/blobs/sha256-3d0b790534fe4b79525fc3692950408dca41171676ed7e21db57af5c65ef6ab6 (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen3
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Qwen3 1.7B
llama_model_loader: - kv   3:                           general.basename str              = Qwen3
llama_model_loader: - kv   4:                         general.size_label str              = 1.7B
llama_model_loader: - kv   5:                          qwen3.block_count u32              = 28
llama_model_loader: - kv   6:                       qwen3.context_length u32              = 40960
llama_model_loader: - kv   7:                     qwen3.embedding_length u32              = 2048
llama_model_loader: - kv   8:                  qwen3.feed_forward_length u32              = 6144
llama_model_loader: - kv   9:                 qwen3.attention.head_count u32              = 16
llama_model_loader: - kv  10:              qwen3.attention.head_count_kv u32              = 8
llama_model_loader: - kv  11:                       qwen3.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv  12:     qwen3.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  13:                 qwen3.attention.key_length u32              = 128
llama_model_loader: - kv  14:               qwen3.attention.value_length u32              = 128
llama_model_loader: - kv  15:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  16:                         tokenizer.ggml.pre str              = qwen2
llama_model_loader: - kv  17:                      tokenizer.ggml.tokens arr[str,151936]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  18:                  tokenizer.ggml.token_type arr[i32,151936]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  19:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv  20:                tokenizer.ggml.eos_token_id u32              = 151645
llama_model_loader: - kv  21:            tokenizer.ggml.padding_token_id u32              = 151643
llama_model_loader: - kv  22:                tokenizer.ggml.bos_token_id u32              = 151643
llama_model_loader: - kv  23:               tokenizer.ggml.add_bos_token bool             = false
llama_model_loader: - kv  24:                    tokenizer.chat_template str              = {%- if tools %}\n    {{- '<|im_start|>...
llama_model_loader: - kv  25:               general.quantization_version u32              = 2
llama_model_loader: - kv  26:                          general.file_type u32              = 15
llama_model_loader: - type  f32:  113 tensors
llama_model_loader: - type  f16:   28 tensors
llama_model_loader: - type q4_K:  155 tensors
llama_model_loader: - type q6_K:   15 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q4_K - Medium
print_info: file size   = 1.26 GiB (5.33 BPW) 
load: special tokens cache size = 26
load: token to piece cache size = 0.9311 MB
print_info: arch             = qwen3
print_info: vocab_only       = 1
print_info: model type       = ?B
print_info: model params     = 2.03 B
print_info: general.name     = Qwen3 1.7B
print_info: vocab type       = BPE
print_info: n_vocab          = 151936
print_info: n_merges         = 151387
print_info: BOS token        = 151643 '<|endoftext|>'
print_info: EOS token        = 151645 '<|im_end|>'
print_info: EOT token        = 151645 '<|im_end|>'
print_info: PAD token        = 151643 '<|endoftext|>'
print_info: LF token         = 198 'Ċ'
print_info: FIM PRE token    = 151659 '<|fim_prefix|>'
print_info: FIM SUF token    = 151661 '<|fim_suffix|>'
print_info: FIM MID token    = 151660 '<|fim_middle|>'
print_info: FIM PAD token    = 151662 '<|fim_pad|>'
print_info: FIM REP token    = 151663 '<|repo_name|>'
print_info: FIM SEP token    = 151664 '<|file_sep|>'
print_info: EOG token        = 151643 '<|endoftext|>'
print_info: EOG token        = 151645 '<|im_end|>'
print_info: EOG token        = 151662 '<|fim_pad|>'
print_info: EOG token        = 151663 '<|repo_name|>'
print_info: EOG token        = 151664 '<|file_sep|>'
print_info: max token length = 256
llama_model_load: vocab only - skipping tensors
time=2025-08-11T10:57:52.133Z level=INFO source=server.go:438 msg="starting llama server" cmd="/usr/bin/ollama runner --model /root/.ollama/models/blobs/sha256-3d0b790534fe4b79525fc3692950408dca41171676ed7e21db57af5c65ef6ab6 --ctx-size 4096 --batch-size 512 --n-gpu-layers 29 --threads 16 --parallel 1 --port 41903"
time=2025-08-11T10:57:52.134Z level=INFO source=sched.go:481 msg="loaded runners" count=1
time=2025-08-11T10:57:52.134Z level=INFO source=server.go:598 msg="waiting for llama runner to start responding"
time=2025-08-11T10:57:52.134Z level=INFO source=server.go:632 msg="waiting for server to become available" status="llm server not responding"
time=2025-08-11T10:57:52.143Z level=INFO source=runner.go:815 msg="starting go runner"
load_backend: loaded CPU backend from /usr/lib/ollama/libggml-cpu-icelake.so
time=2025-08-11T10:57:52.147Z level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.AVX512=1 CPU.0.AVX512_VBMI=1 CPU.0.AVX512_VNNI=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 compiler=cgo(gcc)
time=2025-08-11T10:57:52.147Z level=INFO source=runner.go:874 msg="Server listening on 127.0.0.1:41903"
llama_model_loader: loaded meta data with 27 key-value pairs and 311 tensors from /root/.ollama/models/blobs/sha256-3d0b790534fe4b79525fc3692950408dca41171676ed7e21db57af5c65ef6ab6 (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen3
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Qwen3 1.7B
llama_model_loader: - kv   3:                           general.basename str              = Qwen3
llama_model_loader: - kv   4:                         general.size_label str              = 1.7B
llama_model_loader: - kv   5:                          qwen3.block_count u32              = 28
llama_model_loader: - kv   6:                       qwen3.context_length u32              = 40960
llama_model_loader: - kv   7:                     qwen3.embedding_length u32              = 2048
llama_model_loader: - kv   8:                  qwen3.feed_forward_length u32              = 6144
llama_model_loader: - kv   9:                 qwen3.attention.head_count u32              = 16
llama_model_loader: - kv  10:              qwen3.attention.head_count_kv u32              = 8
llama_model_loader: - kv  11:                       qwen3.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv  12:     qwen3.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  13:                 qwen3.attention.key_length u32              = 128
llama_model_loader: - kv  14:               qwen3.attention.value_length u32              = 128
llama_model_loader: - kv  15:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  16:                         tokenizer.ggml.pre str              = qwen2
llama_model_loader: - kv  17:                      tokenizer.ggml.tokens arr[str,151936]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  18:                  tokenizer.ggml.token_type arr[i32,151936]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  19:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv  20:                tokenizer.ggml.eos_token_id u32              = 151645
llama_model_loader: - kv  21:            tokenizer.ggml.padding_token_id u32              = 151643
llama_model_loader: - kv  22:                tokenizer.ggml.bos_token_id u32              = 151643
llama_model_loader: - kv  23:               tokenizer.ggml.add_bos_token bool             = false
llama_model_loader: - kv  24:                    tokenizer.chat_template str              = {%- if tools %}\n    {{- '<|im_start|>...
llama_model_loader: - kv  25:               general.quantization_version u32              = 2
llama_model_loader: - kv  26:                          general.file_type u32              = 15
llama_model_loader: - type  f32:  113 tensors
llama_model_loader: - type  f16:   28 tensors
llama_model_loader: - type q4_K:  155 tensors
llama_model_loader: - type q6_K:   15 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q4_K - Medium
print_info: file size   = 1.26 GiB (5.33 BPW) 
load: special tokens cache size = 26
load: token to piece cache size = 0.9311 MB
print_info: arch             = qwen3
print_info: vocab_only       = 0
print_info: n_ctx_train      = 40960
print_info: n_embd           = 2048
print_info: n_layer          = 28
print_info: n_head           = 16
print_info: n_head_kv        = 8
print_info: n_rot            = 128
print_info: n_swa            = 0
print_info: n_swa_pattern    = 1
print_info: n_embd_head_k    = 128
print_info: n_embd_head_v    = 128
print_info: n_gqa            = 2
print_info: n_embd_k_gqa     = 1024
print_info: n_embd_v_gqa     = 1024
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-06
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: f_attn_scale     = 0.0e+00
print_info: n_ff             = 6144
print_info: n_expert         = 0
print_info: n_expert_used    = 0
print_info: causal attn      = 1
print_info: pooling type     = 0
print_info: rope type        = 2
print_info: rope scaling     = linear
print_info: freq_base_train  = 1000000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn  = 40960
print_info: rope_finetuned   = unknown
print_info: ssm_d_conv       = 0
print_info: ssm_d_inner      = 0
print_info: ssm_d_state      = 0
print_info: ssm_dt_rank      = 0
print_info: ssm_dt_b_c_rms   = 0
print_info: model type       = 1.7B
print_info: model params     = 2.03 B
print_info: general.name     = Qwen3 1.7B
print_info: vocab type       = BPE
print_info: n_vocab          = 151936
print_info: n_merges         = 151387
print_info: BOS token        = 151643 '<|endoftext|>'
print_info: EOS token        = 151645 '<|im_end|>'
print_info: EOT token        = 151645 '<|im_end|>'
print_info: PAD token        = 151643 '<|endoftext|>'
print_info: LF token         = 198 'Ċ'
print_info: FIM PRE token    = 151659 '<|fim_prefix|>'
print_info: FIM SUF token    = 151661 '<|fim_suffix|>'
print_info: FIM MID token    = 151660 '<|fim_middle|>'
print_info: FIM PAD token    = 151662 '<|fim_pad|>'
print_info: FIM REP token    = 151663 '<|repo_name|>'
print_info: FIM SEP token    = 151664 '<|file_sep|>'
print_info: EOG token        = 151643 '<|endoftext|>'
print_info: EOG token        = 151645 '<|im_end|>'
print_info: EOG token        = 151662 '<|fim_pad|>'
print_info: EOG token        = 151663 '<|repo_name|>'
print_info: EOG token        = 151664 '<|file_sep|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors:   CPU_Mapped model buffer size =  1290.63 MiB
llama_context: constructing llama_context
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 4096
llama_context: n_ctx_per_seq = 4096
llama_context: n_batch       = 512
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = 0
llama_context: freq_base     = 1000000.0
llama_context: freq_scale    = 1
llama_context: n_ctx_per_seq (4096) < n_ctx_train (40960) -- the full capacity of the model will not be utilized
llama_context:        CPU  output buffer size =     0.59 MiB
llama_kv_cache_unified: kv_size = 4096, type_k = 'f16', type_v = 'f16', n_layer = 28, can_shift = 1, padding = 32
time=2025-08-11T10:57:52.385Z level=INFO source=server.go:632 msg="waiting for server to become available" status="llm server loading model"
llama_kv_cache_unified:        CPU KV buffer size =   448.00 MiB
llama_kv_cache_unified: KV self size  =  448.00 MiB, K (f16):  224.00 MiB, V (f16):  224.00 MiB
llama_context:        CPU compute buffer size =   300.75 MiB
llama_context: graph nodes  = 1070
llama_context: graph splits = 1
time=2025-08-11T10:57:52.637Z level=INFO source=server.go:637 msg="llama runner started in 0.50 seconds"
[GIN] 2025/08/11 - 10:57:52 | 200 |  713.230321ms |       127.0.0.1 | POST     "/api/generate"
[GIN] 2025/08/11 - 10:58:34 | 200 |      37.148µs |       127.0.0.1 | HEAD     "/"
[GIN] 2025/08/11 - 10:58:34 | 200 |   32.567885ms |    

Relevant log output


OS

Linux, Docker

GPU

AMD

CPU

AMD

Ollama version

0.11.4

Originally created by @deific on GitHub (Aug 11, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/11854 Originally assigned to: @dhiltgen on GitHub. ### What is the issue? The latest version of Rocm is already 6.4.3, and the next major version 7.0 has been previewed and may be released soon. I saw ARG ROCMVERSION=6.3.3 in Dockerfile, which means that ollama has been using the old version of 6.3.3. How to use the latest 6.4.3? I tried to modify ARG ROCMVERSION=6.4.3 to recompile the image and run it, but the log indicated that the GPU was recognized, but the final run rolled back to the CPU(`load_backend: loaded CPU backend from /usr/lib/ollama/libggml-cpu-icelake.so`). OS: Ubuntu24.04 Linux: 6.14.0-27-generic GPU: AMD Radeon Graphics 8060s gfx1151 ollama log: ``` time=2025-08-11T10:57:51.988Z level=INFO source=sched.go:786 msg="new model will fit in available VRAM in single GPU, loading" model=/root/.ollama/models/blobs/sha256-3d0b790534fe4b79525fc3692950408dca41171676ed7e21db57af5c65ef6ab6 gpu=0 parallel=1 available=68553601024 required="2.2 GiB" time=2025-08-11T10:57:51.988Z level=INFO source=server.go:135 msg="system memory" total="62.4 GiB" free="46.7 GiB" free_swap="8.0 GiB" time=2025-08-11T10:57:51.988Z level=INFO source=server.go:175 msg=offload library=rocm layers.requested=-1 layers.model=29 layers.offload=29 layers.split="" memory.available="[63.8 GiB]" memory.gpu_overhead="0 B" memory.required.full="2.2 GiB" memory.required.partial="2.2 GiB" memory.required.kv="448.0 MiB" memory.required.allocations="[2.2 GiB]" memory.weights.total="1.1 GiB" memory.weights.repeating="880.3 MiB" memory.weights.nonrepeating="243.4 MiB" memory.graph.full="149.3 MiB" memory.graph.partial="149.3 MiB" llama_model_loader: loaded meta data with 27 key-value pairs and 311 tensors from /root/.ollama/models/blobs/sha256-3d0b790534fe4b79525fc3692950408dca41171676ed7e21db57af5c65ef6ab6 (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = qwen3 llama_model_loader: - kv 1: general.type str = model llama_model_loader: - kv 2: general.name str = Qwen3 1.7B llama_model_loader: - kv 3: general.basename str = Qwen3 llama_model_loader: - kv 4: general.size_label str = 1.7B llama_model_loader: - kv 5: qwen3.block_count u32 = 28 llama_model_loader: - kv 6: qwen3.context_length u32 = 40960 llama_model_loader: - kv 7: qwen3.embedding_length u32 = 2048 llama_model_loader: - kv 8: qwen3.feed_forward_length u32 = 6144 llama_model_loader: - kv 9: qwen3.attention.head_count u32 = 16 llama_model_loader: - kv 10: qwen3.attention.head_count_kv u32 = 8 llama_model_loader: - kv 11: qwen3.rope.freq_base f32 = 1000000.000000 llama_model_loader: - kv 12: qwen3.attention.layer_norm_rms_epsilon f32 = 0.000001 llama_model_loader: - kv 13: qwen3.attention.key_length u32 = 128 llama_model_loader: - kv 14: qwen3.attention.value_length u32 = 128 llama_model_loader: - kv 15: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 16: tokenizer.ggml.pre str = qwen2 llama_model_loader: - kv 17: tokenizer.ggml.tokens arr[str,151936] = ["!", "\"", "#", "$", "%", "&", "'", ... llama_model_loader: - kv 18: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 19: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",... llama_model_loader: - kv 20: tokenizer.ggml.eos_token_id u32 = 151645 llama_model_loader: - kv 21: tokenizer.ggml.padding_token_id u32 = 151643 llama_model_loader: - kv 22: tokenizer.ggml.bos_token_id u32 = 151643 llama_model_loader: - kv 23: tokenizer.ggml.add_bos_token bool = false llama_model_loader: - kv 24: tokenizer.chat_template str = {%- if tools %}\n {{- '<|im_start|>... llama_model_loader: - kv 25: general.quantization_version u32 = 2 llama_model_loader: - kv 26: general.file_type u32 = 15 llama_model_loader: - type f32: 113 tensors llama_model_loader: - type f16: 28 tensors llama_model_loader: - type q4_K: 155 tensors llama_model_loader: - type q6_K: 15 tensors print_info: file format = GGUF V3 (latest) print_info: file type = Q4_K - Medium print_info: file size = 1.26 GiB (5.33 BPW) load: special tokens cache size = 26 load: token to piece cache size = 0.9311 MB print_info: arch = qwen3 print_info: vocab_only = 1 print_info: model type = ?B print_info: model params = 2.03 B print_info: general.name = Qwen3 1.7B print_info: vocab type = BPE print_info: n_vocab = 151936 print_info: n_merges = 151387 print_info: BOS token = 151643 '<|endoftext|>' print_info: EOS token = 151645 '<|im_end|>' print_info: EOT token = 151645 '<|im_end|>' print_info: PAD token = 151643 '<|endoftext|>' print_info: LF token = 198 'Ċ' print_info: FIM PRE token = 151659 '<|fim_prefix|>' print_info: FIM SUF token = 151661 '<|fim_suffix|>' print_info: FIM MID token = 151660 '<|fim_middle|>' print_info: FIM PAD token = 151662 '<|fim_pad|>' print_info: FIM REP token = 151663 '<|repo_name|>' print_info: FIM SEP token = 151664 '<|file_sep|>' print_info: EOG token = 151643 '<|endoftext|>' print_info: EOG token = 151645 '<|im_end|>' print_info: EOG token = 151662 '<|fim_pad|>' print_info: EOG token = 151663 '<|repo_name|>' print_info: EOG token = 151664 '<|file_sep|>' print_info: max token length = 256 llama_model_load: vocab only - skipping tensors time=2025-08-11T10:57:52.133Z level=INFO source=server.go:438 msg="starting llama server" cmd="/usr/bin/ollama runner --model /root/.ollama/models/blobs/sha256-3d0b790534fe4b79525fc3692950408dca41171676ed7e21db57af5c65ef6ab6 --ctx-size 4096 --batch-size 512 --n-gpu-layers 29 --threads 16 --parallel 1 --port 41903" time=2025-08-11T10:57:52.134Z level=INFO source=sched.go:481 msg="loaded runners" count=1 time=2025-08-11T10:57:52.134Z level=INFO source=server.go:598 msg="waiting for llama runner to start responding" time=2025-08-11T10:57:52.134Z level=INFO source=server.go:632 msg="waiting for server to become available" status="llm server not responding" time=2025-08-11T10:57:52.143Z level=INFO source=runner.go:815 msg="starting go runner" load_backend: loaded CPU backend from /usr/lib/ollama/libggml-cpu-icelake.so time=2025-08-11T10:57:52.147Z level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.AVX512=1 CPU.0.AVX512_VBMI=1 CPU.0.AVX512_VNNI=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 compiler=cgo(gcc) time=2025-08-11T10:57:52.147Z level=INFO source=runner.go:874 msg="Server listening on 127.0.0.1:41903" llama_model_loader: loaded meta data with 27 key-value pairs and 311 tensors from /root/.ollama/models/blobs/sha256-3d0b790534fe4b79525fc3692950408dca41171676ed7e21db57af5c65ef6ab6 (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = qwen3 llama_model_loader: - kv 1: general.type str = model llama_model_loader: - kv 2: general.name str = Qwen3 1.7B llama_model_loader: - kv 3: general.basename str = Qwen3 llama_model_loader: - kv 4: general.size_label str = 1.7B llama_model_loader: - kv 5: qwen3.block_count u32 = 28 llama_model_loader: - kv 6: qwen3.context_length u32 = 40960 llama_model_loader: - kv 7: qwen3.embedding_length u32 = 2048 llama_model_loader: - kv 8: qwen3.feed_forward_length u32 = 6144 llama_model_loader: - kv 9: qwen3.attention.head_count u32 = 16 llama_model_loader: - kv 10: qwen3.attention.head_count_kv u32 = 8 llama_model_loader: - kv 11: qwen3.rope.freq_base f32 = 1000000.000000 llama_model_loader: - kv 12: qwen3.attention.layer_norm_rms_epsilon f32 = 0.000001 llama_model_loader: - kv 13: qwen3.attention.key_length u32 = 128 llama_model_loader: - kv 14: qwen3.attention.value_length u32 = 128 llama_model_loader: - kv 15: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 16: tokenizer.ggml.pre str = qwen2 llama_model_loader: - kv 17: tokenizer.ggml.tokens arr[str,151936] = ["!", "\"", "#", "$", "%", "&", "'", ... llama_model_loader: - kv 18: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 19: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",... llama_model_loader: - kv 20: tokenizer.ggml.eos_token_id u32 = 151645 llama_model_loader: - kv 21: tokenizer.ggml.padding_token_id u32 = 151643 llama_model_loader: - kv 22: tokenizer.ggml.bos_token_id u32 = 151643 llama_model_loader: - kv 23: tokenizer.ggml.add_bos_token bool = false llama_model_loader: - kv 24: tokenizer.chat_template str = {%- if tools %}\n {{- '<|im_start|>... llama_model_loader: - kv 25: general.quantization_version u32 = 2 llama_model_loader: - kv 26: general.file_type u32 = 15 llama_model_loader: - type f32: 113 tensors llama_model_loader: - type f16: 28 tensors llama_model_loader: - type q4_K: 155 tensors llama_model_loader: - type q6_K: 15 tensors print_info: file format = GGUF V3 (latest) print_info: file type = Q4_K - Medium print_info: file size = 1.26 GiB (5.33 BPW) load: special tokens cache size = 26 load: token to piece cache size = 0.9311 MB print_info: arch = qwen3 print_info: vocab_only = 0 print_info: n_ctx_train = 40960 print_info: n_embd = 2048 print_info: n_layer = 28 print_info: n_head = 16 print_info: n_head_kv = 8 print_info: n_rot = 128 print_info: n_swa = 0 print_info: n_swa_pattern = 1 print_info: n_embd_head_k = 128 print_info: n_embd_head_v = 128 print_info: n_gqa = 2 print_info: n_embd_k_gqa = 1024 print_info: n_embd_v_gqa = 1024 print_info: f_norm_eps = 0.0e+00 print_info: f_norm_rms_eps = 1.0e-06 print_info: f_clamp_kqv = 0.0e+00 print_info: f_max_alibi_bias = 0.0e+00 print_info: f_logit_scale = 0.0e+00 print_info: f_attn_scale = 0.0e+00 print_info: n_ff = 6144 print_info: n_expert = 0 print_info: n_expert_used = 0 print_info: causal attn = 1 print_info: pooling type = 0 print_info: rope type = 2 print_info: rope scaling = linear print_info: freq_base_train = 1000000.0 print_info: freq_scale_train = 1 print_info: n_ctx_orig_yarn = 40960 print_info: rope_finetuned = unknown print_info: ssm_d_conv = 0 print_info: ssm_d_inner = 0 print_info: ssm_d_state = 0 print_info: ssm_dt_rank = 0 print_info: ssm_dt_b_c_rms = 0 print_info: model type = 1.7B print_info: model params = 2.03 B print_info: general.name = Qwen3 1.7B print_info: vocab type = BPE print_info: n_vocab = 151936 print_info: n_merges = 151387 print_info: BOS token = 151643 '<|endoftext|>' print_info: EOS token = 151645 '<|im_end|>' print_info: EOT token = 151645 '<|im_end|>' print_info: PAD token = 151643 '<|endoftext|>' print_info: LF token = 198 'Ċ' print_info: FIM PRE token = 151659 '<|fim_prefix|>' print_info: FIM SUF token = 151661 '<|fim_suffix|>' print_info: FIM MID token = 151660 '<|fim_middle|>' print_info: FIM PAD token = 151662 '<|fim_pad|>' print_info: FIM REP token = 151663 '<|repo_name|>' print_info: FIM SEP token = 151664 '<|file_sep|>' print_info: EOG token = 151643 '<|endoftext|>' print_info: EOG token = 151645 '<|im_end|>' print_info: EOG token = 151662 '<|fim_pad|>' print_info: EOG token = 151663 '<|repo_name|>' print_info: EOG token = 151664 '<|file_sep|>' print_info: max token length = 256 load_tensors: loading model tensors, this can take a while... (mmap = true) load_tensors: CPU_Mapped model buffer size = 1290.63 MiB llama_context: constructing llama_context llama_context: n_seq_max = 1 llama_context: n_ctx = 4096 llama_context: n_ctx_per_seq = 4096 llama_context: n_batch = 512 llama_context: n_ubatch = 512 llama_context: causal_attn = 1 llama_context: flash_attn = 0 llama_context: freq_base = 1000000.0 llama_context: freq_scale = 1 llama_context: n_ctx_per_seq (4096) < n_ctx_train (40960) -- the full capacity of the model will not be utilized llama_context: CPU output buffer size = 0.59 MiB llama_kv_cache_unified: kv_size = 4096, type_k = 'f16', type_v = 'f16', n_layer = 28, can_shift = 1, padding = 32 time=2025-08-11T10:57:52.385Z level=INFO source=server.go:632 msg="waiting for server to become available" status="llm server loading model" llama_kv_cache_unified: CPU KV buffer size = 448.00 MiB llama_kv_cache_unified: KV self size = 448.00 MiB, K (f16): 224.00 MiB, V (f16): 224.00 MiB llama_context: CPU compute buffer size = 300.75 MiB llama_context: graph nodes = 1070 llama_context: graph splits = 1 time=2025-08-11T10:57:52.637Z level=INFO source=server.go:637 msg="llama runner started in 0.50 seconds" [GIN] 2025/08/11 - 10:57:52 | 200 | 713.230321ms | 127.0.0.1 | POST "/api/generate" [GIN] 2025/08/11 - 10:58:34 | 200 | 37.148µs | 127.0.0.1 | HEAD "/" [GIN] 2025/08/11 - 10:58:34 | 200 | 32.567885ms | ``` ### Relevant log output ```shell ``` ### OS Linux, Docker ### GPU AMD ### CPU AMD ### Ollama version 0.11.4
GiteaMirror added the bugamdlinux labels 2026-04-29 05:51:50 -05:00
Author
Owner

@kode54 commented on GitHub (Aug 12, 2025):

On Arch, using ROCm 6.4.3 from extra-testing, it outright segfaults trying to use the ROCm version, on gfx1101.

<!-- gh-comment-id:3178282717 --> @kode54 commented on GitHub (Aug 12, 2025): On Arch, using ROCm 6.4.3 from extra-testing, it outright segfaults trying to use the ROCm version, on gfx1101.
Author
Owner

@kode54 commented on GitHub (Aug 25, 2025):

Still broken with ollama 0.11.6.

<!-- gh-comment-id:3221657645 --> @kode54 commented on GitHub (Aug 25, 2025): Still broken with ollama 0.11.6.
Author
Owner

@dhiltgen commented on GitHub (Nov 5, 2025):

As soon as we get Vulkan support enabled by default, we're planning to start updating ROCm versions more rapidly to keep up with the latest GPU support, while still maintaining support for older GPUs via Vulkan.

<!-- gh-comment-id:3488561020 --> @dhiltgen commented on GitHub (Nov 5, 2025): As soon as we get Vulkan support enabled by default, we're planning to start updating ROCm versions more rapidly to keep up with the latest GPU support, while still maintaining support for older GPUs via Vulkan.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#54382