[GH-ISSUE #10784] error loading model: unable to allocate ROCm0 buffer #69142

Open
opened 2026-05-04 17:16:24 -05:00 by GiteaMirror · 9 comments
Owner

Originally created by @IMIEEET on GitHub (May 20, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/10784

What is the issue?

hello, i updated ollama from 0.6.5 to 0.7.0 and now my model wont load, its deepseek-r1:14b.
whenever i run with ollama run deepseek-r1:14b after a few seconds it gives me this

Error: llama runner process has terminated: error loading model: unable to allocate ROCm0 buffer
llama_model_load_from_file_impl: failed to load model

Relevant log output

May 21 00:30:22 imieeet ollama[1369]: [GIN] 2025/05/21 - 00:30:22 | 200 |      49.443µs |       127.0.0.1 | HEAD     "/"
May 21 00:30:22 imieeet ollama[1369]: [GIN] 2025/05/21 - 00:30:22 | 200 |   31.919152ms |       127.0.0.1 | POST     "/api/show"
May 21 00:30:22 imieeet ollama[1369]: time=2025-05-21T00:30:22.811+03:30 level=INFO source=server.go:135 msg="system memory" total="30.6 GiB" free="22.4 GiB" free_swap="8.0 GiB"
May 21 00:30:22 imieeet ollama[1369]: time=2025-05-21T00:30:22.811+03:30 level=INFO source=server.go:168 msg=offload library=rocm layers.requested=-1 layers.model=49 layers.offload=35 layers.split="" memory.available="[8.0 GiB]" memory.gpu_overhead="0 B" memory.required.full="10.7 GiB" memory.required.partial="7.8 GiB" memory.required.kv="768.0 MiB" memory.required.allocations="[7.8 GiB]" memory.weights.total="8.0 GiB" memory.weights.repeating="7.4 GiB" memory.weights.nonrepeating="609.1 MiB" memory.graph.full="348.0 MiB" memory.graph.partial="916.1 MiB"
May 21 00:30:22 imieeet ollama[1369]: llama_model_loader: loaded meta data with 26 key-value pairs and 579 tensors from /usr/share/ollama/.ollama/models/blobs/sha256-6e9f90f02bb3b39b59e81916e8cfce9deb45aeaeb9a54a5be4414486b907dc1e (version GGUF V3 (latest))
May 21 00:30:22 imieeet ollama[1369]: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
May 21 00:30:22 imieeet ollama[1369]: llama_model_loader: - kv   0:                       general.architecture str              = qwen2
May 21 00:30:22 imieeet ollama[1369]: llama_model_loader: - kv   1:                               general.type str              = model
May 21 00:30:22 imieeet ollama[1369]: llama_model_loader: - kv   2:                               general.name str              = DeepSeek R1 Distill Qwen 14B
May 21 00:30:22 imieeet ollama[1369]: llama_model_loader: - kv   3:                           general.basename str              = DeepSeek-R1-Distill-Qwen
May 21 00:30:22 imieeet ollama[1369]: llama_model_loader: - kv   4:                         general.size_label str              = 14B
May 21 00:30:22 imieeet ollama[1369]: llama_model_loader: - kv   5:                          qwen2.block_count u32              = 48
May 21 00:30:22 imieeet ollama[1369]: llama_model_loader: - kv   6:                       qwen2.context_length u32              = 131072
May 21 00:30:22 imieeet ollama[1369]: llama_model_loader: - kv   7:                     qwen2.embedding_length u32              = 5120
May 21 00:30:22 imieeet ollama[1369]: llama_model_loader: - kv   8:                  qwen2.feed_forward_length u32              = 13824
May 21 00:30:22 imieeet ollama[1369]: llama_model_loader: - kv   9:                 qwen2.attention.head_count u32              = 40
May 21 00:30:22 imieeet ollama[1369]: llama_model_loader: - kv  10:              qwen2.attention.head_count_kv u32              = 8
May 21 00:30:22 imieeet ollama[1369]: llama_model_loader: - kv  11:                       qwen2.rope.freq_base f32              = 1000000.000000
May 21 00:30:22 imieeet ollama[1369]: llama_model_loader: - kv  12:     qwen2.attention.layer_norm_rms_epsilon f32              = 0.000010
May 21 00:30:22 imieeet ollama[1369]: llama_model_loader: - kv  13:                          general.file_type u32              = 15
May 21 00:30:22 imieeet ollama[1369]: llama_model_loader: - kv  14:                       tokenizer.ggml.model str              = gpt2
May 21 00:30:22 imieeet ollama[1369]: llama_model_loader: - kv  15:                         tokenizer.ggml.pre str              = qwen2
May 21 00:30:22 imieeet ollama[1369]: llama_model_loader: - kv  16:                      tokenizer.ggml.tokens arr[str,152064]  = ["!", "\"", "#", "$", "%", "&", "'", ...
May 21 00:30:22 imieeet ollama[1369]: llama_model_loader: - kv  17:                  tokenizer.ggml.token_type arr[i32,152064]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
May 21 00:30:22 imieeet ollama[1369]: llama_model_loader: - kv  18:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
May 21 00:30:22 imieeet ollama[1369]: llama_model_loader: - kv  19:                tokenizer.ggml.bos_token_id u32              = 151646
May 21 00:30:22 imieeet ollama[1369]: llama_model_loader: - kv  20:                tokenizer.ggml.eos_token_id u32              = 151643
May 21 00:30:22 imieeet ollama[1369]: llama_model_loader: - kv  21:            tokenizer.ggml.padding_token_id u32              = 151643
May 21 00:30:22 imieeet ollama[1369]: llama_model_loader: - kv  22:               tokenizer.ggml.add_bos_token bool             = true
May 21 00:30:22 imieeet ollama[1369]: llama_model_loader: - kv  23:               tokenizer.ggml.add_eos_token bool             = false
May 21 00:30:22 imieeet ollama[1369]: llama_model_loader: - kv  24:                    tokenizer.chat_template str              = {% if not add_generation_prompt is de...
May 21 00:30:22 imieeet ollama[1369]: llama_model_loader: - kv  25:               general.quantization_version u32              = 2
May 21 00:30:22 imieeet ollama[1369]: llama_model_loader: - type  f32:  241 tensors
May 21 00:30:22 imieeet ollama[1369]: llama_model_loader: - type q4_K:  289 tensors
May 21 00:30:22 imieeet ollama[1369]: llama_model_loader: - type q6_K:   49 tensors
May 21 00:30:22 imieeet ollama[1369]: print_info: file format = GGUF V3 (latest)
May 21 00:30:22 imieeet ollama[1369]: print_info: file type   = Q4_K - Medium
May 21 00:30:22 imieeet ollama[1369]: print_info: file size   = 8.37 GiB (4.87 BPW)
May 21 00:30:22 imieeet ollama[1369]: load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
May 21 00:30:22 imieeet ollama[1369]: load: special tokens cache size = 22
May 21 00:30:23 imieeet ollama[1369]: load: token to piece cache size = 0.9310 MB
May 21 00:30:23 imieeet ollama[1369]: print_info: arch             = qwen2
May 21 00:30:23 imieeet ollama[1369]: print_info: vocab_only       = 1
May 21 00:30:23 imieeet ollama[1369]: print_info: model type       = ?B
May 21 00:30:23 imieeet ollama[1369]: print_info: model params     = 14.77 B
May 21 00:30:23 imieeet ollama[1369]: print_info: general.name     = DeepSeek R1 Distill Qwen 14B
May 21 00:30:23 imieeet ollama[1369]: print_info: vocab type       = BPE
May 21 00:30:23 imieeet ollama[1369]: print_info: n_vocab          = 152064
May 21 00:30:23 imieeet ollama[1369]: print_info: n_merges         = 151387
May 21 00:30:23 imieeet ollama[1369]: print_info: BOS token        = 151646 '<|begin▁of▁sentence|>'
May 21 00:30:23 imieeet ollama[1369]: print_info: EOS token        = 151643 '<|end▁of▁sentence|>'
May 21 00:30:23 imieeet ollama[1369]: print_info: EOT token        = 151643 '<|end▁of▁sentence|>'
May 21 00:30:23 imieeet ollama[1369]: print_info: PAD token        = 151643 '<|end▁of▁sentence|>'
May 21 00:30:23 imieeet ollama[1369]: print_info: LF token         = 198 'Ċ'
May 21 00:30:23 imieeet ollama[1369]: print_info: FIM PRE token    = 151659 '<|fim_prefix|>'
May 21 00:30:23 imieeet ollama[1369]: print_info: FIM SUF token    = 151661 '<|fim_suffix|>'
May 21 00:30:23 imieeet ollama[1369]: print_info: FIM MID token    = 151660 '<|fim_middle|>'
May 21 00:30:23 imieeet ollama[1369]: print_info: FIM PAD token    = 151662 '<|fim_pad|>'
May 21 00:30:23 imieeet ollama[1369]: print_info: FIM REP token    = 151663 '<|repo_name|>'
May 21 00:30:23 imieeet ollama[1369]: print_info: FIM SEP token    = 151664 '<|file_sep|>'
May 21 00:30:23 imieeet ollama[1369]: print_info: EOG token        = 151643 '<|end▁of▁sentence|>'
May 21 00:30:23 imieeet ollama[1369]: print_info: EOG token        = 151662 '<|fim_pad|>'
May 21 00:30:23 imieeet ollama[1369]: print_info: EOG token        = 151663 '<|repo_name|>'
May 21 00:30:23 imieeet ollama[1369]: print_info: EOG token        = 151664 '<|file_sep|>'
May 21 00:30:23 imieeet ollama[1369]: print_info: max token length = 256
May 21 00:30:23 imieeet ollama[1369]: llama_model_load: vocab only - skipping tensors
May 21 00:30:23 imieeet ollama[1369]: time=2025-05-21T00:30:23.019+03:30 level=INFO source=server.go:431 msg="starting llama server" cmd="/usr/bin/ollama runner --model /usr/share/ollama/.ollama/models/blobs/sha256-6e9f90f02bb3b39b59e81916e8cfce9deb45aeaeb9a54a5be4414486b907dc1e --ctx-size 4096 --batch-size 512 --n-gpu-layers 35 --threads 8 --parallel 1 --port 44291"
May 21 00:30:23 imieeet ollama[1369]: time=2025-05-21T00:30:23.019+03:30 level=INFO source=sched.go:472 msg="loaded runners" count=1
May 21 00:30:23 imieeet ollama[1369]: time=2025-05-21T00:30:23.019+03:30 level=INFO source=server.go:591 msg="waiting for llama runner to start responding"
May 21 00:30:23 imieeet ollama[1369]: time=2025-05-21T00:30:23.020+03:30 level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server not responding"
May 21 00:30:23 imieeet ollama[1369]: time=2025-05-21T00:30:23.029+03:30 level=INFO source=runner.go:815 msg="starting go runner"
May 21 00:30:23 imieeet ollama[1369]: load_backend: loaded CPU backend from /usr/lib/ollama/libggml-cpu-haswell.so
May 21 00:30:25 imieeet ollama[1369]: /opt/amdgpu/share/libdrm/amdgpu.ids: No such file or directory
May 21 00:30:25 imieeet ollama[1369]: /opt/amdgpu/share/libdrm/amdgpu.ids: No such file or directory
May 21 00:30:27 imieeet ollama[1369]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
May 21 00:30:27 imieeet ollama[1369]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
May 21 00:30:27 imieeet ollama[1369]: ggml_cuda_init: found 1 ROCm devices:
May 21 00:30:27 imieeet ollama[1369]:   Device 0: AMD Radeon Graphics, gfx1030 (0x1030), VMM: no, Wave Size: 32
May 21 00:30:27 imieeet ollama[1369]: load_backend: loaded ROCm backend from /usr/lib/ollama/rocm/libggml-hip.so
May 21 00:30:27 imieeet ollama[1369]: time=2025-05-21T00:30:27.377+03:30 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 ROCm.0.NO_VMM=1 ROCm.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc)
May 21 00:30:27 imieeet ollama[1369]: llama_model_load_from_file_impl: using device ROCm0 (AMD Radeon Graphics) - 8152 MiB free
May 21 00:30:27 imieeet ollama[1369]: time=2025-05-21T00:30:27.378+03:30 level=INFO source=runner.go:874 msg="Server listening on 127.0.0.1:44291"
May 21 00:30:27 imieeet ollama[1369]: llama_model_loader: loaded meta data with 26 key-value pairs and 579 tensors from /usr/share/ollama/.ollama/models/blobs/sha256-6e9f90f02bb3b39b59e81916e8cfce9deb45aeaeb9a54a5be4414486b907dc1e (version GGUF V3 (latest))
May 21 00:30:27 imieeet ollama[1369]: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
May 21 00:30:27 imieeet ollama[1369]: llama_model_loader: - kv   0:                       general.architecture str              = qwen2
May 21 00:30:27 imieeet ollama[1369]: llama_model_loader: - kv   1:                               general.type str              = model
May 21 00:30:27 imieeet ollama[1369]: llama_model_loader: - kv   2:                               general.name str              = DeepSeek R1 Distill Qwen 14B
May 21 00:30:27 imieeet ollama[1369]: llama_model_loader: - kv   3:                           general.basename str              = DeepSeek-R1-Distill-Qwen
May 21 00:30:27 imieeet ollama[1369]: llama_model_loader: - kv   4:                         general.size_label str              = 14B
May 21 00:30:27 imieeet ollama[1369]: llama_model_loader: - kv   5:                          qwen2.block_count u32              = 48
May 21 00:30:27 imieeet ollama[1369]: llama_model_loader: - kv   6:                       qwen2.context_length u32              = 131072
May 21 00:30:27 imieeet ollama[1369]: llama_model_loader: - kv   7:                     qwen2.embedding_length u32              = 5120
May 21 00:30:27 imieeet ollama[1369]: llama_model_loader: - kv   8:                  qwen2.feed_forward_length u32              = 13824
May 21 00:30:27 imieeet ollama[1369]: llama_model_loader: - kv   9:                 qwen2.attention.head_count u32              = 40
May 21 00:30:27 imieeet ollama[1369]: llama_model_loader: - kv  10:              qwen2.attention.head_count_kv u32              = 8
May 21 00:30:27 imieeet ollama[1369]: llama_model_loader: - kv  11:                       qwen2.rope.freq_base f32              = 1000000.000000
May 21 00:30:27 imieeet ollama[1369]: llama_model_loader: - kv  12:     qwen2.attention.layer_norm_rms_epsilon f32              = 0.000010
May 21 00:30:27 imieeet ollama[1369]: llama_model_loader: - kv  13:                          general.file_type u32              = 15
May 21 00:30:27 imieeet ollama[1369]: llama_model_loader: - kv  14:                       tokenizer.ggml.model str              = gpt2
May 21 00:30:27 imieeet ollama[1369]: llama_model_loader: - kv  15:                         tokenizer.ggml.pre str              = qwen2
May 21 00:30:27 imieeet ollama[1369]: llama_model_loader: - kv  16:                      tokenizer.ggml.tokens arr[str,152064]  = ["!", "\"", "#", "$", "%", "&", "'", ...
May 21 00:30:27 imieeet ollama[1369]: llama_model_loader: - kv  17:                  tokenizer.ggml.token_type arr[i32,152064]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
May 21 00:30:27 imieeet ollama[1369]: llama_model_loader: - kv  18:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
May 21 00:30:27 imieeet ollama[1369]: llama_model_loader: - kv  19:                tokenizer.ggml.bos_token_id u32              = 151646
May 21 00:30:27 imieeet ollama[1369]: llama_model_loader: - kv  20:                tokenizer.ggml.eos_token_id u32              = 151643
May 21 00:30:27 imieeet ollama[1369]: llama_model_loader: - kv  21:            tokenizer.ggml.padding_token_id u32              = 151643
May 21 00:30:27 imieeet ollama[1369]: llama_model_loader: - kv  22:               tokenizer.ggml.add_bos_token bool             = true
May 21 00:30:27 imieeet ollama[1369]: llama_model_loader: - kv  23:               tokenizer.ggml.add_eos_token bool             = false
May 21 00:30:27 imieeet ollama[1369]: llama_model_loader: - kv  24:                    tokenizer.chat_template str              = {% if not add_generation_prompt is de...
May 21 00:30:27 imieeet ollama[1369]: llama_model_loader: - kv  25:               general.quantization_version u32              = 2
May 21 00:30:27 imieeet ollama[1369]: llama_model_loader: - type  f32:  241 tensors
May 21 00:30:27 imieeet ollama[1369]: llama_model_loader: - type q4_K:  289 tensors
May 21 00:30:27 imieeet ollama[1369]: llama_model_loader: - type q6_K:   49 tensors
May 21 00:30:27 imieeet ollama[1369]: print_info: file format = GGUF V3 (latest)
May 21 00:30:27 imieeet ollama[1369]: print_info: file type   = Q4_K - Medium
May 21 00:30:27 imieeet ollama[1369]: print_info: file size   = 8.37 GiB (4.87 BPW)
May 21 00:30:27 imieeet ollama[1369]: load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
May 21 00:30:27 imieeet ollama[1369]: load: special tokens cache size = 22
May 21 00:30:27 imieeet ollama[1369]: time=2025-05-21T00:30:27.533+03:30 level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server loading model"
May 21 00:30:27 imieeet ollama[1369]: load: token to piece cache size = 0.9310 MB
May 21 00:30:27 imieeet ollama[1369]: print_info: arch             = qwen2
May 21 00:30:27 imieeet ollama[1369]: print_info: vocab_only       = 0
May 21 00:30:27 imieeet ollama[1369]: print_info: n_ctx_train      = 131072
May 21 00:30:27 imieeet ollama[1369]: print_info: n_embd           = 5120
May 21 00:30:27 imieeet ollama[1369]: print_info: n_layer          = 48
May 21 00:30:27 imieeet ollama[1369]: print_info: n_head           = 40
May 21 00:30:27 imieeet ollama[1369]: print_info: n_head_kv        = 8
May 21 00:30:27 imieeet ollama[1369]: print_info: n_rot            = 128
May 21 00:30:27 imieeet ollama[1369]: print_info: n_swa            = 0
May 21 00:30:27 imieeet ollama[1369]: print_info: n_swa_pattern    = 1
May 21 00:30:27 imieeet ollama[1369]: print_info: n_embd_head_k    = 128
May 21 00:30:27 imieeet ollama[1369]: print_info: n_embd_head_v    = 128
May 21 00:30:27 imieeet ollama[1369]: print_info: n_gqa            = 5
May 21 00:30:27 imieeet ollama[1369]: print_info: n_embd_k_gqa     = 1024
May 21 00:30:27 imieeet ollama[1369]: print_info: n_embd_v_gqa     = 1024
May 21 00:30:27 imieeet ollama[1369]: print_info: f_norm_eps       = 0.0e+00
May 21 00:30:27 imieeet ollama[1369]: print_info: f_norm_rms_eps   = 1.0e-05
May 21 00:30:27 imieeet ollama[1369]: print_info: f_clamp_kqv      = 0.0e+00
May 21 00:30:27 imieeet ollama[1369]: print_info: f_max_alibi_bias = 0.0e+00
May 21 00:30:27 imieeet ollama[1369]: print_info: f_logit_scale    = 0.0e+00
May 21 00:30:27 imieeet ollama[1369]: print_info: f_attn_scale     = 0.0e+00
May 21 00:30:27 imieeet ollama[1369]: print_info: n_ff             = 13824
May 21 00:30:27 imieeet ollama[1369]: print_info: n_expert         = 0
May 21 00:30:27 imieeet ollama[1369]: print_info: n_expert_used    = 0
May 21 00:30:27 imieeet ollama[1369]: print_info: causal attn      = 1
May 21 00:30:27 imieeet ollama[1369]: print_info: pooling type     = -1
May 21 00:30:27 imieeet ollama[1369]: print_info: rope type        = 2
May 21 00:30:27 imieeet ollama[1369]: print_info: rope scaling     = linear
May 21 00:30:27 imieeet ollama[1369]: print_info: freq_base_train  = 1000000.0
May 21 00:30:27 imieeet ollama[1369]: print_info: freq_scale_train = 1
May 21 00:30:27 imieeet ollama[1369]: print_info: n_ctx_orig_yarn  = 131072
May 21 00:30:27 imieeet ollama[1369]: print_info: rope_finetuned   = unknown
May 21 00:30:27 imieeet ollama[1369]: print_info: ssm_d_conv       = 0
May 21 00:30:27 imieeet ollama[1369]: print_info: ssm_d_inner      = 0
May 21 00:30:27 imieeet ollama[1369]: print_info: ssm_d_state      = 0
May 21 00:30:27 imieeet ollama[1369]: print_info: ssm_dt_rank      = 0
May 21 00:30:27 imieeet ollama[1369]: print_info: ssm_dt_b_c_rms   = 0
May 21 00:30:27 imieeet ollama[1369]: print_info: model type       = 14B
May 21 00:30:27 imieeet ollama[1369]: print_info: model params     = 14.77 B
May 21 00:30:27 imieeet ollama[1369]: print_info: general.name     = DeepSeek R1 Distill Qwen 14B
May 21 00:30:27 imieeet ollama[1369]: print_info: vocab type       = BPE
May 21 00:30:27 imieeet ollama[1369]: print_info: n_vocab          = 152064
May 21 00:30:27 imieeet ollama[1369]: print_info: n_merges         = 151387
May 21 00:30:27 imieeet ollama[1369]: print_info: BOS token        = 151646 '<|begin▁of▁sentence|>'
May 21 00:30:27 imieeet ollama[1369]: print_info: EOS token        = 151643 '<|end▁of▁sentence|>'
May 21 00:30:27 imieeet ollama[1369]: print_info: EOT token        = 151643 '<|end▁of▁sentence|>'
May 21 00:30:27 imieeet ollama[1369]: print_info: PAD token        = 151643 '<|end▁of▁sentence|>'
May 21 00:30:27 imieeet ollama[1369]: print_info: LF token         = 198 'Ċ'
May 21 00:30:27 imieeet ollama[1369]: print_info: FIM PRE token    = 151659 '<|fim_prefix|>'
May 21 00:30:27 imieeet ollama[1369]: print_info: FIM SUF token    = 151661 '<|fim_suffix|>'
May 21 00:30:27 imieeet ollama[1369]: print_info: FIM MID token    = 151660 '<|fim_middle|>'
May 21 00:30:27 imieeet ollama[1369]: print_info: FIM PAD token    = 151662 '<|fim_pad|>'
May 21 00:30:27 imieeet ollama[1369]: print_info: FIM REP token    = 151663 '<|repo_name|>'
May 21 00:30:27 imieeet ollama[1369]: print_info: FIM SEP token    = 151664 '<|file_sep|>'
May 21 00:30:27 imieeet ollama[1369]: print_info: EOG token        = 151643 '<|end▁of▁sentence|>'
May 21 00:30:27 imieeet ollama[1369]: print_info: EOG token        = 151662 '<|fim_pad|>'
May 21 00:30:27 imieeet ollama[1369]: print_info: EOG token        = 151663 '<|repo_name|>'
May 21 00:30:27 imieeet ollama[1369]: print_info: EOG token        = 151664 '<|file_sep|>'
May 21 00:30:27 imieeet ollama[1369]: print_info: max token length = 256
May 21 00:30:27 imieeet ollama[1369]: load_tensors: loading model tensors, this can take a while... (mmap = true)
May 21 00:30:31 imieeet ollama[1369]: alloc_tensor_range: failed to initialize tensor blk.13.attn_q.weight
May 21 00:30:31 imieeet ollama[1369]: llama_model_load: error loading model: unable to allocate ROCm0 buffer
May 21 00:30:31 imieeet ollama[1369]: llama_model_load_from_file_impl: failed to load model
May 21 00:30:31 imieeet ollama[1369]: panic: unable to load model: /usr/share/ollama/.ollama/models/blobs/sha256-6e9f90f02bb3b39b59e81916e8cfce9deb45aeaeb9a54a5be4414486b907dc1e
May 21 00:30:31 imieeet ollama[1369]: goroutine 50 [running]:
May 21 00:30:31 imieeet ollama[1369]: github.com/ollama/ollama/runner/llamarunner.(*Server).loadModel(0xc00059a000, {0x23, 0x0, 0x1, {0x0, 0x0, 0x0}, 0xc0000427b0, 0x0}, {0x7fff3b76bc73, ...}, ...)
May 21 00:30:31 imieeet ollama[1369]:         github.com/ollama/ollama/runner/llamarunner/runner.go:751 +0x395
May 21 00:30:31 imieeet ollama[1369]: created by github.com/ollama/ollama/runner/llamarunner.Execute in goroutine 1
May 21 00:30:31 imieeet ollama[1369]:         github.com/ollama/ollama/runner/llamarunner/runner.go:848 +0xb57
May 21 00:30:31 imieeet ollama[1369]: time=2025-05-21T00:30:31.834+03:30 level=ERROR source=server.go:457 msg="llama runner terminated" error="exit status 2"
May 21 00:30:32 imieeet ollama[1369]: time=2025-05-21T00:30:32.045+03:30 level=ERROR source=sched.go:478 msg="error loading llama server" error="llama runner process has terminated: error loading model: unable to allocate ROCm0 buffer\nllama_model_load_from_file_impl: failed to load model"
May 21 00:30:32 imieeet ollama[1369]: [GIN] 2025/05/21 - 00:30:32 | 500 |  9.274238288s |       127.0.0.1 | POST     "/api/generate"
May 21 00:30:37 imieeet ollama[1369]: time=2025-05-21T00:30:37.046+03:30 level=WARN source=sched.go:676 msg="gpu VRAM usage didn't recover within timeout" seconds=5.001099854 runner.size="10.7 GiB" runner.vram="7.8 GiB" runner.parallel=1 runner.pid=40703 runner.model=/usr/share/ollama/.ollama/models/blobs/sha256-6e9f90f02bb3b39b59e81916e8cfce9deb45aeaeb9a54a5be4414486b907dc1e
May 21 00:30:37 imieeet ollama[1369]: time=2025-05-21T00:30:37.296+03:30 level=WARN source=sched.go:676 msg="gpu VRAM usage didn't recover within timeout" seconds=5.250552187 runner.size="10.7 GiB" runner.vram="7.8 GiB" runner.parallel=1 runner.pid=40703 runner.model=/usr/share/ollama/.ollama/models/blobs/sha256-6e9f90f02bb3b39b59e81916e8cfce9deb45aeaeb9a54a5be4414486b907dc1e
May 21 00:30:37 imieeet ollama[1369]: time=2025-05-21T00:30:37.546+03:30 level=WARN source=sched.go:676 msg="gpu VRAM usage didn't recover within timeout" seconds=5.501048397 runner.size="10.7 GiB" runner.vram="7.8 GiB" runner.parallel=1 runner.pid=40703 runner.model=/usr/share/ollama/.ollama/models/blobs/sha256-6e9f90f02bb3b39b59e81916e8cfce9deb45aeaeb9a54a5be4414486b907dc1e
May 21 00:35:03 imieeet ollama[1369]: [GIN] 2025/05/21 - 00:35:03 | 200 |      24.887µs |       127.0.0.1 | HEAD     "/"
May 21 00:35:03 imieeet ollama[1369]: [GIN] 2025/05/21 - 00:35:03 | 200 |     468.013µs |       127.0.0.1 | GET      "/api/tags"

OS

Linux

GPU

AMD

CPU

AMD

Ollama version

0.7.0

Originally created by @IMIEEET on GitHub (May 20, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/10784 ### What is the issue? hello, i updated ollama from 0.6.5 to 0.7.0 and now my model wont load, its deepseek-r1:14b. whenever i run with `ollama run deepseek-r1:14b` after a few seconds it gives me this > Error: llama runner process has terminated: error loading model: unable to allocate ROCm0 buffer llama_model_load_from_file_impl: failed to load model ### Relevant log output ```shell May 21 00:30:22 imieeet ollama[1369]: [GIN] 2025/05/21 - 00:30:22 | 200 | 49.443µs | 127.0.0.1 | HEAD "/" May 21 00:30:22 imieeet ollama[1369]: [GIN] 2025/05/21 - 00:30:22 | 200 | 31.919152ms | 127.0.0.1 | POST "/api/show" May 21 00:30:22 imieeet ollama[1369]: time=2025-05-21T00:30:22.811+03:30 level=INFO source=server.go:135 msg="system memory" total="30.6 GiB" free="22.4 GiB" free_swap="8.0 GiB" May 21 00:30:22 imieeet ollama[1369]: time=2025-05-21T00:30:22.811+03:30 level=INFO source=server.go:168 msg=offload library=rocm layers.requested=-1 layers.model=49 layers.offload=35 layers.split="" memory.available="[8.0 GiB]" memory.gpu_overhead="0 B" memory.required.full="10.7 GiB" memory.required.partial="7.8 GiB" memory.required.kv="768.0 MiB" memory.required.allocations="[7.8 GiB]" memory.weights.total="8.0 GiB" memory.weights.repeating="7.4 GiB" memory.weights.nonrepeating="609.1 MiB" memory.graph.full="348.0 MiB" memory.graph.partial="916.1 MiB" May 21 00:30:22 imieeet ollama[1369]: llama_model_loader: loaded meta data with 26 key-value pairs and 579 tensors from /usr/share/ollama/.ollama/models/blobs/sha256-6e9f90f02bb3b39b59e81916e8cfce9deb45aeaeb9a54a5be4414486b907dc1e (version GGUF V3 (latest)) May 21 00:30:22 imieeet ollama[1369]: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. May 21 00:30:22 imieeet ollama[1369]: llama_model_loader: - kv 0: general.architecture str = qwen2 May 21 00:30:22 imieeet ollama[1369]: llama_model_loader: - kv 1: general.type str = model May 21 00:30:22 imieeet ollama[1369]: llama_model_loader: - kv 2: general.name str = DeepSeek R1 Distill Qwen 14B May 21 00:30:22 imieeet ollama[1369]: llama_model_loader: - kv 3: general.basename str = DeepSeek-R1-Distill-Qwen May 21 00:30:22 imieeet ollama[1369]: llama_model_loader: - kv 4: general.size_label str = 14B May 21 00:30:22 imieeet ollama[1369]: llama_model_loader: - kv 5: qwen2.block_count u32 = 48 May 21 00:30:22 imieeet ollama[1369]: llama_model_loader: - kv 6: qwen2.context_length u32 = 131072 May 21 00:30:22 imieeet ollama[1369]: llama_model_loader: - kv 7: qwen2.embedding_length u32 = 5120 May 21 00:30:22 imieeet ollama[1369]: llama_model_loader: - kv 8: qwen2.feed_forward_length u32 = 13824 May 21 00:30:22 imieeet ollama[1369]: llama_model_loader: - kv 9: qwen2.attention.head_count u32 = 40 May 21 00:30:22 imieeet ollama[1369]: llama_model_loader: - kv 10: qwen2.attention.head_count_kv u32 = 8 May 21 00:30:22 imieeet ollama[1369]: llama_model_loader: - kv 11: qwen2.rope.freq_base f32 = 1000000.000000 May 21 00:30:22 imieeet ollama[1369]: llama_model_loader: - kv 12: qwen2.attention.layer_norm_rms_epsilon f32 = 0.000010 May 21 00:30:22 imieeet ollama[1369]: llama_model_loader: - kv 13: general.file_type u32 = 15 May 21 00:30:22 imieeet ollama[1369]: llama_model_loader: - kv 14: tokenizer.ggml.model str = gpt2 May 21 00:30:22 imieeet ollama[1369]: llama_model_loader: - kv 15: tokenizer.ggml.pre str = qwen2 May 21 00:30:22 imieeet ollama[1369]: llama_model_loader: - kv 16: tokenizer.ggml.tokens arr[str,152064] = ["!", "\"", "#", "$", "%", "&", "'", ... May 21 00:30:22 imieeet ollama[1369]: llama_model_loader: - kv 17: tokenizer.ggml.token_type arr[i32,152064] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... May 21 00:30:22 imieeet ollama[1369]: llama_model_loader: - kv 18: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",... May 21 00:30:22 imieeet ollama[1369]: llama_model_loader: - kv 19: tokenizer.ggml.bos_token_id u32 = 151646 May 21 00:30:22 imieeet ollama[1369]: llama_model_loader: - kv 20: tokenizer.ggml.eos_token_id u32 = 151643 May 21 00:30:22 imieeet ollama[1369]: llama_model_loader: - kv 21: tokenizer.ggml.padding_token_id u32 = 151643 May 21 00:30:22 imieeet ollama[1369]: llama_model_loader: - kv 22: tokenizer.ggml.add_bos_token bool = true May 21 00:30:22 imieeet ollama[1369]: llama_model_loader: - kv 23: tokenizer.ggml.add_eos_token bool = false May 21 00:30:22 imieeet ollama[1369]: llama_model_loader: - kv 24: tokenizer.chat_template str = {% if not add_generation_prompt is de... May 21 00:30:22 imieeet ollama[1369]: llama_model_loader: - kv 25: general.quantization_version u32 = 2 May 21 00:30:22 imieeet ollama[1369]: llama_model_loader: - type f32: 241 tensors May 21 00:30:22 imieeet ollama[1369]: llama_model_loader: - type q4_K: 289 tensors May 21 00:30:22 imieeet ollama[1369]: llama_model_loader: - type q6_K: 49 tensors May 21 00:30:22 imieeet ollama[1369]: print_info: file format = GGUF V3 (latest) May 21 00:30:22 imieeet ollama[1369]: print_info: file type = Q4_K - Medium May 21 00:30:22 imieeet ollama[1369]: print_info: file size = 8.37 GiB (4.87 BPW) May 21 00:30:22 imieeet ollama[1369]: load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect May 21 00:30:22 imieeet ollama[1369]: load: special tokens cache size = 22 May 21 00:30:23 imieeet ollama[1369]: load: token to piece cache size = 0.9310 MB May 21 00:30:23 imieeet ollama[1369]: print_info: arch = qwen2 May 21 00:30:23 imieeet ollama[1369]: print_info: vocab_only = 1 May 21 00:30:23 imieeet ollama[1369]: print_info: model type = ?B May 21 00:30:23 imieeet ollama[1369]: print_info: model params = 14.77 B May 21 00:30:23 imieeet ollama[1369]: print_info: general.name = DeepSeek R1 Distill Qwen 14B May 21 00:30:23 imieeet ollama[1369]: print_info: vocab type = BPE May 21 00:30:23 imieeet ollama[1369]: print_info: n_vocab = 152064 May 21 00:30:23 imieeet ollama[1369]: print_info: n_merges = 151387 May 21 00:30:23 imieeet ollama[1369]: print_info: BOS token = 151646 '<|begin▁of▁sentence|>' May 21 00:30:23 imieeet ollama[1369]: print_info: EOS token = 151643 '<|end▁of▁sentence|>' May 21 00:30:23 imieeet ollama[1369]: print_info: EOT token = 151643 '<|end▁of▁sentence|>' May 21 00:30:23 imieeet ollama[1369]: print_info: PAD token = 151643 '<|end▁of▁sentence|>' May 21 00:30:23 imieeet ollama[1369]: print_info: LF token = 198 'Ċ' May 21 00:30:23 imieeet ollama[1369]: print_info: FIM PRE token = 151659 '<|fim_prefix|>' May 21 00:30:23 imieeet ollama[1369]: print_info: FIM SUF token = 151661 '<|fim_suffix|>' May 21 00:30:23 imieeet ollama[1369]: print_info: FIM MID token = 151660 '<|fim_middle|>' May 21 00:30:23 imieeet ollama[1369]: print_info: FIM PAD token = 151662 '<|fim_pad|>' May 21 00:30:23 imieeet ollama[1369]: print_info: FIM REP token = 151663 '<|repo_name|>' May 21 00:30:23 imieeet ollama[1369]: print_info: FIM SEP token = 151664 '<|file_sep|>' May 21 00:30:23 imieeet ollama[1369]: print_info: EOG token = 151643 '<|end▁of▁sentence|>' May 21 00:30:23 imieeet ollama[1369]: print_info: EOG token = 151662 '<|fim_pad|>' May 21 00:30:23 imieeet ollama[1369]: print_info: EOG token = 151663 '<|repo_name|>' May 21 00:30:23 imieeet ollama[1369]: print_info: EOG token = 151664 '<|file_sep|>' May 21 00:30:23 imieeet ollama[1369]: print_info: max token length = 256 May 21 00:30:23 imieeet ollama[1369]: llama_model_load: vocab only - skipping tensors May 21 00:30:23 imieeet ollama[1369]: time=2025-05-21T00:30:23.019+03:30 level=INFO source=server.go:431 msg="starting llama server" cmd="/usr/bin/ollama runner --model /usr/share/ollama/.ollama/models/blobs/sha256-6e9f90f02bb3b39b59e81916e8cfce9deb45aeaeb9a54a5be4414486b907dc1e --ctx-size 4096 --batch-size 512 --n-gpu-layers 35 --threads 8 --parallel 1 --port 44291" May 21 00:30:23 imieeet ollama[1369]: time=2025-05-21T00:30:23.019+03:30 level=INFO source=sched.go:472 msg="loaded runners" count=1 May 21 00:30:23 imieeet ollama[1369]: time=2025-05-21T00:30:23.019+03:30 level=INFO source=server.go:591 msg="waiting for llama runner to start responding" May 21 00:30:23 imieeet ollama[1369]: time=2025-05-21T00:30:23.020+03:30 level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server not responding" May 21 00:30:23 imieeet ollama[1369]: time=2025-05-21T00:30:23.029+03:30 level=INFO source=runner.go:815 msg="starting go runner" May 21 00:30:23 imieeet ollama[1369]: load_backend: loaded CPU backend from /usr/lib/ollama/libggml-cpu-haswell.so May 21 00:30:25 imieeet ollama[1369]: /opt/amdgpu/share/libdrm/amdgpu.ids: No such file or directory May 21 00:30:25 imieeet ollama[1369]: /opt/amdgpu/share/libdrm/amdgpu.ids: No such file or directory May 21 00:30:27 imieeet ollama[1369]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no May 21 00:30:27 imieeet ollama[1369]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no May 21 00:30:27 imieeet ollama[1369]: ggml_cuda_init: found 1 ROCm devices: May 21 00:30:27 imieeet ollama[1369]: Device 0: AMD Radeon Graphics, gfx1030 (0x1030), VMM: no, Wave Size: 32 May 21 00:30:27 imieeet ollama[1369]: load_backend: loaded ROCm backend from /usr/lib/ollama/rocm/libggml-hip.so May 21 00:30:27 imieeet ollama[1369]: time=2025-05-21T00:30:27.377+03:30 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 ROCm.0.NO_VMM=1 ROCm.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc) May 21 00:30:27 imieeet ollama[1369]: llama_model_load_from_file_impl: using device ROCm0 (AMD Radeon Graphics) - 8152 MiB free May 21 00:30:27 imieeet ollama[1369]: time=2025-05-21T00:30:27.378+03:30 level=INFO source=runner.go:874 msg="Server listening on 127.0.0.1:44291" May 21 00:30:27 imieeet ollama[1369]: llama_model_loader: loaded meta data with 26 key-value pairs and 579 tensors from /usr/share/ollama/.ollama/models/blobs/sha256-6e9f90f02bb3b39b59e81916e8cfce9deb45aeaeb9a54a5be4414486b907dc1e (version GGUF V3 (latest)) May 21 00:30:27 imieeet ollama[1369]: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. May 21 00:30:27 imieeet ollama[1369]: llama_model_loader: - kv 0: general.architecture str = qwen2 May 21 00:30:27 imieeet ollama[1369]: llama_model_loader: - kv 1: general.type str = model May 21 00:30:27 imieeet ollama[1369]: llama_model_loader: - kv 2: general.name str = DeepSeek R1 Distill Qwen 14B May 21 00:30:27 imieeet ollama[1369]: llama_model_loader: - kv 3: general.basename str = DeepSeek-R1-Distill-Qwen May 21 00:30:27 imieeet ollama[1369]: llama_model_loader: - kv 4: general.size_label str = 14B May 21 00:30:27 imieeet ollama[1369]: llama_model_loader: - kv 5: qwen2.block_count u32 = 48 May 21 00:30:27 imieeet ollama[1369]: llama_model_loader: - kv 6: qwen2.context_length u32 = 131072 May 21 00:30:27 imieeet ollama[1369]: llama_model_loader: - kv 7: qwen2.embedding_length u32 = 5120 May 21 00:30:27 imieeet ollama[1369]: llama_model_loader: - kv 8: qwen2.feed_forward_length u32 = 13824 May 21 00:30:27 imieeet ollama[1369]: llama_model_loader: - kv 9: qwen2.attention.head_count u32 = 40 May 21 00:30:27 imieeet ollama[1369]: llama_model_loader: - kv 10: qwen2.attention.head_count_kv u32 = 8 May 21 00:30:27 imieeet ollama[1369]: llama_model_loader: - kv 11: qwen2.rope.freq_base f32 = 1000000.000000 May 21 00:30:27 imieeet ollama[1369]: llama_model_loader: - kv 12: qwen2.attention.layer_norm_rms_epsilon f32 = 0.000010 May 21 00:30:27 imieeet ollama[1369]: llama_model_loader: - kv 13: general.file_type u32 = 15 May 21 00:30:27 imieeet ollama[1369]: llama_model_loader: - kv 14: tokenizer.ggml.model str = gpt2 May 21 00:30:27 imieeet ollama[1369]: llama_model_loader: - kv 15: tokenizer.ggml.pre str = qwen2 May 21 00:30:27 imieeet ollama[1369]: llama_model_loader: - kv 16: tokenizer.ggml.tokens arr[str,152064] = ["!", "\"", "#", "$", "%", "&", "'", ... May 21 00:30:27 imieeet ollama[1369]: llama_model_loader: - kv 17: tokenizer.ggml.token_type arr[i32,152064] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... May 21 00:30:27 imieeet ollama[1369]: llama_model_loader: - kv 18: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",... May 21 00:30:27 imieeet ollama[1369]: llama_model_loader: - kv 19: tokenizer.ggml.bos_token_id u32 = 151646 May 21 00:30:27 imieeet ollama[1369]: llama_model_loader: - kv 20: tokenizer.ggml.eos_token_id u32 = 151643 May 21 00:30:27 imieeet ollama[1369]: llama_model_loader: - kv 21: tokenizer.ggml.padding_token_id u32 = 151643 May 21 00:30:27 imieeet ollama[1369]: llama_model_loader: - kv 22: tokenizer.ggml.add_bos_token bool = true May 21 00:30:27 imieeet ollama[1369]: llama_model_loader: - kv 23: tokenizer.ggml.add_eos_token bool = false May 21 00:30:27 imieeet ollama[1369]: llama_model_loader: - kv 24: tokenizer.chat_template str = {% if not add_generation_prompt is de... May 21 00:30:27 imieeet ollama[1369]: llama_model_loader: - kv 25: general.quantization_version u32 = 2 May 21 00:30:27 imieeet ollama[1369]: llama_model_loader: - type f32: 241 tensors May 21 00:30:27 imieeet ollama[1369]: llama_model_loader: - type q4_K: 289 tensors May 21 00:30:27 imieeet ollama[1369]: llama_model_loader: - type q6_K: 49 tensors May 21 00:30:27 imieeet ollama[1369]: print_info: file format = GGUF V3 (latest) May 21 00:30:27 imieeet ollama[1369]: print_info: file type = Q4_K - Medium May 21 00:30:27 imieeet ollama[1369]: print_info: file size = 8.37 GiB (4.87 BPW) May 21 00:30:27 imieeet ollama[1369]: load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect May 21 00:30:27 imieeet ollama[1369]: load: special tokens cache size = 22 May 21 00:30:27 imieeet ollama[1369]: time=2025-05-21T00:30:27.533+03:30 level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server loading model" May 21 00:30:27 imieeet ollama[1369]: load: token to piece cache size = 0.9310 MB May 21 00:30:27 imieeet ollama[1369]: print_info: arch = qwen2 May 21 00:30:27 imieeet ollama[1369]: print_info: vocab_only = 0 May 21 00:30:27 imieeet ollama[1369]: print_info: n_ctx_train = 131072 May 21 00:30:27 imieeet ollama[1369]: print_info: n_embd = 5120 May 21 00:30:27 imieeet ollama[1369]: print_info: n_layer = 48 May 21 00:30:27 imieeet ollama[1369]: print_info: n_head = 40 May 21 00:30:27 imieeet ollama[1369]: print_info: n_head_kv = 8 May 21 00:30:27 imieeet ollama[1369]: print_info: n_rot = 128 May 21 00:30:27 imieeet ollama[1369]: print_info: n_swa = 0 May 21 00:30:27 imieeet ollama[1369]: print_info: n_swa_pattern = 1 May 21 00:30:27 imieeet ollama[1369]: print_info: n_embd_head_k = 128 May 21 00:30:27 imieeet ollama[1369]: print_info: n_embd_head_v = 128 May 21 00:30:27 imieeet ollama[1369]: print_info: n_gqa = 5 May 21 00:30:27 imieeet ollama[1369]: print_info: n_embd_k_gqa = 1024 May 21 00:30:27 imieeet ollama[1369]: print_info: n_embd_v_gqa = 1024 May 21 00:30:27 imieeet ollama[1369]: print_info: f_norm_eps = 0.0e+00 May 21 00:30:27 imieeet ollama[1369]: print_info: f_norm_rms_eps = 1.0e-05 May 21 00:30:27 imieeet ollama[1369]: print_info: f_clamp_kqv = 0.0e+00 May 21 00:30:27 imieeet ollama[1369]: print_info: f_max_alibi_bias = 0.0e+00 May 21 00:30:27 imieeet ollama[1369]: print_info: f_logit_scale = 0.0e+00 May 21 00:30:27 imieeet ollama[1369]: print_info: f_attn_scale = 0.0e+00 May 21 00:30:27 imieeet ollama[1369]: print_info: n_ff = 13824 May 21 00:30:27 imieeet ollama[1369]: print_info: n_expert = 0 May 21 00:30:27 imieeet ollama[1369]: print_info: n_expert_used = 0 May 21 00:30:27 imieeet ollama[1369]: print_info: causal attn = 1 May 21 00:30:27 imieeet ollama[1369]: print_info: pooling type = -1 May 21 00:30:27 imieeet ollama[1369]: print_info: rope type = 2 May 21 00:30:27 imieeet ollama[1369]: print_info: rope scaling = linear May 21 00:30:27 imieeet ollama[1369]: print_info: freq_base_train = 1000000.0 May 21 00:30:27 imieeet ollama[1369]: print_info: freq_scale_train = 1 May 21 00:30:27 imieeet ollama[1369]: print_info: n_ctx_orig_yarn = 131072 May 21 00:30:27 imieeet ollama[1369]: print_info: rope_finetuned = unknown May 21 00:30:27 imieeet ollama[1369]: print_info: ssm_d_conv = 0 May 21 00:30:27 imieeet ollama[1369]: print_info: ssm_d_inner = 0 May 21 00:30:27 imieeet ollama[1369]: print_info: ssm_d_state = 0 May 21 00:30:27 imieeet ollama[1369]: print_info: ssm_dt_rank = 0 May 21 00:30:27 imieeet ollama[1369]: print_info: ssm_dt_b_c_rms = 0 May 21 00:30:27 imieeet ollama[1369]: print_info: model type = 14B May 21 00:30:27 imieeet ollama[1369]: print_info: model params = 14.77 B May 21 00:30:27 imieeet ollama[1369]: print_info: general.name = DeepSeek R1 Distill Qwen 14B May 21 00:30:27 imieeet ollama[1369]: print_info: vocab type = BPE May 21 00:30:27 imieeet ollama[1369]: print_info: n_vocab = 152064 May 21 00:30:27 imieeet ollama[1369]: print_info: n_merges = 151387 May 21 00:30:27 imieeet ollama[1369]: print_info: BOS token = 151646 '<|begin▁of▁sentence|>' May 21 00:30:27 imieeet ollama[1369]: print_info: EOS token = 151643 '<|end▁of▁sentence|>' May 21 00:30:27 imieeet ollama[1369]: print_info: EOT token = 151643 '<|end▁of▁sentence|>' May 21 00:30:27 imieeet ollama[1369]: print_info: PAD token = 151643 '<|end▁of▁sentence|>' May 21 00:30:27 imieeet ollama[1369]: print_info: LF token = 198 'Ċ' May 21 00:30:27 imieeet ollama[1369]: print_info: FIM PRE token = 151659 '<|fim_prefix|>' May 21 00:30:27 imieeet ollama[1369]: print_info: FIM SUF token = 151661 '<|fim_suffix|>' May 21 00:30:27 imieeet ollama[1369]: print_info: FIM MID token = 151660 '<|fim_middle|>' May 21 00:30:27 imieeet ollama[1369]: print_info: FIM PAD token = 151662 '<|fim_pad|>' May 21 00:30:27 imieeet ollama[1369]: print_info: FIM REP token = 151663 '<|repo_name|>' May 21 00:30:27 imieeet ollama[1369]: print_info: FIM SEP token = 151664 '<|file_sep|>' May 21 00:30:27 imieeet ollama[1369]: print_info: EOG token = 151643 '<|end▁of▁sentence|>' May 21 00:30:27 imieeet ollama[1369]: print_info: EOG token = 151662 '<|fim_pad|>' May 21 00:30:27 imieeet ollama[1369]: print_info: EOG token = 151663 '<|repo_name|>' May 21 00:30:27 imieeet ollama[1369]: print_info: EOG token = 151664 '<|file_sep|>' May 21 00:30:27 imieeet ollama[1369]: print_info: max token length = 256 May 21 00:30:27 imieeet ollama[1369]: load_tensors: loading model tensors, this can take a while... (mmap = true) May 21 00:30:31 imieeet ollama[1369]: alloc_tensor_range: failed to initialize tensor blk.13.attn_q.weight May 21 00:30:31 imieeet ollama[1369]: llama_model_load: error loading model: unable to allocate ROCm0 buffer May 21 00:30:31 imieeet ollama[1369]: llama_model_load_from_file_impl: failed to load model May 21 00:30:31 imieeet ollama[1369]: panic: unable to load model: /usr/share/ollama/.ollama/models/blobs/sha256-6e9f90f02bb3b39b59e81916e8cfce9deb45aeaeb9a54a5be4414486b907dc1e May 21 00:30:31 imieeet ollama[1369]: goroutine 50 [running]: May 21 00:30:31 imieeet ollama[1369]: github.com/ollama/ollama/runner/llamarunner.(*Server).loadModel(0xc00059a000, {0x23, 0x0, 0x1, {0x0, 0x0, 0x0}, 0xc0000427b0, 0x0}, {0x7fff3b76bc73, ...}, ...) May 21 00:30:31 imieeet ollama[1369]: github.com/ollama/ollama/runner/llamarunner/runner.go:751 +0x395 May 21 00:30:31 imieeet ollama[1369]: created by github.com/ollama/ollama/runner/llamarunner.Execute in goroutine 1 May 21 00:30:31 imieeet ollama[1369]: github.com/ollama/ollama/runner/llamarunner/runner.go:848 +0xb57 May 21 00:30:31 imieeet ollama[1369]: time=2025-05-21T00:30:31.834+03:30 level=ERROR source=server.go:457 msg="llama runner terminated" error="exit status 2" May 21 00:30:32 imieeet ollama[1369]: time=2025-05-21T00:30:32.045+03:30 level=ERROR source=sched.go:478 msg="error loading llama server" error="llama runner process has terminated: error loading model: unable to allocate ROCm0 buffer\nllama_model_load_from_file_impl: failed to load model" May 21 00:30:32 imieeet ollama[1369]: [GIN] 2025/05/21 - 00:30:32 | 500 | 9.274238288s | 127.0.0.1 | POST "/api/generate" May 21 00:30:37 imieeet ollama[1369]: time=2025-05-21T00:30:37.046+03:30 level=WARN source=sched.go:676 msg="gpu VRAM usage didn't recover within timeout" seconds=5.001099854 runner.size="10.7 GiB" runner.vram="7.8 GiB" runner.parallel=1 runner.pid=40703 runner.model=/usr/share/ollama/.ollama/models/blobs/sha256-6e9f90f02bb3b39b59e81916e8cfce9deb45aeaeb9a54a5be4414486b907dc1e May 21 00:30:37 imieeet ollama[1369]: time=2025-05-21T00:30:37.296+03:30 level=WARN source=sched.go:676 msg="gpu VRAM usage didn't recover within timeout" seconds=5.250552187 runner.size="10.7 GiB" runner.vram="7.8 GiB" runner.parallel=1 runner.pid=40703 runner.model=/usr/share/ollama/.ollama/models/blobs/sha256-6e9f90f02bb3b39b59e81916e8cfce9deb45aeaeb9a54a5be4414486b907dc1e May 21 00:30:37 imieeet ollama[1369]: time=2025-05-21T00:30:37.546+03:30 level=WARN source=sched.go:676 msg="gpu VRAM usage didn't recover within timeout" seconds=5.501048397 runner.size="10.7 GiB" runner.vram="7.8 GiB" runner.parallel=1 runner.pid=40703 runner.model=/usr/share/ollama/.ollama/models/blobs/sha256-6e9f90f02bb3b39b59e81916e8cfce9deb45aeaeb9a54a5be4414486b907dc1e May 21 00:35:03 imieeet ollama[1369]: [GIN] 2025/05/21 - 00:35:03 | 200 | 24.887µs | 127.0.0.1 | HEAD "/" May 21 00:35:03 imieeet ollama[1369]: [GIN] 2025/05/21 - 00:35:03 | 200 | 468.013µs | 127.0.0.1 | GET "/api/tags" ``` ### OS Linux ### GPU AMD ### CPU AMD ### Ollama version 0.7.0
GiteaMirror added the bug label 2026-05-04 17:16:24 -05:00
Author
Owner

@rick-github commented on GitHub (May 20, 2025):

May 21 00:30:22 imieeet ollama[1369]: time=2025-05-21T00:30:22.811+03:30 level=INFO source=server.go:168 msg=offload
 library=rocm layers.requested=-1 layers.model=49 layers.offload=35 layers.split="" memory.available="[8.0 GiB]"
 memory.gpu_overhead="0 B" memory.required.full="10.7 GiB" memory.required.partial="7.8 GiB"
 memory.required.kv="768.0 MiB" memory.required.allocations="[7.8 GiB]" memory.weights.total="8.0 GiB"
 memory.weights.repeating="7.4 GiB" memory.weights.nonrepeating="609.1 MiB" memory.graph.full="348.0 MiB"
 memory.graph.partial="916.1 MiB"

8G available and 7.8G allocated to host 35 of 49 layers of the model, leaving only 200M for transient allocations. OOMs can be mitigated with the steps shown here.

<!-- gh-comment-id:2895852209 --> @rick-github commented on GitHub (May 20, 2025): ``` May 21 00:30:22 imieeet ollama[1369]: time=2025-05-21T00:30:22.811+03:30 level=INFO source=server.go:168 msg=offload library=rocm layers.requested=-1 layers.model=49 layers.offload=35 layers.split="" memory.available="[8.0 GiB]" memory.gpu_overhead="0 B" memory.required.full="10.7 GiB" memory.required.partial="7.8 GiB" memory.required.kv="768.0 MiB" memory.required.allocations="[7.8 GiB]" memory.weights.total="8.0 GiB" memory.weights.repeating="7.4 GiB" memory.weights.nonrepeating="609.1 MiB" memory.graph.full="348.0 MiB" memory.graph.partial="916.1 MiB" ``` 8G available and 7.8G allocated to host 35 of 49 layers of the model, leaving only 200M for transient allocations. OOMs can be mitigated with the steps shown [here](https://github.com/ollama/ollama/issues/8597#issuecomment-2614533288).
Author
Owner

@IMIEEET commented on GitHub (May 20, 2025):

@rick-github i did gpu overhead,FA wasnt working for me, created copies with lower num_gpu like 25, and set parallel to 1 still same error. same hardware on 0.6.5 was working

<!-- gh-comment-id:2895945883 --> @IMIEEET commented on GitHub (May 20, 2025): @rick-github i did gpu overhead,FA wasnt working for me, created copies with lower num_gpu like 25, and set parallel to 1 still same error. same hardware on 0.6.5 was working
Author
Owner

@rick-github commented on GitHub (May 20, 2025):

Can you post logs from those changes?

<!-- gh-comment-id:2895990091 --> @rick-github commented on GitHub (May 20, 2025): Can you post logs from those changes?
Author
Owner

@apt-install-coffee commented on GitHub (May 21, 2025):

memory.required.kv="768.0 MiB"

what OLLAMA_GPU_OVERHEAD did you set? Try setting it to slightly larger than the above value

<!-- gh-comment-id:2896683161 --> @apt-install-coffee commented on GitHub (May 21, 2025): > memory.required.kv="768.0 MiB" what OLLAMA_GPU_OVERHEAD did you set? Try setting it to slightly larger than the above value
Author
Owner

@logandoo commented on GitHub (May 23, 2025):

same here, and I got 16g vram while I tried to load a 12b model, same error. However, when I change another device with the exact same configuration, it works fine. I got the full log here:

time=2025-05-23T11:51:09.682+08:00 level=INFO source=sched.go:777 msg="new model will fit in available VRAM in single GPU, loading" model=/home/biklimax/.ollama/models/blobs/sha256-cf4795c211f8a5740d5b6244448fb18f56645614616287c09756ab26303cdd33 gpu=0 parallel=2 available=16501870592 required="10.5 GiB"
time=2025-05-23T11:51:09.682+08:00 level=INFO source=server.go:135 msg="system memory" total="15.4 GiB" free="13.7 GiB" free_swap="1.5 GiB"
time=2025-05-23T11:51:09.683+08:00 level=INFO source=server.go:168 msg=offload library=rocm layers.requested=-1 layers.model=41 layers.offload=41 layers.split="" memory.available="[15.4 GiB]" memory.gpu_overhead="0 B" memory.required.full="10.5 GiB" memory.required.partial="10.5 GiB" memory.required.kv="1.2 GiB" memory.required.allocations="[10.5 GiB]" memory.weights.total="7.7 GiB" memory.weights.repeating="7.2 GiB" memory.weights.nonrepeating="525.0 MiB" memory.graph.full="568.0 MiB" memory.graph.partial="801.0 MiB"
time=2025-05-23T11:51:09.683+08:00 level=INFO source=server.go:211 msg="enabling flash attention"
time=2025-05-23T11:51:09.683+08:00 level=WARN source=server.go:219 msg="kv cache type not supported by model" type=""
time=2025-05-23T11:51:09.706+08:00 level=INFO source=server.go:431 msg="starting llama server" cmd="/usr/local/bin/ollama runner --ollama-engine --model /home/biklimax/.ollama/models/blobs/sha256-cf4795c211f8a5740d5b6244448fb18f56645614616287c09756ab26303cdd33 --ctx-size 8192 --batch-size 512 --n-gpu-layers 41 --threads 8 --flash-attn --parallel 2 --port 40981"
time=2025-05-23T11:51:09.706+08:00 level=INFO source=sched.go:472 msg="loaded runners" count=1
time=2025-05-23T11:51:09.706+08:00 level=INFO source=server.go:591 msg="waiting for llama runner to start responding"
time=2025-05-23T11:51:09.707+08:00 level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server not responding"
time=2025-05-23T11:51:09.715+08:00 level=INFO source=runner.go:836 msg="starting ollama engine"
time=2025-05-23T11:51:09.715+08:00 level=INFO source=runner.go:899 msg="Server listening on 127.0.0.1:40981"
time=2025-05-23T11:51:09.736+08:00 level=INFO source=ggml.go:73 msg="" architecture=llama file_type=Q5_K_M name="Mistral Nemo Instruct 2407" description="" num_tensors=363 num_key_values=36
load_backend: loaded CPU backend from /usr/local/lib/ollama/libggml-cpu-icelake.so
time=2025-05-23T11:51:09.958+08:00 level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server loading model"
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
  Device 0: AMD Radeon Graphics, gfx1100 (0x1100), VMM: no, Wave Size: 32
load_backend: loaded ROCm backend from /usr/local/lib/ollama/rocm/libggml-hip.so
time=2025-05-23T11:51:11.050+08:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.AVX512=1 CPU.0.AVX512_VBMI=1 CPU.0.AVX512_VNNI=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 ROCm.0.NO_VMM=1 ROCm.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc)
ggml_backend_cuda_buffer_type_alloc_buffer: allocating 7875.80 MiB on device 0: cudaMalloc failed: out of memory
alloc_tensor_range: failed to allocate ROCm0 buffer of size 8258375680
panic: unable to allocate memory from device ROCm0 for model weights

goroutine 13 [running]:
github.com/ollama/ollama/runner/ollamarunner.(*Server).loadModel(0xc0001437a0, {0x6248ab123b10?, 0xc00012fd10?}, {0x7fff3424c0d6?, 0x0?}, {0xc0004d29e0, 0x8, 0x0, 0x29, {0x0, ...}, ...}, ...)
	github.com/ollama/ollama/runner/ollamarunner/runner.go:777 +0x2d1
created by github.com/ollama/ollama/runner/ollamarunner.Execute in goroutine 1
	github.com/ollama/ollama/runner/ollamarunner/runner.go:872 +0xa2b
time=2025-05-23T11:51:11.079+08:00 level=ERROR source=server.go:457 msg="llama runner terminated" error="exit status 2"
time=2025-05-23T11:51:11.213+08:00 level=ERROR source=sched.go:478 msg="error loading llama server" error="llama runner process has terminated: cudaMalloc failed: out of memory\nalloc_tensor_range: failed to allocate ROCm0 buffer of size 8258375680"

and this is my rocminfo:

ROCk module version 6.8.5 is loaded
=====================    
HSA System Attributes    
=====================    
Runtime Version:         1.14
Runtime Ext Version:     1.6
System Timestamp Freq.:  1000.000000MHz
Sig. Max Wait Duration:  18446744073709551615 (0xFFFFFFFFFFFFFFFF) (timestamp count)
Machine Model:           LARGE                              
System Endianness:       LITTLE                             
Mwaitx:                  DISABLED
DMAbuf Support:          YES

==========               
HSA Agents               
==========               
*******                  
Agent 1                  
*******                  
  Name:                    AMD Eng Sample: 100-000000955-50_Y 
  Uuid:                    CPU-XX                             
  Marketing Name:          AMD Eng Sample: 100-000000955-50_Y 
  Vendor Name:             CPU                                
  Feature:                 None specified                     
  Profile:                 FULL_PROFILE                       
  Float Round Mode:        NEAR                               
  Max Queue Number:        0(0x0)                             
  Queue Min Size:          0(0x0)                             
  Queue Max Size:          0(0x0)                             
  Queue Type:              MULTI                              
  Node:                    0                                  
  Device Type:             CPU                                
  Cache Info:              
    L1:                      32768(0x8000) KB                   
  Chip ID:                 0(0x0)                             
  ASIC Revision:           0(0x0)                             
  Cacheline Size:          64(0x40)                           
  Max Clock Freq. (MHz):   4971                               
  BDFID:                   0                                  
  Internal Node ID:        0                                  
  Compute Unit:            16                                 
  SIMDs per CU:            0                                  
  Shader Engines:          0                                  
  Shader Arrs. per Eng.:   0                                  
  WatchPts on Addr. Ranges:1                                  
  Memory Properties:       
  Features:                None
  Pool Info:               
    Pool 1                   
      Segment:                 GLOBAL; FLAGS: FINE GRAINED        
      Size:                    16175700(0xf6d254) KB              
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Recommended Granule:4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       TRUE                               
    Pool 2                   
      Segment:                 GLOBAL; FLAGS: KERNARG, FINE GRAINED
      Size:                    16175700(0xf6d254) KB              
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Recommended Granule:4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       TRUE                               
    Pool 3                   
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED      
      Size:                    16175700(0xf6d254) KB              
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Recommended Granule:4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       TRUE                               
  ISA Info:                
*******                  
Agent 2                  
*******                  
  Name:                    gfx1103                            
  Uuid:                    GPU-XX                             
  Marketing Name:          AMD Radeon Graphics                
  Vendor Name:             AMD                                
  Feature:                 KERNEL_DISPATCH                    
  Profile:                 BASE_PROFILE                       
  Float Round Mode:        NEAR                               
  Max Queue Number:        128(0x80)                          
  Queue Min Size:          64(0x40)                           
  Queue Max Size:          131072(0x20000)                    
  Queue Type:              MULTI                              
  Node:                    1                                  
  Device Type:             GPU                                
  Cache Info:              
    L1:                      32(0x20) KB                        
    L2:                      2048(0x800) KB                     
  Chip ID:                 5567(0x15bf)                       
  ASIC Revision:           5(0x5)                             
  Cacheline Size:          64(0x40)                           
  Max Clock Freq. (MHz):   2500                               
  BDFID:                   25856                              
  Internal Node ID:        1                                  
  Compute Unit:            12                                 
  SIMDs per CU:            2                                  
  Shader Engines:          1                                  
  Shader Arrs. per Eng.:   2                                  
  WatchPts on Addr. Ranges:4                                  
  Coherent Host Access:    FALSE                              
  Memory Properties:       APU
  Features:                KERNEL_DISPATCH 
  Fast F16 Operation:      TRUE                               
  Wavefront Size:          32(0x20)                           
  Workgroup Max Size:      1024(0x400)                        
  Workgroup Max Size per Dimension:
    x                        1024(0x400)                        
    y                        1024(0x400)                        
    z                        1024(0x400)                        
  Max Waves Per CU:        32(0x20)                           
  Max Work-item Per CU:    1024(0x400)                        
  Grid Max Size:           4294967295(0xffffffff)             
  Grid Max Size per Dimension:
    x                        4294967295(0xffffffff)             
    y                        4294967295(0xffffffff)             
    z                        4294967295(0xffffffff)             
  Max fbarriers/Workgrp:   32                                 
  Packet Processor uCode:: 39                                 
  SDMA engine uCode::      18                                 
  IOMMU Support::          None                               
  Pool Info:               
    Pool 1                   
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED      
      Size:                    8087848(0x7b6928) KB               
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Recommended Granule:2048KB                             
      Alloc Alignment:         4KB                                
      Accessible by all:       FALSE                              
    Pool 2                   
      Segment:                 GLOBAL; FLAGS: EXTENDED FINE GRAINED
      Size:                    8087848(0x7b6928) KB               
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Recommended Granule:2048KB                             
      Alloc Alignment:         4KB                                
      Accessible by all:       FALSE                              
    Pool 3                   
      Segment:                 GROUP                              
      Size:                    64(0x40) KB                        
      Allocatable:             FALSE                              
      Alloc Granule:           0KB                                
      Alloc Recommended Granule:0KB                                
      Alloc Alignment:         0KB                                
      Accessible by all:       FALSE                              
  ISA Info:                
    ISA 1                    
      Name:                    amdgcn-amd-amdhsa--gfx1103         
      Machine Models:          HSA_MACHINE_MODEL_LARGE            
      Profiles:                HSA_PROFILE_BASE                   
      Default Rounding Mode:   NEAR                               
      Default Rounding Mode:   NEAR                               
      Fast f16:                TRUE                               
      Workgroup Max Size:      1024(0x400)                        
      Workgroup Max Size per Dimension:
        x                        1024(0x400)                        
        y                        1024(0x400)                        
        z                        1024(0x400)                        
      Grid Max Size:           4294967295(0xffffffff)             
      Grid Max Size per Dimension:
        x                        4294967295(0xffffffff)             
        y                        4294967295(0xffffffff)             
        z                        4294967295(0xffffffff)             
      FBarrier Max Size:       32                                 
*** Done ***     

here is my hipcc --version

HIP version: 6.2.41134-65d174c3e
AMD clang version 18.0.0git (https://github.com/RadeonOpenCompute/llvm-project roc-6.2.4 24392 1e2c94795ee0d6ab8e2ff3035965a6b74e11b475)
Target: x86_64-unknown-linux-gnu
Thread model: posix
InstalledDir: /opt/rocm-6.2.4/lib/llvm/bin
Configuration file: /opt/rocm-6.2.4/lib/llvm/bin/clang++.cfg
<!-- gh-comment-id:2903179345 --> @logandoo commented on GitHub (May 23, 2025): same here, and I got 16g vram while I tried to load a 12b model, same error. However, when I change another device with the exact same configuration, it works fine. I got the full log here: ``` time=2025-05-23T11:51:09.682+08:00 level=INFO source=sched.go:777 msg="new model will fit in available VRAM in single GPU, loading" model=/home/biklimax/.ollama/models/blobs/sha256-cf4795c211f8a5740d5b6244448fb18f56645614616287c09756ab26303cdd33 gpu=0 parallel=2 available=16501870592 required="10.5 GiB" time=2025-05-23T11:51:09.682+08:00 level=INFO source=server.go:135 msg="system memory" total="15.4 GiB" free="13.7 GiB" free_swap="1.5 GiB" time=2025-05-23T11:51:09.683+08:00 level=INFO source=server.go:168 msg=offload library=rocm layers.requested=-1 layers.model=41 layers.offload=41 layers.split="" memory.available="[15.4 GiB]" memory.gpu_overhead="0 B" memory.required.full="10.5 GiB" memory.required.partial="10.5 GiB" memory.required.kv="1.2 GiB" memory.required.allocations="[10.5 GiB]" memory.weights.total="7.7 GiB" memory.weights.repeating="7.2 GiB" memory.weights.nonrepeating="525.0 MiB" memory.graph.full="568.0 MiB" memory.graph.partial="801.0 MiB" time=2025-05-23T11:51:09.683+08:00 level=INFO source=server.go:211 msg="enabling flash attention" time=2025-05-23T11:51:09.683+08:00 level=WARN source=server.go:219 msg="kv cache type not supported by model" type="" time=2025-05-23T11:51:09.706+08:00 level=INFO source=server.go:431 msg="starting llama server" cmd="/usr/local/bin/ollama runner --ollama-engine --model /home/biklimax/.ollama/models/blobs/sha256-cf4795c211f8a5740d5b6244448fb18f56645614616287c09756ab26303cdd33 --ctx-size 8192 --batch-size 512 --n-gpu-layers 41 --threads 8 --flash-attn --parallel 2 --port 40981" time=2025-05-23T11:51:09.706+08:00 level=INFO source=sched.go:472 msg="loaded runners" count=1 time=2025-05-23T11:51:09.706+08:00 level=INFO source=server.go:591 msg="waiting for llama runner to start responding" time=2025-05-23T11:51:09.707+08:00 level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server not responding" time=2025-05-23T11:51:09.715+08:00 level=INFO source=runner.go:836 msg="starting ollama engine" time=2025-05-23T11:51:09.715+08:00 level=INFO source=runner.go:899 msg="Server listening on 127.0.0.1:40981" time=2025-05-23T11:51:09.736+08:00 level=INFO source=ggml.go:73 msg="" architecture=llama file_type=Q5_K_M name="Mistral Nemo Instruct 2407" description="" num_tensors=363 num_key_values=36 load_backend: loaded CPU backend from /usr/local/lib/ollama/libggml-cpu-icelake.so time=2025-05-23T11:51:09.958+08:00 level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server loading model" ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 1 ROCm devices: Device 0: AMD Radeon Graphics, gfx1100 (0x1100), VMM: no, Wave Size: 32 load_backend: loaded ROCm backend from /usr/local/lib/ollama/rocm/libggml-hip.so time=2025-05-23T11:51:11.050+08:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.AVX512=1 CPU.0.AVX512_VBMI=1 CPU.0.AVX512_VNNI=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 ROCm.0.NO_VMM=1 ROCm.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc) ggml_backend_cuda_buffer_type_alloc_buffer: allocating 7875.80 MiB on device 0: cudaMalloc failed: out of memory alloc_tensor_range: failed to allocate ROCm0 buffer of size 8258375680 panic: unable to allocate memory from device ROCm0 for model weights goroutine 13 [running]: github.com/ollama/ollama/runner/ollamarunner.(*Server).loadModel(0xc0001437a0, {0x6248ab123b10?, 0xc00012fd10?}, {0x7fff3424c0d6?, 0x0?}, {0xc0004d29e0, 0x8, 0x0, 0x29, {0x0, ...}, ...}, ...) github.com/ollama/ollama/runner/ollamarunner/runner.go:777 +0x2d1 created by github.com/ollama/ollama/runner/ollamarunner.Execute in goroutine 1 github.com/ollama/ollama/runner/ollamarunner/runner.go:872 +0xa2b time=2025-05-23T11:51:11.079+08:00 level=ERROR source=server.go:457 msg="llama runner terminated" error="exit status 2" time=2025-05-23T11:51:11.213+08:00 level=ERROR source=sched.go:478 msg="error loading llama server" error="llama runner process has terminated: cudaMalloc failed: out of memory\nalloc_tensor_range: failed to allocate ROCm0 buffer of size 8258375680" ``` and this is my rocminfo: ``` ROCk module version 6.8.5 is loaded ===================== HSA System Attributes ===================== Runtime Version: 1.14 Runtime Ext Version: 1.6 System Timestamp Freq.: 1000.000000MHz Sig. Max Wait Duration: 18446744073709551615 (0xFFFFFFFFFFFFFFFF) (timestamp count) Machine Model: LARGE System Endianness: LITTLE Mwaitx: DISABLED DMAbuf Support: YES ========== HSA Agents ========== ******* Agent 1 ******* Name: AMD Eng Sample: 100-000000955-50_Y Uuid: CPU-XX Marketing Name: AMD Eng Sample: 100-000000955-50_Y Vendor Name: CPU Feature: None specified Profile: FULL_PROFILE Float Round Mode: NEAR Max Queue Number: 0(0x0) Queue Min Size: 0(0x0) Queue Max Size: 0(0x0) Queue Type: MULTI Node: 0 Device Type: CPU Cache Info: L1: 32768(0x8000) KB Chip ID: 0(0x0) ASIC Revision: 0(0x0) Cacheline Size: 64(0x40) Max Clock Freq. (MHz): 4971 BDFID: 0 Internal Node ID: 0 Compute Unit: 16 SIMDs per CU: 0 Shader Engines: 0 Shader Arrs. per Eng.: 0 WatchPts on Addr. Ranges:1 Memory Properties: Features: None Pool Info: Pool 1 Segment: GLOBAL; FLAGS: FINE GRAINED Size: 16175700(0xf6d254) KB Allocatable: TRUE Alloc Granule: 4KB Alloc Recommended Granule:4KB Alloc Alignment: 4KB Accessible by all: TRUE Pool 2 Segment: GLOBAL; FLAGS: KERNARG, FINE GRAINED Size: 16175700(0xf6d254) KB Allocatable: TRUE Alloc Granule: 4KB Alloc Recommended Granule:4KB Alloc Alignment: 4KB Accessible by all: TRUE Pool 3 Segment: GLOBAL; FLAGS: COARSE GRAINED Size: 16175700(0xf6d254) KB Allocatable: TRUE Alloc Granule: 4KB Alloc Recommended Granule:4KB Alloc Alignment: 4KB Accessible by all: TRUE ISA Info: ******* Agent 2 ******* Name: gfx1103 Uuid: GPU-XX Marketing Name: AMD Radeon Graphics Vendor Name: AMD Feature: KERNEL_DISPATCH Profile: BASE_PROFILE Float Round Mode: NEAR Max Queue Number: 128(0x80) Queue Min Size: 64(0x40) Queue Max Size: 131072(0x20000) Queue Type: MULTI Node: 1 Device Type: GPU Cache Info: L1: 32(0x20) KB L2: 2048(0x800) KB Chip ID: 5567(0x15bf) ASIC Revision: 5(0x5) Cacheline Size: 64(0x40) Max Clock Freq. (MHz): 2500 BDFID: 25856 Internal Node ID: 1 Compute Unit: 12 SIMDs per CU: 2 Shader Engines: 1 Shader Arrs. per Eng.: 2 WatchPts on Addr. Ranges:4 Coherent Host Access: FALSE Memory Properties: APU Features: KERNEL_DISPATCH Fast F16 Operation: TRUE Wavefront Size: 32(0x20) Workgroup Max Size: 1024(0x400) Workgroup Max Size per Dimension: x 1024(0x400) y 1024(0x400) z 1024(0x400) Max Waves Per CU: 32(0x20) Max Work-item Per CU: 1024(0x400) Grid Max Size: 4294967295(0xffffffff) Grid Max Size per Dimension: x 4294967295(0xffffffff) y 4294967295(0xffffffff) z 4294967295(0xffffffff) Max fbarriers/Workgrp: 32 Packet Processor uCode:: 39 SDMA engine uCode:: 18 IOMMU Support:: None Pool Info: Pool 1 Segment: GLOBAL; FLAGS: COARSE GRAINED Size: 8087848(0x7b6928) KB Allocatable: TRUE Alloc Granule: 4KB Alloc Recommended Granule:2048KB Alloc Alignment: 4KB Accessible by all: FALSE Pool 2 Segment: GLOBAL; FLAGS: EXTENDED FINE GRAINED Size: 8087848(0x7b6928) KB Allocatable: TRUE Alloc Granule: 4KB Alloc Recommended Granule:2048KB Alloc Alignment: 4KB Accessible by all: FALSE Pool 3 Segment: GROUP Size: 64(0x40) KB Allocatable: FALSE Alloc Granule: 0KB Alloc Recommended Granule:0KB Alloc Alignment: 0KB Accessible by all: FALSE ISA Info: ISA 1 Name: amdgcn-amd-amdhsa--gfx1103 Machine Models: HSA_MACHINE_MODEL_LARGE Profiles: HSA_PROFILE_BASE Default Rounding Mode: NEAR Default Rounding Mode: NEAR Fast f16: TRUE Workgroup Max Size: 1024(0x400) Workgroup Max Size per Dimension: x 1024(0x400) y 1024(0x400) z 1024(0x400) Grid Max Size: 4294967295(0xffffffff) Grid Max Size per Dimension: x 4294967295(0xffffffff) y 4294967295(0xffffffff) z 4294967295(0xffffffff) FBarrier Max Size: 32 *** Done *** ``` here is my hipcc --version ``` HIP version: 6.2.41134-65d174c3e AMD clang version 18.0.0git (https://github.com/RadeonOpenCompute/llvm-project roc-6.2.4 24392 1e2c94795ee0d6ab8e2ff3035965a6b74e11b475) Target: x86_64-unknown-linux-gnu Thread model: posix InstalledDir: /opt/rocm-6.2.4/lib/llvm/bin Configuration file: /opt/rocm-6.2.4/lib/llvm/bin/clang++.cfg ```
Author
Owner

@IMIEEET commented on GitHub (May 24, 2025):

@rick-github sorry for long pause

May 24 04:26:26 imieeet ollama[1367]: [GIN] 2025/05/24 - 04:26:26 | 200 | 42.257µs | 127.0.0.1 | HEAD "/"
May 24 04:26:26 imieeet ollama[1367]: [GIN] 2025/05/24 - 04:26:26 | 200 | 1.053802ms | 127.0.0.1 | GET "/api/tags"
May 24 04:26:41 imieeet ollama[1367]: [GIN] 2025/05/24 - 04:26:41 | 200 | 24.936µs | 127.0.0.1 | HEAD "/"
May 24 04:26:42 imieeet ollama[1367]: [GIN] 2025/05/24 - 04:26:42 | 200 | 99.916225ms | 127.0.0.1 | POST "/api/show"
May 24 04:26:42 imieeet ollama[1367]: time=2025-05-24T04:26:42.137+03:30 level=INFO source=sched.go:777 msg="new model will fit in available VRAM in single GPU, loading" model=/usr/share/ollama/.ollama/models/blobs/sha256-6e9f90f02bb3b39b59e81916e8cfce9deb45aeaeb9a54a5be4414486b907dc1e gpu=0 parallel=1 available=8556519424 required="3.7 GiB"
May 24 04:26:42 imieeet ollama[1367]: time=2025-05-24T04:26:42.137+03:30 level=INFO source=server.go:135 msg="system memory" total="30.6 GiB" free="23.5 GiB" free_swap="8.0 GiB"
May 24 04:26:42 imieeet ollama[1367]: time=2025-05-24T04:26:42.137+03:30 level=INFO source=server.go:168 msg=offload library=rocm layers.requested=10 layers.model=49 layers.offload=10 layers.split="" memory.available="[8.0 GiB]" memory.gpu_overhead="921.1 MiB" memory.required.full="10.1 GiB" memory.required.partial="3.7 GiB" memory.required.kv="768.0 MiB" memory.required.allocations="[3.7 GiB]" memory.weights.total="8.0 GiB" memory.weights.repeating="7.4 GiB" memory.weights.nonrepeating="609.1 MiB" memory.graph.full="348.0 MiB" memory.graph.partial="916.1 MiB"
May 24 04:26:42 imieeet ollama[1367]: time=2025-05-24T04:26:42.137+03:30 level=WARN source=server.go:222 msg="quantized kv cache requested but flash attention disabled" type=q4_0
May 24 04:26:42 imieeet ollama[1367]: llama_model_loader: loaded meta data with 26 key-value pairs and 579 tensors from /usr/share/ollama/.ollama/models/blobs/sha256-6e9f90f02bb3b39b59e81916e8cfce9deb45aeaeb9a54a5be4414486b907dc1e (version GGUF V3 (latest))
May 24 04:26:42 imieeet ollama[1367]: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
May 24 04:26:42 imieeet ollama[1367]: llama_model_loader: - kv 0: general.architecture str = qwen2
May 24 04:26:42 imieeet ollama[1367]: llama_model_loader: - kv 1: general.type str = model
May 24 04:26:42 imieeet ollama[1367]: llama_model_loader: - kv 2: general.name str = DeepSeek R1 Distill Qwen 14B
May 24 04:26:42 imieeet ollama[1367]: llama_model_loader: - kv 3: general.basename str = DeepSeek-R1-Distill-Qwen
May 24 04:26:42 imieeet ollama[1367]: llama_model_loader: - kv 4: general.size_label str = 14B
May 24 04:26:42 imieeet ollama[1367]: llama_model_loader: - kv 5: qwen2.block_count u32 = 48
May 24 04:26:42 imieeet ollama[1367]: llama_model_loader: - kv 6: qwen2.context_length u32 = 131072
May 24 04:26:42 imieeet ollama[1367]: llama_model_loader: - kv 7: qwen2.embedding_length u32 = 5120
May 24 04:26:42 imieeet ollama[1367]: llama_model_loader: - kv 8: qwen2.feed_forward_length u32 = 13824
May 24 04:26:42 imieeet ollama[1367]: llama_model_loader: - kv 9: qwen2.attention.head_count u32 = 40
May 24 04:26:42 imieeet ollama[1367]: llama_model_loader: - kv 10: qwen2.attention.head_count_kv u32 = 8
May 24 04:26:42 imieeet ollama[1367]: llama_model_loader: - kv 11: qwen2.rope.freq_base f32 = 1000000.000000
May 24 04:26:42 imieeet ollama[1367]: llama_model_loader: - kv 12: qwen2.attention.layer_norm_rms_epsilon f32 = 0.000010
May 24 04:26:42 imieeet ollama[1367]: llama_model_loader: - kv 13: general.file_type u32 = 15
May 24 04:26:42 imieeet ollama[1367]: llama_model_loader: - kv 14: tokenizer.ggml.model str = gpt2
May 24 04:26:42 imieeet ollama[1367]: llama_model_loader: - kv 15: tokenizer.ggml.pre str = qwen2
May 24 04:26:42 imieeet ollama[1367]: llama_model_loader: - kv 16: tokenizer.ggml.tokens arr[str,152064] = ["!", """, "#", "$", "%", "&", "'", ...
May 24 04:26:42 imieeet ollama[1367]: llama_model_loader: - kv 17: tokenizer.ggml.token_type arr[i32,152064] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
May 24 04:26:42 imieeet ollama[1367]: llama_model_loader: - kv 18: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
May 24 04:26:42 imieeet ollama[1367]: llama_model_loader: - kv 19: tokenizer.ggml.bos_token_id u32 = 151646
May 24 04:26:42 imieeet ollama[1367]: llama_model_loader: - kv 20: tokenizer.ggml.eos_token_id u32 = 151643
May 24 04:26:42 imieeet ollama[1367]: llama_model_loader: - kv 21: tokenizer.ggml.padding_token_id u32 = 151643
May 24 04:26:42 imieeet ollama[1367]: llama_model_loader: - kv 22: tokenizer.ggml.add_bos_token bool = true
May 24 04:26:42 imieeet ollama[1367]: llama_model_loader: - kv 23: tokenizer.ggml.add_eos_token bool = false
May 24 04:26:42 imieeet ollama[1367]: llama_model_loader: - kv 24: tokenizer.chat_template str = {% if not add_generation_prompt is de...
May 24 04:26:42 imieeet ollama[1367]: llama_model_loader: - kv 25: general.quantization_version u32 = 2
May 24 04:26:42 imieeet ollama[1367]: llama_model_loader: - type f32: 241 tensors
May 24 04:26:42 imieeet ollama[1367]: llama_model_loader: - type q4_K: 289 tensors
May 24 04:26:42 imieeet ollama[1367]: llama_model_loader: - type q6_K: 49 tensors
May 24 04:26:42 imieeet ollama[1367]: print_info: file format = GGUF V3 (latest)
May 24 04:26:42 imieeet ollama[1367]: print_info: file type = Q4_K - Medium
May 24 04:26:42 imieeet ollama[1367]: print_info: file size = 8.37 GiB (4.87 BPW)
May 24 04:26:42 imieeet ollama[1367]: load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
May 24 04:26:42 imieeet ollama[1367]: load: special tokens cache size = 22
May 24 04:26:42 imieeet ollama[1367]: load: token to piece cache size = 0.9310 MB
May 24 04:26:42 imieeet ollama[1367]: print_info: arch = qwen2
May 24 04:26:42 imieeet ollama[1367]: print_info: vocab_only = 1
May 24 04:26:42 imieeet ollama[1367]: print_info: model type = ?B
May 24 04:26:42 imieeet ollama[1367]: print_info: model params = 14.77 B
May 24 04:26:42 imieeet ollama[1367]: print_info: general.name = DeepSeek R1 Distill Qwen 14B
May 24 04:26:42 imieeet ollama[1367]: print_info: vocab type = BPE
May 24 04:26:42 imieeet ollama[1367]: print_info: n_vocab = 152064
May 24 04:26:42 imieeet ollama[1367]: print_info: n_merges = 151387
May 24 04:26:42 imieeet ollama[1367]: print_info: BOS token = 151646 '<|begin▁of▁sentence|>'
May 24 04:26:42 imieeet ollama[1367]: print_info: EOS token = 151643 '<|end▁of▁sentence|>'
May 24 04:26:42 imieeet ollama[1367]: print_info: EOT token = 151643 '<|end▁of▁sentence|>'
May 24 04:26:42 imieeet ollama[1367]: print_info: PAD token = 151643 '<|end▁of▁sentence|>'
May 24 04:26:42 imieeet ollama[1367]: print_info: LF token = 198 'Ċ'
May 24 04:26:42 imieeet ollama[1367]: print_info: FIM PRE token = 151659 '<|fim_prefix|>'
May 24 04:26:42 imieeet ollama[1367]: print_info: FIM SUF token = 151661 '<|fim_suffix|>'
May 24 04:26:42 imieeet ollama[1367]: print_info: FIM MID token = 151660 '<|fim_middle|>'
May 24 04:26:42 imieeet ollama[1367]: print_info: FIM PAD token = 151662 '<|fim_pad|>'
May 24 04:26:42 imieeet ollama[1367]: print_info: FIM REP token = 151663 '<|repo_name|>'
May 24 04:26:42 imieeet ollama[1367]: print_info: FIM SEP token = 151664 '<|file_sep|>'
May 24 04:26:42 imieeet ollama[1367]: print_info: EOG token = 151643 '<|end▁of▁sentence|>'
May 24 04:26:42 imieeet ollama[1367]: print_info: EOG token = 151662 '<|fim_pad|>'
May 24 04:26:42 imieeet ollama[1367]: print_info: EOG token = 151663 '<|repo_name|>'
May 24 04:26:42 imieeet ollama[1367]: print_info: EOG token = 151664 '<|file_sep|>'
May 24 04:26:42 imieeet ollama[1367]: print_info: max token length = 256
May 24 04:26:42 imieeet ollama[1367]: llama_model_load: vocab only - skipping tensors
May 24 04:26:42 imieeet ollama[1367]: time=2025-05-24T04:26:42.308+03:30 level=INFO source=server.go:431 msg="starting llama server" cmd="/usr/bin/ollama runner --model /usr/share/ollama/.ollama/models/blobs/sha256-6e9f90f02bb3b39b59e81916e8cfce9deb45aeaeb9a54a5be4414486b907dc1e --ctx-size 4096 --batch-size 512 --n-gpu-layers 10 --threads 8 --parallel 1 --port 39195"
May 24 04:26:42 imieeet ollama[1367]: time=2025-05-24T04:26:42.308+03:30 level=INFO source=sched.go:472 msg="loaded runners" count=1
May 24 04:26:42 imieeet ollama[1367]: time=2025-05-24T04:26:42.308+03:30 level=INFO source=server.go:591 msg="waiting for llama runner to start responding"
May 24 04:26:42 imieeet ollama[1367]: time=2025-05-24T04:26:42.309+03:30 level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server not responding"
May 24 04:26:42 imieeet ollama[1367]: time=2025-05-24T04:26:42.318+03:30 level=INFO source=runner.go:815 msg="starting go runner"
May 24 04:26:42 imieeet ollama[1367]: load_backend: loaded CPU backend from /usr/lib/ollama/libggml-cpu-haswell.so
May 24 04:26:44 imieeet ollama[1367]: /opt/amdgpu/share/libdrm/amdgpu.ids: No such file or directory
May 24 04:26:44 imieeet ollama[1367]: /opt/amdgpu/share/libdrm/amdgpu.ids: No such file or directory
May 24 04:26:46 imieeet ollama[1367]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
May 24 04:26:46 imieeet ollama[1367]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
May 24 04:26:46 imieeet ollama[1367]: ggml_cuda_init: found 1 ROCm devices:
May 24 04:26:46 imieeet ollama[1367]: Device 0: AMD Radeon Graphics, gfx1030 (0x1030), VMM: no, Wave Size: 32
May 24 04:26:46 imieeet ollama[1367]: load_backend: loaded ROCm backend from /usr/lib/ollama/rocm/libggml-hip.so
May 24 04:26:46 imieeet ollama[1367]: time=2025-05-24T04:26:46.711+03:30 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 ROCm.0.NO_VMM=1 ROCm.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc)
May 24 04:26:46 imieeet ollama[1367]: llama_model_load_from_file_impl: using device ROCm0 (AMD Radeon Graphics) - 8152 MiB free
May 24 04:26:46 imieeet ollama[1367]: time=2025-05-24T04:26:46.712+03:30 level=INFO source=runner.go:874 msg="Server listening on 127.0.0.1:39195"
May 24 04:26:46 imieeet ollama[1367]: llama_model_loader: loaded meta data with 26 key-value pairs and 579 tensors from /usr/share/ollama/.ollama/models/blobs/sha256-6e9f90f02bb3b39b59e81916e8cfce9deb45aeaeb9a54a5be4414486b907dc1e (version GGUF V3 (latest))
May 24 04:26:46 imieeet ollama[1367]: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
May 24 04:26:46 imieeet ollama[1367]: llama_model_loader: - kv 0: general.architecture str = qwen2
May 24 04:26:46 imieeet ollama[1367]: llama_model_loader: - kv 1: general.type str = model
May 24 04:26:46 imieeet ollama[1367]: llama_model_loader: - kv 2: general.name str = DeepSeek R1 Distill Qwen 14B
May 24 04:26:46 imieeet ollama[1367]: llama_model_loader: - kv 3: general.basename str = DeepSeek-R1-Distill-Qwen
May 24 04:26:46 imieeet ollama[1367]: llama_model_loader: - kv 4: general.size_label str = 14B
May 24 04:26:46 imieeet ollama[1367]: llama_model_loader: - kv 5: qwen2.block_count u32 = 48
May 24 04:26:46 imieeet ollama[1367]: llama_model_loader: - kv 6: qwen2.context_length u32 = 131072
May 24 04:26:46 imieeet ollama[1367]: llama_model_loader: - kv 7: qwen2.embedding_length u32 = 5120
May 24 04:26:46 imieeet ollama[1367]: llama_model_loader: - kv 8: qwen2.feed_forward_length u32 = 13824
May 24 04:26:46 imieeet ollama[1367]: llama_model_loader: - kv 9: qwen2.attention.head_count u32 = 40
May 24 04:26:46 imieeet ollama[1367]: llama_model_loader: - kv 10: qwen2.attention.head_count_kv u32 = 8
May 24 04:26:46 imieeet ollama[1367]: llama_model_loader: - kv 11: qwen2.rope.freq_base f32 = 1000000.000000
May 24 04:26:46 imieeet ollama[1367]: llama_model_loader: - kv 12: qwen2.attention.layer_norm_rms_epsilon f32 = 0.000010
May 24 04:26:46 imieeet ollama[1367]: llama_model_loader: - kv 13: general.file_type u32 = 15
May 24 04:26:46 imieeet ollama[1367]: llama_model_loader: - kv 14: tokenizer.ggml.model str = gpt2
May 24 04:26:46 imieeet ollama[1367]: llama_model_loader: - kv 15: tokenizer.ggml.pre str = qwen2
May 24 04:26:46 imieeet ollama[1367]: llama_model_loader: - kv 16: tokenizer.ggml.tokens arr[str,152064] = ["!", """, "#", "$", "%", "&", "'", ...
May 24 04:26:46 imieeet ollama[1367]: llama_model_loader: - kv 17: tokenizer.ggml.token_type arr[i32,152064] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
May 24 04:26:46 imieeet ollama[1367]: llama_model_loader: - kv 18: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
May 24 04:26:46 imieeet ollama[1367]: llama_model_loader: - kv 19: tokenizer.ggml.bos_token_id u32 = 151646
May 24 04:26:46 imieeet ollama[1367]: llama_model_loader: - kv 20: tokenizer.ggml.eos_token_id u32 = 151643
May 24 04:26:46 imieeet ollama[1367]: llama_model_loader: - kv 21: tokenizer.ggml.padding_token_id u32 = 151643
May 24 04:26:46 imieeet ollama[1367]: llama_model_loader: - kv 22: tokenizer.ggml.add_bos_token bool = true
May 24 04:26:46 imieeet ollama[1367]: llama_model_loader: - kv 23: tokenizer.ggml.add_eos_token bool = false
May 24 04:26:46 imieeet ollama[1367]: llama_model_loader: - kv 24: tokenizer.chat_template str = {% if not add_generation_prompt is de...
May 24 04:26:46 imieeet ollama[1367]: llama_model_loader: - kv 25: general.quantization_version u32 = 2
May 24 04:26:46 imieeet ollama[1367]: llama_model_loader: - type f32: 241 tensors
May 24 04:26:46 imieeet ollama[1367]: llama_model_loader: - type q4_K: 289 tensors
May 24 04:26:46 imieeet ollama[1367]: llama_model_loader: - type q6_K: 49 tensors
May 24 04:26:46 imieeet ollama[1367]: print_info: file format = GGUF V3 (latest)
May 24 04:26:46 imieeet ollama[1367]: print_info: file type = Q4_K - Medium
May 24 04:26:46 imieeet ollama[1367]: print_info: file size = 8.37 GiB (4.87 BPW)
May 24 04:26:46 imieeet ollama[1367]: time=2025-05-24T04:26:46.825+03:30 level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server loading model"
May 24 04:26:46 imieeet ollama[1367]: load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
May 24 04:26:46 imieeet ollama[1367]: load: special tokens cache size = 22
May 24 04:26:46 imieeet ollama[1367]: load: token to piece cache size = 0.9310 MB
May 24 04:26:46 imieeet ollama[1367]: print_info: arch = qwen2
May 24 04:26:46 imieeet ollama[1367]: print_info: vocab_only = 0
May 24 04:26:46 imieeet ollama[1367]: print_info: n_ctx_train = 131072
May 24 04:26:46 imieeet ollama[1367]: print_info: n_embd = 5120
May 24 04:26:46 imieeet ollama[1367]: print_info: n_layer = 48
May 24 04:26:46 imieeet ollama[1367]: print_info: n_head = 40
May 24 04:26:46 imieeet ollama[1367]: print_info: n_head_kv = 8
May 24 04:26:46 imieeet ollama[1367]: print_info: n_rot = 128
May 24 04:26:46 imieeet ollama[1367]: print_info: n_swa = 0
May 24 04:26:46 imieeet ollama[1367]: print_info: n_swa_pattern = 1
May 24 04:26:46 imieeet ollama[1367]: print_info: n_embd_head_k = 128
May 24 04:26:46 imieeet ollama[1367]: print_info: n_embd_head_v = 128
May 24 04:26:46 imieeet ollama[1367]: print_info: n_gqa = 5
May 24 04:26:46 imieeet ollama[1367]: print_info: n_embd_k_gqa = 1024
May 24 04:26:46 imieeet ollama[1367]: print_info: n_embd_v_gqa = 1024
May 24 04:26:46 imieeet ollama[1367]: print_info: f_norm_eps = 0.0e+00
May 24 04:26:46 imieeet ollama[1367]: print_info: f_norm_rms_eps = 1.0e-05
May 24 04:26:46 imieeet ollama[1367]: print_info: f_clamp_kqv = 0.0e+00
May 24 04:26:46 imieeet ollama[1367]: print_info: f_max_alibi_bias = 0.0e+00
May 24 04:26:46 imieeet ollama[1367]: print_info: f_logit_scale = 0.0e+00
May 24 04:26:46 imieeet ollama[1367]: print_info: f_attn_scale = 0.0e+00
May 24 04:26:46 imieeet ollama[1367]: print_info: n_ff = 13824
May 24 04:26:46 imieeet ollama[1367]: print_info: n_expert = 0
May 24 04:26:46 imieeet ollama[1367]: print_info: n_expert_used = 0
May 24 04:26:46 imieeet ollama[1367]: print_info: causal attn = 1
May 24 04:26:46 imieeet ollama[1367]: print_info: pooling type = -1
May 24 04:26:46 imieeet ollama[1367]: print_info: rope type = 2
May 24 04:26:46 imieeet ollama[1367]: print_info: rope scaling = linear
May 24 04:26:46 imieeet ollama[1367]: print_info: freq_base_train = 1000000.0
May 24 04:26:46 imieeet ollama[1367]: print_info: freq_scale_train = 1
May 24 04:26:46 imieeet ollama[1367]: print_info: n_ctx_orig_yarn = 131072
May 24 04:26:46 imieeet ollama[1367]: print_info: rope_finetuned = unknown
May 24 04:26:46 imieeet ollama[1367]: print_info: ssm_d_conv = 0
May 24 04:26:46 imieeet ollama[1367]: print_info: ssm_d_inner = 0
May 24 04:26:46 imieeet ollama[1367]: print_info: ssm_d_state = 0
May 24 04:26:46 imieeet ollama[1367]: print_info: ssm_dt_rank = 0
May 24 04:26:46 imieeet ollama[1367]: print_info: ssm_dt_b_c_rms = 0
May 24 04:26:46 imieeet ollama[1367]: print_info: model type = 14B
May 24 04:26:46 imieeet ollama[1367]: print_info: model params = 14.77 B
May 24 04:26:46 imieeet ollama[1367]: print_info: general.name = DeepSeek R1 Distill Qwen 14B
May 24 04:26:46 imieeet ollama[1367]: print_info: vocab type = BPE
May 24 04:26:46 imieeet ollama[1367]: print_info: n_vocab = 152064
May 24 04:26:46 imieeet ollama[1367]: print_info: n_merges = 151387
May 24 04:26:46 imieeet ollama[1367]: print_info: BOS token = 151646 '<|begin▁of▁sentence|>'
May 24 04:26:46 imieeet ollama[1367]: print_info: EOS token = 151643 '<|end▁of▁sentence|>'
May 24 04:26:46 imieeet ollama[1367]: print_info: EOT token = 151643 '<|end▁of▁sentence|>'
May 24 04:26:46 imieeet ollama[1367]: print_info: PAD token = 151643 '<|end▁of▁sentence|>'
May 24 04:26:46 imieeet ollama[1367]: print_info: LF token = 198 'Ċ'
May 24 04:26:46 imieeet ollama[1367]: print_info: FIM PRE token = 151659 '<|fim_prefix|>'
May 24 04:26:46 imieeet ollama[1367]: print_info: FIM SUF token = 151661 '<|fim_suffix|>'
May 24 04:26:46 imieeet ollama[1367]: print_info: FIM MID token = 151660 '<|fim_middle|>'
May 24 04:26:46 imieeet ollama[1367]: print_info: FIM PAD token = 151662 '<|fim_pad|>'
May 24 04:26:46 imieeet ollama[1367]: print_info: FIM REP token = 151663 '<|repo_name|>'
May 24 04:26:46 imieeet ollama[1367]: print_info: FIM SEP token = 151664 '<|file_sep|>'
May 24 04:26:46 imieeet ollama[1367]: print_info: EOG token = 151643 '<|end▁of▁sentence|>'
May 24 04:26:46 imieeet ollama[1367]: print_info: EOG token = 151662 '<|fim_pad|>'
May 24 04:26:46 imieeet ollama[1367]: print_info: EOG token = 151663 '<|repo_name|>'
May 24 04:26:46 imieeet ollama[1367]: print_info: EOG token = 151664 '<|file_sep|>'
May 24 04:26:46 imieeet ollama[1367]: print_info: max token length = 256
May 24 04:26:46 imieeet ollama[1367]: load_tensors: loading model tensors, this can take a while... (mmap = true)
May 24 04:26:51 imieeet ollama[1367]: alloc_tensor_range: failed to initialize tensor blk.38.attn_q.weight
May 24 04:26:51 imieeet ollama[1367]: llama_model_load: error loading model: unable to allocate ROCm0 buffer
May 24 04:26:51 imieeet ollama[1367]: llama_model_load_from_file_impl: failed to load model
May 24 04:26:51 imieeet ollama[1367]: panic: unable to load model: /usr/share/ollama/.ollama/models/blobs/sha256-6e9f90f02bb3b39b59e81916e8cfce9deb45aeaeb9a54a5be4414486b907dc1e
May 24 04:26:51 imieeet ollama[1367]: goroutine 36 [running]:
May 24 04:26:51 imieeet ollama[1367]: github.com/ollama/ollama/runner/llamarunner.(*Server).loadModel(0xc00012e000, {0xa, 0x0, 0x1, {0x0, 0x0, 0x0}, 0xc0003d6190, 0x0}, {0x7ffcec3f3c25, ...}, ...)
May 24 04:26:51 imieeet ollama[1367]: github.com/ollama/ollama/runner/llamarunner/runner.go:751 +0x395
May 24 04:26:51 imieeet ollama[1367]: created by github.com/ollama/ollama/runner/llamarunner.Execute in goroutine 1
May 24 04:26:51 imieeet ollama[1367]: github.com/ollama/ollama/runner/llamarunner/runner.go:848 +0xb57
May 24 04:26:51 imieeet ollama[1367]: time=2025-05-24T04:26:51.266+03:30 level=ERROR source=server.go:457 msg="llama runner terminated" error="exit status 2"
May 24 04:26:51 imieeet ollama[1367]: time=2025-05-24T04:26:51.342+03:30 level=ERROR source=sched.go:478 msg="error loading llama server" error="llama runner process has terminated: error loading model: unable to allocate ROCm0 buffer\nllama_model_load_from_file_impl: failed to load model"
May 24 04:26:51 imieeet ollama[1367]: [GIN] 2025/05/24 - 04:26:51 | 500 | 9.241591566s | 127.0.0.1 | POST "/api/generate"
May 24 04:26:56 imieeet ollama[1367]: time=2025-05-24T04:26:56.344+03:30 level=WARN source=sched.go:676 msg="gpu VRAM usage didn't recover within timeout" seconds=5.001398221 runner.size="10.1 GiB" runner.vram="3.7 GiB" runner.parallel=1 runner.pid=41272 runner.model=/usr/share/ollama/.ollama/models/blobs/sha256-6e9f90f02bb3b39b59e81916e8cfce9deb45aeaeb9a54a5be4414486b907dc1e
May 24 04:26:56 imieeet ollama[1367]: time=2025-05-24T04:26:56.593+03:30 level=WARN source=sched.go:676 msg="gpu VRAM usage didn't recover within timeout" seconds=5.250604741 runner.size="10.1 GiB" runner.vram="3.7 GiB" runner.parallel=1 runner.pid=41272 runner.model=/usr/share/ollama/.ollama/models/blobs/sha256-6e9f90f02bb3b39b59e81916e8cfce9deb45aeaeb9a54a5be4414486b907dc1e
May 24 04:26:56 imieeet ollama[1367]: time=2025-05-24T04:26:56.843+03:30 level=WARN source=sched.go:676 msg="gpu VRAM usage didn't recover within timeout" seconds=5.5007566919999995 runner.size="10.1 GiB" runner.vram="3.7 GiB" runner.parallel=1 runner.pid=41272 runner.model=/usr/share/ollama/.ollama/models/blobs/sha256-6e9f90f02bb3b39b59e81916e8cfce9deb45aeaeb9a54a5be4414486b907dc1e

i tried to load as low as 10 layers here, also as @apt-install-coffee stated set the overhead higher than memory.required.kv

<!-- gh-comment-id:2906225514 --> @IMIEEET commented on GitHub (May 24, 2025): @rick-github sorry for long pause > May 24 04:26:26 imieeet ollama[1367]: [GIN] 2025/05/24 - 04:26:26 | 200 | 42.257µs | 127.0.0.1 | HEAD "/" May 24 04:26:26 imieeet ollama[1367]: [GIN] 2025/05/24 - 04:26:26 | 200 | 1.053802ms | 127.0.0.1 | GET "/api/tags" May 24 04:26:41 imieeet ollama[1367]: [GIN] 2025/05/24 - 04:26:41 | 200 | 24.936µs | 127.0.0.1 | HEAD "/" May 24 04:26:42 imieeet ollama[1367]: [GIN] 2025/05/24 - 04:26:42 | 200 | 99.916225ms | 127.0.0.1 | POST "/api/show" May 24 04:26:42 imieeet ollama[1367]: time=2025-05-24T04:26:42.137+03:30 level=INFO source=sched.go:777 msg="new model will fit in available VRAM in single GPU, loading" model=/usr/share/ollama/.ollama/models/blobs/sha256-6e9f90f02bb3b39b59e81916e8cfce9deb45aeaeb9a54a5be4414486b907dc1e gpu=0 parallel=1 available=8556519424 required="3.7 GiB" May 24 04:26:42 imieeet ollama[1367]: time=2025-05-24T04:26:42.137+03:30 level=INFO source=server.go:135 msg="system memory" total="30.6 GiB" free="23.5 GiB" free_swap="8.0 GiB" May 24 04:26:42 imieeet ollama[1367]: time=2025-05-24T04:26:42.137+03:30 level=INFO source=server.go:168 msg=offload library=rocm layers.requested=10 layers.model=49 layers.offload=10 layers.split="" memory.available="[8.0 GiB]" memory.gpu_overhead="921.1 MiB" memory.required.full="10.1 GiB" memory.required.partial="3.7 GiB" memory.required.kv="768.0 MiB" memory.required.allocations="[3.7 GiB]" memory.weights.total="8.0 GiB" memory.weights.repeating="7.4 GiB" memory.weights.nonrepeating="609.1 MiB" memory.graph.full="348.0 MiB" memory.graph.partial="916.1 MiB" May 24 04:26:42 imieeet ollama[1367]: time=2025-05-24T04:26:42.137+03:30 level=WARN source=server.go:222 msg="quantized kv cache requested but flash attention disabled" type=q4_0 May 24 04:26:42 imieeet ollama[1367]: llama_model_loader: loaded meta data with 26 key-value pairs and 579 tensors from /usr/share/ollama/.ollama/models/blobs/sha256-6e9f90f02bb3b39b59e81916e8cfce9deb45aeaeb9a54a5be4414486b907dc1e (version GGUF V3 (latest)) May 24 04:26:42 imieeet ollama[1367]: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. May 24 04:26:42 imieeet ollama[1367]: llama_model_loader: - kv 0: general.architecture str = qwen2 May 24 04:26:42 imieeet ollama[1367]: llama_model_loader: - kv 1: general.type str = model May 24 04:26:42 imieeet ollama[1367]: llama_model_loader: - kv 2: general.name str = DeepSeek R1 Distill Qwen 14B May 24 04:26:42 imieeet ollama[1367]: llama_model_loader: - kv 3: general.basename str = DeepSeek-R1-Distill-Qwen May 24 04:26:42 imieeet ollama[1367]: llama_model_loader: - kv 4: general.size_label str = 14B May 24 04:26:42 imieeet ollama[1367]: llama_model_loader: - kv 5: qwen2.block_count u32 = 48 May 24 04:26:42 imieeet ollama[1367]: llama_model_loader: - kv 6: qwen2.context_length u32 = 131072 May 24 04:26:42 imieeet ollama[1367]: llama_model_loader: - kv 7: qwen2.embedding_length u32 = 5120 May 24 04:26:42 imieeet ollama[1367]: llama_model_loader: - kv 8: qwen2.feed_forward_length u32 = 13824 May 24 04:26:42 imieeet ollama[1367]: llama_model_loader: - kv 9: qwen2.attention.head_count u32 = 40 May 24 04:26:42 imieeet ollama[1367]: llama_model_loader: - kv 10: qwen2.attention.head_count_kv u32 = 8 May 24 04:26:42 imieeet ollama[1367]: llama_model_loader: - kv 11: qwen2.rope.freq_base f32 = 1000000.000000 May 24 04:26:42 imieeet ollama[1367]: llama_model_loader: - kv 12: qwen2.attention.layer_norm_rms_epsilon f32 = 0.000010 May 24 04:26:42 imieeet ollama[1367]: llama_model_loader: - kv 13: general.file_type u32 = 15 May 24 04:26:42 imieeet ollama[1367]: llama_model_loader: - kv 14: tokenizer.ggml.model str = gpt2 May 24 04:26:42 imieeet ollama[1367]: llama_model_loader: - kv 15: tokenizer.ggml.pre str = qwen2 May 24 04:26:42 imieeet ollama[1367]: llama_model_loader: - kv 16: tokenizer.ggml.tokens arr[str,152064] = ["!", "\"", "#", "$", "%", "&", "'", ... May 24 04:26:42 imieeet ollama[1367]: llama_model_loader: - kv 17: tokenizer.ggml.token_type arr[i32,152064] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... May 24 04:26:42 imieeet ollama[1367]: llama_model_loader: - kv 18: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",... May 24 04:26:42 imieeet ollama[1367]: llama_model_loader: - kv 19: tokenizer.ggml.bos_token_id u32 = 151646 May 24 04:26:42 imieeet ollama[1367]: llama_model_loader: - kv 20: tokenizer.ggml.eos_token_id u32 = 151643 May 24 04:26:42 imieeet ollama[1367]: llama_model_loader: - kv 21: tokenizer.ggml.padding_token_id u32 = 151643 May 24 04:26:42 imieeet ollama[1367]: llama_model_loader: - kv 22: tokenizer.ggml.add_bos_token bool = true May 24 04:26:42 imieeet ollama[1367]: llama_model_loader: - kv 23: tokenizer.ggml.add_eos_token bool = false May 24 04:26:42 imieeet ollama[1367]: llama_model_loader: - kv 24: tokenizer.chat_template str = {% if not add_generation_prompt is de... May 24 04:26:42 imieeet ollama[1367]: llama_model_loader: - kv 25: general.quantization_version u32 = 2 May 24 04:26:42 imieeet ollama[1367]: llama_model_loader: - type f32: 241 tensors May 24 04:26:42 imieeet ollama[1367]: llama_model_loader: - type q4_K: 289 tensors May 24 04:26:42 imieeet ollama[1367]: llama_model_loader: - type q6_K: 49 tensors May 24 04:26:42 imieeet ollama[1367]: print_info: file format = GGUF V3 (latest) May 24 04:26:42 imieeet ollama[1367]: print_info: file type = Q4_K - Medium May 24 04:26:42 imieeet ollama[1367]: print_info: file size = 8.37 GiB (4.87 BPW) May 24 04:26:42 imieeet ollama[1367]: load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect May 24 04:26:42 imieeet ollama[1367]: load: special tokens cache size = 22 May 24 04:26:42 imieeet ollama[1367]: load: token to piece cache size = 0.9310 MB May 24 04:26:42 imieeet ollama[1367]: print_info: arch = qwen2 May 24 04:26:42 imieeet ollama[1367]: print_info: vocab_only = 1 May 24 04:26:42 imieeet ollama[1367]: print_info: model type = ?B May 24 04:26:42 imieeet ollama[1367]: print_info: model params = 14.77 B May 24 04:26:42 imieeet ollama[1367]: print_info: general.name = DeepSeek R1 Distill Qwen 14B May 24 04:26:42 imieeet ollama[1367]: print_info: vocab type = BPE May 24 04:26:42 imieeet ollama[1367]: print_info: n_vocab = 152064 May 24 04:26:42 imieeet ollama[1367]: print_info: n_merges = 151387 May 24 04:26:42 imieeet ollama[1367]: print_info: BOS token = 151646 '<|begin▁of▁sentence|>' May 24 04:26:42 imieeet ollama[1367]: print_info: EOS token = 151643 '<|end▁of▁sentence|>' May 24 04:26:42 imieeet ollama[1367]: print_info: EOT token = 151643 '<|end▁of▁sentence|>' May 24 04:26:42 imieeet ollama[1367]: print_info: PAD token = 151643 '<|end▁of▁sentence|>' May 24 04:26:42 imieeet ollama[1367]: print_info: LF token = 198 'Ċ' May 24 04:26:42 imieeet ollama[1367]: print_info: FIM PRE token = 151659 '<|fim_prefix|>' May 24 04:26:42 imieeet ollama[1367]: print_info: FIM SUF token = 151661 '<|fim_suffix|>' May 24 04:26:42 imieeet ollama[1367]: print_info: FIM MID token = 151660 '<|fim_middle|>' May 24 04:26:42 imieeet ollama[1367]: print_info: FIM PAD token = 151662 '<|fim_pad|>' May 24 04:26:42 imieeet ollama[1367]: print_info: FIM REP token = 151663 '<|repo_name|>' May 24 04:26:42 imieeet ollama[1367]: print_info: FIM SEP token = 151664 '<|file_sep|>' May 24 04:26:42 imieeet ollama[1367]: print_info: EOG token = 151643 '<|end▁of▁sentence|>' May 24 04:26:42 imieeet ollama[1367]: print_info: EOG token = 151662 '<|fim_pad|>' May 24 04:26:42 imieeet ollama[1367]: print_info: EOG token = 151663 '<|repo_name|>' May 24 04:26:42 imieeet ollama[1367]: print_info: EOG token = 151664 '<|file_sep|>' May 24 04:26:42 imieeet ollama[1367]: print_info: max token length = 256 May 24 04:26:42 imieeet ollama[1367]: llama_model_load: vocab only - skipping tensors May 24 04:26:42 imieeet ollama[1367]: time=2025-05-24T04:26:42.308+03:30 level=INFO source=server.go:431 msg="starting llama server" cmd="/usr/bin/ollama runner --model /usr/share/ollama/.ollama/models/blobs/sha256-6e9f90f02bb3b39b59e81916e8cfce9deb45aeaeb9a54a5be4414486b907dc1e --ctx-size 4096 --batch-size 512 --n-gpu-layers 10 --threads 8 --parallel 1 --port 39195" May 24 04:26:42 imieeet ollama[1367]: time=2025-05-24T04:26:42.308+03:30 level=INFO source=sched.go:472 msg="loaded runners" count=1 May 24 04:26:42 imieeet ollama[1367]: time=2025-05-24T04:26:42.308+03:30 level=INFO source=server.go:591 msg="waiting for llama runner to start responding" May 24 04:26:42 imieeet ollama[1367]: time=2025-05-24T04:26:42.309+03:30 level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server not responding" May 24 04:26:42 imieeet ollama[1367]: time=2025-05-24T04:26:42.318+03:30 level=INFO source=runner.go:815 msg="starting go runner" May 24 04:26:42 imieeet ollama[1367]: load_backend: loaded CPU backend from /usr/lib/ollama/libggml-cpu-haswell.so May 24 04:26:44 imieeet ollama[1367]: /opt/amdgpu/share/libdrm/amdgpu.ids: No such file or directory May 24 04:26:44 imieeet ollama[1367]: /opt/amdgpu/share/libdrm/amdgpu.ids: No such file or directory May 24 04:26:46 imieeet ollama[1367]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no May 24 04:26:46 imieeet ollama[1367]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no May 24 04:26:46 imieeet ollama[1367]: ggml_cuda_init: found 1 ROCm devices: May 24 04:26:46 imieeet ollama[1367]: Device 0: AMD Radeon Graphics, gfx1030 (0x1030), VMM: no, Wave Size: 32 May 24 04:26:46 imieeet ollama[1367]: load_backend: loaded ROCm backend from /usr/lib/ollama/rocm/libggml-hip.so May 24 04:26:46 imieeet ollama[1367]: time=2025-05-24T04:26:46.711+03:30 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 ROCm.0.NO_VMM=1 ROCm.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc) May 24 04:26:46 imieeet ollama[1367]: llama_model_load_from_file_impl: using device ROCm0 (AMD Radeon Graphics) - 8152 MiB free May 24 04:26:46 imieeet ollama[1367]: time=2025-05-24T04:26:46.712+03:30 level=INFO source=runner.go:874 msg="Server listening on 127.0.0.1:39195" May 24 04:26:46 imieeet ollama[1367]: llama_model_loader: loaded meta data with 26 key-value pairs and 579 tensors from /usr/share/ollama/.ollama/models/blobs/sha256-6e9f90f02bb3b39b59e81916e8cfce9deb45aeaeb9a54a5be4414486b907dc1e (version GGUF V3 (latest)) May 24 04:26:46 imieeet ollama[1367]: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. May 24 04:26:46 imieeet ollama[1367]: llama_model_loader: - kv 0: general.architecture str = qwen2 May 24 04:26:46 imieeet ollama[1367]: llama_model_loader: - kv 1: general.type str = model May 24 04:26:46 imieeet ollama[1367]: llama_model_loader: - kv 2: general.name str = DeepSeek R1 Distill Qwen 14B May 24 04:26:46 imieeet ollama[1367]: llama_model_loader: - kv 3: general.basename str = DeepSeek-R1-Distill-Qwen May 24 04:26:46 imieeet ollama[1367]: llama_model_loader: - kv 4: general.size_label str = 14B May 24 04:26:46 imieeet ollama[1367]: llama_model_loader: - kv 5: qwen2.block_count u32 = 48 May 24 04:26:46 imieeet ollama[1367]: llama_model_loader: - kv 6: qwen2.context_length u32 = 131072 May 24 04:26:46 imieeet ollama[1367]: llama_model_loader: - kv 7: qwen2.embedding_length u32 = 5120 May 24 04:26:46 imieeet ollama[1367]: llama_model_loader: - kv 8: qwen2.feed_forward_length u32 = 13824 May 24 04:26:46 imieeet ollama[1367]: llama_model_loader: - kv 9: qwen2.attention.head_count u32 = 40 May 24 04:26:46 imieeet ollama[1367]: llama_model_loader: - kv 10: qwen2.attention.head_count_kv u32 = 8 May 24 04:26:46 imieeet ollama[1367]: llama_model_loader: - kv 11: qwen2.rope.freq_base f32 = 1000000.000000 May 24 04:26:46 imieeet ollama[1367]: llama_model_loader: - kv 12: qwen2.attention.layer_norm_rms_epsilon f32 = 0.000010 May 24 04:26:46 imieeet ollama[1367]: llama_model_loader: - kv 13: general.file_type u32 = 15 May 24 04:26:46 imieeet ollama[1367]: llama_model_loader: - kv 14: tokenizer.ggml.model str = gpt2 May 24 04:26:46 imieeet ollama[1367]: llama_model_loader: - kv 15: tokenizer.ggml.pre str = qwen2 May 24 04:26:46 imieeet ollama[1367]: llama_model_loader: - kv 16: tokenizer.ggml.tokens arr[str,152064] = ["!", "\"", "#", "$", "%", "&", "'", ... May 24 04:26:46 imieeet ollama[1367]: llama_model_loader: - kv 17: tokenizer.ggml.token_type arr[i32,152064] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... May 24 04:26:46 imieeet ollama[1367]: llama_model_loader: - kv 18: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",... May 24 04:26:46 imieeet ollama[1367]: llama_model_loader: - kv 19: tokenizer.ggml.bos_token_id u32 = 151646 May 24 04:26:46 imieeet ollama[1367]: llama_model_loader: - kv 20: tokenizer.ggml.eos_token_id u32 = 151643 May 24 04:26:46 imieeet ollama[1367]: llama_model_loader: - kv 21: tokenizer.ggml.padding_token_id u32 = 151643 May 24 04:26:46 imieeet ollama[1367]: llama_model_loader: - kv 22: tokenizer.ggml.add_bos_token bool = true May 24 04:26:46 imieeet ollama[1367]: llama_model_loader: - kv 23: tokenizer.ggml.add_eos_token bool = false May 24 04:26:46 imieeet ollama[1367]: llama_model_loader: - kv 24: tokenizer.chat_template str = {% if not add_generation_prompt is de... May 24 04:26:46 imieeet ollama[1367]: llama_model_loader: - kv 25: general.quantization_version u32 = 2 May 24 04:26:46 imieeet ollama[1367]: llama_model_loader: - type f32: 241 tensors May 24 04:26:46 imieeet ollama[1367]: llama_model_loader: - type q4_K: 289 tensors May 24 04:26:46 imieeet ollama[1367]: llama_model_loader: - type q6_K: 49 tensors May 24 04:26:46 imieeet ollama[1367]: print_info: file format = GGUF V3 (latest) May 24 04:26:46 imieeet ollama[1367]: print_info: file type = Q4_K - Medium May 24 04:26:46 imieeet ollama[1367]: print_info: file size = 8.37 GiB (4.87 BPW) May 24 04:26:46 imieeet ollama[1367]: time=2025-05-24T04:26:46.825+03:30 level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server loading model" May 24 04:26:46 imieeet ollama[1367]: load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect May 24 04:26:46 imieeet ollama[1367]: load: special tokens cache size = 22 May 24 04:26:46 imieeet ollama[1367]: load: token to piece cache size = 0.9310 MB May 24 04:26:46 imieeet ollama[1367]: print_info: arch = qwen2 May 24 04:26:46 imieeet ollama[1367]: print_info: vocab_only = 0 May 24 04:26:46 imieeet ollama[1367]: print_info: n_ctx_train = 131072 May 24 04:26:46 imieeet ollama[1367]: print_info: n_embd = 5120 May 24 04:26:46 imieeet ollama[1367]: print_info: n_layer = 48 May 24 04:26:46 imieeet ollama[1367]: print_info: n_head = 40 May 24 04:26:46 imieeet ollama[1367]: print_info: n_head_kv = 8 May 24 04:26:46 imieeet ollama[1367]: print_info: n_rot = 128 May 24 04:26:46 imieeet ollama[1367]: print_info: n_swa = 0 May 24 04:26:46 imieeet ollama[1367]: print_info: n_swa_pattern = 1 May 24 04:26:46 imieeet ollama[1367]: print_info: n_embd_head_k = 128 May 24 04:26:46 imieeet ollama[1367]: print_info: n_embd_head_v = 128 May 24 04:26:46 imieeet ollama[1367]: print_info: n_gqa = 5 May 24 04:26:46 imieeet ollama[1367]: print_info: n_embd_k_gqa = 1024 May 24 04:26:46 imieeet ollama[1367]: print_info: n_embd_v_gqa = 1024 May 24 04:26:46 imieeet ollama[1367]: print_info: f_norm_eps = 0.0e+00 May 24 04:26:46 imieeet ollama[1367]: print_info: f_norm_rms_eps = 1.0e-05 May 24 04:26:46 imieeet ollama[1367]: print_info: f_clamp_kqv = 0.0e+00 May 24 04:26:46 imieeet ollama[1367]: print_info: f_max_alibi_bias = 0.0e+00 May 24 04:26:46 imieeet ollama[1367]: print_info: f_logit_scale = 0.0e+00 May 24 04:26:46 imieeet ollama[1367]: print_info: f_attn_scale = 0.0e+00 May 24 04:26:46 imieeet ollama[1367]: print_info: n_ff = 13824 May 24 04:26:46 imieeet ollama[1367]: print_info: n_expert = 0 May 24 04:26:46 imieeet ollama[1367]: print_info: n_expert_used = 0 May 24 04:26:46 imieeet ollama[1367]: print_info: causal attn = 1 May 24 04:26:46 imieeet ollama[1367]: print_info: pooling type = -1 May 24 04:26:46 imieeet ollama[1367]: print_info: rope type = 2 May 24 04:26:46 imieeet ollama[1367]: print_info: rope scaling = linear May 24 04:26:46 imieeet ollama[1367]: print_info: freq_base_train = 1000000.0 May 24 04:26:46 imieeet ollama[1367]: print_info: freq_scale_train = 1 May 24 04:26:46 imieeet ollama[1367]: print_info: n_ctx_orig_yarn = 131072 May 24 04:26:46 imieeet ollama[1367]: print_info: rope_finetuned = unknown May 24 04:26:46 imieeet ollama[1367]: print_info: ssm_d_conv = 0 May 24 04:26:46 imieeet ollama[1367]: print_info: ssm_d_inner = 0 May 24 04:26:46 imieeet ollama[1367]: print_info: ssm_d_state = 0 May 24 04:26:46 imieeet ollama[1367]: print_info: ssm_dt_rank = 0 May 24 04:26:46 imieeet ollama[1367]: print_info: ssm_dt_b_c_rms = 0 May 24 04:26:46 imieeet ollama[1367]: print_info: model type = 14B May 24 04:26:46 imieeet ollama[1367]: print_info: model params = 14.77 B May 24 04:26:46 imieeet ollama[1367]: print_info: general.name = DeepSeek R1 Distill Qwen 14B May 24 04:26:46 imieeet ollama[1367]: print_info: vocab type = BPE May 24 04:26:46 imieeet ollama[1367]: print_info: n_vocab = 152064 May 24 04:26:46 imieeet ollama[1367]: print_info: n_merges = 151387 May 24 04:26:46 imieeet ollama[1367]: print_info: BOS token = 151646 '<|begin▁of▁sentence|>' May 24 04:26:46 imieeet ollama[1367]: print_info: EOS token = 151643 '<|end▁of▁sentence|>' May 24 04:26:46 imieeet ollama[1367]: print_info: EOT token = 151643 '<|end▁of▁sentence|>' May 24 04:26:46 imieeet ollama[1367]: print_info: PAD token = 151643 '<|end▁of▁sentence|>' May 24 04:26:46 imieeet ollama[1367]: print_info: LF token = 198 'Ċ' May 24 04:26:46 imieeet ollama[1367]: print_info: FIM PRE token = 151659 '<|fim_prefix|>' May 24 04:26:46 imieeet ollama[1367]: print_info: FIM SUF token = 151661 '<|fim_suffix|>' May 24 04:26:46 imieeet ollama[1367]: print_info: FIM MID token = 151660 '<|fim_middle|>' May 24 04:26:46 imieeet ollama[1367]: print_info: FIM PAD token = 151662 '<|fim_pad|>' May 24 04:26:46 imieeet ollama[1367]: print_info: FIM REP token = 151663 '<|repo_name|>' May 24 04:26:46 imieeet ollama[1367]: print_info: FIM SEP token = 151664 '<|file_sep|>' May 24 04:26:46 imieeet ollama[1367]: print_info: EOG token = 151643 '<|end▁of▁sentence|>' May 24 04:26:46 imieeet ollama[1367]: print_info: EOG token = 151662 '<|fim_pad|>' May 24 04:26:46 imieeet ollama[1367]: print_info: EOG token = 151663 '<|repo_name|>' May 24 04:26:46 imieeet ollama[1367]: print_info: EOG token = 151664 '<|file_sep|>' May 24 04:26:46 imieeet ollama[1367]: print_info: max token length = 256 May 24 04:26:46 imieeet ollama[1367]: load_tensors: loading model tensors, this can take a while... (mmap = true) May 24 04:26:51 imieeet ollama[1367]: alloc_tensor_range: failed to initialize tensor blk.38.attn_q.weight May 24 04:26:51 imieeet ollama[1367]: llama_model_load: error loading model: unable to allocate ROCm0 buffer May 24 04:26:51 imieeet ollama[1367]: llama_model_load_from_file_impl: failed to load model May 24 04:26:51 imieeet ollama[1367]: panic: unable to load model: /usr/share/ollama/.ollama/models/blobs/sha256-6e9f90f02bb3b39b59e81916e8cfce9deb45aeaeb9a54a5be4414486b907dc1e May 24 04:26:51 imieeet ollama[1367]: goroutine 36 [running]: May 24 04:26:51 imieeet ollama[1367]: github.com/ollama/ollama/runner/llamarunner.(*Server).loadModel(0xc00012e000, {0xa, 0x0, 0x1, {0x0, 0x0, 0x0}, 0xc0003d6190, 0x0}, {0x7ffcec3f3c25, ...}, ...) May 24 04:26:51 imieeet ollama[1367]: github.com/ollama/ollama/runner/llamarunner/runner.go:751 +0x395 May 24 04:26:51 imieeet ollama[1367]: created by github.com/ollama/ollama/runner/llamarunner.Execute in goroutine 1 May 24 04:26:51 imieeet ollama[1367]: github.com/ollama/ollama/runner/llamarunner/runner.go:848 +0xb57 May 24 04:26:51 imieeet ollama[1367]: time=2025-05-24T04:26:51.266+03:30 level=ERROR source=server.go:457 msg="llama runner terminated" error="exit status 2" May 24 04:26:51 imieeet ollama[1367]: time=2025-05-24T04:26:51.342+03:30 level=ERROR source=sched.go:478 msg="error loading llama server" error="llama runner process has terminated: error loading model: unable to allocate ROCm0 buffer\nllama_model_load_from_file_impl: failed to load model" May 24 04:26:51 imieeet ollama[1367]: [GIN] 2025/05/24 - 04:26:51 | 500 | 9.241591566s | 127.0.0.1 | POST "/api/generate" May 24 04:26:56 imieeet ollama[1367]: time=2025-05-24T04:26:56.344+03:30 level=WARN source=sched.go:676 msg="gpu VRAM usage didn't recover within timeout" seconds=5.001398221 runner.size="10.1 GiB" runner.vram="3.7 GiB" runner.parallel=1 runner.pid=41272 runner.model=/usr/share/ollama/.ollama/models/blobs/sha256-6e9f90f02bb3b39b59e81916e8cfce9deb45aeaeb9a54a5be4414486b907dc1e May 24 04:26:56 imieeet ollama[1367]: time=2025-05-24T04:26:56.593+03:30 level=WARN source=sched.go:676 msg="gpu VRAM usage didn't recover within timeout" seconds=5.250604741 runner.size="10.1 GiB" runner.vram="3.7 GiB" runner.parallel=1 runner.pid=41272 runner.model=/usr/share/ollama/.ollama/models/blobs/sha256-6e9f90f02bb3b39b59e81916e8cfce9deb45aeaeb9a54a5be4414486b907dc1e May 24 04:26:56 imieeet ollama[1367]: time=2025-05-24T04:26:56.843+03:30 level=WARN source=sched.go:676 msg="gpu VRAM usage didn't recover within timeout" seconds=5.5007566919999995 runner.size="10.1 GiB" runner.vram="3.7 GiB" runner.parallel=1 runner.pid=41272 runner.model=/usr/share/ollama/.ollama/models/blobs/sha256-6e9f90f02bb3b39b59e81916e8cfce9deb45aeaeb9a54a5be4414486b907dc1e i tried to load as low as 10 layers here, also as @apt-install-coffee stated set the overhead higher than memory.required.kv
Author
Owner

@IMIEEET commented on GitHub (May 29, 2025):

i tries to understand where this became an issue, so i can run deepseek-r1:14b on my 8GB vram GPU and ram on ollama 0.6.5 but this error appears in 0.6.6. i tested later version and 0.8.0 and same error
i cant even run new deepseek-r1:8b based on qwen3 on 0.8.0 even though its less than my vram

<!-- gh-comment-id:2920805252 --> @IMIEEET commented on GitHub (May 29, 2025): i tries to understand where this became an issue, so i can run deepseek-r1:14b on my 8GB vram GPU and ram on ollama 0.6.5 but this error appears in 0.6.6. i tested later version and 0.8.0 and same error i cant even run new deepseek-r1:8b based on qwen3 on 0.8.0 even though its less than my vram
Author
Owner

@rick-github commented on GitHub (May 30, 2025):

May 24 04:26:42 imieeet ollama[1367]: time=2025-05-24T04:26:42.137+03:30 level=INFO source=server.go:168 msg=offload
 library=rocm layers.requested=10 layers.model=49 layers.offload=10 layers.split="" memory.available="[8.0 GiB]"
 memory.gpu_overhead="921.1 MiB" memory.required.full="10.1 GiB" memory.required.partial="3.7 GiB"
 memory.required.kv="768.0 MiB" memory.required.allocations="[3.7 GiB]" memory.weights.total="8.0 GiB"
 memory.weights.repeating="7.4 GiB" memory.weights.nonrepeating="609.1 MiB" memory.graph.full="348.0 MiB"
 memory.graph.partial="916.1 MiB"

ollama offloaded 10 of 49 layers, estimating a usage of 3.7G. The allocation failed in blk.38.attn_q.weight, which is in the tenth layer, so it looks like it nearly fit. It's not clear to me why the OOM occurred, though - according to the logs, the GPU has 8G free so there should be plenty (8 - 3.7 = 4.3) of spare room. Perhaps this is a ROCm thing? What device do you have?

0.6.6 and later have changes to memory estimation designed to minimize OOM issues but it appears to have made the situation worse for you. What happens if you set num_gpu to 9?

<!-- gh-comment-id:2922565895 --> @rick-github commented on GitHub (May 30, 2025): ``` May 24 04:26:42 imieeet ollama[1367]: time=2025-05-24T04:26:42.137+03:30 level=INFO source=server.go:168 msg=offload library=rocm layers.requested=10 layers.model=49 layers.offload=10 layers.split="" memory.available="[8.0 GiB]" memory.gpu_overhead="921.1 MiB" memory.required.full="10.1 GiB" memory.required.partial="3.7 GiB" memory.required.kv="768.0 MiB" memory.required.allocations="[3.7 GiB]" memory.weights.total="8.0 GiB" memory.weights.repeating="7.4 GiB" memory.weights.nonrepeating="609.1 MiB" memory.graph.full="348.0 MiB" memory.graph.partial="916.1 MiB" ``` ollama offloaded 10 of 49 layers, estimating a usage of 3.7G. The allocation failed in `blk.38.attn_q.weight`, which is in the tenth layer, so it looks like it nearly fit. It's not clear to me why the OOM occurred, though - according to the logs, the GPU has 8G free so there should be plenty (8 - 3.7 = 4.3) of spare room. Perhaps this is a ROCm thing? What device do you have? 0.6.6 and later have changes to memory estimation designed to minimize OOM issues but it appears to have made the situation worse for you. What happens if you set `num_gpu` to 9?
Author
Owner

@IMIEEET commented on GitHub (May 31, 2025):

you are right it does not make sense, i also thought maybe i also updated rocm or kernel in the middle thats maybe the problem but yesterday i decided to run the same model with llama.cpp. it ran with rocm backend with as many layers as almost my 8GB could handle so it most be ollama
i did that with Modelfile:

FROM deepseek-r1:14b
PARAMETER num_gpu 9

still same error
ollama 0.8.0
logs:

May 31 14:48:02 imieeet ollama[1368]: [GIN] 2025/05/31 - 14:48:02 | 200 | 23.814µs | 127.0.0.1 | HEAD "/"
May 31 14:48:02 imieeet ollama[1368]: [GIN] 2025/05/31 - 14:48:02 | 200 | 24.946351ms | 127.0.0.1 | POST "/api/show"
May 31 14:48:02 imieeet ollama[1368]: time=2025-05-31T14:48:02.135+03:30 level=INFO source=sched.go:788 msg="new model will fit in available VRAM in single GPU, loading" model=/usr/share/ollama/.ollama/models/blobs/sha256-6e9f90f02bb3b39b59e81916e8cfce9deb45aeaeb9a54a5be4414486b907dc1e gpu=0 parallel=2 available=8556507136 required="3.2 GiB"
May 31 14:48:02 imieeet ollama[1368]: time=2025-05-31T14:48:02.136+03:30 level=INFO source=server.go:135 msg="system memory" total="30.6 GiB" free="23.5 GiB" free_swap="8.0 GiB"
May 31 14:48:02 imieeet ollama[1368]: time=2025-05-31T14:48:02.136+03:30 level=INFO source=server.go:168 msg=offload library=rocm layers.requested=9 layers.model=49 layers.offload=9 layers.split="" memory.available="[8.0 GiB]" memory.gpu_overhead="0 B" memory.required.full="11.0 GiB" memory.required.partial="3.2 GiB" memory.required.kv="1.5 GiB" memory.required.allocations="[3.2 GiB]" memory.weights.total="8.0 GiB" memory.weights.repeating="7.4 GiB" memory.weights.nonrepeating="609.1 MiB" memory.graph.full="676.0 MiB" memory.graph.partial="916.1 MiB"
May 31 14:48:02 imieeet ollama[1368]: llama_model_loader: loaded meta data with 26 key-value pairs and 579 tensors from /usr/share/ollama/.ollama/models/blobs/sha256-6e9f90f02bb3b39b59e81916e8cfce9deb45aeaeb9a54a5be4414486b907dc1e (version GGUF V3 (latest))
May 31 14:48:02 imieeet ollama[1368]: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
May 31 14:48:02 imieeet ollama[1368]: llama_model_loader: - kv 0: general.architecture str = qwen2
May 31 14:48:02 imieeet ollama[1368]: llama_model_loader: - kv 1: general.type str = model
May 31 14:48:02 imieeet ollama[1368]: llama_model_loader: - kv 2: general.name str = DeepSeek R1 Distill Qwen 14B
May 31 14:48:02 imieeet ollama[1368]: llama_model_loader: - kv 3: general.basename str = DeepSeek-R1-Distill-Qwen
May 31 14:48:02 imieeet ollama[1368]: llama_model_loader: - kv 4: general.size_label str = 14B
May 31 14:48:02 imieeet ollama[1368]: llama_model_loader: - kv 5: qwen2.block_count u32 = 48
May 31 14:48:02 imieeet ollama[1368]: llama_model_loader: - kv 6: qwen2.context_length u32 = 131072
May 31 14:48:02 imieeet ollama[1368]: llama_model_loader: - kv 7: qwen2.embedding_length u32 = 5120
May 31 14:48:02 imieeet ollama[1368]: llama_model_loader: - kv 8: qwen2.feed_forward_length u32 = 13824
May 31 14:48:02 imieeet ollama[1368]: llama_model_loader: - kv 9: qwen2.attention.head_count u32 = 40
May 31 14:48:02 imieeet ollama[1368]: llama_model_loader: - kv 10: qwen2.attention.head_count_kv u32 = 8
May 31 14:48:02 imieeet ollama[1368]: llama_model_loader: - kv 11: qwen2.rope.freq_base f32 = 1000000.000000
May 31 14:48:02 imieeet ollama[1368]: llama_model_loader: - kv 12: qwen2.attention.layer_norm_rms_epsilon f32 = 0.000010
May 31 14:48:02 imieeet ollama[1368]: llama_model_loader: - kv 13: general.file_type u32 = 15
May 31 14:48:02 imieeet ollama[1368]: llama_model_loader: - kv 14: tokenizer.ggml.model str = gpt2
May 31 14:48:02 imieeet ollama[1368]: llama_model_loader: - kv 15: tokenizer.ggml.pre str = qwen2
May 31 14:48:02 imieeet ollama[1368]: llama_model_loader: - kv 16: tokenizer.ggml.tokens arr[str,152064] = ["!", """, "#", "$", "%", "&", "'", ...
May 31 14:48:02 imieeet ollama[1368]: llama_model_loader: - kv 17: tokenizer.ggml.token_type arr[i32,152064] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
May 31 14:48:02 imieeet ollama[1368]: llama_model_loader: - kv 18: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
May 31 14:48:02 imieeet ollama[1368]: llama_model_loader: - kv 19: tokenizer.ggml.bos_token_id u32 = 151646
May 31 14:48:02 imieeet ollama[1368]: llama_model_loader: - kv 20: tokenizer.ggml.eos_token_id u32 = 151643
May 31 14:48:02 imieeet ollama[1368]: llama_model_loader: - kv 21: tokenizer.ggml.padding_token_id u32 = 151643
May 31 14:48:02 imieeet ollama[1368]: llama_model_loader: - kv 22: tokenizer.ggml.add_bos_token bool = true
May 31 14:48:02 imieeet ollama[1368]: llama_model_loader: - kv 23: tokenizer.ggml.add_eos_token bool = false
May 31 14:48:02 imieeet ollama[1368]: llama_model_loader: - kv 24: tokenizer.chat_template str = {% if not add_generation_prompt is de...
May 31 14:48:02 imieeet ollama[1368]: llama_model_loader: - kv 25: general.quantization_version u32 = 2
May 31 14:48:02 imieeet ollama[1368]: llama_model_loader: - type f32: 241 tensors
May 31 14:48:02 imieeet ollama[1368]: llama_model_loader: - type q4_K: 289 tensors
May 31 14:48:02 imieeet ollama[1368]: llama_model_loader: - type q6_K: 49 tensors
May 31 14:48:02 imieeet ollama[1368]: print_info: file format = GGUF V3 (latest)
May 31 14:48:02 imieeet ollama[1368]: print_info: file type = Q4_K - Medium
May 31 14:48:02 imieeet ollama[1368]: print_info: file size = 8.37 GiB (4.87 BPW)
May 31 14:48:02 imieeet ollama[1368]: load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
May 31 14:48:02 imieeet ollama[1368]: load: special tokens cache size = 22
May 31 14:48:02 imieeet ollama[1368]: load: token to piece cache size = 0.9310 MB
May 31 14:48:02 imieeet ollama[1368]: print_info: arch = qwen2
May 31 14:48:02 imieeet ollama[1368]: print_info: vocab_only = 1
May 31 14:48:02 imieeet ollama[1368]: print_info: model type = ?B
May 31 14:48:02 imieeet ollama[1368]: print_info: model params = 14.77 B
May 31 14:48:02 imieeet ollama[1368]: print_info: general.name = DeepSeek R1 Distill Qwen 14B
May 31 14:48:02 imieeet ollama[1368]: print_info: vocab type = BPE
May 31 14:48:02 imieeet ollama[1368]: print_info: n_vocab = 152064
May 31 14:48:02 imieeet ollama[1368]: print_info: n_merges = 151387
May 31 14:48:02 imieeet ollama[1368]: print_info: BOS token = 151646 '<|begin▁of▁sentence|>'
May 31 14:48:02 imieeet ollama[1368]: print_info: EOS token = 151643 '<|end▁of▁sentence|>'
May 31 14:48:02 imieeet ollama[1368]: print_info: EOT token = 151643 '<|end▁of▁sentence|>'
May 31 14:48:02 imieeet ollama[1368]: print_info: PAD token = 151643 '<|end▁of▁sentence|>'
May 31 14:48:02 imieeet ollama[1368]: print_info: LF token = 198 'Ċ'
May 31 14:48:02 imieeet ollama[1368]: print_info: FIM PRE token = 151659 '<|fim_prefix|>'
May 31 14:48:02 imieeet ollama[1368]: print_info: FIM SUF token = 151661 '<|fim_suffix|>'
May 31 14:48:02 imieeet ollama[1368]: print_info: FIM MID token = 151660 '<|fim_middle|>'
May 31 14:48:02 imieeet ollama[1368]: print_info: FIM PAD token = 151662 '<|fim_pad|>'
May 31 14:48:02 imieeet ollama[1368]: print_info: FIM REP token = 151663 '<|repo_name|>'
May 31 14:48:02 imieeet ollama[1368]: print_info: FIM SEP token = 151664 '<|file_sep|>'
May 31 14:48:02 imieeet ollama[1368]: print_info: EOG token = 151643 '<|end▁of▁sentence|>'
May 31 14:48:02 imieeet ollama[1368]: print_info: EOG token = 151662 '<|fim_pad|>'
May 31 14:48:02 imieeet ollama[1368]: print_info: EOG token = 151663 '<|repo_name|>'
May 31 14:48:02 imieeet ollama[1368]: print_info: EOG token = 151664 '<|file_sep|>'
May 31 14:48:02 imieeet ollama[1368]: print_info: max token length = 256
May 31 14:48:02 imieeet ollama[1368]: llama_model_load: vocab only - skipping tensors
May 31 14:48:02 imieeet ollama[1368]: time=2025-05-31T14:48:02.294+03:30 level=INFO source=server.go:431 msg="starting llama server" cmd="/usr/bin/ollama runner --model /usr/share/ollama/.ollama/models/blobs/sha256-6e9f90f02bb3b39b59e81916e8cfce9deb45aeaeb9a54a5be4414486b907dc1e --ctx-size 8192 --batch-size 512 --n-gpu-layers 9 --threads 8 --parallel 2 --port 46793"
May 31 14:48:02 imieeet ollama[1368]: time=2025-05-31T14:48:02.295+03:30 level=INFO source=sched.go:483 msg="loaded runners" count=1
May 31 14:48:02 imieeet ollama[1368]: time=2025-05-31T14:48:02.295+03:30 level=INFO source=server.go:591 msg="waiting for llama runner to start responding"
May 31 14:48:02 imieeet ollama[1368]: time=2025-05-31T14:48:02.295+03:30 level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server not responding"
May 31 14:48:02 imieeet ollama[1368]: time=2025-05-31T14:48:02.305+03:30 level=INFO source=runner.go:815 msg="starting go runner"
May 31 14:48:02 imieeet ollama[1368]: load_backend: loaded CPU backend from /usr/lib/ollama/libggml-cpu-haswell.so
May 31 14:48:02 imieeet ollama[1368]: /opt/amdgpu/share/libdrm/amdgpu.ids: No such file or directory
May 31 14:48:02 imieeet ollama[1368]: /opt/amdgpu/share/libdrm/amdgpu.ids: No such file or directory
May 31 14:48:03 imieeet ollama[1368]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
May 31 14:48:03 imieeet ollama[1368]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
May 31 14:48:03 imieeet ollama[1368]: ggml_cuda_init: found 1 ROCm devices:
May 31 14:48:03 imieeet ollama[1368]: Device 0: AMD Radeon Graphics, gfx1030 (0x1030), VMM: no, Wave Size: 32
May 31 14:48:03 imieeet ollama[1368]: load_backend: loaded ROCm backend from /usr/lib/ollama/rocm/libggml-hip.so
May 31 14:48:03 imieeet ollama[1368]: time=2025-05-31T14:48:03.780+03:30 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 ROCm.0.NO_VMM=1 ROCm.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc)
May 31 14:48:03 imieeet ollama[1368]: llama_model_load_from_file_impl: using device ROCm0 (AMD Radeon Graphics) - 8152 MiB free
May 31 14:48:03 imieeet ollama[1368]: time=2025-05-31T14:48:03.780+03:30 level=INFO source=runner.go:874 msg="Server listening on 127.0.0.1:46793"
May 31 14:48:03 imieeet ollama[1368]: time=2025-05-31T14:48:03.802+03:30 level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server loading model"
May 31 14:48:03 imieeet ollama[1368]: llama_model_loader: loaded meta data with 26 key-value pairs and 579 tensors from /usr/share/ollama/.ollama/models/blobs/sha256-6e9f90f02bb3b39b59e81916e8cfce9deb45aeaeb9a54a5be4414486b907dc1e (version GGUF V3 (latest))
May 31 14:48:03 imieeet ollama[1368]: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
May 31 14:48:03 imieeet ollama[1368]: llama_model_loader: - kv 0: general.architecture str = qwen2
May 31 14:48:03 imieeet ollama[1368]: llama_model_loader: - kv 1: general.type str = model
May 31 14:48:03 imieeet ollama[1368]: llama_model_loader: - kv 2: general.name str = DeepSeek R1 Distill Qwen 14B
May 31 14:48:03 imieeet ollama[1368]: llama_model_loader: - kv 3: general.basename str = DeepSeek-R1-Distill-Qwen
May 31 14:48:03 imieeet ollama[1368]: llama_model_loader: - kv 4: general.size_label str = 14B
May 31 14:48:03 imieeet ollama[1368]: llama_model_loader: - kv 5: qwen2.block_count u32 = 48
May 31 14:48:03 imieeet ollama[1368]: llama_model_loader: - kv 6: qwen2.context_length u32 = 131072
May 31 14:48:03 imieeet ollama[1368]: llama_model_loader: - kv 7: qwen2.embedding_length u32 = 5120
May 31 14:48:03 imieeet ollama[1368]: llama_model_loader: - kv 8: qwen2.feed_forward_length u32 = 13824
May 31 14:48:03 imieeet ollama[1368]: llama_model_loader: - kv 9: qwen2.attention.head_count u32 = 40
May 31 14:48:03 imieeet ollama[1368]: llama_model_loader: - kv 10: qwen2.attention.head_count_kv u32 = 8
May 31 14:48:03 imieeet ollama[1368]: llama_model_loader: - kv 11: qwen2.rope.freq_base f32 = 1000000.000000
May 31 14:48:03 imieeet ollama[1368]: llama_model_loader: - kv 12: qwen2.attention.layer_norm_rms_epsilon f32 = 0.000010
May 31 14:48:03 imieeet ollama[1368]: llama_model_loader: - kv 13: general.file_type u32 = 15
May 31 14:48:03 imieeet ollama[1368]: llama_model_loader: - kv 14: tokenizer.ggml.model str = gpt2
May 31 14:48:03 imieeet ollama[1368]: llama_model_loader: - kv 15: tokenizer.ggml.pre str = qwen2
May 31 14:48:03 imieeet ollama[1368]: llama_model_loader: - kv 16: tokenizer.ggml.tokens arr[str,152064] = ["!", """, "#", "$", "%", "&", "'", ...
May 31 14:48:03 imieeet ollama[1368]: llama_model_loader: - kv 17: tokenizer.ggml.token_type arr[i32,152064] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
May 31 14:48:03 imieeet ollama[1368]: llama_model_loader: - kv 18: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
May 31 14:48:03 imieeet ollama[1368]: llama_model_loader: - kv 19: tokenizer.ggml.bos_token_id u32 = 151646
May 31 14:48:03 imieeet ollama[1368]: llama_model_loader: - kv 20: tokenizer.ggml.eos_token_id u32 = 151643
May 31 14:48:03 imieeet ollama[1368]: llama_model_loader: - kv 21: tokenizer.ggml.padding_token_id u32 = 151643
May 31 14:48:03 imieeet ollama[1368]: llama_model_loader: - kv 22: tokenizer.ggml.add_bos_token bool = true
May 31 14:48:03 imieeet ollama[1368]: llama_model_loader: - kv 23: tokenizer.ggml.add_eos_token bool = false
May 31 14:48:03 imieeet ollama[1368]: llama_model_loader: - kv 24: tokenizer.chat_template str = {% if not add_generation_prompt is de...
May 31 14:48:03 imieeet ollama[1368]: llama_model_loader: - kv 25: general.quantization_version u32 = 2
May 31 14:48:03 imieeet ollama[1368]: llama_model_loader: - type f32: 241 tensors
May 31 14:48:03 imieeet ollama[1368]: llama_model_loader: - type q4_K: 289 tensors
May 31 14:48:03 imieeet ollama[1368]: llama_model_loader: - type q6_K: 49 tensors
May 31 14:48:03 imieeet ollama[1368]: print_info: file format = GGUF V3 (latest)
May 31 14:48:03 imieeet ollama[1368]: print_info: file type = Q4_K - Medium
May 31 14:48:03 imieeet ollama[1368]: print_info: file size = 8.37 GiB (4.87 BPW)
May 31 14:48:03 imieeet ollama[1368]: load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
May 31 14:48:03 imieeet ollama[1368]: load: special tokens cache size = 22
May 31 14:48:03 imieeet ollama[1368]: load: token to piece cache size = 0.9310 MB
May 31 14:48:03 imieeet ollama[1368]: print_info: arch = qwen2
May 31 14:48:03 imieeet ollama[1368]: print_info: vocab_only = 0
May 31 14:48:03 imieeet ollama[1368]: print_info: n_ctx_train = 131072
May 31 14:48:03 imieeet ollama[1368]: print_info: n_embd = 5120
May 31 14:48:03 imieeet ollama[1368]: print_info: n_layer = 48
May 31 14:48:03 imieeet ollama[1368]: print_info: n_head = 40
May 31 14:48:03 imieeet ollama[1368]: print_info: n_head_kv = 8
May 31 14:48:03 imieeet ollama[1368]: print_info: n_rot = 128
May 31 14:48:03 imieeet ollama[1368]: print_info: n_swa = 0
May 31 14:48:03 imieeet ollama[1368]: print_info: n_swa_pattern = 1
May 31 14:48:03 imieeet ollama[1368]: print_info: n_embd_head_k = 128
May 31 14:48:03 imieeet ollama[1368]: print_info: n_embd_head_v = 128
May 31 14:48:03 imieeet ollama[1368]: print_info: n_gqa = 5
May 31 14:48:03 imieeet ollama[1368]: print_info: n_embd_k_gqa = 1024
May 31 14:48:03 imieeet ollama[1368]: print_info: n_embd_v_gqa = 1024
May 31 14:48:03 imieeet ollama[1368]: print_info: f_norm_eps = 0.0e+00
May 31 14:48:03 imieeet ollama[1368]: print_info: f_norm_rms_eps = 1.0e-05
May 31 14:48:03 imieeet ollama[1368]: print_info: f_clamp_kqv = 0.0e+00
May 31 14:48:03 imieeet ollama[1368]: print_info: f_max_alibi_bias = 0.0e+00
May 31 14:48:03 imieeet ollama[1368]: print_info: f_logit_scale = 0.0e+00
May 31 14:48:03 imieeet ollama[1368]: print_info: f_attn_scale = 0.0e+00
May 31 14:48:03 imieeet ollama[1368]: print_info: n_ff = 13824
May 31 14:48:03 imieeet ollama[1368]: print_info: n_expert = 0
May 31 14:48:03 imieeet ollama[1368]: print_info: n_expert_used = 0
May 31 14:48:03 imieeet ollama[1368]: print_info: causal attn = 1
May 31 14:48:03 imieeet ollama[1368]: print_info: pooling type = -1
May 31 14:48:03 imieeet ollama[1368]: print_info: rope type = 2
May 31 14:48:03 imieeet ollama[1368]: print_info: rope scaling = linear
May 31 14:48:03 imieeet ollama[1368]: print_info: freq_base_train = 1000000.0
May 31 14:48:03 imieeet ollama[1368]: print_info: freq_scale_train = 1
May 31 14:48:03 imieeet ollama[1368]: print_info: n_ctx_orig_yarn = 131072
May 31 14:48:03 imieeet ollama[1368]: print_info: rope_finetuned = unknown
May 31 14:48:03 imieeet ollama[1368]: print_info: ssm_d_conv = 0
May 31 14:48:03 imieeet ollama[1368]: print_info: ssm_d_inner = 0
May 31 14:48:03 imieeet ollama[1368]: print_info: ssm_d_state = 0
May 31 14:48:03 imieeet ollama[1368]: print_info: ssm_dt_rank = 0
May 31 14:48:03 imieeet ollama[1368]: print_info: ssm_dt_b_c_rms = 0
May 31 14:48:03 imieeet ollama[1368]: print_info: model type = 14B
May 31 14:48:03 imieeet ollama[1368]: print_info: model params = 14.77 B
May 31 14:48:03 imieeet ollama[1368]: print_info: general.name = DeepSeek R1 Distill Qwen 14B
May 31 14:48:03 imieeet ollama[1368]: print_info: vocab type = BPE
May 31 14:48:03 imieeet ollama[1368]: print_info: n_vocab = 152064
May 31 14:48:03 imieeet ollama[1368]: print_info: n_merges = 151387
May 31 14:48:03 imieeet ollama[1368]: print_info: BOS token = 151646 '<|begin▁of▁sentence|>'
May 31 14:48:03 imieeet ollama[1368]: print_info: EOS token = 151643 '<|end▁of▁sentence|>'
May 31 14:48:03 imieeet ollama[1368]: print_info: EOT token = 151643 '<|end▁of▁sentence|>'
May 31 14:48:03 imieeet ollama[1368]: print_info: PAD token = 151643 '<|end▁of▁sentence|>'
May 31 14:48:03 imieeet ollama[1368]: print_info: LF token = 198 'Ċ'
May 31 14:48:03 imieeet ollama[1368]: print_info: FIM PRE token = 151659 '<|fim_prefix|>'
May 31 14:48:03 imieeet ollama[1368]: print_info: FIM SUF token = 151661 '<|fim_suffix|>'
May 31 14:48:03 imieeet ollama[1368]: print_info: FIM MID token = 151660 '<|fim_middle|>'
May 31 14:48:03 imieeet ollama[1368]: print_info: FIM PAD token = 151662 '<|fim_pad|>'
May 31 14:48:03 imieeet ollama[1368]: print_info: FIM REP token = 151663 '<|repo_name|>'
May 31 14:48:03 imieeet ollama[1368]: print_info: FIM SEP token = 151664 '<|file_sep|>'
May 31 14:48:03 imieeet ollama[1368]: print_info: EOG token = 151643 '<|end▁of▁sentence|>'
May 31 14:48:03 imieeet ollama[1368]: print_info: EOG token = 151662 '<|fim_pad|>'
May 31 14:48:03 imieeet ollama[1368]: print_info: EOG token = 151663 '<|repo_name|>'
May 31 14:48:03 imieeet ollama[1368]: print_info: EOG token = 151664 '<|file_sep|>'
May 31 14:48:03 imieeet ollama[1368]: print_info: max token length = 256
May 31 14:48:03 imieeet ollama[1368]: load_tensors: loading model tensors, this can take a while... (mmap = true)
May 31 14:48:04 imieeet ollama[1368]: alloc_tensor_range: failed to initialize tensor blk.39.attn_q.weight
May 31 14:48:04 imieeet ollama[1368]: llama_model_load: error loading model: unable to allocate ROCm0 buffer
May 31 14:48:04 imieeet ollama[1368]: llama_model_load_from_file_impl: failed to load model
May 31 14:48:04 imieeet ollama[1368]: panic: unable to load model: /usr/share/ollama/.ollama/models/blobs/sha256-6e9f90f02bb3b39b59e81916e8cfce9deb45aeaeb9a54a5be4414486b907dc1e
May 31 14:48:04 imieeet ollama[1368]: goroutine 14 [running]:
May 31 14:48:04 imieeet ollama[1368]: github.com/ollama/ollama/runner/llamarunner.(*Server).loadModel(0xc0004c0360, {0x9, 0x0, 0x1, {0x0, 0x0, 0x0}, 0xc0005a3740, 0x0}, {0x7ffe701a8c75, ...}, ...)
May 31 14:48:04 imieeet ollama[1368]: github.com/ollama/ollama/runner/llamarunner/runner.go:751 +0x395
May 31 14:48:04 imieeet ollama[1368]: created by github.com/ollama/ollama/runner/llamarunner.Execute in goroutine 1
May 31 14:48:04 imieeet ollama[1368]: github.com/ollama/ollama/runner/llamarunner/runner.go:848 +0xb57
May 31 14:48:04 imieeet ollama[1368]: time=2025-05-31T14:48:04.787+03:30 level=ERROR source=server.go:457 msg="llama runner terminated" error="exit status 2"
May 31 14:48:04 imieeet ollama[1368]: time=2025-05-31T14:48:04.805+03:30 level=ERROR source=sched.go:489 msg="error loading llama server" error="llama runner process has terminated: error loading model: unable to allocate ROCm0 buffer\nllama_model_load_from_file_impl: failed to load model"
May 31 14:48:04 imieeet ollama[1368]: [GIN] 2025/05/31 - 14:48:04 | 500 | 2.70692255s | 127.0.0.1 | POST "/api/generate"
May 31 14:48:09 imieeet ollama[1368]: time=2025-05-31T14:48:09.805+03:30 level=WARN source=sched.go:687 msg="gpu VRAM usage didn't recover within timeout" seconds=5.00039831 runner.size="11.0 GiB" runner.vram="3.2 GiB" runner.parallel=2 runner.pid=24230 runner.model=/usr/share/ollama/.ollama/models/blobs/sha256-6e9f90f02bb3b39b59e81916e8cfce9deb45aeaeb9a54a5be4414486b907dc1e
May 31 14:48:10 imieeet ollama[1368]: time=2025-05-31T14:48:10.055+03:30 level=WARN source=sched.go:687 msg="gpu VRAM usage didn't recover within timeout" seconds=5.250212572 runner.size="11.0 GiB" runner.vram="3.2 GiB" runner.parallel=2 runner.pid=24230 runner.model=/usr/share/ollama/.ollama/models/blobs/sha256-6e9f90f02bb3b39b59e81916e8cfce9deb45aeaeb9a54a5be4414486b907dc1e
May 31 14:48:10 imieeet ollama[1368]: time=2025-05-31T14:48:10.305+03:30 level=WARN source=sched.go:687 msg="gpu VRAM usage didn't recover within timeout" seconds=5.5009139959999995 runner.size="11.0 GiB" runner.vram="3.2 GiB" runner.parallel=2 runner.pid=24230 runner.model=/usr/share/ollama/.ollama/models/blobs/sha256-6e9f90f02bb3b39b59e81916e8cfce9deb45aeaeb9a54a5be4414486b907dc1e

<!-- gh-comment-id:2924992644 --> @IMIEEET commented on GitHub (May 31, 2025): you are right it does not make sense, i also thought maybe i also updated rocm or kernel in the middle thats maybe the problem but yesterday i decided to run the same model with llama.cpp. it ran with rocm backend with as many layers as almost my 8GB could handle so it most be ollama i did that with Modelfile: > FROM deepseek-r1:14b PARAMETER num_gpu 9 still same error ollama 0.8.0 logs: > May 31 14:48:02 imieeet ollama[1368]: [GIN] 2025/05/31 - 14:48:02 | 200 | 23.814µs | 127.0.0.1 | HEAD "/" May 31 14:48:02 imieeet ollama[1368]: [GIN] 2025/05/31 - 14:48:02 | 200 | 24.946351ms | 127.0.0.1 | POST "/api/show" May 31 14:48:02 imieeet ollama[1368]: time=2025-05-31T14:48:02.135+03:30 level=INFO source=sched.go:788 msg="new model will fit in available VRAM in single GPU, loading" model=/usr/share/ollama/.ollama/models/blobs/sha256-6e9f90f02bb3b39b59e81916e8cfce9deb45aeaeb9a54a5be4414486b907dc1e gpu=0 parallel=2 available=8556507136 required="3.2 GiB" May 31 14:48:02 imieeet ollama[1368]: time=2025-05-31T14:48:02.136+03:30 level=INFO source=server.go:135 msg="system memory" total="30.6 GiB" free="23.5 GiB" free_swap="8.0 GiB" May 31 14:48:02 imieeet ollama[1368]: time=2025-05-31T14:48:02.136+03:30 level=INFO source=server.go:168 msg=offload library=rocm layers.requested=9 layers.model=49 layers.offload=9 layers.split="" memory.available="[8.0 GiB]" memory.gpu_overhead="0 B" memory.required.full="11.0 GiB" memory.required.partial="3.2 GiB" memory.required.kv="1.5 GiB" memory.required.allocations="[3.2 GiB]" memory.weights.total="8.0 GiB" memory.weights.repeating="7.4 GiB" memory.weights.nonrepeating="609.1 MiB" memory.graph.full="676.0 MiB" memory.graph.partial="916.1 MiB" May 31 14:48:02 imieeet ollama[1368]: llama_model_loader: loaded meta data with 26 key-value pairs and 579 tensors from /usr/share/ollama/.ollama/models/blobs/sha256-6e9f90f02bb3b39b59e81916e8cfce9deb45aeaeb9a54a5be4414486b907dc1e (version GGUF V3 (latest)) May 31 14:48:02 imieeet ollama[1368]: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. May 31 14:48:02 imieeet ollama[1368]: llama_model_loader: - kv 0: general.architecture str = qwen2 May 31 14:48:02 imieeet ollama[1368]: llama_model_loader: - kv 1: general.type str = model May 31 14:48:02 imieeet ollama[1368]: llama_model_loader: - kv 2: general.name str = DeepSeek R1 Distill Qwen 14B May 31 14:48:02 imieeet ollama[1368]: llama_model_loader: - kv 3: general.basename str = DeepSeek-R1-Distill-Qwen May 31 14:48:02 imieeet ollama[1368]: llama_model_loader: - kv 4: general.size_label str = 14B May 31 14:48:02 imieeet ollama[1368]: llama_model_loader: - kv 5: qwen2.block_count u32 = 48 May 31 14:48:02 imieeet ollama[1368]: llama_model_loader: - kv 6: qwen2.context_length u32 = 131072 May 31 14:48:02 imieeet ollama[1368]: llama_model_loader: - kv 7: qwen2.embedding_length u32 = 5120 May 31 14:48:02 imieeet ollama[1368]: llama_model_loader: - kv 8: qwen2.feed_forward_length u32 = 13824 May 31 14:48:02 imieeet ollama[1368]: llama_model_loader: - kv 9: qwen2.attention.head_count u32 = 40 May 31 14:48:02 imieeet ollama[1368]: llama_model_loader: - kv 10: qwen2.attention.head_count_kv u32 = 8 May 31 14:48:02 imieeet ollama[1368]: llama_model_loader: - kv 11: qwen2.rope.freq_base f32 = 1000000.000000 May 31 14:48:02 imieeet ollama[1368]: llama_model_loader: - kv 12: qwen2.attention.layer_norm_rms_epsilon f32 = 0.000010 May 31 14:48:02 imieeet ollama[1368]: llama_model_loader: - kv 13: general.file_type u32 = 15 May 31 14:48:02 imieeet ollama[1368]: llama_model_loader: - kv 14: tokenizer.ggml.model str = gpt2 May 31 14:48:02 imieeet ollama[1368]: llama_model_loader: - kv 15: tokenizer.ggml.pre str = qwen2 May 31 14:48:02 imieeet ollama[1368]: llama_model_loader: - kv 16: tokenizer.ggml.tokens arr[str,152064] = ["!", "\"", "#", "$", "%", "&", "'", ... May 31 14:48:02 imieeet ollama[1368]: llama_model_loader: - kv 17: tokenizer.ggml.token_type arr[i32,152064] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... May 31 14:48:02 imieeet ollama[1368]: llama_model_loader: - kv 18: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",... May 31 14:48:02 imieeet ollama[1368]: llama_model_loader: - kv 19: tokenizer.ggml.bos_token_id u32 = 151646 May 31 14:48:02 imieeet ollama[1368]: llama_model_loader: - kv 20: tokenizer.ggml.eos_token_id u32 = 151643 May 31 14:48:02 imieeet ollama[1368]: llama_model_loader: - kv 21: tokenizer.ggml.padding_token_id u32 = 151643 May 31 14:48:02 imieeet ollama[1368]: llama_model_loader: - kv 22: tokenizer.ggml.add_bos_token bool = true May 31 14:48:02 imieeet ollama[1368]: llama_model_loader: - kv 23: tokenizer.ggml.add_eos_token bool = false May 31 14:48:02 imieeet ollama[1368]: llama_model_loader: - kv 24: tokenizer.chat_template str = {% if not add_generation_prompt is de... May 31 14:48:02 imieeet ollama[1368]: llama_model_loader: - kv 25: general.quantization_version u32 = 2 May 31 14:48:02 imieeet ollama[1368]: llama_model_loader: - type f32: 241 tensors May 31 14:48:02 imieeet ollama[1368]: llama_model_loader: - type q4_K: 289 tensors May 31 14:48:02 imieeet ollama[1368]: llama_model_loader: - type q6_K: 49 tensors May 31 14:48:02 imieeet ollama[1368]: print_info: file format = GGUF V3 (latest) May 31 14:48:02 imieeet ollama[1368]: print_info: file type = Q4_K - Medium May 31 14:48:02 imieeet ollama[1368]: print_info: file size = 8.37 GiB (4.87 BPW) May 31 14:48:02 imieeet ollama[1368]: load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect May 31 14:48:02 imieeet ollama[1368]: load: special tokens cache size = 22 May 31 14:48:02 imieeet ollama[1368]: load: token to piece cache size = 0.9310 MB May 31 14:48:02 imieeet ollama[1368]: print_info: arch = qwen2 May 31 14:48:02 imieeet ollama[1368]: print_info: vocab_only = 1 May 31 14:48:02 imieeet ollama[1368]: print_info: model type = ?B May 31 14:48:02 imieeet ollama[1368]: print_info: model params = 14.77 B May 31 14:48:02 imieeet ollama[1368]: print_info: general.name = DeepSeek R1 Distill Qwen 14B May 31 14:48:02 imieeet ollama[1368]: print_info: vocab type = BPE May 31 14:48:02 imieeet ollama[1368]: print_info: n_vocab = 152064 May 31 14:48:02 imieeet ollama[1368]: print_info: n_merges = 151387 May 31 14:48:02 imieeet ollama[1368]: print_info: BOS token = 151646 '<|begin▁of▁sentence|>' May 31 14:48:02 imieeet ollama[1368]: print_info: EOS token = 151643 '<|end▁of▁sentence|>' May 31 14:48:02 imieeet ollama[1368]: print_info: EOT token = 151643 '<|end▁of▁sentence|>' May 31 14:48:02 imieeet ollama[1368]: print_info: PAD token = 151643 '<|end▁of▁sentence|>' May 31 14:48:02 imieeet ollama[1368]: print_info: LF token = 198 'Ċ' May 31 14:48:02 imieeet ollama[1368]: print_info: FIM PRE token = 151659 '<|fim_prefix|>' May 31 14:48:02 imieeet ollama[1368]: print_info: FIM SUF token = 151661 '<|fim_suffix|>' May 31 14:48:02 imieeet ollama[1368]: print_info: FIM MID token = 151660 '<|fim_middle|>' May 31 14:48:02 imieeet ollama[1368]: print_info: FIM PAD token = 151662 '<|fim_pad|>' May 31 14:48:02 imieeet ollama[1368]: print_info: FIM REP token = 151663 '<|repo_name|>' May 31 14:48:02 imieeet ollama[1368]: print_info: FIM SEP token = 151664 '<|file_sep|>' May 31 14:48:02 imieeet ollama[1368]: print_info: EOG token = 151643 '<|end▁of▁sentence|>' May 31 14:48:02 imieeet ollama[1368]: print_info: EOG token = 151662 '<|fim_pad|>' May 31 14:48:02 imieeet ollama[1368]: print_info: EOG token = 151663 '<|repo_name|>' May 31 14:48:02 imieeet ollama[1368]: print_info: EOG token = 151664 '<|file_sep|>' May 31 14:48:02 imieeet ollama[1368]: print_info: max token length = 256 May 31 14:48:02 imieeet ollama[1368]: llama_model_load: vocab only - skipping tensors May 31 14:48:02 imieeet ollama[1368]: time=2025-05-31T14:48:02.294+03:30 level=INFO source=server.go:431 msg="starting llama server" cmd="/usr/bin/ollama runner --model /usr/share/ollama/.ollama/models/blobs/sha256-6e9f90f02bb3b39b59e81916e8cfce9deb45aeaeb9a54a5be4414486b907dc1e --ctx-size 8192 --batch-size 512 --n-gpu-layers 9 --threads 8 --parallel 2 --port 46793" May 31 14:48:02 imieeet ollama[1368]: time=2025-05-31T14:48:02.295+03:30 level=INFO source=sched.go:483 msg="loaded runners" count=1 May 31 14:48:02 imieeet ollama[1368]: time=2025-05-31T14:48:02.295+03:30 level=INFO source=server.go:591 msg="waiting for llama runner to start responding" May 31 14:48:02 imieeet ollama[1368]: time=2025-05-31T14:48:02.295+03:30 level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server not responding" May 31 14:48:02 imieeet ollama[1368]: time=2025-05-31T14:48:02.305+03:30 level=INFO source=runner.go:815 msg="starting go runner" May 31 14:48:02 imieeet ollama[1368]: load_backend: loaded CPU backend from /usr/lib/ollama/libggml-cpu-haswell.so May 31 14:48:02 imieeet ollama[1368]: /opt/amdgpu/share/libdrm/amdgpu.ids: No such file or directory May 31 14:48:02 imieeet ollama[1368]: /opt/amdgpu/share/libdrm/amdgpu.ids: No such file or directory May 31 14:48:03 imieeet ollama[1368]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no May 31 14:48:03 imieeet ollama[1368]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no May 31 14:48:03 imieeet ollama[1368]: ggml_cuda_init: found 1 ROCm devices: May 31 14:48:03 imieeet ollama[1368]: Device 0: AMD Radeon Graphics, gfx1030 (0x1030), VMM: no, Wave Size: 32 May 31 14:48:03 imieeet ollama[1368]: load_backend: loaded ROCm backend from /usr/lib/ollama/rocm/libggml-hip.so May 31 14:48:03 imieeet ollama[1368]: time=2025-05-31T14:48:03.780+03:30 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 ROCm.0.NO_VMM=1 ROCm.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc) May 31 14:48:03 imieeet ollama[1368]: llama_model_load_from_file_impl: using device ROCm0 (AMD Radeon Graphics) - 8152 MiB free May 31 14:48:03 imieeet ollama[1368]: time=2025-05-31T14:48:03.780+03:30 level=INFO source=runner.go:874 msg="Server listening on 127.0.0.1:46793" May 31 14:48:03 imieeet ollama[1368]: time=2025-05-31T14:48:03.802+03:30 level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server loading model" May 31 14:48:03 imieeet ollama[1368]: llama_model_loader: loaded meta data with 26 key-value pairs and 579 tensors from /usr/share/ollama/.ollama/models/blobs/sha256-6e9f90f02bb3b39b59e81916e8cfce9deb45aeaeb9a54a5be4414486b907dc1e (version GGUF V3 (latest)) May 31 14:48:03 imieeet ollama[1368]: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. May 31 14:48:03 imieeet ollama[1368]: llama_model_loader: - kv 0: general.architecture str = qwen2 May 31 14:48:03 imieeet ollama[1368]: llama_model_loader: - kv 1: general.type str = model May 31 14:48:03 imieeet ollama[1368]: llama_model_loader: - kv 2: general.name str = DeepSeek R1 Distill Qwen 14B May 31 14:48:03 imieeet ollama[1368]: llama_model_loader: - kv 3: general.basename str = DeepSeek-R1-Distill-Qwen May 31 14:48:03 imieeet ollama[1368]: llama_model_loader: - kv 4: general.size_label str = 14B May 31 14:48:03 imieeet ollama[1368]: llama_model_loader: - kv 5: qwen2.block_count u32 = 48 May 31 14:48:03 imieeet ollama[1368]: llama_model_loader: - kv 6: qwen2.context_length u32 = 131072 May 31 14:48:03 imieeet ollama[1368]: llama_model_loader: - kv 7: qwen2.embedding_length u32 = 5120 May 31 14:48:03 imieeet ollama[1368]: llama_model_loader: - kv 8: qwen2.feed_forward_length u32 = 13824 May 31 14:48:03 imieeet ollama[1368]: llama_model_loader: - kv 9: qwen2.attention.head_count u32 = 40 May 31 14:48:03 imieeet ollama[1368]: llama_model_loader: - kv 10: qwen2.attention.head_count_kv u32 = 8 May 31 14:48:03 imieeet ollama[1368]: llama_model_loader: - kv 11: qwen2.rope.freq_base f32 = 1000000.000000 May 31 14:48:03 imieeet ollama[1368]: llama_model_loader: - kv 12: qwen2.attention.layer_norm_rms_epsilon f32 = 0.000010 May 31 14:48:03 imieeet ollama[1368]: llama_model_loader: - kv 13: general.file_type u32 = 15 May 31 14:48:03 imieeet ollama[1368]: llama_model_loader: - kv 14: tokenizer.ggml.model str = gpt2 May 31 14:48:03 imieeet ollama[1368]: llama_model_loader: - kv 15: tokenizer.ggml.pre str = qwen2 May 31 14:48:03 imieeet ollama[1368]: llama_model_loader: - kv 16: tokenizer.ggml.tokens arr[str,152064] = ["!", "\"", "#", "$", "%", "&", "'", ... May 31 14:48:03 imieeet ollama[1368]: llama_model_loader: - kv 17: tokenizer.ggml.token_type arr[i32,152064] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... May 31 14:48:03 imieeet ollama[1368]: llama_model_loader: - kv 18: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",... May 31 14:48:03 imieeet ollama[1368]: llama_model_loader: - kv 19: tokenizer.ggml.bos_token_id u32 = 151646 May 31 14:48:03 imieeet ollama[1368]: llama_model_loader: - kv 20: tokenizer.ggml.eos_token_id u32 = 151643 May 31 14:48:03 imieeet ollama[1368]: llama_model_loader: - kv 21: tokenizer.ggml.padding_token_id u32 = 151643 May 31 14:48:03 imieeet ollama[1368]: llama_model_loader: - kv 22: tokenizer.ggml.add_bos_token bool = true May 31 14:48:03 imieeet ollama[1368]: llama_model_loader: - kv 23: tokenizer.ggml.add_eos_token bool = false May 31 14:48:03 imieeet ollama[1368]: llama_model_loader: - kv 24: tokenizer.chat_template str = {% if not add_generation_prompt is de... May 31 14:48:03 imieeet ollama[1368]: llama_model_loader: - kv 25: general.quantization_version u32 = 2 May 31 14:48:03 imieeet ollama[1368]: llama_model_loader: - type f32: 241 tensors May 31 14:48:03 imieeet ollama[1368]: llama_model_loader: - type q4_K: 289 tensors May 31 14:48:03 imieeet ollama[1368]: llama_model_loader: - type q6_K: 49 tensors May 31 14:48:03 imieeet ollama[1368]: print_info: file format = GGUF V3 (latest) May 31 14:48:03 imieeet ollama[1368]: print_info: file type = Q4_K - Medium May 31 14:48:03 imieeet ollama[1368]: print_info: file size = 8.37 GiB (4.87 BPW) May 31 14:48:03 imieeet ollama[1368]: load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect May 31 14:48:03 imieeet ollama[1368]: load: special tokens cache size = 22 May 31 14:48:03 imieeet ollama[1368]: load: token to piece cache size = 0.9310 MB May 31 14:48:03 imieeet ollama[1368]: print_info: arch = qwen2 May 31 14:48:03 imieeet ollama[1368]: print_info: vocab_only = 0 May 31 14:48:03 imieeet ollama[1368]: print_info: n_ctx_train = 131072 May 31 14:48:03 imieeet ollama[1368]: print_info: n_embd = 5120 May 31 14:48:03 imieeet ollama[1368]: print_info: n_layer = 48 May 31 14:48:03 imieeet ollama[1368]: print_info: n_head = 40 May 31 14:48:03 imieeet ollama[1368]: print_info: n_head_kv = 8 May 31 14:48:03 imieeet ollama[1368]: print_info: n_rot = 128 May 31 14:48:03 imieeet ollama[1368]: print_info: n_swa = 0 May 31 14:48:03 imieeet ollama[1368]: print_info: n_swa_pattern = 1 May 31 14:48:03 imieeet ollama[1368]: print_info: n_embd_head_k = 128 May 31 14:48:03 imieeet ollama[1368]: print_info: n_embd_head_v = 128 May 31 14:48:03 imieeet ollama[1368]: print_info: n_gqa = 5 May 31 14:48:03 imieeet ollama[1368]: print_info: n_embd_k_gqa = 1024 May 31 14:48:03 imieeet ollama[1368]: print_info: n_embd_v_gqa = 1024 May 31 14:48:03 imieeet ollama[1368]: print_info: f_norm_eps = 0.0e+00 May 31 14:48:03 imieeet ollama[1368]: print_info: f_norm_rms_eps = 1.0e-05 May 31 14:48:03 imieeet ollama[1368]: print_info: f_clamp_kqv = 0.0e+00 May 31 14:48:03 imieeet ollama[1368]: print_info: f_max_alibi_bias = 0.0e+00 May 31 14:48:03 imieeet ollama[1368]: print_info: f_logit_scale = 0.0e+00 May 31 14:48:03 imieeet ollama[1368]: print_info: f_attn_scale = 0.0e+00 May 31 14:48:03 imieeet ollama[1368]: print_info: n_ff = 13824 May 31 14:48:03 imieeet ollama[1368]: print_info: n_expert = 0 May 31 14:48:03 imieeet ollama[1368]: print_info: n_expert_used = 0 May 31 14:48:03 imieeet ollama[1368]: print_info: causal attn = 1 May 31 14:48:03 imieeet ollama[1368]: print_info: pooling type = -1 May 31 14:48:03 imieeet ollama[1368]: print_info: rope type = 2 May 31 14:48:03 imieeet ollama[1368]: print_info: rope scaling = linear May 31 14:48:03 imieeet ollama[1368]: print_info: freq_base_train = 1000000.0 May 31 14:48:03 imieeet ollama[1368]: print_info: freq_scale_train = 1 May 31 14:48:03 imieeet ollama[1368]: print_info: n_ctx_orig_yarn = 131072 May 31 14:48:03 imieeet ollama[1368]: print_info: rope_finetuned = unknown May 31 14:48:03 imieeet ollama[1368]: print_info: ssm_d_conv = 0 May 31 14:48:03 imieeet ollama[1368]: print_info: ssm_d_inner = 0 May 31 14:48:03 imieeet ollama[1368]: print_info: ssm_d_state = 0 May 31 14:48:03 imieeet ollama[1368]: print_info: ssm_dt_rank = 0 May 31 14:48:03 imieeet ollama[1368]: print_info: ssm_dt_b_c_rms = 0 May 31 14:48:03 imieeet ollama[1368]: print_info: model type = 14B May 31 14:48:03 imieeet ollama[1368]: print_info: model params = 14.77 B May 31 14:48:03 imieeet ollama[1368]: print_info: general.name = DeepSeek R1 Distill Qwen 14B May 31 14:48:03 imieeet ollama[1368]: print_info: vocab type = BPE May 31 14:48:03 imieeet ollama[1368]: print_info: n_vocab = 152064 May 31 14:48:03 imieeet ollama[1368]: print_info: n_merges = 151387 May 31 14:48:03 imieeet ollama[1368]: print_info: BOS token = 151646 '<|begin▁of▁sentence|>' May 31 14:48:03 imieeet ollama[1368]: print_info: EOS token = 151643 '<|end▁of▁sentence|>' May 31 14:48:03 imieeet ollama[1368]: print_info: EOT token = 151643 '<|end▁of▁sentence|>' May 31 14:48:03 imieeet ollama[1368]: print_info: PAD token = 151643 '<|end▁of▁sentence|>' May 31 14:48:03 imieeet ollama[1368]: print_info: LF token = 198 'Ċ' May 31 14:48:03 imieeet ollama[1368]: print_info: FIM PRE token = 151659 '<|fim_prefix|>' May 31 14:48:03 imieeet ollama[1368]: print_info: FIM SUF token = 151661 '<|fim_suffix|>' May 31 14:48:03 imieeet ollama[1368]: print_info: FIM MID token = 151660 '<|fim_middle|>' May 31 14:48:03 imieeet ollama[1368]: print_info: FIM PAD token = 151662 '<|fim_pad|>' May 31 14:48:03 imieeet ollama[1368]: print_info: FIM REP token = 151663 '<|repo_name|>' May 31 14:48:03 imieeet ollama[1368]: print_info: FIM SEP token = 151664 '<|file_sep|>' May 31 14:48:03 imieeet ollama[1368]: print_info: EOG token = 151643 '<|end▁of▁sentence|>' May 31 14:48:03 imieeet ollama[1368]: print_info: EOG token = 151662 '<|fim_pad|>' May 31 14:48:03 imieeet ollama[1368]: print_info: EOG token = 151663 '<|repo_name|>' May 31 14:48:03 imieeet ollama[1368]: print_info: EOG token = 151664 '<|file_sep|>' May 31 14:48:03 imieeet ollama[1368]: print_info: max token length = 256 May 31 14:48:03 imieeet ollama[1368]: load_tensors: loading model tensors, this can take a while... (mmap = true) May 31 14:48:04 imieeet ollama[1368]: alloc_tensor_range: failed to initialize tensor blk.39.attn_q.weight May 31 14:48:04 imieeet ollama[1368]: llama_model_load: error loading model: unable to allocate ROCm0 buffer May 31 14:48:04 imieeet ollama[1368]: llama_model_load_from_file_impl: failed to load model May 31 14:48:04 imieeet ollama[1368]: panic: unable to load model: /usr/share/ollama/.ollama/models/blobs/sha256-6e9f90f02bb3b39b59e81916e8cfce9deb45aeaeb9a54a5be4414486b907dc1e May 31 14:48:04 imieeet ollama[1368]: goroutine 14 [running]: May 31 14:48:04 imieeet ollama[1368]: github.com/ollama/ollama/runner/llamarunner.(*Server).loadModel(0xc0004c0360, {0x9, 0x0, 0x1, {0x0, 0x0, 0x0}, 0xc0005a3740, 0x0}, {0x7ffe701a8c75, ...}, ...) May 31 14:48:04 imieeet ollama[1368]: github.com/ollama/ollama/runner/llamarunner/runner.go:751 +0x395 May 31 14:48:04 imieeet ollama[1368]: created by github.com/ollama/ollama/runner/llamarunner.Execute in goroutine 1 May 31 14:48:04 imieeet ollama[1368]: github.com/ollama/ollama/runner/llamarunner/runner.go:848 +0xb57 May 31 14:48:04 imieeet ollama[1368]: time=2025-05-31T14:48:04.787+03:30 level=ERROR source=server.go:457 msg="llama runner terminated" error="exit status 2" May 31 14:48:04 imieeet ollama[1368]: time=2025-05-31T14:48:04.805+03:30 level=ERROR source=sched.go:489 msg="error loading llama server" error="llama runner process has terminated: error loading model: unable to allocate ROCm0 buffer\nllama_model_load_from_file_impl: failed to load model" May 31 14:48:04 imieeet ollama[1368]: [GIN] 2025/05/31 - 14:48:04 | 500 | 2.70692255s | 127.0.0.1 | POST "/api/generate" May 31 14:48:09 imieeet ollama[1368]: time=2025-05-31T14:48:09.805+03:30 level=WARN source=sched.go:687 msg="gpu VRAM usage didn't recover within timeout" seconds=5.00039831 runner.size="11.0 GiB" runner.vram="3.2 GiB" runner.parallel=2 runner.pid=24230 runner.model=/usr/share/ollama/.ollama/models/blobs/sha256-6e9f90f02bb3b39b59e81916e8cfce9deb45aeaeb9a54a5be4414486b907dc1e May 31 14:48:10 imieeet ollama[1368]: time=2025-05-31T14:48:10.055+03:30 level=WARN source=sched.go:687 msg="gpu VRAM usage didn't recover within timeout" seconds=5.250212572 runner.size="11.0 GiB" runner.vram="3.2 GiB" runner.parallel=2 runner.pid=24230 runner.model=/usr/share/ollama/.ollama/models/blobs/sha256-6e9f90f02bb3b39b59e81916e8cfce9deb45aeaeb9a54a5be4414486b907dc1e May 31 14:48:10 imieeet ollama[1368]: time=2025-05-31T14:48:10.305+03:30 level=WARN source=sched.go:687 msg="gpu VRAM usage didn't recover within timeout" seconds=5.5009139959999995 runner.size="11.0 GiB" runner.vram="3.2 GiB" runner.parallel=2 runner.pid=24230 runner.model=/usr/share/ollama/.ollama/models/blobs/sha256-6e9f90f02bb3b39b59e81916e8cfce9deb45aeaeb9a54a5be4414486b907dc1e
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#69142