[GH-ISSUE #8843] Ollama current/stable (or built from source) appears broken on AMD MI300x ROCm gfx942 #67786

Open
opened 2026-05-04 11:40:37 -05:00 by GiteaMirror · 30 comments
Owner

Originally created by @zebrax0r on GitHub (Feb 5, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/8843

Originally assigned to: @dhiltgen on GitHub.

What is the issue?

I have found an issue that exists between the current (stable) ollama, AMD ROCm 6.3.1 and AMD gfx942 (MI300x).

I have tested with multiple operating systems including Rocky8 and current Ubuntu Server (24.x.x).

Setup:

Linux host 4.18.0-553.8.1.el8_10.x86_64 #1 SMP Tue Jul 2 17:10:26 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux

Rocky Linux release 8.10 (Green Obsidian)

me@host:~$ rocm-smi

============================================ ROCm System Management Interface ============================================
====================================================== Concise Info ======================================================
Device  Node  IDs              Temp        Power     Partitions          SCLK    MCLK    Fan  Perf  PwrCap  VRAM%  GPU%
              (DID,     GUID)  (Junction)  (Socket)  (Mem, Compute, ID)
==========================================================================================================================
0       20    0x74a1,   28851  42.0°C      123.0W    NPS1, SPX, 0        131Mhz  900Mhz  0%   auto  750.0W  0%     0%
1       17    0x74a1,   43178  40.0°C      121.0W    NPS1, SPX, 0        131Mhz  900Mhz  0%   auto  750.0W  0%     0%
2       16    0x74a1,   32898  44.0°C      121.0W    NPS1, SPX, 0        132Mhz  900Mhz  0%   auto  750.0W  0%     0%
3       19    0x74a1,   22683  39.0°C      119.0W    NPS1, SPX, 0        132Mhz  900Mhz  0%   auto  750.0W  0%     0%
4       22    0x74a1,   53458  42.0°C      126.0W    NPS1, SPX, 0        131Mhz  900Mhz  0%   auto  750.0W  0%     0%
5       23    0x74a1,   2251   39.0°C      119.0W    NPS1, SPX, 0        131Mhz  900Mhz  0%   auto  750.0W  0%     0%
6       21    0x74a1,   8419   42.0°C      123.0W    NPS1, SPX, 0        131Mhz  900Mhz  0%   auto  750.0W  0%     0%
7       18    0x74a1,   63738  39.0°C      120.0W    NPS1, SPX, 0        132Mhz  900Mhz  0%   auto  750.0W  0%     0%
==========================================================================================================================
================================================== End of ROCm SMI Log ===================================================

Notes:

  • Bare metal. No hypervisors or containers here.
  • Running SLURM to control the node. cgroups have been checked for issues - no apparent constraints.

Whenever any model is run, ollama now outputs or GGGGGGGG or 4444444 and so forth. I believe this was not the case in previous versions of ollama - and I have verified the same behaviour in llama.cpp with other front ends/back-ends. We do not observe this behaviour with nvidia GPUs.

ollama appears to start up correctly...

me@host:~$ ollama serve
2025/02/05 20:34:17 routes.go:1187: INFO server config env="map[CUDA_VISIBLE_DEVICES:0,1,2,3,4,5,6,7 GPU_DEVICE_ORDINAL:0,1,2,3,4,5,6,7 HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://127.0.0.1:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_KV_CACHE_TYPE: OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/home/me/.ollama/models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://*] OLLAMA_SCHED_SPREAD:false ROCR_VISIBLE_DEVICES:0,1,2,3,4,5,6,7 http_proxy: https_proxy: no_proxy:]"
time=2025-02-05T20:34:17.893+10:00 level=INFO source=images.go:432 msg="total blobs: 30"
time=2025-02-05T20:34:17.894+10:00 level=INFO source=images.go:439 msg="total unused blobs removed: 0"
[GIN-debug] [WARNING] Creating an Engine instance with the Logger and Recovery middleware already attached.

[GIN-debug] [WARNING] Running in "debug" mode. Switch to "release" mode in production.
 - using env:	export GIN_MODE=release
 - using code:	gin.SetMode(gin.ReleaseMode)

[GIN-debug] POST   /api/pull                 --> github.com/ollama/ollama/server.(*Server).PullHandler-fm (5 handlers)
[GIN-debug] POST   /api/generate             --> github.com/ollama/ollama/server.(*Server).GenerateHandler-fm (5 handlers)
[GIN-debug] POST   /api/chat                 --> github.com/ollama/ollama/server.(*Server).ChatHandler-fm (5 handlers)
[GIN-debug] POST   /api/embed                --> github.com/ollama/ollama/server.(*Server).EmbedHandler-fm (5 handlers)
[GIN-debug] POST   /api/embeddings           --> github.com/ollama/ollama/server.(*Server).EmbeddingsHandler-fm (5 handlers)
[GIN-debug] POST   /api/create               --> github.com/ollama/ollama/server.(*Server).CreateHandler-fm (5 handlers)
[GIN-debug] POST   /api/push                 --> github.com/ollama/ollama/server.(*Server).PushHandler-fm (5 handlers)
[GIN-debug] POST   /api/copy                 --> github.com/ollama/ollama/server.(*Server).CopyHandler-fm (5 handlers)
[GIN-debug] DELETE /api/delete               --> github.com/ollama/ollama/server.(*Server).DeleteHandler-fm (5 handlers)
[GIN-debug] POST   /api/show                 --> github.com/ollama/ollama/server.(*Server).ShowHandler-fm (5 handlers)
[GIN-debug] POST   /api/blobs/:digest        --> github.com/ollama/ollama/server.(*Server).CreateBlobHandler-fm (5 handlers)
[GIN-debug] HEAD   /api/blobs/:digest        --> github.com/ollama/ollama/server.(*Server).HeadBlobHandler-fm (5 handlers)
[GIN-debug] GET    /api/ps                   --> github.com/ollama/ollama/server.(*Server).PsHandler-fm (5 handlers)
[GIN-debug] POST   /v1/chat/completions      --> github.com/ollama/ollama/server.(*Server).ChatHandler-fm (6 handlers)
[GIN-debug] POST   /v1/completions           --> github.com/ollama/ollama/server.(*Server).GenerateHandler-fm (6 handlers)
[GIN-debug] POST   /v1/embeddings            --> github.com/ollama/ollama/server.(*Server).EmbedHandler-fm (6 handlers)
[GIN-debug] GET    /v1/models                --> github.com/ollama/ollama/server.(*Server).ListHandler-fm (6 handlers)
[GIN-debug] GET    /v1/models/:model         --> github.com/ollama/ollama/server.(*Server).ShowHandler-fm (6 handlers)
[GIN-debug] GET    /                         --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func1 (5 handlers)
[GIN-debug] GET    /api/tags                 --> github.com/ollama/ollama/server.(*Server).ListHandler-fm (5 handlers)
[GIN-debug] GET    /api/version              --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func2 (5 handlers)
[GIN-debug] HEAD   /                         --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func1 (5 handlers)
[GIN-debug] HEAD   /api/tags                 --> github.com/ollama/ollama/server.(*Server).ListHandler-fm (5 handlers)
[GIN-debug] HEAD   /api/version              --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func2 (5 handlers)
time=2025-02-05T20:34:17.895+10:00 level=INFO source=routes.go:1238 msg="Listening on 127.0.0.1:11434 (version 0.5.7)"
time=2025-02-05T20:34:17.897+10:00 level=INFO source=routes.go:1267 msg="Dynamic LLM libraries" runners="[cpu cpu_avx cpu_avx2 cuda_v11_avx cuda_v12_avx rocm_avx]"
time=2025-02-05T20:34:17.898+10:00 level=INFO source=gpu.go:226 msg="looking for compatible GPUs"
time=2025-02-05T20:34:17.924+10:00 level=INFO source=amd_linux.go:388 msg="amdgpu is supported" gpu=GPU-bdc863557120278e gpu_type=gfx942
time=2025-02-05T20:34:17.926+10:00 level=INFO source=amd_linux.go:388 msg="amdgpu is supported" gpu=GPU-7726b37bb420e44d gpu_type=gfx942
time=2025-02-05T20:34:17.928+10:00 level=INFO source=amd_linux.go:388 msg="amdgpu is supported" gpu=GPU-0518a62b7fc63c75 gpu_type=gfx942
time=2025-02-05T20:34:17.930+10:00 level=INFO source=amd_linux.go:388 msg="amdgpu is supported" gpu=GPU-00f0af4bd21e3c60 gpu_type=gfx942
time=2025-02-05T20:34:17.931+10:00 level=INFO source=amd_linux.go:388 msg="amdgpu is supported" gpu=GPU-60617e0e49ca7200 gpu_type=gfx942
time=2025-02-05T20:34:17.933+10:00 level=INFO source=amd_linux.go:388 msg="amdgpu is supported" gpu=GPU-d534c0b85b7c20e8 gpu_type=gfx942
time=2025-02-05T20:34:17.935+10:00 level=INFO source=amd_linux.go:388 msg="amdgpu is supported" gpu=GPU-cd30cff04bf7d13f gpu_type=gfx942
time=2025-02-05T20:34:17.937+10:00 level=INFO source=amd_linux.go:388 msg="amdgpu is supported" gpu=GPU-17cf8ca10dd02af6 gpu_type=gfx942
time=2025-02-05T20:34:17.937+10:00 level=INFO source=types.go:131 msg="inference compute" id=GPU-bdc863557120278e library=rocm variant="" compute=gfx942 driver=6.8 name=1002:74a1 total="192.0 GiB" available="191.7 GiB"
time=2025-02-05T20:34:17.937+10:00 level=INFO source=types.go:131 msg="inference compute" id=GPU-7726b37bb420e44d library=rocm variant="" compute=gfx942 driver=6.8 name=1002:74a1 total="192.0 GiB" available="191.7 GiB"
time=2025-02-05T20:34:17.937+10:00 level=INFO source=types.go:131 msg="inference compute" id=GPU-0518a62b7fc63c75 library=rocm variant="" compute=gfx942 driver=6.8 name=1002:74a1 total="192.0 GiB" available="191.7 GiB"
time=2025-02-05T20:34:17.937+10:00 level=INFO source=types.go:131 msg="inference compute" id=GPU-00f0af4bd21e3c60 library=rocm variant="" compute=gfx942 driver=6.8 name=1002:74a1 total="192.0 GiB" available="191.7 GiB"
time=2025-02-05T20:34:17.937+10:00 level=INFO source=types.go:131 msg="inference compute" id=GPU-60617e0e49ca7200 library=rocm variant="" compute=gfx942 driver=6.8 name=1002:74a1 total="192.0 GiB" available="191.7 GiB"
time=2025-02-05T20:34:17.937+10:00 level=INFO source=types.go:131 msg="inference compute" id=GPU-d534c0b85b7c20e8 library=rocm variant="" compute=gfx942 driver=6.8 name=1002:74a1 total="192.0 GiB" available="191.7 GiB"
time=2025-02-05T20:34:17.937+10:00 level=INFO source=types.go:131 msg="inference compute" id=GPU-cd30cff04bf7d13f library=rocm variant="" compute=gfx942 driver=6.8 name=1002:74a1 total="192.0 GiB" available="191.7 GiB"
time=2025-02-05T20:34:17.937+10:00 level=INFO source=types.go:131 msg="inference compute" id=GPU-17cf8ca10dd02af6 library=rocm variant="" compute=gfx942 driver=6.8 name=1002:74a1 total="192.0 GiB" available="191.7 GiB"

...and we appear to be even able to load a model:

[GIN] 2025/02/05 - 20:35:42 | 200 | 169.804µs | 127.0.0.1 | HEAD "/"
[GIN] 2025/02/05 - 20:35:42 | 200 | 1.79409ms | 127.0.0.1 | GET "/api/tags"
[GIN] 2025/02/05 - 20:35:52 | 200 | 21.853µs | 127.0.0.1 | HEAD "/"
[GIN] 2025/02/05 - 20:35:52 | 200 | 29.667829ms | 127.0.0.1 | POST "/api/show"
time=2025-02-05T20:35:52.772+10:00 level=INFO source=sched.go:714 msg="new model will fit in available VRAM in single GPU, loading" model=/home/me/.ollama/models/blobs/sha256-6150cb382311b69f09cc0f9a1b69fc029cbd742b66bb8ec531aa5ecf5c613e93 gpu=GPU-bdc863557120278e parallel=4 available=205843881984 required="21.5 GiB"
time=2025-02-05T20:35:52.772+10:00 level=INFO source=server.go:104 msg="system memory" total="2267.0 GiB" free="2155.1 GiB" free_swap="4.0 GiB"
time=2025-02-05T20:35:52.773+10:00 level=INFO source=memory.go:356 msg="offload to rocm" layers.requested=-1 layers.model=65 layers.offload=65 layers.split="" memory.available="[191.7 GiB]" memory.gpu_overhead="0 B" memory.required.full="21.5 GiB" memory.required.partial="21.5 GiB" memory.required.kv="2.0 GiB" memory.required.allocations="[21.5 GiB]" memory.weights.total="19.5 GiB" memory.weights.repeating="18.9 GiB" memory.weights.nonrepeating="609.1 MiB" memory.graph.full="676.0 MiB" memory.graph.partial="916.1 MiB"
time=2025-02-05T20:35:52.774+10:00 level=INFO source=server.go:376 msg="starting llama server" cmd="/scratch/project_mnt/S0156/ollama/lib/ollama/runners/rocm_avx/ollama_llama_server runner --model /home/me/.ollama/models/blobs/sha256-6150cb382311b69f09cc0f9a1b69fc029cbd742b66bb8ec531aa5ecf5c613e93 --ctx-size 8192 --batch-size 512 --n-gpu-layers 65 --threads 128 --parallel 4 --port 43799"
time=2025-02-05T20:35:52.779+10:00 level=INFO source=sched.go:449 msg="loaded runners" count=1
time=2025-02-05T20:35:52.779+10:00 level=INFO source=server.go:555 msg="waiting for llama runner to start responding"
time=2025-02-05T20:35:52.783+10:00 level=INFO source=server.go:589 msg="waiting for server to become available" status="llm server error"
time=2025-02-05T20:35:52.837+10:00 level=INFO source=runner.go:936 msg="starting go runner"
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
Device 0: , compute capability 9.4, VMM: no
time=2025-02-05T20:35:56.249+10:00 level=INFO source=runner.go:937 msg=system info="ROCm : PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | LLAMAFILE = 1 | AARCH64_REPACK = 1 | cgo(gcc)" threads=128
time=2025-02-05T20:35:56.249+10:00 level=INFO source=.:0 msg="Server listening on 127.0.0.1:43799"
llama_load_model_from_file: using device ROCm0 () - 195148 MiB free
llama_model_loader: loaded meta data with 26 key-value pairs and 771 tensors from /home/me/.ollama/models/blobs/sha256-6150cb382311b69f09cc0f9a1b69fc029cbd742b66bb8ec531aa5ecf5c613e93 (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = qwen2
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = DeepSeek R1 Distill Qwen 32B
llama_model_loader: - kv 3: general.basename str = DeepSeek-R1-Distill-Qwen
llama_model_loader: - kv 4: general.size_label str = 32B
llama_model_loader: - kv 5: qwen2.block_count u32 = 64
llama_model_loader: - kv 6: qwen2.context_length u32 = 131072
llama_model_loader: - kv 7: qwen2.embedding_length u32 = 5120
llama_model_loader: - kv 8: qwen2.feed_forward_length u32 = 27648
llama_model_loader: - kv 9: qwen2.attention.head_count u32 = 40
llama_model_loader: - kv 10: qwen2.attention.head_count_kv u32 = 8
llama_model_loader: - kv 11: qwen2.rope.freq_base f32 = 1000000.000000
llama_model_loader: - kv 12: qwen2.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 13: general.file_type u32 = 15
llama_model_loader: - kv 14: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 15: tokenizer.ggml.pre str = deepseek-r1-qwen
llama_model_loader: - kv 16: tokenizer.ggml.tokens arr[str,152064] = ["!", """, "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 17: tokenizer.ggml.token_type arr[i32,152064] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
time=2025-02-05T20:35:56.328+10:00 level=INFO source=server.go:589 msg="waiting for server to become available" status="llm server loading model"
llama_model_loader: - kv 18: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv 19: tokenizer.ggml.bos_token_id u32 = 151646
llama_model_loader: - kv 20: tokenizer.ggml.eos_token_id u32 = 151643
llama_model_loader: - kv 21: tokenizer.ggml.padding_token_id u32 = 151643
llama_model_loader: - kv 22: tokenizer.ggml.add_bos_token bool = true
llama_model_loader: - kv 23: tokenizer.ggml.add_eos_token bool = false
llama_model_loader: - kv 24: tokenizer.chat_template str = {% if not add_generation_prompt is de...
llama_model_loader: - kv 25: general.quantization_version u32 = 2
llama_model_loader: - type f32: 321 tensors
llama_model_loader: - type q4_K: 385 tensors
llama_model_loader: - type q6_K: 65 tensors
llm_load_vocab: missing or unrecognized pre-tokenizer type, using: 'default'
llm_load_vocab: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
llm_load_vocab: special tokens cache size = 22
llm_load_vocab: token to piece cache size = 0.9310 MB
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = qwen2
llm_load_print_meta: vocab type = BPE
llm_load_print_meta: n_vocab = 152064
llm_load_print_meta: n_merges = 151387
llm_load_print_meta: vocab_only = 0
llm_load_print_meta: n_ctx_train = 131072
llm_load_print_meta: n_embd = 5120
llm_load_print_meta: n_layer = 64
llm_load_print_meta: n_head = 40
llm_load_print_meta: n_head_kv = 8
llm_load_print_meta: n_rot = 128
llm_load_print_meta: n_swa = 0
llm_load_print_meta: n_embd_head_k = 128
llm_load_print_meta: n_embd_head_v = 128
llm_load_print_meta: n_gqa = 5
llm_load_print_meta: n_embd_k_gqa = 1024
llm_load_print_meta: n_embd_v_gqa = 1024
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale = 0.0e+00
llm_load_print_meta: n_ff = 27648
llm_load_print_meta: n_expert = 0
llm_load_print_meta: n_expert_used = 0
llm_load_print_meta: causal attn = 1
llm_load_print_meta: pooling type = 0
llm_load_print_meta: rope type = 2
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 1000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn = 131072
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: ssm_d_conv = 0
llm_load_print_meta: ssm_d_inner = 0
llm_load_print_meta: ssm_d_state = 0
llm_load_print_meta: ssm_dt_rank = 0
llm_load_print_meta: ssm_dt_b_c_rms = 0
llm_load_print_meta: model type = 32B
llm_load_print_meta: model ftype = Q4_K - Medium
llm_load_print_meta: model params = 32.76 B
llm_load_print_meta: model size = 18.48 GiB (4.85 BPW)
llm_load_print_meta: general.name = DeepSeek R1 Distill Qwen 32B
llm_load_print_meta: BOS token = 151646 '<|begin▁of▁sentence|>'
llm_load_print_meta: EOS token = 151643 '<|end▁of▁sentence|>'
llm_load_print_meta: EOT token = 151643 '<|end▁of▁sentence|>'
llm_load_print_meta: PAD token = 151643 '<|end▁of▁sentence|>'
llm_load_print_meta: LF token = 148848 'ÄĬ'
llm_load_print_meta: FIM PRE token = 151659 '<|fim_prefix|>'
llm_load_print_meta: FIM SUF token = 151661 '<|fim_suffix|>'
llm_load_print_meta: FIM MID token = 151660 '<|fim_middle|>'
llm_load_print_meta: FIM PAD token = 151662 '<|fim_pad|>'
llm_load_print_meta: FIM REP token = 151663 '<|repo_name|>'
llm_load_print_meta: FIM SEP token = 151664 '<|file_sep|>'
llm_load_print_meta: EOG token = 151643 '<|end▁of▁sentence|>'
llm_load_print_meta: EOG token = 151662 '<|fim_pad|>'
llm_load_print_meta: EOG token = 151663 '<|repo_name|>'
llm_load_print_meta: EOG token = 151664 '<|file_sep|>'
llm_load_print_meta: max token length = 256

...but if we try to interact with the model:

me@host:~$ ollama run deepseek-r1:32b
>>> Tell me about why the sky is blue
GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG

>>> Send a message (/? for help)

Splat...

llm_load_tensors: offloading 64 repeating layers to GPU
llm_load_tensors: offloading output layer to GPU
llm_load_tensors: offloaded 65/65 layers to GPU
llm_load_tensors:   CPU_Mapped model buffer size =   417.66 MiB
llm_load_tensors:        ROCm0 model buffer size = 18508.35 MiB
llama_new_context_with_model: n_seq_max     = 4
llama_new_context_with_model: n_ctx         = 8192
llama_new_context_with_model: n_ctx_per_seq = 2048
llama_new_context_with_model: n_batch       = 2048
llama_new_context_with_model: n_ubatch      = 512
llama_new_context_with_model: flash_attn    = 0
llama_new_context_with_model: freq_base     = 1000000.0
llama_new_context_with_model: freq_scale    = 1
llama_new_context_with_model: n_ctx_per_seq (2048) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
llama_kv_cache_init: kv_size = 8192, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 64, can_shift = 1
llama_kv_cache_init:      ROCm0 KV buffer size =  2048.00 MiB
llama_new_context_with_model: KV self size  = 2048.00 MiB, K (f16): 1024.00 MiB, V (f16): 1024.00 MiB
llama_new_context_with_model:  ROCm_Host  output buffer size =     2.40 MiB
llama_new_context_with_model:      ROCm0 compute buffer size =   696.00 MiB
llama_new_context_with_model:  ROCm_Host compute buffer size =    26.01 MiB
llama_new_context_with_model: graph nodes  = 2246
llama_new_context_with_model: graph splits = 2
time=2025-02-05T20:36:24.638+10:00 level=INFO source=server.go:594 msg="llama runner started in 31.86 seconds"
[GIN] 2025/02/05 - 20:36:24 | 200 | 31.911820747s |       127.0.0.1 | POST     "/api/generate"
[GIN] 2025/02/05 - 20:37:22 | 200 |  1.082842864s |       127.0.0.1 | POST     "/api/chat"

I've tried various models, recompiles, different ROCm versions (6.3.0 - and will test soon with 6.3.2 as well), different OS's and different BIOS level/UEFI level settings to no avail. Different iommu=pt variants too at boot time - still the same behaviour.

The only clue I've got is in the kernel buffer output, and it might be a spurious or not helpful lead:

[547815.107451] amdgpu: init_user_pages: Failed to get user pages: -1
[547815.114438] amdgpu: init_user_pages: Failed to get user pages: -1
[547815.127493] amdgpu: init_user_pages: Failed to get user pages: -1
[547815.134476] amdgpu: init_user_pages: Failed to get user pages: -1
[547815.150316] amdgpu: init_user_pages: Failed to get user pages: -1
[547815.157324] amdgpu: init_user_pages: Failed to get user pages: -1
[547815.170387] amdgpu: init_user_pages: Failed to get user pages: -1
[547815.177364] amdgpu: init_user_pages: Failed to get user pages: -1
[547815.185523] amdgpu: init_user_pages: Failed to get user pages: -1
[547815.192492] amdgpu: init_user_pages: Failed to get user pages: -1
[547815.199741] amdgpu: init_user_pages: Failed to get user pages: -1
[547815.206709] amdgpu: init_user_pages: Failed to get user pages: -1
[547815.214083] amdgpu: init_user_pages: Failed to get user pages: -1
[547815.221056] amdgpu: init_user_pages: Failed to get user pages: -1
[547815.229336] amdgpu: init_user_pages: Failed to get user pages: -1
[547815.236308] amdgpu: init_user_pages: Failed to get user pages: -1
[547815.249435] amdgpu: init_user_pages: Failed to get user pages: -1
[547815.256437] amdgpu: init_user_pages: Failed to get user pages: -1
[547815.272280] amdgpu: init_user_pages: Failed to get user pages: -1
[547815.279310] amdgpu: init_user_pages: Failed to get user pages: -1

I turned the AMD_LOGLEVEL right up so we could see all the HIP backend status, too, as we loaded a model:

llama_new_context_with_model: n_ctx_per_seq (2048) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
llama_kv_cache_init: kv_size = 8192, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 64, can_shift = 1
:3:hip_device_runtime.cpp   :634 : 550213171495d us:   hipGetDevice ( 0x7ffca34532ec )
:3:hip_device_runtime.cpp   :642 : 550213171503d us:  hipGetDevice: Returned hipSuccess :
:3:hip_device_runtime.cpp   :634 : 550213171506d us:   hipGetDevice ( 0x7ffca34532ec )
:3:hip_device_runtime.cpp   :642 : 550213171508d us:  hipGetDevice: Returned hipSuccess :
:3:hip_memory.cpp           :674 : 550213171514d us:   hipMalloc ( 0x7ffca3453368, 2147483648 )
:3:rocdevice.cpp            :2425: 550213171870d us:  Device=0x561a2c38bd80, freeMem_ = 0x2afa3a6000
:3:hip_memory.cpp           :676 : 550213171878d us:  hipMalloc: Returned hipSuccess : 0x7faf1a800000: duration: 364d us
:3:hip_device_runtime.cpp   :634 : 550213171888d us:   hipGetDevice ( 0x7ffca345345c )
:3:hip_device_runtime.cpp   :642 : 550213171890d us:  hipGetDevice: Returned hipSuccess :
:3:hip_device_runtime.cpp   :618 : 550213171892d us:   hipDeviceSynchronize (  )
:3:hip_device_runtime.cpp   :622 : 550213171901d us:  hipDeviceSynchronize: Returned hipSuccess :
:3:hip_memory.cpp           :3281: 550213171910d us:   hipMemset ( 0x7faf1a800000, 0, 2147483648 )
:3:rocdevice.cpp            :3064: 550213171922d us:  Number of allocated hardware queues with low priority: 0, with normal priority: 1, with high priority: 0, maximum per priority is: 4
:3:rocdevice.cpp            :3140: 550213193990d us:  Created SWq=0x7fb8199ec000 to map on HWq=0x7faf1a600000 with size 16384 with priority 1, cooperative: 0
:3:rocdevice.cpp            :3232: 550213194033d us:  acquireQueue refCount: 0x7faf1a600000 (1)
:3:rocvirtual.cpp           :731 : 550213194253d us:  Arg0: void* buf = ptr:0x7faf1a800000 obj:[0x7faf1a800000-0x7faf9a800000]
:3:rocvirtual.cpp           :807 : 550213194257d us:  Arg2: uint pattern_size = val:1
:3:rocvirtual.cpp           :807 : 550213194259d us:  Arg3: uint alignment = val:16
:3:rocvirtual.cpp           :807 : 550213194261d us:  Arg4: ulong end_ptr = val:140392188084224
:3:rocvirtual.cpp           :807 : 550213194263d us:  Arg5: uint next_chunk = val:77824
:3:rocvirtual.cpp           :3056: 550213194266d us:  ShaderName : __amd_rocclr_fillBufferAligned
:3:hip_memory.cpp           :3282: 550213194275d us:  hipMemset: Returned hipSuccess :
:3:hip_device_runtime.cpp   :618 : 550213194279d us:   hipDeviceSynchronize (  )
:3:rocvirtual.cpp           :480 : 550213194286d us:  Set Handler: handle(0x7fb81abfd280), timestamp(0x561a69bed8d0)
:3:rocvirtual.hpp           :67  : 550213194288d us:  Host active wait for Signal = (0x7fb81abfd280) for -1 ns
:3:hip_device_runtime.cpp   :622 : 550213194683d us:  hipDeviceSynchronize: Returned hipSuccess :
llama_kv_cache_init:      ROCm0 KV buffer size =  2048.00 MiB
:3:rocvirtual.cpp           :227 : 550213194708d us:  Handler: value(0), timestamp(0x561a6ab11860), handle(0x7fb81abfd280)
llama_new_context_with_model: KV self size  = 2048.00 MiB, K (f16): 1024.00 MiB, V (f16): 1024.00 MiB
:3:hip_memory.cpp           :680 : 550213194755d us:   hipHostMalloc ( 0x7ffca3453550, 2514944, 0 )
:3:hip_memory.cpp           :686 : 550213195021d us:  hipHostMalloc: Returned hipSuccess : 0x7faf0f000000: duration: 266d us
llama_new_context_with_model:  ROCm_Host  output buffer size =     2.40 MiB
:3:hip_device_runtime.cpp   :634 : 550213196052d us:   hipGetDevice ( 0x7ffca345354c )
:3:hip_device_runtime.cpp   :642 : 550213196056d us:  hipGetDevice: Returned hipSuccess :
:3:hip_stream.cpp           :256 : 550213196063d us:   hipStreamCreateWithFlags ( 0x561a6edfe6a8, 1 )
:3:rocdevice.cpp            :3064: 550213196068d us:  Number of allocated hardware queues with low priority: 0, with normal priority: 2, with high priority: 0, maximum per priority is: 4
:3:rocdevice.cpp            :3140: 550213216828d us:  Created SWq=0x7fb8199be000 to map on HWq=0x7faf0ba00000 with size 16384 with priority 1, cooperative: 0
:3:rocdevice.cpp            :3232: 550213216838d us:  acquireQueue refCount: 0x7faf0ba00000 (1)
:3:hip_stream.cpp           :262 : 550213217077d us:  hipStreamCreateWithFlags: Returned hipSuccess : stream:0x561a6e594ea0
:3:hip_stream.cpp           :374 : 550213217084d us:   hipStreamSynchronize ( stream:0x561a6e594ea0 )
:3:hip_stream.cpp           :375 : 550213217090d us:  hipStreamSynchronize: Returned hipSuccess :
:3:hip_device_runtime.cpp   :634 : 550213217409d us:   hipGetDevice ( 0x7ffca345344c )
:3:hip_device_runtime.cpp   :642 : 550213217412d us:  hipGetDevice: Returned hipSuccess :
:3:hip_device_runtime.cpp   :634 : 550213217414d us:   hipGetDevice ( 0x7ffca345344c )
:3:hip_device_runtime.cpp   :642 : 550213217416d us:  hipGetDevice: Returned hipSuccess :
:3:hip_memory.cpp           :674 : 550213217420d us:   hipMalloc ( 0x7ffca34534c8, 729812992 )
:3:rocdevice.cpp            :2425: 550213217482d us:  Device=0x561a2c38bd80, freeMem_ = 0x2aceba5000
:3:hip_memory.cpp           :676 : 550213217487d us:  hipMalloc: Returned hipSuccess : 0x7faed4800000: duration: 67d us
:3:hip_memory.cpp           :680 : 550213217496d us:   hipHostMalloc ( 0x7ffca34534d0, 27269120, 0 )
:3:hip_memory.cpp           :686 : 550213220223d us:  hipHostMalloc: Returned hipSuccess : 0x7faed2c00000: duration: 2727d us
:3:hip_stream.cpp           :374 : 550213220720d us:   hipStreamSynchronize ( stream:0x561a6e594ea0 )
:3:hip_stream.cpp           :375 : 550213220725d us:  hipStreamSynchronize: Returned hipSuccess :
:3:hip_stream.cpp           :374 : 550213221356d us:   hipStreamSynchronize ( stream:0x561a6e594ea0 )
:3:hip_stream.cpp           :375 : 550213221360d us:  hipStreamSynchronize: Returned hipSuccess :
llama_new_context_with_model:      ROCm0 compute buffer size =   696.00 MiB
llama_new_context_with_model:  ROCm_Host compute buffer size =    26.01 MiB
llama_new_context_with_model: graph nodes  = 2246
llama_new_context_with_model: graph splits = 2
time=2025-02-05T21:16:23.691+10:00 level=INFO source=server.go:594 msg="llama runner started in 16.26 seconds"
[GIN] 2025/02/05 - 21:16:23 | 200 | 16.309228075s |       127.0.0.1 | POST     "/api/generate"

And the output watching "GGGGGGGGG" slowly show up:

:3:rocvirtual.cpp           :731 : 550318994888d us:  Arg0:   = ptr:0x7faed4800000 obj:[0x7faed4800000-0x7faf00001000]
:3:rocvirtual.cpp           :731 : 550318994890d us:  Arg1:   = ptr:0x7fa676800000 obj:[0x7fa676800000-0x7faafb45a000]
:3:rocvirtual.cpp           :731 : 550318994891d us:  Arg2:   = ptr:0x7faed4800000 obj:[0x7faed4800000-0x7faf00001000]
:3:rocvirtual.cpp           :807 : 550318994893d us:  Arg3:   = val:5120
:3:rocvirtual.cpp           :807 : 550318994894d us:  Arg4:   = val:1
:3:rocvirtual.cpp           :807 : 550318994896d us:  Arg5:   = val:1
:3:rocvirtual.cpp           :807 : 550318994897d us:  Arg6:   = val:1
:3:rocvirtual.cpp           :807 : 550318994899d us:  Arg7:   = val:5120
:3:rocvirtual.cpp           :807 : 550318994900d us:  Arg8:   = val:1
:3:rocvirtual.cpp           :807 : 550318994902d us:  Arg9:   = val:1
:3:rocvirtual.cpp           :807 : 550318994903d us:  Arg10:   = val:1
:3:rocvirtual.cpp           :807 : 550318994905d us:  Arg11:   = val:5120
:3:rocvirtual.cpp           :807 : 550318994906d us:  Arg12:   = val:5120
:3:rocvirtual.cpp           :807 : 550318994908d us:  Arg13:   = val:5120
:3:rocvirtual.cpp           :807 : 550318994909d us:  Arg14:   = val:5120
:3:rocvirtual.cpp           :807 : 550318994911d us:  Arg15:   = val:5120
:3:rocvirtual.cpp           :807 : 550318994912d us:  Arg16:   = val:5120
:3:rocvirtual.cpp           :807 : 550318994914d us:  Arg17:   = val:5120
:3:rocvirtual.cpp           :807 : 550318994915d us:  Arg18:   = val:5120
:3:rocvirtual.cpp           :807 : 550318994917d us:  Arg19:   = val:5120
:3:rocvirtual.cpp           :3056: 550318994918d us:  ShaderName : _ZL11k_bin_bcastIXadL_ZL6op_mulffEEfffEvPKT0_PKT1_PT2_iiiiiiiiiiiiiiiii
:3:hip_module.cpp           :687 : 550318994922d us:  hipLaunchKernel: Returned hipSuccess :
:3:hip_error.cpp            :36  : 550318994924d us:   hipGetLastError (  )
:3:hip_device_runtime.cpp   :634 : 550318994926d us:   hipGetDevice ( 0x7fb818ff5bac )
:3:hip_device_runtime.cpp   :642 : 550318994928d us:  hipGetDevice: Returned hipSuccess :
:3:hip_platform.cpp         :230 : 550318994930d us:   __hipPushCallConfiguration ( {20,1,1}, {256,1,1}, 0, stream:0x561a6e594ea0 )
:3:hip_platform.cpp         :234 : 550318994932d us:  __hipPushCallConfiguration: Returned hipSuccess :
:3:hip_platform.cpp         :239 : 550318994935d us:   __hipPopCallConfiguration ( {0,0,4024810072}, {419391248,32696,0}, 0x7fb818ff5b18, 0x7fb818ff5b10 )
:3:hip_platform.cpp         :248 : 550318994937d us:  __hipPopCallConfiguration: Returned hipSuccess :
:3:hip_module.cpp           :686 : 550318994940d us:   hipLaunchKernel ( 0x7fb8f4a2a438, {20,1,1}, {256,1,1}, 0x7fb818ff5b60, 0, stream:0x561a6e594ea0 )
:3:rocvirtual.cpp           :731 : 550318994942d us:  Arg0:   = ptr:0x7faed4800000 obj:[0x7faed4800000-0x7faf00001000]
:3:rocvirtual.cpp           :731 : 550318994944d us:  Arg1:   = ptr:0x7faec7200000 obj:[0x7faec7200000-0x7faec721b800]
:3:rocvirtual.cpp           :807 : 550318994946d us:  Arg2:   = val:5120
:3:rocvirtual.cpp           :807 : 550318994947d us:  Arg3:   = val:5120
:3:rocvirtual.cpp           :3056: 550318994948d us:  ShaderName : _ZL13quantize_q8_1PKfPvll
:3:hip_module.cpp           :687 : 550318994952d us:  hipLaunchKernel: Returned hipSuccess :
:3:hip_error.cpp            :36  : 550318994954d us:   hipGetLastError (  )
:3:hip_device_runtime.cpp   :634 : 550318994955d us:   hipGetDevice ( 0x7fb818ff5d08 )
:3:hip_device_runtime.cpp   :642 : 550318994957d us:  hipGetDevice: Returned hipSuccess :
:3:hip_device_runtime.cpp   :634 : 550318994959d us:   hipGetDevice ( 0x7fb818ff5a94 )
:3:hip_device_runtime.cpp   :642 : 550318994961d us:  hipGetDevice: Returned hipSuccess :
:3:hip_device_runtime.cpp   :634 : 550318994963d us:   hipGetDevice ( 0x7fb818ff5a94 )
:3:hip_device_runtime.cpp   :642 : 550318994964d us:  hipGetDevice: Returned hipSuccess :
:3:hip_platform.cpp         :230 : 550318994967d us:   __hipPushCallConfiguration ( {152064,1,1}, {32,1,1}, 0, stream:0x561a6e594ea0 )
:3:hip_platform.cpp         :234 : 550318994969d us:  __hipPushCallConfiguration: Returned hipSuccess :
:3:hip_platform.cpp         :239 : 550318994972d us:   __hipPopCallConfiguration ( {0,0,5120}, {5120,0,3565158400}, 0x7fb818ff5ac0, 0x7fb818ff5ab8 )
:3:hip_platform.cpp         :248 : 550318994973d us:  __hipPopCallConfiguration: Returned hipSuccess :
:3:hip_module.cpp           :686 : 550318994977d us:   hipLaunchKernel ( 0x7fb8f4a2a1b8, {152064,1,1}, {32,1,1}, 0x7fb818ff5b00, 0, stream:0x561a6e594ea0 )
:3:rocvirtual.cpp           :731 : 550318994980d us:  Arg0:   = ptr:0x7fa676805000 obj:[0x7fa676800000-0x7faafb45a000]
:3:rocvirtual.cpp           :731 : 550318994982d us:  Arg1:   = ptr:0x7faec7200000 obj:[0x7faec7200000-0x7faec721b800]
:3:rocvirtual.cpp           :731 : 550318994983d us:  Arg2:   = ptr:0x7faed5200000 obj:[0x7faed4800000-0x7faf00001000]
:3:rocvirtual.cpp           :807 : 550318994985d us:  Arg3:   = val:5120
:3:rocvirtual.cpp           :807 : 550318994986d us:  Arg4:   = val:152064
:3:rocvirtual.cpp           :807 : 550318994988d us:  Arg5:   = val:5120
:3:rocvirtual.cpp           :807 : 550318994989d us:  Arg6:   = val:152064
:3:rocvirtual.cpp           :3056: 550318994991d us:  ShaderName : _ZL13mul_mat_vec_qIL9ggml_type14ELi1EEvPKvS2_Pfiiii
:3:hip_module.cpp           :687 : 550318994994d us:  hipLaunchKernel: Returned hipSuccess :
:3:hip_error.cpp            :36  : 550318994996d us:   hipGetLastError (  )
:3:hip_error.cpp            :36  : 550318994998d us:   hipGetLastError (  )
:3:hip_memory.cpp           :1543: 550318995009d us:   hipMemcpyAsync ( 0x7faf0f000000, 0x7faed5200000, 608256, hipMemcpyDeviceToHost, stream:0x561a6e594ea0 )
:3:rocvirtual.hpp           :67  : 550318995021d us:  Host active wait for Signal = (0x7fb81abfaa80) for 10000 ns
:3:hip_memory.cpp           :1544: 550318995042d us:  hipMemcpyAsync: Returned hipSuccess : : duration: 33d us
:3:hip_memory.cpp           :1543: 550318995045d us:   hipMemcpyAsync ( 0x7faf0f252000, 0x7faed4800000, 20480, hipMemcpyDeviceToHost, stream:0x561a6e594ea0 )
:3:hip_memory.cpp           :1544: 550318995050d us:  hipMemcpyAsync: Returned hipSuccess : : duration: 5d us
:3:hip_stream.cpp           :374 : 550318995107d us:   hipStreamSynchronize ( stream:0x561a6e594ea0 )
:3:rocvirtual.cpp           :480 : 550318995123d us:  Set Handler: handle(0x7fb81abfa900), timestamp(0x561a6e29ebf0)
:3:rocvirtual.hpp           :67  : 550318995126d us:  Host active wait for Signal = (0x7fb81abfa900) for -1 ns
:3:hip_stream.cpp           :375 : 550318995209d us:  hipStreamSynchronize: Returned hipSuccess :
:3:rocvirtual.cpp           :227 : 550318995213d us:  Handler: value(0), timestamp(0x7fafefea3570), handle(0x7fb81abfa900)
  • and yes, the GPUs have been set appropriate in cgroups/groups/passwd for the "video" and "render" user as appropriate and as instructed by the ROCm install documentation.

Any help would be incredibly appreciated. Thank you.

OS

Linux

GPU

AMD

CPU

AMD

Ollama version

0.5.7

Originally created by @zebrax0r on GitHub (Feb 5, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/8843 Originally assigned to: @dhiltgen on GitHub. ### What is the issue? I have found an issue that exists between the current (stable) ollama, AMD ROCm 6.3.1 and AMD gfx942 (MI300x). I have tested with multiple operating systems including Rocky8 and current Ubuntu Server (24.x.x). Setup: `Linux host 4.18.0-553.8.1.el8_10.x86_64 #1 SMP Tue Jul 2 17:10:26 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux` `Rocky Linux release 8.10 (Green Obsidian)` ``` me@host:~$ rocm-smi ============================================ ROCm System Management Interface ============================================ ====================================================== Concise Info ====================================================== Device Node IDs Temp Power Partitions SCLK MCLK Fan Perf PwrCap VRAM% GPU% (DID, GUID) (Junction) (Socket) (Mem, Compute, ID) ========================================================================================================================== 0 20 0x74a1, 28851 42.0°C 123.0W NPS1, SPX, 0 131Mhz 900Mhz 0% auto 750.0W 0% 0% 1 17 0x74a1, 43178 40.0°C 121.0W NPS1, SPX, 0 131Mhz 900Mhz 0% auto 750.0W 0% 0% 2 16 0x74a1, 32898 44.0°C 121.0W NPS1, SPX, 0 132Mhz 900Mhz 0% auto 750.0W 0% 0% 3 19 0x74a1, 22683 39.0°C 119.0W NPS1, SPX, 0 132Mhz 900Mhz 0% auto 750.0W 0% 0% 4 22 0x74a1, 53458 42.0°C 126.0W NPS1, SPX, 0 131Mhz 900Mhz 0% auto 750.0W 0% 0% 5 23 0x74a1, 2251 39.0°C 119.0W NPS1, SPX, 0 131Mhz 900Mhz 0% auto 750.0W 0% 0% 6 21 0x74a1, 8419 42.0°C 123.0W NPS1, SPX, 0 131Mhz 900Mhz 0% auto 750.0W 0% 0% 7 18 0x74a1, 63738 39.0°C 120.0W NPS1, SPX, 0 132Mhz 900Mhz 0% auto 750.0W 0% 0% ========================================================================================================================== ================================================== End of ROCm SMI Log =================================================== ``` Notes: * Bare metal. No hypervisors or containers here. * Running SLURM to control the node. cgroups have been checked for issues - no apparent constraints. Whenever *any* model is run, ollama now outputs $$$$$$$$ or GGGGGGGG or 4444444 and so forth. I believe this was not the case in previous versions of ollama - and I have verified the same behaviour in llama.cpp with other front ends/back-ends. We do not observe this behaviour with nvidia GPUs. ollama appears to start up correctly... ``` me@host:~$ ollama serve 2025/02/05 20:34:17 routes.go:1187: INFO server config env="map[CUDA_VISIBLE_DEVICES:0,1,2,3,4,5,6,7 GPU_DEVICE_ORDINAL:0,1,2,3,4,5,6,7 HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://127.0.0.1:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_KV_CACHE_TYPE: OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/home/me/.ollama/models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://*] OLLAMA_SCHED_SPREAD:false ROCR_VISIBLE_DEVICES:0,1,2,3,4,5,6,7 http_proxy: https_proxy: no_proxy:]" time=2025-02-05T20:34:17.893+10:00 level=INFO source=images.go:432 msg="total blobs: 30" time=2025-02-05T20:34:17.894+10:00 level=INFO source=images.go:439 msg="total unused blobs removed: 0" [GIN-debug] [WARNING] Creating an Engine instance with the Logger and Recovery middleware already attached. [GIN-debug] [WARNING] Running in "debug" mode. Switch to "release" mode in production. - using env: export GIN_MODE=release - using code: gin.SetMode(gin.ReleaseMode) [GIN-debug] POST /api/pull --> github.com/ollama/ollama/server.(*Server).PullHandler-fm (5 handlers) [GIN-debug] POST /api/generate --> github.com/ollama/ollama/server.(*Server).GenerateHandler-fm (5 handlers) [GIN-debug] POST /api/chat --> github.com/ollama/ollama/server.(*Server).ChatHandler-fm (5 handlers) [GIN-debug] POST /api/embed --> github.com/ollama/ollama/server.(*Server).EmbedHandler-fm (5 handlers) [GIN-debug] POST /api/embeddings --> github.com/ollama/ollama/server.(*Server).EmbeddingsHandler-fm (5 handlers) [GIN-debug] POST /api/create --> github.com/ollama/ollama/server.(*Server).CreateHandler-fm (5 handlers) [GIN-debug] POST /api/push --> github.com/ollama/ollama/server.(*Server).PushHandler-fm (5 handlers) [GIN-debug] POST /api/copy --> github.com/ollama/ollama/server.(*Server).CopyHandler-fm (5 handlers) [GIN-debug] DELETE /api/delete --> github.com/ollama/ollama/server.(*Server).DeleteHandler-fm (5 handlers) [GIN-debug] POST /api/show --> github.com/ollama/ollama/server.(*Server).ShowHandler-fm (5 handlers) [GIN-debug] POST /api/blobs/:digest --> github.com/ollama/ollama/server.(*Server).CreateBlobHandler-fm (5 handlers) [GIN-debug] HEAD /api/blobs/:digest --> github.com/ollama/ollama/server.(*Server).HeadBlobHandler-fm (5 handlers) [GIN-debug] GET /api/ps --> github.com/ollama/ollama/server.(*Server).PsHandler-fm (5 handlers) [GIN-debug] POST /v1/chat/completions --> github.com/ollama/ollama/server.(*Server).ChatHandler-fm (6 handlers) [GIN-debug] POST /v1/completions --> github.com/ollama/ollama/server.(*Server).GenerateHandler-fm (6 handlers) [GIN-debug] POST /v1/embeddings --> github.com/ollama/ollama/server.(*Server).EmbedHandler-fm (6 handlers) [GIN-debug] GET /v1/models --> github.com/ollama/ollama/server.(*Server).ListHandler-fm (6 handlers) [GIN-debug] GET /v1/models/:model --> github.com/ollama/ollama/server.(*Server).ShowHandler-fm (6 handlers) [GIN-debug] GET / --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func1 (5 handlers) [GIN-debug] GET /api/tags --> github.com/ollama/ollama/server.(*Server).ListHandler-fm (5 handlers) [GIN-debug] GET /api/version --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func2 (5 handlers) [GIN-debug] HEAD / --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func1 (5 handlers) [GIN-debug] HEAD /api/tags --> github.com/ollama/ollama/server.(*Server).ListHandler-fm (5 handlers) [GIN-debug] HEAD /api/version --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func2 (5 handlers) time=2025-02-05T20:34:17.895+10:00 level=INFO source=routes.go:1238 msg="Listening on 127.0.0.1:11434 (version 0.5.7)" time=2025-02-05T20:34:17.897+10:00 level=INFO source=routes.go:1267 msg="Dynamic LLM libraries" runners="[cpu cpu_avx cpu_avx2 cuda_v11_avx cuda_v12_avx rocm_avx]" time=2025-02-05T20:34:17.898+10:00 level=INFO source=gpu.go:226 msg="looking for compatible GPUs" time=2025-02-05T20:34:17.924+10:00 level=INFO source=amd_linux.go:388 msg="amdgpu is supported" gpu=GPU-bdc863557120278e gpu_type=gfx942 time=2025-02-05T20:34:17.926+10:00 level=INFO source=amd_linux.go:388 msg="amdgpu is supported" gpu=GPU-7726b37bb420e44d gpu_type=gfx942 time=2025-02-05T20:34:17.928+10:00 level=INFO source=amd_linux.go:388 msg="amdgpu is supported" gpu=GPU-0518a62b7fc63c75 gpu_type=gfx942 time=2025-02-05T20:34:17.930+10:00 level=INFO source=amd_linux.go:388 msg="amdgpu is supported" gpu=GPU-00f0af4bd21e3c60 gpu_type=gfx942 time=2025-02-05T20:34:17.931+10:00 level=INFO source=amd_linux.go:388 msg="amdgpu is supported" gpu=GPU-60617e0e49ca7200 gpu_type=gfx942 time=2025-02-05T20:34:17.933+10:00 level=INFO source=amd_linux.go:388 msg="amdgpu is supported" gpu=GPU-d534c0b85b7c20e8 gpu_type=gfx942 time=2025-02-05T20:34:17.935+10:00 level=INFO source=amd_linux.go:388 msg="amdgpu is supported" gpu=GPU-cd30cff04bf7d13f gpu_type=gfx942 time=2025-02-05T20:34:17.937+10:00 level=INFO source=amd_linux.go:388 msg="amdgpu is supported" gpu=GPU-17cf8ca10dd02af6 gpu_type=gfx942 time=2025-02-05T20:34:17.937+10:00 level=INFO source=types.go:131 msg="inference compute" id=GPU-bdc863557120278e library=rocm variant="" compute=gfx942 driver=6.8 name=1002:74a1 total="192.0 GiB" available="191.7 GiB" time=2025-02-05T20:34:17.937+10:00 level=INFO source=types.go:131 msg="inference compute" id=GPU-7726b37bb420e44d library=rocm variant="" compute=gfx942 driver=6.8 name=1002:74a1 total="192.0 GiB" available="191.7 GiB" time=2025-02-05T20:34:17.937+10:00 level=INFO source=types.go:131 msg="inference compute" id=GPU-0518a62b7fc63c75 library=rocm variant="" compute=gfx942 driver=6.8 name=1002:74a1 total="192.0 GiB" available="191.7 GiB" time=2025-02-05T20:34:17.937+10:00 level=INFO source=types.go:131 msg="inference compute" id=GPU-00f0af4bd21e3c60 library=rocm variant="" compute=gfx942 driver=6.8 name=1002:74a1 total="192.0 GiB" available="191.7 GiB" time=2025-02-05T20:34:17.937+10:00 level=INFO source=types.go:131 msg="inference compute" id=GPU-60617e0e49ca7200 library=rocm variant="" compute=gfx942 driver=6.8 name=1002:74a1 total="192.0 GiB" available="191.7 GiB" time=2025-02-05T20:34:17.937+10:00 level=INFO source=types.go:131 msg="inference compute" id=GPU-d534c0b85b7c20e8 library=rocm variant="" compute=gfx942 driver=6.8 name=1002:74a1 total="192.0 GiB" available="191.7 GiB" time=2025-02-05T20:34:17.937+10:00 level=INFO source=types.go:131 msg="inference compute" id=GPU-cd30cff04bf7d13f library=rocm variant="" compute=gfx942 driver=6.8 name=1002:74a1 total="192.0 GiB" available="191.7 GiB" time=2025-02-05T20:34:17.937+10:00 level=INFO source=types.go:131 msg="inference compute" id=GPU-17cf8ca10dd02af6 library=rocm variant="" compute=gfx942 driver=6.8 name=1002:74a1 total="192.0 GiB" available="191.7 GiB" ``` ...and we appear to be even able to load a model: [GIN] 2025/02/05 - 20:35:42 | 200 | 169.804µs | 127.0.0.1 | HEAD "/" [GIN] 2025/02/05 - 20:35:42 | 200 | 1.79409ms | 127.0.0.1 | GET "/api/tags" [GIN] 2025/02/05 - 20:35:52 | 200 | 21.853µs | 127.0.0.1 | HEAD "/" [GIN] 2025/02/05 - 20:35:52 | 200 | 29.667829ms | 127.0.0.1 | POST "/api/show" time=2025-02-05T20:35:52.772+10:00 level=INFO source=sched.go:714 msg="new model will fit in available VRAM in single GPU, loading" model=/home/me/.ollama/models/blobs/sha256-6150cb382311b69f09cc0f9a1b69fc029cbd742b66bb8ec531aa5ecf5c613e93 gpu=GPU-bdc863557120278e parallel=4 available=205843881984 required="21.5 GiB" time=2025-02-05T20:35:52.772+10:00 level=INFO source=server.go:104 msg="system memory" total="2267.0 GiB" free="2155.1 GiB" free_swap="4.0 GiB" time=2025-02-05T20:35:52.773+10:00 level=INFO source=memory.go:356 msg="offload to rocm" layers.requested=-1 layers.model=65 layers.offload=65 layers.split="" memory.available="[191.7 GiB]" memory.gpu_overhead="0 B" memory.required.full="21.5 GiB" memory.required.partial="21.5 GiB" memory.required.kv="2.0 GiB" memory.required.allocations="[21.5 GiB]" memory.weights.total="19.5 GiB" memory.weights.repeating="18.9 GiB" memory.weights.nonrepeating="609.1 MiB" memory.graph.full="676.0 MiB" memory.graph.partial="916.1 MiB" time=2025-02-05T20:35:52.774+10:00 level=INFO source=server.go:376 msg="starting llama server" cmd="/scratch/project_mnt/S0156/ollama/lib/ollama/runners/rocm_avx/ollama_llama_server runner --model /home/me/.ollama/models/blobs/sha256-6150cb382311b69f09cc0f9a1b69fc029cbd742b66bb8ec531aa5ecf5c613e93 --ctx-size 8192 --batch-size 512 --n-gpu-layers 65 --threads 128 --parallel 4 --port 43799" time=2025-02-05T20:35:52.779+10:00 level=INFO source=sched.go:449 msg="loaded runners" count=1 time=2025-02-05T20:35:52.779+10:00 level=INFO source=server.go:555 msg="waiting for llama runner to start responding" time=2025-02-05T20:35:52.783+10:00 level=INFO source=server.go:589 msg="waiting for server to become available" status="llm server error" time=2025-02-05T20:35:52.837+10:00 level=INFO source=runner.go:936 msg="starting go runner" ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 1 ROCm devices: Device 0: , compute capability 9.4, VMM: no time=2025-02-05T20:35:56.249+10:00 level=INFO source=runner.go:937 msg=system info="ROCm : PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | LLAMAFILE = 1 | AARCH64_REPACK = 1 | cgo(gcc)" threads=128 time=2025-02-05T20:35:56.249+10:00 level=INFO source=.:0 msg="Server listening on 127.0.0.1:43799" llama_load_model_from_file: using device ROCm0 () - 195148 MiB free llama_model_loader: loaded meta data with 26 key-value pairs and 771 tensors from /home/me/.ollama/models/blobs/sha256-6150cb382311b69f09cc0f9a1b69fc029cbd742b66bb8ec531aa5ecf5c613e93 (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = qwen2 llama_model_loader: - kv 1: general.type str = model llama_model_loader: - kv 2: general.name str = DeepSeek R1 Distill Qwen 32B llama_model_loader: - kv 3: general.basename str = DeepSeek-R1-Distill-Qwen llama_model_loader: - kv 4: general.size_label str = 32B llama_model_loader: - kv 5: qwen2.block_count u32 = 64 llama_model_loader: - kv 6: qwen2.context_length u32 = 131072 llama_model_loader: - kv 7: qwen2.embedding_length u32 = 5120 llama_model_loader: - kv 8: qwen2.feed_forward_length u32 = 27648 llama_model_loader: - kv 9: qwen2.attention.head_count u32 = 40 llama_model_loader: - kv 10: qwen2.attention.head_count_kv u32 = 8 llama_model_loader: - kv 11: qwen2.rope.freq_base f32 = 1000000.000000 llama_model_loader: - kv 12: qwen2.attention.layer_norm_rms_epsilon f32 = 0.000010 llama_model_loader: - kv 13: general.file_type u32 = 15 llama_model_loader: - kv 14: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 15: tokenizer.ggml.pre str = deepseek-r1-qwen llama_model_loader: - kv 16: tokenizer.ggml.tokens arr[str,152064] = ["!", "\"", "#", "$", "%", "&", "'", ... llama_model_loader: - kv 17: tokenizer.ggml.token_type arr[i32,152064] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... time=2025-02-05T20:35:56.328+10:00 level=INFO source=server.go:589 msg="waiting for server to become available" status="llm server loading model" llama_model_loader: - kv 18: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",... llama_model_loader: - kv 19: tokenizer.ggml.bos_token_id u32 = 151646 llama_model_loader: - kv 20: tokenizer.ggml.eos_token_id u32 = 151643 llama_model_loader: - kv 21: tokenizer.ggml.padding_token_id u32 = 151643 llama_model_loader: - kv 22: tokenizer.ggml.add_bos_token bool = true llama_model_loader: - kv 23: tokenizer.ggml.add_eos_token bool = false llama_model_loader: - kv 24: tokenizer.chat_template str = {% if not add_generation_prompt is de... llama_model_loader: - kv 25: general.quantization_version u32 = 2 llama_model_loader: - type f32: 321 tensors llama_model_loader: - type q4_K: 385 tensors llama_model_loader: - type q6_K: 65 tensors llm_load_vocab: missing or unrecognized pre-tokenizer type, using: 'default' llm_load_vocab: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect llm_load_vocab: special tokens cache size = 22 llm_load_vocab: token to piece cache size = 0.9310 MB llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = qwen2 llm_load_print_meta: vocab type = BPE llm_load_print_meta: n_vocab = 152064 llm_load_print_meta: n_merges = 151387 llm_load_print_meta: vocab_only = 0 llm_load_print_meta: n_ctx_train = 131072 llm_load_print_meta: n_embd = 5120 llm_load_print_meta: n_layer = 64 llm_load_print_meta: n_head = 40 llm_load_print_meta: n_head_kv = 8 llm_load_print_meta: n_rot = 128 llm_load_print_meta: n_swa = 0 llm_load_print_meta: n_embd_head_k = 128 llm_load_print_meta: n_embd_head_v = 128 llm_load_print_meta: n_gqa = 5 llm_load_print_meta: n_embd_k_gqa = 1024 llm_load_print_meta: n_embd_v_gqa = 1024 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-05 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: f_logit_scale = 0.0e+00 llm_load_print_meta: n_ff = 27648 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: causal attn = 1 llm_load_print_meta: pooling type = 0 llm_load_print_meta: rope type = 2 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 1000000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_ctx_orig_yarn = 131072 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: ssm_d_conv = 0 llm_load_print_meta: ssm_d_inner = 0 llm_load_print_meta: ssm_d_state = 0 llm_load_print_meta: ssm_dt_rank = 0 llm_load_print_meta: ssm_dt_b_c_rms = 0 llm_load_print_meta: model type = 32B llm_load_print_meta: model ftype = Q4_K - Medium llm_load_print_meta: model params = 32.76 B llm_load_print_meta: model size = 18.48 GiB (4.85 BPW) llm_load_print_meta: general.name = DeepSeek R1 Distill Qwen 32B llm_load_print_meta: BOS token = 151646 '<|begin▁of▁sentence|>' llm_load_print_meta: EOS token = 151643 '<|end▁of▁sentence|>' llm_load_print_meta: EOT token = 151643 '<|end▁of▁sentence|>' llm_load_print_meta: PAD token = 151643 '<|end▁of▁sentence|>' llm_load_print_meta: LF token = 148848 'ÄĬ' llm_load_print_meta: FIM PRE token = 151659 '<|fim_prefix|>' llm_load_print_meta: FIM SUF token = 151661 '<|fim_suffix|>' llm_load_print_meta: FIM MID token = 151660 '<|fim_middle|>' llm_load_print_meta: FIM PAD token = 151662 '<|fim_pad|>' llm_load_print_meta: FIM REP token = 151663 '<|repo_name|>' llm_load_print_meta: FIM SEP token = 151664 '<|file_sep|>' llm_load_print_meta: EOG token = 151643 '<|end▁of▁sentence|>' llm_load_print_meta: EOG token = 151662 '<|fim_pad|>' llm_load_print_meta: EOG token = 151663 '<|repo_name|>' llm_load_print_meta: EOG token = 151664 '<|file_sep|>' llm_load_print_meta: max token length = 256 ...but if we try to interact with the model: ``` me@host:~$ ollama run deepseek-r1:32b >>> Tell me about why the sky is blue GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG >>> Send a message (/? for help) ``` Splat... ``` llm_load_tensors: offloading 64 repeating layers to GPU llm_load_tensors: offloading output layer to GPU llm_load_tensors: offloaded 65/65 layers to GPU llm_load_tensors: CPU_Mapped model buffer size = 417.66 MiB llm_load_tensors: ROCm0 model buffer size = 18508.35 MiB llama_new_context_with_model: n_seq_max = 4 llama_new_context_with_model: n_ctx = 8192 llama_new_context_with_model: n_ctx_per_seq = 2048 llama_new_context_with_model: n_batch = 2048 llama_new_context_with_model: n_ubatch = 512 llama_new_context_with_model: flash_attn = 0 llama_new_context_with_model: freq_base = 1000000.0 llama_new_context_with_model: freq_scale = 1 llama_new_context_with_model: n_ctx_per_seq (2048) < n_ctx_train (131072) -- the full capacity of the model will not be utilized llama_kv_cache_init: kv_size = 8192, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 64, can_shift = 1 llama_kv_cache_init: ROCm0 KV buffer size = 2048.00 MiB llama_new_context_with_model: KV self size = 2048.00 MiB, K (f16): 1024.00 MiB, V (f16): 1024.00 MiB llama_new_context_with_model: ROCm_Host output buffer size = 2.40 MiB llama_new_context_with_model: ROCm0 compute buffer size = 696.00 MiB llama_new_context_with_model: ROCm_Host compute buffer size = 26.01 MiB llama_new_context_with_model: graph nodes = 2246 llama_new_context_with_model: graph splits = 2 time=2025-02-05T20:36:24.638+10:00 level=INFO source=server.go:594 msg="llama runner started in 31.86 seconds" [GIN] 2025/02/05 - 20:36:24 | 200 | 31.911820747s | 127.0.0.1 | POST "/api/generate" [GIN] 2025/02/05 - 20:37:22 | 200 | 1.082842864s | 127.0.0.1 | POST "/api/chat" ``` I've tried various models, recompiles, different ROCm versions (6.3.0 - and will test soon with 6.3.2 as well), different OS's and different BIOS level/UEFI level settings to no avail. Different iommu=pt variants too at boot time - still the same behaviour. The only clue I've got is in the kernel buffer output, and it might be a spurious or not helpful lead: ``` [547815.107451] amdgpu: init_user_pages: Failed to get user pages: -1 [547815.114438] amdgpu: init_user_pages: Failed to get user pages: -1 [547815.127493] amdgpu: init_user_pages: Failed to get user pages: -1 [547815.134476] amdgpu: init_user_pages: Failed to get user pages: -1 [547815.150316] amdgpu: init_user_pages: Failed to get user pages: -1 [547815.157324] amdgpu: init_user_pages: Failed to get user pages: -1 [547815.170387] amdgpu: init_user_pages: Failed to get user pages: -1 [547815.177364] amdgpu: init_user_pages: Failed to get user pages: -1 [547815.185523] amdgpu: init_user_pages: Failed to get user pages: -1 [547815.192492] amdgpu: init_user_pages: Failed to get user pages: -1 [547815.199741] amdgpu: init_user_pages: Failed to get user pages: -1 [547815.206709] amdgpu: init_user_pages: Failed to get user pages: -1 [547815.214083] amdgpu: init_user_pages: Failed to get user pages: -1 [547815.221056] amdgpu: init_user_pages: Failed to get user pages: -1 [547815.229336] amdgpu: init_user_pages: Failed to get user pages: -1 [547815.236308] amdgpu: init_user_pages: Failed to get user pages: -1 [547815.249435] amdgpu: init_user_pages: Failed to get user pages: -1 [547815.256437] amdgpu: init_user_pages: Failed to get user pages: -1 [547815.272280] amdgpu: init_user_pages: Failed to get user pages: -1 [547815.279310] amdgpu: init_user_pages: Failed to get user pages: -1 ``` I turned the AMD_LOGLEVEL right up so we could see all the HIP backend status, too, as we loaded a model: ``` llama_new_context_with_model: n_ctx_per_seq (2048) < n_ctx_train (131072) -- the full capacity of the model will not be utilized llama_kv_cache_init: kv_size = 8192, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 64, can_shift = 1 :3:hip_device_runtime.cpp :634 : 550213171495d us: hipGetDevice ( 0x7ffca34532ec ) :3:hip_device_runtime.cpp :642 : 550213171503d us: hipGetDevice: Returned hipSuccess : :3:hip_device_runtime.cpp :634 : 550213171506d us: hipGetDevice ( 0x7ffca34532ec ) :3:hip_device_runtime.cpp :642 : 550213171508d us: hipGetDevice: Returned hipSuccess : :3:hip_memory.cpp :674 : 550213171514d us: hipMalloc ( 0x7ffca3453368, 2147483648 ) :3:rocdevice.cpp :2425: 550213171870d us: Device=0x561a2c38bd80, freeMem_ = 0x2afa3a6000 :3:hip_memory.cpp :676 : 550213171878d us: hipMalloc: Returned hipSuccess : 0x7faf1a800000: duration: 364d us :3:hip_device_runtime.cpp :634 : 550213171888d us: hipGetDevice ( 0x7ffca345345c ) :3:hip_device_runtime.cpp :642 : 550213171890d us: hipGetDevice: Returned hipSuccess : :3:hip_device_runtime.cpp :618 : 550213171892d us: hipDeviceSynchronize ( ) :3:hip_device_runtime.cpp :622 : 550213171901d us: hipDeviceSynchronize: Returned hipSuccess : :3:hip_memory.cpp :3281: 550213171910d us: hipMemset ( 0x7faf1a800000, 0, 2147483648 ) :3:rocdevice.cpp :3064: 550213171922d us: Number of allocated hardware queues with low priority: 0, with normal priority: 1, with high priority: 0, maximum per priority is: 4 :3:rocdevice.cpp :3140: 550213193990d us: Created SWq=0x7fb8199ec000 to map on HWq=0x7faf1a600000 with size 16384 with priority 1, cooperative: 0 :3:rocdevice.cpp :3232: 550213194033d us: acquireQueue refCount: 0x7faf1a600000 (1) :3:rocvirtual.cpp :731 : 550213194253d us: Arg0: void* buf = ptr:0x7faf1a800000 obj:[0x7faf1a800000-0x7faf9a800000] :3:rocvirtual.cpp :807 : 550213194257d us: Arg2: uint pattern_size = val:1 :3:rocvirtual.cpp :807 : 550213194259d us: Arg3: uint alignment = val:16 :3:rocvirtual.cpp :807 : 550213194261d us: Arg4: ulong end_ptr = val:140392188084224 :3:rocvirtual.cpp :807 : 550213194263d us: Arg5: uint next_chunk = val:77824 :3:rocvirtual.cpp :3056: 550213194266d us: ShaderName : __amd_rocclr_fillBufferAligned :3:hip_memory.cpp :3282: 550213194275d us: hipMemset: Returned hipSuccess : :3:hip_device_runtime.cpp :618 : 550213194279d us: hipDeviceSynchronize ( ) :3:rocvirtual.cpp :480 : 550213194286d us: Set Handler: handle(0x7fb81abfd280), timestamp(0x561a69bed8d0) :3:rocvirtual.hpp :67 : 550213194288d us: Host active wait for Signal = (0x7fb81abfd280) for -1 ns :3:hip_device_runtime.cpp :622 : 550213194683d us: hipDeviceSynchronize: Returned hipSuccess : llama_kv_cache_init: ROCm0 KV buffer size = 2048.00 MiB :3:rocvirtual.cpp :227 : 550213194708d us: Handler: value(0), timestamp(0x561a6ab11860), handle(0x7fb81abfd280) llama_new_context_with_model: KV self size = 2048.00 MiB, K (f16): 1024.00 MiB, V (f16): 1024.00 MiB :3:hip_memory.cpp :680 : 550213194755d us: hipHostMalloc ( 0x7ffca3453550, 2514944, 0 ) :3:hip_memory.cpp :686 : 550213195021d us: hipHostMalloc: Returned hipSuccess : 0x7faf0f000000: duration: 266d us llama_new_context_with_model: ROCm_Host output buffer size = 2.40 MiB :3:hip_device_runtime.cpp :634 : 550213196052d us: hipGetDevice ( 0x7ffca345354c ) :3:hip_device_runtime.cpp :642 : 550213196056d us: hipGetDevice: Returned hipSuccess : :3:hip_stream.cpp :256 : 550213196063d us: hipStreamCreateWithFlags ( 0x561a6edfe6a8, 1 ) :3:rocdevice.cpp :3064: 550213196068d us: Number of allocated hardware queues with low priority: 0, with normal priority: 2, with high priority: 0, maximum per priority is: 4 :3:rocdevice.cpp :3140: 550213216828d us: Created SWq=0x7fb8199be000 to map on HWq=0x7faf0ba00000 with size 16384 with priority 1, cooperative: 0 :3:rocdevice.cpp :3232: 550213216838d us: acquireQueue refCount: 0x7faf0ba00000 (1) :3:hip_stream.cpp :262 : 550213217077d us: hipStreamCreateWithFlags: Returned hipSuccess : stream:0x561a6e594ea0 :3:hip_stream.cpp :374 : 550213217084d us: hipStreamSynchronize ( stream:0x561a6e594ea0 ) :3:hip_stream.cpp :375 : 550213217090d us: hipStreamSynchronize: Returned hipSuccess : :3:hip_device_runtime.cpp :634 : 550213217409d us: hipGetDevice ( 0x7ffca345344c ) :3:hip_device_runtime.cpp :642 : 550213217412d us: hipGetDevice: Returned hipSuccess : :3:hip_device_runtime.cpp :634 : 550213217414d us: hipGetDevice ( 0x7ffca345344c ) :3:hip_device_runtime.cpp :642 : 550213217416d us: hipGetDevice: Returned hipSuccess : :3:hip_memory.cpp :674 : 550213217420d us: hipMalloc ( 0x7ffca34534c8, 729812992 ) :3:rocdevice.cpp :2425: 550213217482d us: Device=0x561a2c38bd80, freeMem_ = 0x2aceba5000 :3:hip_memory.cpp :676 : 550213217487d us: hipMalloc: Returned hipSuccess : 0x7faed4800000: duration: 67d us :3:hip_memory.cpp :680 : 550213217496d us: hipHostMalloc ( 0x7ffca34534d0, 27269120, 0 ) :3:hip_memory.cpp :686 : 550213220223d us: hipHostMalloc: Returned hipSuccess : 0x7faed2c00000: duration: 2727d us :3:hip_stream.cpp :374 : 550213220720d us: hipStreamSynchronize ( stream:0x561a6e594ea0 ) :3:hip_stream.cpp :375 : 550213220725d us: hipStreamSynchronize: Returned hipSuccess : :3:hip_stream.cpp :374 : 550213221356d us: hipStreamSynchronize ( stream:0x561a6e594ea0 ) :3:hip_stream.cpp :375 : 550213221360d us: hipStreamSynchronize: Returned hipSuccess : llama_new_context_with_model: ROCm0 compute buffer size = 696.00 MiB llama_new_context_with_model: ROCm_Host compute buffer size = 26.01 MiB llama_new_context_with_model: graph nodes = 2246 llama_new_context_with_model: graph splits = 2 time=2025-02-05T21:16:23.691+10:00 level=INFO source=server.go:594 msg="llama runner started in 16.26 seconds" [GIN] 2025/02/05 - 21:16:23 | 200 | 16.309228075s | 127.0.0.1 | POST "/api/generate" ``` And the output watching "GGGGGGGGG" slowly show up: ``` :3:rocvirtual.cpp :731 : 550318994888d us: Arg0: = ptr:0x7faed4800000 obj:[0x7faed4800000-0x7faf00001000] :3:rocvirtual.cpp :731 : 550318994890d us: Arg1: = ptr:0x7fa676800000 obj:[0x7fa676800000-0x7faafb45a000] :3:rocvirtual.cpp :731 : 550318994891d us: Arg2: = ptr:0x7faed4800000 obj:[0x7faed4800000-0x7faf00001000] :3:rocvirtual.cpp :807 : 550318994893d us: Arg3: = val:5120 :3:rocvirtual.cpp :807 : 550318994894d us: Arg4: = val:1 :3:rocvirtual.cpp :807 : 550318994896d us: Arg5: = val:1 :3:rocvirtual.cpp :807 : 550318994897d us: Arg6: = val:1 :3:rocvirtual.cpp :807 : 550318994899d us: Arg7: = val:5120 :3:rocvirtual.cpp :807 : 550318994900d us: Arg8: = val:1 :3:rocvirtual.cpp :807 : 550318994902d us: Arg9: = val:1 :3:rocvirtual.cpp :807 : 550318994903d us: Arg10: = val:1 :3:rocvirtual.cpp :807 : 550318994905d us: Arg11: = val:5120 :3:rocvirtual.cpp :807 : 550318994906d us: Arg12: = val:5120 :3:rocvirtual.cpp :807 : 550318994908d us: Arg13: = val:5120 :3:rocvirtual.cpp :807 : 550318994909d us: Arg14: = val:5120 :3:rocvirtual.cpp :807 : 550318994911d us: Arg15: = val:5120 :3:rocvirtual.cpp :807 : 550318994912d us: Arg16: = val:5120 :3:rocvirtual.cpp :807 : 550318994914d us: Arg17: = val:5120 :3:rocvirtual.cpp :807 : 550318994915d us: Arg18: = val:5120 :3:rocvirtual.cpp :807 : 550318994917d us: Arg19: = val:5120 :3:rocvirtual.cpp :3056: 550318994918d us: ShaderName : _ZL11k_bin_bcastIXadL_ZL6op_mulffEEfffEvPKT0_PKT1_PT2_iiiiiiiiiiiiiiiii :3:hip_module.cpp :687 : 550318994922d us: hipLaunchKernel: Returned hipSuccess : :3:hip_error.cpp :36 : 550318994924d us: hipGetLastError ( ) :3:hip_device_runtime.cpp :634 : 550318994926d us: hipGetDevice ( 0x7fb818ff5bac ) :3:hip_device_runtime.cpp :642 : 550318994928d us: hipGetDevice: Returned hipSuccess : :3:hip_platform.cpp :230 : 550318994930d us: __hipPushCallConfiguration ( {20,1,1}, {256,1,1}, 0, stream:0x561a6e594ea0 ) :3:hip_platform.cpp :234 : 550318994932d us: __hipPushCallConfiguration: Returned hipSuccess : :3:hip_platform.cpp :239 : 550318994935d us: __hipPopCallConfiguration ( {0,0,4024810072}, {419391248,32696,0}, 0x7fb818ff5b18, 0x7fb818ff5b10 ) :3:hip_platform.cpp :248 : 550318994937d us: __hipPopCallConfiguration: Returned hipSuccess : :3:hip_module.cpp :686 : 550318994940d us: hipLaunchKernel ( 0x7fb8f4a2a438, {20,1,1}, {256,1,1}, 0x7fb818ff5b60, 0, stream:0x561a6e594ea0 ) :3:rocvirtual.cpp :731 : 550318994942d us: Arg0: = ptr:0x7faed4800000 obj:[0x7faed4800000-0x7faf00001000] :3:rocvirtual.cpp :731 : 550318994944d us: Arg1: = ptr:0x7faec7200000 obj:[0x7faec7200000-0x7faec721b800] :3:rocvirtual.cpp :807 : 550318994946d us: Arg2: = val:5120 :3:rocvirtual.cpp :807 : 550318994947d us: Arg3: = val:5120 :3:rocvirtual.cpp :3056: 550318994948d us: ShaderName : _ZL13quantize_q8_1PKfPvll :3:hip_module.cpp :687 : 550318994952d us: hipLaunchKernel: Returned hipSuccess : :3:hip_error.cpp :36 : 550318994954d us: hipGetLastError ( ) :3:hip_device_runtime.cpp :634 : 550318994955d us: hipGetDevice ( 0x7fb818ff5d08 ) :3:hip_device_runtime.cpp :642 : 550318994957d us: hipGetDevice: Returned hipSuccess : :3:hip_device_runtime.cpp :634 : 550318994959d us: hipGetDevice ( 0x7fb818ff5a94 ) :3:hip_device_runtime.cpp :642 : 550318994961d us: hipGetDevice: Returned hipSuccess : :3:hip_device_runtime.cpp :634 : 550318994963d us: hipGetDevice ( 0x7fb818ff5a94 ) :3:hip_device_runtime.cpp :642 : 550318994964d us: hipGetDevice: Returned hipSuccess : :3:hip_platform.cpp :230 : 550318994967d us: __hipPushCallConfiguration ( {152064,1,1}, {32,1,1}, 0, stream:0x561a6e594ea0 ) :3:hip_platform.cpp :234 : 550318994969d us: __hipPushCallConfiguration: Returned hipSuccess : :3:hip_platform.cpp :239 : 550318994972d us: __hipPopCallConfiguration ( {0,0,5120}, {5120,0,3565158400}, 0x7fb818ff5ac0, 0x7fb818ff5ab8 ) :3:hip_platform.cpp :248 : 550318994973d us: __hipPopCallConfiguration: Returned hipSuccess : :3:hip_module.cpp :686 : 550318994977d us: hipLaunchKernel ( 0x7fb8f4a2a1b8, {152064,1,1}, {32,1,1}, 0x7fb818ff5b00, 0, stream:0x561a6e594ea0 ) :3:rocvirtual.cpp :731 : 550318994980d us: Arg0: = ptr:0x7fa676805000 obj:[0x7fa676800000-0x7faafb45a000] :3:rocvirtual.cpp :731 : 550318994982d us: Arg1: = ptr:0x7faec7200000 obj:[0x7faec7200000-0x7faec721b800] :3:rocvirtual.cpp :731 : 550318994983d us: Arg2: = ptr:0x7faed5200000 obj:[0x7faed4800000-0x7faf00001000] :3:rocvirtual.cpp :807 : 550318994985d us: Arg3: = val:5120 :3:rocvirtual.cpp :807 : 550318994986d us: Arg4: = val:152064 :3:rocvirtual.cpp :807 : 550318994988d us: Arg5: = val:5120 :3:rocvirtual.cpp :807 : 550318994989d us: Arg6: = val:152064 :3:rocvirtual.cpp :3056: 550318994991d us: ShaderName : _ZL13mul_mat_vec_qIL9ggml_type14ELi1EEvPKvS2_Pfiiii :3:hip_module.cpp :687 : 550318994994d us: hipLaunchKernel: Returned hipSuccess : :3:hip_error.cpp :36 : 550318994996d us: hipGetLastError ( ) :3:hip_error.cpp :36 : 550318994998d us: hipGetLastError ( ) :3:hip_memory.cpp :1543: 550318995009d us: hipMemcpyAsync ( 0x7faf0f000000, 0x7faed5200000, 608256, hipMemcpyDeviceToHost, stream:0x561a6e594ea0 ) :3:rocvirtual.hpp :67 : 550318995021d us: Host active wait for Signal = (0x7fb81abfaa80) for 10000 ns :3:hip_memory.cpp :1544: 550318995042d us: hipMemcpyAsync: Returned hipSuccess : : duration: 33d us :3:hip_memory.cpp :1543: 550318995045d us: hipMemcpyAsync ( 0x7faf0f252000, 0x7faed4800000, 20480, hipMemcpyDeviceToHost, stream:0x561a6e594ea0 ) :3:hip_memory.cpp :1544: 550318995050d us: hipMemcpyAsync: Returned hipSuccess : : duration: 5d us :3:hip_stream.cpp :374 : 550318995107d us: hipStreamSynchronize ( stream:0x561a6e594ea0 ) :3:rocvirtual.cpp :480 : 550318995123d us: Set Handler: handle(0x7fb81abfa900), timestamp(0x561a6e29ebf0) :3:rocvirtual.hpp :67 : 550318995126d us: Host active wait for Signal = (0x7fb81abfa900) for -1 ns :3:hip_stream.cpp :375 : 550318995209d us: hipStreamSynchronize: Returned hipSuccess : :3:rocvirtual.cpp :227 : 550318995213d us: Handler: value(0), timestamp(0x7fafefea3570), handle(0x7fb81abfa900) ``` - and yes, the GPUs have been set appropriate in cgroups/groups/passwd for the "video" and "render" user as appropriate and as instructed by the ROCm install documentation. Any help would be incredibly appreciated. Thank you. ### OS Linux ### GPU AMD ### CPU AMD ### Ollama version 0.5.7
GiteaMirror added the amdbuggpu labels 2026-05-04 11:40:38 -05:00
Author
Owner

@rick-github commented on GitHub (Feb 12, 2025):

What happens if you load the model in RAM;

$ ollama run deepseek-r1:32b
>>> /set parameter num_gpu 0
Set parameter 'num_gpu' to '0'
>>> Tell me about why the sky is blue
...
<!-- gh-comment-id:2653163047 --> @rick-github commented on GitHub (Feb 12, 2025): What happens if you load the model in RAM; ```console $ ollama run deepseek-r1:32b >>> /set parameter num_gpu 0 Set parameter 'num_gpu' to '0' >>> Tell me about why the sky is blue ... ```
Author
Owner

@rick-github commented on GitHub (Feb 12, 2025):

If you limit ollama to one GPU with CUDA_VISIBLE_DEVICES=0 (perhaps try all devices) is there any change? If you force ollama to use all GPUs with OLLAMA_SCHED_SPREAD=1 is there any change?

<!-- gh-comment-id:2653182524 --> @rick-github commented on GitHub (Feb 12, 2025): If you limit ollama to one GPU with `CUDA_VISIBLE_DEVICES=0` (perhaps try all devices) is there any change? If you force ollama to use all GPUs with `OLLAMA_SCHED_SPREAD=1` is there any change?
Author
Owner

@jinzhang119 commented on GitHub (Feb 12, 2025):

/set parameter num_gpu 0 does make my output become normal. But isn't that bypassing GPU and using CPU only?

I just tried setting environment variable CUDA_VISIBLE_DEVICES=0 or OLLAMA_SCHED_SPREAD=1 that still output GGGGGG

<!-- gh-comment-id:2654241250 --> @jinzhang119 commented on GitHub (Feb 12, 2025): /set parameter num_gpu 0 does make my output become normal. But isn't that bypassing GPU and using CPU only? I just tried setting environment variable CUDA_VISIBLE_DEVICES=0 or OLLAMA_SCHED_SPREAD=1 that still output GGGGGG
Author
Owner

@rick-github commented on GitHub (Feb 12, 2025):

But isn't that bypassing GPU and using CPU only?

Yes. Verifying that the GPU is the problem.

If you set ROCR_VISIBLE_DEVICES=0 and then ROCR_VISIBLE_DEVICES=1 does anything change?

<!-- gh-comment-id:2654263411 --> @rick-github commented on GitHub (Feb 12, 2025): > But isn't that bypassing GPU and using CPU only? Yes. Verifying that the GPU is the problem. If you set `ROCR_VISIBLE_DEVICES=0` and then `ROCR_VISIBLE_DEVICES=1` does anything change?
Author
Owner

@jinzhang119 commented on GitHub (Feb 12, 2025):

With ROCR_VISIBLE_DEVICES=0 and then ROCR_VISIBLE_DEVICES=1

still output GGGGG….

<!-- gh-comment-id:2654277707 --> @jinzhang119 commented on GitHub (Feb 12, 2025): With ROCR_VISIBLE_DEVICES=0 and then ROCR_VISIBLE_DEVICES=1 still output GGGGG….
Author
Owner

@zebrax0r commented on GitHub (Feb 13, 2025):

Same behaviour here if it is set to ROCR_VISIBLE_DEVICES=0 or 1. Still:

me@host:~$ export ROCR_VISIBLE_DEVICES=0
me@host:~$ ollama run deepseek-r1:32b
>>> tell me why the sky is blue
GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG

Also tried explict inline setting of environmental variable...

me@host:~$ ROCR_VISIBLE_DEVICES=0 ollama run deepseek-r1:32b
>>> tell me why the sky is blue
GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG
<!-- gh-comment-id:2655337477 --> @zebrax0r commented on GitHub (Feb 13, 2025): Same behaviour here if it is set to ROCR_VISIBLE_DEVICES=0 or 1. Still: ``` me@host:~$ export ROCR_VISIBLE_DEVICES=0 me@host:~$ ollama run deepseek-r1:32b >>> tell me why the sky is blue GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG ``` Also tried explict inline setting of environmental variable... ``` me@host:~$ ROCR_VISIBLE_DEVICES=0 ollama run deepseek-r1:32b >>> tell me why the sky is blue GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG ```
Author
Owner

@rick-github commented on GitHub (Feb 13, 2025):

Set ROCR_VISIBLE_DEVICES in the server environment, not the client environment,

<!-- gh-comment-id:2655984042 --> @rick-github commented on GitHub (Feb 13, 2025): Set `ROCR_VISIBLE_DEVICES` in the [server environment](https://github.com/ollama/ollama/blob/main/docs/faq.md#how-do-i-configure-ollama-server), not the client environment,
Author
Owner

@zebrax0r commented on GitHub (Feb 13, 2025):

Same behaviour. Without GPU, works. With GPU "GGGGGGGGGGGG".

Doing more testing with different models - but cannot yet find one that works at all.

Incidentally, I did go to the trouble of testing that the ROCm/HIP environment was sane. SGLang backends work perfectly on this system.

<!-- gh-comment-id:2656182694 --> @zebrax0r commented on GitHub (Feb 13, 2025): Same behaviour. Without GPU, works. With GPU "GGGGGGGGGGGG". Doing more testing with different models - but cannot yet find one that works at all. Incidentally, I did go to the trouble of testing that the ROCm/HIP environment was sane. SGLang backends work perfectly on this system.
Author
Owner

@rick-github commented on GitHub (Feb 13, 2025):

What was the last version of ollama that wasn't broken?

<!-- gh-comment-id:2656194451 --> @rick-github commented on GitHub (Feb 13, 2025): What was the last version of ollama that wasn't broken?
Author
Owner

@jinzhang119 commented on GitHub (Feb 13, 2025):

My co-worker is using 0.1.48 , it works for him.

Image

<!-- gh-comment-id:2657513452 --> @jinzhang119 commented on GitHub (Feb 13, 2025): My co-worker is using 0.1.48 , it works for him. ![Image](https://github.com/user-attachments/assets/1a598056-91e8-4b8d-a6ce-fab36e72aa0b)
Author
Owner

@jinzhang119 commented on GitHub (Feb 14, 2025):

the issue is likely to be introduced in 0.5.2. as from my co-worker found the the dockerhub ollama container version ollama/ollama:0.5.2-rc0-rocm

<!-- gh-comment-id:2659693775 --> @jinzhang119 commented on GitHub (Feb 14, 2025): the issue is likely to be introduced in 0.5.2. as from my co-worker found the the dockerhub ollama container version ollama/ollama:0.5.2-rc0-rocm
Author
Owner

@rick-github commented on GitHub (Feb 14, 2025):

What relevance is ollama/ollama:0.5.2-rc0-rocm? The original post says No hypervisors or containers here.

<!-- gh-comment-id:2659717649 --> @rick-github commented on GitHub (Feb 14, 2025): What relevance is ollama/ollama:0.5.2-rc0-rocm? The original post says `No hypervisors or containers here`.
Author
Owner

@jinzhang119 commented on GitHub (Feb 14, 2025):

I did a uninstall and install the verion 0.5.1. Now I don't see "GGGGGG".

However, it does not seem to utilize GPU.

ollama run llama3.1 --verbose

hi
How's it going? Is there something I can help you with or would you like to chat?

total duration: 949.157032ms
load duration: 26.237067ms
prompt eval count: 11 token(s)
prompt eval duration: 99ms
prompt eval rate: 111.11 tokens/s
eval count: 21 token(s)
eval duration: 821ms
eval rate: 25.58 tokens/s

ollama ps
NAME ID SIZE PROCESSOR UNTIL
llama3.1:latest 46e0c10c039e 6.5 GB 100% CPU 4 minutes from now

<!-- gh-comment-id:2659744123 --> @jinzhang119 commented on GitHub (Feb 14, 2025): I did a uninstall and install the verion 0.5.1. Now I don't see "GGGGGG". However, it does not seem to utilize GPU. ollama run llama3.1 --verbose >>> hi How's it going? Is there something I can help you with or would you like to chat? total duration: 949.157032ms load duration: 26.237067ms prompt eval count: 11 token(s) prompt eval duration: 99ms prompt eval rate: 111.11 tokens/s eval count: 21 token(s) eval duration: 821ms eval rate: 25.58 tokens/s ollama ps NAME ID SIZE PROCESSOR UNTIL llama3.1:latest 46e0c10c039e 6.5 GB 100% CPU 4 minutes from now
Author
Owner

@rick-github commented on GitHub (Feb 14, 2025):

If it's using CPU, that's why you don't see "GGGGGGG", Server logs may show why the GPU isn't used.

<!-- gh-comment-id:2659750702 --> @rick-github commented on GitHub (Feb 14, 2025): If it's using CPU, that's why you don't see "GGGGGGG", Server logs may show why the GPU isn't used.
Author
Owner

@zebrax0r commented on GitHub (Feb 15, 2025):

Yes, llama.cpp backends that get forced to CPU only seem to me functional. The instant you try to use GPU, you end up with GGGGGG or ######…..

AFAIk, the last version that I had working was a version of ollama that was current at or around Christmas time 2024. A lot has changed since then. Versions of ROCm, kernels and so forth.

I tried other backends like Kobolt (spelling) and it seems to be behaving the same way (it uses GPU and outputs GGGGGG….).

So is this perhaps a regression in llama.cpp in conjunction with ROCm somehow?

<!-- gh-comment-id:2660861341 --> @zebrax0r commented on GitHub (Feb 15, 2025): Yes, llama.cpp backends that get forced to CPU only seem to me functional. The instant you try to use GPU, you end up with GGGGGG or ######….. AFAIk, the last version that I had working was a version of ollama that was current at or around Christmas time 2024. A lot has changed since then. Versions of ROCm, kernels and so forth. I tried other backends like Kobolt (spelling) and it seems to be behaving the same way (it uses GPU and outputs GGGGGG….). So is this perhaps a regression in llama.cpp in conjunction with ROCm somehow?
Author
Owner

@cowcomic commented on GitHub (Feb 16, 2025):

Suggest checking the model file, we have encountered the same output before, and later found that the model file was damaged during the download process, the sha256 is different
Re-downloading fixed it

<!-- gh-comment-id:2661497126 --> @cowcomic commented on GitHub (Feb 16, 2025): Suggest checking the model file, we have encountered the same output before, and later found that the model file was damaged during the download process, the sha256 is different Re-downloading fixed it
Author
Owner

@zebrax0r commented on GitHub (Feb 18, 2025):

A good suggestion but this is not the case in this situation. The same model, same path, same sha256/md5sum works correctly on almost any nvidia GPU I've tested (A16, L40, L40S, H100, A100, H100 SXM5). Working perfectly. Load in on an instinct GPU like MI300x however, and it "GGGGGGGGGGGG"s ....

<!-- gh-comment-id:2664585319 --> @zebrax0r commented on GitHub (Feb 18, 2025): A good suggestion but this is not the case in this situation. The same model, same path, same sha256/md5sum works correctly on almost any nvidia GPU I've tested (A16, L40, L40S, H100, A100, H100 SXM5). Working perfectly. Load in on an instinct GPU like MI300x however, and it "GGGGGGGGGGGG"s ....
Author
Owner

@Xingwei-Tan commented on GitHub (Feb 24, 2025):

I'm dealing with the same issue with (single) MI300X and ROCm 6.2.1
The issue is likely related to the quantization. The models work completely fine when I load the fp16 versions. However, once I load the 8-bit version of any model (llama3.2, deepseek-distill, qwen2.5), they all output "GGGGGGGG"

My guess is the implementation of loading the quantization models on AMD GPUs is the cause of this issue.

<!-- gh-comment-id:2678201303 --> @Xingwei-Tan commented on GitHub (Feb 24, 2025): I'm dealing with the same issue with (single) MI300X and ROCm 6.2.1 The issue is likely related to the quantization. The models work completely fine when I load the **fp16** versions. However, once I load the 8-bit version of any model (llama3.2, deepseek-distill, qwen2.5), they all output "GGGGGGGG" My guess is the implementation of loading the quantization models on AMD GPUs is the cause of this issue.
Author
Owner

@rick-github commented on GitHub (Feb 24, 2025):

Single or multi-GPU?

<!-- gh-comment-id:2678218905 --> @rick-github commented on GitHub (Feb 24, 2025): Single or multi-GPU?
Author
Owner

@Xingwei-Tan commented on GitHub (Feb 24, 2025):

Single or multi-GPU?

Those errors occurred when I was using one MI300X

<!-- gh-comment-id:2678228153 --> @Xingwei-Tan commented on GitHub (Feb 24, 2025): > Single or multi-GPU? Those errors occurred when I was using one MI300X
Author
Owner

@zebrax0r commented on GitHub (Mar 2, 2025):

I did some more testing this morning with a few fp16 versions of different models. So far, I agree that fp16 models seem to work on the MI300x. If I swap back to fp8/4 of the same models, I end up with GGGGGGG…..

<!-- gh-comment-id:2692530936 --> @zebrax0r commented on GitHub (Mar 2, 2025): I did some more testing this morning with a few fp16 versions of different models. So far, I agree that fp16 models seem to work on the MI300x. If I swap back to fp8/4 of the same models, I end up with GGGGGGG…..
Author
Owner

@rick-github commented on GitHub (Mar 2, 2025):

fp8/fp4 are not supported data types for llama.cpp/ollama. Do you mean q8/q4?

<!-- gh-comment-id:2692531876 --> @rick-github commented on GitHub (Mar 2, 2025): fp8/fp4 are not supported data types for llama.cpp/ollama. Do you mean q8/q4?
Author
Owner

@zebrax0r commented on GitHub (Mar 2, 2025):

Yes. q8/q4. To be clear, in testing q8 or q4 quantisation ends in “GGGG….”

<!-- gh-comment-id:2692534545 --> @zebrax0r commented on GitHub (Mar 2, 2025): Yes. q8/q4. To be clear, in testing q8 or q4 quantisation ends in “GGGG….”
Author
Owner

@acb5764 commented on GitHub (Mar 2, 2025):

Hey guys, I've seen a handful of these tickets around of different status and this seems like the most recent&active one. I was having no issue with AMD/ROCM multi-gpu until I was trying (and failed) to get some bifurcation working; I changed some bios settings and started having this issue.
For me, at least, this was resolved by reverting my BIOS settings.
Here are what I changed to get it working again:

  • PCIE Bifurcation 4x4x4x4x->auto (I don't think this is related)
  • Resizable Bar On -> Off
  • Above 4G Decoding On -> Off

Worth a try, at least

<!-- gh-comment-id:2692884056 --> @acb5764 commented on GitHub (Mar 2, 2025): Hey guys, I've seen a handful of these tickets around of different status and this seems like the most recent&active one. I was having no issue with AMD/ROCM multi-gpu until I was trying (and failed) to get some bifurcation working; I changed some bios settings and started having this issue. For me, at least, this was resolved by reverting my BIOS settings. Here are what I changed to get it working again: - PCIE Bifurcation 4x4x4x4x->auto (I don't think this is related) - Resizable Bar On -> Off - Above 4G Decoding On -> Off Worth a try, at least
Author
Owner

@rick-github commented on GitHub (Mar 2, 2025):

Thanks for the data points.

<!-- gh-comment-id:2692886462 --> @rick-github commented on GitHub (Mar 2, 2025): Thanks for the data points.
Author
Owner

@svorster commented on GitHub (Mar 16, 2025):

Hey guys, hopefully I can help out. My system has pcie gen 4 and dual e5-2680v2 with 128gb ram. I am running ollama 0.6.1 (tried both native install and docker) and using 3 Mi50. I am running ubuntu 24.04.2 with rocm 6.3.3 on bare metal . I found that i was getting gibberish outputs with any model that would span more than one card (16gb) and even had the same issue within any model that wasn't fp16.

My Solution

  • Changed Above 4G Decoding On -> Off
  • Changed Pci Delay from 96 -> 128

Now I have no problem running >16gb models and can run any quant type I have tried so far. Also no problem with it running in either command line or from Open Web Ui. Tested it in docker no problem using 0.6.1-rocm and I tried it with 0.5.13-rocm and worked fine on that also.

<!-- gh-comment-id:2727079499 --> @svorster commented on GitHub (Mar 16, 2025): Hey guys, hopefully I can help out. My system has pcie gen 4 and dual e5-2680v2 with 128gb ram. I am running ollama 0.6.1 (tried both native install and docker) and using 3 Mi50. I am running ubuntu 24.04.2 with rocm 6.3.3 on bare metal . I found that i was getting gibberish outputs with any model that would span more than one card (16gb) and even had the same issue within any model that wasn't fp16. My Solution - Changed Above 4G Decoding On -> Off - Changed Pci Delay from 96 -> 128 Now I have no problem running >16gb models and can run any quant type I have tried so far. Also no problem with it running in either command line or from Open Web Ui. Tested it in docker no problem using 0.6.1-rocm and I tried it with 0.5.13-rocm and worked fine on that also.
Author
Owner

@rick-github commented on GitHub (Mar 16, 2025):

Thanks @svorster. When you say "gibberish outputs", do you mean strings of "GGG" or "@@@", or some other output that looks more like English but is disconnected syllables?

<!-- gh-comment-id:2727122166 --> @rick-github commented on GitHub (Mar 16, 2025): Thanks @svorster. When you say "gibberish outputs", do you mean strings of "GGG" or "@@@", or some other output that looks more like English but is disconnected syllables?
Author
Owner

@svorster commented on GitHub (Mar 16, 2025):

Thanks @svorster. When you say "gibberish outputs", do you mean strings of "GGG" or "@@@", or some other output that looks more like English but is disconnected syllables?

@rick-github I was getting a mix of issues, depending on the model I would get either "GGGGGGGG", random words, error looking dialog in brackets "[some text]", and i would get just random numbers, letters, and symbols all mashed up. also seemed command line vs open web ui would give different error types. I tried multiple models from gemma, qwq, deepseek-r1, llama3.3/3.2/3.1, and phi4. every time i went over 16gb i got error and seemed that nothing except fp16 would really work.

<!-- gh-comment-id:2727133746 --> @svorster commented on GitHub (Mar 16, 2025): > Thanks [@svorster](https://github.com/svorster). When you say "gibberish outputs", do you mean strings of "GGG" or "@@@", or some other output that looks more like English but is disconnected syllables? @rick-github I was getting a mix of issues, depending on the model I would get either "GGGGGGGG", random words, error looking dialog in brackets "[some text]", and i would get just random numbers, letters, and symbols all mashed up. also seemed command line vs open web ui would give different error types. I tried multiple models from gemma, qwq, deepseek-r1, llama3.3/3.2/3.1, and phi4. every time i went over 16gb i got error and seemed that nothing except fp16 would really work.
Author
Owner

@Mohamed0Hegazi commented on GitHub (Mar 16, 2025):

https://ollama.com/install.sh

في الأحد، ١٦ مارس ٢٠٢٥، ٤:٢٢ ص svorster @.***> كتب:

Thanks @svorster https://github.com/svorster. When you say "gibberish
outputs", do you mean strings of "GGG" or "@@@", or some other output that
looks more like English but is disconnected syllables?

@rick-github https://github.com/rick-github I was getting a mix of
issues, depending on the model I would get either "GGGGGGGG", random words,
error looking dialog in brackets "[some text]", and i would get just random
numbers, letters, and symbols all mashed up. also seemed command line vs
open web ui would give different error types. I tried multiple models from
gemma, qwq, deepseek-r1, llama3.3/3.2/3.1, and phi4. every time i went over
16gb i got error and seemed that nothing except fp16 would really work.


Reply to this email directly, view it on GitHub
https://github.com/ollama/ollama/issues/8843#issuecomment-2727133746,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/BQOY5ME4C6IQVDDFTKNQN532UTOAFAVCNFSM6AAAAABWQXM4E2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDOMRXGEZTGNZUGY
.
You are receiving this because you are subscribed to this thread.Message
ID: @.***>
[image: svorster]svorster left a comment (ollama/ollama#8843)
https://github.com/ollama/ollama/issues/8843#issuecomment-2727133746

Thanks @svorster https://github.com/svorster. When you say "gibberish
outputs", do you mean strings of "GGG" or "@@@", or some other output that
looks more like English but is disconnected syllables?

@rick-github https://github.com/rick-github I was getting a mix of
issues, depending on the model I would get either "GGGGGGGG", random words,
error looking dialog in brackets "[some text]", and i would get just random
numbers, letters, and symbols all mashed up. also seemed command line vs
open web ui would give different error types. I tried multiple models from
gemma, qwq, deepseek-r1, llama3.3/3.2/3.1, and phi4. every time i went over
16gb i got error and seemed that nothing except fp16 would really work.


Reply to this email directly, view it on GitHub
https://github.com/ollama/ollama/issues/8843#issuecomment-2727133746,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/BQOY5ME4C6IQVDDFTKNQN532UTOAFAVCNFSM6AAAAABWQXM4E2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDOMRXGEZTGNZUGY
.
You are receiving this because you are subscribed to this thread.Message
ID: @.***>

<!-- gh-comment-id:2727150621 --> @Mohamed0Hegazi commented on GitHub (Mar 16, 2025): https://ollama.com/install.sh في الأحد، ١٦ مارس ٢٠٢٥، ٤:٢٢ ص svorster ***@***.***> كتب: > Thanks @svorster <https://github.com/svorster>. When you say "gibberish > outputs", do you mean strings of "GGG" or "@@@", or some other output that > looks more like English but is disconnected syllables? > > @rick-github <https://github.com/rick-github> I was getting a mix of > issues, depending on the model I would get either "GGGGGGGG", random words, > error looking dialog in brackets "[some text]", and i would get just random > numbers, letters, and symbols all mashed up. also seemed command line vs > open web ui would give different error types. I tried multiple models from > gemma, qwq, deepseek-r1, llama3.3/3.2/3.1, and phi4. every time i went over > 16gb i got error and seemed that nothing except fp16 would really work. > > — > Reply to this email directly, view it on GitHub > <https://github.com/ollama/ollama/issues/8843#issuecomment-2727133746>, > or unsubscribe > <https://github.com/notifications/unsubscribe-auth/BQOY5ME4C6IQVDDFTKNQN532UTOAFAVCNFSM6AAAAABWQXM4E2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDOMRXGEZTGNZUGY> > . > You are receiving this because you are subscribed to this thread.Message > ID: ***@***.***> > [image: svorster]*svorster* left a comment (ollama/ollama#8843) > <https://github.com/ollama/ollama/issues/8843#issuecomment-2727133746> > > Thanks @svorster <https://github.com/svorster>. When you say "gibberish > outputs", do you mean strings of "GGG" or "@@@", or some other output that > looks more like English but is disconnected syllables? > > @rick-github <https://github.com/rick-github> I was getting a mix of > issues, depending on the model I would get either "GGGGGGGG", random words, > error looking dialog in brackets "[some text]", and i would get just random > numbers, letters, and symbols all mashed up. also seemed command line vs > open web ui would give different error types. I tried multiple models from > gemma, qwq, deepseek-r1, llama3.3/3.2/3.1, and phi4. every time i went over > 16gb i got error and seemed that nothing except fp16 would really work. > > — > Reply to this email directly, view it on GitHub > <https://github.com/ollama/ollama/issues/8843#issuecomment-2727133746>, > or unsubscribe > <https://github.com/notifications/unsubscribe-auth/BQOY5ME4C6IQVDDFTKNQN532UTOAFAVCNFSM6AAAAABWQXM4E2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDOMRXGEZTGNZUGY> > . > You are receiving this because you are subscribed to this thread.Message > ID: ***@***.***> >
Author
Owner

@Mohamed0Hegazi commented on GitHub (Mar 18, 2025):

لقد شاركت عنصرًا معك:

Canva Shortcut - DAGgO4gLKL0
https://drive.google.com/file/d/1phwF699SVfSzCoCpguViJiH1az_BRZim/view?usp=sharing&invite=CJf7uPgG&ts=67d9ec2b

إنه ليس مرفقًا -- بل تمّ تخزينه على الإنترنت، ولفتح هذا العنصر، انقر على
الرابط أعلاه.

<!-- gh-comment-id:2734831996 --> @Mohamed0Hegazi commented on GitHub (Mar 18, 2025): لقد شاركت عنصرًا معك: Canva Shortcut - DAGgO4gLKL0 https://drive.google.com/file/d/1phwF699SVfSzCoCpguViJiH1az_BRZim/view?usp=sharing&invite=CJf7uPgG&ts=67d9ec2b إنه ليس مرفقًا -- بل تمّ تخزينه على الإنترنت، ولفتح هذا العنصر، انقر على الرابط أعلاه.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#67786