[GH-ISSUE #15771] Qwen3.6-35B-A3B is much slower in Ollama 0.21.0 than llama.cpp on ROCm with the same GPU #72109

New Issue

GiteaMirror · 2026-05-05T03:29:45-05:00

GiteaMirror commented

2026-05-05 03:29:45 -05:00

Originally created by @lennarkivistik on GitHub (Apr 23, 2026).
Original GitHub issue: https://github.com/ollama/ollama/issues/15771

What is the issue?

On my system, Qwen3.6-35B-A3B is much slower in Ollama 0.21.0 than in standalone llama.cpp, using the same machine and the same AMD GPU.

Qwen3.6-35B-A3B runs much faster in llama.cpp directly on the same machine
gpt-oss:20b runs fast in Ollama on the same machine

This Ollama-specific performance issue for Qwen3.5 and Qwen3.6 MoE models on ROCm, possibly related to the bundled llama.cpp version, runner configuration, model handling, or default settings.

There are already several related reports: #14861, #14579, #15601

I wanted to add a ROCm reproduction with concrete side by side numbers from Ollama and llama.cpp.

and if you need me to test anything im eager to help out, im able to build ollama locally also if needed.

Environment

Ollama version: 0.21.0
OS: Linux
GPU: AMD Radeon RX 7900 XTX 24 GB
CPU: Ryzen 9 7950X3D
Backend: ROCm

Observed results

Ollama: qwen3.6 MoE stays around 24.5 tok/s

Ongoing chat run:

total duration:       1m49.28193622s
load duration:        112.493407ms
prompt eval count:    980 token(s)
prompt eval duration: 830.104216ms
prompt eval rate:     1180.57 tokens/s
eval count:           2623 token(s)
eval duration:        1m47.111537835s
eval rate:            24.49 tokens/s

Ollama: gpt-oss:20b is much faster on the same machine

total duration:       2.547799345s
load duration:        120.750897ms
prompt eval count:    1472 token(s)
prompt eval duration: 443.048925ms
prompt eval rate:     3322.43 tokens/s
eval count:           227 token(s)
eval duration:        1.90281622s
eval rate:            119.30 tokens/s

So the machine and ROCm stack are capable of much higher throughput in Ollama with other models.

Standalone llama.cpp on the same machine is much faster for Qwen3.6-35B-A3B

Command:

llama-cli -m ./Qwen3.6-35B-A3B-UD-Q4_K_M.gguf -p "Write one sentence about Arch Linux." -ngl 99 --device ROCm0

Result:

[ Prompt: 94.3 t/s | Generation: 89.3 t/s ]
llama_memory_breakdown_print: | memory breakdown [MiB]  | total   free     self   model   context   compute    unaccounted |
llama_memory_breakdown_print: |   - ROCm0 (RX 7900 XTX) | 24560 = 1101 + (23164 = 20583 +    2087 +     493) +         294 |
llama_memory_breakdown_print: |   - Host                |                   725 =   515 +       0 +     210                |

I am also interested in using imported GGUF models, but that path appears to have recent compatibility issues with newer Qwen GGUFs as well. I am not making that the primary issue here, but if the team thinks the best way to compare behavior is with a local GGUF import path, I am happy to test that too.

Relevant log output

OS

Linux

GPU

AMD

CPU

AMD

Ollama version

0.21.0

Originally created by @lennarkivistik on GitHub (Apr 23, 2026). Original GitHub issue: https://github.com/ollama/ollama/issues/15771 ### What is the issue? On my system, Qwen3.6-35B-A3B is much slower in Ollama 0.21.0 than in standalone llama.cpp, using the same machine and the same AMD GPU. Qwen3.6-35B-A3B runs much faster in llama.cpp directly on the same machine gpt-oss:20b runs fast in Ollama on the same machine This Ollama-specific performance issue for Qwen3.5 and Qwen3.6 MoE models on ROCm, possibly related to the bundled llama.cpp version, runner configuration, model handling, or default settings. There are already several related reports: #14861, #14579, #15601 I wanted to add a ROCm reproduction with concrete side by side numbers from Ollama and llama.cpp. and if you need me to test anything im eager to help out, im able to build ollama locally also if needed. ### Environment Ollama version: 0.21.0 OS: Linux GPU: AMD Radeon RX 7900 XTX 24 GB CPU: Ryzen 9 7950X3D Backend: ROCm ### Observed results Ollama: qwen3.6 MoE stays around 24.5 tok/s Ongoing chat run: ``` total duration: 1m49.28193622s load duration: 112.493407ms prompt eval count: 980 token(s) prompt eval duration: 830.104216ms prompt eval rate: 1180.57 tokens/s eval count: 2623 token(s) eval duration: 1m47.111537835s eval rate: 24.49 tokens/s ``` Ollama: gpt-oss:20b is much faster on the same machine ``` total duration: 2.547799345s load duration: 120.750897ms prompt eval count: 1472 token(s) prompt eval duration: 443.048925ms prompt eval rate: 3322.43 tokens/s eval count: 227 token(s) eval duration: 1.90281622s eval rate: 119.30 tokens/s ``` So the machine and ROCm stack are capable of much higher throughput in Ollama with other models. Standalone llama.cpp on the same machine is much faster for Qwen3.6-35B-A3B Command: `llama-cli -m ./Qwen3.6-35B-A3B-UD-Q4_K_M.gguf -p "Write one sentence about Arch Linux." -ngl 99 --device ROCm0` Result: ``` [ Prompt: 94.3 t/s | Generation: 89.3 t/s ] llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted | llama_memory_breakdown_print: | - ROCm0 (RX 7900 XTX) | 24560 = 1101 + (23164 = 20583 + 2087 + 493) + 294 | llama_memory_breakdown_print: | - Host | 725 = 515 + 0 + 210 | ``` I am also interested in using imported GGUF models, but that path appears to have recent compatibility issues with newer Qwen GGUFs as well. I am not making that the primary issue here, but if the team thinks the best way to compare behavior is with a local GGUF import path, I am happy to test that too. ### Relevant log output ```shell ``` ### OS Linux ### GPU AMD ### CPU AMD ### Ollama version 0.21.0

GiteaMirror added the bug label 2026-05-05 03:29:45 -05:00

GiteaMirror commented

2026-05-05 03:29:48 -05:00

@AnyRock commented on GitHub (Apr 23, 2026):

我也是同样的问题。并且ollama更新总是围绕gemma4，但是对qwen的优化几乎没有

@AnyRock commented on GitHub (Apr 23, 2026): 我也是同样的问题。并且ollama更新总是围绕gemma4，但是对qwen的优化几乎没有

GiteaMirror commented

2026-05-05 03:29:48 -05:00

@chejh-amd commented on GitHub (Apr 24, 2026):

Hi @lennarkivistik Thanks for lining up the numbers side by side that makes the comparison a lot easier to think about.

A few things that might be worth double-checking if useful:

Ollama’s qwen3.6:… pull is not guaranteed to be the same blob as your local Qwen3.6-35B-A3B-UD-Q4_K_M.gguf. Different quant / tensor layout = totally different t/s. If you can, try importing the exact same GGUF you used with llama-cli and compare again..
On MoE models, if a chunk of experts or routing ends up on CPU in one stack but fully on GPU in the other, decode can drop hard. Worth confirming from a debug run (layer/GPU load lines) whether all layers you expect on the 7900 XTX are actually on device for the Ollama case.

@chejh-amd commented on GitHub (Apr 24, 2026): Hi @lennarkivistik Thanks for lining up the numbers side by side that makes the comparison a lot easier to think about. A few things that might be worth double-checking if useful: 1. Ollama’s qwen3.6:… pull is not guaranteed to be the same blob as your local Qwen3.6-35B-A3B-UD-Q4_K_M.gguf. Different quant / tensor layout = totally different t/s. If you can, try importing the exact same GGUF you used with llama-cli and compare again.. 2. On MoE models, if a chunk of experts or routing ends up on CPU in one stack but fully on GPU in the other, decode can drop hard. Worth confirming from a debug run (layer/GPU load lines) whether all layers you expect on the 7900 XTX are actually on device for the Ollama case.

GiteaMirror commented

2026-05-05 03:29:49 -05:00

@lennarkivistik commented on GitHub (Apr 24, 2026):

Thanks for the suggestions @chejh-amd, im aware of the points you made.

I did a bit more testing and tried to line up the comparison more carefully.

First, I added an “apples to apples-ish” baseline with the older qwen3:30b-a3b-thinking-2507-q4_K_M, since it is in the same rough active/model size class as qwen3.6:35b-a3b.

One thing I should clarify: I am not assuming Qwen3, Qwen3.5, and Qwen3.6 are the same internally. I know this is not a true architecture-equivalent comparison.

My understanding is roughly:

qwen3:30b-a3b is a more conventional sparse MoE transformer-style Qwen3 model.
Qwen3.5 / Qwen3-Next changed the internal design quite a bit: higher-sparsity MoE, shared experts, and a hybrid attention design using Gated DeltaNet / linear-attention-style blocks mixed with full attention.
Qwen3.6-35B-A3B appears to continue that newer architecture family: qwen35moe.

So I do expect Qwen3.6 to behave differently and to potentially be heavier in some places. I am not expecting identical throughput to qwen3:30b-a3b. The reason I brought up the older Qwen3 result is mainly as a sanity check: on the same ROCm system, Ollama can clearly run a similarly sized active MoE model around ~90 tok/s, while qwen3.6:35b-a3b is around ~24–25 tok/s in Ollama, even though the standalone llama-cli result for my local Qwen3.6 GGUF is much closer to the ~90 tok/s range.

qwen3:30b-a3b-thinking-2507-q4_K_M in Ollama

total duration:       9.768541354s
load duration:        59.170826ms
prompt eval count:    426 token(s)
prompt eval duration: 136.877327ms
prompt eval rate:     3112.28 tokens/s
eval count:           852 token(s)
eval duration:        9.371897025s
eval rate:            91.91 tokens/s

That result is much closer to what I see from llama-cli with Qwen3.6, while Ollama’s qwen3.6:35b-a3b path stays around ~24–25 tok/s on the same machine.

Regarding “same GGUF” testing

I agree that comparing Ollama’s pulled model vs my local GGUF may not be a perfect comparison. I tried to test the direct imported GGUF path as well.

At the moment I only have the Q2 imported variant in Ollama:

ollama run hf.co/unsloth/Qwen3.6-35B-A3B-GGUF:UD-Q2_K_XL --verbose

But that currently fails to load in Ollama 0.21.0, since well gguf support has been regressed at some versions back, or maybe it was just the "qwen35moe" architecture in ggufs

Error: 500 Internal Server Error: unable to load model: /var/lib/ollama/blobs/sha256-******

Relevant sanitized log excerpt:

level=WARN source=sched.go:423 msg="model architecture does not currently support parallel requests" architecture=qwen35moe
level=DEBUG source=server.go:156 msg="model not yet supported by Ollama engine, switching to compatibility mode" model=/var/lib/ollama/blobs/sha256-******
llama_model_loader: loaded meta data with 54 key-value pairs and 733 tensors from /var/lib/ollama/blobs/sha256-******
llama_model_loader: - kv 0:  general.architecture str = qwen35moe
llama_model_loader: - kv 5:  general.name str = Qwen3.6-35B-A3B
llama_model_loader: - kv 17: qwen35moe.block_count u32 = 40
llama_model_loader: - kv 18: qwen35moe.context_length u32 = 262144
llama_model_loader: - kv 25: qwen35moe.expert_count u32 = 256
llama_model_loader: - kv 26: qwen35moe.expert_used_count u32 = 8
print_info: file format = GGUF V3
print_info: file type   = Q2_K - Medium
print_info: file size   = 11.44 GiB
llama_model_load: error loading model: error loading model architecture: unknown model architecture: 'qwen35moe'
llama_model_load_from_file_impl: failed to load model
level=INFO source=sched.go:462 msg="failed to create server" model=hf.co/unsloth/Qwen3.6-35B-A3B-GGUF:UD-Q2_K_XL error="unable to load model: /var/lib/ollama/blobs/sha256-******"

So I currently cannot use that path to do an exact same-GGUF benchmark inside Ollama, at least not with this imported Qwen3.6 GGUF. The standalone llama-cli path does load and run the local Qwen3.6-35B-A3B-UD-Q4_K_M.gguf.

Regarding GPU placement / CPU fallback

From the debug logs, Ollama is detecting the ROCm device correctly and using the ROCm backend:

load_backend: loaded ROCm backend from /usr/lib/ollama/rocm/libggml-hip.so
ggml_cuda_init: found 1 ROCm devices:
  Device 0: AMD Radeon RX 7900 XTX, gfx1100, VMM: no, Wave Size: 32, ID: GPU-2b91a683f3f9e991

During the failed imported Q2 run, I also see Ollama reporting the model/runner size and VRAM estimate before unload:

runner.size="25.3 GiB" runner.vram="21.1 GiB" runner.parallel=1

For the llama-cli Q4_K_M run, llama.cpp reported the model/context/compute mostly on the RX 7900 XTX:

ROCm0 (RX 7900 XTX): total 24560 MiB
model: 20583 MiB
context: 2087 MiB
compute: 493 MiB
Host: 725 MiB

So at least in llama-cli, the Q4_K_M case appears to be almost entirely GPU-resident.

Runtime configuration

My Ollama service is tuned for this machine and is otherwise working very well with other models. The relevant parts are:

Environment=OLLAMA_LLM_LIBRARY=rocm
Environment=OLLAMA_LIBRARY_PATH=/usr/lib/ollama/rocm
Environment=ROCR_VISIBLE_DEVICES=GPU-2b91a683f3f9e991
Environment="LD_LIBRARY_PATH=/usr/lib/ollama/rocm:/usr/lib/ollama:/opt/rocm/lib:/opt/rocm/lib64"

Environment=OLLAMA_FLASH_ATTENTION=1
Environment=OLLAMA_KV_CACHE_TYPE=q4_0
Environment=OLLAMA_MAX_LOADED_MODELS=1
Environment=OLLAMA_NUM_PARALLEL=4
Environment=OLLAMA_KEEP_ALIVE=10m
Environment=OLLAMA_MODELS=/var/lib/ollama
Environment=OLLAMA_HOST=0.0.0.0:11434
Environment=OLLAMA_DEBUG=2
Environment=OLLAMA_GPU_OVERHEAD=536870912

Environment=OPENBLAS_NUM_THREADS=16
Environment=OMP_NUM_THREADS=16

User=ollama
Group=ollama
PrivateDevices=no
LimitMEMLOCK=infinity

ExecStart=/usr/bin/ollama serve

I also tested changing OLLAMA_KV_CACHE_TYPE from q4_0 to q8_0, and it did not materially change the result.

I do need Flash Attention enabled for this setup because the card has 24 GB VRAM and this MoE model is already close to the limit if I want any useful context window. I know TurboQuant or similar further context/VRAM reduction work is not implemented yet, so I am not expecting miracles there; I just want to make sure the current ROCm/Ollama path is not accidentally taking a much slower route than llama.cpp.

So the machine/ROCm stack appears capable of much higher throughput, and the slowdown seems specific to the Qwen3.6 MoE path in Ollama.

Happy to test a local build, a specific branch, or run any extra debug command if that helps narrow down whether this is model import support, compatibility mode, GPU offload placement, or a runner/kernel issue.

@lennarkivistik commented on GitHub (Apr 24, 2026): Thanks for the suggestions @chejh-amd, im aware of the points you made. I did a bit more testing and tried to line up the comparison more carefully. First, I added an “apples to apples-ish” baseline with the older `qwen3:30b-a3b-thinking-2507-q4_K_M`, since it is in the same rough active/model size class as `qwen3.6:35b-a3b`. One thing I should clarify: I am not assuming Qwen3, Qwen3.5, and Qwen3.6 are the same internally. I know this is not a true architecture-equivalent comparison. My understanding is roughly: - `qwen3:30b-a3b` is a more conventional sparse MoE transformer-style Qwen3 model. - Qwen3.5 / Qwen3-Next changed the internal design quite a bit: higher-sparsity MoE, shared experts, and a hybrid attention design using Gated DeltaNet / linear-attention-style blocks mixed with full attention. - `Qwen3.6-35B-A3B` appears to continue that newer architecture family: `qwen35moe`. So I do expect Qwen3.6 to behave differently and to potentially be heavier in some places. I am not expecting identical throughput to `qwen3:30b-a3b`. The reason I brought up the older Qwen3 result is mainly as a sanity check: on the same ROCm system, Ollama can clearly run a similarly sized active MoE model around ~90 tok/s, while `qwen3.6:35b-a3b` is around ~24–25 tok/s in Ollama, even though the standalone llama-cli result for my local Qwen3.6 GGUF is much closer to the ~90 tok/s range. ### qwen3:30b-a3b-thinking-2507-q4_K_M in Ollama ```text total duration: 9.768541354s load duration: 59.170826ms prompt eval count: 426 token(s) prompt eval duration: 136.877327ms prompt eval rate: 3112.28 tokens/s eval count: 852 token(s) eval duration: 9.371897025s eval rate: 91.91 tokens/s ```` That result is much closer to what I see from llama-cli with Qwen3.6, while Ollama’s `qwen3.6:35b-a3b` path stays around ~24–25 tok/s on the same machine. ### Regarding “same GGUF” testing I agree that comparing Ollama’s pulled model vs my local GGUF may not be a perfect comparison. I tried to test the direct imported GGUF path as well. At the moment I only have the Q2 imported variant in Ollama: ```bash ollama run hf.co/unsloth/Qwen3.6-35B-A3B-GGUF:UD-Q2_K_XL --verbose ``` But that currently fails to load in Ollama 0.21.0, since well gguf support has been regressed at some versions back, or maybe it was just the "qwen35moe" architecture in ggufs ```text Error: 500 Internal Server Error: unable to load model: /var/lib/ollama/blobs/sha256-****** ``` Relevant sanitized log excerpt: ```text level=WARN source=sched.go:423 msg="model architecture does not currently support parallel requests" architecture=qwen35moe level=DEBUG source=server.go:156 msg="model not yet supported by Ollama engine, switching to compatibility mode" model=/var/lib/ollama/blobs/sha256-****** llama_model_loader: loaded meta data with 54 key-value pairs and 733 tensors from /var/lib/ollama/blobs/sha256-****** llama_model_loader: - kv 0: general.architecture str = qwen35moe llama_model_loader: - kv 5: general.name str = Qwen3.6-35B-A3B llama_model_loader: - kv 17: qwen35moe.block_count u32 = 40 llama_model_loader: - kv 18: qwen35moe.context_length u32 = 262144 llama_model_loader: - kv 25: qwen35moe.expert_count u32 = 256 llama_model_loader: - kv 26: qwen35moe.expert_used_count u32 = 8 print_info: file format = GGUF V3 print_info: file type = Q2_K - Medium print_info: file size = 11.44 GiB llama_model_load: error loading model: error loading model architecture: unknown model architecture: 'qwen35moe' llama_model_load_from_file_impl: failed to load model level=INFO source=sched.go:462 msg="failed to create server" model=hf.co/unsloth/Qwen3.6-35B-A3B-GGUF:UD-Q2_K_XL error="unable to load model: /var/lib/ollama/blobs/sha256-******" ``` So I currently cannot use that path to do an exact same-GGUF benchmark inside Ollama, at least not with this imported Qwen3.6 GGUF. The standalone llama-cli path does load and run the local `Qwen3.6-35B-A3B-UD-Q4_K_M.gguf`. ### Regarding GPU placement / CPU fallback From the debug logs, Ollama is detecting the ROCm device correctly and using the ROCm backend: ```text load_backend: loaded ROCm backend from /usr/lib/ollama/rocm/libggml-hip.so ggml_cuda_init: found 1 ROCm devices: Device 0: AMD Radeon RX 7900 XTX, gfx1100, VMM: no, Wave Size: 32, ID: GPU-2b91a683f3f9e991 ``` During the failed imported Q2 run, I also see Ollama reporting the model/runner size and VRAM estimate before unload: ```text runner.size="25.3 GiB" runner.vram="21.1 GiB" runner.parallel=1 ``` For the llama-cli Q4_K_M run, llama.cpp reported the model/context/compute mostly on the RX 7900 XTX: ```text ROCm0 (RX 7900 XTX): total 24560 MiB model: 20583 MiB context: 2087 MiB compute: 493 MiB Host: 725 MiB ``` So at least in llama-cli, the Q4_K_M case appears to be almost entirely GPU-resident. ### Runtime configuration My Ollama service is tuned for this machine and is otherwise working very well with other models. The relevant parts are: ```ini Environment=OLLAMA_LLM_LIBRARY=rocm Environment=OLLAMA_LIBRARY_PATH=/usr/lib/ollama/rocm Environment=ROCR_VISIBLE_DEVICES=GPU-2b91a683f3f9e991 Environment="LD_LIBRARY_PATH=/usr/lib/ollama/rocm:/usr/lib/ollama:/opt/rocm/lib:/opt/rocm/lib64" Environment=OLLAMA_FLASH_ATTENTION=1 Environment=OLLAMA_KV_CACHE_TYPE=q4_0 Environment=OLLAMA_MAX_LOADED_MODELS=1 Environment=OLLAMA_NUM_PARALLEL=4 Environment=OLLAMA_KEEP_ALIVE=10m Environment=OLLAMA_MODELS=/var/lib/ollama Environment=OLLAMA_HOST=0.0.0.0:11434 Environment=OLLAMA_DEBUG=2 Environment=OLLAMA_GPU_OVERHEAD=536870912 Environment=OPENBLAS_NUM_THREADS=16 Environment=OMP_NUM_THREADS=16 User=ollama Group=ollama PrivateDevices=no LimitMEMLOCK=infinity ExecStart=/usr/bin/ollama serve ``` I also tested changing `OLLAMA_KV_CACHE_TYPE` from `q4_0` to `q8_0`, and it did not materially change the result. I do need Flash Attention enabled for this setup because the card has 24 GB VRAM and this MoE model is already close to the limit if I want any useful context window. I know TurboQuant or similar further context/VRAM reduction work is not implemented yet, so I am not expecting miracles there; I just want to make sure the current ROCm/Ollama path is not accidentally taking a much slower route than llama.cpp. So the machine/ROCm stack appears capable of much higher throughput, and the slowdown seems specific to the Qwen3.6 MoE path in Ollama. Happy to test a local build, a specific branch, or run any extra debug command if that helps narrow down whether this is model import support, compatibility mode, GPU offload placement, or a runner/kernel issue.

GiteaMirror commented

2026-05-05 03:29:50 -05:00

@chejh-amd commented on GitHub (Apr 27, 2026):

Hi @lennarkivistik Solid comparison, the qwen3:30b-a3b at ~92 t/s really pins down that ROCm isn't the bottleneck. That compatibility mode fallback for qwen35moe you found in the logs lines up: the pulled qwen3.6:35b-a3b is probably taking the same slower path since native qwen35moe isn't in the Ollama engine yet. Once that lands the gap should mostly close.
Probably just a waiting at this point, but if you do end up testing a local Ollama build with a newer llama.cpp, would be curious what numbers you get.

@chejh-amd commented on GitHub (Apr 27, 2026): Hi @lennarkivistik Solid comparison, the qwen3:30b-a3b at ~92 t/s really pins down that ROCm isn't the bottleneck. That compatibility mode fallback for qwen35moe you found in the logs lines up: the pulled qwen3.6:35b-a3b is probably taking the same slower path since native qwen35moe isn't in the Ollama engine yet. Once that lands the gap should mostly close. Probably just a waiting at this point, but if you do end up testing a local Ollama build with a newer llama.cpp, would be curious what numbers you get.

GiteaMirror commented

2026-05-05 03:29:51 -05:00

@lennarkivistik commented on GitHub (Apr 28, 2026):

Small update: I tested the imported GGUF path more deeply, and I found something interesting.

The direct HF import still fails:

ollama pull hf.co/unsloth/Qwen3.6-35B-A3B-GGUF:UD-Q4_K_M
ollama run hf.co/unsloth/Qwen3.6-35B-A3B-GGUF:UD-Q4_K_M

This fails with:

Error: 500 Internal Server Error: unable to load model:
/var/lib/ollama/blobs/sha256-ac0e2c1189e055faa36eff361580e79c5bd6f8e76bffb4ce547f167d53e31a61

ollama show reveals why this path is different: Ollama imports the HF model as a vision-capable model with a separate projector:

Model
  architecture        qwen35moe
  parameters          34.7B
  context length      262144
  embedding length    2048
  quantization        unknown

Capabilities
  completion
  vision

Projector
  architecture        clip
  parameters          446.57M
  embedding length    1152
  dimensions          2048

In the logs this corresponds to Ollama switching away from the normal engine path:

model not yet supported by Ollama engine, switching to compatibility mode
error="split vision models aren't supported"

Then the compatibility loader fails on:

unknown model architecture: 'qwen35moe'

So I manually repackaged the exact same downloaded text GGUF blob as a local text-only Ollama model, without the projector.

Steps:

mkdir -p /data/ollama-repack/qwen36-a3b-q4

cp /var/lib/ollama/blobs/sha256-ac0e2c1189e055faa36eff361580e79c5bd6f8e76bffb4ce547f167d53e31a61 \
  /data/ollama-repack/qwen36-a3b-q4/Qwen3.6-35B-A3B-UD-Q4_K_M.gguf

Then I used this Modelfile:

FROM /data/ollama-repack/qwen36-a3b-q4/Qwen3.6-35B-A3B-UD-Q4_K_M.gguf

TEMPLATE """{{ if .System }}<|im_start|>system
{{ .System }}<|im_end|>
{{ end }}{{ if .Prompt }}<|im_start|>user
{{ .Prompt }}<|im_end|>
{{ end }}<|im_start|>assistant
{{ .Response }}"""

PARAMETER stop "<|im_start|>"
PARAMETER stop "<|im_end|>"

PARAMETER num_ctx 8192
PARAMETER num_predict 1024
PARAMETER temperature 1
PARAMETER top_k 20
PARAMETER top_p 0.95

Created the model:

ollama create qwen3.6:a3b-q4-gguf -f /data/ollama-repack/qwen36-a3b-q4/Modelfile

That succeeded:

gathering model components
copying file sha256:ac0e2c1189e055faa36eff361580e79c5bd6f8e76bffb4ce547f167d53e31a61 100%
parsing GGUF
using existing layer sha256:ac0e2c1189e055faa36eff361580e79c5bd6f8e76bffb4ce547f167d53e31a61
writing manifest
success

Then running it worked:

ollama run qwen3.6:a3b-q4-gguf --verbose "Tell me about arch linux"

Result:

total duration:       27.092096404s
load duration:        2.371408921s
prompt eval count:    13 token(s)
prompt eval duration: 87.972274ms
prompt eval rate:     147.77 tokens/s
eval count:           1024 token(s)
eval duration:        24.182387354s
eval rate:            42.34 tokens/s

For comparison, the official Ollama model run I originally reported was:

total duration:       1m49.28193622s
load duration:        112.493407ms
prompt eval count:    980 token(s)
prompt eval duration: 830.104216ms
prompt eval rate:     1180.57 tokens/s
eval count:           2623 token(s)
eval duration:        1m47.111537835s
eval rate:            24.49 tokens/s

So the manually repackaged text-only GGUF runs at about 42.34 tok/s, compared with about 24.49 tok/s from the official Ollama model in my earlier test.

That still does not match standalone llama.cpp, where I saw around 89 tok/s, but it does suggest the situation is more nuanced than just “ROCm is slow” or “the GPU is the bottleneck”.

My current interpretation:

The direct hf.co/unsloth/... import path fails because Ollama treats the repo as a split vision model with a projector.
Repackaging the exact same GGUF blob as a text-only local Ollama model avoids that split vision/projector path.
In that text-only path, qwen35moe does load and run.
The manually repackaged GGUF is significantly faster than the official Ollama model in my previous run, but still much slower than standalone llama.cpp on the same machine.

So this may be two separate issues:

HF GGUF import/package handling for Qwen3.6 vision/projector repos.
Runtime performance gap between Ollama and standalone llama.cpp for Qwen3.6 MoE on ROCm.

@dhiltgen Hopefully I wont disturb you too much for a bit of input since you know ollama in and out if maybe this test can help the team out

@lennarkivistik commented on GitHub (Apr 28, 2026): Small update: I tested the imported GGUF path more deeply, and I found something interesting. The direct HF import still fails: ```bash ollama pull hf.co/unsloth/Qwen3.6-35B-A3B-GGUF:UD-Q4_K_M ollama run hf.co/unsloth/Qwen3.6-35B-A3B-GGUF:UD-Q4_K_M ``` This fails with: ```text Error: 500 Internal Server Error: unable to load model: /var/lib/ollama/blobs/sha256-ac0e2c1189e055faa36eff361580e79c5bd6f8e76bffb4ce547f167d53e31a61 ``` `ollama show` reveals why this path is different: Ollama imports the HF model as a vision-capable model with a separate projector: ```text Model architecture qwen35moe parameters 34.7B context length 262144 embedding length 2048 quantization unknown Capabilities completion vision Projector architecture clip parameters 446.57M embedding length 1152 dimensions 2048 ``` In the logs this corresponds to Ollama switching away from the normal engine path: ```text model not yet supported by Ollama engine, switching to compatibility mode error="split vision models aren't supported" ``` Then the compatibility loader fails on: ```text unknown model architecture: 'qwen35moe' ``` So I manually repackaged the exact same downloaded text GGUF blob as a local text-only Ollama model, without the projector. Steps: ```bash mkdir -p /data/ollama-repack/qwen36-a3b-q4 cp /var/lib/ollama/blobs/sha256-ac0e2c1189e055faa36eff361580e79c5bd6f8e76bffb4ce547f167d53e31a61 \ /data/ollama-repack/qwen36-a3b-q4/Qwen3.6-35B-A3B-UD-Q4_K_M.gguf ``` Then I used this Modelfile: ```Dockerfile FROM /data/ollama-repack/qwen36-a3b-q4/Qwen3.6-35B-A3B-UD-Q4_K_M.gguf TEMPLATE """{{ if .System }}<|im_start|>system {{ .System }}<|im_end|> {{ end }}{{ if .Prompt }}<|im_start|>user {{ .Prompt }}<|im_end|> {{ end }}<|im_start|>assistant {{ .Response }}""" PARAMETER stop "<|im_start|>" PARAMETER stop "<|im_end|>" PARAMETER num_ctx 8192 PARAMETER num_predict 1024 PARAMETER temperature 1 PARAMETER top_k 20 PARAMETER top_p 0.95 ``` Created the model: ```bash ollama create qwen3.6:a3b-q4-gguf -f /data/ollama-repack/qwen36-a3b-q4/Modelfile ``` That succeeded: ```text gathering model components copying file sha256:ac0e2c1189e055faa36eff361580e79c5bd6f8e76bffb4ce547f167d53e31a61 100% parsing GGUF using existing layer sha256:ac0e2c1189e055faa36eff361580e79c5bd6f8e76bffb4ce547f167d53e31a61 writing manifest success ``` Then running it worked: ```bash ollama run qwen3.6:a3b-q4-gguf --verbose "Tell me about arch linux" ``` Result: ```text total duration: 27.092096404s load duration: 2.371408921s prompt eval count: 13 token(s) prompt eval duration: 87.972274ms prompt eval rate: 147.77 tokens/s eval count: 1024 token(s) eval duration: 24.182387354s eval rate: 42.34 tokens/s ``` For comparison, the official Ollama model run I originally reported was: ```text total duration: 1m49.28193622s load duration: 112.493407ms prompt eval count: 980 token(s) prompt eval duration: 830.104216ms prompt eval rate: 1180.57 tokens/s eval count: 2623 token(s) eval duration: 1m47.111537835s eval rate: 24.49 tokens/s ``` So the manually repackaged text-only GGUF runs at about **42.34 tok/s**, compared with about **24.49 tok/s** from the official Ollama model in my earlier test. That still does not match standalone llama.cpp, where I saw around **89 tok/s**, but it does suggest the situation is more nuanced than just “ROCm is slow” or “the GPU is the bottleneck”. My current interpretation: 1. The direct `hf.co/unsloth/...` import path fails because Ollama treats the repo as a split vision model with a projector. 2. Repackaging the exact same GGUF blob as a text-only local Ollama model avoids that split vision/projector path. 3. In that text-only path, `qwen35moe` does load and run. 4. The manually repackaged GGUF is significantly faster than the official Ollama model in my previous run, but still much slower than standalone llama.cpp on the same machine. So this may be two separate issues: * HF GGUF import/package handling for Qwen3.6 vision/projector repos. * Runtime performance gap between Ollama and standalone llama.cpp for Qwen3.6 MoE on ROCm. @dhiltgen Hopefully I wont disturb you too much for a bit of input since you know ollama in and out if maybe this test can help the team out

GiteaMirror commented

2026-05-05 03:29:51 -05:00

@chejh-amd commented on GitHub (Apr 29, 2026):

Hi @lennarkivistik The text-only repack of the same blob is a really clean way to separate “HF vision/projector packaging” from “runtime decode path.” Thanks for digging this deep.

The ~42 tok/s vs ~24 tok/s gap on otherwise similar weights is a useful datapoint: it suggests the official pull path isn’t identical to a plain text GGUF model in practice, not just “ROCm is slow.”

The remaining gap vs standalone llama-cli on the same GGUF still looks like something worth profiling separately, but your breakdown already narrows what “slow” could mean in practice.

@chejh-amd commented on GitHub (Apr 29, 2026): Hi @lennarkivistik The text-only repack of the *same* blob is a really clean way to separate “HF vision/projector packaging” from “runtime decode path.” Thanks for digging this deep. The ~42 tok/s vs ~24 tok/s gap on otherwise similar weights is a useful datapoint: it suggests the official pull path isn’t identical to a plain text GGUF model in practice, not just “ROCm is slow.” The remaining gap vs standalone llama-cli on the same GGUF still looks like something worth profiling separately, but your breakdown already narrows what “slow” could mean in practice.

Sign in to join this conversation.

Branches Tags

main

hoyyeva/anthropic-local-image-path

dhiltgen/ci

dhiltgen/llama-runner

parth-remove-claude-desktop-launch

hoyyeva/anthropic-reference-images-path

parth-anthropic-reference-images-path

brucemacd/download-before-remove

hoyyeva/editor-config-repair

parth-mlx-decode-checkpoints

parth-launch-codex-app

hoyyeva/fix-codex-model-metadata-warning

hoyyeva/qwen

parth/hide-claude-desktop-till-release

hoyyeva/opencode-image-modality

parth-add-claude-code-autoinstall

release_v0.22.0

pdevine/manifest-list

codex/fix-codex-model-metadata-warning

pdevine/addressable-manifest

brucemacd/launch-fetch-reccomended

jmorganca/llama-compat

launch-copilot-cli

hoyyeva/opencode-thinking

release_v0.20.7

parth-auto-save-backup

parth-test

jmorganca/gemma4-audio-replacements

fix-manifest-digest-on-pull

hoyyeva/vscode-improve

brucemacd/install-server-wait

parth/update-claude-docs

brucemac/start-ap-install

pdevine/mlx-update

pdevine/qwen35_vision

drifkin/api-show-fallback

mintlify/image-generation-1773352582

hoyyeva/server-context-length-local-config

jmorganca/faster-reptition-penalties

jmorganca/convert-nemotron

parth-pi-thinking

pdevine/sampling-penalties

jmorganca/fix-create-quantization-memory

dongchen/resumable_transfer_fix

pdevine/sampling-cache-error

jessegross/mlx-usage

hoyyeva/openclaw-config

hoyyeva/app-html

pdevine/qwen3next

brucemacd/sign-sh-install

brucemacd/tui-update

brucemacd/usage-api

jmorganca/launch-empty

fix-app-dist-embed

mxyng/mlx-compile

mxyng/mlx-quant

mxyng/mlx-glm4.7

mxyng/mlx

brucemacd/simplify-model-picker

jmorganca/qwen3-concurrent

fix-glm-4.7-flash-mla-config

drifkin/qwen3-coder-opening-tag

brucemacd/usage-cli

fix-cuda12-fattn-shmem

ollama-imagegen-docs

parth/fix-multiline-inputs

brucemacd/config-docs

mxyng/model-files

mxyng/simple-execute

fix-imagegen-ollama-models

mxyng/async-upload

jmorganca/lazy-no-dtype-changes

imagegen-auto-detect-create

parth/decrease-concurrent-download-hf

fix-mlx-quantize-init

jmorganca/x-cleanup

usage

imagegen-readme

jmorganca/glm-image

mlx-gpu-cd

jmorganca/imagegen-modelfile

parth/agent-skills

parth/agent-allowlist

parth/signed-in-offline

parth/agents

parth/fix-context-chopping

improve-cloud-flow

parth/add-models-websearch

parth/prompt-renderer-mcp

jmorganca/native-settings

jmorganca/download-stream-hash

jmorganca/client2-rebased

brucemacd/oai-chat-req-multipart

jessegross/multi_chunk_reserve

grace/additional-omit-empty

grace/mistral-3-large

mxyng/tokenizer2

mxyng/tokenizer

jessegross/flash

hoyyeva/windows-nacked-app

mxyng/cleanup-attention

grace/deepseek-parser

hoyyeva/remember-unsent-prompt

parth/add-lfs-pointer-error-conversion

parth/olmo2-test2

hoyyeva/ollama-launchagent-plist

nicole/olmo-model

parth/olmo-test

mxyng/remove-embedded

parth/render-template

jmorganca/intellect-3

parth/remove-prealloc-linter

jmorganca/cmd-eval

nicole/nomic-embed-text-fix

mxyng/lint-2

hoyyeva/add-gemini-3-pro-preview

hoyyeva/load-model-list

mxyng/expand-path

mxyng/environ-2

hoyyeva/deeplink-json-encoding

parth/improve-tool-calling-tests

hoyyeva/conversation

hoyyeva/assistant-edit-response

hoyyeva/thinking

origin/brucemacd/invalid-char-i-err

parth/improve-tool-calling

jmorganca/required-omitempty

grace/qwen3-vl-tests

mxyng/iter-client

parth/docs-readme

nicole/embed-test

pdevine/integration-benchstat

parth/remove-generate-cmd

parth/add-toolcall-id

mxyng/server-tests

jmorganca/glm-4.6

jmorganca/gin-h-compat

drifkin/stable-tool-args

pdevine/qwen3-more-thinking

parth/add-websearch-client

nicole/websearch_local

jmorganca/qwen3-coder-updates

grace/deepseek-v3-migration-tests

mxyng/fix-create

jmorganca/cloud-errors

pdevine/parser-tidy

revert-12233-parth/simplify-entrypoints-runner

parth/enable-so-gpt-oss

brucemacd/qwen3vl

jmorganca/readme-simplify

parth/gpt-oss-structured-outputs

revert-12039-jmorganca/tools-braces

mxyng/embeddings

mxyng/gguf

mxyng/benchmark

mxyng/types-null

parth/move-parsing

mxyng/gemma2

jmorganca/docs

mxyng/16-bit

mxyng/create-stdin

pdevine/authorizedkeys

mxyng/quant

parth/opt-in-error-context-window

brucemacd/cache-models

brucemacd/runner-completion

jmorganca/llama-update-6

brucemacd/benchmark-list

brucemacd/partial-read-caps

parth/deepseek-r1-tools

mxyng/omit-array

parth/tool-prefix-temp

brucemacd/runner-test

jmorganca/qwen25vl

brucemacd/model-forward-test-ext

parth/python-function-parsing

jmorganca/cuda-compression-none

drifkin/num-parallel

drifkin/chat-truncation-fix

jmorganca/sync

parth/python-tools-calling

drifkin/array-head-count

brucemacd/create-no-loop

parth/server-enable-content-stream-with-tools

qwen25omni

mxyng/v3

brucemacd/ropeconfig

jmorganca/silence-tokenizer

parth/sample-so-test

parth/sampling-structured-outputs

brucemacd/doc-go-engine

parth/constrained-sampling-json

jmorganca/mistral-wip

brucemacd/mistral-small-convert

parth/sample-unmarshal-json-for-params

brucemacd/jomorganca/mistral

pdevine/bfloat16

jmorganca/mistral

brucemacd/mistral

pdevine/logging

parth/sample-correctness-fix

parth/sample-fix-sorting

jmorgan/sample-fix-sorting-extras

jmorganca/temp-0-images

brucemacd/parallel-embed-models

brucemacd/shim-grammar

jmorganca/fix-gguf-error

bmizerany/nameswork

jmorganca/faster-releases

bmizerany/validatenames

brucemacd/err-no-vocab

brucemacd/rope-config

brucemacd/err-hint

brucemacd/qwen2_5

brucemacd/logprobs

brucemacd/new_runner_graph_bench

progress-flicker

brucemacd/forward-test

brucemacd/go_qwen2

pdevine/gemma2

jmorganca/add-missing-symlink-eval

mxyng/next-debug

parth/set-context-size-openai

brucemacd/next-bpe-bench

brucemacd/next-bpe-test

brucemacd/new_runner_e2e

brucemacd/new_runner_qwen2

pdevine/convert-cohere2

brucemacd/convert-cli

parth/log-probs

mxyng/next-mlx

mxyng/cmd-history

parth/templating

parth/tokenize-detokenize

brucemacd/check-key-register

bmizerany/grammar

jmorganca/vendor-081b29bd

mxyng/func-checks

jmorganca/fix-null-format

parth/fix-default-to-warn-json

jmorganca/qwen2vl

jmorganca/no-concat

parth/cmd-cleanup-SO

brucemacd/check-key-register-structured-err

parth/openai-stream-usage

parth/fix-referencing-so

stream-tools-stop

jmorganca/degin-1

brucemacd/install-path-clean

brucemacd/push-name-validation

brucemacd/browser-key-register

jmorganca/openai-fix-first-message

jmorganca/fix-proxy

jessegross/sample

parth/disallow-streaming-tools

dhiltgen/remove_submodule

jmorganca/ga

jmorganca/mllama

pdevine/newlines

pdevine/geems-2b

jmorganca/llama-bump

mxyng/modelname-7

mxyng/gin-slog

mxyng/modelname-6

jyan/convert-prog

jyan/quant5

paligemma-support

pdevine/import-docs

jmorganca/openai-context

jyan/paligemma

jyan/p2

jyan/palitest

bmizerany/embedspeedup

jmorganca/llama-vit

brucemacd/allow-ollama

royh/ep-methods

royh/whisper

mxyng/api-models

mxyng/fix-memory

jyan/q4_4/8

jyan/ollama-v

royh/stream-tools

roy-embed-parallel

bmizerany/hrm

revert-5963-revert-5924-mxyng/llama3.1-rope

royh/embed-viz

jyan/local2

jyan/auth

jyan/local

jyan/parse-temp

jmorganca/template-mistral

jyan/reord-g

royh-openai-suffixdocs

royh-imgembed

royh-embed-parallel

jyan/quant4

royh-precision

jyan/progress

pdevine/fix-template

jyan/quant3

pdevine/ggla

mxyng/update-registry-domain

jmorganca/ggml-static

mxyng/create-context

jyan/v0.146

mxyng/layers-from-files

build_dist

bmizerany/noseek

royh-ls

royh-name

timeout

mxyng/server-timestamp

bmizerany/nosillyggufslurps

royh-params

jmorganca/llama-cpp-7c26775

royh-openai-delete

royh-show-rigid

jmorganca/enable-fa

jmorganca/no-error-template

jyan/format

royh-testdelete

bmizerany/fastverify

language_support

pdevine/ps-glitches

brucemacd/tokenize

bruce/iq-quants

bmizerany/filepathwithcoloninhost

mxyng/split-bin

bmizerany/client-registry

jmorganca/if-none-match

native

jmorganca/native

jmorganca/batch-embeddings

jmorganca/initcmake

jmorganca/mm

pdevine/showggmlinfo

modenameenforcealphanum

bmizerany/modenameenforcealphanum

jmorganca/done-reason

jmorganca/llama-cpp-8960fe8

ollama.com

bmizerany/filepathnobuild

bmizerany/types/model/defaultfix

rmdisplaylong

nogogen

bmizerany/x

modelfile-readme

bmizerany/replacecolon

jmorganca/limit

jmorganca/execstack

jmorganca/replace-assets

mxyng/tune-concurrency

jmorganca/testing

whitespace-detection

jmorganca/options

upgrade-all

scratch

cuda-search

mattw/airenamer

mattw/allmodelsonhuggingface

mattw/quantcontext

mattw/whatneedstorun

brucemacd/llama-mem-calc

mattw/faq-context

mattw/communitylinks

mattw/noprune

mattw/python-functioncalling

rename

mxyng/install

pulse

remove-first

editor

mattw/selfqueryingretrieval

cgo

mattw/howtoquant

api

matt/streamingapi

format-config

mxyng/extra-args

shell

update-nous-hermes

cp-model

upload-progress

fix-unknown-model

fix-model-names

delete-fix

insecure-registry

ls

deletemodels

progressbar

readme-updates

license-layers

skip-list

list-models

modelpath

matt/examplemodelfiles

distribution

go-opts

1 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: github-starred/ollama#72109