[GH-ISSUE #15771] Qwen3.6-35B-A3B is much slower in Ollama 0.21.0 than llama.cpp on ROCm with the same GPU #72109

Open
opened 2026-05-05 03:29:45 -05:00 by GiteaMirror · 6 comments
Owner

Originally created by @lennarkivistik on GitHub (Apr 23, 2026).
Original GitHub issue: https://github.com/ollama/ollama/issues/15771

What is the issue?

On my system, Qwen3.6-35B-A3B is much slower in Ollama 0.21.0 than in standalone llama.cpp, using the same machine and the same AMD GPU.

Qwen3.6-35B-A3B runs much faster in llama.cpp directly on the same machine
gpt-oss:20b runs fast in Ollama on the same machine

This Ollama-specific performance issue for Qwen3.5 and Qwen3.6 MoE models on ROCm, possibly related to the bundled llama.cpp version, runner configuration, model handling, or default settings.

There are already several related reports: #14861, #14579, #15601

I wanted to add a ROCm reproduction with concrete side by side numbers from Ollama and llama.cpp.

and if you need me to test anything im eager to help out, im able to build ollama locally also if needed.

Environment

Ollama version: 0.21.0
OS: Linux
GPU: AMD Radeon RX 7900 XTX 24 GB
CPU: Ryzen 9 7950X3D
Backend: ROCm

Observed results

Ollama: qwen3.6 MoE stays around 24.5 tok/s

Ongoing chat run:

total duration:       1m49.28193622s
load duration:        112.493407ms
prompt eval count:    980 token(s)
prompt eval duration: 830.104216ms
prompt eval rate:     1180.57 tokens/s
eval count:           2623 token(s)
eval duration:        1m47.111537835s
eval rate:            24.49 tokens/s

Ollama: gpt-oss:20b is much faster on the same machine

total duration:       2.547799345s
load duration:        120.750897ms
prompt eval count:    1472 token(s)
prompt eval duration: 443.048925ms
prompt eval rate:     3322.43 tokens/s
eval count:           227 token(s)
eval duration:        1.90281622s
eval rate:            119.30 tokens/s

So the machine and ROCm stack are capable of much higher throughput in Ollama with other models.

Standalone llama.cpp on the same machine is much faster for Qwen3.6-35B-A3B

Command:

llama-cli -m ./Qwen3.6-35B-A3B-UD-Q4_K_M.gguf -p "Write one sentence about Arch Linux." -ngl 99 --device ROCm0

Result:

[ Prompt: 94.3 t/s | Generation: 89.3 t/s ]
llama_memory_breakdown_print: | memory breakdown [MiB]  | total   free     self   model   context   compute    unaccounted |
llama_memory_breakdown_print: |   - ROCm0 (RX 7900 XTX) | 24560 = 1101 + (23164 = 20583 +    2087 +     493) +         294 |
llama_memory_breakdown_print: |   - Host                |                   725 =   515 +       0 +     210                |

I am also interested in using imported GGUF models, but that path appears to have recent compatibility issues with newer Qwen GGUFs as well. I am not making that the primary issue here, but if the team thinks the best way to compare behavior is with a local GGUF import path, I am happy to test that too.

Relevant log output


OS

Linux

GPU

AMD

CPU

AMD

Ollama version

0.21.0

Originally created by @lennarkivistik on GitHub (Apr 23, 2026). Original GitHub issue: https://github.com/ollama/ollama/issues/15771 ### What is the issue? On my system, Qwen3.6-35B-A3B is much slower in Ollama 0.21.0 than in standalone llama.cpp, using the same machine and the same AMD GPU. Qwen3.6-35B-A3B runs much faster in llama.cpp directly on the same machine gpt-oss:20b runs fast in Ollama on the same machine This Ollama-specific performance issue for Qwen3.5 and Qwen3.6 MoE models on ROCm, possibly related to the bundled llama.cpp version, runner configuration, model handling, or default settings. There are already several related reports: #14861, #14579, #15601 I wanted to add a ROCm reproduction with concrete side by side numbers from Ollama and llama.cpp. and if you need me to test anything im eager to help out, im able to build ollama locally also if needed. ### Environment Ollama version: 0.21.0 OS: Linux GPU: AMD Radeon RX 7900 XTX 24 GB CPU: Ryzen 9 7950X3D Backend: ROCm ### Observed results Ollama: qwen3.6 MoE stays around 24.5 tok/s Ongoing chat run: ``` total duration: 1m49.28193622s load duration: 112.493407ms prompt eval count: 980 token(s) prompt eval duration: 830.104216ms prompt eval rate: 1180.57 tokens/s eval count: 2623 token(s) eval duration: 1m47.111537835s eval rate: 24.49 tokens/s ``` Ollama: gpt-oss:20b is much faster on the same machine ``` total duration: 2.547799345s load duration: 120.750897ms prompt eval count: 1472 token(s) prompt eval duration: 443.048925ms prompt eval rate: 3322.43 tokens/s eval count: 227 token(s) eval duration: 1.90281622s eval rate: 119.30 tokens/s ``` So the machine and ROCm stack are capable of much higher throughput in Ollama with other models. Standalone llama.cpp on the same machine is much faster for Qwen3.6-35B-A3B Command: `llama-cli -m ./Qwen3.6-35B-A3B-UD-Q4_K_M.gguf -p "Write one sentence about Arch Linux." -ngl 99 --device ROCm0` Result: ``` [ Prompt: 94.3 t/s | Generation: 89.3 t/s ] llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted | llama_memory_breakdown_print: | - ROCm0 (RX 7900 XTX) | 24560 = 1101 + (23164 = 20583 + 2087 + 493) + 294 | llama_memory_breakdown_print: | - Host | 725 = 515 + 0 + 210 | ``` I am also interested in using imported GGUF models, but that path appears to have recent compatibility issues with newer Qwen GGUFs as well. I am not making that the primary issue here, but if the team thinks the best way to compare behavior is with a local GGUF import path, I am happy to test that too. ### Relevant log output ```shell ``` ### OS Linux ### GPU AMD ### CPU AMD ### Ollama version 0.21.0
GiteaMirror added the bug label 2026-05-05 03:29:45 -05:00
Author
Owner

@AnyRock commented on GitHub (Apr 23, 2026):

我也是同样的问题。并且ollama更新总是围绕gemma4,但是对qwen的优化几乎没有

<!-- gh-comment-id:4305841634 --> @AnyRock commented on GitHub (Apr 23, 2026): 我也是同样的问题。并且ollama更新总是围绕gemma4,但是对qwen的优化几乎没有
Author
Owner

@chejh-amd commented on GitHub (Apr 24, 2026):

Hi @lennarkivistik Thanks for lining up the numbers side by side that makes the comparison a lot easier to think about.

A few things that might be worth double-checking if useful:

  1. Ollama’s qwen3.6:… pull is not guaranteed to be the same blob as your local Qwen3.6-35B-A3B-UD-Q4_K_M.gguf. Different quant / tensor layout = totally different t/s. If you can, try importing the exact same GGUF you used with llama-cli and compare again..

  2. On MoE models, if a chunk of experts or routing ends up on CPU in one stack but fully on GPU in the other, decode can drop hard. Worth confirming from a debug run (layer/GPU load lines) whether all layers you expect on the 7900 XTX are actually on device for the Ollama case.

<!-- gh-comment-id:4310022393 --> @chejh-amd commented on GitHub (Apr 24, 2026): Hi @lennarkivistik Thanks for lining up the numbers side by side that makes the comparison a lot easier to think about. A few things that might be worth double-checking if useful: 1. Ollama’s qwen3.6:… pull is not guaranteed to be the same blob as your local Qwen3.6-35B-A3B-UD-Q4_K_M.gguf. Different quant / tensor layout = totally different t/s. If you can, try importing the exact same GGUF you used with llama-cli and compare again.. 2. On MoE models, if a chunk of experts or routing ends up on CPU in one stack but fully on GPU in the other, decode can drop hard. Worth confirming from a debug run (layer/GPU load lines) whether all layers you expect on the 7900 XTX are actually on device for the Ollama case.
Author
Owner

@lennarkivistik commented on GitHub (Apr 24, 2026):

Thanks for the suggestions @chejh-amd, im aware of the points you made.

I did a bit more testing and tried to line up the comparison more carefully.

First, I added an “apples to apples-ish” baseline with the older qwen3:30b-a3b-thinking-2507-q4_K_M, since it is in the same rough active/model size class as qwen3.6:35b-a3b.

One thing I should clarify: I am not assuming Qwen3, Qwen3.5, and Qwen3.6 are the same internally. I know this is not a true architecture-equivalent comparison.

My understanding is roughly:

  • qwen3:30b-a3b is a more conventional sparse MoE transformer-style Qwen3 model.
  • Qwen3.5 / Qwen3-Next changed the internal design quite a bit: higher-sparsity MoE, shared experts, and a hybrid attention design using Gated DeltaNet / linear-attention-style blocks mixed with full attention.
  • Qwen3.6-35B-A3B appears to continue that newer architecture family: qwen35moe.

So I do expect Qwen3.6 to behave differently and to potentially be heavier in some places. I am not expecting identical throughput to qwen3:30b-a3b. The reason I brought up the older Qwen3 result is mainly as a sanity check: on the same ROCm system, Ollama can clearly run a similarly sized active MoE model around ~90 tok/s, while qwen3.6:35b-a3b is around ~24–25 tok/s in Ollama, even though the standalone llama-cli result for my local Qwen3.6 GGUF is much closer to the ~90 tok/s range.

qwen3:30b-a3b-thinking-2507-q4_K_M in Ollama

total duration:       9.768541354s
load duration:        59.170826ms
prompt eval count:    426 token(s)
prompt eval duration: 136.877327ms
prompt eval rate:     3112.28 tokens/s
eval count:           852 token(s)
eval duration:        9.371897025s
eval rate:            91.91 tokens/s

That result is much closer to what I see from llama-cli with Qwen3.6, while Ollama’s qwen3.6:35b-a3b path stays around ~24–25 tok/s on the same machine.

Regarding “same GGUF” testing

I agree that comparing Ollama’s pulled model vs my local GGUF may not be a perfect comparison. I tried to test the direct imported GGUF path as well.

At the moment I only have the Q2 imported variant in Ollama:

ollama run hf.co/unsloth/Qwen3.6-35B-A3B-GGUF:UD-Q2_K_XL --verbose

But that currently fails to load in Ollama 0.21.0, since well gguf support has been regressed at some versions back, or maybe it was just the "qwen35moe" architecture in ggufs

Error: 500 Internal Server Error: unable to load model: /var/lib/ollama/blobs/sha256-******

Relevant sanitized log excerpt:

level=WARN source=sched.go:423 msg="model architecture does not currently support parallel requests" architecture=qwen35moe
level=DEBUG source=server.go:156 msg="model not yet supported by Ollama engine, switching to compatibility mode" model=/var/lib/ollama/blobs/sha256-******
llama_model_loader: loaded meta data with 54 key-value pairs and 733 tensors from /var/lib/ollama/blobs/sha256-******
llama_model_loader: - kv 0:  general.architecture str = qwen35moe
llama_model_loader: - kv 5:  general.name str = Qwen3.6-35B-A3B
llama_model_loader: - kv 17: qwen35moe.block_count u32 = 40
llama_model_loader: - kv 18: qwen35moe.context_length u32 = 262144
llama_model_loader: - kv 25: qwen35moe.expert_count u32 = 256
llama_model_loader: - kv 26: qwen35moe.expert_used_count u32 = 8
print_info: file format = GGUF V3
print_info: file type   = Q2_K - Medium
print_info: file size   = 11.44 GiB
llama_model_load: error loading model: error loading model architecture: unknown model architecture: 'qwen35moe'
llama_model_load_from_file_impl: failed to load model
level=INFO source=sched.go:462 msg="failed to create server" model=hf.co/unsloth/Qwen3.6-35B-A3B-GGUF:UD-Q2_K_XL error="unable to load model: /var/lib/ollama/blobs/sha256-******"

So I currently cannot use that path to do an exact same-GGUF benchmark inside Ollama, at least not with this imported Qwen3.6 GGUF. The standalone llama-cli path does load and run the local Qwen3.6-35B-A3B-UD-Q4_K_M.gguf.

Regarding GPU placement / CPU fallback

From the debug logs, Ollama is detecting the ROCm device correctly and using the ROCm backend:

load_backend: loaded ROCm backend from /usr/lib/ollama/rocm/libggml-hip.so
ggml_cuda_init: found 1 ROCm devices:
  Device 0: AMD Radeon RX 7900 XTX, gfx1100, VMM: no, Wave Size: 32, ID: GPU-2b91a683f3f9e991

During the failed imported Q2 run, I also see Ollama reporting the model/runner size and VRAM estimate before unload:

runner.size="25.3 GiB" runner.vram="21.1 GiB" runner.parallel=1

For the llama-cli Q4_K_M run, llama.cpp reported the model/context/compute mostly on the RX 7900 XTX:

ROCm0 (RX 7900 XTX): total 24560 MiB
model: 20583 MiB
context: 2087 MiB
compute: 493 MiB
Host: 725 MiB

So at least in llama-cli, the Q4_K_M case appears to be almost entirely GPU-resident.

Runtime configuration

My Ollama service is tuned for this machine and is otherwise working very well with other models. The relevant parts are:

Environment=OLLAMA_LLM_LIBRARY=rocm
Environment=OLLAMA_LIBRARY_PATH=/usr/lib/ollama/rocm
Environment=ROCR_VISIBLE_DEVICES=GPU-2b91a683f3f9e991
Environment="LD_LIBRARY_PATH=/usr/lib/ollama/rocm:/usr/lib/ollama:/opt/rocm/lib:/opt/rocm/lib64"

Environment=OLLAMA_FLASH_ATTENTION=1
Environment=OLLAMA_KV_CACHE_TYPE=q4_0
Environment=OLLAMA_MAX_LOADED_MODELS=1
Environment=OLLAMA_NUM_PARALLEL=4
Environment=OLLAMA_KEEP_ALIVE=10m
Environment=OLLAMA_MODELS=/var/lib/ollama
Environment=OLLAMA_HOST=0.0.0.0:11434
Environment=OLLAMA_DEBUG=2
Environment=OLLAMA_GPU_OVERHEAD=536870912

Environment=OPENBLAS_NUM_THREADS=16
Environment=OMP_NUM_THREADS=16

User=ollama
Group=ollama
PrivateDevices=no
LimitMEMLOCK=infinity

ExecStart=/usr/bin/ollama serve

I also tested changing OLLAMA_KV_CACHE_TYPE from q4_0 to q8_0, and it did not materially change the result.

I do need Flash Attention enabled for this setup because the card has 24 GB VRAM and this MoE model is already close to the limit if I want any useful context window. I know TurboQuant or similar further context/VRAM reduction work is not implemented yet, so I am not expecting miracles there; I just want to make sure the current ROCm/Ollama path is not accidentally taking a much slower route than llama.cpp.

So the machine/ROCm stack appears capable of much higher throughput, and the slowdown seems specific to the Qwen3.6 MoE path in Ollama.

Happy to test a local build, a specific branch, or run any extra debug command if that helps narrow down whether this is model import support, compatibility mode, GPU offload placement, or a runner/kernel issue.

<!-- gh-comment-id:4311110072 --> @lennarkivistik commented on GitHub (Apr 24, 2026): Thanks for the suggestions @chejh-amd, im aware of the points you made. I did a bit more testing and tried to line up the comparison more carefully. First, I added an “apples to apples-ish” baseline with the older `qwen3:30b-a3b-thinking-2507-q4_K_M`, since it is in the same rough active/model size class as `qwen3.6:35b-a3b`. One thing I should clarify: I am not assuming Qwen3, Qwen3.5, and Qwen3.6 are the same internally. I know this is not a true architecture-equivalent comparison. My understanding is roughly: - `qwen3:30b-a3b` is a more conventional sparse MoE transformer-style Qwen3 model. - Qwen3.5 / Qwen3-Next changed the internal design quite a bit: higher-sparsity MoE, shared experts, and a hybrid attention design using Gated DeltaNet / linear-attention-style blocks mixed with full attention. - `Qwen3.6-35B-A3B` appears to continue that newer architecture family: `qwen35moe`. So I do expect Qwen3.6 to behave differently and to potentially be heavier in some places. I am not expecting identical throughput to `qwen3:30b-a3b`. The reason I brought up the older Qwen3 result is mainly as a sanity check: on the same ROCm system, Ollama can clearly run a similarly sized active MoE model around ~90 tok/s, while `qwen3.6:35b-a3b` is around ~24–25 tok/s in Ollama, even though the standalone llama-cli result for my local Qwen3.6 GGUF is much closer to the ~90 tok/s range. ### qwen3:30b-a3b-thinking-2507-q4_K_M in Ollama ```text total duration: 9.768541354s load duration: 59.170826ms prompt eval count: 426 token(s) prompt eval duration: 136.877327ms prompt eval rate: 3112.28 tokens/s eval count: 852 token(s) eval duration: 9.371897025s eval rate: 91.91 tokens/s ```` That result is much closer to what I see from llama-cli with Qwen3.6, while Ollama’s `qwen3.6:35b-a3b` path stays around ~24–25 tok/s on the same machine. ### Regarding “same GGUF” testing I agree that comparing Ollama’s pulled model vs my local GGUF may not be a perfect comparison. I tried to test the direct imported GGUF path as well. At the moment I only have the Q2 imported variant in Ollama: ```bash ollama run hf.co/unsloth/Qwen3.6-35B-A3B-GGUF:UD-Q2_K_XL --verbose ``` But that currently fails to load in Ollama 0.21.0, since well gguf support has been regressed at some versions back, or maybe it was just the "qwen35moe" architecture in ggufs ```text Error: 500 Internal Server Error: unable to load model: /var/lib/ollama/blobs/sha256-****** ``` Relevant sanitized log excerpt: ```text level=WARN source=sched.go:423 msg="model architecture does not currently support parallel requests" architecture=qwen35moe level=DEBUG source=server.go:156 msg="model not yet supported by Ollama engine, switching to compatibility mode" model=/var/lib/ollama/blobs/sha256-****** llama_model_loader: loaded meta data with 54 key-value pairs and 733 tensors from /var/lib/ollama/blobs/sha256-****** llama_model_loader: - kv 0: general.architecture str = qwen35moe llama_model_loader: - kv 5: general.name str = Qwen3.6-35B-A3B llama_model_loader: - kv 17: qwen35moe.block_count u32 = 40 llama_model_loader: - kv 18: qwen35moe.context_length u32 = 262144 llama_model_loader: - kv 25: qwen35moe.expert_count u32 = 256 llama_model_loader: - kv 26: qwen35moe.expert_used_count u32 = 8 print_info: file format = GGUF V3 print_info: file type = Q2_K - Medium print_info: file size = 11.44 GiB llama_model_load: error loading model: error loading model architecture: unknown model architecture: 'qwen35moe' llama_model_load_from_file_impl: failed to load model level=INFO source=sched.go:462 msg="failed to create server" model=hf.co/unsloth/Qwen3.6-35B-A3B-GGUF:UD-Q2_K_XL error="unable to load model: /var/lib/ollama/blobs/sha256-******" ``` So I currently cannot use that path to do an exact same-GGUF benchmark inside Ollama, at least not with this imported Qwen3.6 GGUF. The standalone llama-cli path does load and run the local `Qwen3.6-35B-A3B-UD-Q4_K_M.gguf`. ### Regarding GPU placement / CPU fallback From the debug logs, Ollama is detecting the ROCm device correctly and using the ROCm backend: ```text load_backend: loaded ROCm backend from /usr/lib/ollama/rocm/libggml-hip.so ggml_cuda_init: found 1 ROCm devices: Device 0: AMD Radeon RX 7900 XTX, gfx1100, VMM: no, Wave Size: 32, ID: GPU-2b91a683f3f9e991 ``` During the failed imported Q2 run, I also see Ollama reporting the model/runner size and VRAM estimate before unload: ```text runner.size="25.3 GiB" runner.vram="21.1 GiB" runner.parallel=1 ``` For the llama-cli Q4_K_M run, llama.cpp reported the model/context/compute mostly on the RX 7900 XTX: ```text ROCm0 (RX 7900 XTX): total 24560 MiB model: 20583 MiB context: 2087 MiB compute: 493 MiB Host: 725 MiB ``` So at least in llama-cli, the Q4_K_M case appears to be almost entirely GPU-resident. ### Runtime configuration My Ollama service is tuned for this machine and is otherwise working very well with other models. The relevant parts are: ```ini Environment=OLLAMA_LLM_LIBRARY=rocm Environment=OLLAMA_LIBRARY_PATH=/usr/lib/ollama/rocm Environment=ROCR_VISIBLE_DEVICES=GPU-2b91a683f3f9e991 Environment="LD_LIBRARY_PATH=/usr/lib/ollama/rocm:/usr/lib/ollama:/opt/rocm/lib:/opt/rocm/lib64" Environment=OLLAMA_FLASH_ATTENTION=1 Environment=OLLAMA_KV_CACHE_TYPE=q4_0 Environment=OLLAMA_MAX_LOADED_MODELS=1 Environment=OLLAMA_NUM_PARALLEL=4 Environment=OLLAMA_KEEP_ALIVE=10m Environment=OLLAMA_MODELS=/var/lib/ollama Environment=OLLAMA_HOST=0.0.0.0:11434 Environment=OLLAMA_DEBUG=2 Environment=OLLAMA_GPU_OVERHEAD=536870912 Environment=OPENBLAS_NUM_THREADS=16 Environment=OMP_NUM_THREADS=16 User=ollama Group=ollama PrivateDevices=no LimitMEMLOCK=infinity ExecStart=/usr/bin/ollama serve ``` I also tested changing `OLLAMA_KV_CACHE_TYPE` from `q4_0` to `q8_0`, and it did not materially change the result. I do need Flash Attention enabled for this setup because the card has 24 GB VRAM and this MoE model is already close to the limit if I want any useful context window. I know TurboQuant or similar further context/VRAM reduction work is not implemented yet, so I am not expecting miracles there; I just want to make sure the current ROCm/Ollama path is not accidentally taking a much slower route than llama.cpp. So the machine/ROCm stack appears capable of much higher throughput, and the slowdown seems specific to the Qwen3.6 MoE path in Ollama. Happy to test a local build, a specific branch, or run any extra debug command if that helps narrow down whether this is model import support, compatibility mode, GPU offload placement, or a runner/kernel issue.
Author
Owner

@chejh-amd commented on GitHub (Apr 27, 2026):

Hi @lennarkivistik Solid comparison, the qwen3:30b-a3b at ~92 t/s really pins down that ROCm isn't the bottleneck. That compatibility mode fallback for qwen35moe you found in the logs lines up: the pulled qwen3.6:35b-a3b is probably taking the same slower path since native qwen35moe isn't in the Ollama engine yet. Once that lands the gap should mostly close.
Probably just a waiting at this point, but if you do end up testing a local Ollama build with a newer llama.cpp, would be curious what numbers you get.

<!-- gh-comment-id:4324436357 --> @chejh-amd commented on GitHub (Apr 27, 2026): Hi @lennarkivistik Solid comparison, the qwen3:30b-a3b at ~92 t/s really pins down that ROCm isn't the bottleneck. That compatibility mode fallback for qwen35moe you found in the logs lines up: the pulled qwen3.6:35b-a3b is probably taking the same slower path since native qwen35moe isn't in the Ollama engine yet. Once that lands the gap should mostly close. Probably just a waiting at this point, but if you do end up testing a local Ollama build with a newer llama.cpp, would be curious what numbers you get.
Author
Owner

@lennarkivistik commented on GitHub (Apr 28, 2026):

Small update: I tested the imported GGUF path more deeply, and I found something interesting.

The direct HF import still fails:

ollama pull hf.co/unsloth/Qwen3.6-35B-A3B-GGUF:UD-Q4_K_M
ollama run hf.co/unsloth/Qwen3.6-35B-A3B-GGUF:UD-Q4_K_M

This fails with:

Error: 500 Internal Server Error: unable to load model:
/var/lib/ollama/blobs/sha256-ac0e2c1189e055faa36eff361580e79c5bd6f8e76bffb4ce547f167d53e31a61

ollama show reveals why this path is different: Ollama imports the HF model as a vision-capable model with a separate projector:

Model
  architecture        qwen35moe
  parameters          34.7B
  context length      262144
  embedding length    2048
  quantization        unknown

Capabilities
  completion
  vision

Projector
  architecture        clip
  parameters          446.57M
  embedding length    1152
  dimensions          2048

In the logs this corresponds to Ollama switching away from the normal engine path:

model not yet supported by Ollama engine, switching to compatibility mode
error="split vision models aren't supported"

Then the compatibility loader fails on:

unknown model architecture: 'qwen35moe'

So I manually repackaged the exact same downloaded text GGUF blob as a local text-only Ollama model, without the projector.

Steps:

mkdir -p /data/ollama-repack/qwen36-a3b-q4

cp /var/lib/ollama/blobs/sha256-ac0e2c1189e055faa36eff361580e79c5bd6f8e76bffb4ce547f167d53e31a61 \
  /data/ollama-repack/qwen36-a3b-q4/Qwen3.6-35B-A3B-UD-Q4_K_M.gguf

Then I used this Modelfile:

FROM /data/ollama-repack/qwen36-a3b-q4/Qwen3.6-35B-A3B-UD-Q4_K_M.gguf

TEMPLATE """{{ if .System }}<|im_start|>system
{{ .System }}<|im_end|>
{{ end }}{{ if .Prompt }}<|im_start|>user
{{ .Prompt }}<|im_end|>
{{ end }}<|im_start|>assistant
{{ .Response }}"""

PARAMETER stop "<|im_start|>"
PARAMETER stop "<|im_end|>"

PARAMETER num_ctx 8192
PARAMETER num_predict 1024
PARAMETER temperature 1
PARAMETER top_k 20
PARAMETER top_p 0.95

Created the model:

ollama create qwen3.6:a3b-q4-gguf -f /data/ollama-repack/qwen36-a3b-q4/Modelfile

That succeeded:

gathering model components
copying file sha256:ac0e2c1189e055faa36eff361580e79c5bd6f8e76bffb4ce547f167d53e31a61 100%
parsing GGUF
using existing layer sha256:ac0e2c1189e055faa36eff361580e79c5bd6f8e76bffb4ce547f167d53e31a61
writing manifest
success

Then running it worked:

ollama run qwen3.6:a3b-q4-gguf --verbose "Tell me about arch linux"

Result:

total duration:       27.092096404s
load duration:        2.371408921s
prompt eval count:    13 token(s)
prompt eval duration: 87.972274ms
prompt eval rate:     147.77 tokens/s
eval count:           1024 token(s)
eval duration:        24.182387354s
eval rate:            42.34 tokens/s

For comparison, the official Ollama model run I originally reported was:

total duration:       1m49.28193622s
load duration:        112.493407ms
prompt eval count:    980 token(s)
prompt eval duration: 830.104216ms
prompt eval rate:     1180.57 tokens/s
eval count:           2623 token(s)
eval duration:        1m47.111537835s
eval rate:            24.49 tokens/s

So the manually repackaged text-only GGUF runs at about 42.34 tok/s, compared with about 24.49 tok/s from the official Ollama model in my earlier test.

That still does not match standalone llama.cpp, where I saw around 89 tok/s, but it does suggest the situation is more nuanced than just “ROCm is slow” or “the GPU is the bottleneck”.

My current interpretation:

  1. The direct hf.co/unsloth/... import path fails because Ollama treats the repo as a split vision model with a projector.
  2. Repackaging the exact same GGUF blob as a text-only local Ollama model avoids that split vision/projector path.
  3. In that text-only path, qwen35moe does load and run.
  4. The manually repackaged GGUF is significantly faster than the official Ollama model in my previous run, but still much slower than standalone llama.cpp on the same machine.

So this may be two separate issues:

  • HF GGUF import/package handling for Qwen3.6 vision/projector repos.
  • Runtime performance gap between Ollama and standalone llama.cpp for Qwen3.6 MoE on ROCm.

@dhiltgen Hopefully I wont disturb you too much for a bit of input since you know ollama in and out if maybe this test can help the team out

<!-- gh-comment-id:4339292699 --> @lennarkivistik commented on GitHub (Apr 28, 2026): Small update: I tested the imported GGUF path more deeply, and I found something interesting. The direct HF import still fails: ```bash ollama pull hf.co/unsloth/Qwen3.6-35B-A3B-GGUF:UD-Q4_K_M ollama run hf.co/unsloth/Qwen3.6-35B-A3B-GGUF:UD-Q4_K_M ``` This fails with: ```text Error: 500 Internal Server Error: unable to load model: /var/lib/ollama/blobs/sha256-ac0e2c1189e055faa36eff361580e79c5bd6f8e76bffb4ce547f167d53e31a61 ``` `ollama show` reveals why this path is different: Ollama imports the HF model as a vision-capable model with a separate projector: ```text Model architecture qwen35moe parameters 34.7B context length 262144 embedding length 2048 quantization unknown Capabilities completion vision Projector architecture clip parameters 446.57M embedding length 1152 dimensions 2048 ``` In the logs this corresponds to Ollama switching away from the normal engine path: ```text model not yet supported by Ollama engine, switching to compatibility mode error="split vision models aren't supported" ``` Then the compatibility loader fails on: ```text unknown model architecture: 'qwen35moe' ``` So I manually repackaged the exact same downloaded text GGUF blob as a local text-only Ollama model, without the projector. Steps: ```bash mkdir -p /data/ollama-repack/qwen36-a3b-q4 cp /var/lib/ollama/blobs/sha256-ac0e2c1189e055faa36eff361580e79c5bd6f8e76bffb4ce547f167d53e31a61 \ /data/ollama-repack/qwen36-a3b-q4/Qwen3.6-35B-A3B-UD-Q4_K_M.gguf ``` Then I used this Modelfile: ```Dockerfile FROM /data/ollama-repack/qwen36-a3b-q4/Qwen3.6-35B-A3B-UD-Q4_K_M.gguf TEMPLATE """{{ if .System }}<|im_start|>system {{ .System }}<|im_end|> {{ end }}{{ if .Prompt }}<|im_start|>user {{ .Prompt }}<|im_end|> {{ end }}<|im_start|>assistant {{ .Response }}""" PARAMETER stop "<|im_start|>" PARAMETER stop "<|im_end|>" PARAMETER num_ctx 8192 PARAMETER num_predict 1024 PARAMETER temperature 1 PARAMETER top_k 20 PARAMETER top_p 0.95 ``` Created the model: ```bash ollama create qwen3.6:a3b-q4-gguf -f /data/ollama-repack/qwen36-a3b-q4/Modelfile ``` That succeeded: ```text gathering model components copying file sha256:ac0e2c1189e055faa36eff361580e79c5bd6f8e76bffb4ce547f167d53e31a61 100% parsing GGUF using existing layer sha256:ac0e2c1189e055faa36eff361580e79c5bd6f8e76bffb4ce547f167d53e31a61 writing manifest success ``` Then running it worked: ```bash ollama run qwen3.6:a3b-q4-gguf --verbose "Tell me about arch linux" ``` Result: ```text total duration: 27.092096404s load duration: 2.371408921s prompt eval count: 13 token(s) prompt eval duration: 87.972274ms prompt eval rate: 147.77 tokens/s eval count: 1024 token(s) eval duration: 24.182387354s eval rate: 42.34 tokens/s ``` For comparison, the official Ollama model run I originally reported was: ```text total duration: 1m49.28193622s load duration: 112.493407ms prompt eval count: 980 token(s) prompt eval duration: 830.104216ms prompt eval rate: 1180.57 tokens/s eval count: 2623 token(s) eval duration: 1m47.111537835s eval rate: 24.49 tokens/s ``` So the manually repackaged text-only GGUF runs at about **42.34 tok/s**, compared with about **24.49 tok/s** from the official Ollama model in my earlier test. That still does not match standalone llama.cpp, where I saw around **89 tok/s**, but it does suggest the situation is more nuanced than just “ROCm is slow” or “the GPU is the bottleneck”. My current interpretation: 1. The direct `hf.co/unsloth/...` import path fails because Ollama treats the repo as a split vision model with a projector. 2. Repackaging the exact same GGUF blob as a text-only local Ollama model avoids that split vision/projector path. 3. In that text-only path, `qwen35moe` does load and run. 4. The manually repackaged GGUF is significantly faster than the official Ollama model in my previous run, but still much slower than standalone llama.cpp on the same machine. So this may be two separate issues: * HF GGUF import/package handling for Qwen3.6 vision/projector repos. * Runtime performance gap between Ollama and standalone llama.cpp for Qwen3.6 MoE on ROCm. @dhiltgen Hopefully I wont disturb you too much for a bit of input since you know ollama in and out if maybe this test can help the team out
Author
Owner

@chejh-amd commented on GitHub (Apr 29, 2026):

Hi @lennarkivistik The text-only repack of the same blob is a really clean way to separate “HF vision/projector packaging” from “runtime decode path.” Thanks for digging this deep.

The ~42 tok/s vs ~24 tok/s gap on otherwise similar weights is a useful datapoint: it suggests the official pull path isn’t identical to a plain text GGUF model in practice, not just “ROCm is slow.”

The remaining gap vs standalone llama-cli on the same GGUF still looks like something worth profiling separately, but your breakdown already narrows what “slow” could mean in practice.

<!-- gh-comment-id:4340635335 --> @chejh-amd commented on GitHub (Apr 29, 2026): Hi @lennarkivistik The text-only repack of the *same* blob is a really clean way to separate “HF vision/projector packaging” from “runtime decode path.” Thanks for digging this deep. The ~42 tok/s vs ~24 tok/s gap on otherwise similar weights is a useful datapoint: it suggests the official pull path isn’t identical to a plain text GGUF model in practice, not just “ROCm is slow.” The remaining gap vs standalone llama-cli on the same GGUF still looks like something worth profiling separately, but your breakdown already narrows what “slow” could mean in practice.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#72109