[GH-ISSUE #14861] Massive difference in speed between Ollama and llama.cpp with qwen3.5:35b! #56098

Open
opened 2026-04-29 10:15:36 -05:00 by GiteaMirror · 30 comments
Owner

Originally created by @chigkim on GitHub (Mar 15, 2026).
Original GitHub issue: https://github.com/ollama/ollama/issues/14861

Originally assigned to: @pdevine on GitHub.

What is the issue?

There's a massive difference in speed between Ollama and llama.cpp when running qwen3.5:35b-a3b-q8_0.
I haven't seen such big difference between Ollama .llama.cpp.

Engine Prompt Tokens PP/s TTFT Generated Tokens TG/s Duration
LCP 471 1001.39 0.47 2245 53.53 42.41
Ollama 471 640.08 0.74 2362 32.54 73.33
LCP 722 1043.92 0.69 2373 53.76 44.83
Ollama 722 605.77 1.19 2664 32.52 83.12
LCP 1140 1078.78 1.06 2729 53.38 52.19
Ollama 1140 691.40 1.65 3074 32.50 96.23
LCP 1845 1304.84 1.41 2865 51.51 57.03
Ollama 1845 739.67 2.49 2742 31.35 89.95
LCP 3067 1077.19 2.85 2531 46.42 57.37
Ollama 3067 650.21 4.72 3674 30.92 123.55
LCP 4852 1077.11 4.50 3052 48.45 67.50
Ollama 4852 695.72 6.97 3037 31.47 103.49
LCP 7950 1183.35 6.72 3545 48.72 79.49
Ollama 7950 703.69 11.30 3233 31.08 115.32
LCP 12753 1126.50 11.32 3049 47.41 75.64
Ollama 12753 689.74 18.49 3597 30.88 134.97
LCP 20762 1035.39 20.05 3061 45.36 87.53
Ollama 20762 666.16 31.17 3062 30.38 131.95
LCP 33057 924.02 35.78 3439 42.66 116.38
Ollama 33057 627.12 52.71 3124 28.81 161.14

LCP Total duration: 12m50s
Ollama Total duration: 20m3s

Here's a comparison with gpt-oss:20b-mxfp4. Ollama is slower, but not as dramatically as with qwen3.5.

Engine Prompt Tokens PP/s TTFT Generated Tokens TG/s Duration
LCP 520 1402.91 0.37 2244 93.66 24.33
Ollama 521 1048.47 0.50 1122 73.84 15.69
LCP 762 1278.40 0.60 1927 93.41 21.23
Ollama 763 1024.17 0.74 2187 72.90 30.75
LCP 1157 1282.74 0.90 2327 92.86 25.96
Ollama 1158 1044.81 1.11 1959 72.66 28.07
LCP 1827 1407.63 1.30 2117 92.43 24.20
Ollama 1828 1200.94 1.52 2197 72.34 31.89
LCP 3002 1449.61 2.07 2704 91.03 31.78
Ollama 3003 1248.64 2.41 2221 71.65 33.40
LCP 4700 1427.46 3.29 3221 89.50 39.28
Ollama 4701 1228.58 3.83 2432 71.00 38.08
LCP 7550 1386.35 5.45 2327 87.75 31.96
Ollama 7551 1247.96 6.05 2768 69.51 45.87
LCP 12051 1326.57 9.08 2295 84.94 36.10
Ollama 12052 1081.05 11.15 2508 62.94 51.00
LCP 19531 1188.91 16.43 2141 75.38 44.83
Ollama 19532 1017.17 19.20 2634 61.81 61.82
LCP 31277 941.32 33.23 1997 67.43 62.84
Ollama 31278 918.82 34.04 2056 58.79 69.01

LCP Total duration: 7m12s
Ollama Total duration: 8m15s

OS

macOS

GPU

Apple

CPU

Apple

Ollama version

0.18.0

Originally created by @chigkim on GitHub (Mar 15, 2026). Original GitHub issue: https://github.com/ollama/ollama/issues/14861 Originally assigned to: @pdevine on GitHub. ### What is the issue? There's a massive difference in speed between Ollama and llama.cpp when running qwen3.5:35b-a3b-q8_0. I haven't seen such big difference between Ollama .llama.cpp. | Engine | Prompt Tokens | PP/s | TTFT | Generated Tokens | TG/s | Duration | | ------ | ------------- | ---- | ---- | ---------------- | ---- | -------- | | LCP | 471 | 1001.39 | 0.47 | 2245 | 53.53 | 42.41 | | Ollama | 471 | 640.08 | 0.74 | 2362 | 32.54 | 73.33 | | LCP | 722 | 1043.92 | 0.69 | 2373 | 53.76 | 44.83 | | Ollama | 722 | 605.77 | 1.19 | 2664 | 32.52 | 83.12 | | LCP | 1140 | 1078.78 | 1.06 | 2729 | 53.38 | 52.19 | | Ollama | 1140 | 691.40 | 1.65 | 3074 | 32.50 | 96.23 | | LCP | 1845 | 1304.84 | 1.41 | 2865 | 51.51 | 57.03 | | Ollama | 1845 | 739.67 | 2.49 | 2742 | 31.35 | 89.95 | | LCP | 3067 | 1077.19 | 2.85 | 2531 | 46.42 | 57.37 | | Ollama | 3067 | 650.21 | 4.72 | 3674 | 30.92 | 123.55 | | LCP | 4852 | 1077.11 | 4.50 | 3052 | 48.45 | 67.50 | | Ollama | 4852 | 695.72 | 6.97 | 3037 | 31.47 | 103.49 | | LCP | 7950 | 1183.35 | 6.72 | 3545 | 48.72 | 79.49 | | Ollama | 7950 | 703.69 | 11.30 | 3233 | 31.08 | 115.32 | | LCP | 12753 | 1126.50 | 11.32 | 3049 | 47.41 | 75.64 | | Ollama | 12753 | 689.74 | 18.49 | 3597 | 30.88 | 134.97 | | LCP | 20762 | 1035.39 | 20.05 | 3061 | 45.36 | 87.53 | | Ollama | 20762 | 666.16 | 31.17 | 3062 | 30.38 | 131.95 | | LCP | 33057 | 924.02 | 35.78 | 3439 | 42.66 | 116.38 | | Ollama | 33057 | 627.12 | 52.71 | 3124 | 28.81 | 161.14 | LCP Total duration: 12m50s Ollama Total duration: 20m3s Here's a comparison with gpt-oss:20b-mxfp4. Ollama is slower, but not as dramatically as with qwen3.5. | Engine | Prompt Tokens | PP/s | TTFT | Generated Tokens | TG/s | Duration | | ------ | ------------- | ---- | ---- | ---------------- | ---- | -------- | | LCP | 520 | 1402.91 | 0.37 | 2244 | 93.66 | 24.33 | | Ollama | 521 | 1048.47 | 0.50 | 1122 | 73.84 | 15.69 | | LCP | 762 | 1278.40 | 0.60 | 1927 | 93.41 | 21.23 | | Ollama | 763 | 1024.17 | 0.74 | 2187 | 72.90 | 30.75 | | LCP | 1157 | 1282.74 | 0.90 | 2327 | 92.86 | 25.96 | | Ollama | 1158 | 1044.81 | 1.11 | 1959 | 72.66 | 28.07 | | LCP | 1827 | 1407.63 | 1.30 | 2117 | 92.43 | 24.20 | | Ollama | 1828 | 1200.94 | 1.52 | 2197 | 72.34 | 31.89 | | LCP | 3002 | 1449.61 | 2.07 | 2704 | 91.03 | 31.78 | | Ollama | 3003 | 1248.64 | 2.41 | 2221 | 71.65 | 33.40 | | LCP | 4700 | 1427.46 | 3.29 | 3221 | 89.50 | 39.28 | | Ollama | 4701 | 1228.58 | 3.83 | 2432 | 71.00 | 38.08 | | LCP | 7550 | 1386.35 | 5.45 | 2327 | 87.75 | 31.96 | | Ollama | 7551 | 1247.96 | 6.05 | 2768 | 69.51 | 45.87 | | LCP | 12051 | 1326.57 | 9.08 | 2295 | 84.94 | 36.10 | | Ollama | 12052 | 1081.05 | 11.15 | 2508 | 62.94 | 51.00 | | LCP | 19531 | 1188.91 | 16.43 | 2141 | 75.38 | 44.83 | | Ollama | 19532 | 1017.17 | 19.20 | 2634 | 61.81 | 61.82 | | LCP | 31277 | 941.32 | 33.23 | 1997 | 67.43 | 62.84 | | Ollama | 31278 | 918.82 | 34.04 | 2056 | 58.79 | 69.01 | LCP Total duration: 7m12s Ollama Total duration: 8m15s ### OS macOS ### GPU Apple ### CPU Apple ### Ollama version 0.18.0
GiteaMirror added the performancebug labels 2026-04-29 10:15:37 -05:00
Author
Owner

@rick-github commented on GitHub (Mar 15, 2026):

ollama has its own implementation of the qwen3.5 model.

model tps
qwen3.5:35b 89.27
frob/qwen3.5:35b-a3b-blind-ud-q4_K_XL 86.21
frob/qwen3.5:35b-a3b-ud-q4_K_XL 218.86

qwen3.5:35b is the ollama library model running on the ollama engine, qwen3.5-35b-blind-ud is the unsloth model running on the ollama engine, qwen3.5-ud is the unsloth model running on the llama.cpp engine.

<!-- gh-comment-id:4063856446 --> @rick-github commented on GitHub (Mar 15, 2026): ollama has its own implementation of the qwen3.5 model. | model | tps | | -- | -- | | qwen3.5:35b | 89.27 | frob/qwen3.5:35b-a3b-blind-ud-q4_K_XL | 86.21 | frob/qwen3.5:35b-a3b-ud-q4_K_XL | 218.86 qwen3.5:35b is the ollama library model running on the ollama engine, qwen3.5-35b-blind-ud is the unsloth model running on the ollama engine, qwen3.5-ud is the unsloth model running on the llama.cpp engine.
Author
Owner

@chigkim commented on GitHub (Mar 15, 2026):

I guess Ollama engine needs more optimization for qwen3.5...

<!-- gh-comment-id:4064208880 --> @chigkim commented on GitHub (Mar 15, 2026): I guess Ollama engine needs more optimization for qwen3.5...
Author
Owner

@rick-github commented on GitHub (Mar 15, 2026):

It seems that way. gpt-oss was similarly a lot slower when first implemented on the ollama engine.

<!-- gh-comment-id:4064222496 --> @rick-github commented on GitHub (Mar 15, 2026): It seems that way. gpt-oss was similarly a lot slower when first implemented on the ollama engine.
Author
Owner

@jkleckner commented on GitHub (Mar 16, 2026):

Apparently https://github.com/ggml-org/llama.cpp/pull/19504 PR in llama.cpp has made a big difference for qwen3+ performance. See also https://github.com/ggml-org/llama.cpp/discussions/16578#discussioncomment-16072258

<!-- gh-comment-id:4064425368 --> @jkleckner commented on GitHub (Mar 16, 2026): Apparently https://github.com/ggml-org/llama.cpp/pull/19504 PR in llama.cpp has made a big difference for qwen3+ performance. See also https://github.com/ggml-org/llama.cpp/discussions/16578#discussioncomment-16072258
Author
Owner

@chigkim commented on GitHub (Mar 16, 2026):

@jmorganca, @mxyng, is there any plan to optimize the new engine for speed? Thanks!

<!-- gh-comment-id:4066605034 --> @chigkim commented on GitHub (Mar 16, 2026): @jmorganca, @mxyng, is there any plan to optimize the new engine for speed? Thanks!
Author
Owner

@BeatWolf commented on GitHub (Mar 16, 2026):

@rick-github this is massive. Are there any plans to get this fixed? Because qwen3.5 seems better in all use-cases for us, but staying with ollama now looks irresponsible (in terms of resources/costs).

<!-- gh-comment-id:4068671600 --> @BeatWolf commented on GitHub (Mar 16, 2026): @rick-github this is massive. Are there any plans to get this fixed? Because qwen3.5 seems better in all use-cases for us, but staying with ollama now looks irresponsible (in terms of resources/costs).
Author
Owner

@rick-github commented on GitHub (Mar 16, 2026):

I expect it will be addressed, as I mentioned gpt-oss had similar performance issues when initially released on the ollama engine. Using the llama.cpp engine in ollama is an alternative if the situation persists.

<!-- gh-comment-id:4068752276 --> @rick-github commented on GitHub (Mar 16, 2026): I expect it will be addressed, as I mentioned gpt-oss had similar performance issues when initially released on the ollama engine. Using the llama.cpp engine in ollama is an alternative if the situation persists.
Author
Owner

@pdevine commented on GitHub (Mar 16, 2026):

@BeatWolf Yes, and I just posted two PRs for getting it to work with the new MLX engine. I see about a 3x speed-up for the 35B 4bit integer model (from 35 tps -> 96 tps) on my M3 Max. I'm still testing on the new M5 and there's still work to be done for CUDA of course.

#14878 addresses importing the safetensors models correctly, and
#14884 fixes some outstanding issues w/ qwen3.5

<!-- gh-comment-id:4070534005 --> @pdevine commented on GitHub (Mar 16, 2026): @BeatWolf Yes, and I just posted two PRs for getting it to work with the new MLX engine. I see about a 3x speed-up for the 35B 4bit integer model (from 35 tps -> 96 tps) on my M3 Max. I'm still testing on the new M5 and there's still work to be done for CUDA of course. #14878 addresses importing the safetensors models correctly, and #14884 fixes some outstanding issues w/ qwen3.5
Author
Owner

@pdevine commented on GitHub (Mar 16, 2026):

@BeatWolf regarding the difference in speed w.r.t. gpt-oss, that's because GGML.org over quantizes a number of the weights (it's also why it performs worse on most benchmarks). You can try ollama run pdevine/gpt-oss:20b-q8_0 if you want a more "apples to apples" comparison.

Here's an example run:

% ./ollama run pdevine/gpt-oss:20b-q8_0 --verbose
>>> hey there
Thinking...
The user says "hey there". That's just a greeting. We need to respond in a friendly way. The instruction: "You are ChatGPT, ...". No special constraints. So reply
with a friendly greeting and maybe ask how they are.
...done thinking.

Hey! 👋 How’s it going? Anything interesting on your mind today?

total duration:       1.438575166s
load duration:        258.866083ms
prompt eval count:    69 token(s)
prompt eval duration: 307.561292ms
prompt eval rate:     224.35 tokens/s
eval count:           75 token(s)
eval duration:        826.062839ms
eval rate:            90.79 tokens/s

That's about 15 toks/sec faster on my machine than gpt-oss:20b, but gets a lot worse real world performance (i.e. the model is an idiot).

<!-- gh-comment-id:4070591478 --> @pdevine commented on GitHub (Mar 16, 2026): @BeatWolf regarding the difference in speed w.r.t. gpt-oss, that's because GGML.org over quantizes a number of the weights (it's also why it performs worse on most benchmarks). You can try `ollama run pdevine/gpt-oss:20b-q8_0` if you want a more "apples to apples" comparison. Here's an example run: ``` % ./ollama run pdevine/gpt-oss:20b-q8_0 --verbose >>> hey there Thinking... The user says "hey there". That's just a greeting. We need to respond in a friendly way. The instruction: "You are ChatGPT, ...". No special constraints. So reply with a friendly greeting and maybe ask how they are. ...done thinking. Hey! 👋 How’s it going? Anything interesting on your mind today? total duration: 1.438575166s load duration: 258.866083ms prompt eval count: 69 token(s) prompt eval duration: 307.561292ms prompt eval rate: 224.35 tokens/s eval count: 75 token(s) eval duration: 826.062839ms eval rate: 90.79 tokens/s ``` That's about 15 toks/sec faster on my machine than `gpt-oss:20b`, but gets a lot worse real world performance (i.e. the model is an idiot).
Author
Owner

@BeatWolf commented on GitHub (Mar 17, 2026):

@pdevine thank you the details, but i was really talking about qwen3.5 which, for my use-case (quality vs size/speed tradeoff) beats gpt-oss easily. This is why i'm particularly sensitive to speed issues in qwen3.5

<!-- gh-comment-id:4073400737 --> @BeatWolf commented on GitHub (Mar 17, 2026): @pdevine thank you the details, but i was really talking about qwen3.5 which, for my use-case (quality vs size/speed tradeoff) beats gpt-oss easily. This is why i'm particularly sensitive to speed issues in qwen3.5
Author
Owner

@pdevine commented on GitHub (Mar 17, 2026):

@BeatWolf Yes, see my first message. That is about getting Qwen3.5 to go fast. I'm getting about 95 toks/sec on an M3 Max and 135 toks/sec on an M5 Max w/ the 35B-A3B model.

<!-- gh-comment-id:4077010473 --> @pdevine commented on GitHub (Mar 17, 2026): @BeatWolf Yes, see my first message. That is about getting Qwen3.5 to go fast. I'm getting about 95 toks/sec on an M3 Max and 135 toks/sec on an M5 Max w/ the 35B-A3B model.
Author
Owner

@lapo-luchini commented on GitHub (Mar 18, 2026):

I see about a 3x speed-up for the 35B 4bit integer model (from 35 tps -> 96 tps) on my M3 Max.

I built v0.18.2-rc0 (which includes both PRs) merged with my Prometheus-metrics branch (but that doesn't touch engines) and I get the same 27.88 t/s I got on v0.18.0 on a M4 Pro… how can I check which backend is used?

time=2026-03-18T00:51:03.273+01:00 level=INFO source=ggml.go:136 msg="" architecture=qwen35moe file_type=Q4_K_M name="" description="" num_tensors=1959 num_key_values=57
ggml_metal_device_init: tensor API disabled for pre-M5 and pre-A19 devices
<!-- gh-comment-id:4078762854 --> @lapo-luchini commented on GitHub (Mar 18, 2026): > I see about a 3x speed-up for the 35B 4bit integer model (from 35 tps -> 96 tps) on my M3 Max. I built v0.18.2-rc0 (which includes both PRs) merged with my Prometheus-metrics branch (but that doesn't touch engines) and I get the same 27.88 t/s I got on v0.18.0 on a M4 Pro… how can I check which backend is used? ``` time=2026-03-18T00:51:03.273+01:00 level=INFO source=ggml.go:136 msg="" architecture=qwen35moe file_type=Q4_K_M name="" description="" num_tensors=1959 num_key_values=57 ggml_metal_device_init: tensor API disabled for pre-M5 and pre-A19 devices ```
Author
Owner

@pdevine commented on GitHub (Mar 18, 2026):

@lapo-luchini Try the pdevine/qwen3.5:35b-a3b-int4 model.

<!-- gh-comment-id:4079002638 --> @pdevine commented on GitHub (Mar 18, 2026): @lapo-luchini Try the `pdevine/qwen3.5:35b-a3b-int4` model.
Author
Owner

@rocaltair commented on GitHub (Mar 19, 2026):

@lapo-luchini Try the pdevine/qwen3.5:35b-a3b-int4 model.

Great!!! The speed has increased from 27 tokens per second to 77 tokens per second on my mbp with M4Pro and 48GB RAM.

<!-- gh-comment-id:4087633907 --> @rocaltair commented on GitHub (Mar 19, 2026): > [@lapo-luchini](https://github.com/lapo-luchini) Try the `pdevine/qwen3.5:35b-a3b-int4` model. Great!!! The speed has increased from 27 tokens per second to 77 tokens per second on my mbp with M4Pro and 48GB RAM.
Author
Owner

@rocaltair commented on GitHub (Mar 19, 2026):

@pdevine Seems like pdevine/qwen3.5:27b-int4, pdevine/qwen3.5:0.8b-bf16 are still slow.

And what's the different between pdevine/qwen3.5:35b-a3b-int4 and pdevine/qwen3.5:35b-a3b-int4-all ?

<!-- gh-comment-id:4088618436 --> @rocaltair commented on GitHub (Mar 19, 2026): @pdevine Seems like pdevine/qwen3.5:27b-int4, pdevine/qwen3.5:0.8b-bf16 are still slow. And what's the different between pdevine/qwen3.5:35b-a3b-int4 and pdevine/qwen3.5:35b-a3b-int4-all ?
Author
Owner

@pdevine commented on GitHub (Mar 19, 2026):

@rocaltair Those are older test models. I'll requantize and push again.

<!-- gh-comment-id:4091711408 --> @pdevine commented on GitHub (Mar 19, 2026): @rocaltair Those are older test models. I'll requantize and push again.
Author
Owner

@rocaltair commented on GitHub (Mar 20, 2026):

@rocaltair Those are older test models. I'll requantize and push again.

Thanks for the awesome contribution! @pdevine

<!-- gh-comment-id:4094819632 --> @rocaltair commented on GitHub (Mar 20, 2026): > [@rocaltair](https://github.com/rocaltair) Those are older test models. I'll requantize and push again. Thanks for the awesome contribution! @pdevine
Author
Owner

@lapo-luchini commented on GitHub (Mar 20, 2026):

@lapo-luchini Try the pdevine/qwen3.5:35b-a3b-int4 model.

I had to struggle a bit to build a MLX-compatible ollama, but the result is huge!

% ollama run --verbose qwen3.5:35b "please write a quick Fibonacci in Go"
total duration:       26.537859334s
eval count:           382 token(s)
eval duration:        13.391695966s
eval rate:            28.53 tokens/s

% ollama run --verbose pdevine/qwen3.5:35b-a3b-int4 "please write a quick Fibonacci in Go"
total duration:       4.992956458s
eval count:           351 token(s)
eval duration:        4.102439208s
eval rate:            85.56 tokens/s

What's the difference between the two models?
Does the second contains some metadata to explicitly use the MLX backend, or uses a different quantization style which is the only one thet MLX backend supports and is then automatically selected?

<!-- gh-comment-id:4096534099 --> @lapo-luchini commented on GitHub (Mar 20, 2026): > [@lapo-luchini](https://github.com/lapo-luchini) Try the `pdevine/qwen3.5:35b-a3b-int4` model. I had to struggle a bit to build a MLX-compatible ollama, but the result is huge! ``` % ollama run --verbose qwen3.5:35b "please write a quick Fibonacci in Go" total duration: 26.537859334s eval count: 382 token(s) eval duration: 13.391695966s eval rate: 28.53 tokens/s % ollama run --verbose pdevine/qwen3.5:35b-a3b-int4 "please write a quick Fibonacci in Go" total duration: 4.992956458s eval count: 351 token(s) eval duration: 4.102439208s eval rate: 85.56 tokens/s ``` What's the difference between the two models? Does the second contains some metadata to explicitly use the MLX backend, or uses a different quantization style which is the only one thet MLX backend supports and is then automatically selected?
Author
Owner

@chigkim commented on GitHub (Mar 20, 2026):

How do you run mlx models with Ollama? Do you have to build a special branch? Is there an environment variable you have to set?

<!-- gh-comment-id:4097895900 --> @chigkim commented on GitHub (Mar 20, 2026): How do you run mlx models with Ollama? Do you have to build a special branch? Is there an environment variable you have to set?
Author
Owner

@rick-github commented on GitHub (Mar 20, 2026):

Mostly you have to use a Mac. Windows support is a WIP, Linux requires installing additional libraries and is slower than GGUF and crashes a lot.

If you have a Mac, you just need to pull the model and run it, the same as the GGUF models.

<!-- gh-comment-id:4098801496 --> @rick-github commented on GitHub (Mar 20, 2026): Mostly you have to use a Mac. Windows support is a WIP, Linux requires installing additional libraries and is slower than GGUF and crashes a lot. If you have a Mac, you just need to pull the model and run it, the same as the GGUF models.
Author
Owner

@chigkim commented on GitHub (Mar 22, 2026):

Yeah I have a mac. Where do you pull from? The Ollama Library doesn't seem to have mlx models.

<!-- gh-comment-id:4105187817 --> @chigkim commented on GitHub (Mar 22, 2026): Yeah I have a mac. Where do you pull from? The [Ollama Library](https://ollama.com/search?c=cloud&o=newest) doesn't seem to have mlx models.
Author
Owner

@aboutlo commented on GitHub (Mar 22, 2026):

Hey I managed to get a 27b model doing following these steps

  • huggingface-cli download Qwen/Qwen3.5-27B
  • uv run --with mlx-lm mlx_lm.convert
    --hf-path Qwen/Qwen3.5-35B-A3B
    --mlx-path ~/.cache/huggingface/hub/models--custom--Qwen/Qwen3.5-27B-mlx-4bit
    -q --q-bits 4 --q-group-size 64
  • cat > /tmp/Modelfile <<'EOF'
    FROM ~/.cache/huggingface/hub/models--custom--Qwen/Qwen3.5-27B-mlx-4bit
    PARAMETER num_ctx 32768
    PARAMETER temperature 0.2
    PARAMETER presence_penalty 0
    PARAMETER repeat_penalty 1
    PARAMETER top_k 20
    PARAMETER top_p 0.9
    PARAMETER min_p 0.05
    EOF
  • ollama create --experimental aboutlo/Qwen3.5-27B-mlx-4bit -f Modelfile
ollama run --verbose qwen3.5:27b "please write a quick Fibonacci in Go" --think=false
total duration:       1m45.375069417s
load duration:        9.6815415s
prompt eval count:    19 token(s)
prompt eval duration: 2.621114583s
prompt eval rate:     7.25 tokens/s
eval count:           413 token(s)
eval duration:        1m32.960728961s
eval rate:            4.44 tokens/s

ollama run --verbose qwen35-27b-mlx-4bit:latest "please write a quick Fibonacci in Go" --think=false
total duration:       1m7.109506709s
load duration:        17.713709ms
prompt eval count:    20 token(s)
prompt eval duration: 848.72825ms
prompt eval rate:     23.56 tokens/s
eval count:           423 token(s)
eval duration:        1m6.242530667s
eval rate:            6.39 tokens/s

But when I tried the same to re-encode the Qwen/Qwen3.5-35B-A3B to have my custom Modelfile I ended up to have a crash a runtime.

ollama run --verbose aboutlo/Qwen3.5-35B-A3B-mlx-4bit:latest "Hello" --think=false

time=2026-03-22T12:47:58.673Z level=INFO source=sched.go:484 msg="system memory" total="32.0 GiB" free="22.5 GiB" free_swap="0 B"
time=2026-03-22T12:47:58.673Z level=INFO source=sched.go:491 msg="gpu memory" id=0 library=Metal available="20.8 GiB" free="21.3 GiB" minimum="512.0 MiB" overhead="0 B"
time=2026-03-22T12:47:58.673Z level=INFO source=client.go:365 msg="starting mlx runner subprocess" model=aboutlo/Qwen3.5-35B-A3B-mlx-4bit:latest port=55268
time=2026-03-22T12:47:58.675Z level=INFO source=sched.go:561 msg="loaded runners" count=1
time=2026-03-22T12:47:58.699Z level=INFO source=server.go:32 msg="MLX engine initialized" "MLX version"=0.30.6-0-g185b06d device=gpu
time=2026-03-22T12:47:58.708Z level=INFO source=base.go:67 msg="Model architecture" arch=Qwen3_5MoeForConditionalGeneration
time=2026-03-22T12:47:58.992Z level=INFO source=runner.go:135 msg="Loaded tensors from manifest" count=1757
time=2026-03-22T12:48:06.677Z level=INFO source=runner.go:169 msg="Starting HTTP server" host=127.0.0.1 port=55268
time=2026-03-22T12:48:06.776Z level=INFO source=server.go:183 msg=ServeHTTP method=GET path=/v1/status took=34.708µs status="200 OK"
time=2026-03-22T12:48:06.776Z level=INFO source=client.go:99 msg="mlx runner is ready" port=55268
time=2026-03-22T12:48:06.777Z level=INFO source=cache.go:173 msg="Cache miss" left=14
time=2026-03-22T12:48:06.778Z level=INFO source=pipeline.go:55 msg="peak memory" size="18.17 GiB"
panic: runtime error: index out of range [0] with length 0

goroutine 11 [running]:
github.com/ollama/ollama/x/models/qwen3_5.(*SparseMoE).Forward(0x14000987860, 0x1400131c940, 0x140008180f0)
	/Users/runner/work/ollama/ollama/x/models/qwen3_5/qwen3_5.go:1304 +0x278
github.com/ollama/ollama/x/models/qwen3_5.(*Layer).Forward(0x14000842f00, 0x1400131c340, {0x1037ca498, 0x1400131e300}, 0x1, 0xd, 0x140008180f0)
	/Users/runner/work/ollama/ollama/x/models/qwen3_5/qwen3_5.go:1335 +0x10c
github.com/ollama/ollama/x/models/qwen3_5.(*Model).Forward(0x1400011a120, 0x1400131c2c0, {0x14000154008, 0x28, 0x140013298f0?})
	/Users/runner/work/ollama/ollama/x/models/qwen3_5/qwen3_5.go:1349 +0xb8
github.com/ollama/ollama/x/mlxrunner.(*Runner).TextGenerationPipeline(0x1400011a5a0, {{{0x1400134a000, 0x4a}, {0x3e4ccccd, 0x3f666666, 0x3d4ccccd, 0x14, 0x40, 0x0, 0x3fff2, ...}}, ...})
	/Users/runner/work/ollama/ollama/x/mlxrunner/pipeline.go:105 +0x594
github.com/ollama/ollama/x/mlxrunner.(*Runner).Run.func1()
	/Users/runner/work/ollama/ollama/x/mlxrunner/runner.go:148 +0x110
golang.org/x/sync/errgroup.(*Group).Go.func1()
	/Users/runner/go/pkg/mod/golang.org/x/sync@v0.17.0/errgroup/errgroup.go:93 +0x54
created by golang.org/x/sync/errgroup.(*Group).Go in goroutine 1
	/Users/runner/go/pkg/mod/golang.org/x/sync@v0.17.0/errgroup/errgroup.go:78 +0x94

It would be super interesting to understand how pdevine/qwen3.5:35b-a3b-int4 has been encoded :)

<!-- gh-comment-id:4106347993 --> @aboutlo commented on GitHub (Mar 22, 2026): Hey I managed to get a 27b model doing following these steps - huggingface-cli download Qwen/Qwen3.5-27B - uv run --with mlx-lm mlx_lm.convert \ --hf-path Qwen/Qwen3.5-35B-A3B \ --mlx-path ~/.cache/huggingface/hub/models--custom--Qwen/Qwen3.5-27B-mlx-4bit \ -q --q-bits 4 --q-group-size 64 - cat > /tmp/Modelfile <<'EOF' FROM ~/.cache/huggingface/hub/models--custom--Qwen/Qwen3.5-27B-mlx-4bit PARAMETER num_ctx 32768 PARAMETER temperature 0.2 PARAMETER presence_penalty 0 PARAMETER repeat_penalty 1 PARAMETER top_k 20 PARAMETER top_p 0.9 PARAMETER min_p 0.05 EOF - ollama create --experimental aboutlo/Qwen3.5-27B-mlx-4bit -f Modelfile ``` ollama run --verbose qwen3.5:27b "please write a quick Fibonacci in Go" --think=false total duration: 1m45.375069417s load duration: 9.6815415s prompt eval count: 19 token(s) prompt eval duration: 2.621114583s prompt eval rate: 7.25 tokens/s eval count: 413 token(s) eval duration: 1m32.960728961s eval rate: 4.44 tokens/s ollama run --verbose qwen35-27b-mlx-4bit:latest "please write a quick Fibonacci in Go" --think=false total duration: 1m7.109506709s load duration: 17.713709ms prompt eval count: 20 token(s) prompt eval duration: 848.72825ms prompt eval rate: 23.56 tokens/s eval count: 423 token(s) eval duration: 1m6.242530667s eval rate: 6.39 tokens/s ``` But when I tried the same to re-encode the Qwen/Qwen3.5-35B-A3B to have my custom Modelfile I ended up to have a crash a runtime. ollama run --verbose aboutlo/Qwen3.5-35B-A3B-mlx-4bit:latest "Hello" --think=false ``` time=2026-03-22T12:47:58.673Z level=INFO source=sched.go:484 msg="system memory" total="32.0 GiB" free="22.5 GiB" free_swap="0 B" time=2026-03-22T12:47:58.673Z level=INFO source=sched.go:491 msg="gpu memory" id=0 library=Metal available="20.8 GiB" free="21.3 GiB" minimum="512.0 MiB" overhead="0 B" time=2026-03-22T12:47:58.673Z level=INFO source=client.go:365 msg="starting mlx runner subprocess" model=aboutlo/Qwen3.5-35B-A3B-mlx-4bit:latest port=55268 time=2026-03-22T12:47:58.675Z level=INFO source=sched.go:561 msg="loaded runners" count=1 time=2026-03-22T12:47:58.699Z level=INFO source=server.go:32 msg="MLX engine initialized" "MLX version"=0.30.6-0-g185b06d device=gpu time=2026-03-22T12:47:58.708Z level=INFO source=base.go:67 msg="Model architecture" arch=Qwen3_5MoeForConditionalGeneration time=2026-03-22T12:47:58.992Z level=INFO source=runner.go:135 msg="Loaded tensors from manifest" count=1757 time=2026-03-22T12:48:06.677Z level=INFO source=runner.go:169 msg="Starting HTTP server" host=127.0.0.1 port=55268 time=2026-03-22T12:48:06.776Z level=INFO source=server.go:183 msg=ServeHTTP method=GET path=/v1/status took=34.708µs status="200 OK" time=2026-03-22T12:48:06.776Z level=INFO source=client.go:99 msg="mlx runner is ready" port=55268 time=2026-03-22T12:48:06.777Z level=INFO source=cache.go:173 msg="Cache miss" left=14 time=2026-03-22T12:48:06.778Z level=INFO source=pipeline.go:55 msg="peak memory" size="18.17 GiB" panic: runtime error: index out of range [0] with length 0 goroutine 11 [running]: github.com/ollama/ollama/x/models/qwen3_5.(*SparseMoE).Forward(0x14000987860, 0x1400131c940, 0x140008180f0) /Users/runner/work/ollama/ollama/x/models/qwen3_5/qwen3_5.go:1304 +0x278 github.com/ollama/ollama/x/models/qwen3_5.(*Layer).Forward(0x14000842f00, 0x1400131c340, {0x1037ca498, 0x1400131e300}, 0x1, 0xd, 0x140008180f0) /Users/runner/work/ollama/ollama/x/models/qwen3_5/qwen3_5.go:1335 +0x10c github.com/ollama/ollama/x/models/qwen3_5.(*Model).Forward(0x1400011a120, 0x1400131c2c0, {0x14000154008, 0x28, 0x140013298f0?}) /Users/runner/work/ollama/ollama/x/models/qwen3_5/qwen3_5.go:1349 +0xb8 github.com/ollama/ollama/x/mlxrunner.(*Runner).TextGenerationPipeline(0x1400011a5a0, {{{0x1400134a000, 0x4a}, {0x3e4ccccd, 0x3f666666, 0x3d4ccccd, 0x14, 0x40, 0x0, 0x3fff2, ...}}, ...}) /Users/runner/work/ollama/ollama/x/mlxrunner/pipeline.go:105 +0x594 github.com/ollama/ollama/x/mlxrunner.(*Runner).Run.func1() /Users/runner/work/ollama/ollama/x/mlxrunner/runner.go:148 +0x110 golang.org/x/sync/errgroup.(*Group).Go.func1() /Users/runner/go/pkg/mod/golang.org/x/sync@v0.17.0/errgroup/errgroup.go:93 +0x54 created by golang.org/x/sync/errgroup.(*Group).Go in goroutine 1 /Users/runner/go/pkg/mod/golang.org/x/sync@v0.17.0/errgroup/errgroup.go:78 +0x94 ``` It would be super interesting to understand how `pdevine/qwen3.5:35b-a3b-int4` has been encoded :)
Author
Owner

@aboutlo commented on GitHub (Mar 22, 2026):

Never mind, I figured out how to encode using ollama directly rather than mlx-lm

ollama create --experimental -q int4 aboutlo/Qwen3.5-35B-A3B-int4 -f Modelfile

Modelfile

FROM Users/YOUR_USER/.cache/huggingface/hub/models--Qwen--Qwen3.5-35B-A3B/snapshots/ec2d4ece1ffb563322cbee9a48fe0e3fcbce0307
TEMPLATE {{ .Prompt }}
RENDERER qwen3.5
PARSER qwen3.5
PARAMETER num_ctx 32768
PARAMETER repeat_penalty 1
PARAMETER temperature 0.2
PARAMETER top_k 20
PARAMETER top_p 0.9
PARAMETER min_p 0.05
PARAMETER repeat_penalty 1.05
PARAMETER presence_penalty 0
<!-- gh-comment-id:4106725425 --> @aboutlo commented on GitHub (Mar 22, 2026): Never mind, I figured out how to encode using ollama directly rather than `mlx-lm` `ollama create --experimental -q int4 aboutlo/Qwen3.5-35B-A3B-int4 -f Modelfile` Modelfile ``` FROM Users/YOUR_USER/.cache/huggingface/hub/models--Qwen--Qwen3.5-35B-A3B/snapshots/ec2d4ece1ffb563322cbee9a48fe0e3fcbce0307 TEMPLATE {{ .Prompt }} RENDERER qwen3.5 PARSER qwen3.5 PARAMETER num_ctx 32768 PARAMETER repeat_penalty 1 PARAMETER temperature 0.2 PARAMETER top_k 20 PARAMETER top_p 0.9 PARAMETER min_p 0.05 PARAMETER repeat_penalty 1.05 PARAMETER presence_penalty 0 ```
Author
Owner

@pdevine commented on GitHub (Mar 22, 2026):

It would be super interesting to understand how pdevine/qwen3.5:35b-a3b-int4 has been encoded :)

Hey sorry, I had an entire write-up on this that I misplaced, but yes, the format has changed. The short answer is that I changed the safetensors format to better support quantizations, and I also change the way the experts are packed to make it more efficient when you load them. You should be able to get it to work with ollama create -f <path/to/Modelfile> --experimental <modelname>. Use the --quantize parameter if you want to quantize it.

<!-- gh-comment-id:4106744034 --> @pdevine commented on GitHub (Mar 22, 2026): > It would be super interesting to understand how pdevine/qwen3.5:35b-a3b-int4 has been encoded :) Hey sorry, I had an entire write-up on this that I misplaced, but yes, the format has changed. The short answer is that I changed the safetensors format to better support quantizations, and I also change the way the experts are packed to make it more efficient when you load them. You should be able to get it to work with `ollama create -f <path/to/Modelfile> --experimental <modelname>`. Use the --quantize parameter if you want to quantize it.
Author
Owner

@pdevine commented on GitHub (Mar 24, 2026):

I've posted:

  • pdevine/qwen3.5:27b-coding-nvfp4 (20GB)
  • pdevine/qwen3.5:27b-int4 (16GB)
  • pdevine/qwen3.5:27b-nvfp4 (20GB)
  • pdevine/qwen3.5:35b-a3b-coding-mxfp8 (38GB)
  • pdevine/qwen3.5:35b-a3b-coding-nvfp4 (22GB)
  • pdevine/qwen3.5:35b-a3b-int4 (20GB)
  • pdevine/qwen3.5:35b-a3b-int8 (38GB)
  • pdevine/qwen3.5:35b-a3b-nvfp4 (22GB)

Some notes:

  • You'll need to build from main to get working support for mxfp8/nvfp4 or wait until the next version of Ollama.
  • The 'coding' models have the hyperparameters correctly set for coding/agentic tasks (as defined by Alibaba). I've tested these out with Claude and Pi and they work quite well.
  • Each of the models is imported from the HuggingFace BF16 models of Qwen3.5-35B-A3B or Qwen3.5-27B except the mxfp8 models which were converted from the FP8 versions of those same models.
  • I do not recommend the affine 4 bit (-int4) models over their nvfp4 counterparts. The int4 models are using a high group size (this is equivalent to q4_1 with groupsize 64) although I did quantize some of the tensors in those models at int8 instead of entirely int4 to make the model not act like a complete idiot. This is different than what the mlx-community publishes on Huggingface. YMMV. Use the nvfp4 models if you want accuracy, or the int4 models if you want pure speed (but not great accuracy).

If you want to use the models which don't have coding in their name to do coding you'll need to set these values in a Modelfile and recreate the model:

PARAMETER temperature 0.6
PARAMETER top_p 0.95
PARAMETER min_p 0.0
PARAMETER top_k 20
PARAMETER presence_penalty 0.0
PARAMETER repeat_penalty 1.0
<!-- gh-comment-id:4121340720 --> @pdevine commented on GitHub (Mar 24, 2026): I've posted: - pdevine/qwen3.5:27b-coding-nvfp4 (20GB) - pdevine/qwen3.5:27b-int4 (16GB) - pdevine/qwen3.5:27b-nvfp4 (20GB) - pdevine/qwen3.5:35b-a3b-coding-mxfp8 (38GB) - pdevine/qwen3.5:35b-a3b-coding-nvfp4 (22GB) - pdevine/qwen3.5:35b-a3b-int4 (20GB) - pdevine/qwen3.5:35b-a3b-int8 (38GB) - pdevine/qwen3.5:35b-a3b-nvfp4 (22GB) Some notes: - *You'll need to build from main to get working support for mxfp8/nvfp4 or wait until the next version of Ollama.* - The 'coding' models have the hyperparameters correctly set for coding/agentic tasks (as defined by Alibaba). I've tested these out with Claude and Pi and they work quite well. - Each of the models is imported from the HuggingFace BF16 models of [Qwen3.5-35B-A3B](https://huggingface.co/Qwen/Qwen3.5-35B-A3B) or [Qwen3.5-27B](https://huggingface.co/Qwen/Qwen3.5-27B) *except* the mxfp8 models which were converted from the FP8 versions of those same models. - I do not recommend the affine 4 bit (-int4) models over their nvfp4 counterparts. The int4 models are using a high group size (this is equivalent to q4_1 with groupsize 64) although I did quantize some of the tensors in those models at int8 instead of entirely int4 to make the model not act like a complete idiot. This is different than what the mlx-community publishes on Huggingface. YMMV. *Use the nvfp4 models* if you want accuracy, or the int4 models if you want pure speed (but not great accuracy). If you want to use the models which don't have `coding` in their name to do coding you'll need to set these values in a Modelfile and recreate the model: ``` PARAMETER temperature 0.6 PARAMETER top_p 0.95 PARAMETER min_p 0.0 PARAMETER top_k 20 PARAMETER presence_penalty 0.0 PARAMETER repeat_penalty 1.0 ```
Author
Owner

@BeatWolf commented on GitHub (Mar 24, 2026):

so, is the default ollama model a good choice or should other quants be used?

<!-- gh-comment-id:4121690793 --> @BeatWolf commented on GitHub (Mar 24, 2026): so, is the default ollama model a good choice or should other quants be used?
Author
Owner

@pdevine commented on GitHub (Mar 24, 2026):

so, is the default ollama model a good choice or should other quants be used?

The default model runs on GGML, not MLX. These models are for the experimental MLX engine and for now will only run on a Mac (although CUDA support is coming really soon).

<!-- gh-comment-id:4121813971 --> @pdevine commented on GitHub (Mar 24, 2026): > so, is the default ollama model a good choice or should other quants be used? The default model runs on GGML, not MLX. These models are for the experimental MLX engine and for now will only run on a Mac (although CUDA support is coming really soon).
Author
Owner

@rocaltair commented on GitHub (Mar 25, 2026):

I think we need a label like Cloud to filter MLX models on https://ollama.com/search .

<!-- gh-comment-id:4122636351 --> @rocaltair commented on GitHub (Mar 25, 2026): I think we need a label like `Cloud` to filter `MLX` models on https://ollama.com/search .
Author
Owner

@athuljayaram commented on GitHub (Mar 27, 2026):

getting error with qwen3

 API Error: 500 {"type":"error","error":{"type":"api_error","message":"model requires more system memory (16.8 GiB) than is available (12.3
GiB)"},"request_id":"req_60ada344b2d69bf5252ed31f"}

ollama run qwen3.5:9b
pulling manifest
pulling dec52a44569a: 100% ▕█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏ 6.6 GB
pulling 7339fa418c9a: 100% ▕█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏ 11 KB
pulling 9371364b27a5: 100% ▕█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏ 65 B
pulling be595b49fe22: 100% ▕█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏ 475 B
verifying sha256 digest
writing manifest
success
Error: 500 Internal Server Error: model requires more system memory (16.8 GiB) than is available (13.8 GiB)

<!-- gh-comment-id:4144868241 --> @athuljayaram commented on GitHub (Mar 27, 2026): getting error with qwen3  API Error: 500 {"type":"error","error":{"type":"api_error","message":"model requires more system memory (16.8 GiB) than is available (12.3 GiB)"},"request_id":"req_60ada344b2d69bf5252ed31f"} ollama run qwen3.5:9b pulling manifest pulling dec52a44569a: 100% ▕█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏ 6.6 GB pulling 7339fa418c9a: 100% ▕█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏ 11 KB pulling 9371364b27a5: 100% ▕█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏ 65 B pulling be595b49fe22: 100% ▕█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏ 475 B verifying sha256 digest writing manifest success Error: 500 Internal Server Error: model requires more system memory (16.8 GiB) than is available (13.8 GiB)
Author
Owner

@baptistejamin commented on GitHub (Apr 19, 2026):

I noticed yesterday exactly the same thing with Qwen 3.6, Ollama is significantly slower compared llama.cpp on the same Q4 with 35B-A3B on a L40s GPU.

Test with: 30 prompts, ~4000 input tokens / 200 output tokens each

Ollama version: latest stable (0.21.0), llama.cpp: build from latest Docker image at the same date (b8833, ghcr.io/ggml-org/llama.cpp:server-cuda)

metric llama.cpp ollama delta
TTFT 0.79 s 1.15 s +45% slower
Prompt eval 5393 t/s 4436 t/s −18%
Generation 144.7 t/s 115.5 t/s −20%
Total time 2.17 s 2.98 s +37% slower

From what I understand, GGML implementation in Ollama is a bit outdated, and missing a few opts that have been implemented around qwen35moe (gated delta net, new topk_moe api, etc).

I'll try some tests today to confirm it's the cause. 37% is massive, and Qwen 3.6 super good, so I really think it shall be tackled.

<!-- gh-comment-id:4275269347 --> @baptistejamin commented on GitHub (Apr 19, 2026): I noticed yesterday exactly the same thing with Qwen 3.6, Ollama is significantly slower compared llama.cpp on the same Q4 with 35B-A3B on a L40s GPU. Test with: 30 prompts, ~4000 input tokens / 200 output tokens each Ollama version: latest stable (0.21.0), llama.cpp: build from latest Docker image at the same date (b8833, ghcr.io/ggml-org/llama.cpp:server-cuda) metric | llama.cpp | ollama | delta -- | -- | -- | -- TTFT | 0.79 s | 1.15 s | +45% slower Prompt eval | 5393 t/s | 4436 t/s | −18% Generation | 144.7 t/s | 115.5 t/s | −20% Total time | 2.17 s | 2.98 s | +37% slower From what I understand, GGML implementation in Ollama is a bit outdated, and missing a few opts that have been implemented around qwen35moe (gated delta net, new topk_moe api, etc). I'll try some tests today to confirm it's the cause. 37% is massive, and Qwen 3.6 super good, so I really think it shall be tackled.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#56098