[GH-ISSUE #14861] Massive difference in speed between Ollama and llama.cpp with qwen3.5:35b! #56098

New Issue

GiteaMirror · 2026-04-29T10:15:36-05:00

GiteaMirror commented

2026-04-29 10:15:36 -05:00

Originally created by @chigkim on GitHub (Mar 15, 2026).
Original GitHub issue: https://github.com/ollama/ollama/issues/14861

Originally assigned to: @pdevine on GitHub.

What is the issue?

There's a massive difference in speed between Ollama and llama.cpp when running qwen3.5:35b-a3b-q8_0.
I haven't seen such big difference between Ollama .llama.cpp.

Engine	Prompt Tokens	PP/s	TTFT	Generated Tokens	TG/s	Duration
LCP	471	1001.39	0.47	2245	53.53	42.41
Ollama	471	640.08	0.74	2362	32.54	73.33
LCP	722	1043.92	0.69	2373	53.76	44.83
Ollama	722	605.77	1.19	2664	32.52	83.12
LCP	1140	1078.78	1.06	2729	53.38	52.19
Ollama	1140	691.40	1.65	3074	32.50	96.23
LCP	1845	1304.84	1.41	2865	51.51	57.03
Ollama	1845	739.67	2.49	2742	31.35	89.95
LCP	3067	1077.19	2.85	2531	46.42	57.37
Ollama	3067	650.21	4.72	3674	30.92	123.55
LCP	4852	1077.11	4.50	3052	48.45	67.50
Ollama	4852	695.72	6.97	3037	31.47	103.49
LCP	7950	1183.35	6.72	3545	48.72	79.49
Ollama	7950	703.69	11.30	3233	31.08	115.32
LCP	12753	1126.50	11.32	3049	47.41	75.64
Ollama	12753	689.74	18.49	3597	30.88	134.97
LCP	20762	1035.39	20.05	3061	45.36	87.53
Ollama	20762	666.16	31.17	3062	30.38	131.95
LCP	33057	924.02	35.78	3439	42.66	116.38
Ollama	33057	627.12	52.71	3124	28.81	161.14

LCP Total duration: 12m50s
Ollama Total duration: 20m3s

Here's a comparison with gpt-oss:20b-mxfp4. Ollama is slower, but not as dramatically as with qwen3.5.

Engine	Prompt Tokens	PP/s	TTFT	Generated Tokens	TG/s	Duration
LCP	520	1402.91	0.37	2244	93.66	24.33
Ollama	521	1048.47	0.50	1122	73.84	15.69
LCP	762	1278.40	0.60	1927	93.41	21.23
Ollama	763	1024.17	0.74	2187	72.90	30.75
LCP	1157	1282.74	0.90	2327	92.86	25.96
Ollama	1158	1044.81	1.11	1959	72.66	28.07
LCP	1827	1407.63	1.30	2117	92.43	24.20
Ollama	1828	1200.94	1.52	2197	72.34	31.89
LCP	3002	1449.61	2.07	2704	91.03	31.78
Ollama	3003	1248.64	2.41	2221	71.65	33.40
LCP	4700	1427.46	3.29	3221	89.50	39.28
Ollama	4701	1228.58	3.83	2432	71.00	38.08
LCP	7550	1386.35	5.45	2327	87.75	31.96
Ollama	7551	1247.96	6.05	2768	69.51	45.87
LCP	12051	1326.57	9.08	2295	84.94	36.10
Ollama	12052	1081.05	11.15	2508	62.94	51.00
LCP	19531	1188.91	16.43	2141	75.38	44.83
Ollama	19532	1017.17	19.20	2634	61.81	61.82
LCP	31277	941.32	33.23	1997	67.43	62.84
Ollama	31278	918.82	34.04	2056	58.79	69.01

LCP Total duration: 7m12s
Ollama Total duration: 8m15s

OS

macOS

GPU

Apple

CPU

Apple

Ollama version

0.18.0

Originally created by @chigkim on GitHub (Mar 15, 2026). Original GitHub issue: https://github.com/ollama/ollama/issues/14861 Originally assigned to: @pdevine on GitHub. ### What is the issue? There's a massive difference in speed between Ollama and llama.cpp when running qwen3.5:35b-a3b-q8_0. I haven't seen such big difference between Ollama .llama.cpp. | Engine | Prompt Tokens | PP/s | TTFT | Generated Tokens | TG/s | Duration | | ------ | ------------- | ---- | ---- | ---------------- | ---- | -------- | | LCP | 471 | 1001.39 | 0.47 | 2245 | 53.53 | 42.41 | | Ollama | 471 | 640.08 | 0.74 | 2362 | 32.54 | 73.33 | | LCP | 722 | 1043.92 | 0.69 | 2373 | 53.76 | 44.83 | | Ollama | 722 | 605.77 | 1.19 | 2664 | 32.52 | 83.12 | | LCP | 1140 | 1078.78 | 1.06 | 2729 | 53.38 | 52.19 | | Ollama | 1140 | 691.40 | 1.65 | 3074 | 32.50 | 96.23 | | LCP | 1845 | 1304.84 | 1.41 | 2865 | 51.51 | 57.03 | | Ollama | 1845 | 739.67 | 2.49 | 2742 | 31.35 | 89.95 | | LCP | 3067 | 1077.19 | 2.85 | 2531 | 46.42 | 57.37 | | Ollama | 3067 | 650.21 | 4.72 | 3674 | 30.92 | 123.55 | | LCP | 4852 | 1077.11 | 4.50 | 3052 | 48.45 | 67.50 | | Ollama | 4852 | 695.72 | 6.97 | 3037 | 31.47 | 103.49 | | LCP | 7950 | 1183.35 | 6.72 | 3545 | 48.72 | 79.49 | | Ollama | 7950 | 703.69 | 11.30 | 3233 | 31.08 | 115.32 | | LCP | 12753 | 1126.50 | 11.32 | 3049 | 47.41 | 75.64 | | Ollama | 12753 | 689.74 | 18.49 | 3597 | 30.88 | 134.97 | | LCP | 20762 | 1035.39 | 20.05 | 3061 | 45.36 | 87.53 | | Ollama | 20762 | 666.16 | 31.17 | 3062 | 30.38 | 131.95 | | LCP | 33057 | 924.02 | 35.78 | 3439 | 42.66 | 116.38 | | Ollama | 33057 | 627.12 | 52.71 | 3124 | 28.81 | 161.14 | LCP Total duration: 12m50s Ollama Total duration: 20m3s Here's a comparison with gpt-oss:20b-mxfp4. Ollama is slower, but not as dramatically as with qwen3.5. | Engine | Prompt Tokens | PP/s | TTFT | Generated Tokens | TG/s | Duration | | ------ | ------------- | ---- | ---- | ---------------- | ---- | -------- | | LCP | 520 | 1402.91 | 0.37 | 2244 | 93.66 | 24.33 | | Ollama | 521 | 1048.47 | 0.50 | 1122 | 73.84 | 15.69 | | LCP | 762 | 1278.40 | 0.60 | 1927 | 93.41 | 21.23 | | Ollama | 763 | 1024.17 | 0.74 | 2187 | 72.90 | 30.75 | | LCP | 1157 | 1282.74 | 0.90 | 2327 | 92.86 | 25.96 | | Ollama | 1158 | 1044.81 | 1.11 | 1959 | 72.66 | 28.07 | | LCP | 1827 | 1407.63 | 1.30 | 2117 | 92.43 | 24.20 | | Ollama | 1828 | 1200.94 | 1.52 | 2197 | 72.34 | 31.89 | | LCP | 3002 | 1449.61 | 2.07 | 2704 | 91.03 | 31.78 | | Ollama | 3003 | 1248.64 | 2.41 | 2221 | 71.65 | 33.40 | | LCP | 4700 | 1427.46 | 3.29 | 3221 | 89.50 | 39.28 | | Ollama | 4701 | 1228.58 | 3.83 | 2432 | 71.00 | 38.08 | | LCP | 7550 | 1386.35 | 5.45 | 2327 | 87.75 | 31.96 | | Ollama | 7551 | 1247.96 | 6.05 | 2768 | 69.51 | 45.87 | | LCP | 12051 | 1326.57 | 9.08 | 2295 | 84.94 | 36.10 | | Ollama | 12052 | 1081.05 | 11.15 | 2508 | 62.94 | 51.00 | | LCP | 19531 | 1188.91 | 16.43 | 2141 | 75.38 | 44.83 | | Ollama | 19532 | 1017.17 | 19.20 | 2634 | 61.81 | 61.82 | | LCP | 31277 | 941.32 | 33.23 | 1997 | 67.43 | 62.84 | | Ollama | 31278 | 918.82 | 34.04 | 2056 | 58.79 | 69.01 | LCP Total duration: 7m12s Ollama Total duration: 8m15s ### OS macOS ### GPU Apple ### CPU Apple ### Ollama version 0.18.0

GiteaMirror added the performance bug labels 2026-04-29 10:15:37 -05:00

GiteaMirror commented

2026-04-29 10:15:38 -05:00

@rick-github commented on GitHub (Mar 15, 2026):

ollama has its own implementation of the qwen3.5 model.

model	tps
qwen3.5:35b	89.27
frob/qwen3.5:35b-a3b-blind-ud-q4_K_XL	86.21
frob/qwen3.5:35b-a3b-ud-q4_K_XL	218.86

qwen3.5:35b is the ollama library model running on the ollama engine, qwen3.5-35b-blind-ud is the unsloth model running on the ollama engine, qwen3.5-ud is the unsloth model running on the llama.cpp engine.

@rick-github commented on GitHub (Mar 15, 2026): ollama has its own implementation of the qwen3.5 model. | model | tps | | -- | -- | | qwen3.5:35b | 89.27 | frob/qwen3.5:35b-a3b-blind-ud-q4_K_XL | 86.21 | frob/qwen3.5:35b-a3b-ud-q4_K_XL | 218.86 qwen3.5:35b is the ollama library model running on the ollama engine, qwen3.5-35b-blind-ud is the unsloth model running on the ollama engine, qwen3.5-ud is the unsloth model running on the llama.cpp engine.

GiteaMirror commented

2026-04-29 10:15:38 -05:00

@chigkim commented on GitHub (Mar 15, 2026):

I guess Ollama engine needs more optimization for qwen3.5...

@chigkim commented on GitHub (Mar 15, 2026): I guess Ollama engine needs more optimization for qwen3.5...

GiteaMirror commented

2026-04-29 10:15:38 -05:00

@rick-github commented on GitHub (Mar 15, 2026):

It seems that way. gpt-oss was similarly a lot slower when first implemented on the ollama engine.

@rick-github commented on GitHub (Mar 15, 2026): It seems that way. gpt-oss was similarly a lot slower when first implemented on the ollama engine.

GiteaMirror commented

2026-04-29 10:15:39 -05:00

@jkleckner commented on GitHub (Mar 16, 2026):

Apparently https://github.com/ggml-org/llama.cpp/pull/19504 PR in llama.cpp has made a big difference for qwen3+ performance. See also https://github.com/ggml-org/llama.cpp/discussions/16578#discussioncomment-16072258

@jkleckner commented on GitHub (Mar 16, 2026): Apparently https://github.com/ggml-org/llama.cpp/pull/19504 PR in llama.cpp has made a big difference for qwen3+ performance. See also https://github.com/ggml-org/llama.cpp/discussions/16578#discussioncomment-16072258

GiteaMirror commented

2026-04-29 10:15:40 -05:00

@chigkim commented on GitHub (Mar 16, 2026):

@jmorganca, @mxyng, is there any plan to optimize the new engine for speed? Thanks!

@chigkim commented on GitHub (Mar 16, 2026): @jmorganca, @mxyng, is there any plan to optimize the new engine for speed? Thanks!

GiteaMirror commented

2026-04-29 10:15:42 -05:00

@BeatWolf commented on GitHub (Mar 16, 2026):

@rick-github this is massive. Are there any plans to get this fixed? Because qwen3.5 seems better in all use-cases for us, but staying with ollama now looks irresponsible (in terms of resources/costs).

@BeatWolf commented on GitHub (Mar 16, 2026): @rick-github this is massive. Are there any plans to get this fixed? Because qwen3.5 seems better in all use-cases for us, but staying with ollama now looks irresponsible (in terms of resources/costs).

GiteaMirror commented

2026-04-29 10:15:43 -05:00

@rick-github commented on GitHub (Mar 16, 2026):

I expect it will be addressed, as I mentioned gpt-oss had similar performance issues when initially released on the ollama engine. Using the llama.cpp engine in ollama is an alternative if the situation persists.

@rick-github commented on GitHub (Mar 16, 2026): I expect it will be addressed, as I mentioned gpt-oss had similar performance issues when initially released on the ollama engine. Using the llama.cpp engine in ollama is an alternative if the situation persists.

GiteaMirror commented

2026-04-29 10:15:43 -05:00

@pdevine commented on GitHub (Mar 16, 2026):

@BeatWolf Yes, and I just posted two PRs for getting it to work with the new MLX engine. I see about a 3x speed-up for the 35B 4bit integer model (from 35 tps -> 96 tps) on my M3 Max. I'm still testing on the new M5 and there's still work to be done for CUDA of course.

#14878 addresses importing the safetensors models correctly, and
#14884 fixes some outstanding issues w/ qwen3.5

@pdevine commented on GitHub (Mar 16, 2026): @BeatWolf Yes, and I just posted two PRs for getting it to work with the new MLX engine. I see about a 3x speed-up for the 35B 4bit integer model (from 35 tps -> 96 tps) on my M3 Max. I'm still testing on the new M5 and there's still work to be done for CUDA of course. #14878 addresses importing the safetensors models correctly, and #14884 fixes some outstanding issues w/ qwen3.5

GiteaMirror commented

2026-04-29 10:15:44 -05:00

@pdevine commented on GitHub (Mar 16, 2026):

@BeatWolf regarding the difference in speed w.r.t. gpt-oss, that's because GGML.org over quantizes a number of the weights (it's also why it performs worse on most benchmarks). You can try ollama run pdevine/gpt-oss:20b-q8_0 if you want a more "apples to apples" comparison.

Here's an example run:

% ./ollama run pdevine/gpt-oss:20b-q8_0 --verbose
>>> hey there
Thinking...
The user says "hey there". That's just a greeting. We need to respond in a friendly way. The instruction: "You are ChatGPT, ...". No special constraints. So reply
with a friendly greeting and maybe ask how they are.
...done thinking.

Hey! 👋 How’s it going? Anything interesting on your mind today?

total duration:       1.438575166s
load duration:        258.866083ms
prompt eval count:    69 token(s)
prompt eval duration: 307.561292ms
prompt eval rate:     224.35 tokens/s
eval count:           75 token(s)
eval duration:        826.062839ms
eval rate:            90.79 tokens/s

That's about 15 toks/sec faster on my machine than gpt-oss:20b, but gets a lot worse real world performance (i.e. the model is an idiot).

@pdevine commented on GitHub (Mar 16, 2026): @BeatWolf regarding the difference in speed w.r.t. gpt-oss, that's because GGML.org over quantizes a number of the weights (it's also why it performs worse on most benchmarks). You can try `ollama run pdevine/gpt-oss:20b-q8_0` if you want a more "apples to apples" comparison. Here's an example run: ``` % ./ollama run pdevine/gpt-oss:20b-q8_0 --verbose >>> hey there Thinking... The user says "hey there". That's just a greeting. We need to respond in a friendly way. The instruction: "You are ChatGPT, ...". No special constraints. So reply with a friendly greeting and maybe ask how they are. ...done thinking. Hey! 👋 How’s it going? Anything interesting on your mind today? total duration: 1.438575166s load duration: 258.866083ms prompt eval count: 69 token(s) prompt eval duration: 307.561292ms prompt eval rate: 224.35 tokens/s eval count: 75 token(s) eval duration: 826.062839ms eval rate: 90.79 tokens/s ``` That's about 15 toks/sec faster on my machine than `gpt-oss:20b`, but gets a lot worse real world performance (i.e. the model is an idiot).

GiteaMirror commented

2026-04-29 10:15:44 -05:00

@BeatWolf commented on GitHub (Mar 17, 2026):

@pdevine thank you the details, but i was really talking about qwen3.5 which, for my use-case (quality vs size/speed tradeoff) beats gpt-oss easily. This is why i'm particularly sensitive to speed issues in qwen3.5

@BeatWolf commented on GitHub (Mar 17, 2026): @pdevine thank you the details, but i was really talking about qwen3.5 which, for my use-case (quality vs size/speed tradeoff) beats gpt-oss easily. This is why i'm particularly sensitive to speed issues in qwen3.5

GiteaMirror commented

2026-04-29 10:15:45 -05:00

@pdevine commented on GitHub (Mar 17, 2026):

@BeatWolf Yes, see my first message. That is about getting Qwen3.5 to go fast. I'm getting about 95 toks/sec on an M3 Max and 135 toks/sec on an M5 Max w/ the 35B-A3B model.

@pdevine commented on GitHub (Mar 17, 2026): @BeatWolf Yes, see my first message. That is about getting Qwen3.5 to go fast. I'm getting about 95 toks/sec on an M3 Max and 135 toks/sec on an M5 Max w/ the 35B-A3B model.

GiteaMirror commented

2026-04-29 10:15:45 -05:00

@lapo-luchini commented on GitHub (Mar 18, 2026):

I see about a 3x speed-up for the 35B 4bit integer model (from 35 tps -> 96 tps) on my M3 Max.

I built v0.18.2-rc0 (which includes both PRs) merged with my Prometheus-metrics branch (but that doesn't touch engines) and I get the same 27.88 t/s I got on v0.18.0 on a M4 Pro… how can I check which backend is used?

time=2026-03-18T00:51:03.273+01:00 level=INFO source=ggml.go:136 msg="" architecture=qwen35moe file_type=Q4_K_M name="" description="" num_tensors=1959 num_key_values=57
ggml_metal_device_init: tensor API disabled for pre-M5 and pre-A19 devices

@lapo-luchini commented on GitHub (Mar 18, 2026): > I see about a 3x speed-up for the 35B 4bit integer model (from 35 tps -> 96 tps) on my M3 Max. I built v0.18.2-rc0 (which includes both PRs) merged with my Prometheus-metrics branch (but that doesn't touch engines) and I get the same 27.88 t/s I got on v0.18.0 on a M4 Pro… how can I check which backend is used? ``` time=2026-03-18T00:51:03.273+01:00 level=INFO source=ggml.go:136 msg="" architecture=qwen35moe file_type=Q4_K_M name="" description="" num_tensors=1959 num_key_values=57 ggml_metal_device_init: tensor API disabled for pre-M5 and pre-A19 devices ```

GiteaMirror commented

2026-04-29 10:15:46 -05:00

@pdevine commented on GitHub (Mar 18, 2026):

@lapo-luchini Try the pdevine/qwen3.5:35b-a3b-int4 model.

@pdevine commented on GitHub (Mar 18, 2026): @lapo-luchini Try the `pdevine/qwen3.5:35b-a3b-int4` model.

GiteaMirror commented

2026-04-29 10:15:46 -05:00

@rocaltair commented on GitHub (Mar 19, 2026):

@lapo-luchini Try the pdevine/qwen3.5:35b-a3b-int4 model.

Great!!! The speed has increased from 27 tokens per second to 77 tokens per second on my mbp with M4Pro and 48GB RAM.

@rocaltair commented on GitHub (Mar 19, 2026): > [@lapo-luchini](https://github.com/lapo-luchini) Try the `pdevine/qwen3.5:35b-a3b-int4` model. Great!!! The speed has increased from 27 tokens per second to 77 tokens per second on my mbp with M4Pro and 48GB RAM.

GiteaMirror commented

2026-04-29 10:15:47 -05:00

@rocaltair commented on GitHub (Mar 19, 2026):

@pdevine Seems like pdevine/qwen3.5:27b-int4, pdevine/qwen3.5:0.8b-bf16 are still slow.

And what's the different between pdevine/qwen3.5:35b-a3b-int4 and pdevine/qwen3.5:35b-a3b-int4-all ?

@rocaltair commented on GitHub (Mar 19, 2026): @pdevine Seems like pdevine/qwen3.5:27b-int4, pdevine/qwen3.5:0.8b-bf16 are still slow. And what's the different between pdevine/qwen3.5:35b-a3b-int4 and pdevine/qwen3.5:35b-a3b-int4-all ?

GiteaMirror commented

2026-04-29 10:15:47 -05:00

@pdevine commented on GitHub (Mar 19, 2026):

@rocaltair Those are older test models. I'll requantize and push again.

@pdevine commented on GitHub (Mar 19, 2026): @rocaltair Those are older test models. I'll requantize and push again.

GiteaMirror commented

2026-04-29 10:15:48 -05:00

@rocaltair commented on GitHub (Mar 20, 2026):

@rocaltair Those are older test models. I'll requantize and push again.

Thanks for the awesome contribution！ @pdevine

@rocaltair commented on GitHub (Mar 20, 2026): > [@rocaltair](https://github.com/rocaltair) Those are older test models. I'll requantize and push again. Thanks for the awesome contribution！ @pdevine

GiteaMirror commented

2026-04-29 10:15:50 -05:00

@lapo-luchini commented on GitHub (Mar 20, 2026):

@lapo-luchini Try the pdevine/qwen3.5:35b-a3b-int4 model.

I had to struggle a bit to build a MLX-compatible ollama, but the result is huge!

% ollama run --verbose qwen3.5:35b "please write a quick Fibonacci in Go"
total duration:       26.537859334s
eval count:           382 token(s)
eval duration:        13.391695966s
eval rate:            28.53 tokens/s

% ollama run --verbose pdevine/qwen3.5:35b-a3b-int4 "please write a quick Fibonacci in Go"
total duration:       4.992956458s
eval count:           351 token(s)
eval duration:        4.102439208s
eval rate:            85.56 tokens/s

What's the difference between the two models?
Does the second contains some metadata to explicitly use the MLX backend, or uses a different quantization style which is the only one thet MLX backend supports and is then automatically selected?

@lapo-luchini commented on GitHub (Mar 20, 2026): > [@lapo-luchini](https://github.com/lapo-luchini) Try the `pdevine/qwen3.5:35b-a3b-int4` model. I had to struggle a bit to build a MLX-compatible ollama, but the result is huge! ``` % ollama run --verbose qwen3.5:35b "please write a quick Fibonacci in Go" total duration: 26.537859334s eval count: 382 token(s) eval duration: 13.391695966s eval rate: 28.53 tokens/s % ollama run --verbose pdevine/qwen3.5:35b-a3b-int4 "please write a quick Fibonacci in Go" total duration: 4.992956458s eval count: 351 token(s) eval duration: 4.102439208s eval rate: 85.56 tokens/s ``` What's the difference between the two models? Does the second contains some metadata to explicitly use the MLX backend, or uses a different quantization style which is the only one thet MLX backend supports and is then automatically selected?

GiteaMirror commented

2026-04-29 10:15:51 -05:00

@chigkim commented on GitHub (Mar 20, 2026):

How do you run mlx models with Ollama? Do you have to build a special branch? Is there an environment variable you have to set?

@chigkim commented on GitHub (Mar 20, 2026): How do you run mlx models with Ollama? Do you have to build a special branch? Is there an environment variable you have to set?

GiteaMirror commented

2026-04-29 10:15:52 -05:00

@rick-github commented on GitHub (Mar 20, 2026):

Mostly you have to use a Mac. Windows support is a WIP, Linux requires installing additional libraries and is slower than GGUF and crashes a lot.

If you have a Mac, you just need to pull the model and run it, the same as the GGUF models.

@rick-github commented on GitHub (Mar 20, 2026): Mostly you have to use a Mac. Windows support is a WIP, Linux requires installing additional libraries and is slower than GGUF and crashes a lot. If you have a Mac, you just need to pull the model and run it, the same as the GGUF models.

GiteaMirror commented

2026-04-29 10:15:52 -05:00

@chigkim commented on GitHub (Mar 22, 2026):

Yeah I have a mac. Where do you pull from? The Ollama Library doesn't seem to have mlx models.

@chigkim commented on GitHub (Mar 22, 2026): Yeah I have a mac. Where do you pull from? The [Ollama Library](https://ollama.com/search?c=cloud&o=newest) doesn't seem to have mlx models.

GiteaMirror commented

2026-04-29 10:15:52 -05:00

@aboutlo commented on GitHub (Mar 22, 2026):

Hey I managed to get a 27b model doing following these steps

huggingface-cli download Qwen/Qwen3.5-27B
uv run --with mlx-lm mlx_lm.convert
--hf-path Qwen/Qwen3.5-35B-A3B
--mlx-path ~/.cache/huggingface/hub/models--custom--Qwen/Qwen3.5-27B-mlx-4bit
-q --q-bits 4 --q-group-size 64
cat > /tmp/Modelfile <<'EOF'
FROM ~/.cache/huggingface/hub/models--custom--Qwen/Qwen3.5-27B-mlx-4bit
PARAMETER num_ctx 32768
PARAMETER temperature 0.2
PARAMETER presence_penalty 0
PARAMETER repeat_penalty 1
PARAMETER top_k 20
PARAMETER top_p 0.9
PARAMETER min_p 0.05
EOF
ollama create --experimental aboutlo/Qwen3.5-27B-mlx-4bit -f Modelfile

ollama run --verbose qwen3.5:27b "please write a quick Fibonacci in Go" --think=false
total duration:       1m45.375069417s
load duration:        9.6815415s
prompt eval count:    19 token(s)
prompt eval duration: 2.621114583s
prompt eval rate:     7.25 tokens/s
eval count:           413 token(s)
eval duration:        1m32.960728961s
eval rate:            4.44 tokens/s

ollama run --verbose qwen35-27b-mlx-4bit:latest "please write a quick Fibonacci in Go" --think=false
total duration:       1m7.109506709s
load duration:        17.713709ms
prompt eval count:    20 token(s)
prompt eval duration: 848.72825ms
prompt eval rate:     23.56 tokens/s
eval count:           423 token(s)
eval duration:        1m6.242530667s
eval rate:            6.39 tokens/s

But when I tried the same to re-encode the Qwen/Qwen3.5-35B-A3B to have my custom Modelfile I ended up to have a crash a runtime.

ollama run --verbose aboutlo/Qwen3.5-35B-A3B-mlx-4bit:latest "Hello" --think=false

time=2026-03-22T12:47:58.673Z level=INFO source=sched.go:484 msg="system memory" total="32.0 GiB" free="22.5 GiB" free_swap="0 B"
time=2026-03-22T12:47:58.673Z level=INFO source=sched.go:491 msg="gpu memory" id=0 library=Metal available="20.8 GiB" free="21.3 GiB" minimum="512.0 MiB" overhead="0 B"
time=2026-03-22T12:47:58.673Z level=INFO source=client.go:365 msg="starting mlx runner subprocess" model=aboutlo/Qwen3.5-35B-A3B-mlx-4bit:latest port=55268
time=2026-03-22T12:47:58.675Z level=INFO source=sched.go:561 msg="loaded runners" count=1
time=2026-03-22T12:47:58.699Z level=INFO source=server.go:32 msg="MLX engine initialized" "MLX version"=0.30.6-0-g185b06d device=gpu
time=2026-03-22T12:47:58.708Z level=INFO source=base.go:67 msg="Model architecture" arch=Qwen3_5MoeForConditionalGeneration
time=2026-03-22T12:47:58.992Z level=INFO source=runner.go:135 msg="Loaded tensors from manifest" count=1757
time=2026-03-22T12:48:06.677Z level=INFO source=runner.go:169 msg="Starting HTTP server" host=127.0.0.1 port=55268
time=2026-03-22T12:48:06.776Z level=INFO source=server.go:183 msg=ServeHTTP method=GET path=/v1/status took=34.708µs status="200 OK"
time=2026-03-22T12:48:06.776Z level=INFO source=client.go:99 msg="mlx runner is ready" port=55268
time=2026-03-22T12:48:06.777Z level=INFO source=cache.go:173 msg="Cache miss" left=14
time=2026-03-22T12:48:06.778Z level=INFO source=pipeline.go:55 msg="peak memory" size="18.17 GiB"
panic: runtime error: index out of range [0] with length 0

goroutine 11 [running]:
github.com/ollama/ollama/x/models/qwen3_5.(*SparseMoE).Forward(0x14000987860, 0x1400131c940, 0x140008180f0)
	/Users/runner/work/ollama/ollama/x/models/qwen3_5/qwen3_5.go:1304 +0x278
github.com/ollama/ollama/x/models/qwen3_5.(*Layer).Forward(0x14000842f00, 0x1400131c340, {0x1037ca498, 0x1400131e300}, 0x1, 0xd, 0x140008180f0)
	/Users/runner/work/ollama/ollama/x/models/qwen3_5/qwen3_5.go:1335 +0x10c
github.com/ollama/ollama/x/models/qwen3_5.(*Model).Forward(0x1400011a120, 0x1400131c2c0, {0x14000154008, 0x28, 0x140013298f0?})
	/Users/runner/work/ollama/ollama/x/models/qwen3_5/qwen3_5.go:1349 +0xb8
github.com/ollama/ollama/x/mlxrunner.(*Runner).TextGenerationPipeline(0x1400011a5a0, {{{0x1400134a000, 0x4a}, {0x3e4ccccd, 0x3f666666, 0x3d4ccccd, 0x14, 0x40, 0x0, 0x3fff2, ...}}, ...})
	/Users/runner/work/ollama/ollama/x/mlxrunner/pipeline.go:105 +0x594
github.com/ollama/ollama/x/mlxrunner.(*Runner).Run.func1()
	/Users/runner/work/ollama/ollama/x/mlxrunner/runner.go:148 +0x110
golang.org/x/sync/errgroup.(*Group).Go.func1()
	/Users/runner/go/pkg/mod/golang.org/x/sync@v0.17.0/errgroup/errgroup.go:93 +0x54
created by golang.org/x/sync/errgroup.(*Group).Go in goroutine 1
	/Users/runner/go/pkg/mod/golang.org/x/sync@v0.17.0/errgroup/errgroup.go:78 +0x94

It would be super interesting to understand how pdevine/qwen3.5:35b-a3b-int4 has been encoded :)

@aboutlo commented on GitHub (Mar 22, 2026): Hey I managed to get a 27b model doing following these steps - huggingface-cli download Qwen/Qwen3.5-27B - uv run --with mlx-lm mlx_lm.convert \ --hf-path Qwen/Qwen3.5-35B-A3B \ --mlx-path ~/.cache/huggingface/hub/models--custom--Qwen/Qwen3.5-27B-mlx-4bit \ -q --q-bits 4 --q-group-size 64 - cat > /tmp/Modelfile <<'EOF' FROM ~/.cache/huggingface/hub/models--custom--Qwen/Qwen3.5-27B-mlx-4bit PARAMETER num_ctx 32768 PARAMETER temperature 0.2 PARAMETER presence_penalty 0 PARAMETER repeat_penalty 1 PARAMETER top_k 20 PARAMETER top_p 0.9 PARAMETER min_p 0.05 EOF - ollama create --experimental aboutlo/Qwen3.5-27B-mlx-4bit -f Modelfile ``` ollama run --verbose qwen3.5:27b "please write a quick Fibonacci in Go" --think=false total duration: 1m45.375069417s load duration: 9.6815415s prompt eval count: 19 token(s) prompt eval duration: 2.621114583s prompt eval rate: 7.25 tokens/s eval count: 413 token(s) eval duration: 1m32.960728961s eval rate: 4.44 tokens/s ollama run --verbose qwen35-27b-mlx-4bit:latest "please write a quick Fibonacci in Go" --think=false total duration: 1m7.109506709s load duration: 17.713709ms prompt eval count: 20 token(s) prompt eval duration: 848.72825ms prompt eval rate: 23.56 tokens/s eval count: 423 token(s) eval duration: 1m6.242530667s eval rate: 6.39 tokens/s ``` But when I tried the same to re-encode the Qwen/Qwen3.5-35B-A3B to have my custom Modelfile I ended up to have a crash a runtime. ollama run --verbose aboutlo/Qwen3.5-35B-A3B-mlx-4bit:latest "Hello" --think=false ``` time=2026-03-22T12:47:58.673Z level=INFO source=sched.go:484 msg="system memory" total="32.0 GiB" free="22.5 GiB" free_swap="0 B" time=2026-03-22T12:47:58.673Z level=INFO source=sched.go:491 msg="gpu memory" id=0 library=Metal available="20.8 GiB" free="21.3 GiB" minimum="512.0 MiB" overhead="0 B" time=2026-03-22T12:47:58.673Z level=INFO source=client.go:365 msg="starting mlx runner subprocess" model=aboutlo/Qwen3.5-35B-A3B-mlx-4bit:latest port=55268 time=2026-03-22T12:47:58.675Z level=INFO source=sched.go:561 msg="loaded runners" count=1 time=2026-03-22T12:47:58.699Z level=INFO source=server.go:32 msg="MLX engine initialized" "MLX version"=0.30.6-0-g185b06d device=gpu time=2026-03-22T12:47:58.708Z level=INFO source=base.go:67 msg="Model architecture" arch=Qwen3_5MoeForConditionalGeneration time=2026-03-22T12:47:58.992Z level=INFO source=runner.go:135 msg="Loaded tensors from manifest" count=1757 time=2026-03-22T12:48:06.677Z level=INFO source=runner.go:169 msg="Starting HTTP server" host=127.0.0.1 port=55268 time=2026-03-22T12:48:06.776Z level=INFO source=server.go:183 msg=ServeHTTP method=GET path=/v1/status took=34.708µs status="200 OK" time=2026-03-22T12:48:06.776Z level=INFO source=client.go:99 msg="mlx runner is ready" port=55268 time=2026-03-22T12:48:06.777Z level=INFO source=cache.go:173 msg="Cache miss" left=14 time=2026-03-22T12:48:06.778Z level=INFO source=pipeline.go:55 msg="peak memory" size="18.17 GiB" panic: runtime error: index out of range [0] with length 0 goroutine 11 [running]: github.com/ollama/ollama/x/models/qwen3_5.(*SparseMoE).Forward(0x14000987860, 0x1400131c940, 0x140008180f0) /Users/runner/work/ollama/ollama/x/models/qwen3_5/qwen3_5.go:1304 +0x278 github.com/ollama/ollama/x/models/qwen3_5.(*Layer).Forward(0x14000842f00, 0x1400131c340, {0x1037ca498, 0x1400131e300}, 0x1, 0xd, 0x140008180f0) /Users/runner/work/ollama/ollama/x/models/qwen3_5/qwen3_5.go:1335 +0x10c github.com/ollama/ollama/x/models/qwen3_5.(*Model).Forward(0x1400011a120, 0x1400131c2c0, {0x14000154008, 0x28, 0x140013298f0?}) /Users/runner/work/ollama/ollama/x/models/qwen3_5/qwen3_5.go:1349 +0xb8 github.com/ollama/ollama/x/mlxrunner.(*Runner).TextGenerationPipeline(0x1400011a5a0, {{{0x1400134a000, 0x4a}, {0x3e4ccccd, 0x3f666666, 0x3d4ccccd, 0x14, 0x40, 0x0, 0x3fff2, ...}}, ...}) /Users/runner/work/ollama/ollama/x/mlxrunner/pipeline.go:105 +0x594 github.com/ollama/ollama/x/mlxrunner.(*Runner).Run.func1() /Users/runner/work/ollama/ollama/x/mlxrunner/runner.go:148 +0x110 golang.org/x/sync/errgroup.(*Group).Go.func1() /Users/runner/go/pkg/mod/golang.org/x/sync@v0.17.0/errgroup/errgroup.go:93 +0x54 created by golang.org/x/sync/errgroup.(*Group).Go in goroutine 1 /Users/runner/go/pkg/mod/golang.org/x/sync@v0.17.0/errgroup/errgroup.go:78 +0x94 ``` It would be super interesting to understand how `pdevine/qwen3.5:35b-a3b-int4` has been encoded :)

GiteaMirror commented

2026-04-29 10:15:52 -05:00

@aboutlo commented on GitHub (Mar 22, 2026):

Never mind, I figured out how to encode using ollama directly rather than mlx-lm

ollama create --experimental -q int4 aboutlo/Qwen3.5-35B-A3B-int4 -f Modelfile

Modelfile

FROM Users/YOUR_USER/.cache/huggingface/hub/models--Qwen--Qwen3.5-35B-A3B/snapshots/ec2d4ece1ffb563322cbee9a48fe0e3fcbce0307
TEMPLATE {{ .Prompt }}
RENDERER qwen3.5
PARSER qwen3.5
PARAMETER num_ctx 32768
PARAMETER repeat_penalty 1
PARAMETER temperature 0.2
PARAMETER top_k 20
PARAMETER top_p 0.9
PARAMETER min_p 0.05
PARAMETER repeat_penalty 1.05
PARAMETER presence_penalty 0

@aboutlo commented on GitHub (Mar 22, 2026): Never mind, I figured out how to encode using ollama directly rather than `mlx-lm` `ollama create --experimental -q int4 aboutlo/Qwen3.5-35B-A3B-int4 -f Modelfile` Modelfile ``` FROM Users/YOUR_USER/.cache/huggingface/hub/models--Qwen--Qwen3.5-35B-A3B/snapshots/ec2d4ece1ffb563322cbee9a48fe0e3fcbce0307 TEMPLATE {{ .Prompt }} RENDERER qwen3.5 PARSER qwen3.5 PARAMETER num_ctx 32768 PARAMETER repeat_penalty 1 PARAMETER temperature 0.2 PARAMETER top_k 20 PARAMETER top_p 0.9 PARAMETER min_p 0.05 PARAMETER repeat_penalty 1.05 PARAMETER presence_penalty 0 ```

GiteaMirror commented

2026-04-29 10:15:53 -05:00

@pdevine commented on GitHub (Mar 22, 2026):

It would be super interesting to understand how pdevine/qwen3.5:35b-a3b-int4 has been encoded :)

Hey sorry, I had an entire write-up on this that I misplaced, but yes, the format has changed. The short answer is that I changed the safetensors format to better support quantizations, and I also change the way the experts are packed to make it more efficient when you load them. You should be able to get it to work with ollama create -f <path/to/Modelfile> --experimental <modelname>. Use the --quantize parameter if you want to quantize it.

@pdevine commented on GitHub (Mar 22, 2026): > It would be super interesting to understand how pdevine/qwen3.5:35b-a3b-int4 has been encoded :) Hey sorry, I had an entire write-up on this that I misplaced, but yes, the format has changed. The short answer is that I changed the safetensors format to better support quantizations, and I also change the way the experts are packed to make it more efficient when you load them. You should be able to get it to work with `ollama create -f <path/to/Modelfile> --experimental <modelname>`. Use the --quantize parameter if you want to quantize it.

GiteaMirror commented

2026-04-29 10:15:53 -05:00

@pdevine commented on GitHub (Mar 24, 2026):

I've posted:

pdevine/qwen3.5:27b-coding-nvfp4 (20GB)
pdevine/qwen3.5:27b-int4 (16GB)
pdevine/qwen3.5:27b-nvfp4 (20GB)
pdevine/qwen3.5:35b-a3b-coding-mxfp8 (38GB)
pdevine/qwen3.5:35b-a3b-coding-nvfp4 (22GB)
pdevine/qwen3.5:35b-a3b-int4 (20GB)
pdevine/qwen3.5:35b-a3b-int8 (38GB)
pdevine/qwen3.5:35b-a3b-nvfp4 (22GB)

Some notes:

You'll need to build from main to get working support for mxfp8/nvfp4 or wait until the next version of Ollama.
The 'coding' models have the hyperparameters correctly set for coding/agentic tasks (as defined by Alibaba). I've tested these out with Claude and Pi and they work quite well.
Each of the models is imported from the HuggingFace BF16 models of Qwen3.5-35B-A3B or Qwen3.5-27B except the mxfp8 models which were converted from the FP8 versions of those same models.
I do not recommend the affine 4 bit (-int4) models over their nvfp4 counterparts. The int4 models are using a high group size (this is equivalent to q4_1 with groupsize 64) although I did quantize some of the tensors in those models at int8 instead of entirely int4 to make the model not act like a complete idiot. This is different than what the mlx-community publishes on Huggingface. YMMV. Use the nvfp4 models if you want accuracy, or the int4 models if you want pure speed (but not great accuracy).

If you want to use the models which don't have coding in their name to do coding you'll need to set these values in a Modelfile and recreate the model:

PARAMETER temperature 0.6
PARAMETER top_p 0.95
PARAMETER min_p 0.0
PARAMETER top_k 20
PARAMETER presence_penalty 0.0
PARAMETER repeat_penalty 1.0

@pdevine commented on GitHub (Mar 24, 2026): I've posted: - pdevine/qwen3.5:27b-coding-nvfp4 (20GB) - pdevine/qwen3.5:27b-int4 (16GB) - pdevine/qwen3.5:27b-nvfp4 (20GB) - pdevine/qwen3.5:35b-a3b-coding-mxfp8 (38GB) - pdevine/qwen3.5:35b-a3b-coding-nvfp4 (22GB) - pdevine/qwen3.5:35b-a3b-int4 (20GB) - pdevine/qwen3.5:35b-a3b-int8 (38GB) - pdevine/qwen3.5:35b-a3b-nvfp4 (22GB) Some notes: - *You'll need to build from main to get working support for mxfp8/nvfp4 or wait until the next version of Ollama.* - The 'coding' models have the hyperparameters correctly set for coding/agentic tasks (as defined by Alibaba). I've tested these out with Claude and Pi and they work quite well. - Each of the models is imported from the HuggingFace BF16 models of [Qwen3.5-35B-A3B](https://huggingface.co/Qwen/Qwen3.5-35B-A3B) or [Qwen3.5-27B](https://huggingface.co/Qwen/Qwen3.5-27B) *except* the mxfp8 models which were converted from the FP8 versions of those same models. - I do not recommend the affine 4 bit (-int4) models over their nvfp4 counterparts. The int4 models are using a high group size (this is equivalent to q4_1 with groupsize 64) although I did quantize some of the tensors in those models at int8 instead of entirely int4 to make the model not act like a complete idiot. This is different than what the mlx-community publishes on Huggingface. YMMV. *Use the nvfp4 models* if you want accuracy, or the int4 models if you want pure speed (but not great accuracy). If you want to use the models which don't have `coding` in their name to do coding you'll need to set these values in a Modelfile and recreate the model: ``` PARAMETER temperature 0.6 PARAMETER top_p 0.95 PARAMETER min_p 0.0 PARAMETER top_k 20 PARAMETER presence_penalty 0.0 PARAMETER repeat_penalty 1.0 ```

GiteaMirror commented

2026-04-29 10:15:53 -05:00

@BeatWolf commented on GitHub (Mar 24, 2026):

so, is the default ollama model a good choice or should other quants be used?

@BeatWolf commented on GitHub (Mar 24, 2026): so, is the default ollama model a good choice or should other quants be used?

GiteaMirror commented

2026-04-29 10:15:54 -05:00

@pdevine commented on GitHub (Mar 24, 2026):

so, is the default ollama model a good choice or should other quants be used?

The default model runs on GGML, not MLX. These models are for the experimental MLX engine and for now will only run on a Mac (although CUDA support is coming really soon).

@pdevine commented on GitHub (Mar 24, 2026): > so, is the default ollama model a good choice or should other quants be used? The default model runs on GGML, not MLX. These models are for the experimental MLX engine and for now will only run on a Mac (although CUDA support is coming really soon).

GiteaMirror commented

2026-04-29 10:15:54 -05:00

@rocaltair commented on GitHub (Mar 25, 2026):

I think we need a label like Cloud to filter MLX models on https://ollama.com/search .

@rocaltair commented on GitHub (Mar 25, 2026): I think we need a label like `Cloud` to filter `MLX` models on https://ollama.com/search .

GiteaMirror commented

2026-04-29 10:15:54 -05:00

@athuljayaram commented on GitHub (Mar 27, 2026):

getting error with qwen3

API Error: 500 {"type":"error","error":{"type":"api_error","message":"model requires more system memory (16.8 GiB) than is available (12.3
GiB)"},"request_id":"req_60ada344b2d69bf5252ed31f"}

ollama run qwen3.5:9b
pulling manifest
pulling dec52a44569a: 100% ▕█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏ 6.6 GB
pulling 7339fa418c9a: 100% ▕█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏ 11 KB
pulling 9371364b27a5: 100% ▕█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏ 65 B
pulling be595b49fe22: 100% ▕█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏ 475 B
verifying sha256 digest
writing manifest
success
Error: 500 Internal Server Error: model requires more system memory (16.8 GiB) than is available (13.8 GiB)

@athuljayaram commented on GitHub (Mar 27, 2026): getting error with qwen3 API Error: 500 {"type":"error","error":{"type":"api_error","message":"model requires more system memory (16.8 GiB) than is available (12.3 GiB)"},"request_id":"req_60ada344b2d69bf5252ed31f"} ollama run qwen3.5:9b pulling manifest pulling dec52a44569a: 100% ▕█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏ 6.6 GB pulling 7339fa418c9a: 100% ▕█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏ 11 KB pulling 9371364b27a5: 100% ▕█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏ 65 B pulling be595b49fe22: 100% ▕█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏ 475 B verifying sha256 digest writing manifest success Error: 500 Internal Server Error: model requires more system memory (16.8 GiB) than is available (13.8 GiB)

GiteaMirror commented

2026-04-29 10:15:55 -05:00

@baptistejamin commented on GitHub (Apr 19, 2026):

I noticed yesterday exactly the same thing with Qwen 3.6, Ollama is significantly slower compared llama.cpp on the same Q4 with 35B-A3B on a L40s GPU.

Test with: 30 prompts, ~4000 input tokens / 200 output tokens each

Ollama version: latest stable (0.21.0), llama.cpp: build from latest Docker image at the same date (b8833, ghcr.io/ggml-org/llama.cpp:server-cuda)

metric	llama.cpp	ollama	delta
TTFT	0.79 s	1.15 s	+45% slower
Prompt eval	5393 t/s	4436 t/s	−18%
Generation	144.7 t/s	115.5 t/s	−20%
Total time	2.17 s	2.98 s	+37% slower

From what I understand, GGML implementation in Ollama is a bit outdated, and missing a few opts that have been implemented around qwen35moe (gated delta net, new topk_moe api, etc).

I'll try some tests today to confirm it's the cause. 37% is massive, and Qwen 3.6 super good, so I really think it shall be tackled.

@baptistejamin commented on GitHub (Apr 19, 2026): I noticed yesterday exactly the same thing with Qwen 3.6, Ollama is significantly slower compared llama.cpp on the same Q4 with 35B-A3B on a L40s GPU. Test with: 30 prompts, ~4000 input tokens / 200 output tokens each Ollama version: latest stable (0.21.0), llama.cpp: build from latest Docker image at the same date (b8833, ghcr.io/ggml-org/llama.cpp:server-cuda) metric | llama.cpp | ollama | delta -- | -- | -- | -- TTFT | 0.79 s | 1.15 s | +45% slower Prompt eval | 5393 t/s | 4436 t/s | −18% Generation | 144.7 t/s | 115.5 t/s | −20% Total time | 2.17 s | 2.98 s | +37% slower From what I understand, GGML implementation in Ollama is a bit outdated, and missing a few opts that have been implemented around qwen35moe (gated delta net, new topk_moe api, etc). I'll try some tests today to confirm it's the cause. 37% is massive, and Qwen 3.6 super good, so I really think it shall be tackled.

Sign in to join this conversation.

Branches Tags

main

hoyyeva/opencode-image-modality

hoyyeva/anthropic-renderer-local-image-path

hoyyeva/anthropic-local-image-path

dhiltgen/ci

dhiltgen/llama-runner

parth-launch-codex-app

hoyyeva/anthropic-reference-images-path

parth-anthropic-reference-images-path

brucemacd/download-before-remove

hoyyeva/editor-config-repair

parth-mlx-decode-checkpoints

hoyyeva/fix-codex-model-metadata-warning

hoyyeva/qwen

parth/hide-claude-desktop-till-release

parth-add-claude-code-autoinstall

release_v0.22.0

pdevine/manifest-list

codex/fix-codex-model-metadata-warning

pdevine/addressable-manifest

brucemacd/launch-fetch-reccomended

jmorganca/llama-compat

launch-copilot-cli

hoyyeva/opencode-thinking

release_v0.20.7

parth-auto-save-backup

parth-test

jmorganca/gemma4-audio-replacements

fix-manifest-digest-on-pull

hoyyeva/vscode-improve

brucemacd/install-server-wait

parth/update-claude-docs

brucemac/start-ap-install

pdevine/mlx-update

pdevine/qwen35_vision

drifkin/api-show-fallback

mintlify/image-generation-1773352582

hoyyeva/server-context-length-local-config

jmorganca/faster-reptition-penalties

jmorganca/convert-nemotron

parth-pi-thinking

pdevine/sampling-penalties

jmorganca/fix-create-quantization-memory

dongchen/resumable_transfer_fix

pdevine/sampling-cache-error

jessegross/mlx-usage

hoyyeva/openclaw-config

hoyyeva/app-html

pdevine/qwen3next

brucemacd/sign-sh-install

brucemacd/tui-update

brucemacd/usage-api

jmorganca/launch-empty

fix-app-dist-embed

mxyng/mlx-compile

mxyng/mlx-quant

mxyng/mlx-glm4.7

mxyng/mlx

brucemacd/simplify-model-picker

jmorganca/qwen3-concurrent

fix-glm-4.7-flash-mla-config

drifkin/qwen3-coder-opening-tag

brucemacd/usage-cli

fix-cuda12-fattn-shmem

ollama-imagegen-docs

parth/fix-multiline-inputs

brucemacd/config-docs

mxyng/model-files

mxyng/simple-execute

fix-imagegen-ollama-models

mxyng/async-upload

jmorganca/lazy-no-dtype-changes

imagegen-auto-detect-create

parth/decrease-concurrent-download-hf

fix-mlx-quantize-init

jmorganca/x-cleanup

usage

imagegen-readme

jmorganca/glm-image

mlx-gpu-cd

jmorganca/imagegen-modelfile

parth/agent-skills

parth/agent-allowlist

parth/signed-in-offline

parth/agents

parth/fix-context-chopping

improve-cloud-flow

parth/add-models-websearch

parth/prompt-renderer-mcp

jmorganca/native-settings

jmorganca/download-stream-hash

jmorganca/client2-rebased

brucemacd/oai-chat-req-multipart

jessegross/multi_chunk_reserve

grace/additional-omit-empty

grace/mistral-3-large

mxyng/tokenizer2

mxyng/tokenizer

jessegross/flash

hoyyeva/windows-nacked-app

mxyng/cleanup-attention

grace/deepseek-parser

hoyyeva/remember-unsent-prompt

parth/add-lfs-pointer-error-conversion

parth/olmo2-test2

hoyyeva/ollama-launchagent-plist

nicole/olmo-model

parth/olmo-test

mxyng/remove-embedded

parth/render-template

jmorganca/intellect-3

parth/remove-prealloc-linter

jmorganca/cmd-eval

nicole/nomic-embed-text-fix

mxyng/lint-2

hoyyeva/add-gemini-3-pro-preview

hoyyeva/load-model-list

mxyng/expand-path

mxyng/environ-2

hoyyeva/deeplink-json-encoding

parth/improve-tool-calling-tests

hoyyeva/conversation

hoyyeva/assistant-edit-response

hoyyeva/thinking

origin/brucemacd/invalid-char-i-err

parth/improve-tool-calling

jmorganca/required-omitempty

grace/qwen3-vl-tests

mxyng/iter-client

parth/docs-readme

nicole/embed-test

pdevine/integration-benchstat

parth/remove-generate-cmd

parth/add-toolcall-id

mxyng/server-tests

jmorganca/glm-4.6

jmorganca/gin-h-compat

drifkin/stable-tool-args

pdevine/qwen3-more-thinking

parth/add-websearch-client

nicole/websearch_local

jmorganca/qwen3-coder-updates

grace/deepseek-v3-migration-tests

mxyng/fix-create

jmorganca/cloud-errors

pdevine/parser-tidy

revert-12233-parth/simplify-entrypoints-runner

parth/enable-so-gpt-oss

brucemacd/qwen3vl

jmorganca/readme-simplify

parth/gpt-oss-structured-outputs

revert-12039-jmorganca/tools-braces

mxyng/embeddings

mxyng/gguf

mxyng/benchmark

mxyng/types-null

parth/move-parsing

mxyng/gemma2

jmorganca/docs

mxyng/16-bit

mxyng/create-stdin

pdevine/authorizedkeys

mxyng/quant

parth/opt-in-error-context-window

brucemacd/cache-models

brucemacd/runner-completion

jmorganca/llama-update-6

brucemacd/benchmark-list

brucemacd/partial-read-caps

parth/deepseek-r1-tools

mxyng/omit-array

parth/tool-prefix-temp

brucemacd/runner-test

jmorganca/qwen25vl

brucemacd/model-forward-test-ext

parth/python-function-parsing

jmorganca/cuda-compression-none

drifkin/num-parallel

drifkin/chat-truncation-fix

jmorganca/sync

parth/python-tools-calling

drifkin/array-head-count

brucemacd/create-no-loop

parth/server-enable-content-stream-with-tools

qwen25omni

mxyng/v3

brucemacd/ropeconfig

jmorganca/silence-tokenizer

parth/sample-so-test

parth/sampling-structured-outputs

brucemacd/doc-go-engine

parth/constrained-sampling-json

jmorganca/mistral-wip

brucemacd/mistral-small-convert

parth/sample-unmarshal-json-for-params

brucemacd/jomorganca/mistral

pdevine/bfloat16

jmorganca/mistral

brucemacd/mistral

pdevine/logging

parth/sample-correctness-fix

parth/sample-fix-sorting

jmorgan/sample-fix-sorting-extras

jmorganca/temp-0-images

brucemacd/parallel-embed-models

brucemacd/shim-grammar

jmorganca/fix-gguf-error

bmizerany/nameswork

jmorganca/faster-releases

bmizerany/validatenames

brucemacd/err-no-vocab

brucemacd/rope-config

brucemacd/err-hint

brucemacd/qwen2_5

brucemacd/logprobs

brucemacd/new_runner_graph_bench

progress-flicker

brucemacd/forward-test

brucemacd/go_qwen2

pdevine/gemma2

jmorganca/add-missing-symlink-eval

mxyng/next-debug

parth/set-context-size-openai

brucemacd/next-bpe-bench

brucemacd/next-bpe-test

brucemacd/new_runner_e2e

brucemacd/new_runner_qwen2

pdevine/convert-cohere2

brucemacd/convert-cli

parth/log-probs

mxyng/next-mlx

mxyng/cmd-history

parth/templating

parth/tokenize-detokenize

brucemacd/check-key-register

bmizerany/grammar

jmorganca/vendor-081b29bd

mxyng/func-checks

jmorganca/fix-null-format

parth/fix-default-to-warn-json

jmorganca/qwen2vl

jmorganca/no-concat

parth/cmd-cleanup-SO

brucemacd/check-key-register-structured-err

parth/openai-stream-usage

parth/fix-referencing-so

stream-tools-stop

jmorganca/degin-1

brucemacd/install-path-clean

brucemacd/push-name-validation

brucemacd/browser-key-register

jmorganca/openai-fix-first-message

jmorganca/fix-proxy

jessegross/sample

parth/disallow-streaming-tools

dhiltgen/remove_submodule

jmorganca/ga

jmorganca/mllama

pdevine/newlines

pdevine/geems-2b

jmorganca/llama-bump

mxyng/modelname-7

mxyng/gin-slog

mxyng/modelname-6

jyan/convert-prog

jyan/quant5

paligemma-support

pdevine/import-docs

jmorganca/openai-context

jyan/paligemma

jyan/p2

jyan/palitest

bmizerany/embedspeedup

jmorganca/llama-vit

brucemacd/allow-ollama

royh/ep-methods

royh/whisper

mxyng/api-models

mxyng/fix-memory

jyan/q4_4/8

jyan/ollama-v

royh/stream-tools

roy-embed-parallel

bmizerany/hrm

revert-5963-revert-5924-mxyng/llama3.1-rope

royh/embed-viz

jyan/local2

jyan/auth

jyan/local

jyan/parse-temp

jmorganca/template-mistral

jyan/reord-g

royh-openai-suffixdocs

royh-imgembed

royh-embed-parallel

jyan/quant4

royh-precision

jyan/progress

pdevine/fix-template

jyan/quant3

pdevine/ggla

mxyng/update-registry-domain

jmorganca/ggml-static

mxyng/create-context

jyan/v0.146

mxyng/layers-from-files

build_dist

bmizerany/noseek

royh-ls

royh-name

timeout

mxyng/server-timestamp

bmizerany/nosillyggufslurps

royh-params

jmorganca/llama-cpp-7c26775

royh-openai-delete

royh-show-rigid

jmorganca/enable-fa

jmorganca/no-error-template

jyan/format

royh-testdelete

bmizerany/fastverify

language_support

pdevine/ps-glitches

brucemacd/tokenize

bruce/iq-quants

bmizerany/filepathwithcoloninhost

mxyng/split-bin

bmizerany/client-registry

jmorganca/if-none-match

native

jmorganca/native

jmorganca/batch-embeddings

jmorganca/initcmake

jmorganca/mm

pdevine/showggmlinfo

modenameenforcealphanum

bmizerany/modenameenforcealphanum

jmorganca/done-reason

jmorganca/llama-cpp-8960fe8

ollama.com

bmizerany/filepathnobuild

bmizerany/types/model/defaultfix

rmdisplaylong

nogogen

bmizerany/x

modelfile-readme

bmizerany/replacecolon

jmorganca/limit

jmorganca/execstack

jmorganca/replace-assets

mxyng/tune-concurrency

jmorganca/testing

whitespace-detection

jmorganca/options

upgrade-all

scratch

cuda-search

mattw/airenamer

mattw/allmodelsonhuggingface

mattw/quantcontext

mattw/whatneedstorun

brucemacd/llama-mem-calc

mattw/faq-context

mattw/communitylinks

mattw/noprune

mattw/python-functioncalling

rename

mxyng/install

pulse

remove-first

editor

mattw/selfqueryingretrieval

cgo

mattw/howtoquant

api

matt/streamingapi

format-config

mxyng/extra-args

shell

update-nous-hermes

cp-model

upload-progress

fix-unknown-model

fix-model-names

delete-fix

insecure-registry

ls

deletemodels

progressbar

readme-updates

license-layers

skip-list

list-models

modelpath

matt/examplemodelfiles

distribution

go-opts

1 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: github-starred/ollama#56098