[GH-ISSUE #15176] Ollama.com model catalog has models that are platform specific #9715

Open
opened 2026-04-12 22:35:46 -05:00 by GiteaMirror · 9 comments
Owner

Originally created by @MarkWard0110 on GitHub (Mar 31, 2026).
Original GitHub issue: https://github.com/ollama/ollama/issues/15176

What is the issue?

Are these models in a file format that is limited to a specific framework or is Ollama's other runtime engines not able to support these formats at this time?

Ollama's documentation states.
"Ollama is the easiest way to get up and running with large language models such as gpt-oss, Gemma 3, DeepSeek-R1, Qwen3 and more."

Ollama users now need to be aware of which platform and framework they are using to determine which models they can use.
Ollama users have no way to know whether a specific model tag will work with their Ollama install because some model tags are platform-locked.

If Ollama.com's model catalog is to include platform-specific models, it must include a platform support field.

Ollama.com model tags must adopt a convention for tags and stick to it.

nvfp4 should not imply it will only work on MLX platforms

Relevant log output


OS

No response

GPU

No response

CPU

No response

Ollama version

No response

Originally created by @MarkWard0110 on GitHub (Mar 31, 2026). Original GitHub issue: https://github.com/ollama/ollama/issues/15176 ### What is the issue? Are these models in a file format that is limited to a specific framework or is Ollama's other runtime engines not able to support these formats at this time? Ollama's documentation states. "[Ollama](https://ollama.com/) is the easiest way to get up and running with large language models such as gpt-oss, Gemma 3, DeepSeek-R1, Qwen3 and more." Ollama users now need to be aware of which platform and framework they are using to determine which models they can use. Ollama users have no way to know whether a specific model tag will work with their Ollama install because some model tags are platform-locked. If Ollama.com's model catalog is to include platform-specific models, it must include a platform support field. Ollama.com model tags must adopt a convention for tags and stick to it. `nvfp4` should not imply it will only work on MLX platforms ### Relevant log output ```shell ``` ### OS _No response_ ### GPU _No response_ ### CPU _No response_ ### Ollama version _No response_
GiteaMirror added the bug label 2026-04-12 22:35:46 -05:00
Author
Owner

@cdsama commented on GitHub (Mar 31, 2026):

its weird my nvdia rtx pro 6000 blackwell does not support a model named nvfp4 in ollama

<!-- gh-comment-id:4163467734 --> @cdsama commented on GitHub (Mar 31, 2026): its weird my nvdia rtx pro 6000 blackwell does not support a model named **nvfp4** in ollama
Author
Owner

@rick-github commented on GitHub (Mar 31, 2026):

NVFP4 support is a feature of the MLX backend, which is currently only supported on macOS. Support for Windows and Linux is in progress.

<!-- gh-comment-id:4163507458 --> @rick-github commented on GitHub (Mar 31, 2026): NVFP4 support is a feature of the [MLX backend](https://ollama.com/blog/mlx#:~:text=running%20with%20%60int4%60%29.-,NVFP4,-support%3A%20higher%20quality), which is currently only supported on macOS. Support for Windows and Linux is in progress.
Author
Owner

@cdsama commented on GitHub (Mar 31, 2026):

NVFP4 support is a feature of the MLX backend, which is currently only supported on macOS. Support for Windows and Linux is in progress.

Great to hear that Windows and Linux support is in progress. Looking forward to future updates.

<!-- gh-comment-id:4163532553 --> @cdsama commented on GitHub (Mar 31, 2026): > NVFP4 support is a feature of the [MLX backend](https://ollama.com/blog/mlx#:~:text=running%20with%20%60int4%60%29.-,NVFP4,-support%3A%20higher%20quality), which is currently only supported on macOS. Support for Windows and Linux is in progress. Great to hear that Windows and Linux support is in progress. Looking forward to future updates.
Author
Owner

@mmartial commented on GitHub (Mar 31, 2026):

Any chance a new nvfp4 tag can be added to the quick search?

<!-- gh-comment-id:4164219383 --> @mmartial commented on GitHub (Mar 31, 2026): Any chance a new `nvfp4` tag can be added to the quick search?
Author
Owner

@JoeLoginIsAlreadyTaken commented on GitHub (Mar 31, 2026):

NVFP4 support is a feature of the MLX backend, which is currently only supported on macOS. Support for Windows and Linux is in progress.

I'm a bit confused at the moment.
Which format would be best suited for the Apple M3 Ultra?

<!-- gh-comment-id:4165026234 --> @JoeLoginIsAlreadyTaken commented on GitHub (Mar 31, 2026): > NVFP4 support is a feature of the [MLX backend](https://ollama.com/blog/mlx#:~:text=running%20with%20%60int4%60%29.-,NVFP4,-support%3A%20higher%20quality), which is currently only supported on macOS. Support for Windows and Linux is in progress. I'm a bit confused at the moment. Which format would be best suited for the Apple M3 Ultra?
Author
Owner

@rick-github commented on GitHub (Mar 31, 2026):

MLX apparently has a speed advantage. Try a model of both types and compare.

for m in qwen3.5:9b-nvfp4 qwen3.5:9b-q4_K_M ; do echo $m ; ollama run $m hello --verbose 2>&1 | grep eval.rate ; done

Model selection should be based on fitness for purpose, try MLX/GGUF with your workload and see which performs at an acceptable rate while returning good results.

<!-- gh-comment-id:4165594013 --> @rick-github commented on GitHub (Mar 31, 2026): MLX apparently has a speed advantage. Try a model of both types and compare. ```shell for m in qwen3.5:9b-nvfp4 qwen3.5:9b-q4_K_M ; do echo $m ; ollama run $m hello --verbose 2>&1 | grep eval.rate ; done ``` Model selection should be based on fitness for purpose, try MLX/GGUF with your workload and see which performs at an acceptable rate while returning good results.
Author
Owner

@JoeLoginIsAlreadyTaken commented on GitHub (Apr 3, 2026):

I did a test with all 4 versions of Qwen3.5-35 on a Mac Studio M3 Ultra (28CPU/60GPU) and 256GB RAM

The test scenario involves question-based summarization of an English paper about the construction of the Great Pyramid.
The prompt and the answer are in German, which poses a challenge for Qwen3.5 in this configuration, resulting in occasional word errors. In my opinion, this represents a good, albeit subjectively measured, indicator.

The temperature was always 0.8, and min_p was 0.1.
All other parameters were set to their default values.
Context size: 32K tokens.

Model Output-Tokens Prompt-Tokens/s Response-Tokens/s
qwen3.5:35b-a3b-q8_0 3761 934,86 36,06
qwen3.5:35b-a3b-int8 3507 1584,35 67,78
qwen3.5:35b-a3b-mxfp8 3747 1552,66 63,26
qwen3.5:35b-a3b-nvfp4 3662 1627,92 74,31

Result:
All MLX models are much faster.
The answer quality of "int8" is on the same level as "Q8_0".
nvfp4 is fastest but with most word errors (English words used or new German words invented)
I can not really tell a quality difference between "int8" and "mxfp8", so I will use int8 for the slightly higher performance.

<!-- gh-comment-id:4183126057 --> @JoeLoginIsAlreadyTaken commented on GitHub (Apr 3, 2026): I did a test with all 4 versions of Qwen3.5-35 on a Mac Studio M3 Ultra (28CPU/60GPU) and 256GB RAM The test scenario involves question-based summarization of an English paper about the construction of the Great Pyramid. The prompt and the answer are in German, which poses a challenge for Qwen3.5 in this configuration, resulting in occasional word errors. In my opinion, this represents a good, albeit subjectively measured, indicator. The temperature was always 0.8, and min_p was 0.1. All other parameters were set to their default values. Context size: 32K tokens. |Model|Output-Tokens|Prompt-Tokens/s|Response-Tokens/s| |---|---|---|---| |qwen3.5:35b-a3b-q8_0|3761|934,86|36,06| |qwen3.5:35b-a3b-int8|3507|1584,35|67,78| |qwen3.5:35b-a3b-mxfp8|3747|1552,66|63,26| |qwen3.5:35b-a3b-nvfp4|3662|1627,92|74,31| Result: All MLX models are much faster. The answer quality of "int8" is on the same level as "Q8_0". nvfp4 is fastest but with most word errors (English words used or new German words invented) I can not really tell a quality difference between "int8" and "mxfp8", so I will use int8 for the slightly higher performance.
Author
Owner

@JoeLoginIsAlreadyTaken commented on GitHub (Apr 3, 2026):

Just for comparison here the results for the dense model.

Model Output-Tokens Prompt-Tokens/s Response-Tokens/s
qwen3.5:27b-q8_0 3645 235,34 13,6
qwen3.5:27b-int8 3456 287,88 19,76

German language output is slightly better than the one of the 35B-A3B Model.
The int8 seems to prefer English vocabulary when unsure (uses "antechamber").
The Q8_0 had some partly translated words like "Antekammer" for "antechamber" instead of the proper "Vorkammer".

<!-- gh-comment-id:4183415087 --> @JoeLoginIsAlreadyTaken commented on GitHub (Apr 3, 2026): Just for comparison here the results for the dense model. |Model|Output-Tokens|Prompt-Tokens/s|Response-Tokens/s| |---|---|---|---| |qwen3.5:27b-q8_0|3645|235,34|13,6| |qwen3.5:27b-int8|3456|287,88|19,76| German language output is slightly better than the one of the 35B-A3B Model. The int8 seems to prefer English vocabulary when unsure (uses "antechamber"). The Q8_0 had some partly translated words like "Antekammer" for "antechamber" instead of the proper "Vorkammer".
Author
Owner

@rick-github commented on GitHub (Apr 3, 2026):

Sadly performance on Linux+CUDA doesn't see as much increase with MLX, and the sparse models don't work at all.

Model Output-Tokens Prompt-Tokens/s Response-Tokens/s
qwen3.5:27b-q8_0 8289 1626.63 40.47
qwen3.5:27b-int8 13927 4.72 37.88
qwen3.5:27b-mxfp8 11324 114.11 42.95
qwen3.5:27b-nvfp4 7289 42.95 64.17
<!-- gh-comment-id:4183436319 --> @rick-github commented on GitHub (Apr 3, 2026): Sadly performance on Linux+CUDA doesn't see as much increase with MLX, and the sparse models don't work at all. |Model|Output-Tokens|Prompt-Tokens/s|Response-Tokens/s| |---|---|---|---| | qwen3.5:27b-q8_0 | 8289|1626.63 | 40.47| | qwen3.5:27b-int8 | 13927| 4.72| 37.88 | |qwen3.5:27b-mxfp8 |11324 | 114.11 | 42.95 | | qwen3.5:27b-nvfp4 |7289 | 42.95 | 64.17 |
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#9715