[GH-ISSUE #9267] Allow Option to Use Original (Unquantized) Models in Model Registry #52552

Closed
opened 2026-04-28 23:39:34 -05:00 by GiteaMirror · 2 comments
Owner

Originally created by @kushteppalwar on GitHub (Feb 21, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/9267

What is the issue?

When viewing models in the Ollama model registry, their sizes are significantly smaller than their original open-source versions. This indicates that, by default, all models are quantized using Q4_K_M, which results in a performance reduction compared to running the original versions.

We have observed a noticeable degradation in performance, particularly in LLaMA 3.2 text and vision models. The quantization affects output quality and model capabilities, making Ollama-hosted models less effective than their counterparts outside Ollama.

Feature Request:
Would it be possible to introduce support for adding original (unquantized) models to the model registry? This would allow users to choose whether to use a quantized version for efficiency or the full model for maximum performance.

Expected Behavior:

Users can select between original (unquantized) models and quantized models in the registry.
The performance of LLaMA 3.2 and other models should be on par with their open-source versions when using the unquantized option.
Additional Context:
We primarily tested LLaMA 3.2 text and vision models and noticed the performance drop due to default quantization. Allowing users to opt for the original models would enhance flexibility and usability.

Originally created by @kushteppalwar on GitHub (Feb 21, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/9267 ### What is the issue? When viewing models in the Ollama model registry, their sizes are significantly smaller than their original open-source versions. This indicates that, by default, all models are quantized using Q4_K_M, which results in a performance reduction compared to running the original versions. We have observed a noticeable degradation in performance, particularly in LLaMA 3.2 text and vision models. The quantization affects output quality and model capabilities, making Ollama-hosted models less effective than their counterparts outside Ollama. Feature Request: Would it be possible to introduce support for adding original (unquantized) models to the model registry? This would allow users to choose whether to use a quantized version for efficiency or the full model for maximum performance. Expected Behavior: Users can select between original (unquantized) models and quantized models in the registry. The performance of LLaMA 3.2 and other models should be on par with their open-source versions when using the unquantized option. Additional Context: We primarily tested LLaMA 3.2 text and vision models and noticed the performance drop due to default quantization. Allowing users to opt for the original models would enhance flexibility and usability.
GiteaMirror added the bug label 2026-04-28 23:39:34 -05:00
Author
Owner

@rick-github commented on GitHub (Feb 21, 2025):

This indicates that, by default, all models are quantized using Q4_K_M

Have you tried a non-default quantization, eg fp16? The current inference backend only supports GGUF files, so some transformation is required, but fp16 should be close to the original quality.

<!-- gh-comment-id:2673856346 --> @rick-github commented on GitHub (Feb 21, 2025): > This indicates that, by default, all models are quantized using Q4_K_M Have you tried a non-default quantization, eg fp16? The current inference backend only supports GGUF files, so some transformation is required, but fp16 should be close to the original quality.
Author
Owner

@jmorganca commented on GitHub (Feb 21, 2025):

Hi @kushteppalwar for now fp16 models are available (e.g. ollama run llama3.2:3b-instruct-fp16)

We are working on supporting the original bf16 weights in ollama and ollama.com - stay tuned!

<!-- gh-comment-id:2675148944 --> @jmorganca commented on GitHub (Feb 21, 2025): Hi @kushteppalwar for now fp16 models are available (e.g. `ollama run llama3.2:3b-instruct-fp16`) We are working on supporting the original bf16 weights in ollama and ollama.com - stay tuned!
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#52552