[GH-ISSUE #13668] Following Ollama docs results in "unsupported quantization" error #55492

Closed
opened 2026-04-29 09:17:37 -05:00 by GiteaMirror · 5 comments
Owner

Originally created by @recursivenomad on GitHub (Jan 10, 2026).
Original GitHub issue: https://github.com/ollama/ollama/issues/13668

What is the issue?

I was hoping to run the 7B comma-v0.1-2t model on an old GPU with 3.5 GB of usable VRAM. Following the Ollama docs, the q3_K_M quantization should be supported and would perfectly give me just enough overhead to be able to run on 3.5 GB with a small context window.

Much to my dismay, attempting to quantize from the *.safetensors files results in:

Error: unsupported quantization type Q3_K_M - supported types are F32, F16, Q4_K_S, Q4_K_M, Q8_0

This is quite the let-down after spending the better part of a week setting up my home environment for hosting a model in what the documentation states is a supported format, only to learn that quantizations which accomodate my hardware seem to have been left behind.

This relates to many other issue reports and PR discussions:

But, this issue is specifically to highlight that the docs are misleading around this feature.

That said, hopefully this will open another conversation around bringing back wider quantization options for improved accessibility of this technology.

Relevant log output


OS

Linux

GPU

Nvidia

CPU

Intel

Ollama version

0.13.5

Originally created by @recursivenomad on GitHub (Jan 10, 2026). Original GitHub issue: https://github.com/ollama/ollama/issues/13668 ### What is the issue? I was hoping to run the 7B [comma-v0.1-2t](https://huggingface.co/common-pile/comma-v0.1-2t) model on an old GPU with 3.5 GB of usable VRAM. Following the [Ollama docs](https://docs.ollama.com/import#supported-quantizations), the `q3_K_M` quantization should be supported and would perfectly give me just enough overhead to be able to run on 3.5 GB with a small context window. Much to my dismay, attempting to quantize from the `*.safetensors` files results in: ``` Error: unsupported quantization type Q3_K_M - supported types are F32, F16, Q4_K_S, Q4_K_M, Q8_0 ``` This is quite the let-down after spending the better part of a week setting up my home environment for hosting a model in what the documentation states is a supported format, only to learn that quantizations which accomodate my hardware seem to have been left behind. This relates to many other issue reports and PR discussions: - https://github.com/ollama/ollama/pull/10647#issuecomment-2873621611 - #10749 - #10886 - #11043 - #11443 But, this issue is specifically to highlight that the docs are misleading around this feature. That said, hopefully this will open another conversation around bringing back wider quantization options for improved accessibility of this technology. ### Relevant log output ```shell ``` ### OS Linux ### GPU Nvidia ### CPU Intel ### Ollama version 0.13.5
GiteaMirror added the bug label 2026-04-29 09:17:37 -05:00
Author
Owner

@rick-github commented on GitHub (Jan 11, 2026):

The docs had been previously updated to indicate the change in supported quants but it looks like the switch to the new documentation structure referenced old material, that will be remedied. To create quants not supported by ollama create, use llama.cpp.

<!-- gh-comment-id:3734289747 --> @rick-github commented on GitHub (Jan 11, 2026): The docs had been previously [updated](https://github.com/ollama/ollama/pull/10842) to indicate the change in supported quants but it looks like the switch to the new documentation structure referenced old material, that will be remedied. To create quants not supported by `ollama create`, use llama.cpp.
Author
Owner

@rick-github commented on GitHub (Jan 11, 2026):

Also note that this model is a base model and is not meant for the usual use case of an ollama model - chat bot, tool using, etc.

<!-- gh-comment-id:3734309341 --> @rick-github commented on GitHub (Jan 11, 2026): Also note that this model is a base model and is not meant for the usual use case of an ollama model - chat bot, tool using, etc.
Author
Owner

@rick-github commented on GitHub (Jan 11, 2026):

Quantized to q3_K_M this model needs 5.5GB. You might have better luck with qwen3:4b-q4_K_M (3.7G), qwen3:1.7b-q8_0 (2.8G), granite4:3b (2.7G) or ministral-3:3b (4.6G).

<!-- gh-comment-id:3734343829 --> @rick-github commented on GitHub (Jan 11, 2026): Quantized to q3_K_M this model needs 5.5GB. You might have better luck with qwen3:4b-q4_K_M (3.7G), qwen3:1.7b-q8_0 (2.8G), granite4:3b (2.7G) or ministral-3:3b (4.6G).
Author
Owner

@recursivenomad commented on GitHub (Jan 11, 2026):

Thanks for your feedback, Rick - I am intentionally using comma-v0.1-2t specifically because it's the only model I've found trained exclusively on openly-licensed text so far, and I am content with it being a base model.

As for the quantization: I am new to all of this, so please correct me if I'm wrong - but using a VRAM calculator on Huggingface, I was hoping that Q3_K_M with a teeny tiny context window of 512 would be just barely enough to fit on my 3.5G card. Quantizing down even smaller to increase that context window would then be my next exploration in usability/coherence.

And to clarify "llama.cpp" - I assume this refers to the llama.cpp project? (rather than a .cpp file itself)

Thanks again for your suggestions.

<!-- gh-comment-id:3735994303 --> @recursivenomad commented on GitHub (Jan 11, 2026): Thanks for your feedback, Rick - I am intentionally using `comma-v0.1-2t` specifically because it's the only model I've found trained exclusively on openly-licensed text so far, and I am content with it being a base model. As for the quantization: I am new to all of this, so please correct me if I'm wrong - but using a [VRAM calculator on Huggingface](https://huggingface.co/spaces/NyxKrage/LLM-Model-VRAM-Calculator), I was hoping that Q3_K_M with a teeny tiny context window of 512 would be just barely enough to fit on my 3.5G card. Quantizing down even smaller to increase that context window would then be my next exploration in usability/coherence. And to clarify "llama.cpp" - I assume this refers to the [llama.cpp](https://github.com/ggml-org/llama.cpp) project? (rather than a `.cpp` file itself) Thanks again for your suggestions.
Author
Owner

@rick-github commented on GitHub (Jan 12, 2026):

I created a few quants, the size is the VRAM used on an Nvidia GPU with a context of 512.

name size
comma-v0.1:2t-q3_K_M 3.8 GB
comma-v0.1:2t-iq3_S 3.4 GB
comma-v0.1:2t-q3_K_S 3.4 GB
comma-v0.1:2t-iq3_XS 3.3 GB
comma-v0.1:2t-q2_K 3.0 GB

The llama.cpp tool to use is usually convert_hf_to_gguf.py, although in this case it turns out the model is missing tokenizer.model so instead I imported with ollama to get an FP16 GGUF and then used the llama.cpp tool llama-quantize to create the smaller quants.

<!-- gh-comment-id:3739438569 --> @rick-github commented on GitHub (Jan 12, 2026): I created a few quants, the size is the VRAM used on an Nvidia GPU with a context of 512. | name | size | | -- | -- | |[comma-v0.1:2t-q3_K_M](https://ollama.com/frob/comma-v0.1:2t-q3_K_M)|3.8 GB| |[comma-v0.1:2t-iq3_S](https://ollama.com/frob/comma-v0.1:2t-iq3_S)|3.4 GB| |[comma-v0.1:2t-q3_K_S](https://ollama.com/frob/comma-v0.1:2t-q3_K_S)|3.4 GB| |[comma-v0.1:2t-iq3_XS](https://ollama.com/frob/comma-v0.1:2t-iq3_XS)|3.3 GB| |[comma-v0.1:2t-q2_K](https://ollama.com/frob/comma-v0.1:2t-q2_K)|3.0 GB| The llama.cpp tool to use is usually `convert_hf_to_gguf.py`, although in this case it turns out the model is missing `tokenizer.model` so instead I [imported](https://github.com/ollama/ollama/blob/main/docs/import.mdx) with ollama to get an FP16 GGUF and then used the llama.cpp tool `llama-quantize` to create the smaller quants.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#55492