[GH-ISSUE #11130] Quantizing Magistral to Q4_K_S with AMD card breaks its output #53853

Open
opened 2026-04-29 04:53:05 -05:00 by GiteaMirror · 6 comments
Owner

Originally created by @ProjectMoon on GitHub (Jun 19, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/11130

What is the issue?

Quantizing Magistral to Q4_K_S using ollama's built-in quantization function ollama create -q breaks the model's output. Instead of outputting actually understandable text, it dumps out a bunch of special character tokens.

magistral:latest (which I assume is Q4_K_M) runs fine.

I have not tried quantizing the fp16 model to other quants.

Relevant log output

Console run (also happens via API, e.g. OpenWebUI):

~> ollama run mistra:omega
>>> what's up nerd
<SPECIAL_39><SPECIAL_39>

>>> blammoo
<SPECIAL_39><SPECIAL_39><SPECIAL_25><SPECIAL_29><SPECIAL_39>[SUFFIX][TOOL_CONTENT]<SPECIAL_33><pad>[/TOOL_RESULTS]

>>> /set nothink
Set 'nothink' mode.
>>> wow
[SYSTEM_PROMPT]<pad><SPECIAL_25><SPECIAL_34>[/AVAILABLE_TOOLS][AVAILABLE_TOOLS][IMG_BREAK]<SPECIAL_38><SPECIAL_37>[AVAILABLE_TOOLS]<SPECIAL_39><s><SPECIAL_29>[/SYSTEM_PROMPT]<SPECIAL_27>[SYSTEM_PROMPT]<pad><SPECIAL_25><SPECIAL_34>[/AVAILABLE_TOOLS][AVAILABLE_TOOLS][IMG_BREAK]<SPECIAL_38><SPECIAL_37>[AVAILABLE_TOOLS]<SPECIAL_39><s><SPECIAL_29>[/SYSTEM_PROMPT]<SPECIAL_27>[SUFFIX]<SPECIAL_39><SPECIAL_38><SPECIAL_27><SPECIAL_26><SPECIAL_34>[SUFFIX]<SPECIAL_25><SPECIAL_35><SPECIAL_39><SPECIAL_34><SPECIAL_26>[TOOL_CONTENT][/SYSTEM_PROMPT][MIDDLE]

>>> Send a message (/? for help)

Modelfile for mistra:omega:

FROM magistral:24b-small-2506-fp16
PARAMETER num_ctx 8000
PARAMETER num_predict 3000
PARAMETER num_gpu 100

OS

Linux

GPU

AMD

CPU

AMD

Ollama version

0.9.2

Originally created by @ProjectMoon on GitHub (Jun 19, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/11130 ### What is the issue? Quantizing Magistral to Q4_K_S using ollama's built-in quantization function `ollama create -q` breaks the model's output. Instead of outputting actually understandable text, it dumps out a bunch of special character tokens. `magistral:latest` (which I assume is Q4_K_M) runs fine. I have not tried quantizing the fp16 model to other quants. ### Relevant log output Console run (also happens via API, e.g. OpenWebUI): ```shell ~> ollama run mistra:omega >>> what's up nerd <SPECIAL_39><SPECIAL_39> >>> blammoo <SPECIAL_39><SPECIAL_39><SPECIAL_25><SPECIAL_29><SPECIAL_39>[SUFFIX][TOOL_CONTENT]<SPECIAL_33><pad>[/TOOL_RESULTS] >>> /set nothink Set 'nothink' mode. >>> wow [SYSTEM_PROMPT]<pad><SPECIAL_25><SPECIAL_34>[/AVAILABLE_TOOLS][AVAILABLE_TOOLS][IMG_BREAK]<SPECIAL_38><SPECIAL_37>[AVAILABLE_TOOLS]<SPECIAL_39><s><SPECIAL_29>[/SYSTEM_PROMPT]<SPECIAL_27>[SYSTEM_PROMPT]<pad><SPECIAL_25><SPECIAL_34>[/AVAILABLE_TOOLS][AVAILABLE_TOOLS][IMG_BREAK]<SPECIAL_38><SPECIAL_37>[AVAILABLE_TOOLS]<SPECIAL_39><s><SPECIAL_29>[/SYSTEM_PROMPT]<SPECIAL_27>[SUFFIX]<SPECIAL_39><SPECIAL_38><SPECIAL_27><SPECIAL_26><SPECIAL_34>[SUFFIX]<SPECIAL_25><SPECIAL_35><SPECIAL_39><SPECIAL_34><SPECIAL_26>[TOOL_CONTENT][/SYSTEM_PROMPT][MIDDLE] >>> Send a message (/? for help) ``` Modelfile for mistra:omega: ``` FROM magistral:24b-small-2506-fp16 PARAMETER num_ctx 8000 PARAMETER num_predict 3000 PARAMETER num_gpu 100 ``` ### OS Linux ### GPU AMD ### CPU AMD ### Ollama version 0.9.2
GiteaMirror added the bug label 2026-04-29 04:53:05 -05:00
Author
Owner

@rick-github commented on GitHub (Jun 19, 2025):

$ ollama -v
ollama version is 0.9.2
$ cat > Modelfile 
FROM magistral:24b-small-2506-fp16
PARAMETER num_ctx 8000
PARAMETER num_predict 3000
PARAMETER num_gpu 100
$ ollama create magistral:omega -q q4_K_S
gathering model components 
quantizing F16 model to Q4_K_S 100% ▕█████████████████████████████ ▏  47 GB/ 47 GB  151 MB/s      0s
verifying conversion 
using existing layer sha256:34af41fb04f221ec8a2b336618611ddeebfe31f4fa07082178be25bb525bffd3 
using existing layer sha256:35f7a1efc383aeaa73f17f770de9c1d3531693c65edf1e0cbadea7d17db23fa9 
using existing layer sha256:43c1db03bf38c4a9a096463d4b9de42ba9e835c084e4c7fdc20ffdef85ec8605 
using existing layer sha256:43070e2d4e532684de521b885f385d0841030efa2b1a20bafb76133a5e1379c1 
creating new layer sha256:01bf371b880f0ef89b1d4319c3cff33c574ee70d197732e96606c69e500dc2dd 
writing manifest 
success 
$ ollama run magistral:omega
>>> what's up nerd
Thinking...
Okay, the user is asking "what's up nerd". At first glance, it seems like a casual greeting or perhaps a reference to something. But I'm not sure if this is a serious question or just a friendly remark. 
Let me break it down:
...
Thus, the final answer is a friendly and slightly playful response that could apply in multiple contexts.

\boxed{Not\ much,\ just\ trying\ to\ be\ excellent!\ What's\ up\ with\ you?}

>>> Send a message (/? for help)

Try verifying that the sha256 hash of the model match the filename.

$ sha256sum $(ollama show --modelfile magistral:omega | sed -ne 's/^FROM //p')
34af41fb04f221ec8a2b336618611ddeebfe31f4fa07082178be25bb525bffd3  /root/.ollama/models/blobs/sha256-34af41fb04f221ec8a2b336618611ddeebfe31f4fa07082178be25bb525bffd3
<!-- gh-comment-id:2988139336 --> @rick-github commented on GitHub (Jun 19, 2025): ```console $ ollama -v ollama version is 0.9.2 $ cat > Modelfile FROM magistral:24b-small-2506-fp16 PARAMETER num_ctx 8000 PARAMETER num_predict 3000 PARAMETER num_gpu 100 $ ollama create magistral:omega -q q4_K_S gathering model components quantizing F16 model to Q4_K_S 100% ▕█████████████████████████████ ▏ 47 GB/ 47 GB 151 MB/s 0s verifying conversion using existing layer sha256:34af41fb04f221ec8a2b336618611ddeebfe31f4fa07082178be25bb525bffd3 using existing layer sha256:35f7a1efc383aeaa73f17f770de9c1d3531693c65edf1e0cbadea7d17db23fa9 using existing layer sha256:43c1db03bf38c4a9a096463d4b9de42ba9e835c084e4c7fdc20ffdef85ec8605 using existing layer sha256:43070e2d4e532684de521b885f385d0841030efa2b1a20bafb76133a5e1379c1 creating new layer sha256:01bf371b880f0ef89b1d4319c3cff33c574ee70d197732e96606c69e500dc2dd writing manifest success $ ollama run magistral:omega >>> what's up nerd Thinking... Okay, the user is asking "what's up nerd". At first glance, it seems like a casual greeting or perhaps a reference to something. But I'm not sure if this is a serious question or just a friendly remark. Let me break it down: ... Thus, the final answer is a friendly and slightly playful response that could apply in multiple contexts. \boxed{Not\ much,\ just\ trying\ to\ be\ excellent!\ What's\ up\ with\ you?} >>> Send a message (/? for help) ``` Try verifying that the sha256 hash of the model match the filename. ```console $ sha256sum $(ollama show --modelfile magistral:omega | sed -ne 's/^FROM //p') 34af41fb04f221ec8a2b336618611ddeebfe31f4fa07082178be25bb525bffd3 /root/.ollama/models/blobs/sha256-34af41fb04f221ec8a2b336618611ddeebfe31f4fa07082178be25bb525bffd3 ```
Author
Owner

@ProjectMoon commented on GitHub (Jun 19, 2025):

@rick-github what GPU are you running it on?

<!-- gh-comment-id:2988155004 --> @ProjectMoon commented on GitHub (Jun 19, 2025): @rick-github what GPU are you running it on?
Author
Owner

@rick-github commented on GitHub (Jun 19, 2025):

GeForce RTX 4070

<!-- gh-comment-id:2988200397 --> @rick-github commented on GitHub (Jun 19, 2025): GeForce RTX 4070
Author
Owner

@ProjectMoon commented on GitHub (Jun 19, 2025):

I quantized again, and ran the SHA sum:

> sha256sum $(ollama show --modelfile mistra:omega2 | sed -ne 's/^FROM //p')

93772e9d5819bcb53557261993dabae6bd987b5d4082c27431347f0970353777  /ollama/blobs/sha256-93772e9d5819bcb53557261993dabae6bd987b5d4082c27431347f0970353777

And the same result happened with the output. So maybe it's specific to AMD cards?

<!-- gh-comment-id:2988230825 --> @ProjectMoon commented on GitHub (Jun 19, 2025): I quantized again, and ran the SHA sum: ``` > sha256sum $(ollama show --modelfile mistra:omega2 | sed -ne 's/^FROM //p') 93772e9d5819bcb53557261993dabae6bd987b5d4082c27431347f0970353777 /ollama/blobs/sha256-93772e9d5819bcb53557261993dabae6bd987b5d4082c27431347f0970353777 ``` And the same result happened with the output. So maybe it's specific to AMD cards?
Author
Owner

@rick-github commented on GitHub (Jun 19, 2025):

So maybe it's specific to AMD cards?

Looks like it, I re-did the quantization on a Radeon 8060S and the SHA sum is the same as yours. The 4070 quantized model works on both cards, the 8060 quantized model generates random tokens on both cards.

<!-- gh-comment-id:2988414136 --> @rick-github commented on GitHub (Jun 19, 2025): > So maybe it's specific to AMD cards? Looks like it, I re-did the quantization on a Radeon 8060S and the SHA sum is the same as yours. The 4070 quantized model works on both cards, the 8060 quantized model generates random tokens on both cards.
Author
Owner

@rick-github commented on GitHub (Sep 29, 2025):

This has been fixed as of 0.11.5.

<!-- gh-comment-id:3344424331 --> @rick-github commented on GitHub (Sep 29, 2025): This has been fixed as of 0.11.5.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#53853