[GH-ISSUE #5918] Llama3.1 70b-instruct-q4_1 buggy #50205

Closed
opened 2026-04-28 14:43:52 -05:00 by GiteaMirror · 14 comments
Owner

Originally created by @velaia on GitHub (Jul 24, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/5918

What is the issue?

When I run the 70b-instruct-q4_1 version of Llama3.1 ollama gives a buggy reply:

My sample request:

➜ ollama-tests curl http://localhost:11434/api/chat -d '{
"model": "llama3.1:70b-instruct-q4_1",
"messages": [
{
"role": "system",
"content": "You are a helpful assistant."
},
{
"role": "user",
"content": "Tell me about the top 3 commercial industrial computer vision products on the market."
}
], "stream": false
}'

The model's response

{
"model": "llama3.1:70b-instruct-q4_1",
"created_at": "2024-07-24T15:45:23.026538Z",
"message": {
"role": "assistant",
"content": "assistant\nassistantassistantassistant"
},
"done_reason": "stop",
"done": true,
"total_duration": 56814340833,
"load_duration": 53209095791,
"prompt_eval_count": 37,
"prompt_eval_duration": 2107466000,
"eval_count": 6,
"eval_duration": 1493717000
}`

The same request generates a proper response using llama3.1:8b-instruct-q8_0.

OS

macOS

GPU

Apple

CPU

Apple

Ollama version

0.2.8

Originally created by @velaia on GitHub (Jul 24, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/5918 ### What is the issue? When I run the **70b-instruct-q4_1** version of Llama3.1 ollama gives a buggy reply: My sample request: > ➜ ollama-tests curl http://localhost:11434/api/chat -d '{ "model": "llama3.1:70b-instruct-q4_1", "messages": [ { "role": "system", "content": "You are a helpful assistant." }, { "role": "user", "content": "Tell me about the top 3 commercial industrial computer vision products on the market." } ], "stream": false }' The model's response > { "model": "llama3.1:70b-instruct-q4_1", "created_at": "2024-07-24T15:45:23.026538Z", "message": { "role": "assistant", "content": "assistant\nassistantassistantassistant" }, "done_reason": "stop", "done": true, "total_duration": 56814340833, "load_duration": 53209095791, "prompt_eval_count": 37, "prompt_eval_duration": 2107466000, "eval_count": 6, "eval_duration": 1493717000 }` The same request generates a proper response using **llama3.1:8b-instruct-q8_0**. ### OS macOS ### GPU Apple ### CPU Apple ### Ollama version 0.2.8
GiteaMirror added the bug label 2026-04-28 14:43:52 -05:00
Author
Owner

@rick-github commented on GitHub (Jul 24, 2024):

Logs may help in the diagnosis of the issue.

<!-- gh-comment-id:2248386429 --> @rick-github commented on GitHub (Jul 24, 2024): Logs may help in the diagnosis of the issue.
Author
Owner

@MaxJa4 commented on GitHub (Jul 24, 2024):

The issue is probably the template.
Check out the model page of Q4_1: https://ollama.com/library/llama3.1:70b-instruct-q4_1
It has two templates (shouldn't be that way). The longer one is the correct one.

Correct: https://ollama.com/library/llama3.1:70b-instruct-q4_1/blobs/11ce4ee3e170
Wrong: https://ollama.com/library/llama3.1:70b-instruct-q4_1/blobs/8ab4849b038c

If you really wanna check out Q4_1, you can maybe fix it locally with manual edits or by making your own modelfile like mentioned in the docs folder.

@Registry-Maintainers: I checked all the other tags, they are fine. It's only 70B-instruct-q4_1.

<!-- gh-comment-id:2248531289 --> @MaxJa4 commented on GitHub (Jul 24, 2024): The issue is probably the template. Check out the model page of Q4_1: https://ollama.com/library/llama3.1:70b-instruct-q4_1 It has two templates (shouldn't be that way). The longer one is the correct one. Correct: https://ollama.com/library/llama3.1:70b-instruct-q4_1/blobs/11ce4ee3e170 Wrong: https://ollama.com/library/llama3.1:70b-instruct-q4_1/blobs/8ab4849b038c If you really wanna check out Q4_1, you can maybe fix it locally with manual edits or by making your own modelfile like mentioned in the docs folder. @Registry-Maintainers: I checked all the other tags, they are fine. It's only 70B-instruct-q4_1.
Author
Owner

@MaxJa4 commented on GitHub (Jul 25, 2024):

Update: The additional wrong template has apparently been removed in the registry. The Q4_1 should work fine now.
#fixed :)

<!-- gh-comment-id:2249761121 --> @MaxJa4 commented on GitHub (Jul 25, 2024): Update: The additional wrong template has apparently been removed in the registry. The Q4_1 should work fine now. #fixed :)
Author
Owner

@rick-github commented on GitHub (Jul 25, 2024):

Re-pulled the model, no improvement.

$ curl -s localhost:11434/api/chat -d '{ "model": "llama3.1:70b-instruct-q4_1", "messages": [ { "role": "system", "content": "You are a helpful assistant." }, { "role": "user
", "content": "Tell me about the top 3 commercial industrial computer vision products on the market." } ], "stream": false }' | jq .message
{
  "role": "assistant",
  "content": "assistantassistant"
}
<!-- gh-comment-id:2250518694 --> @rick-github commented on GitHub (Jul 25, 2024): Re-pulled the model, no improvement. ``` $ curl -s localhost:11434/api/chat -d '{ "model": "llama3.1:70b-instruct-q4_1", "messages": [ { "role": "system", "content": "You are a helpful assistant." }, { "role": "user ", "content": "Tell me about the top 3 commercial industrial computer vision products on the market." } ], "stream": false }' | jq .message { "role": "assistant", "content": "assistantassistant" } ```
Author
Owner

@velaia commented on GitHub (Jul 25, 2024):

Same here. I had to 'rm' it and re-download. Even using a custom Modelfile with adjusted TEMPLATE didn't work for me.

<!-- gh-comment-id:2250587914 --> @velaia commented on GitHub (Jul 25, 2024): Same here. I had to 'rm' it and re-download. Even using a custom Modelfile with adjusted TEMPLATE didn't work for me.
Author
Owner

@MaxJa4 commented on GitHub (Jul 25, 2024):

I guess the pull action only adds and updates, but doesn't remove...
Good to know 👍

<!-- gh-comment-id:2250675643 --> @MaxJa4 commented on GitHub (Jul 25, 2024): I guess the pull action only adds and updates, but doesn't remove... Good to know 👍
Author
Owner

@rick-github commented on GitHub (Jul 25, 2024):

Deleted, pulled, no change.

<!-- gh-comment-id:2250824241 --> @rick-github commented on GitHub (Jul 25, 2024): Deleted, pulled, no change.
Author
Owner

@MaxJa4 commented on GitHub (Jul 25, 2024):

What's the content of your manifest file? Located at <your_model_location>\manifests\registry.ollama.ai\library\llama3.1
Maybe it is not getting removed correctly, had that issue before.

<!-- gh-comment-id:2250972306 --> @MaxJa4 commented on GitHub (Jul 25, 2024): What's the content of your manifest file? Located at `<your_model_location>\manifests\registry.ollama.ai\library\llama3.1` Maybe it is not getting removed correctly, had that issue before.
Author
Owner

@rick-github commented on GitHub (Jul 25, 2024):

Installed ollama:0.2.8 on a different machine, never had ollama on it before.

$ ollama run llama3.1:70b-instruct-q4_1
pulling manifest 
pulling 3177e205cbbb... 100% ▕██████████████████████████████████▏  44 GB                         
pulling 8cf247399e57... 100% ▕██████████████████████████████████▏ 1.7 KB                         
pulling f1cd752815fc... 100% ▕██████████████████████████████████▏  12 KB                         
pulling 56bb8bd477a5... 100% ▕██████████████████████████████████▏   96 B                         
pulling eaa5bd3ca6da... 100% ▕██████████████████████████████████▏  560 B                         
verifying sha256 digest 
writing manifest 
removing any unused layers 
success 
>>> hello
##assistant

>>> Send a message (/? for help)

Just doesn't like me, I guess.

<!-- gh-comment-id:2251288791 --> @rick-github commented on GitHub (Jul 25, 2024): Installed ollama:0.2.8 on a different machine, never had ollama on it before. ``` $ ollama run llama3.1:70b-instruct-q4_1 pulling manifest pulling 3177e205cbbb... 100% ▕██████████████████████████████████▏ 44 GB pulling 8cf247399e57... 100% ▕██████████████████████████████████▏ 1.7 KB pulling f1cd752815fc... 100% ▕██████████████████████████████████▏ 12 KB pulling 56bb8bd477a5... 100% ▕██████████████████████████████████▏ 96 B pulling eaa5bd3ca6da... 100% ▕██████████████████████████████████▏ 560 B verifying sha256 digest writing manifest removing any unused layers success >>> hello ##assistant >>> Send a message (/? for help) ``` Just doesn't like me, I guess.
Author
Owner

@MaxJa4 commented on GitHub (Jul 26, 2024):

Template and params are identical with the other quants, so that's really weird.
Tested it on 0.2.8 with Q4_1 (broken) and Q4_K_M (works). Same with 0.3.0 - so the 4_1 seems to be broken, yup.

<!-- gh-comment-id:2253496943 --> @MaxJa4 commented on GitHub (Jul 26, 2024): Template and params are identical with the other quants, so that's really weird. Tested it on 0.2.8 with Q4_1 (broken) and Q4_K_M (works). Same with 0.3.0 - so the 4_1 seems to be broken, yup.
Author
Owner

@Jsin01 commented on GitHub (Aug 23, 2024):

Same

<!-- gh-comment-id:2306106396 --> @Jsin01 commented on GitHub (Aug 23, 2024): Same
Author
Owner

@pdevine commented on GitHub (Sep 14, 2024):

I ended up rebuilding the q4_1 weights and still ran into issues. In talking with the Llama team, the model is really sensitive to certain quantizations, although they didn't give guidelines for which quantizations work best w/ 8b or 70b, but only for 405b.

I think the best case here is we just remove q4_1 since it's unlikely it's going to be able to work.

<!-- gh-comment-id:2351075534 --> @pdevine commented on GitHub (Sep 14, 2024): I ended up rebuilding the q4_1 weights and still ran into issues. In talking with the Llama team, the model is really sensitive to certain quantizations, although they didn't give guidelines for which quantizations work best w/ 8b or 70b, but only for 405b. I think the best case here is we just remove q4_1 since it's unlikely it's going to be able to work.
Author
Owner

@pdevine commented on GitHub (Sep 15, 2024):

We ended up removing the quantization. I think there was probably an issue also w/ the kv cache. There are some changes coming to improve kv cache performance and I'm wondering if that might at least fix the assistant issue. Regardless, this quantization level isn't super great.

I'll go ahead and close out the issue as something we won't fix.

cc @jessegross

<!-- gh-comment-id:2351685825 --> @pdevine commented on GitHub (Sep 15, 2024): We ended up removing the quantization. I think there was probably an issue also w/ the kv cache. There are some changes coming to improve kv cache performance and I'm wondering if that might at least fix the `assistant` issue. Regardless, this quantization level isn't super great. I'll go ahead and close out the issue as something we won't fix. cc @jessegross
Author
Owner

@jessegross commented on GitHub (Sep 15, 2024):

The KV changes are really just around speed so I think they are unlikely to have an impact on this unless there was another bug that I'm not aware of that happened to get fixed at the same time. I think quantization is more likely the issue.

<!-- gh-comment-id:2351692223 --> @jessegross commented on GitHub (Sep 15, 2024): The KV changes are really just around speed so I think they are unlikely to have an impact on this unless there was another bug that I'm not aware of that happened to get fixed at the same time. I think quantization is more likely the issue.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#50205