[GH-ISSUE #7984] llama3.3:70b-instruct-q8_0 generates garbage #30867

Closed
opened 2026-04-22 10:49:54 -05:00 by GiteaMirror · 7 comments
Owner

Originally created by @Baughn on GitHub (Dec 7, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/7984

What is the issue?

➜  models ollama pull llama3.3:70b-instruct-q8_0
pulling manifest
pulling 4a8a92e57c0f... 100% ▕██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏  74 GB
pulling 948af2743fc7... 100% ▕██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏ 1.5 KB
pulling bc371a43ce90... 100% ▕██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏ 7.6 KB
pulling 53a87df39647... 100% ▕██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏ 5.6 KB
pulling 56bb8bd477a5... 100% ▕██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏   96 B
pulling d95adcc05174... 100% ▕██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏  560 B
verifying sha256 digest
writing manifest
success
➜  models ollama run llama3.3:70b-instruct-q8_0
>>> Hi
;*?(9)6#94&>H<2%E-?7G5-?7/)'./7;2!12D&.."F#,+?72:D,?(>?7>GB9:("$.G%4(%BH<#!%&,?"%&>'%B"66.4:)>=-5$5EHG$<!:,F%%GH&=171+&-E:6#F.$5,!/?+/?CG,E-:'-?!1,(!>.9<E*/&3$#?5=>6*H$1:#%((=6?5;6?%2F5!35=-8#/+G'?7>8*;G?$/4'4;*?(9)6#94&>H<2%E-?7G5-?7/)'./7;2!12D&.."F#,+?72:D,?(>?7>GB9:("$.G%4(%BH<#!%&,?"%&>'%B"66.4:)>=-5$5EHG$<!:,F%%GH&=171+&-E:6#F.$5,!/?+/?CG,E-:'-?!1,(!>.9<E*/&3$#?5=>6*H$1:#%((=6?5;6?%2F5!35=-8#/+G'?7>8*;G?$/4'4?6?DB>?>'-'7:9H!CGGD#.!?GE4G!.&)4F,<71GD4//H<2#1$&7.'&5,/7E=/E(#/-6-",5^C

Nothing else needs be said.

OS

macOS

GPU

Apple

CPU

Apple

Ollama version

0.5.1

Originally created by @Baughn on GitHub (Dec 7, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/7984 ### What is the issue? ```shell ➜ models ollama pull llama3.3:70b-instruct-q8_0 pulling manifest pulling 4a8a92e57c0f... 100% ▕██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏ 74 GB pulling 948af2743fc7... 100% ▕██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏ 1.5 KB pulling bc371a43ce90... 100% ▕██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏ 7.6 KB pulling 53a87df39647... 100% ▕██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏ 5.6 KB pulling 56bb8bd477a5... 100% ▕██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏ 96 B pulling d95adcc05174... 100% ▕██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏ 560 B verifying sha256 digest writing manifest success ➜ models ollama run llama3.3:70b-instruct-q8_0 >>> Hi ;*?(9)6#94&>H<2%E-?7G5-?7/)'./7;2!12D&.."F#,+?72:D,?(>?7>GB9:("$.G%4(%BH<#!%&,?"%&>'%B"66.4:)>=-5$5EHG$<!:,F%%GH&=171+&-E:6#F.$5,!/?+/?CG,E-:'-?!1,(!>.9<E*/&3$#?5=>6*H$1:#%((=6?5;6?%2F5!35=-8#/+G'?7>8*;G?$/4'4;*?(9)6#94&>H<2%E-?7G5-?7/)'./7;2!12D&.."F#,+?72:D,?(>?7>GB9:("$.G%4(%BH<#!%&,?"%&>'%B"66.4:)>=-5$5EHG$<!:,F%%GH&=171+&-E:6#F.$5,!/?+/?CG,E-:'-?!1,(!>.9<E*/&3$#?5=>6*H$1:#%((=6?5;6?%2F5!35=-8#/+G'?7>8*;G?$/4'4?6?DB>?>'-'7:9H!CGGD#.!?GE4G!.&)4F,<71GD4//H<2#1$&7.'&5,/7E=/E(#/-6-",5^C ``` Nothing else needs be said. ### OS macOS ### GPU Apple ### CPU Apple ### Ollama version 0.5.1
GiteaMirror added the bug label 2026-04-22 10:49:54 -05:00
Author
Owner

@rick-github commented on GitHub (Dec 7, 2024):

$ ollama run llama3.3:70b-instruct-q8_0
>>> Hi
How's it going? Is there something I can help you with or would you like to chat?

Thing that needs be said: server log

<!-- gh-comment-id:2525063978 --> @rick-github commented on GitHub (Dec 7, 2024): ``` $ ollama run llama3.3:70b-instruct-q8_0 >>> Hi How's it going? Is there something I can help you with or would you like to chat? ``` Thing that needs be said: [server log](https://github.com/ollama/ollama/blob/main/docs/troubleshooting.md#how-to-troubleshoot-issues)
Author
Owner

@Baughn commented on GitHub (Dec 7, 2024):

server.log

Here you are. The out-of-memory errors are quite noticeable, but the laptop still has 30GB free; 96GB total, and I've run models this size in the past. That being said, I notice that the UI gets laggy, so this might be an OS issue.

I'd propose surfacing the error instead of picking random tokens, however. Or... these don't look quite random. Is it reading uninitialised memory?

<!-- gh-comment-id:2525284224 --> @Baughn commented on GitHub (Dec 7, 2024): [server.log](https://github.com/user-attachments/files/18049606/server.log) Here you are. The out-of-memory errors are quite noticeable, but the laptop still has 30GB free; 96GB total, and I've run models this size in the past. That being said, I notice that the UI gets laggy, so this might be an OS issue. I'd propose surfacing the error instead of picking random tokens, however. Or... these don't look quite random. Is it reading uninitialised memory?
Author
Owner

@rick-github commented on GitHub (Dec 7, 2024):

memory.available="[72.0 GiB]"  memory.required.full="71.1 GiB"

The model is pushing the limit of the available resources, the transient allocations that llama.cpp makes might be sending it over the edge. A similar issue was logged in llama.cpp and the solution was to reduce the size of the context, ie reduce memory usage. Since you're already using the default of 2048 there's not much in the way of savings there, but there are other mitigations you can try.

  1. Set OLLAMA_GPU_OVERHEAD to give llama.cpp a buffer to grow in to (eg, OLLAMA_GPU_OVERHEAD=536870912 to reserve 512M)
  2. Enable flash attention by setting OLLAMA_FLASH_ATTENTION=1 in the server environment. Flash attention is a more efficient use of memory and may reduce memory pressure.
  3. Reduce the number layers that ollama thinks it can offload to the GPU, see here. Ollama is currently offloading 81 layers, try setting num_gpu to 75.

Regarding the errors, a similar issue was filed against llama.cpp some time ago and a PR to fix it was merged. However, the code has transmuted considerably since then, and it's possible there's been a regression. The PR has some discussion about various limits that may be informative to a Mac user (ie, not me).

<!-- gh-comment-id:2525300079 --> @rick-github commented on GitHub (Dec 7, 2024): ``` memory.available="[72.0 GiB]" memory.required.full="71.1 GiB" ``` The model is pushing the limit of the available resources, the transient allocations that llama.cpp makes might be sending it over the edge. A similar issue was [logged](https://github.com/ggerganov/llama.cpp/issues/9701) in llama.cpp and the solution was to reduce the size of the context, ie reduce memory usage. Since you're already using the default of 2048 there's not much in the way of savings there, but there are other mitigations you can try. 1. Set [`OLLAMA_GPU_OVERHEAD`](https://github.com/ollama/ollama/blob/5f8051180e3b9aeafc153f6b5056e7358a939c88/envconfig/config.go#L237) to give llama.cpp a buffer to grow in to (eg, `OLLAMA_GPU_OVERHEAD=536870912` to reserve 512M) 2. Enable flash attention by setting [`OLLAMA_FLASH_ATTENTION=1`](https://github.com/ollama/ollama/blob/5f8051180e3b9aeafc153f6b5056e7358a939c88/envconfig/config.go#L236) in the server environment. Flash attention is a more efficient use of memory and may reduce memory pressure. 3. Reduce the number layers that ollama thinks it can offload to the GPU, see [here](https://github.com/ollama/ollama/issues/6950#issuecomment-2373663650). Ollama is currently offloading 81 layers, try setting `num_gpu` to 75. Regarding the errors, a similar [issue](https://github.com/ggerganov/llama.cpp/issues/9701) was filed against llama.cpp some time ago and a [PR](https://github.com/ggerganov/llama.cpp/pull/1826) to fix it was merged. However, the code has transmuted considerably since then, and it's possible there's been a regression. The PR has some discussion about various limits that may be informative to a Mac user (ie, not me).
Author
Owner

@dstadulis commented on GitHub (Dec 9, 2024):

When memory usage is at a critical usage level, reducing context size eliminated logging errors and results in coherent inference:

$ ollama run llama3.3:70b-instruct-q8_0
>>> hi
)$9H,(C6(42$
ggml_metal_graph_compute: command buffer 0 failed with status 5
error: Insufficient Memory (00000008:kIOGPUCommandBufferCallbackErrorOutOfMemory)
$ ollama run llama3.3:70b-instruct-q8_0
>>> /set parameter num_ctx 64
Set parameter 'num_ctx' to '64'
>>> hi
How's it going? Is there something I can help you with or would you like to chat?

If vm_stat output is available if helpful

<!-- gh-comment-id:2529047377 --> @dstadulis commented on GitHub (Dec 9, 2024): When memory usage is at a critical usage level, reducing context size eliminated logging errors and results in coherent inference: ```shell $ ollama run llama3.3:70b-instruct-q8_0 >>> hi )$9H,(C6(42$ ``` ```log ggml_metal_graph_compute: command buffer 0 failed with status 5 error: Insufficient Memory (00000008:kIOGPUCommandBufferCallbackErrorOutOfMemory) ``` ```shell $ ollama run llama3.3:70b-instruct-q8_0 >>> /set parameter num_ctx 64 Set parameter 'num_ctx' to '64' >>> hi How's it going? Is there something I can help you with or would you like to chat? ``` If `vm_stat` output is available if helpful
Author
Owner

@rick-github commented on GitHub (Dec 10, 2024):

Yes, reducing VRAM usage improves performance. A context buffer of 64 tokens has limited utility, hence the other suggestions.

<!-- gh-comment-id:2531786459 --> @rick-github commented on GitHub (Dec 10, 2024): Yes, reducing VRAM usage improves performance. A context buffer of 64 tokens has limited utility, hence the other suggestions.
Author
Owner

@dstadulis commented on GitHub (Dec 10, 2024):

A context buffer of 64 tokens has limited utility

Agreed, size was chosen to demonstrate marginal effects.

<!-- gh-comment-id:2532054374 --> @dstadulis commented on GitHub (Dec 10, 2024): >A context buffer of 64 tokens has limited utility Agreed, size was chosen to demonstrate marginal effects.
Author
Owner

@rick-github commented on GitHub (Dec 14, 2024):

Was this resolved?

<!-- gh-comment-id:2543175450 --> @rick-github commented on GitHub (Dec 14, 2024): Was this resolved?
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#30867