[GH-ISSUE #8331] Llama3.3:70b-intstruct-q5_K_M got EOF on MacMini M4 pro 64GB RAM #5339

Closed
opened 2026-04-12 16:32:22 -05:00 by GiteaMirror · 4 comments
Owner

Originally created by @simon4ddvd on GitHub (Jan 7, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/8331

What is the issue?

Mac OS: Sequoia 15.2
Llama3.3:70b-instruct-q4_K_m is running OK.
Llama3.3:70b-instruct-q6_K is running OK.
But Llama3.3:70b-instruct-q5_K_m got EOF as below:

% ollama run llama3.3:70b-instruct-q5_k_m
>>> hi
Error: POST predict: Post "http://127.0.0.1:53684/completion": EOF

OS

macOS

GPU

Apple

CPU

Apple

Ollama version

0.5.4

Originally created by @simon4ddvd on GitHub (Jan 7, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/8331 ### What is the issue? Mac OS: Sequoia 15.2 Llama3.3:70b-instruct-q4_K_m is running OK. Llama3.3:70b-instruct-q6_K is running OK. But Llama3.3:70b-instruct-q5_K_m got EOF as below: ``` % ollama run llama3.3:70b-instruct-q5_k_m >>> hi Error: POST predict: Post "http://127.0.0.1:53684/completion": EOF ``` ### OS macOS ### GPU Apple ### CPU Apple ### Ollama version 0.5.4
GiteaMirror added the bug label 2026-04-12 16:32:22 -05:00
Author
Owner

@rick-github commented on GitHub (Jan 7, 2025):

Server logs may aid in debugging.

<!-- gh-comment-id:2574555013 --> @rick-github commented on GitHub (Jan 7, 2025): [Server logs](https://github.com/ollama/ollama/blob/main/docs/troubleshooting.md#how-to-troubleshoot-issues) may aid in debugging.
Author
Owner

@simon4ddvd commented on GitHub (Jan 7, 2025):

server.20250107.log
Please refer to the uploaded log. The error messages as below:
ggml_metal_graph_compute: command buffer 0 failed with status 5
error: Insufficient Memory (00000008:kIOGPUCommandBufferCallbackErrorOutOfMemory)
llama_graph_compute: ggml_backend_sched_graph_compute_async failed with error -1
llama_decode: failed to decode, ret = -3
panic: failed to decode batch: llama_decode failed with code -3

<!-- gh-comment-id:2575099324 --> @simon4ddvd commented on GitHub (Jan 7, 2025): [server.20250107.log](https://github.com/user-attachments/files/18332112/server.20250107.log) Please refer to the uploaded log. The error messages as below: ggml_metal_graph_compute: command buffer 0 failed with status 5 error: Insufficient Memory (00000008:kIOGPUCommandBufferCallbackErrorOutOfMemory) llama_graph_compute: ggml_backend_sched_graph_compute_async failed with error -1 llama_decode: failed to decode, ret = -3 panic: failed to decode batch: llama_decode failed with code -3
Author
Owner

@rick-github commented on GitHub (Jan 7, 2025):

time=2025-01-07T18:10:10.780+08:00 level=INFO source=memory.go:356 msg="offload to metal" layers.requested=-1 layers.model=81 layers.offload=81 layers.split="" memory.available="[48.0 GiB]" memory.gpu_overhead="0 B" memory.required.full="47.9 GiB" memory.required.partial="47.9 GiB" memory.required.kv="640.0 MiB" memory.required.allocations="[47.9 GiB]" memory.weights.total="45.7 GiB" memory.weights.repeating="44.9 GiB" memory.weights.nonrepeating="822.0 MiB" memory.graph.full="324.0 MiB" memory.graph.partial="324.0 MiB"

The GPU has 48G available and ollama wants to use 47.9G, so it's a really tight squeeze. I suspect that during the inference llama.cpp is making a temporary allocation which exceeds the available spare VRAM and the process dies with an OOM error. The llama3.3:70b-instruct-q6_K model, which is larger, is probably overflowing in to system RAM and leaving enough room for the transient memory requirerments.

I'm not an Apple user but it's my understanding that you can increase the amount of VRAM your GPU has access to. If you can do that on your system, that will solve your problem. Otherwise, the only solution is to reduce the number of layers (num_gpu) that is being loaded into VRAM. See here for ways set num_gpu. Currently ollama is offloading 81 layers, try 75 and see how it goes.

<!-- gh-comment-id:2575437516 --> @rick-github commented on GitHub (Jan 7, 2025): ``` time=2025-01-07T18:10:10.780+08:00 level=INFO source=memory.go:356 msg="offload to metal" layers.requested=-1 layers.model=81 layers.offload=81 layers.split="" memory.available="[48.0 GiB]" memory.gpu_overhead="0 B" memory.required.full="47.9 GiB" memory.required.partial="47.9 GiB" memory.required.kv="640.0 MiB" memory.required.allocations="[47.9 GiB]" memory.weights.total="45.7 GiB" memory.weights.repeating="44.9 GiB" memory.weights.nonrepeating="822.0 MiB" memory.graph.full="324.0 MiB" memory.graph.partial="324.0 MiB" ``` The GPU has 48G available and ollama wants to use 47.9G, so it's a really tight squeeze. I suspect that during the inference llama.cpp is making a temporary allocation which exceeds the available spare VRAM and the process dies with an OOM error. The llama3.3:70b-instruct-q6_K model, which is larger, is probably overflowing in to system RAM and leaving enough room for the transient memory requirerments. I'm not an Apple user but it's my understanding that you can [increase](https://github.com/ggerganov/llama.cpp/discussions/2182#discussioncomment-7698315) the amount of VRAM your GPU has access to. If you can do that on your system, that will solve your problem. Otherwise, the only solution is to reduce the number of layers (`num_gpu`) that is being loaded into VRAM. See [here](https://github.com/ollama/ollama/issues/6950#issuecomment-2373663650) for ways set `num_gpu`. Currently ollama is offloading 81 layers, try 75 and see how it goes.
Author
Owner

@simon4ddvd commented on GitHub (Jan 8, 2025):

Thank you so much.
After I set iogpu.wired_limit_mb=50176 (49GB at least), it's working fine.

sudo sysctl iogpu.wired_limit_mb=50176
<!-- gh-comment-id:2576776867 --> @simon4ddvd commented on GitHub (Jan 8, 2025): Thank you so much. After I set iogpu.wired_limit_mb=50176 (49GB at least), it's working fine. ``` sudo sysctl iogpu.wired_limit_mb=50176 ```
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#5339