[GH-ISSUE #10900] Error: POST predict: Post "http://127.0.0.1:56330/completion": EOF of deepseek-r1-8b-qwen3 #7165

Closed
opened 2026-04-12 19:09:49 -05:00 by GiteaMirror · 3 comments
Owner

Originally created by @TatsuhiroC on GitHub (May 29, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/10900

What is the issue?

I have no idea why, but i meet the Error: POST predict: Post "http://127.0.0.1:56330/completion": EOF, when i try to use deepseek-r1-8b-qwen3 on Ollama version is 0.9.0-rc0, which should support deepseek-r1-8b-qwen3.

我不知道为什么,但我在Ollama版本0.9.0-rc0上尝试使用deepseek-r1-8b-qwen3遇到了错误:POST predict:Post "http://127.0.0.1:56330/completion" EOF,这应该支持deepseek-r1-8b-qwen3。

System: macOS Sequoia 15.5
Ollama vsrsion: 0.9.0-rc0
model: deepseek-r1-8b-qwen3

Relevant log output

ollama run deepseek-r1:8b-0528-qwen3-q8_0
>>> hello
Error: POST predict: Post "http://127.0.0.1:56330/completion": EOF

OS

No response

GPU

No response

CPU

No response

Ollama version

No response

Originally created by @TatsuhiroC on GitHub (May 29, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/10900 ### What is the issue? I have no idea why, but i meet the Error: POST predict: Post "http://127.0.0.1:56330/completion": EOF, when i try to use deepseek-r1-8b-qwen3 on Ollama version is 0.9.0-rc0, which should support deepseek-r1-8b-qwen3. 我不知道为什么,但我在Ollama版本0.9.0-rc0上尝试使用deepseek-r1-8b-qwen3遇到了错误:POST predict:Post "http://127.0.0.1:56330/completion" EOF,这应该支持deepseek-r1-8b-qwen3。 System: macOS Sequoia 15.5 Ollama vsrsion: 0.9.0-rc0 model: [deepseek-r1-8b-qwen3](deepseek-r1:8b-0528-qwen3-q8_0) ### Relevant log output ```shell ollama run deepseek-r1:8b-0528-qwen3-q8_0 >>> hello Error: POST predict: Post "http://127.0.0.1:56330/completion": EOF ``` ### OS _No response_ ### GPU _No response_ ### CPU _No response_ ### Ollama version _No response_
GiteaMirror added the bug label 2026-04-12 19:09:49 -05:00
Author
Owner

@rick-github commented on GitHub (May 29, 2025):

Server logs will aid in debugging.

<!-- gh-comment-id:2919903617 --> @rick-github commented on GitHub (May 29, 2025): [Server logs](https://github.com/ollama/ollama/blob/main/docs/troubleshooting.md#how-to-troubleshoot-issues) will aid in debugging.
Author
Owner

@TatsuhiroC commented on GitHub (May 29, 2025):

server.log
@rick-github
its too long and I cant read this, so........i just paste it all here.

<!-- gh-comment-id:2920838066 --> @TatsuhiroC commented on GitHub (May 29, 2025): [server.log](https://github.com/user-attachments/files/20512749/server.log) @rick-github its too long and I cant read this, so........i just paste it all here.
Author
Owner

@rick-github commented on GitHub (May 30, 2025):

[GIN] 2025/05/30 - 07:51:46 | 200 |  5.446009875s |       127.0.0.1 | POST     "/api/generate"
ggml_metal_graph_compute: command buffer 0 failed with status 5
error: Insufficient Memory (00000008:kIOGPUCommandBufferCallbackErrorOutOfMemory)
graph_compute: ggml_backend_sched_graph_compute_async failed with error -1

The runner ran out of memory.

time=2025-05-30T07:51:41.359+08:00 level=INFO source=server.go:168 msg=offload library=metal layers.requested=-1
 layers.model=37 layers.offload=37 layers.split="" memory.available="[10.7 GiB]" memory.gpu_overhead="0 B"
 memory.required.full="10.2 GiB" memory.required.partial="10.2 GiB" memory.required.kv="1.1 GiB"
 memory.required.allocations="[10.2 GiB]" memory.weights.total="7.6 GiB" memory.weights.repeating="7.0 GiB"
 memory.weights.nonrepeating="630.6 MiB" memory.graph.full="768.0 MiB" memory.graph.partial="768.0 MiB"
llama_model_load_from_file_impl: using device Metal (Apple M1 Pro) - 10922 MiB free

During model loading, ollama estimated that it needed 10.2G of the available 10.7G to load the model. Since it OOM'ed, the estimation was incorrect. You can find ways to mitigate this here.

<!-- gh-comment-id:2920849595 --> @rick-github commented on GitHub (May 30, 2025): ``` [GIN] 2025/05/30 - 07:51:46 | 200 | 5.446009875s | 127.0.0.1 | POST "/api/generate" ggml_metal_graph_compute: command buffer 0 failed with status 5 error: Insufficient Memory (00000008:kIOGPUCommandBufferCallbackErrorOutOfMemory) graph_compute: ggml_backend_sched_graph_compute_async failed with error -1 ``` The runner ran out of memory. ``` time=2025-05-30T07:51:41.359+08:00 level=INFO source=server.go:168 msg=offload library=metal layers.requested=-1 layers.model=37 layers.offload=37 layers.split="" memory.available="[10.7 GiB]" memory.gpu_overhead="0 B" memory.required.full="10.2 GiB" memory.required.partial="10.2 GiB" memory.required.kv="1.1 GiB" memory.required.allocations="[10.2 GiB]" memory.weights.total="7.6 GiB" memory.weights.repeating="7.0 GiB" memory.weights.nonrepeating="630.6 MiB" memory.graph.full="768.0 MiB" memory.graph.partial="768.0 MiB" llama_model_load_from_file_impl: using device Metal (Apple M1 Pro) - 10922 MiB free ``` During model loading, ollama estimated that it needed 10.2G of the available 10.7G to load the model. Since it OOM'ed, the estimation was incorrect. You can find ways to mitigate this [here](https://github.com/ollama/ollama/issues/8597#issuecomment-2614533288).
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#7165