[GH-ISSUE #9674] Error: POST predict: Post "http://127.0.0.1:62622/completion": read tcp 127.0.0.1:62627->127.0.0.1:62622: wsarecv: The remote host has closed a connection. #52823

Open
opened 2026-04-29 00:59:45 -05:00 by GiteaMirror · 49 comments
Owner

Originally created by @mswcap on GitHub (Mar 12, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/9674

What is the issue?

When I run Gemma3:12b the first one or two prompts run fine. But any prompt there after this error is thrown: Error: POST predict: Post "http://127.0.0.1:62622/completion": read tcp 127.0.0.1:62627->127.0.0.1:62622: wsarecv: The remote host has closed a connection.

Relevant log output


OS

Windows 11

GPU

Nvidia

CPU

AMD

Ollama version

0.6.0

Originally created by @mswcap on GitHub (Mar 12, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/9674 ### What is the issue? When I run Gemma3:12b the first one or two prompts run fine. But any prompt there after this error is thrown: **Error: POST predict: Post "http://127.0.0.1:62622/completion": read tcp 127.0.0.1:62627->127.0.0.1:62622: wsarecv: The remote host has closed a connection**. ### Relevant log output ```shell ``` ### OS Windows 11 ### GPU Nvidia ### CPU AMD ### Ollama version 0.6.0
GiteaMirror added the bugneeds more info labels 2026-04-29 00:59:45 -05:00
Author
Owner

@SunnyOd commented on GitHub (Mar 12, 2025):

Similar thing with the 27B model, getting this sometimes when asking the first questions, but always after the second:

Error: POST predict: Post "http://127.0.0.1:44703/completion": EOF

running v0.6.0 with a 3090 and it maxes vram and uses an additional 27% system memory

<!-- gh-comment-id:2717394755 --> @SunnyOd commented on GitHub (Mar 12, 2025): Similar thing with the 27B model, getting this sometimes when asking the first questions, but always after the second: ``` Error: POST predict: Post "http://127.0.0.1:44703/completion": EOF ``` running v0.6.0 with a 3090 and it maxes vram and uses an additional 27% system memory
Author
Owner

@mswcap commented on GitHub (Mar 12, 2025):

This issue resembles this one as well: https://github.com/ollama/ollama/issues/9676. Makes me wonder whether it's really model related or that Ollama 0.6.0 is having issues?

<!-- gh-comment-id:2717497888 --> @mswcap commented on GitHub (Mar 12, 2025): This issue resembles this one as well: https://github.com/ollama/ollama/issues/9676. Makes me wonder whether it's really model related or that Ollama 0.6.0 is having issues?
Author
Owner

@rick-github commented on GitHub (Mar 12, 2025):

Server logs may aid in debugging.

<!-- gh-comment-id:2717868644 --> @rick-github commented on GitHub (Mar 12, 2025): [Server logs](https://github.com/ollama/ollama/blob/main/docs/troubleshooting.md#how-to-troubleshoot-issues) may aid in debugging.
Author
Owner

@mswcap commented on GitHub (Mar 12, 2025):

server.log

app.log

<!-- gh-comment-id:2717900886 --> @mswcap commented on GitHub (Mar 12, 2025): [server.log](https://github.com/user-attachments/files/19210696/server.log) [app.log](https://github.com/user-attachments/files/19210697/app.log)
Author
Owner

@mswcap commented on GitHub (Mar 12, 2025):

Hi @rick-github , you're right. See above. Thank you for tyour time and efforts

<!-- gh-comment-id:2717902483 --> @mswcap commented on GitHub (Mar 12, 2025): Hi @rick-github , you're right. See above. Thank you for tyour time and efforts
Author
Owner

@rick-github commented on GitHub (Mar 12, 2025):

time=2025-03-12T14:29:55.607+01:00 level=INFO source=server.go:138 msg=offload library=cuda layers.requested=-1 layers.model=49 layers.offload=31 layers.split="" memory.available="[6.3 GiB]" memory.gpu_overhead="0 B" memory.required.full="9.5 GiB" memory.required.partial="6.2 GiB" memory.required.kv="768.0 MiB" memory.required.allocations="[6.2 GiB]" memory.weights.total="6.8 GiB" memory.weights.repeating="6.0 GiB" memory.weights.nonrepeating="787.5 MiB" memory.graph.full="519.5 MiB" memory.graph.partial="1.3 GiB"

[GIN] 2025/03/12 - 14:30:22 | 200 |    4.0387674s |       127.0.0.1 | POST     "/api/chat"
[GIN] 2025/03/12 - 14:30:55 | 200 |   27.1031548s |       127.0.0.1 | POST     "/api/chat"

ggml_backend_cuda_buffer_type_alloc_buffer: allocating 9487.76 MiB on device 0: cudaMalloc failed: out of memory

ollama allocated 6.2G of 6.3G to host the model, which worked for the first couple of requests, but then the runner ran out of memory and crashed. So either there were transient allocations that exceeded the remaining 0.1G, or something like a memory leak. You can reduce the memory footprint by following some of the recommendations here.

<!-- gh-comment-id:2717972675 --> @rick-github commented on GitHub (Mar 12, 2025): ``` time=2025-03-12T14:29:55.607+01:00 level=INFO source=server.go:138 msg=offload library=cuda layers.requested=-1 layers.model=49 layers.offload=31 layers.split="" memory.available="[6.3 GiB]" memory.gpu_overhead="0 B" memory.required.full="9.5 GiB" memory.required.partial="6.2 GiB" memory.required.kv="768.0 MiB" memory.required.allocations="[6.2 GiB]" memory.weights.total="6.8 GiB" memory.weights.repeating="6.0 GiB" memory.weights.nonrepeating="787.5 MiB" memory.graph.full="519.5 MiB" memory.graph.partial="1.3 GiB" [GIN] 2025/03/12 - 14:30:22 | 200 | 4.0387674s | 127.0.0.1 | POST "/api/chat" [GIN] 2025/03/12 - 14:30:55 | 200 | 27.1031548s | 127.0.0.1 | POST "/api/chat" ggml_backend_cuda_buffer_type_alloc_buffer: allocating 9487.76 MiB on device 0: cudaMalloc failed: out of memory ``` ollama allocated 6.2G of 6.3G to host the model, which worked for the first couple of requests, but then the runner ran out of memory and crashed. So either there were transient allocations that exceeded the remaining 0.1G, or something like a memory leak. You can reduce the memory footprint by following some of the recommendations [here](https://github.com/ollama/ollama/issues/8597#issuecomment-2614533288).
Author
Owner

@mswcap commented on GitHub (Mar 12, 2025):

Hi @rick-github , thanks for the advice. But even with these settings, I do get the same error.
set OLLAMA_NUM_PARALLEL=1
set OLLAMA_GPU_OVERHEAD=536870912
set OLLAMA_FLASH_ATTENTION=1

And even these settings do not help either.
set OLLAMA_GPU_OVERHEAD=1073741824
set OLLAMA_NUM_CTX=2048

Strangest thing is that I can run bigger models (like phi4:latest) without any issues, besided being a bit slow in the response. But no OOMs.

<!-- gh-comment-id:2718808290 --> @mswcap commented on GitHub (Mar 12, 2025): Hi @rick-github , thanks for the advice. But even with these settings, I do get the same error. set OLLAMA_NUM_PARALLEL=1 set OLLAMA_GPU_OVERHEAD=536870912 set OLLAMA_FLASH_ATTENTION=1 And even these settings do not help either. set OLLAMA_GPU_OVERHEAD=1073741824 set OLLAMA_NUM_CTX=2048 Strangest thing is that I can run bigger models (like phi4:latest) without any issues, besided being a bit slow in the response. But no OOMs.
Author
Owner

@rick-github commented on GitHub (Mar 12, 2025):

Server log? And OLLAMA_NUM_CTX is not an ollama config variable, try OLLAMA_CONTEXT_LENGTH.

<!-- gh-comment-id:2718933999 --> @rick-github commented on GitHub (Mar 12, 2025): Server log? And `OLLAMA_NUM_CTX` is not an ollama config variable, try `OLLAMA_CONTEXT_LENGTH`.
Author
Owner

@Corredor-Mediterraneo commented on GitHub (Mar 12, 2025):

OS
Windows 11 Pro

GPU
Nvidia GForce RTX 3060 12Gb

CPU
AMD Ryzen 7 5800X 8-Core Processor 3.80 GHz

Ollama version
0.6.0

Same error
ollama run gemma3:4b

hi
Error: POST predict: Post "http://127.0.0.1:56993/completion": read tcp 127.0.0.1:56995->127.0.0.1:56993: wsarecv: An existing connection was forcibly closed by the remote host.

app.log
config.json
server.log

<!-- gh-comment-id:2718935790 --> @Corredor-Mediterraneo commented on GitHub (Mar 12, 2025): OS Windows 11 Pro GPU Nvidia GForce RTX 3060 12Gb CPU AMD Ryzen 7 5800X 8-Core Processor 3.80 GHz Ollama version 0.6.0 Same error ollama run gemma3:4b >>> hi Error: POST predict: Post "http://127.0.0.1:56993/completion": read tcp 127.0.0.1:56995->127.0.0.1:56993: wsarecv: An existing connection was forcibly closed by the remote host. [app.log](https://github.com/user-attachments/files/19216525/app.log) [config.json](https://github.com/user-attachments/files/19216527/config.json) [server.log](https://github.com/user-attachments/files/19216526/server.log)
Author
Owner

@rick-github commented on GitHub (Mar 12, 2025):

@Corredor-Mediterraneo Your problem is different, it is this one.

<!-- gh-comment-id:2718941644 --> @rick-github commented on GitHub (Mar 12, 2025): @Corredor-Mediterraneo Your problem is different, it is [this one](https://github.com/ollama/ollama/issues/9509).
Author
Owner

@mswcap commented on GitHub (Mar 12, 2025):

config.json

Hi @rick-github , I am so sorry, see attachments. Tried these settings but no result :(. Thanks again for your time and patience.
set OLLAMA_NUM_PARALLEL=1
set OLLAMA_GPU_OVERHEAD=1073741824
set OLLAMA_CONTEXT_LENGTH=2048

server.log
app.log

<!-- gh-comment-id:2719048498 --> @mswcap commented on GitHub (Mar 12, 2025): [config.json](https://github.com/user-attachments/files/19217350/config.json) Hi @rick-github , I am so sorry, see attachments. Tried these settings but no result :(. Thanks again for your time and patience. set OLLAMA_NUM_PARALLEL=1 set OLLAMA_GPU_OVERHEAD=1073741824 set OLLAMA_CONTEXT_LENGTH=2048 [server.log](https://github.com/user-attachments/files/19217332/server.log) [app.log](https://github.com/user-attachments/files/19217333/app.log)
Author
Owner

@mswcap commented on GitHub (Mar 13, 2025):

When I run Gemma3:12b in LMStudio, it works without OOMs. No additional memory settings/config required. On the same machine. Perhaps there is an issue with Ollama 0.6.0?

<!-- gh-comment-id:2720623747 --> @mswcap commented on GitHub (Mar 13, 2025): When I run Gemma3:12b in LMStudio, it works without OOMs. No additional memory settings/config required. On the same machine. Perhaps there is an issue with Ollama 0.6.0?
Author
Owner

@ecarmen16 commented on GitHub (Mar 15, 2025):

Yeah, this is a strange one that I've noticed seems to affect my M4 more than it does my Windows machines. I absolutely have the resources to run QWQ 32b but I get this on the q8_0 version everytime I try. I'll redownload the q4 (default) and report back but this is super annoying and doesn't give great output considering the failure is the result of Ollama inference from its own run command.

<!-- gh-comment-id:2726722855 --> @ecarmen16 commented on GitHub (Mar 15, 2025): Yeah, this is a strange one that I've noticed seems to affect my M4 more than it does my Windows machines. I absolutely have the resources to run QWQ 32b but I get this on the q8_0 version everytime I try. I'll redownload the q4 (default) and report back but this is super annoying and doesn't give great output considering the failure is the result of Ollama inference from its own `run` command.
Author
Owner

@rick-github commented on GitHub (Mar 15, 2025):

Hi @rick-github , I am so sorry, see attachments. Tried these settings but no result :(. Thanks again for your time and patience.
set OLLAMA_NUM_PARALLEL=1
set OLLAMA_GPU_OVERHEAD=1073741824
set OLLAMA_CONTEXT_LENGTH=2048

@mswcap These settings are not shown in the log:

OLLAMA_NUM_PARALLEL:0
OLLAMA_GPU_OVERHEAD:0 

You have to set them in the server environment for them to take effect.

<!-- gh-comment-id:2726747041 --> @rick-github commented on GitHub (Mar 15, 2025): > Hi [@rick-github](https://github.com/rick-github) , I am so sorry, see attachments. Tried these settings but no result :(. Thanks again for your time and patience. > set OLLAMA_NUM_PARALLEL=1 > set OLLAMA_GPU_OVERHEAD=1073741824 > set OLLAMA_CONTEXT_LENGTH=2048 @mswcap These settings are not shown in the log: ``` OLLAMA_NUM_PARALLEL:0 OLLAMA_GPU_OVERHEAD:0 ``` You have to set them in the [server environment](https://github.com/ollama/ollama/blob/main/docs/faq.md#setting-environment-variables-on-windows) for them to take effect.
Author
Owner

@rick-github commented on GitHub (Mar 15, 2025):

When I run Gemma3:12b in LMStudio, it works without OOMs. No additional memory settings/config required. On the same machine. Perhaps there is an issue with Ollama 0.6.0?

I suspect LMStudio is much more conservative with its layer offloading than ollama. For example, when I load :27b on my test GPU (12G) it offloads 27 layers while ollama offloads 35. Tuning the VRAM allocation with the above variables or controlling the layer count directly with num_gpu should help while the developers zero in on the memory calculations.

<!-- gh-comment-id:2726778220 --> @rick-github commented on GitHub (Mar 15, 2025): > When I run Gemma3:12b in LMStudio, it works without OOMs. No additional memory settings/config required. On the same machine. Perhaps there is an issue with Ollama 0.6.0? I suspect LMStudio is much more conservative with its layer offloading than ollama. For example, when I load :27b on my test GPU (12G) it offloads 27 layers while ollama offloads 35. Tuning the VRAM allocation with the above variables or controlling the layer count directly with `num_gpu` should help while the developers zero in on the memory calculations.
Author
Owner

@mswcap commented on GitHub (Mar 15, 2025):

Hi @rick-github, I have set the variables as requested and updated to Ollama 0.6.1. Now I get the error 'Error: POST predict: Post "http://127.0.0.1:52573/completion": read tcp 127.0.0.1:53516->127.0.0.1:52573: wsarecv: De externe host heeft een verbinding verbroken.' I will look for the cause and try to solve it.

See logs attached (I am learning it :).

app.log
app-1.log
config.json
server.log
server-1.log

<!-- gh-comment-id:2726960223 --> @mswcap commented on GitHub (Mar 15, 2025): Hi @rick-github, I have set the variables as requested and updated to Ollama 0.6.1. Now I get the error 'Error: POST predict: Post "http://127.0.0.1:52573/completion": read tcp 127.0.0.1:53516->127.0.0.1:52573: wsarecv: De externe host heeft een verbinding verbroken.' I will look for the cause and try to solve it. See logs attached (I am learning it :). [app.log](https://github.com/user-attachments/files/19265639/app.log) [app-1.log](https://github.com/user-attachments/files/19265638/app-1.log) [config.json](https://github.com/user-attachments/files/19265636/config.json) [server.log](https://github.com/user-attachments/files/19265637/server.log) [server-1.log](https://github.com/user-attachments/files/19265640/server-1.log)
Author
Owner

@mswcap commented on GitHub (Mar 15, 2025):

No results. I even removed Ollama, all models and the environment variables. Reloaded the model (Gemma3:12b), but still the same error (Error: POST predict: Post "http://127.0.0.1:54429/completion": read tcp 127.0.0.1:54437->127.0.0.1:54429: wsarecv: De externe host heeft een verbinding verbroken.)

server.log

<!-- gh-comment-id:2726977760 --> @mswcap commented on GitHub (Mar 15, 2025): No results. I even removed Ollama, all models and the environment variables. Reloaded the model (Gemma3:12b), but still the same error (Error: POST predict: Post "http://127.0.0.1:54429/completion": read tcp 127.0.0.1:54437->127.0.0.1:54429: wsarecv: De externe host heeft een verbinding verbroken.) [server.log](https://github.com/user-attachments/files/19265791/server.log)
Author
Owner

@mswcap commented on GitHub (Mar 15, 2025):

Hmm, when I use these environment variables, it works BUT nothing runs on the GPU (NVIDIA GeForce RTX 4060 Laptop GPU), all on the CPU. Guess I am doing something wrong here.

server.log

OLLAMA_GPU_OVERHEAD 4294967296
OLLAMA_NUM_PARALLEL 1
OLLAMA_CONTEXT_LENGTH 2048

<!-- gh-comment-id:2726994311 --> @mswcap commented on GitHub (Mar 15, 2025): Hmm, when I use these environment variables, it works BUT _nothing_ runs on the GPU (NVIDIA GeForce RTX 4060 Laptop GPU), all on the CPU. Guess I am doing something wrong here. [server.log](https://github.com/user-attachments/files/19266231/server.log) OLLAMA_GPU_OVERHEAD 4294967296 OLLAMA_NUM_PARALLEL 1 OLLAMA_CONTEXT_LENGTH 2048
Author
Owner

@mswcap commented on GitHub (Mar 15, 2025):

Changed OLLAMA_GPU_OVERHEAD to 2147483648 (2 GB), and now the GPU is partially used. But again the same error. :(

server.log

<!-- gh-comment-id:2726996637 --> @mswcap commented on GitHub (Mar 15, 2025): Changed **OLLAMA_GPU_OVERHEAD** to **2147483648** (2 GB), and now the GPU is partially used. But again the same error. :( [server.log](https://github.com/user-attachments/files/19266267/server.log)
Author
Owner

@mswcap commented on GitHub (Mar 15, 2025):

Guess Gemma3-12b doesn't run good on my system, even though other models run just fine with Ollama.

<!-- gh-comment-id:2726996970 --> @mswcap commented on GitHub (Mar 15, 2025): Guess Gemma3-12b doesn't run good on my system, even though other models run just fine with Ollama.
Author
Owner

@rick-github commented on GitHub (Mar 15, 2025):

Something very strange is going on.

time=2025-03-15T21:51:51.016+01:00 level=INFO source=server.go:138 msg=offload library=cuda layers.requested=-1
 layers.model=49 layers.offload=8 layers.split="" memory.available="[6.9 GiB]" memory.gpu_overhead="2.0 GiB"
 memory.required.full="11.6 GiB" memory.required.partial="4.8 GiB" memory.required.kv="768.0 MiB"
 memory.required.allocations="[4.8 GiB]" memory.weights.total="6.0 GiB" memory.weights.repeating="6.0 GiB"
 memory.weights.nonrepeating="787.5 MiB" memory.graph.full="519.5 MiB" memory.graph.partial="1.3 GiB"
 projector.weights="795.9 MiB" projector.graph="1.0 GiB"

Your 4060 has 8G VRAM, and ollama wants to use 4.8G of that to load the model. 2G is held aside by OLLAMA_GPU_OVERHEAD. So ollama offloads 8 layers to the GPU.

time=2025-03-15T21:51:56.911+01:00 level=INFO source=server.go:624 msg="llama runner started in 5.76 seconds"
[GIN] 2025/03/15 - 21:51:56 | 200 |    6.0910103s |       127.0.0.1 | POST     "/api/generate"
[GIN] 2025/03/15 - 21:52:02 | 200 |       525.7µs |       127.0.0.1 | HEAD     "/"
[GIN] 2025/03/15 - 21:52:02 | 200 |       519.9µs |       127.0.0.1 | GET      "/api/ps"
[GIN] 2025/03/15 - 21:52:10 | 200 |            0s |       127.0.0.1 | HEAD     "/"
[GIN] 2025/03/15 - 21:52:10 | 200 |     62.7702ms |       127.0.0.1 | POST     "/api/show"
[GIN] 2025/03/15 - 21:52:10 | 200 |     45.0137ms |       127.0.0.1 | POST     "/api/generate"
[GIN] 2025/03/15 - 21:52:26 | 200 |    6.7491248s |       127.0.0.1 | POST     "/api/chat"
[GIN] 2025/03/15 - 21:53:02 | 200 |   13.4244342s |       127.0.0.1 | POST     "/api/chat"
[GIN] 2025/03/15 - 21:54:46 | 200 |         1m39s |       127.0.0.1 | POST     "/api/chat"

The model is loaded, and successfully answers both chat and generate requests. But then

ggml_backend_cuda_buffer_type_alloc_buffer: allocating 21459.48 MiB on device 0: cudaMalloc failed: out of memory
ggml_gallocr_reserve_n: failed to allocate CUDA0 buffer of size 22501892096

it suddenly wants another 21G of memory! What? That makes no sense.

Do you recall what the prompt was that caused the crash? Did it involve vision?

<!-- gh-comment-id:2727029965 --> @rick-github commented on GitHub (Mar 15, 2025): Something very strange is going on. ``` time=2025-03-15T21:51:51.016+01:00 level=INFO source=server.go:138 msg=offload library=cuda layers.requested=-1 layers.model=49 layers.offload=8 layers.split="" memory.available="[6.9 GiB]" memory.gpu_overhead="2.0 GiB" memory.required.full="11.6 GiB" memory.required.partial="4.8 GiB" memory.required.kv="768.0 MiB" memory.required.allocations="[4.8 GiB]" memory.weights.total="6.0 GiB" memory.weights.repeating="6.0 GiB" memory.weights.nonrepeating="787.5 MiB" memory.graph.full="519.5 MiB" memory.graph.partial="1.3 GiB" projector.weights="795.9 MiB" projector.graph="1.0 GiB" ``` Your 4060 has 8G VRAM, and ollama wants to use 4.8G of that to load the model. 2G is held aside by OLLAMA_GPU_OVERHEAD. So ollama offloads 8 layers to the GPU. ``` time=2025-03-15T21:51:56.911+01:00 level=INFO source=server.go:624 msg="llama runner started in 5.76 seconds" [GIN] 2025/03/15 - 21:51:56 | 200 | 6.0910103s | 127.0.0.1 | POST "/api/generate" [GIN] 2025/03/15 - 21:52:02 | 200 | 525.7µs | 127.0.0.1 | HEAD "/" [GIN] 2025/03/15 - 21:52:02 | 200 | 519.9µs | 127.0.0.1 | GET "/api/ps" [GIN] 2025/03/15 - 21:52:10 | 200 | 0s | 127.0.0.1 | HEAD "/" [GIN] 2025/03/15 - 21:52:10 | 200 | 62.7702ms | 127.0.0.1 | POST "/api/show" [GIN] 2025/03/15 - 21:52:10 | 200 | 45.0137ms | 127.0.0.1 | POST "/api/generate" [GIN] 2025/03/15 - 21:52:26 | 200 | 6.7491248s | 127.0.0.1 | POST "/api/chat" [GIN] 2025/03/15 - 21:53:02 | 200 | 13.4244342s | 127.0.0.1 | POST "/api/chat" [GIN] 2025/03/15 - 21:54:46 | 200 | 1m39s | 127.0.0.1 | POST "/api/chat" ```` The model is loaded, and successfully answers both `chat` and `generate` requests. But then ``` ggml_backend_cuda_buffer_type_alloc_buffer: allocating 21459.48 MiB on device 0: cudaMalloc failed: out of memory ggml_gallocr_reserve_n: failed to allocate CUDA0 buffer of size 22501892096 ``` it suddenly wants another 21G of memory! What? That makes no sense. Do you recall what the prompt was that caused the crash? Did it involve vision?
Author
Owner

@mswcap commented on GitHub (Mar 16, 2025):

Hi @rick-github, the prompt involved wasn't vision related. I asked what Google as a company does for business activities. Just chatting away, in order to see whether the crashes were gone.

<!-- gh-comment-id:2727212952 --> @mswcap commented on GitHub (Mar 16, 2025): Hi @rick-github, the prompt involved wasn't vision related. I asked what Google as a company does for business activities. Just chatting away, in order to see whether the crashes were gone.
Author
Owner

@mswcap commented on GitHub (Mar 16, 2025):

For testing I use this kind of chat. I ask what the model is, then who Google is, then what the activities of Google are. After that I ask random questions. Most of the times the model goes bad after the 2nd or 3rd question

<!-- gh-comment-id:2727214030 --> @mswcap commented on GitHub (Mar 16, 2025): For testing I use this kind of chat. I ask what the model is, then who Google is, then what the activities of Google are. After that I ask random questions. Most of the times the model goes bad after the 2nd or 3rd question
Author
Owner

@mswcap commented on GitHub (Mar 17, 2025):

I just redownloaded Gemma3-12b and used Ollama 0.6.1. During a normal conversation (text only, no images involved) the model suddenly requires around 17 GB of memory, see attached server log. I don't know whether the model is buggy or Ollama is.

server.log

time=2025-03-17T08:15:58.642+01:00 level=INFO source=server.go:624 msg="llama runner started in 13.02 seconds"
[GIN] 2025/03/17 - 08:15:58 | 200 | 13.3177393s | 127.0.0.1 | POST "/api/generate"
[GIN] 2025/03/17 - 08:16:24 | 200 | 5.6807634s | 127.0.0.1 | POST "/api/chat"
[GIN] 2025/03/17 - 08:16:31 | 200 | 3.4231776s | 127.0.0.1 | POST "/api/chat"
[GIN] 2025/03/17 - 08:16:38 | 200 | 1.9263122s | 127.0.0.1 | POST "/api/chat"
[GIN] 2025/03/17 - 08:16:44 | 200 | 1.965793s | 127.0.0.1 | POST "/api/chat"
[GIN] 2025/03/17 - 08:17:02 | 200 | 6.3516731s | 127.0.0.1 | POST "/api/chat"
ggml_backend_cuda_buffer_type_alloc_buffer: allocating 17481.62 MiB on device 0: cudaMalloc failed: out of memory
ggml_gallocr_reserve_n: failed to allocate CUDA0 buffer of size 18330807808
Exception 0xc0000005 0x0 0x58 0x7ff789a4ebd4
PC=0x7ff789a4ebd4
signal arrived during external code execution

<!-- gh-comment-id:2728428366 --> @mswcap commented on GitHub (Mar 17, 2025): I just redownloaded Gemma3-12b and used Ollama 0.6.1. During a normal conversation (text only, no images involved) the model suddenly requires around 17 GB of memory, see attached server log. I don't know whether the model is buggy or Ollama is. [server.log](https://github.com/user-attachments/files/19280934/server.log) time=2025-03-17T08:15:58.642+01:00 level=INFO source=server.go:624 msg="llama runner started in 13.02 seconds" [GIN] 2025/03/17 - 08:15:58 | 200 | 13.3177393s | 127.0.0.1 | POST "/api/generate" [GIN] 2025/03/17 - 08:16:24 | 200 | 5.6807634s | 127.0.0.1 | POST "/api/chat" [GIN] 2025/03/17 - 08:16:31 | 200 | 3.4231776s | 127.0.0.1 | POST "/api/chat" [GIN] 2025/03/17 - 08:16:38 | 200 | 1.9263122s | 127.0.0.1 | POST "/api/chat" [GIN] 2025/03/17 - 08:16:44 | 200 | 1.965793s | 127.0.0.1 | POST "/api/chat" [GIN] 2025/03/17 - 08:17:02 | 200 | 6.3516731s | 127.0.0.1 | POST "/api/chat" ggml_backend_cuda_buffer_type_alloc_buffer: allocating 17481.62 MiB on device 0: cudaMalloc **failed: out of memory** ggml_gallocr_reserve_n: **failed to allocate CUDA0 buffer of size 18330807808** Exception 0xc0000005 0x0 0x58 0x7ff789a4ebd4 PC=0x7ff789a4ebd4 signal arrived during external code execution
Author
Owner

@aalencia commented on GitHub (Mar 18, 2025):

I am able to reproduce this error and am available to help debug. let me know.

<!-- gh-comment-id:2734101134 --> @aalencia commented on GitHub (Mar 18, 2025): I am able to reproduce this error and am available to help debug. let me know.
Author
Owner

@rick-github commented on GitHub (Mar 18, 2025):

@aalencia Does upgrading to 0.6.2 reduce the crashing?

<!-- gh-comment-id:2734106758 --> @rick-github commented on GitHub (Mar 18, 2025): @aalencia Does upgrading to 0.6.2 reduce the crashing?
Author
Owner

@mswcap commented on GitHub (Mar 18, 2025):

@rick-github , can you please tell me how to install 0.6,2 on Windows? The download is version 0.6.1. I see 0.6.2. but no download. Guess I am doing something wrong, my bad.

<!-- gh-comment-id:2734219516 --> @mswcap commented on GitHub (Mar 18, 2025): @rick-github , can you please tell me how to install 0.6,2 on Windows? The download is version 0.6.1. I see 0.6.2. but no download. Guess I am doing something wrong, my bad.
Author
Owner

@mswcap commented on GitHub (Mar 18, 2025):

Found it: https://github.com/ollama/ollama/releases

<!-- gh-comment-id:2734276004 --> @mswcap commented on GitHub (Mar 18, 2025): Found it: https://github.com/ollama/ollama/releases
Author
Owner

@mswcap commented on GitHub (Mar 18, 2025):

@aalencia Does upgrading to 0.6.2 reduce the crashing?

YES it does! Woohooo! Just running Ollama 0.6.2. without additional settings for the server and it runs great on prompts. Will test it with Open Web UI and images as well and report back.

<!-- gh-comment-id:2734322662 --> @mswcap commented on GitHub (Mar 18, 2025): > [@aalencia](https://github.com/aalencia) Does upgrading to 0.6.2 reduce the crashing? YES it does! Woohooo! Just running Ollama 0.6.2. without additional settings for the server and it runs great on prompts. Will test it with Open Web UI and images as well and report back.
Author
Owner

@mswcap commented on GitHub (Mar 18, 2025):

Also describing images works like a charm! Thank you all!!!

<!-- gh-comment-id:2734348080 --> @mswcap commented on GitHub (Mar 18, 2025): Also describing images works like a charm! Thank you all!!!
Author
Owner

@vitlib commented on GitHub (Mar 19, 2025):

I upgraded to 0.6.2, the model doesn't give the Error: POST predict anymore but it still uses only CPU.

<!-- gh-comment-id:2737106239 --> @vitlib commented on GitHub (Mar 19, 2025): I upgraded to 0.6.2, the model doesn't give the Error: POST predict anymore but it still uses only CPU.
Author
Owner

@rick-github commented on GitHub (Mar 19, 2025):

Server logs may aid in debugging.

<!-- gh-comment-id:2737115869 --> @rick-github commented on GitHub (Mar 19, 2025): [Server logs](https://github.com/ollama/ollama/blob/main/docs/troubleshooting.md#how-to-troubleshoot-issues) may aid in debugging.
Author
Owner

@vitlib commented on GitHub (Mar 19, 2025):

server.log thanks

<!-- gh-comment-id:2737141822 --> @vitlib commented on GitHub (Mar 19, 2025): [server.log](https://github.com/user-attachments/files/19344161/server.log) thanks
Author
Owner

@rick-github commented on GitHub (Mar 19, 2025):

time=2025-03-19T15:57:45.943+01:00 level=ERROR source=ggml.go:88 msg="failed to get absolute path" error="The filename, directory name, or volume label syntax is incorrect."
load_backend: loaded CPU backend from C:\Users\libbb\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-haswell.dll
time=2025-03-19T15:57:46.211+01:00 level=INFO source=ggml.go:109 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 compiler=cgo(clang)

No GPU backends found. What's the output of

dir /s C:\Users\libbb\AppData\Local\Programs\Ollama
<!-- gh-comment-id:2737178920 --> @rick-github commented on GitHub (Mar 19, 2025): ``` time=2025-03-19T15:57:45.943+01:00 level=ERROR source=ggml.go:88 msg="failed to get absolute path" error="The filename, directory name, or volume label syntax is incorrect." load_backend: loaded CPU backend from C:\Users\libbb\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-haswell.dll time=2025-03-19T15:57:46.211+01:00 level=INFO source=ggml.go:109 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 compiler=cgo(clang) ``` No GPU backends found. What's the output of ``` dir /s C:\Users\libbb\AppData\Local\Programs\Ollama ```
Author
Owner

@vitlib commented on GitHub (Mar 19, 2025):

with /s argument it returns error not found. But if i run other models it works just fine!

<!-- gh-comment-id:2737212512 --> @vitlib commented on GitHub (Mar 19, 2025): with /s argument it returns error not found. But if i run other models it works just fine!
Author
Owner

@rick-github commented on GitHub (Mar 19, 2025):

There should be at least something like the following

 Directory of C:\Users\libbb\AppData\Local\Programs\Ollama

07/03/2025  18:04    <DIR>          .
07/03/2025  18:02    <DIR>          ..
04/03/2025  04:11             7,502 app.ico
07/03/2025  18:02    <DIR>          lib
04/03/2025  04:12         7,046,080 ollama app.exe
04/03/2025  04:12        30,578,104 ollama.exe
04/03/2025  04:12            11,815 ollama_welcome.ps1
07/03/2025  18:04           233,835 unins000.dat
07/03/2025  18:01         3,446,712 unins000.exe
07/03/2025  18:04            24,381 unins000.msg
               7 File(s)     41,348,429 bytes

because that's wherer the ollama executable is.

<!-- gh-comment-id:2737263203 --> @rick-github commented on GitHub (Mar 19, 2025): There should be at least something like the following ``` Directory of C:\Users\libbb\AppData\Local\Programs\Ollama 07/03/2025 18:04 <DIR> . 07/03/2025 18:02 <DIR> .. 04/03/2025 04:11 7,502 app.ico 07/03/2025 18:02 <DIR> lib 04/03/2025 04:12 7,046,080 ollama app.exe 04/03/2025 04:12 30,578,104 ollama.exe 04/03/2025 04:12 11,815 ollama_welcome.ps1 07/03/2025 18:04 233,835 unins000.dat 07/03/2025 18:01 3,446,712 unins000.exe 07/03/2025 18:04 24,381 unins000.msg 7 File(s) 41,348,429 bytes ``` because that's wherer the ollama executable is.
Author
Owner

@vitlib commented on GitHub (Mar 19, 2025):

Yes without the /s argument that's the output

<!-- gh-comment-id:2737360383 --> @vitlib commented on GitHub (Mar 19, 2025): Yes without the /s argument that's the output
Author
Owner

@rick-github commented on GitHub (Mar 19, 2025):

Run it in a CMD window, not a PS window.

<!-- gh-comment-id:2737369845 --> @rick-github commented on GitHub (Mar 19, 2025): Run it in a CMD window, not a PS window.
Author
Owner

@rick-github commented on GitHub (Mar 19, 2025):

Or in a PS window:

 dir -recurse C:\Users\libbb\AppData\Local\Programs\Ollama
<!-- gh-comment-id:2737390055 --> @rick-github commented on GitHub (Mar 19, 2025): Or in a PS window: ``` dir -recurse C:\Users\libbb\AppData\Local\Programs\Ollama ```
Author
Owner

@vitlib commented on GitHub (Mar 19, 2025):

dir-recurse.log

<!-- gh-comment-id:2737412997 --> @vitlib commented on GitHub (Mar 19, 2025): [dir-recurse.log](https://github.com/user-attachments/files/19347204/dir-recurse.log)
Author
Owner

@rick-github commented on GitHub (Mar 20, 2025):

The directory listing looks normal. The error is The filename, directory name, or volume label syntax is incorrect. but then it successfully loads the ggml-cpu-haswell.dll backend. It's not clear what's causing the error. You could try setting OLLAMA_DEBUG=1 in the server environment and post the log from that - it will be quite large but it might contain a clue.

<!-- gh-comment-id:2741230993 --> @rick-github commented on GitHub (Mar 20, 2025): The directory listing looks normal. The error is `The filename, directory name, or volume label syntax is incorrect.` but then it successfully loads the `ggml-cpu-haswell.dll` backend. It's not clear what's causing the error. You could try setting `OLLAMA_DEBUG=1` in the server environment and post the log from that - it will be quite large but it might contain a clue.
Author
Owner

@rtmcrc commented on GitHub (Apr 4, 2025):

I have the same issue only with Phi4 all other models work fine

server.log
app.log

phi4:latest 9.1 GB ERROR
vanilj/Phi-4:latest 9.1 GB ERROR
vanilj/phi-4-unsloth:latest 8.9 GB OK
hf.co/google/gemma-3-12b-it-qat-q4_0-gguf:latest 8.9 GB OK
deepseek-r1:14b 9.0 GB OK
mistral-nemo:12b-instruct-2407-q6_K 10 GB OK
deepseek-coder-v2:latest 8.9 GB OK
mightykatun/qwen2.5-math:7b 8.1 GB OK
hhao/qwen2.5-coder-tools:1.5b 1.6 GB OK
hhao/qwen2.5-coder-tools:7b 4.7 GB OK
hhao/qwen2.5-coder-tools:14b 9.0 GB OK

OS
Windows 10

GPU
Nvidia

CPU
Intel

Ollama version
0.6.4

<!-- gh-comment-id:2779868633 --> @rtmcrc commented on GitHub (Apr 4, 2025): I have the same issue only with Phi4 all other models work fine [server.log](https://github.com/user-attachments/files/19610956/server.log) [app.log](https://github.com/user-attachments/files/19610957/app.log) phi4:latest 9.1 GB **ERROR** vanilj/Phi-4:latest 9.1 GB **ERROR** vanilj/phi-4-unsloth:latest 8.9 GB **OK** hf.co/google/gemma-3-12b-it-qat-q4_0-gguf:latest 8.9 GB OK deepseek-r1:14b 9.0 GB OK mistral-nemo:12b-instruct-2407-q6_K 10 GB OK deepseek-coder-v2:latest 8.9 GB OK mightykatun/qwen2.5-math:7b 8.1 GB OK hhao/qwen2.5-coder-tools:1.5b 1.6 GB OK hhao/qwen2.5-coder-tools:7b 4.7 GB OK hhao/qwen2.5-coder-tools:14b 9.0 GB OK OS Windows 10 GPU Nvidia CPU Intel Ollama version 0.6.4
Author
Owner

@rick-github commented on GitHub (Apr 7, 2025):

time=2025-04-05T02:45:02.824+06:00 level=INFO source=server.go:138 msg=offload library=cuda layers.requested=-1
 layers.model=41 layers.offload=3 layers.split="" memory.available="[1.6 GiB]" memory.gpu_overhead="0 B"
 memory.required.full="9.9 GiB" memory.required.partial="1.6 GiB" memory.required.kv="400.0 MiB"
 memory.required.allocations="[1.6 GiB]" memory.weights.total="8.2 GiB" memory.weights.repeating="7.8 GiB"
 memory.weights.nonrepeating="402.0 MiB" memory.graph.full="266.7 MiB" memory.graph.partial="266.7 MiB"

CUDA error: out of memory
  current device: 0, in function alloc at C:\a\ollama\ollama\ml\backend\ggml\ggml\src\ggml-cuda\ggml-cuda.cu:467
  cuMemSetAccess((CUdeviceptr)((char *)(pool_addr) + pool_size), reserve_size, &access, 1)

ollama offloaded 3 of 41 layers to the GPU, taking 1.6GiB of the available 1.6GiB VRAM, ie all of it. A temporary allocation during inference caused an OOM error. See here for ways to mitigate this.

<!-- gh-comment-id:2783917136 --> @rick-github commented on GitHub (Apr 7, 2025): ``` time=2025-04-05T02:45:02.824+06:00 level=INFO source=server.go:138 msg=offload library=cuda layers.requested=-1 layers.model=41 layers.offload=3 layers.split="" memory.available="[1.6 GiB]" memory.gpu_overhead="0 B" memory.required.full="9.9 GiB" memory.required.partial="1.6 GiB" memory.required.kv="400.0 MiB" memory.required.allocations="[1.6 GiB]" memory.weights.total="8.2 GiB" memory.weights.repeating="7.8 GiB" memory.weights.nonrepeating="402.0 MiB" memory.graph.full="266.7 MiB" memory.graph.partial="266.7 MiB" CUDA error: out of memory current device: 0, in function alloc at C:\a\ollama\ollama\ml\backend\ggml\ggml\src\ggml-cuda\ggml-cuda.cu:467 cuMemSetAccess((CUdeviceptr)((char *)(pool_addr) + pool_size), reserve_size, &access, 1) ``` ollama offloaded 3 of 41 layers to the GPU, taking 1.6GiB of the available 1.6GiB VRAM, ie all of it. A temporary allocation during inference caused an OOM error. See [here](https://github.com/ollama/ollama/issues/8597#issuecomment-2614533288) for ways to mitigate this.
Author
Owner

@vitlib commented on GitHub (Apr 7, 2025):

The directory listing looks normal. The error is The filename, directory name, or volume label syntax is incorrect. but then it successfully loads the ggml-cpu-haswell.dll backend. It's not clear what's causing the error. You could try setting OLLAMA_DEBUG=1 in the server environment and post the log from that - it will be quite large but it might contain a clue.

The issue fixed itself updating Ollama

<!-- gh-comment-id:2783974108 --> @vitlib commented on GitHub (Apr 7, 2025): > The directory listing looks normal. The error is `The filename, directory name, or volume label syntax is incorrect.` but then it successfully loads the `ggml-cpu-haswell.dll` backend. It's not clear what's causing the error. You could try setting `OLLAMA_DEBUG=1` in the server environment and post the log from that - it will be quite large but it might contain a clue. The issue fixed itself updating Ollama
Author
Owner

@kuriel-dev commented on GitHub (Apr 28, 2025):

No cares what params i set to ollama serve it brakes at once when long prompts are sent on requests, i tryied all the options here:

https://github.com/ollama/ollama/issues/8597#issuecomment-2614533288

i tried with ollama-js
i tried with custom http subscription requests (API)
i tried with chat
i tired with generate

i though i was having a bad implementation but ollama just breaks.

the problem is the API service from ollama serve, it just cant handle long requests breaking the whole usability of the tool.

steps to reproduce:

  1. copy a long text more than 2k characters, tested on 5k characters (which is normal in order to ask to IA for a summary)
    next step: none, thats it, ollama brakes that and next answers

one weird thing is that ollama /generate handle images processing well, since images are, by far, larger than text, so i can asume the prompt handling is the problem, including long contexts when api returns context and i recycle it in the next request, if the context is large, requests are having errors.

hope this helps to identify the bug and people that just frustrates (like me) cuz the service is broken.

<!-- gh-comment-id:2835619718 --> @kuriel-dev commented on GitHub (Apr 28, 2025): No cares what params i set to `ollama serve` it brakes at once when long prompts are sent on requests, i tryied all the options here: https://github.com/ollama/ollama/issues/8597#issuecomment-2614533288 i tried with ollama-js i tried with custom http subscription requests (API) i tried with chat i tired with generate i though i was having a bad implementation but ollama just breaks. the problem is the API service from ollama serve, it just cant handle long requests breaking the whole usability of the tool. steps to reproduce: 1. copy a long text more than 2k characters, tested on 5k characters (which is normal in order to ask to IA for a summary) next step: none, thats it, ollama brakes that and next answers one weird thing is that ollama /generate handle images processing well, since images are, by far, larger than text, so i can asume the prompt handling is the problem, including long contexts when api returns context and i recycle it in the next request, if the context is large, requests are having errors. hope this helps to identify the bug and people that just frustrates (like me) cuz the service is broken.
Author
Owner

@kuriel-dev commented on GitHub (Apr 29, 2025):

i'm back with the error log:

ggml-alloc.c:819: GGML_ASSERT(talloc->buffer_id >= 0) failed
time=2025-04-29T10:05:30.623-06:00 level=ERROR source=server.go:449 msg="llama runner terminated" error="exit status 0xc0000409"

My only custom settings are:

      OLLAMA_GPU_OVERHEAD: 536870912,
      OLLAMA_FLASH_ATTENTION: 1,
      OLLAMA_NUM_PARALLEL: 1,
      OLLAMA_CONTEXT_LENGTH: 1024 * 16,

No api extra params, cuz it really doesn't care, the result its the same with error.

<!-- gh-comment-id:2839465816 --> @kuriel-dev commented on GitHub (Apr 29, 2025): i'm back with the error log: ``` ggml-alloc.c:819: GGML_ASSERT(talloc->buffer_id >= 0) failed time=2025-04-29T10:05:30.623-06:00 level=ERROR source=server.go:449 msg="llama runner terminated" error="exit status 0xc0000409" ``` My only custom settings are: OLLAMA_GPU_OVERHEAD: 536870912, OLLAMA_FLASH_ATTENTION: 1, OLLAMA_NUM_PARALLEL: 1, OLLAMA_CONTEXT_LENGTH: 1024 * 16, No api extra params, cuz it really doesn't care, the result its the same with error.
Author
Owner

@rick-github commented on GitHub (Apr 29, 2025):

#10410

<!-- gh-comment-id:2839498014 --> @rick-github commented on GitHub (Apr 29, 2025): #10410
Author
Owner

@kuriel-dev commented on GitHub (Apr 29, 2025):

#10410

Hello rick, this happening while i'm monitoring my RAM of 32GB while running the long request i'm having 70% in use.

(which means i'm having 9.6GB free DURING the operation).

still failing allocation memory in a 4GB model and 5 ~ 50Mb request input.

<!-- gh-comment-id:2839522719 --> @kuriel-dev commented on GitHub (Apr 29, 2025): > [#10410](https://github.com/ollama/ollama/issues/10410) Hello rick, this happening while i'm monitoring my RAM of 32GB while running the long request i'm having 70% in use. (which means i'm having 9.6GB free DURING the operation). still failing allocation memory in a 4GB model and 5 ~ 50Mb request input.
Author
Owner

@kuriel-dev commented on GitHub (Apr 30, 2025):

Hi @rick-github, I'm happy to help contribute to resolving this issue and streamlining the process.

I've observed that when submitting an image alongside a lengthy prompt, the request completes successfully.

I believe the memory allocation algorithm currently used in the image processing pipeline could be re-used for this purpose.

Currently, I consistently submit very small images in my requests, which still results in a significant wait time, leading to a frustrating experience for users. I wanted to share this feedback as it highlights a potential area for improvement and efficiency.

<!-- gh-comment-id:2840763477 --> @kuriel-dev commented on GitHub (Apr 30, 2025): Hi @rick-github, I'm happy to help contribute to resolving this issue and streamlining the process. I've observed that when submitting an image alongside a lengthy prompt, the request completes successfully. I believe the memory allocation algorithm currently used in the image processing pipeline could be re-used for this purpose. Currently, I consistently submit very small images in my requests, which still results in a significant wait time, leading to a frustrating experience for users. I wanted to share this feedback as it highlights a potential area for improvement and efficiency.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#52823