[GH-ISSUE #9674] Error: POST predict: Post "http://127.0.0.1:62622/completion": read tcp 127.0.0.1:62627->127.0.0.1:62622: wsarecv: The remote host has closed a connection. #52823

New Issue

GiteaMirror · 2026-04-29T00:59:45-05:00

GiteaMirror commented

2026-04-29 00:59:45 -05:00

Originally created by @mswcap on GitHub (Mar 12, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/9674

What is the issue?

When I run Gemma3:12b the first one or two prompts run fine. But any prompt there after this error is thrown: Error: POST predict: Post "http://127.0.0.1:62622/completion": read tcp 127.0.0.1:62627->127.0.0.1:62622: wsarecv: The remote host has closed a connection.

Relevant log output

OS

Windows 11

GPU

Nvidia

CPU

AMD

Ollama version

0.6.0

Originally created by @mswcap on GitHub (Mar 12, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/9674 ### What is the issue? When I run Gemma3:12b the first one or two prompts run fine. But any prompt there after this error is thrown: **Error: POST predict: Post "http://127.0.0.1:62622/completion": read tcp 127.0.0.1:62627->127.0.0.1:62622: wsarecv: The remote host has closed a connection**. ### Relevant log output ```shell ``` ### OS Windows 11 ### GPU Nvidia ### CPU AMD ### Ollama version 0.6.0

GiteaMirror added the bug needs more info labels 2026-04-29 00:59:45 -05:00

GiteaMirror commented

2026-04-29 00:59:49 -05:00

@SunnyOd commented on GitHub (Mar 12, 2025):

Similar thing with the 27B model, getting this sometimes when asking the first questions, but always after the second:

Error: POST predict: Post "http://127.0.0.1:44703/completion": EOF

running v0.6.0 with a 3090 and it maxes vram and uses an additional 27% system memory

@SunnyOd commented on GitHub (Mar 12, 2025): Similar thing with the 27B model, getting this sometimes when asking the first questions, but always after the second: ``` Error: POST predict: Post "http://127.0.0.1:44703/completion": EOF ``` running v0.6.0 with a 3090 and it maxes vram and uses an additional 27% system memory

GiteaMirror commented

2026-04-29 00:59:51 -05:00

@mswcap commented on GitHub (Mar 12, 2025):

This issue resembles this one as well: https://github.com/ollama/ollama/issues/9676. Makes me wonder whether it's really model related or that Ollama 0.6.0 is having issues?

@mswcap commented on GitHub (Mar 12, 2025): This issue resembles this one as well: https://github.com/ollama/ollama/issues/9676. Makes me wonder whether it's really model related or that Ollama 0.6.0 is having issues?

GiteaMirror commented

2026-04-29 00:59:53 -05:00

@rick-github commented on GitHub (Mar 12, 2025):

Server logs may aid in debugging.

@rick-github commented on GitHub (Mar 12, 2025): [Server logs](https://github.com/ollama/ollama/blob/main/docs/troubleshooting.md#how-to-troubleshoot-issues) may aid in debugging.

GiteaMirror commented

2026-04-29 00:59:56 -05:00

@mswcap commented on GitHub (Mar 12, 2025):

server.log

app.log

@mswcap commented on GitHub (Mar 12, 2025): [server.log](https://github.com/user-attachments/files/19210696/server.log) [app.log](https://github.com/user-attachments/files/19210697/app.log)

GiteaMirror commented

2026-04-29 00:59:57 -05:00

@mswcap commented on GitHub (Mar 12, 2025):

Hi @rick-github , you're right. See above. Thank you for tyour time and efforts

@mswcap commented on GitHub (Mar 12, 2025): Hi @rick-github , you're right. See above. Thank you for tyour time and efforts

GiteaMirror commented

2026-04-29 00:59:58 -05:00

@rick-github commented on GitHub (Mar 12, 2025):

time=2025-03-12T14:29:55.607+01:00 level=INFO source=server.go:138 msg=offload library=cuda layers.requested=-1 layers.model=49 layers.offload=31 layers.split="" memory.available="[6.3 GiB]" memory.gpu_overhead="0 B" memory.required.full="9.5 GiB" memory.required.partial="6.2 GiB" memory.required.kv="768.0 MiB" memory.required.allocations="[6.2 GiB]" memory.weights.total="6.8 GiB" memory.weights.repeating="6.0 GiB" memory.weights.nonrepeating="787.5 MiB" memory.graph.full="519.5 MiB" memory.graph.partial="1.3 GiB"

[GIN] 2025/03/12 - 14:30:22 | 200 |    4.0387674s |       127.0.0.1 | POST     "/api/chat"
[GIN] 2025/03/12 - 14:30:55 | 200 |   27.1031548s |       127.0.0.1 | POST     "/api/chat"

ggml_backend_cuda_buffer_type_alloc_buffer: allocating 9487.76 MiB on device 0: cudaMalloc failed: out of memory

ollama allocated 6.2G of 6.3G to host the model, which worked for the first couple of requests, but then the runner ran out of memory and crashed. So either there were transient allocations that exceeded the remaining 0.1G, or something like a memory leak. You can reduce the memory footprint by following some of the recommendations here.

@rick-github commented on GitHub (Mar 12, 2025): ``` time=2025-03-12T14:29:55.607+01:00 level=INFO source=server.go:138 msg=offload library=cuda layers.requested=-1 layers.model=49 layers.offload=31 layers.split="" memory.available="[6.3 GiB]" memory.gpu_overhead="0 B" memory.required.full="9.5 GiB" memory.required.partial="6.2 GiB" memory.required.kv="768.0 MiB" memory.required.allocations="[6.2 GiB]" memory.weights.total="6.8 GiB" memory.weights.repeating="6.0 GiB" memory.weights.nonrepeating="787.5 MiB" memory.graph.full="519.5 MiB" memory.graph.partial="1.3 GiB" [GIN] 2025/03/12 - 14:30:22 | 200 | 4.0387674s | 127.0.0.1 | POST "/api/chat" [GIN] 2025/03/12 - 14:30:55 | 200 | 27.1031548s | 127.0.0.1 | POST "/api/chat" ggml_backend_cuda_buffer_type_alloc_buffer: allocating 9487.76 MiB on device 0: cudaMalloc failed: out of memory ``` ollama allocated 6.2G of 6.3G to host the model, which worked for the first couple of requests, but then the runner ran out of memory and crashed. So either there were transient allocations that exceeded the remaining 0.1G, or something like a memory leak. You can reduce the memory footprint by following some of the recommendations [here](https://github.com/ollama/ollama/issues/8597#issuecomment-2614533288).

GiteaMirror commented

2026-04-29 01:00:00 -05:00

@mswcap commented on GitHub (Mar 12, 2025):

Hi @rick-github , thanks for the advice. But even with these settings, I do get the same error.
set OLLAMA_NUM_PARALLEL=1
set OLLAMA_GPU_OVERHEAD=536870912
set OLLAMA_FLASH_ATTENTION=1

And even these settings do not help either.
set OLLAMA_GPU_OVERHEAD=1073741824
set OLLAMA_NUM_CTX=2048

Strangest thing is that I can run bigger models (like phi4:latest) without any issues, besided being a bit slow in the response. But no OOMs.

@mswcap commented on GitHub (Mar 12, 2025): Hi @rick-github , thanks for the advice. But even with these settings, I do get the same error. set OLLAMA_NUM_PARALLEL=1 set OLLAMA_GPU_OVERHEAD=536870912 set OLLAMA_FLASH_ATTENTION=1 And even these settings do not help either. set OLLAMA_GPU_OVERHEAD=1073741824 set OLLAMA_NUM_CTX=2048 Strangest thing is that I can run bigger models (like phi4:latest) without any issues, besided being a bit slow in the response. But no OOMs.

GiteaMirror commented

2026-04-29 01:00:03 -05:00

@rick-github commented on GitHub (Mar 12, 2025):

Server log? And OLLAMA_NUM_CTX is not an ollama config variable, try OLLAMA_CONTEXT_LENGTH.

@rick-github commented on GitHub (Mar 12, 2025): Server log? And `OLLAMA_NUM_CTX` is not an ollama config variable, try `OLLAMA_CONTEXT_LENGTH`.

GiteaMirror commented

2026-04-29 01:00:08 -05:00

@Corredor-Mediterraneo commented on GitHub (Mar 12, 2025):

OS
Windows 11 Pro

GPU
Nvidia GForce RTX 3060 12Gb

CPU
AMD Ryzen 7 5800X 8-Core Processor 3.80 GHz

Ollama version
0.6.0

Same error
ollama run gemma3:4b

hi
Error: POST predict: Post "http://127.0.0.1:56993/completion": read tcp 127.0.0.1:56995->127.0.0.1:56993: wsarecv: An existing connection was forcibly closed by the remote host.

app.log
config.json
server.log

@Corredor-Mediterraneo commented on GitHub (Mar 12, 2025): OS Windows 11 Pro GPU Nvidia GForce RTX 3060 12Gb CPU AMD Ryzen 7 5800X 8-Core Processor 3.80 GHz Ollama version 0.6.0 Same error ollama run gemma3:4b >>> hi Error: POST predict: Post "http://127.0.0.1:56993/completion": read tcp 127.0.0.1:56995->127.0.0.1:56993: wsarecv: An existing connection was forcibly closed by the remote host. [app.log](https://github.com/user-attachments/files/19216525/app.log) [config.json](https://github.com/user-attachments/files/19216527/config.json) [server.log](https://github.com/user-attachments/files/19216526/server.log)

GiteaMirror commented

2026-04-29 01:00:09 -05:00

@rick-github commented on GitHub (Mar 12, 2025):

@Corredor-Mediterraneo Your problem is different, it is this one.

@rick-github commented on GitHub (Mar 12, 2025): @Corredor-Mediterraneo Your problem is different, it is [this one](https://github.com/ollama/ollama/issues/9509).

GiteaMirror commented

2026-04-29 01:00:11 -05:00

@mswcap commented on GitHub (Mar 12, 2025):

config.json

Hi @rick-github , I am so sorry, see attachments. Tried these settings but no result :(. Thanks again for your time and patience.
set OLLAMA_NUM_PARALLEL=1
set OLLAMA_GPU_OVERHEAD=1073741824
set OLLAMA_CONTEXT_LENGTH=2048

server.log
app.log

@mswcap commented on GitHub (Mar 12, 2025): [config.json](https://github.com/user-attachments/files/19217350/config.json) Hi @rick-github , I am so sorry, see attachments. Tried these settings but no result :(. Thanks again for your time and patience. set OLLAMA_NUM_PARALLEL=1 set OLLAMA_GPU_OVERHEAD=1073741824 set OLLAMA_CONTEXT_LENGTH=2048 [server.log](https://github.com/user-attachments/files/19217332/server.log) [app.log](https://github.com/user-attachments/files/19217333/app.log)

GiteaMirror commented

2026-04-29 01:00:12 -05:00

@mswcap commented on GitHub (Mar 13, 2025):

When I run Gemma3:12b in LMStudio, it works without OOMs. No additional memory settings/config required. On the same machine. Perhaps there is an issue with Ollama 0.6.0?

@mswcap commented on GitHub (Mar 13, 2025): When I run Gemma3:12b in LMStudio, it works without OOMs. No additional memory settings/config required. On the same machine. Perhaps there is an issue with Ollama 0.6.0?

GiteaMirror commented

2026-04-29 01:00:14 -05:00

@ecarmen16 commented on GitHub (Mar 15, 2025):

Yeah, this is a strange one that I've noticed seems to affect my M4 more than it does my Windows machines. I absolutely have the resources to run QWQ 32b but I get this on the q8_0 version everytime I try. I'll redownload the q4 (default) and report back but this is super annoying and doesn't give great output considering the failure is the result of Ollama inference from its own run command.

@ecarmen16 commented on GitHub (Mar 15, 2025): Yeah, this is a strange one that I've noticed seems to affect my M4 more than it does my Windows machines. I absolutely have the resources to run QWQ 32b but I get this on the q8_0 version everytime I try. I'll redownload the q4 (default) and report back but this is super annoying and doesn't give great output considering the failure is the result of Ollama inference from its own `run` command.

GiteaMirror commented

2026-04-29 01:00:15 -05:00

@rick-github commented on GitHub (Mar 15, 2025):

Hi @rick-github , I am so sorry, see attachments. Tried these settings but no result :(. Thanks again for your time and patience.
set OLLAMA_NUM_PARALLEL=1
set OLLAMA_GPU_OVERHEAD=1073741824
set OLLAMA_CONTEXT_LENGTH=2048

@mswcap These settings are not shown in the log:

OLLAMA_NUM_PARALLEL:0
OLLAMA_GPU_OVERHEAD:0

You have to set them in the server environment for them to take effect.

@rick-github commented on GitHub (Mar 15, 2025): > Hi [@rick-github](https://github.com/rick-github) , I am so sorry, see attachments. Tried these settings but no result :(. Thanks again for your time and patience. > set OLLAMA_NUM_PARALLEL=1 > set OLLAMA_GPU_OVERHEAD=1073741824 > set OLLAMA_CONTEXT_LENGTH=2048 @mswcap These settings are not shown in the log: ``` OLLAMA_NUM_PARALLEL:0 OLLAMA_GPU_OVERHEAD:0 ``` You have to set them in the [server environment](https://github.com/ollama/ollama/blob/main/docs/faq.md#setting-environment-variables-on-windows) for them to take effect.

GiteaMirror commented

2026-04-29 01:00:16 -05:00

@rick-github commented on GitHub (Mar 15, 2025):

When I run Gemma3:12b in LMStudio, it works without OOMs. No additional memory settings/config required. On the same machine. Perhaps there is an issue with Ollama 0.6.0?

I suspect LMStudio is much more conservative with its layer offloading than ollama. For example, when I load :27b on my test GPU (12G) it offloads 27 layers while ollama offloads 35. Tuning the VRAM allocation with the above variables or controlling the layer count directly with num_gpu should help while the developers zero in on the memory calculations.

@rick-github commented on GitHub (Mar 15, 2025): > When I run Gemma3:12b in LMStudio, it works without OOMs. No additional memory settings/config required. On the same machine. Perhaps there is an issue with Ollama 0.6.0? I suspect LMStudio is much more conservative with its layer offloading than ollama. For example, when I load :27b on my test GPU (12G) it offloads 27 layers while ollama offloads 35. Tuning the VRAM allocation with the above variables or controlling the layer count directly with `num_gpu` should help while the developers zero in on the memory calculations.

GiteaMirror commented

2026-04-29 01:00:18 -05:00

@mswcap commented on GitHub (Mar 15, 2025):

Hi @rick-github, I have set the variables as requested and updated to Ollama 0.6.1. Now I get the error 'Error: POST predict: Post "http://127.0.0.1:52573/completion": read tcp 127.0.0.1:53516->127.0.0.1:52573: wsarecv: De externe host heeft een verbinding verbroken.' I will look for the cause and try to solve it.

See logs attached (I am learning it :).

app.log
app-1.log
config.json
server.log
server-1.log

@mswcap commented on GitHub (Mar 15, 2025): Hi @rick-github, I have set the variables as requested and updated to Ollama 0.6.1. Now I get the error 'Error: POST predict: Post "http://127.0.0.1:52573/completion": read tcp 127.0.0.1:53516->127.0.0.1:52573: wsarecv: De externe host heeft een verbinding verbroken.' I will look for the cause and try to solve it. See logs attached (I am learning it :). [app.log](https://github.com/user-attachments/files/19265639/app.log) [app-1.log](https://github.com/user-attachments/files/19265638/app-1.log) [config.json](https://github.com/user-attachments/files/19265636/config.json) [server.log](https://github.com/user-attachments/files/19265637/server.log) [server-1.log](https://github.com/user-attachments/files/19265640/server-1.log)

GiteaMirror commented

2026-04-29 01:00:20 -05:00

@mswcap commented on GitHub (Mar 15, 2025):

No results. I even removed Ollama, all models and the environment variables. Reloaded the model (Gemma3:12b), but still the same error (Error: POST predict: Post "http://127.0.0.1:54429/completion": read tcp 127.0.0.1:54437->127.0.0.1:54429: wsarecv: De externe host heeft een verbinding verbroken.)

server.log

@mswcap commented on GitHub (Mar 15, 2025): No results. I even removed Ollama, all models and the environment variables. Reloaded the model (Gemma3:12b), but still the same error (Error: POST predict: Post "http://127.0.0.1:54429/completion": read tcp 127.0.0.1:54437->127.0.0.1:54429: wsarecv: De externe host heeft een verbinding verbroken.) [server.log](https://github.com/user-attachments/files/19265791/server.log)

GiteaMirror commented

2026-04-29 01:00:22 -05:00

@mswcap commented on GitHub (Mar 15, 2025):

Hmm, when I use these environment variables, it works BUT nothing runs on the GPU (NVIDIA GeForce RTX 4060 Laptop GPU), all on the CPU. Guess I am doing something wrong here.

server.log

OLLAMA_GPU_OVERHEAD 4294967296
OLLAMA_NUM_PARALLEL 1
OLLAMA_CONTEXT_LENGTH 2048

@mswcap commented on GitHub (Mar 15, 2025): Hmm, when I use these environment variables, it works BUT _nothing_ runs on the GPU (NVIDIA GeForce RTX 4060 Laptop GPU), all on the CPU. Guess I am doing something wrong here. [server.log](https://github.com/user-attachments/files/19266231/server.log) OLLAMA_GPU_OVERHEAD 4294967296 OLLAMA_NUM_PARALLEL 1 OLLAMA_CONTEXT_LENGTH 2048

GiteaMirror commented

2026-04-29 01:00:25 -05:00

@mswcap commented on GitHub (Mar 15, 2025):

Changed OLLAMA_GPU_OVERHEAD to 2147483648 (2 GB), and now the GPU is partially used. But again the same error. :(

server.log

@mswcap commented on GitHub (Mar 15, 2025): Changed **OLLAMA_GPU_OVERHEAD** to **2147483648** (2 GB), and now the GPU is partially used. But again the same error. :( [server.log](https://github.com/user-attachments/files/19266267/server.log)

GiteaMirror commented

2026-04-29 01:00:27 -05:00

@mswcap commented on GitHub (Mar 15, 2025):

Guess Gemma3-12b doesn't run good on my system, even though other models run just fine with Ollama.

@mswcap commented on GitHub (Mar 15, 2025): Guess Gemma3-12b doesn't run good on my system, even though other models run just fine with Ollama.

GiteaMirror commented

2026-04-29 01:00:28 -05:00

@rick-github commented on GitHub (Mar 15, 2025):

Something very strange is going on.

time=2025-03-15T21:51:51.016+01:00 level=INFO source=server.go:138 msg=offload library=cuda layers.requested=-1
 layers.model=49 layers.offload=8 layers.split="" memory.available="[6.9 GiB]" memory.gpu_overhead="2.0 GiB"
 memory.required.full="11.6 GiB" memory.required.partial="4.8 GiB" memory.required.kv="768.0 MiB"
 memory.required.allocations="[4.8 GiB]" memory.weights.total="6.0 GiB" memory.weights.repeating="6.0 GiB"
 memory.weights.nonrepeating="787.5 MiB" memory.graph.full="519.5 MiB" memory.graph.partial="1.3 GiB"
 projector.weights="795.9 MiB" projector.graph="1.0 GiB"

Your 4060 has 8G VRAM, and ollama wants to use 4.8G of that to load the model. 2G is held aside by OLLAMA_GPU_OVERHEAD. So ollama offloads 8 layers to the GPU.

time=2025-03-15T21:51:56.911+01:00 level=INFO source=server.go:624 msg="llama runner started in 5.76 seconds"
[GIN] 2025/03/15 - 21:51:56 | 200 |    6.0910103s |       127.0.0.1 | POST     "/api/generate"
[GIN] 2025/03/15 - 21:52:02 | 200 |       525.7µs |       127.0.0.1 | HEAD     "/"
[GIN] 2025/03/15 - 21:52:02 | 200 |       519.9µs |       127.0.0.1 | GET      "/api/ps"
[GIN] 2025/03/15 - 21:52:10 | 200 |            0s |       127.0.0.1 | HEAD     "/"
[GIN] 2025/03/15 - 21:52:10 | 200 |     62.7702ms |       127.0.0.1 | POST     "/api/show"
[GIN] 2025/03/15 - 21:52:10 | 200 |     45.0137ms |       127.0.0.1 | POST     "/api/generate"
[GIN] 2025/03/15 - 21:52:26 | 200 |    6.7491248s |       127.0.0.1 | POST     "/api/chat"
[GIN] 2025/03/15 - 21:53:02 | 200 |   13.4244342s |       127.0.0.1 | POST     "/api/chat"
[GIN] 2025/03/15 - 21:54:46 | 200 |         1m39s |       127.0.0.1 | POST     "/api/chat"

The model is loaded, and successfully answers both chat and generate requests. But then

ggml_backend_cuda_buffer_type_alloc_buffer: allocating 21459.48 MiB on device 0: cudaMalloc failed: out of memory
ggml_gallocr_reserve_n: failed to allocate CUDA0 buffer of size 22501892096

it suddenly wants another 21G of memory! What? That makes no sense.

Do you recall what the prompt was that caused the crash? Did it involve vision?

@rick-github commented on GitHub (Mar 15, 2025): Something very strange is going on. ``` time=2025-03-15T21:51:51.016+01:00 level=INFO source=server.go:138 msg=offload library=cuda layers.requested=-1 layers.model=49 layers.offload=8 layers.split="" memory.available="[6.9 GiB]" memory.gpu_overhead="2.0 GiB" memory.required.full="11.6 GiB" memory.required.partial="4.8 GiB" memory.required.kv="768.0 MiB" memory.required.allocations="[4.8 GiB]" memory.weights.total="6.0 GiB" memory.weights.repeating="6.0 GiB" memory.weights.nonrepeating="787.5 MiB" memory.graph.full="519.5 MiB" memory.graph.partial="1.3 GiB" projector.weights="795.9 MiB" projector.graph="1.0 GiB" ``` Your 4060 has 8G VRAM, and ollama wants to use 4.8G of that to load the model. 2G is held aside by OLLAMA_GPU_OVERHEAD. So ollama offloads 8 layers to the GPU. ``` time=2025-03-15T21:51:56.911+01:00 level=INFO source=server.go:624 msg="llama runner started in 5.76 seconds" [GIN] 2025/03/15 - 21:51:56 | 200 | 6.0910103s | 127.0.0.1 | POST "/api/generate" [GIN] 2025/03/15 - 21:52:02 | 200 | 525.7µs | 127.0.0.1 | HEAD "/" [GIN] 2025/03/15 - 21:52:02 | 200 | 519.9µs | 127.0.0.1 | GET "/api/ps" [GIN] 2025/03/15 - 21:52:10 | 200 | 0s | 127.0.0.1 | HEAD "/" [GIN] 2025/03/15 - 21:52:10 | 200 | 62.7702ms | 127.0.0.1 | POST "/api/show" [GIN] 2025/03/15 - 21:52:10 | 200 | 45.0137ms | 127.0.0.1 | POST "/api/generate" [GIN] 2025/03/15 - 21:52:26 | 200 | 6.7491248s | 127.0.0.1 | POST "/api/chat" [GIN] 2025/03/15 - 21:53:02 | 200 | 13.4244342s | 127.0.0.1 | POST "/api/chat" [GIN] 2025/03/15 - 21:54:46 | 200 | 1m39s | 127.0.0.1 | POST "/api/chat" ```` The model is loaded, and successfully answers both `chat` and `generate` requests. But then ``` ggml_backend_cuda_buffer_type_alloc_buffer: allocating 21459.48 MiB on device 0: cudaMalloc failed: out of memory ggml_gallocr_reserve_n: failed to allocate CUDA0 buffer of size 22501892096 ``` it suddenly wants another 21G of memory! What? That makes no sense. Do you recall what the prompt was that caused the crash? Did it involve vision?

GiteaMirror commented

2026-04-29 01:00:28 -05:00

@mswcap commented on GitHub (Mar 16, 2025):

Hi @rick-github, the prompt involved wasn't vision related. I asked what Google as a company does for business activities. Just chatting away, in order to see whether the crashes were gone.

@mswcap commented on GitHub (Mar 16, 2025): Hi @rick-github, the prompt involved wasn't vision related. I asked what Google as a company does for business activities. Just chatting away, in order to see whether the crashes were gone.

GiteaMirror commented

2026-04-29 01:00:29 -05:00

@mswcap commented on GitHub (Mar 16, 2025):

For testing I use this kind of chat. I ask what the model is, then who Google is, then what the activities of Google are. After that I ask random questions. Most of the times the model goes bad after the 2nd or 3rd question

@mswcap commented on GitHub (Mar 16, 2025): For testing I use this kind of chat. I ask what the model is, then who Google is, then what the activities of Google are. After that I ask random questions. Most of the times the model goes bad after the 2nd or 3rd question

GiteaMirror commented

2026-04-29 01:00:31 -05:00

@mswcap commented on GitHub (Mar 17, 2025):

I just redownloaded Gemma3-12b and used Ollama 0.6.1. During a normal conversation (text only, no images involved) the model suddenly requires around 17 GB of memory, see attached server log. I don't know whether the model is buggy or Ollama is.

server.log

time=2025-03-17T08:15:58.642+01:00 level=INFO source=server.go:624 msg="llama runner started in 13.02 seconds"
[GIN] 2025/03/17 - 08:15:58 | 200 | 13.3177393s | 127.0.0.1 | POST "/api/generate"
[GIN] 2025/03/17 - 08:16:24 | 200 | 5.6807634s | 127.0.0.1 | POST "/api/chat"
[GIN] 2025/03/17 - 08:16:31 | 200 | 3.4231776s | 127.0.0.1 | POST "/api/chat"
[GIN] 2025/03/17 - 08:16:38 | 200 | 1.9263122s | 127.0.0.1 | POST "/api/chat"
[GIN] 2025/03/17 - 08:16:44 | 200 | 1.965793s | 127.0.0.1 | POST "/api/chat"
[GIN] 2025/03/17 - 08:17:02 | 200 | 6.3516731s | 127.0.0.1 | POST "/api/chat"
ggml_backend_cuda_buffer_type_alloc_buffer: allocating 17481.62 MiB on device 0: cudaMalloc failed: out of memory
ggml_gallocr_reserve_n: failed to allocate CUDA0 buffer of size 18330807808
Exception 0xc0000005 0x0 0x58 0x7ff789a4ebd4
PC=0x7ff789a4ebd4
signal arrived during external code execution

@mswcap commented on GitHub (Mar 17, 2025): I just redownloaded Gemma3-12b and used Ollama 0.6.1. During a normal conversation (text only, no images involved) the model suddenly requires around 17 GB of memory, see attached server log. I don't know whether the model is buggy or Ollama is. [server.log](https://github.com/user-attachments/files/19280934/server.log) time=2025-03-17T08:15:58.642+01:00 level=INFO source=server.go:624 msg="llama runner started in 13.02 seconds" [GIN] 2025/03/17 - 08:15:58 | 200 | 13.3177393s | 127.0.0.1 | POST "/api/generate" [GIN] 2025/03/17 - 08:16:24 | 200 | 5.6807634s | 127.0.0.1 | POST "/api/chat" [GIN] 2025/03/17 - 08:16:31 | 200 | 3.4231776s | 127.0.0.1 | POST "/api/chat" [GIN] 2025/03/17 - 08:16:38 | 200 | 1.9263122s | 127.0.0.1 | POST "/api/chat" [GIN] 2025/03/17 - 08:16:44 | 200 | 1.965793s | 127.0.0.1 | POST "/api/chat" [GIN] 2025/03/17 - 08:17:02 | 200 | 6.3516731s | 127.0.0.1 | POST "/api/chat" ggml_backend_cuda_buffer_type_alloc_buffer: allocating 17481.62 MiB on device 0: cudaMalloc **failed: out of memory** ggml_gallocr_reserve_n: **failed to allocate CUDA0 buffer of size 18330807808** Exception 0xc0000005 0x0 0x58 0x7ff789a4ebd4 PC=0x7ff789a4ebd4 signal arrived during external code execution

GiteaMirror commented

2026-04-29 01:00:32 -05:00

@aalencia commented on GitHub (Mar 18, 2025):

I am able to reproduce this error and am available to help debug. let me know.

@aalencia commented on GitHub (Mar 18, 2025): I am able to reproduce this error and am available to help debug. let me know.

GiteaMirror commented

2026-04-29 01:00:33 -05:00

@rick-github commented on GitHub (Mar 18, 2025):

@aalencia Does upgrading to 0.6.2 reduce the crashing?

@rick-github commented on GitHub (Mar 18, 2025): @aalencia Does upgrading to 0.6.2 reduce the crashing?

GiteaMirror commented

2026-04-29 01:00:35 -05:00

@mswcap commented on GitHub (Mar 18, 2025):

@rick-github , can you please tell me how to install 0.6,2 on Windows? The download is version 0.6.1. I see 0.6.2. but no download. Guess I am doing something wrong, my bad.

@mswcap commented on GitHub (Mar 18, 2025): @rick-github , can you please tell me how to install 0.6,2 on Windows? The download is version 0.6.1. I see 0.6.2. but no download. Guess I am doing something wrong, my bad.

GiteaMirror commented

2026-04-29 01:00:36 -05:00

@mswcap commented on GitHub (Mar 18, 2025):

Found it: https://github.com/ollama/ollama/releases

@mswcap commented on GitHub (Mar 18, 2025): Found it: https://github.com/ollama/ollama/releases

GiteaMirror commented

2026-04-29 01:00:38 -05:00

@mswcap commented on GitHub (Mar 18, 2025):

@aalencia Does upgrading to 0.6.2 reduce the crashing?

YES it does! Woohooo! Just running Ollama 0.6.2. without additional settings for the server and it runs great on prompts. Will test it with Open Web UI and images as well and report back.

@mswcap commented on GitHub (Mar 18, 2025): > [@aalencia](https://github.com/aalencia) Does upgrading to 0.6.2 reduce the crashing? YES it does! Woohooo! Just running Ollama 0.6.2. without additional settings for the server and it runs great on prompts. Will test it with Open Web UI and images as well and report back.

GiteaMirror commented

2026-04-29 01:00:42 -05:00

@mswcap commented on GitHub (Mar 18, 2025):

Also describing images works like a charm! Thank you all!!!

@mswcap commented on GitHub (Mar 18, 2025): Also describing images works like a charm! Thank you all!!!

GiteaMirror commented

2026-04-29 01:00:46 -05:00

@vitlib commented on GitHub (Mar 19, 2025):

I upgraded to 0.6.2, the model doesn't give the Error: POST predict anymore but it still uses only CPU.

@vitlib commented on GitHub (Mar 19, 2025): I upgraded to 0.6.2, the model doesn't give the Error: POST predict anymore but it still uses only CPU.

GiteaMirror commented

2026-04-29 01:00:47 -05:00

@rick-github commented on GitHub (Mar 19, 2025):

Server logs may aid in debugging.

@rick-github commented on GitHub (Mar 19, 2025): [Server logs](https://github.com/ollama/ollama/blob/main/docs/troubleshooting.md#how-to-troubleshoot-issues) may aid in debugging.

GiteaMirror commented

2026-04-29 01:00:48 -05:00

@vitlib commented on GitHub (Mar 19, 2025):

server.log thanks

@vitlib commented on GitHub (Mar 19, 2025): [server.log](https://github.com/user-attachments/files/19344161/server.log) thanks

GiteaMirror commented

2026-04-29 01:00:51 -05:00

@rick-github commented on GitHub (Mar 19, 2025):

time=2025-03-19T15:57:45.943+01:00 level=ERROR source=ggml.go:88 msg="failed to get absolute path" error="The filename, directory name, or volume label syntax is incorrect."
load_backend: loaded CPU backend from C:\Users\libbb\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-haswell.dll
time=2025-03-19T15:57:46.211+01:00 level=INFO source=ggml.go:109 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 compiler=cgo(clang)

No GPU backends found. What's the output of

dir /s C:\Users\libbb\AppData\Local\Programs\Ollama

@rick-github commented on GitHub (Mar 19, 2025): ``` time=2025-03-19T15:57:45.943+01:00 level=ERROR source=ggml.go:88 msg="failed to get absolute path" error="The filename, directory name, or volume label syntax is incorrect." load_backend: loaded CPU backend from C:\Users\libbb\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-haswell.dll time=2025-03-19T15:57:46.211+01:00 level=INFO source=ggml.go:109 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 compiler=cgo(clang) ``` No GPU backends found. What's the output of ``` dir /s C:\Users\libbb\AppData\Local\Programs\Ollama ```

GiteaMirror commented

2026-04-29 01:00:52 -05:00

@vitlib commented on GitHub (Mar 19, 2025):

with /s argument it returns error not found. But if i run other models it works just fine!

@vitlib commented on GitHub (Mar 19, 2025): with /s argument it returns error not found. But if i run other models it works just fine!

GiteaMirror commented

2026-04-29 01:00:55 -05:00

@rick-github commented on GitHub (Mar 19, 2025):

There should be at least something like the following

 Directory of C:\Users\libbb\AppData\Local\Programs\Ollama

07/03/2025  18:04    <DIR>          .
07/03/2025  18:02    <DIR>          ..
04/03/2025  04:11             7,502 app.ico
07/03/2025  18:02    <DIR>          lib
04/03/2025  04:12         7,046,080 ollama app.exe
04/03/2025  04:12        30,578,104 ollama.exe
04/03/2025  04:12            11,815 ollama_welcome.ps1
07/03/2025  18:04           233,835 unins000.dat
07/03/2025  18:01         3,446,712 unins000.exe
07/03/2025  18:04            24,381 unins000.msg
               7 File(s)     41,348,429 bytes

because that's wherer the ollama executable is.

@rick-github commented on GitHub (Mar 19, 2025): There should be at least something like the following ``` Directory of C:\Users\libbb\AppData\Local\Programs\Ollama 07/03/2025 18:04 <DIR> . 07/03/2025 18:02 <DIR> .. 04/03/2025 04:11 7,502 app.ico 07/03/2025 18:02 <DIR> lib 04/03/2025 04:12 7,046,080 ollama app.exe 04/03/2025 04:12 30,578,104 ollama.exe 04/03/2025 04:12 11,815 ollama_welcome.ps1 07/03/2025 18:04 233,835 unins000.dat 07/03/2025 18:01 3,446,712 unins000.exe 07/03/2025 18:04 24,381 unins000.msg 7 File(s) 41,348,429 bytes ``` because that's wherer the ollama executable is.

GiteaMirror commented

2026-04-29 01:00:56 -05:00

@vitlib commented on GitHub (Mar 19, 2025):

Yes without the /s argument that's the output

@vitlib commented on GitHub (Mar 19, 2025): Yes without the /s argument that's the output

GiteaMirror commented

2026-04-29 01:00:56 -05:00

@rick-github commented on GitHub (Mar 19, 2025):

Run it in a CMD window, not a PS window.

@rick-github commented on GitHub (Mar 19, 2025): Run it in a CMD window, not a PS window.

GiteaMirror commented

2026-04-29 01:00:57 -05:00

@rick-github commented on GitHub (Mar 19, 2025):

Or in a PS window:

 dir -recurse C:\Users\libbb\AppData\Local\Programs\Ollama

@rick-github commented on GitHub (Mar 19, 2025): Or in a PS window: ``` dir -recurse C:\Users\libbb\AppData\Local\Programs\Ollama ```

GiteaMirror commented

2026-04-29 01:00:58 -05:00

@vitlib commented on GitHub (Mar 19, 2025):

dir-recurse.log

@vitlib commented on GitHub (Mar 19, 2025): [dir-recurse.log](https://github.com/user-attachments/files/19347204/dir-recurse.log)

GiteaMirror commented

2026-04-29 01:01:00 -05:00

@rick-github commented on GitHub (Mar 20, 2025):

The directory listing looks normal. The error is The filename, directory name, or volume label syntax is incorrect. but then it successfully loads the ggml-cpu-haswell.dll backend. It's not clear what's causing the error. You could try setting OLLAMA_DEBUG=1 in the server environment and post the log from that - it will be quite large but it might contain a clue.

@rick-github commented on GitHub (Mar 20, 2025): The directory listing looks normal. The error is `The filename, directory name, or volume label syntax is incorrect.` but then it successfully loads the `ggml-cpu-haswell.dll` backend. It's not clear what's causing the error. You could try setting `OLLAMA_DEBUG=1` in the server environment and post the log from that - it will be quite large but it might contain a clue.

GiteaMirror commented

2026-04-29 01:01:00 -05:00

@rtmcrc commented on GitHub (Apr 4, 2025):

I have the same issue only with Phi4 all other models work fine

server.log
app.log

phi4:latest 9.1 GB ERROR
vanilj/Phi-4:latest 9.1 GB ERROR
vanilj/phi-4-unsloth:latest 8.9 GB OK
hf.co/google/gemma-3-12b-it-qat-q4_0-gguf:latest 8.9 GB OK
deepseek-r1:14b 9.0 GB OK
mistral-nemo:12b-instruct-2407-q6_K 10 GB OK
deepseek-coder-v2:latest 8.9 GB OK
mightykatun/qwen2.5-math:7b 8.1 GB OK
hhao/qwen2.5-coder-tools:1.5b 1.6 GB OK
hhao/qwen2.5-coder-tools:7b 4.7 GB OK
hhao/qwen2.5-coder-tools:14b 9.0 GB OK

OS
Windows 10

GPU
Nvidia

CPU
Intel

Ollama version
0.6.4

@rtmcrc commented on GitHub (Apr 4, 2025): I have the same issue only with Phi4 all other models work fine [server.log](https://github.com/user-attachments/files/19610956/server.log) [app.log](https://github.com/user-attachments/files/19610957/app.log) phi4:latest 9.1 GB **ERROR** vanilj/Phi-4:latest 9.1 GB **ERROR** vanilj/phi-4-unsloth:latest 8.9 GB **OK** hf.co/google/gemma-3-12b-it-qat-q4_0-gguf:latest 8.9 GB OK deepseek-r1:14b 9.0 GB OK mistral-nemo:12b-instruct-2407-q6_K 10 GB OK deepseek-coder-v2:latest 8.9 GB OK mightykatun/qwen2.5-math:7b 8.1 GB OK hhao/qwen2.5-coder-tools:1.5b 1.6 GB OK hhao/qwen2.5-coder-tools:7b 4.7 GB OK hhao/qwen2.5-coder-tools:14b 9.0 GB OK OS Windows 10 GPU Nvidia CPU Intel Ollama version 0.6.4

GiteaMirror commented

2026-04-29 01:01:01 -05:00

@rick-github commented on GitHub (Apr 7, 2025):

time=2025-04-05T02:45:02.824+06:00 level=INFO source=server.go:138 msg=offload library=cuda layers.requested=-1
 layers.model=41 layers.offload=3 layers.split="" memory.available="[1.6 GiB]" memory.gpu_overhead="0 B"
 memory.required.full="9.9 GiB" memory.required.partial="1.6 GiB" memory.required.kv="400.0 MiB"
 memory.required.allocations="[1.6 GiB]" memory.weights.total="8.2 GiB" memory.weights.repeating="7.8 GiB"
 memory.weights.nonrepeating="402.0 MiB" memory.graph.full="266.7 MiB" memory.graph.partial="266.7 MiB"

CUDA error: out of memory
  current device: 0, in function alloc at C:\a\ollama\ollama\ml\backend\ggml\ggml\src\ggml-cuda\ggml-cuda.cu:467
  cuMemSetAccess((CUdeviceptr)((char *)(pool_addr) + pool_size), reserve_size, &access, 1)

ollama offloaded 3 of 41 layers to the GPU, taking 1.6GiB of the available 1.6GiB VRAM, ie all of it. A temporary allocation during inference caused an OOM error. See here for ways to mitigate this.

@rick-github commented on GitHub (Apr 7, 2025): ``` time=2025-04-05T02:45:02.824+06:00 level=INFO source=server.go:138 msg=offload library=cuda layers.requested=-1 layers.model=41 layers.offload=3 layers.split="" memory.available="[1.6 GiB]" memory.gpu_overhead="0 B" memory.required.full="9.9 GiB" memory.required.partial="1.6 GiB" memory.required.kv="400.0 MiB" memory.required.allocations="[1.6 GiB]" memory.weights.total="8.2 GiB" memory.weights.repeating="7.8 GiB" memory.weights.nonrepeating="402.0 MiB" memory.graph.full="266.7 MiB" memory.graph.partial="266.7 MiB" CUDA error: out of memory current device: 0, in function alloc at C:\a\ollama\ollama\ml\backend\ggml\ggml\src\ggml-cuda\ggml-cuda.cu:467 cuMemSetAccess((CUdeviceptr)((char *)(pool_addr) + pool_size), reserve_size, &access, 1) ``` ollama offloaded 3 of 41 layers to the GPU, taking 1.6GiB of the available 1.6GiB VRAM, ie all of it. A temporary allocation during inference caused an OOM error. See [here](https://github.com/ollama/ollama/issues/8597#issuecomment-2614533288) for ways to mitigate this.

GiteaMirror commented

2026-04-29 01:01:02 -05:00

@vitlib commented on GitHub (Apr 7, 2025):

The directory listing looks normal. The error is The filename, directory name, or volume label syntax is incorrect. but then it successfully loads the ggml-cpu-haswell.dll backend. It's not clear what's causing the error. You could try setting OLLAMA_DEBUG=1 in the server environment and post the log from that - it will be quite large but it might contain a clue.

The issue fixed itself updating Ollama

@vitlib commented on GitHub (Apr 7, 2025): > The directory listing looks normal. The error is `The filename, directory name, or volume label syntax is incorrect.` but then it successfully loads the `ggml-cpu-haswell.dll` backend. It's not clear what's causing the error. You could try setting `OLLAMA_DEBUG=1` in the server environment and post the log from that - it will be quite large but it might contain a clue. The issue fixed itself updating Ollama

GiteaMirror commented

2026-04-29 01:01:03 -05:00

@kuriel-dev commented on GitHub (Apr 28, 2025):

No cares what params i set to ollama serve it brakes at once when long prompts are sent on requests, i tryied all the options here:

https://github.com/ollama/ollama/issues/8597#issuecomment-2614533288

i tried with ollama-js
i tried with custom http subscription requests (API)
i tried with chat
i tired with generate

i though i was having a bad implementation but ollama just breaks.

the problem is the API service from ollama serve, it just cant handle long requests breaking the whole usability of the tool.

steps to reproduce:

copy a long text more than 2k characters, tested on 5k characters (which is normal in order to ask to IA for a summary)
next step: none, thats it, ollama brakes that and next answers

one weird thing is that ollama /generate handle images processing well, since images are, by far, larger than text, so i can asume the prompt handling is the problem, including long contexts when api returns context and i recycle it in the next request, if the context is large, requests are having errors.

hope this helps to identify the bug and people that just frustrates (like me) cuz the service is broken.

@kuriel-dev commented on GitHub (Apr 28, 2025): No cares what params i set to `ollama serve` it brakes at once when long prompts are sent on requests, i tryied all the options here: https://github.com/ollama/ollama/issues/8597#issuecomment-2614533288 i tried with ollama-js i tried with custom http subscription requests (API) i tried with chat i tired with generate i though i was having a bad implementation but ollama just breaks. the problem is the API service from ollama serve, it just cant handle long requests breaking the whole usability of the tool. steps to reproduce: 1. copy a long text more than 2k characters, tested on 5k characters (which is normal in order to ask to IA for a summary) next step: none, thats it, ollama brakes that and next answers one weird thing is that ollama /generate handle images processing well, since images are, by far, larger than text, so i can asume the prompt handling is the problem, including long contexts when api returns context and i recycle it in the next request, if the context is large, requests are having errors. hope this helps to identify the bug and people that just frustrates (like me) cuz the service is broken.

GiteaMirror commented

2026-04-29 01:01:04 -05:00

@kuriel-dev commented on GitHub (Apr 29, 2025):

i'm back with the error log:

ggml-alloc.c:819: GGML_ASSERT(talloc->buffer_id >= 0) failed
time=2025-04-29T10:05:30.623-06:00 level=ERROR source=server.go:449 msg="llama runner terminated" error="exit status 0xc0000409"

My only custom settings are:

      OLLAMA_GPU_OVERHEAD: 536870912,
      OLLAMA_FLASH_ATTENTION: 1,
      OLLAMA_NUM_PARALLEL: 1,
      OLLAMA_CONTEXT_LENGTH: 1024 * 16,

No api extra params, cuz it really doesn't care, the result its the same with error.

@kuriel-dev commented on GitHub (Apr 29, 2025): i'm back with the error log: ``` ggml-alloc.c:819: GGML_ASSERT(talloc->buffer_id >= 0) failed time=2025-04-29T10:05:30.623-06:00 level=ERROR source=server.go:449 msg="llama runner terminated" error="exit status 0xc0000409" ``` My only custom settings are: OLLAMA_GPU_OVERHEAD: 536870912, OLLAMA_FLASH_ATTENTION: 1, OLLAMA_NUM_PARALLEL: 1, OLLAMA_CONTEXT_LENGTH: 1024 * 16, No api extra params, cuz it really doesn't care, the result its the same with error.

GiteaMirror commented

2026-04-29 01:01:05 -05:00

@rick-github commented on GitHub (Apr 29, 2025):

#10410

@rick-github commented on GitHub (Apr 29, 2025): #10410

GiteaMirror commented

2026-04-29 01:01:06 -05:00

@kuriel-dev commented on GitHub (Apr 29, 2025):

#10410

Hello rick, this happening while i'm monitoring my RAM of 32GB while running the long request i'm having 70% in use.

(which means i'm having 9.6GB free DURING the operation).

still failing allocation memory in a 4GB model and 5 ~ 50Mb request input.

@kuriel-dev commented on GitHub (Apr 29, 2025): > [#10410](https://github.com/ollama/ollama/issues/10410) Hello rick, this happening while i'm monitoring my RAM of 32GB while running the long request i'm having 70% in use. (which means i'm having 9.6GB free DURING the operation). still failing allocation memory in a 4GB model and 5 ~ 50Mb request input.

GiteaMirror commented

2026-04-29 01:01:07 -05:00

@kuriel-dev commented on GitHub (Apr 30, 2025):

Hi @rick-github, I'm happy to help contribute to resolving this issue and streamlining the process.

I've observed that when submitting an image alongside a lengthy prompt, the request completes successfully.

I believe the memory allocation algorithm currently used in the image processing pipeline could be re-used for this purpose.

Currently, I consistently submit very small images in my requests, which still results in a significant wait time, leading to a frustrating experience for users. I wanted to share this feedback as it highlights a potential area for improvement and efficiency.

@kuriel-dev commented on GitHub (Apr 30, 2025): Hi @rick-github, I'm happy to help contribute to resolving this issue and streamlining the process. I've observed that when submitting an image alongside a lengthy prompt, the request completes successfully. I believe the memory allocation algorithm currently used in the image processing pipeline could be re-used for this purpose. Currently, I consistently submit very small images in my requests, which still results in a significant wait time, leading to a frustrating experience for users. I wanted to share this feedback as it highlights a potential area for improvement and efficiency.

Sign in to join this conversation.

Branches Tags

main

hoyyeva/anthropic-local-image-path

dhiltgen/ci

dhiltgen/llama-runner

parth-remove-claude-desktop-launch

hoyyeva/anthropic-reference-images-path

parth-anthropic-reference-images-path

brucemacd/download-before-remove

hoyyeva/editor-config-repair

parth-mlx-decode-checkpoints

parth-launch-codex-app

hoyyeva/fix-codex-model-metadata-warning

hoyyeva/qwen

parth/hide-claude-desktop-till-release

hoyyeva/opencode-image-modality

parth-add-claude-code-autoinstall

release_v0.22.0

pdevine/manifest-list

codex/fix-codex-model-metadata-warning

pdevine/addressable-manifest

brucemacd/launch-fetch-reccomended

jmorganca/llama-compat

launch-copilot-cli

hoyyeva/opencode-thinking

release_v0.20.7

parth-auto-save-backup

parth-test

jmorganca/gemma4-audio-replacements

fix-manifest-digest-on-pull

hoyyeva/vscode-improve

brucemacd/install-server-wait

parth/update-claude-docs

brucemac/start-ap-install

pdevine/mlx-update

pdevine/qwen35_vision

drifkin/api-show-fallback

mintlify/image-generation-1773352582

hoyyeva/server-context-length-local-config

jmorganca/faster-reptition-penalties

jmorganca/convert-nemotron

parth-pi-thinking

pdevine/sampling-penalties

jmorganca/fix-create-quantization-memory

dongchen/resumable_transfer_fix

pdevine/sampling-cache-error

jessegross/mlx-usage

hoyyeva/openclaw-config

hoyyeva/app-html

pdevine/qwen3next

brucemacd/sign-sh-install

brucemacd/tui-update

brucemacd/usage-api

jmorganca/launch-empty

fix-app-dist-embed

mxyng/mlx-compile

mxyng/mlx-quant

mxyng/mlx-glm4.7

mxyng/mlx

brucemacd/simplify-model-picker

jmorganca/qwen3-concurrent

fix-glm-4.7-flash-mla-config

drifkin/qwen3-coder-opening-tag

brucemacd/usage-cli

fix-cuda12-fattn-shmem

ollama-imagegen-docs

parth/fix-multiline-inputs

brucemacd/config-docs

mxyng/model-files

mxyng/simple-execute

fix-imagegen-ollama-models

mxyng/async-upload

jmorganca/lazy-no-dtype-changes

imagegen-auto-detect-create

parth/decrease-concurrent-download-hf

fix-mlx-quantize-init

jmorganca/x-cleanup

usage

imagegen-readme

jmorganca/glm-image

mlx-gpu-cd

jmorganca/imagegen-modelfile

parth/agent-skills

parth/agent-allowlist

parth/signed-in-offline

parth/agents

parth/fix-context-chopping

improve-cloud-flow

parth/add-models-websearch

parth/prompt-renderer-mcp

jmorganca/native-settings

jmorganca/download-stream-hash

jmorganca/client2-rebased

brucemacd/oai-chat-req-multipart

jessegross/multi_chunk_reserve

grace/additional-omit-empty

grace/mistral-3-large

mxyng/tokenizer2

mxyng/tokenizer

jessegross/flash

hoyyeva/windows-nacked-app

mxyng/cleanup-attention

grace/deepseek-parser

hoyyeva/remember-unsent-prompt

parth/add-lfs-pointer-error-conversion

parth/olmo2-test2

hoyyeva/ollama-launchagent-plist

nicole/olmo-model

parth/olmo-test

mxyng/remove-embedded

parth/render-template

jmorganca/intellect-3

parth/remove-prealloc-linter

jmorganca/cmd-eval

nicole/nomic-embed-text-fix

mxyng/lint-2

hoyyeva/add-gemini-3-pro-preview

hoyyeva/load-model-list

mxyng/expand-path

mxyng/environ-2

hoyyeva/deeplink-json-encoding

parth/improve-tool-calling-tests

hoyyeva/conversation

hoyyeva/assistant-edit-response

hoyyeva/thinking

origin/brucemacd/invalid-char-i-err

parth/improve-tool-calling

jmorganca/required-omitempty

grace/qwen3-vl-tests

mxyng/iter-client

parth/docs-readme

nicole/embed-test

pdevine/integration-benchstat

parth/remove-generate-cmd

parth/add-toolcall-id

mxyng/server-tests

jmorganca/glm-4.6

jmorganca/gin-h-compat

drifkin/stable-tool-args

pdevine/qwen3-more-thinking

parth/add-websearch-client

nicole/websearch_local

jmorganca/qwen3-coder-updates

grace/deepseek-v3-migration-tests

mxyng/fix-create

jmorganca/cloud-errors

pdevine/parser-tidy

revert-12233-parth/simplify-entrypoints-runner

parth/enable-so-gpt-oss

brucemacd/qwen3vl

jmorganca/readme-simplify

parth/gpt-oss-structured-outputs

revert-12039-jmorganca/tools-braces

mxyng/embeddings

mxyng/gguf

mxyng/benchmark

mxyng/types-null

parth/move-parsing

mxyng/gemma2

jmorganca/docs

mxyng/16-bit

mxyng/create-stdin

pdevine/authorizedkeys

mxyng/quant

parth/opt-in-error-context-window

brucemacd/cache-models

brucemacd/runner-completion

jmorganca/llama-update-6

brucemacd/benchmark-list

brucemacd/partial-read-caps

parth/deepseek-r1-tools

mxyng/omit-array

parth/tool-prefix-temp

brucemacd/runner-test

jmorganca/qwen25vl

brucemacd/model-forward-test-ext

parth/python-function-parsing

jmorganca/cuda-compression-none

drifkin/num-parallel

drifkin/chat-truncation-fix

jmorganca/sync

parth/python-tools-calling

drifkin/array-head-count

brucemacd/create-no-loop

parth/server-enable-content-stream-with-tools

qwen25omni

mxyng/v3

brucemacd/ropeconfig

jmorganca/silence-tokenizer

parth/sample-so-test

parth/sampling-structured-outputs

brucemacd/doc-go-engine

parth/constrained-sampling-json

jmorganca/mistral-wip

brucemacd/mistral-small-convert

parth/sample-unmarshal-json-for-params

brucemacd/jomorganca/mistral

pdevine/bfloat16

jmorganca/mistral

brucemacd/mistral

pdevine/logging

parth/sample-correctness-fix

parth/sample-fix-sorting

jmorgan/sample-fix-sorting-extras

jmorganca/temp-0-images

brucemacd/parallel-embed-models

brucemacd/shim-grammar

jmorganca/fix-gguf-error

bmizerany/nameswork

jmorganca/faster-releases

bmizerany/validatenames

brucemacd/err-no-vocab

brucemacd/rope-config

brucemacd/err-hint

brucemacd/qwen2_5

brucemacd/logprobs

brucemacd/new_runner_graph_bench

progress-flicker

brucemacd/forward-test

brucemacd/go_qwen2

pdevine/gemma2

jmorganca/add-missing-symlink-eval

mxyng/next-debug

parth/set-context-size-openai

brucemacd/next-bpe-bench

brucemacd/next-bpe-test

brucemacd/new_runner_e2e

brucemacd/new_runner_qwen2

pdevine/convert-cohere2

brucemacd/convert-cli

parth/log-probs

mxyng/next-mlx

mxyng/cmd-history

parth/templating

parth/tokenize-detokenize

brucemacd/check-key-register

bmizerany/grammar

jmorganca/vendor-081b29bd

mxyng/func-checks

jmorganca/fix-null-format

parth/fix-default-to-warn-json

jmorganca/qwen2vl

jmorganca/no-concat

parth/cmd-cleanup-SO

brucemacd/check-key-register-structured-err

parth/openai-stream-usage

parth/fix-referencing-so

stream-tools-stop

jmorganca/degin-1

brucemacd/install-path-clean

brucemacd/push-name-validation

brucemacd/browser-key-register

jmorganca/openai-fix-first-message

jmorganca/fix-proxy

jessegross/sample

parth/disallow-streaming-tools

dhiltgen/remove_submodule

jmorganca/ga

jmorganca/mllama

pdevine/newlines

pdevine/geems-2b

jmorganca/llama-bump

mxyng/modelname-7

mxyng/gin-slog

mxyng/modelname-6

jyan/convert-prog

jyan/quant5

paligemma-support

pdevine/import-docs

jmorganca/openai-context

jyan/paligemma

jyan/p2

jyan/palitest

bmizerany/embedspeedup

jmorganca/llama-vit

brucemacd/allow-ollama

royh/ep-methods

royh/whisper

mxyng/api-models

mxyng/fix-memory

jyan/q4_4/8

jyan/ollama-v

royh/stream-tools

roy-embed-parallel

bmizerany/hrm

revert-5963-revert-5924-mxyng/llama3.1-rope

royh/embed-viz

jyan/local2

jyan/auth

jyan/local

jyan/parse-temp

jmorganca/template-mistral

jyan/reord-g

royh-openai-suffixdocs

royh-imgembed

royh-embed-parallel

jyan/quant4

royh-precision

jyan/progress

pdevine/fix-template

jyan/quant3

pdevine/ggla

mxyng/update-registry-domain

jmorganca/ggml-static

mxyng/create-context

jyan/v0.146

mxyng/layers-from-files

build_dist

bmizerany/noseek

royh-ls

royh-name

timeout

mxyng/server-timestamp

bmizerany/nosillyggufslurps

royh-params

jmorganca/llama-cpp-7c26775

royh-openai-delete

royh-show-rigid

jmorganca/enable-fa

jmorganca/no-error-template

jyan/format

royh-testdelete

bmizerany/fastverify

language_support

pdevine/ps-glitches

brucemacd/tokenize

bruce/iq-quants

bmizerany/filepathwithcoloninhost

mxyng/split-bin

bmizerany/client-registry

jmorganca/if-none-match

native

jmorganca/native

jmorganca/batch-embeddings

jmorganca/initcmake

jmorganca/mm

pdevine/showggmlinfo

modenameenforcealphanum

bmizerany/modenameenforcealphanum

jmorganca/done-reason

jmorganca/llama-cpp-8960fe8

ollama.com

bmizerany/filepathnobuild

bmizerany/types/model/defaultfix

rmdisplaylong

nogogen

bmizerany/x

modelfile-readme

bmizerany/replacecolon

jmorganca/limit

jmorganca/execstack

jmorganca/replace-assets

mxyng/tune-concurrency

jmorganca/testing

whitespace-detection

jmorganca/options

upgrade-all

scratch

cuda-search

mattw/airenamer

mattw/allmodelsonhuggingface

mattw/quantcontext

mattw/whatneedstorun

brucemacd/llama-mem-calc

mattw/faq-context

mattw/communitylinks

mattw/noprune

mattw/python-functioncalling

rename

mxyng/install

pulse

remove-first

editor

mattw/selfqueryingretrieval

cgo

mattw/howtoquant

api

matt/streamingapi

format-config

mxyng/extra-args

shell

update-nous-hermes

cp-model

upload-progress

fix-unknown-model

fix-model-names

delete-fix

insecure-registry

ls

deletemodels

progressbar

readme-updates

license-layers

skip-list

list-models

modelpath

matt/examplemodelfiles

distribution

go-opts

1 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: github-starred/ollama#52823