[GH-ISSUE #8779] When using open-webui and others ollama seems to struggle stopping/unloading/switching models #31461

New Issue

GiteaMirror · 2026-04-22T11:54:50-05:00

GiteaMirror commented

2026-04-22 11:54:50 -05:00

Originally created by @AncientMystic on GitHub (Feb 2, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/8779

I have noticed over a few installs and few different UIs, with quite a few attempts, when ollama has to switch models, stop models or unload them for whatever reason, it seems to hang a bit, often causing problems with requests.

In open-webui, it seems to cause problems with responses, it will stop to unload after 5 minutes then when you make another request it hangs because its still struggling to unload it from last time and sometimes it causes problems and it just never responds.

It seems ollama will be way better and more efficient if it can unload models more quickly, some processes require rapid switching between models too with using memories and other features that use different models in succession

Originally created by @AncientMystic on GitHub (Feb 2, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/8779 I have noticed over a few installs and few different UIs, with quite a few attempts, when ollama has to switch models, stop models or unload them for whatever reason, it seems to hang a bit, often causing problems with requests. In open-webui, it seems to cause problems with responses, it will stop to unload after 5 minutes then when you make another request it hangs because its still struggling to unload it from last time and sometimes it causes problems and it just never responds. It seems ollama will be way better and more efficient if it can unload models more quickly, some processes require rapid switching between models too with using memories and other features that use different models in succession

GiteaMirror added the needs more info label 2026-04-22 11:54:50 -05:00

GiteaMirror closed this issue

2026-04-22 11:54:54 -05:00

GiteaMirror commented

2026-04-22 11:54:55 -05:00

@AncientMystic commented on GitHub (Feb 3, 2025):

Just to add, it probably just needs a verification step or something to ensure the model was unloaded or to force it if it has not been done and is still on stopping after x seconds.

Sometimes it unloads fine and other times it just wont, half the time when this happens i have to remotely connect to the host and force stop ollama to restart it manually (or wait between 5-20min and thats just too long to have everything come grinding to a halt) and it happens often enough its annoying.

It will also tend to not handle requests properly when it is doing all of this, then it just waits out the time until keep alive expires and unloads again, while open-webui hangs like its going to reply but of course ollama already unloaded the model and quit trying.

Edit: it also often has issues with consecutive replies, like the first 1-2 replies are within a few seconds then it will randomly hang and take forever to reply for no apparent reason.

@AncientMystic commented on GitHub (Feb 3, 2025): Just to add, it probably just needs a verification step or something to ensure the model was unloaded or to force it if it has not been done and is still on stopping after x seconds. Sometimes it unloads fine and other times it just wont, half the time when this happens i have to remotely connect to the host and force stop ollama to restart it manually (or wait between 5-20min and thats just too long to have everything come grinding to a halt) and it happens often enough its annoying. It will also tend to not handle requests properly when it is doing all of this, then it just waits out the time until keep alive expires and unloads again, while open-webui hangs like its going to reply but of course ollama already unloaded the model and quit trying. Edit: it also often has issues with consecutive replies, like the first 1-2 replies are within a few seconds then it will randomly hang and take forever to reply for no apparent reason.

GiteaMirror commented

2026-04-22 11:54:56 -05:00

@rick-github commented on GitHub (Feb 3, 2025):

Server logs will aid in debugging.

@rick-github commented on GitHub (Feb 3, 2025): [Server logs](https://github.com/ollama/ollama/blob/main/docs/troubleshooting.md) will aid in debugging.

GiteaMirror commented

2026-04-22 11:54:56 -05:00

@DarkIceXD commented on GitHub (Feb 20, 2025):

I'm running LibreChat and Ollama inside two separate Docker containers. However, I intermittently experience the same issue. Today, I tried limiting OLLAMA_NUM_PARALLEL to 1 as a test, but unfortunately it didn't make a difference.

Here are server logs after it happened today again: logs.txt

Should i enable OLLAMA_DEBUG and report back when the issue happens again?

@DarkIceXD commented on GitHub (Feb 20, 2025): I'm running LibreChat and Ollama inside two separate Docker containers. However, I intermittently experience the same issue. Today, I tried limiting `OLLAMA_NUM_PARALLEL` to 1 as a test, but unfortunately it didn't make a difference. Here are server logs after it happened today again: [logs.txt](https://github.com/user-attachments/files/18884036/logs.txt) Should i enable `OLLAMA_DEBUG` and report back when the issue happens again?

GiteaMirror commented

2026-04-22 11:54:56 -05:00

@rick-github commented on GitHub (Feb 20, 2025):

From the log, you loaded llama3.3:70b-instruct-q4_K_M, had chats using both the ollama API (/api/chat) and the openAI API (/v1/chat/completions). It looks like your last successful chat was at 08:51:51 using the openAI API, after that the chats take so long to complete that the client times out after 10 minutes. Do you know what the requests after 08:51:51 were about?

I speculate that the model has lost coherence and is "rambling", ie generating a stream of tokens and not hitting an EOS (end of sequence) token. This can happen when a response fills the context buffer and the resulting shift to make room causes the model to lose track of what it was doing. Enabling OLLAMA_DEBUG should show lots of shifting happening. The reason this prevents model unloading is that ollama waits for a completion to finish before unloading the model that is generating it. You can mitigate this by setting num_predict. Since this is happening in the openAI API call, you would need to set this in the Modelfile.

Logs with OLLAMA_DEBUG=1 would verify or refute the above.

@rick-github commented on GitHub (Feb 20, 2025): From the log, you loaded llama3.3:70b-instruct-q4_K_M, had chats using both the ollama API (`/api/chat`) and the openAI API (`/v1/chat/completions`). It looks like your last successful chat was at 08:51:51 using the openAI API, after that the chats take so long to complete that the client times out after 10 minutes. Do you know what the requests after 08:51:51 were about? I speculate that the model has lost coherence and is "rambling", ie generating a stream of tokens and not hitting an EOS (end of sequence) token. This can happen when a response fills the context buffer and the resulting shift to make room causes the model to lose track of what it was doing. Enabling `OLLAMA_DEBUG` should show lots of `shifting` happening. The reason this prevents model unloading is that ollama waits for a completion to finish before unloading the model that is generating it. You can mitigate this by setting `num_predict`. Since this is happening in the openAI API call, you would need to set this in the Modelfile. Logs with `OLLAMA_DEBUG=1` would verify or refute the above.

GiteaMirror commented

2026-04-22 11:54:57 -05:00

@DarkIceXD commented on GitHub (Feb 20, 2025):

Do you know what the requests after 08:51:51 were about?

Not sure but the last was definatly a new chat just asking "hello" just to see if there is any response. It didn't give me any so i had to restart ollama.

Usually i only use /v1/completions using the vscode extension continue and /v1/chat/completions using librechat the reason why you see /api/chat is because i was testing if i disabled OLLAMA_NUM_PARALLEL correctly. Also i exclusivly use llama3.3:70b-instruct-q4_K_M and try to never unload it for fast responses.

I will add OLLAMA_DEBUG=1

@DarkIceXD commented on GitHub (Feb 20, 2025): > Do you know what the requests after 08:51:51 were about? Not sure but the last was definatly a new chat just asking "hello" just to see if there is any response. It didn't give me any so i had to restart ollama. Usually i only use `/v1/completions` using the vscode extension continue and `/v1/chat/completions` using librechat the reason why you see `/api/chat` is because i was testing if i disabled OLLAMA_NUM_PARALLEL correctly. Also i exclusivly use llama3.3:70b-instruct-q4_K_M and try to never unload it for fast responses. I will add OLLAMA_DEBUG=1

GiteaMirror commented

2026-04-22 11:54:57 -05:00

@DarkIceXD commented on GitHub (Feb 21, 2025):

Here is another log for the freeze but i had to redact the messages. It looks like the messages still get generated correctly but are not sent to the client?
logs_freeze.txt

Also i encountered the "GGGGG" issue again. Should i open a new ticket for that?
logs_gggg.txt

@DarkIceXD commented on GitHub (Feb 21, 2025): Here is another log for the freeze but i had to redact the messages. It looks like the messages still get generated correctly but are not sent to the client? [logs_freeze.txt](https://github.com/user-attachments/files/18902724/logs_freeze.txt) Also i encountered the "GGGGG" issue again. Should i open a new ticket for that? [logs_gggg.txt](https://github.com/user-attachments/files/18902726/logs_gggg.txt)

GiteaMirror commented

2026-04-22 11:54:58 -05:00

@AncientMystic commented on GitHub (Feb 24, 2025):

I kept running into this issue in windows, so i switched my vm environment to linux which never does it, so i am no longer experiencing this issue but it is still absolutely a problem on windows, from multiple ollama installs on multiple windows installs both 10 and 11 the problem persisted.

It is also pretty easy to replicate from my experience with it, just run ollama on windows with open-webui and try to get it to respond more than once. Sometimes it works most of the time it has a lot of problems replying and its just completely unreliable.

From my logs reading over them randomly every time it happened it said nothing about any error at most that the connection was closed and ollama unloaded the model or the "gpu vram usage didn't recover within timeout" message i see you (rick-github) have said is just a warning and can be ignored on other issue threads here.

I will see if i can get it to give any specific errors or anything in debug mode logs as the normal log said nothing about a problem, it would just respond fine the first time then 2nd-3rd time stop and hang like it was thinking forever and it would do it more often than not making it mostly unusable unless i wanted to close ollama between every single reply.

(Linux / WSL always seems to work better than running directly on windows for everything that has a linux variant.)

@AncientMystic commented on GitHub (Feb 24, 2025): I kept running into this issue in windows, so i switched my vm environment to linux which never does it, so i am no longer experiencing this issue but it is still absolutely a problem on windows, from multiple ollama installs on multiple windows installs both 10 and 11 the problem persisted. It is also pretty easy to replicate from my experience with it, just run ollama on windows with open-webui and try to get it to respond more than once. Sometimes it works most of the time it has a lot of problems replying and its just completely unreliable. From my logs reading over them randomly every time it happened it said nothing about any error at most that the connection was closed and ollama unloaded the model or the "gpu vram usage didn't recover within timeout" message i see you (rick-github) have said is just a warning and can be ignored on other issue threads here. I will see if i can get it to give any specific errors or anything in debug mode logs as the normal log said nothing about a problem, it would just respond fine the first time then 2nd-3rd time stop and hang like it was thinking forever and it would do it more often than not making it mostly unusable unless i wanted to close ollama between every single reply. (Linux / WSL always seems to work better than running directly on windows for everything that has a linux variant.)

GiteaMirror commented

2026-04-22 11:54:58 -05:00

@rick-github commented on GitHub (Feb 24, 2025):

It is also pretty easy to replicate from my experience with it,

If you can reproduce it reliably, it would be interesting to see if setting num_predict has any affect - my suggestion of this helping is purely guess work.

@rick-github commented on GitHub (Feb 24, 2025): > It is also pretty easy to replicate from my experience with it, If you can reproduce it reliably, it would be interesting to see if setting `num_predict` has any affect - my suggestion of this helping is purely guess work.

GiteaMirror commented

2026-04-22 11:54:59 -05:00

@rick-github commented on GitHub (Feb 24, 2025):

logs_freeze.txt

There's a lot of truncation happening. Multiple conversation turnarounds are filling up the context buffer and ollama is discarding earlier parts of the conversation. This may lead to the model lose track of the conversation and go off the rails. I suggest increasing num_ctx and/or adding num_predict to see if it helps.

logs_gggg.txt

time=2025-02-20T15:22:28.984Z level=INFO source=server.go:596 msg="llama runner started in 8.02 seconds"
time=2025-02-20T15:22:28.984Z level=DEBUG source=sched.go:462 msg="finished setting up runner" model=/root/.ollama/models/blobs/sha256-4824460d29f2058aaf6e1118a63a7a197a09bed509f0e7d4e2efb1ee273b447d
time=2025-02-20T15:22:28.984Z level=DEBUG source=routes.go:1461 msg="chat request" images=0 prompt="<|start_header_id|>user<|end_header_id|>\n\ntest<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n"
time=2025-02-20T15:22:28.985Z level=DEBUG source=cache.go:104 msg="loading cache slot" id=0 cache=0 prompt=11 used=0 remaining=11
time=2025-02-20T15:22:30.877Z level=DEBUG source=server.go:818 msg="prediction aborted, token repeat limit reached"
[GIN] 2025/02/20 - 15:22:30 | 200 | 10.355523674s |      172.20.0.7 | POST     "/v1/chat/completions"

Freshly started server and short prompt. It's not clear why the model starts generating nonsense, but it's probably related to the multi-GPU setup - there are multiple tickets for this issue and they usually involve multiple GPUs. There are reports that setting GGML_CUDA_NO_PEER_COPY helps, but that's unfortunately a compile-time setting, and no strong evidence (yet) that it actually does resolve the issue. What's interesting is that the nonsense generation is not more widespread, it appears affect some users more than others. What's the configuration of the machine that docker is running on? OS, CPU, virtualized, etc.

@rick-github commented on GitHub (Feb 24, 2025): > logs_freeze.txt There's a lot of truncation happening. Multiple conversation turnarounds are filling up the context buffer and ollama is discarding earlier parts of the conversation. This may lead to the model lose track of the conversation and go off the rails. I suggest increasing `num_ctx` and/or adding `num_predict` to see if it helps. > logs_gggg.txt ``` time=2025-02-20T15:22:28.984Z level=INFO source=server.go:596 msg="llama runner started in 8.02 seconds" time=2025-02-20T15:22:28.984Z level=DEBUG source=sched.go:462 msg="finished setting up runner" model=/root/.ollama/models/blobs/sha256-4824460d29f2058aaf6e1118a63a7a197a09bed509f0e7d4e2efb1ee273b447d time=2025-02-20T15:22:28.984Z level=DEBUG source=routes.go:1461 msg="chat request" images=0 prompt="<|start_header_id|>user<|end_header_id|>\n\ntest<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n" time=2025-02-20T15:22:28.985Z level=DEBUG source=cache.go:104 msg="loading cache slot" id=0 cache=0 prompt=11 used=0 remaining=11 time=2025-02-20T15:22:30.877Z level=DEBUG source=server.go:818 msg="prediction aborted, token repeat limit reached" [GIN] 2025/02/20 - 15:22:30 | 200 | 10.355523674s | 172.20.0.7 | POST "/v1/chat/completions" ``` Freshly started server and short prompt. It's not clear why the model starts generating nonsense, but it's probably related to the multi-GPU setup - there are multiple tickets for this issue and they usually involve multiple GPUs. There are reports that setting `GGML_CUDA_NO_PEER_COPY` helps, but that's unfortunately a compile-time setting, and no strong evidence (yet) that it actually does resolve the issue. What's interesting is that the nonsense generation is not more widespread, it appears affect some users more than others. What's the configuration of the machine that docker is running on? OS, CPU, virtualized, etc.

GiteaMirror commented

2026-04-22 11:54:59 -05:00

@DarkIceXD commented on GitHub (Feb 24, 2025):

There's a lot of truncation happening. Multiple conversation turnarounds are filling up the context buffer and ollama is discarding earlier parts of the conversation. This may lead to the model lose track of the conversation and go off the rails. I suggest increasing num_ctx and/or adding num_predict to see if it helps.

I will need to look on how i can incorparate num_predict to librechat since i would have to send it on every request is that correct?
And num_ctx will need to be set in the modelfile right?
Also what values would you recommend?

What's interesting is that the nonsense generation is not more widespread, it appears affect some users more than others. What's the configuration of the machine that docker is running on? OS, CPU, virtualized, etc.

I am running:

Debian 12 (on bare metal no virtualization)
6.1.0-30-amd64 x86_64 GNU/Linux
AMD Ryzen Threadripper PRO 5955WX
128 GB RAM
Official Docker (not the debian version) + NVIDIA Container Toolkit
Official Ollama Docker Image

@DarkIceXD commented on GitHub (Feb 24, 2025): > There's a lot of truncation happening. Multiple conversation turnarounds are filling up the context buffer and ollama is discarding earlier parts of the conversation. This may lead to the model lose track of the conversation and go off the rails. I suggest increasing `num_ctx` and/or adding `num_predict` to see if it helps. I will need to look on how i can incorparate num_predict to librechat since i would have to send it on every request is that correct? And num_ctx will need to be set in the modelfile right? Also what values would you recommend? > What's interesting is that the nonsense generation is not more widespread, it appears affect some users more than others. What's the configuration of the machine that docker is running on? OS, CPU, virtualized, etc. I am running: - Debian 12 (on bare metal no virtualization) - 6.1.0-30-amd64 x86_64 GNU/Linux - AMD Ryzen Threadripper PRO 5955WX - 128 GB RAM - Official Docker (not the debian version) + NVIDIA Container Toolkit - Official Ollama Docker Image

GiteaMirror commented

2026-04-22 11:55:00 -05:00

@rick-github commented on GitHub (Feb 24, 2025):

I will need to look on how i can incorparate num_predict to librechat since i would have to send it on every request is that correct?

It's probably easier to just create models as required:

$ ollama run llama3.3
>>> /set parameter num_ctx 4096
>>> /set parameter num_predict 200
>>> /save llama3.3:c4k-p200
>>> /set parameter num_ctx 8192
>>> /set parameter num_predict 400
>>> /save llama3.3:c8k-p400
>>> /bye

Then in librechat, choose model llama3.3:c4k-p200 or llama3.3:c8k-p400 or whichever model you create based on parameters.

Also what values would you recommend?

It depends on what your inputs and outputs are. If it's just a conversation, then you probably want a large input buffer (num_ctx) to handle multiple turns, but only need a small amount of tokens (num_predict) for the response. If you expect verbose responses, you would increase num_predict so as to not prematurely cut off the response. Note that if you make num_ctx larger it may push model layers off the GPU in to system RAM, which will make the model run slower. You are borderline at the moment (memory.available="[23.4 GiB 23.4 GiB]" memory.required.allocations="[23.3 GiB 22.5 GiB]"), you can make room for more context by setting OLLAMA_NUM_PARALLEL=1. This will reduce the context memory requirements to a quarter of the current value, so you should be able to get num_ctx up to around 8192.

@rick-github commented on GitHub (Feb 24, 2025): > I will need to look on how i can incorparate num_predict to librechat since i would have to send it on every request is that correct? It's probably easier to just create models as required: ```console $ ollama run llama3.3 >>> /set parameter num_ctx 4096 >>> /set parameter num_predict 200 >>> /save llama3.3:c4k-p200 >>> /set parameter num_ctx 8192 >>> /set parameter num_predict 400 >>> /save llama3.3:c8k-p400 >>> /bye ``` Then in librechat, choose model llama3.3:c4k-p200 or llama3.3:c8k-p400 or whichever model you create based on parameters. > Also what values would you recommend? It depends on what your inputs and outputs are. If it's just a conversation, then you probably want a large input buffer (`num_ctx`) to handle multiple turns, but only need a small amount of tokens (`num_predict`) for the response. If you expect verbose responses, you would increase `num_predict` so as to not prematurely cut off the response. Note that if you make `num_ctx` larger it may push model layers off the GPU in to system RAM, which will make the model run slower. You are borderline at the moment (`memory.available="[23.4 GiB 23.4 GiB]" memory.required.allocations="[23.3 GiB 22.5 GiB]"`), you can make room for more context by setting `OLLAMA_NUM_PARALLEL=1`. This will reduce the context memory requirements to a quarter of the current value, so you should be able to get `num_ctx` up to around 8192.

GiteaMirror commented

2026-04-22 11:55:00 -05:00

@DarkIceXD commented on GitHub (Feb 25, 2025):

It depends on what your inputs and outputs are. If it's just a conversation, then you probably want a large input buffer (num_ctx) to handle multiple turns, but only need a small amount of tokens (num_predict) for the response. If you expect verbose responses, you would increase num_predict so as to not prematurely cut off the response. Note that if you make num_ctx larger it may push model layers off the GPU in to system RAM, which will make the model run slower. You are borderline at the moment (memory.available="[23.4 GiB 23.4 GiB]" memory.required.allocations="[23.3 GiB 22.5 GiB]"), you can make room for more context by setting OLLAMA_NUM_PARALLEL=1. This will reduce the context memory requirements to a quarter of the current value, so you should be able to get num_ctx up to around 8192.

Actually i don't really want to push it past VRAM and into RAM. Is there no other way than increasing num_ctx and num_predict? I feel like it's a stop gap solution until the context in 1 chat pushes it to the new limit right?

@DarkIceXD commented on GitHub (Feb 25, 2025): > It depends on what your inputs and outputs are. If it's just a conversation, then you probably want a large input buffer (`num_ctx`) to handle multiple turns, but only need a small amount of tokens (`num_predict`) for the response. If you expect verbose responses, you would increase `num_predict` so as to not prematurely cut off the response. Note that if you make `num_ctx` larger it may push model layers off the GPU in to system RAM, which will make the model run slower. You are borderline at the moment (`memory.available="[23.4 GiB 23.4 GiB]" memory.required.allocations="[23.3 GiB 22.5 GiB]"`), you can make room for more context by setting `OLLAMA_NUM_PARALLEL=1`. This will reduce the context memory requirements to a quarter of the current value, so you should be able to get `num_ctx` up to around 8192. Actually i don't really want to push it past VRAM and into RAM. Is there no other way than increasing num_ctx and num_predict? I feel like it's a stop gap solution until the context in 1 chat pushes it to the new limit right?

GiteaMirror commented

2026-04-22 11:55:01 -05:00

@rick-github commented on GitHub (Feb 25, 2025):

Increasing num_ctx is to allow longer conversations in the context buffer, num_predict is to protect against the model losing coherence, you don't need to do both.

Long conversations aren't normally a problem because as mentioned, ollama will trim down to make it fit in the context buffer.

The model losing coherence seems to be the issue. This is supposed to be uncommon, if it's an ongoing issue for you it might be tied to your hardware/software setup, which may also be the cause of the "GGGG" output. Have you considered running a GPU/VRAM tester to see if anything pops up? If you are dual booting, OCCT or gpumemtest, otherwise (less featureful) open source one here. There's also the possibility of a software issue with multi-gpu cards as raised earlier, you could try building ollama from source and setting GGML_CUDA_NO_PEER_COPY (or other flags) to see if it resolves the problem.

@rick-github commented on GitHub (Feb 25, 2025): Increasing `num_ctx` is to allow longer conversations in the context buffer, `num_predict` is to protect against the model losing coherence, you don't need to do both. Long conversations aren't normally a problem because as mentioned, ollama will trim down to make it fit in the context buffer. The model losing coherence seems to be the issue. This is supposed to be uncommon, if it's an ongoing issue for you it might be tied to your hardware/software setup, which may also be the cause of the "GGGG" output. Have you considered running a GPU/VRAM tester to see if anything pops up? If you are dual booting, [OCCT](https://www.ocbase.com/download) or [gpumemtest](https://www.programming4beginners.com/gpumemtest), otherwise (less featureful) open source one [here](https://github.com/ihaque/memtestG80). There's also the possibility of a software issue with multi-gpu cards as raised earlier, you could try building ollama from source and setting `GGML_CUDA_NO_PEER_COPY` (or other flags) to see if it resolves the problem.

GiteaMirror commented

2026-04-22 11:55:01 -05:00

@rick-github commented on GitHub (Apr 13, 2025):

Marking this as a dupe of #7606 for the stopping issue.

@rick-github commented on GitHub (Apr 13, 2025): Marking this as a dupe of #7606 for the stopping issue.

Sign in to join this conversation.

Branches Tags

main

hoyyeva/anthropic-local-image-path

dhiltgen/ci

dhiltgen/llama-runner

parth-remove-claude-desktop-launch

hoyyeva/anthropic-reference-images-path

parth-anthropic-reference-images-path

brucemacd/download-before-remove

hoyyeva/editor-config-repair

parth-mlx-decode-checkpoints

parth-launch-codex-app

hoyyeva/fix-codex-model-metadata-warning

hoyyeva/qwen

parth/hide-claude-desktop-till-release

hoyyeva/opencode-image-modality

parth-add-claude-code-autoinstall

release_v0.22.0

pdevine/manifest-list

codex/fix-codex-model-metadata-warning

pdevine/addressable-manifest

brucemacd/launch-fetch-reccomended

jmorganca/llama-compat

launch-copilot-cli

hoyyeva/opencode-thinking

release_v0.20.7

parth-auto-save-backup

parth-test

jmorganca/gemma4-audio-replacements

fix-manifest-digest-on-pull

hoyyeva/vscode-improve

brucemacd/install-server-wait

parth/update-claude-docs

brucemac/start-ap-install

pdevine/mlx-update

pdevine/qwen35_vision

drifkin/api-show-fallback

mintlify/image-generation-1773352582

hoyyeva/server-context-length-local-config

jmorganca/faster-reptition-penalties

jmorganca/convert-nemotron

parth-pi-thinking

pdevine/sampling-penalties

jmorganca/fix-create-quantization-memory

dongchen/resumable_transfer_fix

pdevine/sampling-cache-error

jessegross/mlx-usage

hoyyeva/openclaw-config

hoyyeva/app-html

pdevine/qwen3next

brucemacd/sign-sh-install

brucemacd/tui-update

brucemacd/usage-api

jmorganca/launch-empty

fix-app-dist-embed

mxyng/mlx-compile

mxyng/mlx-quant

mxyng/mlx-glm4.7

mxyng/mlx

brucemacd/simplify-model-picker

jmorganca/qwen3-concurrent

fix-glm-4.7-flash-mla-config

drifkin/qwen3-coder-opening-tag

brucemacd/usage-cli

fix-cuda12-fattn-shmem

ollama-imagegen-docs

parth/fix-multiline-inputs

brucemacd/config-docs

mxyng/model-files

mxyng/simple-execute

fix-imagegen-ollama-models

mxyng/async-upload

jmorganca/lazy-no-dtype-changes

imagegen-auto-detect-create

parth/decrease-concurrent-download-hf

fix-mlx-quantize-init

jmorganca/x-cleanup

usage

imagegen-readme

jmorganca/glm-image

mlx-gpu-cd

jmorganca/imagegen-modelfile

parth/agent-skills

parth/agent-allowlist

parth/signed-in-offline

parth/agents

parth/fix-context-chopping

improve-cloud-flow

parth/add-models-websearch

parth/prompt-renderer-mcp

jmorganca/native-settings

jmorganca/download-stream-hash

jmorganca/client2-rebased

brucemacd/oai-chat-req-multipart

jessegross/multi_chunk_reserve

grace/additional-omit-empty

grace/mistral-3-large

mxyng/tokenizer2

mxyng/tokenizer

jessegross/flash

hoyyeva/windows-nacked-app

mxyng/cleanup-attention

grace/deepseek-parser

hoyyeva/remember-unsent-prompt

parth/add-lfs-pointer-error-conversion

parth/olmo2-test2

hoyyeva/ollama-launchagent-plist

nicole/olmo-model

parth/olmo-test

mxyng/remove-embedded

parth/render-template

jmorganca/intellect-3

parth/remove-prealloc-linter

jmorganca/cmd-eval

nicole/nomic-embed-text-fix

mxyng/lint-2

hoyyeva/add-gemini-3-pro-preview

hoyyeva/load-model-list

mxyng/expand-path

mxyng/environ-2

hoyyeva/deeplink-json-encoding

parth/improve-tool-calling-tests

hoyyeva/conversation

hoyyeva/assistant-edit-response

hoyyeva/thinking

origin/brucemacd/invalid-char-i-err

parth/improve-tool-calling

jmorganca/required-omitempty

grace/qwen3-vl-tests

mxyng/iter-client

parth/docs-readme

nicole/embed-test

pdevine/integration-benchstat

parth/remove-generate-cmd

parth/add-toolcall-id

mxyng/server-tests

jmorganca/glm-4.6

jmorganca/gin-h-compat

drifkin/stable-tool-args

pdevine/qwen3-more-thinking

parth/add-websearch-client

nicole/websearch_local

jmorganca/qwen3-coder-updates

grace/deepseek-v3-migration-tests

mxyng/fix-create

jmorganca/cloud-errors

pdevine/parser-tidy

revert-12233-parth/simplify-entrypoints-runner

parth/enable-so-gpt-oss

brucemacd/qwen3vl

jmorganca/readme-simplify

parth/gpt-oss-structured-outputs

revert-12039-jmorganca/tools-braces

mxyng/embeddings

mxyng/gguf

mxyng/benchmark

mxyng/types-null

parth/move-parsing

mxyng/gemma2

jmorganca/docs

mxyng/16-bit

mxyng/create-stdin

pdevine/authorizedkeys

mxyng/quant

parth/opt-in-error-context-window

brucemacd/cache-models

brucemacd/runner-completion

jmorganca/llama-update-6

brucemacd/benchmark-list

brucemacd/partial-read-caps

parth/deepseek-r1-tools

mxyng/omit-array

parth/tool-prefix-temp

brucemacd/runner-test

jmorganca/qwen25vl

brucemacd/model-forward-test-ext

parth/python-function-parsing

jmorganca/cuda-compression-none

drifkin/num-parallel

drifkin/chat-truncation-fix

jmorganca/sync

parth/python-tools-calling

drifkin/array-head-count

brucemacd/create-no-loop

parth/server-enable-content-stream-with-tools

qwen25omni

mxyng/v3

brucemacd/ropeconfig

jmorganca/silence-tokenizer

parth/sample-so-test

parth/sampling-structured-outputs

brucemacd/doc-go-engine

parth/constrained-sampling-json

jmorganca/mistral-wip

brucemacd/mistral-small-convert

parth/sample-unmarshal-json-for-params

brucemacd/jomorganca/mistral

pdevine/bfloat16

jmorganca/mistral

brucemacd/mistral

pdevine/logging

parth/sample-correctness-fix

parth/sample-fix-sorting

jmorgan/sample-fix-sorting-extras

jmorganca/temp-0-images

brucemacd/parallel-embed-models

brucemacd/shim-grammar

jmorganca/fix-gguf-error

bmizerany/nameswork

jmorganca/faster-releases

bmizerany/validatenames

brucemacd/err-no-vocab

brucemacd/rope-config

brucemacd/err-hint

brucemacd/qwen2_5

brucemacd/logprobs

brucemacd/new_runner_graph_bench

progress-flicker

brucemacd/forward-test

brucemacd/go_qwen2

pdevine/gemma2

jmorganca/add-missing-symlink-eval

mxyng/next-debug

parth/set-context-size-openai

brucemacd/next-bpe-bench

brucemacd/next-bpe-test

brucemacd/new_runner_e2e

brucemacd/new_runner_qwen2

pdevine/convert-cohere2

brucemacd/convert-cli

parth/log-probs

mxyng/next-mlx

mxyng/cmd-history

parth/templating

parth/tokenize-detokenize

brucemacd/check-key-register

bmizerany/grammar

jmorganca/vendor-081b29bd

mxyng/func-checks

jmorganca/fix-null-format

parth/fix-default-to-warn-json

jmorganca/qwen2vl

jmorganca/no-concat

parth/cmd-cleanup-SO

brucemacd/check-key-register-structured-err

parth/openai-stream-usage

parth/fix-referencing-so

stream-tools-stop

jmorganca/degin-1

brucemacd/install-path-clean

brucemacd/push-name-validation

brucemacd/browser-key-register

jmorganca/openai-fix-first-message

jmorganca/fix-proxy

jessegross/sample

parth/disallow-streaming-tools

dhiltgen/remove_submodule

jmorganca/ga

jmorganca/mllama

pdevine/newlines

pdevine/geems-2b

jmorganca/llama-bump

mxyng/modelname-7

mxyng/gin-slog

mxyng/modelname-6

jyan/convert-prog

jyan/quant5

paligemma-support

pdevine/import-docs

jmorganca/openai-context

jyan/paligemma

jyan/p2

jyan/palitest

bmizerany/embedspeedup

jmorganca/llama-vit

brucemacd/allow-ollama

royh/ep-methods

royh/whisper

mxyng/api-models

mxyng/fix-memory

jyan/q4_4/8

jyan/ollama-v

royh/stream-tools

roy-embed-parallel

bmizerany/hrm

revert-5963-revert-5924-mxyng/llama3.1-rope

royh/embed-viz

jyan/local2

jyan/auth

jyan/local

jyan/parse-temp

jmorganca/template-mistral

jyan/reord-g

royh-openai-suffixdocs

royh-imgembed

royh-embed-parallel

jyan/quant4

royh-precision

jyan/progress

pdevine/fix-template

jyan/quant3

pdevine/ggla

mxyng/update-registry-domain

jmorganca/ggml-static

mxyng/create-context

jyan/v0.146

mxyng/layers-from-files

build_dist

bmizerany/noseek

royh-ls

royh-name

timeout

mxyng/server-timestamp

bmizerany/nosillyggufslurps

royh-params

jmorganca/llama-cpp-7c26775

royh-openai-delete

royh-show-rigid

jmorganca/enable-fa

jmorganca/no-error-template

jyan/format

royh-testdelete

bmizerany/fastverify

language_support

pdevine/ps-glitches

brucemacd/tokenize

bruce/iq-quants

bmizerany/filepathwithcoloninhost

mxyng/split-bin

bmizerany/client-registry

jmorganca/if-none-match

native

jmorganca/native

jmorganca/batch-embeddings

jmorganca/initcmake

jmorganca/mm

pdevine/showggmlinfo

modenameenforcealphanum

bmizerany/modenameenforcealphanum

jmorganca/done-reason

jmorganca/llama-cpp-8960fe8

ollama.com

bmizerany/filepathnobuild

bmizerany/types/model/defaultfix

rmdisplaylong

nogogen

bmizerany/x

modelfile-readme

bmizerany/replacecolon

jmorganca/limit

jmorganca/execstack

jmorganca/replace-assets

mxyng/tune-concurrency

jmorganca/testing

whitespace-detection

jmorganca/options

upgrade-all

scratch

cuda-search

mattw/airenamer

mattw/allmodelsonhuggingface

mattw/quantcontext

mattw/whatneedstorun

brucemacd/llama-mem-calc

mattw/faq-context

mattw/communitylinks

mattw/noprune

mattw/python-functioncalling

rename

mxyng/install

pulse

remove-first

editor

mattw/selfqueryingretrieval

cgo

mattw/howtoquant

api

matt/streamingapi

format-config

mxyng/extra-args

shell

update-nous-hermes

cp-model

upload-progress

fix-unknown-model

fix-model-names

delete-fix

insecure-registry

ls

deletemodels

progressbar

readme-updates

license-layers

skip-list

list-models

modelpath

matt/examplemodelfiles

distribution

go-opts

1 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: github-starred/ollama#31461