[GH-ISSUE #7078] Model keeps running forever #4494

New Issue

GiteaMirror · 2026-04-12T15:25:16-05:00

GiteaMirror commented

2026-04-12 15:25:16 -05:00

Originally created by @Krakonos on GitHub (Oct 2, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/7078

What is the issue?

Hi!

I'm using ollama as a backend for code completion. I'm running OpenWebUI for authentication, which proxies requests to my ollama instance locally. However, after some requests (I've been unable to find any similarities so far), the model is stuck running forever (loaded and uses 100% GPU). In my case, this unfortunately ends up thermal throttling the GPU, wastes a lot of power, but also prevents running more workloads on the model until I restart ollama.

A sample log is attached:

always-on.txt

I'm running Ryzen 5700X, but the model runs in a virtualized environment via proxmox, 2x Nvidia tesla P40 are passed in, 48GB system RAM.

Under some workloads (specifically some models), I also suffered issues from GPUs dropping off the bus, I'm not sure it's related and I haven't gathered enough information to say for sure the problem is with ollama or some hardware/virtualization in that case. I suspect this was due to bad model though (I was trying to evaluate deepseek-coder-v2:16b-lite-base-q4_0 and found this is a common problem), but I think I've seen this once running codellama:7b-code.

There are no relevant events in dmesg, the GPUs have persistent mode enabled.

Let me know what further debugging steps should I take.

OS

Linux

GPU

Nvidia

CPU

AMD

Ollama version

0.3.10

Originally created by @Krakonos on GitHub (Oct 2, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/7078 ### What is the issue? Hi! I'm using ollama as a backend for code completion. I'm running OpenWebUI for authentication, which proxies requests to my ollama instance locally. However, after some requests (I've been unable to find any similarities so far), the model is stuck running forever (loaded and uses 100% GPU). In my case, this unfortunately ends up thermal throttling the GPU, wastes a lot of power, but also prevents running more workloads on the model until I restart ollama. A sample log is attached: [always-on.txt](https://github.com/user-attachments/files/17228827/always-on.txt) I'm running Ryzen 5700X, but the model runs in a virtualized environment via proxmox, 2x Nvidia tesla P40 are passed in, 48GB system RAM. Under some workloads (specifically some models), I also suffered issues from GPUs dropping off the bus, I'm not sure it's related and I haven't gathered enough information to say for sure the problem is with ollama or some hardware/virtualization in that case. I suspect this was due to bad model though (I was trying to evaluate `deepseek-coder-v2:16b-lite-base-q4_0` and found this is a common problem), but I think I've seen this once running `codellama:7b-code`. There are no relevant events in dmesg, the GPUs have persistent mode enabled. Let me know what further debugging steps should I take. ### OS Linux ### GPU Nvidia ### CPU AMD ### Ollama version 0.3.10

GiteaMirror added the bug label 2026-04-12 15:25:16 -05:00

GiteaMirror closed this issue

2026-04-12 15:25:17 -05:00

GiteaMirror commented

2026-04-12 15:25:18 -05:00

@rick-github commented on GitHub (Oct 2, 2024):

Oct 02 11:40:28 grasshopper-tesla ollama[23633]: DEBUG [update_slots] slot context shift | n_cache_tokens=2048 n_ctx=8192 n_discard=1021 n_keep=5 n_left=2042 n_past=2047 n_system_tokens=0 slot_id=1 task_id=10 tid="140304693719040" timestamp=1727869228

This model is codellama:7b-code. It's stuck in a loop, it appears to have forgotten how to send an EOS token. ollama has some heuristics to try to catch this, the main one being num_predict, or the maximum number of tokens to predict. The default for this is 10 * num_context, which should be 20480 in this case. Since the generated tokens are exceeding the context window, the model has to do a context shift which, looking at the statistics in the log, is incredibly slow. So it's taking a long time to reach the num_predict limit: slot 1 has been processing for 4 hours at the end of the log with what looks like a completion request for def t. It's not clear why it's so slow. You have OLLAMA_NUM_PARALLEL unset or set to 4, so the runner is doing 4 concurrent requests, but I can't imagine why that would cause such a performance hit.

Something else that's interesting is that the the request to the model doesn't contain any special tokens, while the template in the ollama library does. The model has been updated since it was published, so re-pulling the model might update the template and give the model more guidance on when to emit an EOS token.

The other thing I suggest is setting OLLAMA_NUM_PARALLEL=1 in the server environment to see if that makes a difference.

@rick-github commented on GitHub (Oct 2, 2024): ``` Oct 02 11:40:28 grasshopper-tesla ollama[23633]: DEBUG [update_slots] slot context shift | n_cache_tokens=2048 n_ctx=8192 n_discard=1021 n_keep=5 n_left=2042 n_past=2047 n_system_tokens=0 slot_id=1 task_id=10 tid="140304693719040" timestamp=1727869228 ``` This model is codellama:7b-code. It's stuck in a loop, it appears to have forgotten how to send an EOS token. ollama has some heuristics to try to catch this, the main one being `num_predict`, or the maximum number of tokens to predict. The default for this is 10 * num_context, which should be 20480 in this case. Since the generated tokens are exceeding the context window, the model has to do a context shift which, looking at the statistics in the log, is incredibly slow. So it's taking a long time to reach the `num_predict` limit: slot 1 has been processing for 4 hours at the end of the log with what looks like a completion request for `def t`. It's not clear why it's so slow. You have `OLLAMA_NUM_PARALLEL` unset or set to 4, so the runner is doing 4 concurrent requests, but I can't imagine why that would cause such a performance hit. Something else that's interesting is that the the request to the model doesn't contain any special tokens, while the [template](https://ollama.com/library/codellama/blobs/2e0493f67d0c) in the ollama library does. The model has been updated since it was published, so re-pulling the model might update the template and give the model more guidance on when to emit an EOS token. The other thing I suggest is setting `OLLAMA_NUM_PARALLEL=1` in the server environment to see if that makes a difference.

GiteaMirror commented

2026-04-12 15:25:18 -05:00

@Krakonos commented on GitHub (Oct 2, 2024):

Thanks. I'll try to run with OLLAMA_NUM_PARALLEL=1. A few questions:

Is there a way to play with num_predict default? Or even cap it to prevent bad requests? This query is from a VS code plugin that doesn't seem to support setting num_predict (I think that needs to go to the query body, right?) I mostly plan to use this for code completion, so having a num_predict would be OK (actually, i could use large context and small predict).
How do you know the context switch takes long? To me I though it takes a long time generate all the tokens to fill the context window and trigger a swap. This might be because my GPU is thermally constrained and clocks down quite a during continuous load. Cooling those beasts (quietly) is more challenging than I anticipated unfortunately.
Who is responsible for applying the template? Ollama or the API client? I expect ollama does it. I pulled codellama:7b-code

Interestingly, the template is in the modefile:

krakonos@grasshopper-tesla:~$ ollama show --modelfile codellama:7b | head -n 10
# Modelfile generated by "ollama show"
# To build a new Modelfile based on this, replace FROM with:
# FROM codellama:7b

FROM /ollama/blobs/sha256-3a43f93b78ec50f7c4e4dc8bd1cb3fff5a900e7d574c51a6f7495e48486e0dac
TEMPLATE "[INST] <<SYS>>{{ .System }}<</SYS>>

{{ .Prompt }} [/INST]
"
PARAMETER rope_frequency_base 1e+06



krakonos@grasshopper-tesla:~$ ollama show --modelfile codellama:7b-code | head -n 10
# Modelfile generated by "ollama show"
# To build a new Modelfile based on this, replace FROM with:
# FROM codellama:7b-code

FROM /ollama/blobs/sha256-8b2eceb7b7a11c307bc9deed38b263e05015945dc0fa2f50c0744c5d49dd293e
TEMPLATE "{{- if .Suffix }}<PRE> {{ .Prompt }} <SUF>{{ .Suffix }} <MID>
{{- else }}{{ .Prompt }}
{{- end }}"
PARAMETER rope_frequency_base 1e+06
LICENSE """LLAMA 2 COMMUNITY LICENSE AGREEMENT

However, I think this might be a red herring, since I have my own vim plugin configured to hopefully use the right template (I'm using the -code variant and insert the <...> tags into the prompt.

My prompt looks like this:

Oct 02 17:29:40 grasshopper-tesla ollama[852]: time=2024-10-02T17:29:40.893Z level=DEBUG source=routes.go:211 msg="generate request" prompt="<PRE> \nstruct TestStruct {\n    int a;\n    int b;\n    int c[];\n};\n\nint main() {\n    static_assert(sizeof(TestStruct) == 2*sizeof(int));\n     <SUF>\n\n\n    return 0;\n}\n <MID>" images=[]

And I will watch out for it, but I'm pretty sure I was able to trigger the same condition.

@Krakonos commented on GitHub (Oct 2, 2024): Thanks. I'll try to run with `OLLAMA_NUM_PARALLEL=1`. A few questions: * Is there a way to play with `num_predict` default? Or even cap it to prevent bad requests? This query is from a VS code plugin that doesn't seem to support setting num_predict (I think that needs to go to the query body, right?) I mostly plan to use this for code completion, so having a num_predict would be OK (actually, i could use large context and small predict). * How do you know the context switch takes long? To me I though it takes a long time generate all the tokens to fill the context window and trigger a swap. This might be because my GPU is thermally constrained and clocks down quite a during continuous load. Cooling those beasts (quietly) is more challenging than I anticipated unfortunately. * Who is responsible for applying the template? Ollama or the API client? I expect ollama does it. I pulled codellama:7b-code Interestingly, the template is in the modefile: ``` krakonos@grasshopper-tesla:~$ ollama show --modelfile codellama:7b | head -n 10 # Modelfile generated by "ollama show" # To build a new Modelfile based on this, replace FROM with: # FROM codellama:7b FROM /ollama/blobs/sha256-3a43f93b78ec50f7c4e4dc8bd1cb3fff5a900e7d574c51a6f7495e48486e0dac TEMPLATE "[INST] <<SYS>>{{ .System }}<</SYS>> {{ .Prompt }} [/INST] " PARAMETER rope_frequency_base 1e+06 krakonos@grasshopper-tesla:~$ ollama show --modelfile codellama:7b-code | head -n 10 # Modelfile generated by "ollama show" # To build a new Modelfile based on this, replace FROM with: # FROM codellama:7b-code FROM /ollama/blobs/sha256-8b2eceb7b7a11c307bc9deed38b263e05015945dc0fa2f50c0744c5d49dd293e TEMPLATE "{{- if .Suffix }}<PRE> {{ .Prompt }} <SUF>{{ .Suffix }} <MID> {{- else }}{{ .Prompt }} {{- end }}" PARAMETER rope_frequency_base 1e+06 LICENSE """LLAMA 2 COMMUNITY LICENSE AGREEMENT ``` However, I think this might be a red herring, since I have my own vim plugin configured to hopefully use the right template (I'm using the -code variant and insert the `<...>` tags into the prompt. My prompt looks like this: ``` Oct 02 17:29:40 grasshopper-tesla ollama[852]: time=2024-10-02T17:29:40.893Z level=DEBUG source=routes.go:211 msg="generate request" prompt="<PRE> \nstruct TestStruct {\n int a;\n int b;\n int c[];\n};\n\nint main() {\n static_assert(sizeof(TestStruct) == 2*sizeof(int));\n <SUF>\n\n\n return 0;\n}\n <MID>" images=[] ``` And I will watch out for it, but I'm pretty sure I was able to trigger the same condition.

GiteaMirror commented

2026-04-12 15:25:19 -05:00

@Gomez12 commented on GitHub (Oct 2, 2024):

What I use to experiment with num_predict is just define 2 models, with as only difference the num_predict option. Just name 1 _large and the other _small.

On disk Ollama will use the same files.
It will reload the model on a change, but imho that is not the end of the world.

@Gomez12 commented on GitHub (Oct 2, 2024): What I use to experiment with num_predict is just define 2 models, with as only difference the num_predict option. Just name 1 _large and the other _small. On disk Ollama will use the same files. It will reload the model on a change, but imho that is not the end of the world.

GiteaMirror commented

2026-04-12 15:25:20 -05:00

@rick-github commented on GitHub (Oct 3, 2024):

You can set num_predict as a parameter in a copy of the model:

$ ollama show --modelfile codellama:7b-code | sed -e 's/^FROM.*/FROM codellama:7b-code/' > Modelfile
$ echo "PARAMETER num_predict 1024" >> Modelfile
$ ollama create codellama:7b-code-np1k

It takes the model 13 minutes to generate 1021 tokens and do a context shift:

$ grep task_id=276 ~/Downloads/always-on.txt | tail -2
Oct 02 11:30:16 grasshopper-tesla ollama[23633]: DEBUG [update_slots] slot context shift | n_cache_tokens=2048 n_ctx=8192 n_discard=1021 n_keep=5 n_left=2042 n_past=2047 n_system_tokens=0 slot_id=3 task_id=276 tid="140304693719040" timestamp=1727868616
Oct 02 11:43:45 grasshopper-tesla ollama[23633]: DEBUG [update_slots] slot context shift | n_cache_tokens=2048 n_ctx=8192 n_discard=1021 n_keep=5 n_left=2042 n_past=2047 n_system_tokens=0 slot_id=3 task_id=276 tid="140304693719040" timestamp=1727869425

but you're right that we don't know if that time was actual shift or generation. You can check the throttling parameters with

nvidia-smi -q -d TEMPERATURE,PERFORMANCE

ollama applies the template.

The log line you quote is a generate request, ie /api/generate. The requests in always-on.txt are chat requests, ie /api/chat. I pulled the model and it does very poorly in low/no context completions without the special tokens for both chat and generate endpoints, like the chat requests in the log. I think it comes down to the model is only suitable for FIM usage, so if you are inserting the <...> tags then I wouldn't expect the looping behaviour from earlier.

Note that you can have ollama apply the template and skip having to insert the tokens yourself by using the suffix parameter:

$ curl -s localhost:11434/api/generate -d '{"model":"codellama:7b-code","prompt":"<PRE> def hello(s: str) <SUF>  return 0 <MID>","stream":false,"options":{"seed":42}}' | jq .response
" -> int:\n import re\n if s == 'hello':\n     print('success')\n else:\n   "
$ curl -s localhost:11434/api/generate -d '{"model":"codellama:7b-code","prompt":"def hello(s: str)","suffix":"  return 0","stream":false,"options":{"seed":42}}' | jq .response
" -> int:\n import re\n if s == 'hello':\n     print('success')\n else:\n   "

Also note that codellama is an old model, counted in internet time. There are other models which may perform better, depending on your requirements.

@rick-github commented on GitHub (Oct 3, 2024): You can set `num_predict` as a parameter in a copy of the model: ``` $ ollama show --modelfile codellama:7b-code | sed -e 's/^FROM.*/FROM codellama:7b-code/' > Modelfile $ echo "PARAMETER num_predict 1024" >> Modelfile $ ollama create codellama:7b-code-np1k ``` It takes the model 13 minutes to generate 1021 tokens and do a context shift: ``` $ grep task_id=276 ~/Downloads/always-on.txt | tail -2 Oct 02 11:30:16 grasshopper-tesla ollama[23633]: DEBUG [update_slots] slot context shift | n_cache_tokens=2048 n_ctx=8192 n_discard=1021 n_keep=5 n_left=2042 n_past=2047 n_system_tokens=0 slot_id=3 task_id=276 tid="140304693719040" timestamp=1727868616 Oct 02 11:43:45 grasshopper-tesla ollama[23633]: DEBUG [update_slots] slot context shift | n_cache_tokens=2048 n_ctx=8192 n_discard=1021 n_keep=5 n_left=2042 n_past=2047 n_system_tokens=0 slot_id=3 task_id=276 tid="140304693719040" timestamp=1727869425 ``` but you're right that we don't know if that time was actual shift or generation. You can check the throttling parameters with ``` nvidia-smi -q -d TEMPERATURE,PERFORMANCE ``` ollama applies the template. The log line you quote is a generate request, ie /api/generate. The requests in always-on.txt are chat requests, ie /api/chat. I pulled the model and it does very poorly in low/no context completions without the special tokens for both chat and generate endpoints, like the chat requests in the log. I think it comes down to the model is only suitable for FIM usage, so if you are inserting the `<...>` tags then I wouldn't expect the looping behaviour from earlier. Note that you can have ollama apply the template and skip having to insert the tokens yourself by using the `suffix` parameter: ```console $ curl -s localhost:11434/api/generate -d '{"model":"codellama:7b-code","prompt":"<PRE> def hello(s: str) <SUF> return 0 <MID>","stream":false,"options":{"seed":42}}' | jq .response " -> int:\n import re\n if s == 'hello':\n print('success')\n else:\n " $ curl -s localhost:11434/api/generate -d '{"model":"codellama:7b-code","prompt":"def hello(s: str)","suffix":" return 0","stream":false,"options":{"seed":42}}' | jq .response " -> int:\n import re\n if s == 'hello':\n print('success')\n else:\n " ``` Also note that codellama is an old model, counted in internet time. There are [other models](https://ollama.com/search?c=code) which may perform better, depending on your requirements.

GiteaMirror commented

2026-04-12 15:25:20 -05:00

@Krakonos commented on GitHub (Oct 3, 2024):

Thanks for the tips. I've created a new model with num_predict 512, I'll see if it will get stuck eventually.

I'm actually testing with a friend and we didn't find a way to convince his plugin to put in the FIM markers. I'm using llm.nvim and that works reasonably well.

If the num_predict fixes the loop issue, I think we'll manage to tune out the rest. When these initial issues are solved, I have a list of other models to test. But I found codellama to be widely support by plugins (specifically found some vs code that didn't support other models due to different FIM markers) and want to make this work first so the infrastructure is solid, then tinker with the models.

But TBH for my uses, codellama does reasonably well. My primary use is to put together click boilerplate in python and such things, which it does nicely.

As for the suffix parameter, I love that and I need to check if my plugin can support it (and possibly add the support), that would make trying out and switching models much more convenient.

I'll keep you posted, we'll run more prompts tomorrow!

In the meantime, thanks for the great software, I'm having a great time overall!

@Krakonos commented on GitHub (Oct 3, 2024): Thanks for the tips. I've created a new model with num_predict 512, I'll see if it will get stuck eventually. I'm actually testing with a friend and we didn't find a way to convince his plugin to put in the FIM markers. I'm using llm.nvim and that works reasonably well. If the num_predict fixes the loop issue, I think we'll manage to tune out the rest. When these initial issues are solved, I have a list of other models to test. But I found codellama to be widely support by plugins (specifically found some vs code that didn't support other models due to different FIM markers) and want to make this work first so the infrastructure is solid, then tinker with the models. But TBH for my uses, codellama does reasonably well. My primary use is to put together click boilerplate in python and such things, which it does nicely. As for the suffix parameter, I love that and I need to check if my plugin can support it (and possibly add the support), that would make trying out and switching models much more convenient. I'll keep you posted, we'll run more prompts tomorrow! In the meantime, thanks for the great software, I'm having a great time overall!

GiteaMirror commented

2026-04-12 15:25:21 -05:00

@Krakonos commented on GitHub (Oct 4, 2024):

Ok, a few observations:

OLLAMA_NUM_PARALLEL=1 doesn't help
Applying num_predict in modelfile works to some extent. However, it's a bit prone to error, since I can easily select other models in different UIs. TBH I'd prefer a time limit for each query, since running large models is so slow, a single query would bog the server for a looong time. Maybe I could do a proxy to apply custom rules, or check if openwebui supports something like it. Applying this globally, for example as environment variable would be highly preferred (both num_predict & time restrictions could be implemented).

One thing that would help me a lot would be an API to list running prompts & cancel them. I could then easily implement a custom supervisor. However, I haven't found a way to do this (maybe it would be possible through nvidia-smi and killing processes, but that seems a bit extreme).

In any case, I don't think there is a fault in ollama, since the issue is caused mostly by hallucinations. I'm closing this issue. Thanks for the help with diagnostics!

@Krakonos commented on GitHub (Oct 4, 2024): Ok, a few observations: * `OLLAMA_NUM_PARALLEL=1` doesn't help * Applying `num_predict` in modelfile works to some extent. However, it's a bit prone to error, since I can easily select other models in different UIs. TBH I'd prefer a time limit for each query, since running large models is so slow, a single query would bog the server for a looong time. Maybe I could do a proxy to apply custom rules, or check if openwebui supports something like it. Applying this globally, for example as environment variable would be highly preferred (both num_predict & time restrictions could be implemented). One thing that would help me a lot would be an API to list running prompts & cancel them. I could then easily implement a custom supervisor. However, I haven't found a way to do this (maybe it would be possible through nvidia-smi and killing processes, but that seems a bit extreme). In any case, I don't think there is a fault in ollama, since the issue is caused mostly by hallucinations. I'm closing this issue. Thanks for the help with diagnostics!

GiteaMirror referenced this issue

2026-04-22 06:51:48 -05:00

[GH-ISSUE #4494] How to load a model from local disk path? #28572

GiteaMirror referenced this issue

2026-04-28 11:15:01 -05:00

[GH-ISSUE #4494] How to load a model from local disk path? #49324

GiteaMirror referenced this issue

2026-05-03 18:58:11 -05:00

[GH-ISSUE #4494] How to load a model from local disk path? #64850

Sign in to join this conversation.

Branches Tags

main

hoyyeva/anthropic-local-image-path

dhiltgen/ci

dhiltgen/llama-runner

parth-remove-claude-desktop-launch

hoyyeva/anthropic-reference-images-path

parth-anthropic-reference-images-path

brucemacd/download-before-remove

hoyyeva/editor-config-repair

parth-mlx-decode-checkpoints

parth-launch-codex-app

hoyyeva/fix-codex-model-metadata-warning

hoyyeva/qwen

parth/hide-claude-desktop-till-release

hoyyeva/opencode-image-modality

parth-add-claude-code-autoinstall

release_v0.22.0

pdevine/manifest-list

codex/fix-codex-model-metadata-warning

pdevine/addressable-manifest

brucemacd/launch-fetch-reccomended

jmorganca/llama-compat

launch-copilot-cli

hoyyeva/opencode-thinking

release_v0.20.7

parth-auto-save-backup

parth-test

jmorganca/gemma4-audio-replacements

fix-manifest-digest-on-pull

hoyyeva/vscode-improve

brucemacd/install-server-wait

parth/update-claude-docs

brucemac/start-ap-install

pdevine/mlx-update

pdevine/qwen35_vision

drifkin/api-show-fallback

mintlify/image-generation-1773352582

hoyyeva/server-context-length-local-config

jmorganca/faster-reptition-penalties

jmorganca/convert-nemotron

parth-pi-thinking

pdevine/sampling-penalties

jmorganca/fix-create-quantization-memory

dongchen/resumable_transfer_fix

pdevine/sampling-cache-error

jessegross/mlx-usage

hoyyeva/openclaw-config

hoyyeva/app-html

pdevine/qwen3next

brucemacd/sign-sh-install

brucemacd/tui-update

brucemacd/usage-api

jmorganca/launch-empty

fix-app-dist-embed

mxyng/mlx-compile

mxyng/mlx-quant

mxyng/mlx-glm4.7

mxyng/mlx

brucemacd/simplify-model-picker

jmorganca/qwen3-concurrent

fix-glm-4.7-flash-mla-config

drifkin/qwen3-coder-opening-tag

brucemacd/usage-cli

fix-cuda12-fattn-shmem

ollama-imagegen-docs

parth/fix-multiline-inputs

brucemacd/config-docs

mxyng/model-files

mxyng/simple-execute

fix-imagegen-ollama-models

mxyng/async-upload

jmorganca/lazy-no-dtype-changes

imagegen-auto-detect-create

parth/decrease-concurrent-download-hf

fix-mlx-quantize-init

jmorganca/x-cleanup

usage

imagegen-readme

jmorganca/glm-image

mlx-gpu-cd

jmorganca/imagegen-modelfile

parth/agent-skills

parth/agent-allowlist

parth/signed-in-offline

parth/agents

parth/fix-context-chopping

improve-cloud-flow

parth/add-models-websearch

parth/prompt-renderer-mcp

jmorganca/native-settings

jmorganca/download-stream-hash

jmorganca/client2-rebased

brucemacd/oai-chat-req-multipart

jessegross/multi_chunk_reserve

grace/additional-omit-empty

grace/mistral-3-large

mxyng/tokenizer2

mxyng/tokenizer

jessegross/flash

hoyyeva/windows-nacked-app

mxyng/cleanup-attention

grace/deepseek-parser

hoyyeva/remember-unsent-prompt

parth/add-lfs-pointer-error-conversion

parth/olmo2-test2

hoyyeva/ollama-launchagent-plist

nicole/olmo-model

parth/olmo-test

mxyng/remove-embedded

parth/render-template

jmorganca/intellect-3

parth/remove-prealloc-linter

jmorganca/cmd-eval

nicole/nomic-embed-text-fix

mxyng/lint-2

hoyyeva/add-gemini-3-pro-preview

hoyyeva/load-model-list

mxyng/expand-path

mxyng/environ-2

hoyyeva/deeplink-json-encoding

parth/improve-tool-calling-tests

hoyyeva/conversation

hoyyeva/assistant-edit-response

hoyyeva/thinking

origin/brucemacd/invalid-char-i-err

parth/improve-tool-calling

jmorganca/required-omitempty

grace/qwen3-vl-tests

mxyng/iter-client

parth/docs-readme

nicole/embed-test

pdevine/integration-benchstat

parth/remove-generate-cmd

parth/add-toolcall-id

mxyng/server-tests

jmorganca/glm-4.6

jmorganca/gin-h-compat

drifkin/stable-tool-args

pdevine/qwen3-more-thinking

parth/add-websearch-client

nicole/websearch_local

jmorganca/qwen3-coder-updates

grace/deepseek-v3-migration-tests

mxyng/fix-create

jmorganca/cloud-errors

pdevine/parser-tidy

revert-12233-parth/simplify-entrypoints-runner

parth/enable-so-gpt-oss

brucemacd/qwen3vl

jmorganca/readme-simplify

parth/gpt-oss-structured-outputs

revert-12039-jmorganca/tools-braces

mxyng/embeddings

mxyng/gguf

mxyng/benchmark

mxyng/types-null

parth/move-parsing

mxyng/gemma2

jmorganca/docs

mxyng/16-bit

mxyng/create-stdin

pdevine/authorizedkeys

mxyng/quant

parth/opt-in-error-context-window

brucemacd/cache-models

brucemacd/runner-completion

jmorganca/llama-update-6

brucemacd/benchmark-list

brucemacd/partial-read-caps

parth/deepseek-r1-tools

mxyng/omit-array

parth/tool-prefix-temp

brucemacd/runner-test

jmorganca/qwen25vl

brucemacd/model-forward-test-ext

parth/python-function-parsing

jmorganca/cuda-compression-none

drifkin/num-parallel

drifkin/chat-truncation-fix

jmorganca/sync

parth/python-tools-calling

drifkin/array-head-count

brucemacd/create-no-loop

parth/server-enable-content-stream-with-tools

qwen25omni

mxyng/v3

brucemacd/ropeconfig

jmorganca/silence-tokenizer

parth/sample-so-test

parth/sampling-structured-outputs

brucemacd/doc-go-engine

parth/constrained-sampling-json

jmorganca/mistral-wip

brucemacd/mistral-small-convert

parth/sample-unmarshal-json-for-params

brucemacd/jomorganca/mistral

pdevine/bfloat16

jmorganca/mistral

brucemacd/mistral

pdevine/logging

parth/sample-correctness-fix

parth/sample-fix-sorting

jmorgan/sample-fix-sorting-extras

jmorganca/temp-0-images

brucemacd/parallel-embed-models

brucemacd/shim-grammar

jmorganca/fix-gguf-error

bmizerany/nameswork

jmorganca/faster-releases

bmizerany/validatenames

brucemacd/err-no-vocab

brucemacd/rope-config

brucemacd/err-hint

brucemacd/qwen2_5

brucemacd/logprobs

brucemacd/new_runner_graph_bench

progress-flicker

brucemacd/forward-test

brucemacd/go_qwen2

pdevine/gemma2

jmorganca/add-missing-symlink-eval

mxyng/next-debug

parth/set-context-size-openai

brucemacd/next-bpe-bench

brucemacd/next-bpe-test

brucemacd/new_runner_e2e

brucemacd/new_runner_qwen2

pdevine/convert-cohere2

brucemacd/convert-cli

parth/log-probs

mxyng/next-mlx

mxyng/cmd-history

parth/templating

parth/tokenize-detokenize

brucemacd/check-key-register

bmizerany/grammar

jmorganca/vendor-081b29bd

mxyng/func-checks

jmorganca/fix-null-format

parth/fix-default-to-warn-json

jmorganca/qwen2vl

jmorganca/no-concat

parth/cmd-cleanup-SO

brucemacd/check-key-register-structured-err

parth/openai-stream-usage

parth/fix-referencing-so

stream-tools-stop

jmorganca/degin-1

brucemacd/install-path-clean

brucemacd/push-name-validation

brucemacd/browser-key-register

jmorganca/openai-fix-first-message

jmorganca/fix-proxy

jessegross/sample

parth/disallow-streaming-tools

dhiltgen/remove_submodule

jmorganca/ga

jmorganca/mllama

pdevine/newlines

pdevine/geems-2b

jmorganca/llama-bump

mxyng/modelname-7

mxyng/gin-slog

mxyng/modelname-6

jyan/convert-prog

jyan/quant5

paligemma-support

pdevine/import-docs

jmorganca/openai-context

jyan/paligemma

jyan/p2

jyan/palitest

bmizerany/embedspeedup

jmorganca/llama-vit

brucemacd/allow-ollama

royh/ep-methods

royh/whisper

mxyng/api-models

mxyng/fix-memory

jyan/q4_4/8

jyan/ollama-v

royh/stream-tools

roy-embed-parallel

bmizerany/hrm

revert-5963-revert-5924-mxyng/llama3.1-rope

royh/embed-viz

jyan/local2

jyan/auth

jyan/local

jyan/parse-temp

jmorganca/template-mistral

jyan/reord-g

royh-openai-suffixdocs

royh-imgembed

royh-embed-parallel

jyan/quant4

royh-precision

jyan/progress

pdevine/fix-template

jyan/quant3

pdevine/ggla

mxyng/update-registry-domain

jmorganca/ggml-static

mxyng/create-context

jyan/v0.146

mxyng/layers-from-files

build_dist

bmizerany/noseek

royh-ls

royh-name

timeout

mxyng/server-timestamp

bmizerany/nosillyggufslurps

royh-params

jmorganca/llama-cpp-7c26775

royh-openai-delete

royh-show-rigid

jmorganca/enable-fa

jmorganca/no-error-template

jyan/format

royh-testdelete

bmizerany/fastverify

language_support

pdevine/ps-glitches

brucemacd/tokenize

bruce/iq-quants

bmizerany/filepathwithcoloninhost

mxyng/split-bin

bmizerany/client-registry

jmorganca/if-none-match

native

jmorganca/native

jmorganca/batch-embeddings

jmorganca/initcmake

jmorganca/mm

pdevine/showggmlinfo

modenameenforcealphanum

bmizerany/modenameenforcealphanum

jmorganca/done-reason

jmorganca/llama-cpp-8960fe8

ollama.com

bmizerany/filepathnobuild

bmizerany/types/model/defaultfix

rmdisplaylong

nogogen

bmizerany/x

modelfile-readme

bmizerany/replacecolon

jmorganca/limit

jmorganca/execstack

jmorganca/replace-assets

mxyng/tune-concurrency

jmorganca/testing

whitespace-detection

jmorganca/options

upgrade-all

scratch

cuda-search

mattw/airenamer

mattw/allmodelsonhuggingface

mattw/quantcontext

mattw/whatneedstorun

brucemacd/llama-mem-calc

mattw/faq-context

mattw/communitylinks

mattw/noprune

mattw/python-functioncalling

rename

mxyng/install

pulse

remove-first

editor

mattw/selfqueryingretrieval

cgo

mattw/howtoquant

api

matt/streamingapi

format-config

mxyng/extra-args

shell

update-nous-hermes

cp-model

upload-progress

fix-unknown-model

fix-model-names

delete-fix

insecure-registry

ls

deletemodels

progressbar

readme-updates

license-layers

skip-list

list-models

modelpath

matt/examplemodelfiles

distribution

go-opts

1 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: github-starred/ollama#4494