[GH-ISSUE #7078] Model keeps running forever #4494

Closed
opened 2026-04-12 15:25:16 -05:00 by GiteaMirror · 6 comments
Owner

Originally created by @Krakonos on GitHub (Oct 2, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/7078

What is the issue?

Hi!

I'm using ollama as a backend for code completion. I'm running OpenWebUI for authentication, which proxies requests to my ollama instance locally. However, after some requests (I've been unable to find any similarities so far), the model is stuck running forever (loaded and uses 100% GPU). In my case, this unfortunately ends up thermal throttling the GPU, wastes a lot of power, but also prevents running more workloads on the model until I restart ollama.

A sample log is attached:

always-on.txt

I'm running Ryzen 5700X, but the model runs in a virtualized environment via proxmox, 2x Nvidia tesla P40 are passed in, 48GB system RAM.

Under some workloads (specifically some models), I also suffered issues from GPUs dropping off the bus, I'm not sure it's related and I haven't gathered enough information to say for sure the problem is with ollama or some hardware/virtualization in that case. I suspect this was due to bad model though (I was trying to evaluate deepseek-coder-v2:16b-lite-base-q4_0 and found this is a common problem), but I think I've seen this once running codellama:7b-code.

There are no relevant events in dmesg, the GPUs have persistent mode enabled.

Let me know what further debugging steps should I take.

OS

Linux

GPU

Nvidia

CPU

AMD

Ollama version

0.3.10

Originally created by @Krakonos on GitHub (Oct 2, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/7078 ### What is the issue? Hi! I'm using ollama as a backend for code completion. I'm running OpenWebUI for authentication, which proxies requests to my ollama instance locally. However, after some requests (I've been unable to find any similarities so far), the model is stuck running forever (loaded and uses 100% GPU). In my case, this unfortunately ends up thermal throttling the GPU, wastes a lot of power, but also prevents running more workloads on the model until I restart ollama. A sample log is attached: [always-on.txt](https://github.com/user-attachments/files/17228827/always-on.txt) I'm running Ryzen 5700X, but the model runs in a virtualized environment via proxmox, 2x Nvidia tesla P40 are passed in, 48GB system RAM. Under some workloads (specifically some models), I also suffered issues from GPUs dropping off the bus, I'm not sure it's related and I haven't gathered enough information to say for sure the problem is with ollama or some hardware/virtualization in that case. I suspect this was due to bad model though (I was trying to evaluate `deepseek-coder-v2:16b-lite-base-q4_0` and found this is a common problem), but I think I've seen this once running `codellama:7b-code`. There are no relevant events in dmesg, the GPUs have persistent mode enabled. Let me know what further debugging steps should I take. ### OS Linux ### GPU Nvidia ### CPU AMD ### Ollama version 0.3.10
GiteaMirror added the bug label 2026-04-12 15:25:16 -05:00
Author
Owner

@rick-github commented on GitHub (Oct 2, 2024):

Oct 02 11:40:28 grasshopper-tesla ollama[23633]: DEBUG [update_slots] slot context shift | n_cache_tokens=2048 n_ctx=8192 n_discard=1021 n_keep=5 n_left=2042 n_past=2047 n_system_tokens=0 slot_id=1 task_id=10 tid="140304693719040" timestamp=1727869228

This model is codellama:7b-code. It's stuck in a loop, it appears to have forgotten how to send an EOS token. ollama has some heuristics to try to catch this, the main one being num_predict, or the maximum number of tokens to predict. The default for this is 10 * num_context, which should be 20480 in this case. Since the generated tokens are exceeding the context window, the model has to do a context shift which, looking at the statistics in the log, is incredibly slow. So it's taking a long time to reach the num_predict limit: slot 1 has been processing for 4 hours at the end of the log with what looks like a completion request for def t. It's not clear why it's so slow. You have OLLAMA_NUM_PARALLEL unset or set to 4, so the runner is doing 4 concurrent requests, but I can't imagine why that would cause such a performance hit.

Something else that's interesting is that the the request to the model doesn't contain any special tokens, while the template in the ollama library does. The model has been updated since it was published, so re-pulling the model might update the template and give the model more guidance on when to emit an EOS token.

The other thing I suggest is setting OLLAMA_NUM_PARALLEL=1 in the server environment to see if that makes a difference.

<!-- gh-comment-id:2388748976 --> @rick-github commented on GitHub (Oct 2, 2024): ``` Oct 02 11:40:28 grasshopper-tesla ollama[23633]: DEBUG [update_slots] slot context shift | n_cache_tokens=2048 n_ctx=8192 n_discard=1021 n_keep=5 n_left=2042 n_past=2047 n_system_tokens=0 slot_id=1 task_id=10 tid="140304693719040" timestamp=1727869228 ``` This model is codellama:7b-code. It's stuck in a loop, it appears to have forgotten how to send an EOS token. ollama has some heuristics to try to catch this, the main one being `num_predict`, or the maximum number of tokens to predict. The default for this is 10 * num_context, which should be 20480 in this case. Since the generated tokens are exceeding the context window, the model has to do a context shift which, looking at the statistics in the log, is incredibly slow. So it's taking a long time to reach the `num_predict` limit: slot 1 has been processing for 4 hours at the end of the log with what looks like a completion request for `def t`. It's not clear why it's so slow. You have `OLLAMA_NUM_PARALLEL` unset or set to 4, so the runner is doing 4 concurrent requests, but I can't imagine why that would cause such a performance hit. Something else that's interesting is that the the request to the model doesn't contain any special tokens, while the [template](https://ollama.com/library/codellama/blobs/2e0493f67d0c) in the ollama library does. The model has been updated since it was published, so re-pulling the model might update the template and give the model more guidance on when to emit an EOS token. The other thing I suggest is setting `OLLAMA_NUM_PARALLEL=1` in the server environment to see if that makes a difference.
Author
Owner

@Krakonos commented on GitHub (Oct 2, 2024):

Thanks. I'll try to run with OLLAMA_NUM_PARALLEL=1. A few questions:

  • Is there a way to play with num_predict default? Or even cap it to prevent bad requests? This query is from a VS code plugin that doesn't seem to support setting num_predict (I think that needs to go to the query body, right?) I mostly plan to use this for code completion, so having a num_predict would be OK (actually, i could use large context and small predict).
  • How do you know the context switch takes long? To me I though it takes a long time generate all the tokens to fill the context window and trigger a swap. This might be because my GPU is thermally constrained and clocks down quite a during continuous load. Cooling those beasts (quietly) is more challenging than I anticipated unfortunately.
  • Who is responsible for applying the template? Ollama or the API client? I expect ollama does it. I pulled codellama:7b-code

Interestingly, the template is in the modefile:

krakonos@grasshopper-tesla:~$ ollama show --modelfile codellama:7b | head -n 10
# Modelfile generated by "ollama show"
# To build a new Modelfile based on this, replace FROM with:
# FROM codellama:7b

FROM /ollama/blobs/sha256-3a43f93b78ec50f7c4e4dc8bd1cb3fff5a900e7d574c51a6f7495e48486e0dac
TEMPLATE "[INST] <<SYS>>{{ .System }}<</SYS>>

{{ .Prompt }} [/INST]
"
PARAMETER rope_frequency_base 1e+06



krakonos@grasshopper-tesla:~$ ollama show --modelfile codellama:7b-code | head -n 10
# Modelfile generated by "ollama show"
# To build a new Modelfile based on this, replace FROM with:
# FROM codellama:7b-code

FROM /ollama/blobs/sha256-8b2eceb7b7a11c307bc9deed38b263e05015945dc0fa2f50c0744c5d49dd293e
TEMPLATE "{{- if .Suffix }}<PRE> {{ .Prompt }} <SUF>{{ .Suffix }} <MID>
{{- else }}{{ .Prompt }}
{{- end }}"
PARAMETER rope_frequency_base 1e+06
LICENSE """LLAMA 2 COMMUNITY LICENSE AGREEMENT	

However, I think this might be a red herring, since I have my own vim plugin configured to hopefully use the right template (I'm using the -code variant and insert the <...> tags into the prompt.

My prompt looks like this:

Oct 02 17:29:40 grasshopper-tesla ollama[852]: time=2024-10-02T17:29:40.893Z level=DEBUG source=routes.go:211 msg="generate request" prompt="<PRE> \nstruct TestStruct {\n    int a;\n    int b;\n    int c[];\n};\n\nint main() {\n    static_assert(sizeof(TestStruct) == 2*sizeof(int));\n     <SUF>\n\n\n    return 0;\n}\n <MID>" images=[]

And I will watch out for it, but I'm pretty sure I was able to trigger the same condition.

<!-- gh-comment-id:2389234584 --> @Krakonos commented on GitHub (Oct 2, 2024): Thanks. I'll try to run with `OLLAMA_NUM_PARALLEL=1`. A few questions: * Is there a way to play with `num_predict` default? Or even cap it to prevent bad requests? This query is from a VS code plugin that doesn't seem to support setting num_predict (I think that needs to go to the query body, right?) I mostly plan to use this for code completion, so having a num_predict would be OK (actually, i could use large context and small predict). * How do you know the context switch takes long? To me I though it takes a long time generate all the tokens to fill the context window and trigger a swap. This might be because my GPU is thermally constrained and clocks down quite a during continuous load. Cooling those beasts (quietly) is more challenging than I anticipated unfortunately. * Who is responsible for applying the template? Ollama or the API client? I expect ollama does it. I pulled codellama:7b-code Interestingly, the template is in the modefile: ``` krakonos@grasshopper-tesla:~$ ollama show --modelfile codellama:7b | head -n 10 # Modelfile generated by "ollama show" # To build a new Modelfile based on this, replace FROM with: # FROM codellama:7b FROM /ollama/blobs/sha256-3a43f93b78ec50f7c4e4dc8bd1cb3fff5a900e7d574c51a6f7495e48486e0dac TEMPLATE "[INST] <<SYS>>{{ .System }}<</SYS>> {{ .Prompt }} [/INST] " PARAMETER rope_frequency_base 1e+06 krakonos@grasshopper-tesla:~$ ollama show --modelfile codellama:7b-code | head -n 10 # Modelfile generated by "ollama show" # To build a new Modelfile based on this, replace FROM with: # FROM codellama:7b-code FROM /ollama/blobs/sha256-8b2eceb7b7a11c307bc9deed38b263e05015945dc0fa2f50c0744c5d49dd293e TEMPLATE "{{- if .Suffix }}<PRE> {{ .Prompt }} <SUF>{{ .Suffix }} <MID> {{- else }}{{ .Prompt }} {{- end }}" PARAMETER rope_frequency_base 1e+06 LICENSE """LLAMA 2 COMMUNITY LICENSE AGREEMENT ``` However, I think this might be a red herring, since I have my own vim plugin configured to hopefully use the right template (I'm using the -code variant and insert the `<...>` tags into the prompt. My prompt looks like this: ``` Oct 02 17:29:40 grasshopper-tesla ollama[852]: time=2024-10-02T17:29:40.893Z level=DEBUG source=routes.go:211 msg="generate request" prompt="<PRE> \nstruct TestStruct {\n int a;\n int b;\n int c[];\n};\n\nint main() {\n static_assert(sizeof(TestStruct) == 2*sizeof(int));\n <SUF>\n\n\n return 0;\n}\n <MID>" images=[] ``` And I will watch out for it, but I'm pretty sure I was able to trigger the same condition.
Author
Owner

@Gomez12 commented on GitHub (Oct 2, 2024):

What I use to experiment with num_predict is just define 2 models, with as only difference the num_predict option. Just name 1 _large and the other _small.

On disk Ollama will use the same files.
It will reload the model on a change, but imho that is not the end of the world.

<!-- gh-comment-id:2389679134 --> @Gomez12 commented on GitHub (Oct 2, 2024): What I use to experiment with num_predict is just define 2 models, with as only difference the num_predict option. Just name 1 _large and the other _small. On disk Ollama will use the same files. It will reload the model on a change, but imho that is not the end of the world.
Author
Owner

@rick-github commented on GitHub (Oct 3, 2024):

You can set num_predict as a parameter in a copy of the model:

$ ollama show --modelfile codellama:7b-code | sed -e 's/^FROM.*/FROM codellama:7b-code/' > Modelfile
$ echo "PARAMETER num_predict 1024" >> Modelfile
$ ollama create codellama:7b-code-np1k

It takes the model 13 minutes to generate 1021 tokens and do a context shift:

$ grep task_id=276 ~/Downloads/always-on.txt | tail -2
Oct 02 11:30:16 grasshopper-tesla ollama[23633]: DEBUG [update_slots] slot context shift | n_cache_tokens=2048 n_ctx=8192 n_discard=1021 n_keep=5 n_left=2042 n_past=2047 n_system_tokens=0 slot_id=3 task_id=276 tid="140304693719040" timestamp=1727868616
Oct 02 11:43:45 grasshopper-tesla ollama[23633]: DEBUG [update_slots] slot context shift | n_cache_tokens=2048 n_ctx=8192 n_discard=1021 n_keep=5 n_left=2042 n_past=2047 n_system_tokens=0 slot_id=3 task_id=276 tid="140304693719040" timestamp=1727869425

but you're right that we don't know if that time was actual shift or generation. You can check the throttling parameters with

nvidia-smi -q -d TEMPERATURE,PERFORMANCE

ollama applies the template.

The log line you quote is a generate request, ie /api/generate. The requests in always-on.txt are chat requests, ie /api/chat. I pulled the model and it does very poorly in low/no context completions without the special tokens for both chat and generate endpoints, like the chat requests in the log. I think it comes down to the model is only suitable for FIM usage, so if you are inserting the <...> tags then I wouldn't expect the looping behaviour from earlier.

Note that you can have ollama apply the template and skip having to insert the tokens yourself by using the suffix parameter:

$ curl -s localhost:11434/api/generate -d '{"model":"codellama:7b-code","prompt":"<PRE> def hello(s: str) <SUF>  return 0 <MID>","stream":false,"options":{"seed":42}}' | jq .response
" -> int:\n import re\n if s == 'hello':\n     print('success')\n else:\n   "
$ curl -s localhost:11434/api/generate -d '{"model":"codellama:7b-code","prompt":"def hello(s: str)","suffix":"  return 0","stream":false,"options":{"seed":42}}' | jq .response
" -> int:\n import re\n if s == 'hello':\n     print('success')\n else:\n   "

Also note that codellama is an old model, counted in internet time. There are other models which may perform better, depending on your requirements.

<!-- gh-comment-id:2391399958 --> @rick-github commented on GitHub (Oct 3, 2024): You can set `num_predict` as a parameter in a copy of the model: ``` $ ollama show --modelfile codellama:7b-code | sed -e 's/^FROM.*/FROM codellama:7b-code/' > Modelfile $ echo "PARAMETER num_predict 1024" >> Modelfile $ ollama create codellama:7b-code-np1k ``` It takes the model 13 minutes to generate 1021 tokens and do a context shift: ``` $ grep task_id=276 ~/Downloads/always-on.txt | tail -2 Oct 02 11:30:16 grasshopper-tesla ollama[23633]: DEBUG [update_slots] slot context shift | n_cache_tokens=2048 n_ctx=8192 n_discard=1021 n_keep=5 n_left=2042 n_past=2047 n_system_tokens=0 slot_id=3 task_id=276 tid="140304693719040" timestamp=1727868616 Oct 02 11:43:45 grasshopper-tesla ollama[23633]: DEBUG [update_slots] slot context shift | n_cache_tokens=2048 n_ctx=8192 n_discard=1021 n_keep=5 n_left=2042 n_past=2047 n_system_tokens=0 slot_id=3 task_id=276 tid="140304693719040" timestamp=1727869425 ``` but you're right that we don't know if that time was actual shift or generation. You can check the throttling parameters with ``` nvidia-smi -q -d TEMPERATURE,PERFORMANCE ``` ollama applies the template. The log line you quote is a generate request, ie /api/generate. The requests in always-on.txt are chat requests, ie /api/chat. I pulled the model and it does very poorly in low/no context completions without the special tokens for both chat and generate endpoints, like the chat requests in the log. I think it comes down to the model is only suitable for FIM usage, so if you are inserting the `<...>` tags then I wouldn't expect the looping behaviour from earlier. Note that you can have ollama apply the template and skip having to insert the tokens yourself by using the `suffix` parameter: ```console $ curl -s localhost:11434/api/generate -d '{"model":"codellama:7b-code","prompt":"<PRE> def hello(s: str) <SUF> return 0 <MID>","stream":false,"options":{"seed":42}}' | jq .response " -> int:\n import re\n if s == 'hello':\n print('success')\n else:\n " $ curl -s localhost:11434/api/generate -d '{"model":"codellama:7b-code","prompt":"def hello(s: str)","suffix":" return 0","stream":false,"options":{"seed":42}}' | jq .response " -> int:\n import re\n if s == 'hello':\n print('success')\n else:\n " ``` Also note that codellama is an old model, counted in internet time. There are [other models](https://ollama.com/search?c=code) which may perform better, depending on your requirements.
Author
Owner

@Krakonos commented on GitHub (Oct 3, 2024):

Thanks for the tips. I've created a new model with num_predict 512, I'll see if it will get stuck eventually.

I'm actually testing with a friend and we didn't find a way to convince his plugin to put in the FIM markers. I'm using llm.nvim and that works reasonably well.

If the num_predict fixes the loop issue, I think we'll manage to tune out the rest. When these initial issues are solved, I have a list of other models to test. But I found codellama to be widely support by plugins (specifically found some vs code that didn't support other models due to different FIM markers) and want to make this work first so the infrastructure is solid, then tinker with the models.

But TBH for my uses, codellama does reasonably well. My primary use is to put together click boilerplate in python and such things, which it does nicely.

As for the suffix parameter, I love that and I need to check if my plugin can support it (and possibly add the support), that would make trying out and switching models much more convenient.

I'll keep you posted, we'll run more prompts tomorrow!

In the meantime, thanks for the great software, I'm having a great time overall!

<!-- gh-comment-id:2392241151 --> @Krakonos commented on GitHub (Oct 3, 2024): Thanks for the tips. I've created a new model with num_predict 512, I'll see if it will get stuck eventually. I'm actually testing with a friend and we didn't find a way to convince his plugin to put in the FIM markers. I'm using llm.nvim and that works reasonably well. If the num_predict fixes the loop issue, I think we'll manage to tune out the rest. When these initial issues are solved, I have a list of other models to test. But I found codellama to be widely support by plugins (specifically found some vs code that didn't support other models due to different FIM markers) and want to make this work first so the infrastructure is solid, then tinker with the models. But TBH for my uses, codellama does reasonably well. My primary use is to put together click boilerplate in python and such things, which it does nicely. As for the suffix parameter, I love that and I need to check if my plugin can support it (and possibly add the support), that would make trying out and switching models much more convenient. I'll keep you posted, we'll run more prompts tomorrow! In the meantime, thanks for the great software, I'm having a great time overall!
Author
Owner

@Krakonos commented on GitHub (Oct 4, 2024):

Ok, a few observations:

  • OLLAMA_NUM_PARALLEL=1 doesn't help
  • Applying num_predict in modelfile works to some extent. However, it's a bit prone to error, since I can easily select other models in different UIs. TBH I'd prefer a time limit for each query, since running large models is so slow, a single query would bog the server for a looong time. Maybe I could do a proxy to apply custom rules, or check if openwebui supports something like it. Applying this globally, for example as environment variable would be highly preferred (both num_predict & time restrictions could be implemented).

One thing that would help me a lot would be an API to list running prompts & cancel them. I could then easily implement a custom supervisor. However, I haven't found a way to do this (maybe it would be possible through nvidia-smi and killing processes, but that seems a bit extreme).

In any case, I don't think there is a fault in ollama, since the issue is caused mostly by hallucinations. I'm closing this issue. Thanks for the help with diagnostics!

<!-- gh-comment-id:2394186187 --> @Krakonos commented on GitHub (Oct 4, 2024): Ok, a few observations: * `OLLAMA_NUM_PARALLEL=1` doesn't help * Applying `num_predict` in modelfile works to some extent. However, it's a bit prone to error, since I can easily select other models in different UIs. TBH I'd prefer a time limit for each query, since running large models is so slow, a single query would bog the server for a looong time. Maybe I could do a proxy to apply custom rules, or check if openwebui supports something like it. Applying this globally, for example as environment variable would be highly preferred (both num_predict & time restrictions could be implemented). One thing that would help me a lot would be an API to list running prompts & cancel them. I could then easily implement a custom supervisor. However, I haven't found a way to do this (maybe it would be possible through nvidia-smi and killing processes, but that seems a bit extreme). In any case, I don't think there is a fault in ollama, since the issue is caused mostly by hallucinations. I'm closing this issue. Thanks for the help with diagnostics!
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#4494