[GH-ISSUE #12179] Model stuck in "Stopping..." state #8099

Closed
opened 2026-04-12 20:24:13 -05:00 by GiteaMirror · 8 comments
Owner

Originally created by @troy256 on GitHub (Sep 4, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/12179

What is the issue?

One model, phi4:14b, has now been stuck "Stopping..." twice in the last hour. It will stay this way until we restart the ollama service. Also splitting between GPU and CPU despite there being plenty of free GPU memory:

ollama ps

NAME ID SIZE PROCESSOR CONTEXT UNTIL
llama-guard3:latest 46f211c3d866 15 GB 100% GPU 8192 59 minutes from now
mxbai-embed-large:latest 468836162de7 1.2 GB 100% GPU 512 50 minutes from now
llama3.1:8b 46e0c10c039e 15 GB 100% GPU 8192 33 minutes from now
gpt-oss:20b f2b8351c629c 21 GB 100% GPU 8192 29 minutes from now
phi4:14b ac896e5b8b34 26 GB 69%/31% CPU/GPU 8192 Stopping...

nvidia-smi

Thu Sep 4 12:50:11 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.35.05 Driver Version: 560.35.05 CUDA Version: 12.6 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA A100 80GB PCIe Off | 00000001:00:00.0 Off | 0 |
| N/A 37C P0 84W / 300W | 25002MiB / 81920MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 1 NVIDIA A100 80GB PCIe Off | 00000002:00:00.0 Off | 0 |
| N/A 39C P0 82W / 300W | 17972MiB / 81920MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 2329224 C /usr/local/bin/ollama 11472MiB |
| 0 N/A N/A 2329265 C /usr/local/bin/ollama 414MiB |
| 0 N/A N/A 2335806 C /usr/local/bin/ollama 1172MiB |
| 0 N/A N/A 2336699 C /usr/local/bin/ollama 11470MiB |
| 0 N/A N/A 2342692 C /usr/local/bin/ollama 440MiB |
| 1 N/A N/A 2329224 C /usr/local/bin/ollama 414MiB |
| 1 N/A N/A 2329265 C /usr/local/bin/ollama 14918MiB |
| 1 N/A N/A 2335806 C /usr/local/bin/ollama 1138MiB |
| 1 N/A N/A 2336699 C /usr/local/bin/ollama 414MiB |
| 1 N/A N/A 2342692 C /usr/local/bin/ollama 1054MiB |
+-----------------------------------------------------------------------------------------+

Ollama version: 0.11.7
OS: Ubuntu 22.04.5 LTS
CUDA: 12,6
GPU: 2x A100 80GB PCIE

Seems relevant to this PR, which we should already have: https://github.com/ollama/ollama/pull/10487

Relevant log output


OS

Linux

GPU

Nvidia

CPU

Intel

Ollama version

0.11.7

Originally created by @troy256 on GitHub (Sep 4, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/12179 ### What is the issue? One model, phi4:14b, has now been stuck "Stopping..." twice in the last hour. It will stay this way until we restart the ollama service. Also splitting between GPU and CPU despite there being plenty of free GPU memory: # ollama ps NAME ID SIZE PROCESSOR CONTEXT UNTIL llama-guard3:latest 46f211c3d866 15 GB 100% GPU 8192 59 minutes from now mxbai-embed-large:latest 468836162de7 1.2 GB 100% GPU 512 50 minutes from now llama3.1:8b 46e0c10c039e 15 GB 100% GPU 8192 33 minutes from now gpt-oss:20b f2b8351c629c 21 GB 100% GPU 8192 29 minutes from now phi4:14b ac896e5b8b34 26 GB 69%/31% CPU/GPU 8192 Stopping... # nvidia-smi Thu Sep 4 12:50:11 2025 +-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 560.35.05 Driver Version: 560.35.05 CUDA Version: 12.6 | |-----------------------------------------+------------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+========================+======================| | 0 NVIDIA A100 80GB PCIe Off | 00000001:00:00.0 Off | 0 | | N/A 37C P0 84W / 300W | 25002MiB / 81920MiB | 0% Default | | | | Disabled | +-----------------------------------------+------------------------+----------------------+ | 1 NVIDIA A100 80GB PCIe Off | 00000002:00:00.0 Off | 0 | | N/A 39C P0 82W / 300W | 17972MiB / 81920MiB | 0% Default | | | | Disabled | +-----------------------------------------+------------------------+----------------------+ +-----------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=========================================================================================| | 0 N/A N/A 2329224 C /usr/local/bin/ollama 11472MiB | | 0 N/A N/A 2329265 C /usr/local/bin/ollama 414MiB | | 0 N/A N/A 2335806 C /usr/local/bin/ollama 1172MiB | | 0 N/A N/A 2336699 C /usr/local/bin/ollama 11470MiB | | 0 N/A N/A 2342692 C /usr/local/bin/ollama 440MiB | | 1 N/A N/A 2329224 C /usr/local/bin/ollama 414MiB | | 1 N/A N/A 2329265 C /usr/local/bin/ollama 14918MiB | | 1 N/A N/A 2335806 C /usr/local/bin/ollama 1138MiB | | 1 N/A N/A 2336699 C /usr/local/bin/ollama 414MiB | | 1 N/A N/A 2342692 C /usr/local/bin/ollama 1054MiB | +-----------------------------------------------------------------------------------------+ Ollama version: 0.11.7 OS: Ubuntu 22.04.5 LTS CUDA: 12,6 GPU: 2x A100 80GB PCIE Seems relevant to this PR, which we should already have: https://github.com/ollama/ollama/pull/10487 ### Relevant log output ```shell ``` ### OS Linux ### GPU Nvidia ### CPU Intel ### Ollama version 0.11.7
GiteaMirror added the bug label 2026-04-12 20:24:13 -05:00
Author
Owner

@rick-github commented on GitHub (Sep 4, 2025):

Server logs may help in debugging.

<!-- gh-comment-id:3253591827 --> @rick-github commented on GitHub (Sep 4, 2025): [Server logs](https://github.com/ollama/ollama/blob/main/docs/troubleshooting.md#how-to-troubleshoot-issues) may help in debugging.
Author
Owner

@troy256 commented on GitHub (Sep 4, 2025):

ollama.log

Sorry, log attached.

<!-- gh-comment-id:3253627185 --> @troy256 commented on GitHub (Sep 4, 2025): [ollama.log](https://github.com/user-attachments/files/22138899/ollama.log) Sorry, log attached.
Author
Owner

@rick-github commented on GitHub (Sep 4, 2025):

The log doesn't contain information about loading or unloading phi4. Add OLLAMA_DEBUG=1 to the server environment, and when the model gets in the "Stopping..." state again, run the following:

journalctl -u ollama --no-pager --since "$(systemctl show ollama --property=ActiveEnterTimestamp --value)"
<!-- gh-comment-id:3253774703 --> @rick-github commented on GitHub (Sep 4, 2025): The log doesn't contain information about loading or unloading phi4. Add `OLLAMA_DEBUG=1` to the server environment, and when the model gets in the "Stopping..." state again, run the following: ``` journalctl -u ollama --no-pager --since "$(systemctl show ollama --property=ActiveEnterTimestamp --value)" ```
Author
Owner

@troy256 commented on GitHub (Sep 4, 2025):

Debug logging enabled and will grab those entries when it happens again.

<!-- gh-comment-id:3254554236 --> @troy256 commented on GitHub (Sep 4, 2025): Debug logging enabled and will grab those entries when it happens again.
Author
Owner

@troy256 commented on GitHub (Sep 4, 2025):

Happened again and grabbed the log.
ollama.log

<!-- gh-comment-id:3255333451 --> @troy256 commented on GitHub (Sep 4, 2025): Happened again and grabbed the log. [ollama.log](https://github.com/user-attachments/files/22145671/ollama.log)
Author
Owner

@rick-github commented on GitHub (Sep 4, 2025):

There were still 6 inferences running or queued on the phi4 runner at the end of the log:

Sep 04 19:36:10 AZREUS2UIS0225 ollama[2428603]: time=2025-09-04T19:36:10.663Z level=DEBUG source=sched.go:304
 msg="after processing request finished event" runner.name=registry.ollama.ai/library/phi4:14b runner.inference=cuda
 runner.devices=2 runner.size="14.1 GiB" runner.vram="3.1 GiB" runner.parallel=4 runner.pid=2463430
 runner.model=/usr/share/ollama/.ollama/models/blobs/sha256-fd7b6731c33c57f61767612f56517460ec2d1e2e5a3f0163e0eb3d8d8cb5df20
 runner.num_ctx=4096 refCount=6

When the ollama server marks a model as "Stopping...", it's waiting for the in-progress inferences to be completed before evicting the model. So it looks like the runner is running some inferences.

What could be happening is the model is losing coherence - it's gone off the beaten path and is generating a stream of tokens that never produces an end-of-sequence token. Since the inferences are running, the model doesn't evict the model. This decoherence can happen when the context buffer fills up and is shifted. The logs shows frequent buffer truncations:

Sep 04 19:30:37 AZREUS2UIS0225 ollama[2428603]: time=2025-09-04T19:30:37.562Z level=WARN
 source=runner.go:127 msg="truncating input prompt" limit=4096 prompt=11080 keep=4 new=4096

which means the buffer is starting out full, and the first few tokens generated by the model will cause a buffer shift:

Sep 04 19:37:13 AZREUS2UIS0225 ollama[2428603]: time=2025-09-04T19:37:13.788Z level=DEBUG source=cache.go:240
 msg="context limit hit - shifting" id=3 limit=4096 input=4096 keep=4 discard=2046

The reason for the model eviction is that the client(s) are changing the context size:

     17 registry.ollama.ai/library/phi4:14b 4096
      9 registry.ollama.ai/library/phi4:14b 512
    459 registry.ollama.ai/library/phi4:14b 8192

There are a couple of things you can try to resolve this.

Increase the size of the context. The default context (OLLAMA_CONTEXT_LENGTH) is already 8192, which is probably the 459 calls to phi4:14b above. The problematic ones are the calls that set num_ctx to 4096, which is the client overriding the default. In most cases where the prompt is truncated, it's a bit less than 12k tokens. So the clients should be setting num_ctx to 12288. Unfortunately there's currently no mechanism for ignoring num_ctx from a client, so the client will have to be configured to either send num_ctx=12288, or better yet, not set num_ctx at all.

If the client doesn't set num_ctx, the model needs to be configured with a large context to handle the prompts without truncating. The easiest way to do this is to create a model that has that parameter set:

$ ollama cp phi4:14b phi4:14b-orig
$ ollama rm phi4:14b
$ echo FROM phi4:14b-orig > Modelfile
$ echo PARAMETER num_ctx 12288 >> Modelfile
$ ollama create phi4:14b

This creates a replacement phi:14b with a larger context. If it's possible to modify the model name used by the client, then you can ollama create phi4:14b-c12k instead of overwriting the original model, and have the client use the new model name.

Limit the number tokens produced. If the problem is that the model has gone runaway, then num_predict can be set to stop the inference after a certain number of tokens. This can either be configured in the client, or a new model can be created as above:

$ echo FROM phi4:14b > Modelfile
$ echo PARAMETER num_predict 4096 >> Modelfile
$ ollama create phi4:14b-p4096
<!-- gh-comment-id:3255875873 --> @rick-github commented on GitHub (Sep 4, 2025): There were still 6 inferences running or queued on the phi4 runner at the end of the log: ``` Sep 04 19:36:10 AZREUS2UIS0225 ollama[2428603]: time=2025-09-04T19:36:10.663Z level=DEBUG source=sched.go:304 msg="after processing request finished event" runner.name=registry.ollama.ai/library/phi4:14b runner.inference=cuda runner.devices=2 runner.size="14.1 GiB" runner.vram="3.1 GiB" runner.parallel=4 runner.pid=2463430 runner.model=/usr/share/ollama/.ollama/models/blobs/sha256-fd7b6731c33c57f61767612f56517460ec2d1e2e5a3f0163e0eb3d8d8cb5df20 runner.num_ctx=4096 refCount=6 ``` When the ollama server marks a model as "Stopping...", it's waiting for the in-progress inferences to be completed before evicting the model. So it looks like the runner is running some inferences. What could be happening is the model is losing coherence - it's gone off the beaten path and is generating a stream of tokens that never produces an end-of-sequence token. Since the inferences are running, the model doesn't evict the model. This decoherence can happen when the context buffer fills up and is shifted. The logs shows frequent buffer truncations: ``` Sep 04 19:30:37 AZREUS2UIS0225 ollama[2428603]: time=2025-09-04T19:30:37.562Z level=WARN source=runner.go:127 msg="truncating input prompt" limit=4096 prompt=11080 keep=4 new=4096 ``` which means the buffer is starting out full, and the first few tokens generated by the model will cause a buffer shift: ``` Sep 04 19:37:13 AZREUS2UIS0225 ollama[2428603]: time=2025-09-04T19:37:13.788Z level=DEBUG source=cache.go:240 msg="context limit hit - shifting" id=3 limit=4096 input=4096 keep=4 discard=2046 ``` The reason for the model eviction is that the client(s) are changing the context size: ``` 17 registry.ollama.ai/library/phi4:14b 4096 9 registry.ollama.ai/library/phi4:14b 512 459 registry.ollama.ai/library/phi4:14b 8192 ``` There are a couple of things you can try to resolve this. Increase the size of the context. The default context (`OLLAMA_CONTEXT_LENGTH`) is already 8192, which is probably the 459 calls to phi4:14b above. The problematic ones are the calls that set `num_ctx` to 4096, which is the client overriding the default. In most cases where the prompt is truncated, it's a bit less than 12k tokens. So the clients should be setting `num_ctx` to 12288. Unfortunately there's currently no mechanism for ignoring `num_ctx` from a client, so the client will have to be configured to either send `num_ctx=12288`, or better yet, not set `num_ctx` at all. If the client doesn't set `num_ctx`, the model needs to be configured with a large context to handle the prompts without truncating. The easiest way to do this is to create a model that has that parameter set: ```console $ ollama cp phi4:14b phi4:14b-orig $ ollama rm phi4:14b $ echo FROM phi4:14b-orig > Modelfile $ echo PARAMETER num_ctx 12288 >> Modelfile $ ollama create phi4:14b ``` This creates a replacement phi:14b with a larger context. If it's possible to modify the model name used by the client, then you can `ollama create phi4:14b-c12k` instead of overwriting the original model, and have the client use the new model name. Limit the number tokens produced. If the problem is that the model has gone runaway, then `num_predict` can be set to stop the inference after a certain number of tokens. This can either be configured in the client, or a new model can be created as above: ```console $ echo FROM phi4:14b > Modelfile $ echo PARAMETER num_predict 4096 >> Modelfile $ ollama create phi4:14b-p4096 ```
Author
Owner

@troy256 commented on GitHub (Sep 5, 2025):

Thank you very much for the insight into what's happening and the suggestions.

We're not aware of anyone setting num_ctx in their API calls but we have adjusted both num_ctx and num_predict as you suggested.

It makes sense that we are seeing runaway inferences, as that model stays in the "Stopping" state indefinitely until we restart ollama and I have that seen that behavior before during my own testing.

Will report back.

<!-- gh-comment-id:3258203727 --> @troy256 commented on GitHub (Sep 5, 2025): Thank you very much for the insight into what's happening and the suggestions. We're not aware of anyone setting num_ctx in their API calls but we have adjusted both num_ctx and num_predict as you suggested. It makes sense that we are seeing runaway inferences, as that model stays in the "Stopping" state indefinitely until we restart ollama and I have that seen that behavior before during my own testing. Will report back.
Author
Owner

@rick-github commented on GitHub (Oct 6, 2025):

Any progress?

<!-- gh-comment-id:3369535854 --> @rick-github commented on GitHub (Oct 6, 2025): Any progress?
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#8099