[GH-ISSUE #14364] Model stuck in 'Stopping...' state indefinitely with active connections #55846

Closed
opened 2026-04-29 09:48:12 -05:00 by GiteaMirror · 2 comments
Owner

Originally created by @JRMeyer on GitHub (Feb 22, 2026).
Original GitHub issue: https://github.com/ollama/ollama/issues/14364

Description

ollama ps shows a model permanently stuck in Stopping... state. The model never finishes unloading and never becomes available again. The Ollama process holds the port but launchd restart attempts fail with bind: address already in use.

What triggered it

Two independent Python batch processes (8 workers total) sending concurrent chat requests to the same model. The second batch connected while the first was already active. Shortly after, the model entered Stopping... and never recovered.

Observed behavior

$ ollama ps
NAME            ID              SIZE     PROCESSOR    CONTEXT    UNTIL
gpt-oss:120b    a951a23b46a1    89 GB    100% GPU     131072     Stopping...

This state persists indefinitely (tested for 10+ minutes). The original Ollama process (PID 37884) still holds port 11434 with established connections from both batch processes. Meanwhile, launchd keeps trying to restart Ollama but fails:

Error: listen tcp 127.0.0.1:11434: bind: address already in use

(repeated hundreds of times in /opt/homebrew/var/log/ollama.log)

The batch processes stay alive but are blocked waiting on Ollama responses.

Expected behavior

The model should either:

  • Stay loaded and serve requests, or
  • Unload cleanly and reload, without getting stuck in a permanent Stopping... state

Steps to reproduce

  1. Start Ollama with OLLAMA_NUM_PARALLEL=8
  2. Launch a batch process with 4 concurrent workers sending chat requests
  3. Wait for model to load and begin serving
  4. Launch a second batch process with 4 more concurrent workers to the same model
  5. Observe ollama ps — model enters Stopping... and never recovers

Environment

  • Ollama version: 0.13.3
  • OS: macOS 26.1 (build 25B78)
  • Hardware: Mac Studio, Apple M4 Max, 128 GB unified memory
  • Model: gpt-oss:120b (65 GB weights, 89 GB loaded)
  • Ollama config:
    • OLLAMA_NUM_PARALLEL=8
    • OLLAMA_KV_CACHE_TYPE=q8_0
    • OLLAMA_FLASH_ATTENTION=1
  • Managed by: launchd (homebrew.mxcl.ollama)

Workaround

Kill Ollama manually and restart. The model reloads fine on fresh start.

Originally created by @JRMeyer on GitHub (Feb 22, 2026). Original GitHub issue: https://github.com/ollama/ollama/issues/14364 ## Description `ollama ps` shows a model permanently stuck in `Stopping...` state. The model never finishes unloading and never becomes available again. The Ollama process holds the port but launchd restart attempts fail with `bind: address already in use`. ## What triggered it Two independent Python batch processes (8 workers total) sending concurrent chat requests to the same model. The second batch connected while the first was already active. Shortly after, the model entered `Stopping...` and never recovered. ## Observed behavior ``` $ ollama ps NAME ID SIZE PROCESSOR CONTEXT UNTIL gpt-oss:120b a951a23b46a1 89 GB 100% GPU 131072 Stopping... ``` This state persists indefinitely (tested for 10+ minutes). The original Ollama process (PID 37884) still holds port 11434 with established connections from both batch processes. Meanwhile, launchd keeps trying to restart Ollama but fails: ``` Error: listen tcp 127.0.0.1:11434: bind: address already in use ``` (repeated hundreds of times in `/opt/homebrew/var/log/ollama.log`) The batch processes stay alive but are blocked waiting on Ollama responses. ## Expected behavior The model should either: - Stay loaded and serve requests, or - Unload cleanly and reload, without getting stuck in a permanent `Stopping...` state ## Steps to reproduce 1. Start Ollama with `OLLAMA_NUM_PARALLEL=8` 2. Launch a batch process with 4 concurrent workers sending chat requests 3. Wait for model to load and begin serving 4. Launch a second batch process with 4 more concurrent workers to the same model 5. Observe `ollama ps` — model enters `Stopping...` and never recovers ## Environment - **Ollama version**: 0.13.3 - **OS**: macOS 26.1 (build 25B78) - **Hardware**: Mac Studio, Apple M4 Max, 128 GB unified memory - **Model**: `gpt-oss:120b` (65 GB weights, 89 GB loaded) - **Ollama config**: - `OLLAMA_NUM_PARALLEL=8` - `OLLAMA_KV_CACHE_TYPE=q8_0` - `OLLAMA_FLASH_ATTENTION=1` - **Managed by**: launchd (`homebrew.mxcl.ollama`) ## Workaround Kill Ollama manually and restart. The model reloads fine on fresh start.
Author
Owner

@JRMeyer commented on GitHub (Feb 22, 2026):

Investigation Findings

After deeper investigation, the situation is more nuanced than originally reported. Here are the full findings:


1. OLLAMA_NUM_PARALLEL was actually 1, not 8

The running Ollama instance (PID 37884, uptime 19+ hours) was started from an older version of the launchd plist that did not include OLLAMA_NUM_PARALLEL. The startup log confirms:

routes.go:1554 msg="server config" env="map[...OLLAMA_NUM_PARALLEL:1...]"

So despite the plist now specifying OLLAMA_NUM_PARALLEL=8, the live process was running with the default of 1 parallel slot. Two batch processes (8 workers total) were hitting a single-slot instance.

2. API is partially responsive

Read-only endpoints respond immediately while in "Stopping..." state:

  • GET / → "Ollama is running"
  • GET /api/ps → returns model info
  • GET /api/tags → returns model list
  • GET /api/version → returns 0.13.3
  • POST /api/chathangs indefinitely (TCP connects, 0 bytes response)

3. expires_at is continuously refreshed

Despite showing "Stopping...", the expires_at field in /api/ps is refreshed to current wall-clock time every ~2 seconds:

10:53:44 → 2026-02-22T10:53:44.287
10:53:46 → 2026-02-22T10:53:46.338
10:53:48 → 2026-02-22T10:53:48.387

Something is actively refreshing the keep-alive timer.

4. Runner process is alive and using CPU

The runner child process (ollama runner --ollama-engine --model ... --port 55342) is in RN (running) state with fluctuating CPU (0.2-21.2%) and 86.3 GB RSS. It's not hung at the OS level. Its internal TCP connections to the Ollama main process are dynamically changing between observations.

5. Runner was restarted at least once

Startup log shows initial runner on port 49397, but the current runner listens on port 55342. The model was unloaded/reloaded at some point during the 19-hour session — before the "Stopping..." state occurred.

6. One client died, leaving a stale connection

Two batch processes were started. Batch B died immediately after startup (never completed health check). Ollama main still has a CLOSED socket on fd 20 to port 55489, which was the dead client's connection. Batch A (4 workers, 4 ESTABLISHED connections) is alive but frozen — its log stopped growing and all threads are sleeping at 0% CPU, waiting on Ollama.

7. launchd has lost track of the process

active count = 0           # launchd thinks nothing is running
state = spawn scheduled    # trying to spawn another
runs = 5,652               # total spawn attempts
last exit code = 1

PID 37884 has PPID=1 but launchd's active count = 0. launchd continuously respawns new instances (~1 every 6 seconds due to KeepAlive=true), all failing with bind: address already in use. This has been going on for hours, adding ~5,852 error lines to the log.

8. Log file is 99.99% crash noise

The 180,768-line, 6.6 MB log contains:

  • ~15,903 panic: $HOME is not defined entries (historical, from before the plist had HOME set)
  • 7 lines of actual operational log from the running instance
  • ~5,852 bind: address already in use errors (ongoing launchd respawn failures)

There is zero log output from Ollama between startup and the current state — no scheduler decisions, no model load/unload messages, no error about the "Stopping..." transition.

9. System resources are fine

  • Memory: 86 GB used by model, 1 GB free + 18 GB inactive, 0 swap
  • Disk: 755 GB available
  • Load: 0.97
  • No resource exhaustion

Environment (corrected)

  • Ollama version: 0.13.3 (latest available: 0.16.3, 3 versions behind)
  • Effective OLLAMA_NUM_PARALLEL: 1 (not 8 as originally reported)
  • Everything else unchanged from original report

Summary of observable state

The model is in a limbo where:

  1. The runner process is alive and doing something (variable CPU, changing sockets)
  2. Read-only API works, inference API hangs
  3. Keep-alive timer is actively refreshed
  4. ollama ps reports "Stopping..." but the model never finishes stopping
  5. One client has a stale CLOSED connection from a process that died
  6. The surviving client's 4 connections are established but getting no responses
<!-- gh-comment-id:3941293205 --> @JRMeyer commented on GitHub (Feb 22, 2026): ## Investigation Findings After deeper investigation, the situation is more nuanced than originally reported. Here are the full findings: --- ### 1. `OLLAMA_NUM_PARALLEL` was actually 1, not 8 The running Ollama instance (PID 37884, uptime 19+ hours) was started from an **older version of the launchd plist** that did not include `OLLAMA_NUM_PARALLEL`. The startup log confirms: ``` routes.go:1554 msg="server config" env="map[...OLLAMA_NUM_PARALLEL:1...]" ``` So despite the plist now specifying `OLLAMA_NUM_PARALLEL=8`, the live process was running with the default of 1 parallel slot. Two batch processes (8 workers total) were hitting a single-slot instance. ### 2. API is partially responsive Read-only endpoints respond immediately while in "Stopping..." state: - `GET /` → "Ollama is running" ✅ - `GET /api/ps` → returns model info ✅ - `GET /api/tags` → returns model list ✅ - `GET /api/version` → returns 0.13.3 ✅ - `POST /api/chat` → **hangs indefinitely** (TCP connects, 0 bytes response) ### 3. `expires_at` is continuously refreshed Despite showing "Stopping...", the `expires_at` field in `/api/ps` is refreshed to current wall-clock time every ~2 seconds: ``` 10:53:44 → 2026-02-22T10:53:44.287 10:53:46 → 2026-02-22T10:53:46.338 10:53:48 → 2026-02-22T10:53:48.387 ``` Something is actively refreshing the keep-alive timer. ### 4. Runner process is alive and using CPU The runner child process (`ollama runner --ollama-engine --model ... --port 55342`) is in `RN` (running) state with fluctuating CPU (0.2-21.2%) and 86.3 GB RSS. It's not hung at the OS level. Its internal TCP connections to the Ollama main process are dynamically changing between observations. ### 5. Runner was restarted at least once Startup log shows initial runner on port **49397**, but the current runner listens on port **55342**. The model was unloaded/reloaded at some point during the 19-hour session — before the "Stopping..." state occurred. ### 6. One client died, leaving a stale connection Two batch processes were started. Batch B died immediately after startup (never completed health check). Ollama main still has a `CLOSED` socket on fd 20 to port 55489, which was the dead client's connection. Batch A (4 workers, 4 ESTABLISHED connections) is alive but frozen — its log stopped growing and all threads are sleeping at 0% CPU, waiting on Ollama. ### 7. launchd has lost track of the process ``` active count = 0 # launchd thinks nothing is running state = spawn scheduled # trying to spawn another runs = 5,652 # total spawn attempts last exit code = 1 ``` PID 37884 has PPID=1 but launchd's `active count = 0`. launchd continuously respawns new instances (~1 every 6 seconds due to `KeepAlive=true`), all failing with `bind: address already in use`. This has been going on for hours, adding ~5,852 error lines to the log. ### 8. Log file is 99.99% crash noise The 180,768-line, 6.6 MB log contains: - ~15,903 `panic: $HOME is not defined` entries (historical, from before the plist had HOME set) - **7 lines** of actual operational log from the running instance - ~5,852 `bind: address already in use` errors (ongoing launchd respawn failures) There is zero log output from Ollama between startup and the current state — no scheduler decisions, no model load/unload messages, no error about the "Stopping..." transition. ### 9. System resources are fine - Memory: 86 GB used by model, 1 GB free + 18 GB inactive, 0 swap - Disk: 755 GB available - Load: 0.97 - No resource exhaustion ### Environment (corrected) - **Ollama version**: 0.13.3 (latest available: 0.16.3, 3 versions behind) - **Effective OLLAMA_NUM_PARALLEL**: 1 (not 8 as originally reported) - Everything else unchanged from original report ### Summary of observable state The model is in a limbo where: 1. The runner process is alive and doing *something* (variable CPU, changing sockets) 2. Read-only API works, inference API hangs 3. Keep-alive timer is actively refreshed 4. `ollama ps` reports "Stopping..." but the model never finishes stopping 5. One client has a stale CLOSED connection from a process that died 6. The surviving client's 4 connections are established but getting no responses
Author
Owner

@rick-github commented on GitHub (Feb 22, 2026):

The Stopping state indicates that the server wanted to unload the model. This usually happens for a model change or when model parameters (eg, num_ctx) are changed by the client. Since the state persists, the unload didn't complete, indicating the model runner got wedged. Just killing the runner (ps w -u ollama | grep runner to identify the runner) should restore function.

One reason for an unload to not complete is that the runner is still processing a request. Since it's still using CPU this seems the likely reason. It could be that the model ran out of context space, or the parallel requests caused context corruption, and the model runner has lost coherence and is stuck in a loop.

There have been previous instances of this bug that have been fixed, so upgrading to a more recent release may help.

<!-- gh-comment-id:3941320644 --> @rick-github commented on GitHub (Feb 22, 2026): The `Stopping` state indicates that the server wanted to unload the model. This usually happens for a model change or when model parameters (eg, `num_ctx`) are changed by the client. Since the state persists, the unload didn't complete, indicating the model runner got wedged. Just killing the runner (`ps w -u ollama | grep runner` to identify the runner) should restore function. One reason for an unload to not complete is that the runner is still processing a request. Since it's still using CPU this seems the likely reason. It could be that the model ran out of context space, or the parallel requests caused context corruption, and the model runner has lost coherence and is stuck in a loop. There have been previous instances of this bug that have been fixed, so upgrading to a more recent release may help.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#55846