[GH-ISSUE #15923] [0.20.5][macOS Apple Silicon] Runner crashes under sustained multi-turn tool-calling on /v1/chat/completions (72% crash rate across 7 models) #72201

Open
opened 2026-05-05 03:37:31 -05:00 by GiteaMirror · 1 comment
Owner

Originally created by @emcee777 on GitHub (May 1, 2026).
Original GitHub issue: https://github.com/ollama/ollama/issues/15923

What is the issue?

The Ollama runner crashes reliably under sustained multi-turn tool-calling on 0.20.5 (and other 0.20.x releases). The crash manifests as one of three error signatures returned to the client mid-conversation:

  1. model runner has unexpectedly stopped, this may be due to resource limitations or an internal error, check ollama server logs for details (HTTP 500)
  2. an error was encountered while running the model: unexpected EOF (HTTP 500)
  3. connection refused (the entire ollama serve process dies; subsequent requests fail until restart)

In a 138-run benchmark suite covering 7 tool-using tasks across 7 models, 99 of 138 runs (72%) failed to complete. Of the 19 that returned a structured error, the three signatures above account for all of them. The remaining 78 running rows are runs where the runner crashed mid-stream and the client harness timed out without ever receiving [DONE] (consistent with the same root cause).

This is a regression from 0.19.x, narrowed by user reports against 0.20.0–0.20.5. It is closely related to but distinct from #14611 (which targets 0.17.5 and /api/generate):

  • This report is specifically about the OpenAI-compatible /v1/chat/completions path with tools and multi-turn loops (agent-style: assistant→tool_call→tool_result→assistant…).
  • Single-shot /api/generate and single-shot /api/chat (no tools) are far more stable on the same machine and the same model versions. Issue surfaces specifically when the harness drives sequential tool-call/tool-result turns.

Reproduction (minimal)

Full script: see repro.sh in the supporting gist. The essence:

  1. Pull a tool-capable model: ollama pull gemma4:31b (also reproduces with gemma4:26b, mistral-small3.2, glm-4.7-flash, nemotron-cascade-2).
  2. In a loop, POST to /v1/chat/completions with tools: [...] and tool_choice: "auto", providing a single read_file-style tool. Each iteration is a fresh single-turn request — so this is not a context-blowup issue, it's frequency.
  3. After 1–10 iterations the runner exits and one of the three error signatures appears.
  4. Larger models (gemma4:31b, mistral-small3.2) often crash on the first tool-using request. Smaller models crash within ~5 iterations. gemma4:e4b (4B effective) is the only model in our matrix that survives sustained tool-using loops.

Aggregate benchmark data (gist)

138 runs, harness = simple agent loop driving /v1/chat/completions with one tool, 120 s curl timeout, single in-flight request:

Model Completed Total Pass rate
gemma4:e4b 14 24 58%
gemma4:26b 23 40 58%
gemma4:31b 0 16 0%
mistral-small3.2 0 16 0%
glm-4.7-flash 1 32 3%
nemotron-cascade-2 1 8 12%

Three sample crash records (one per error class) attached as JSON in the gist:

Environment

  • Ollama: 0.20.5 (also reproduced on 0.20.6, 0.20.7; 0.21.x improves but still crashes on qwen3-coder-next via /v1/messages+tools)
  • Server config (excerpt from server log):
    OLLAMA_FLASH_ATTENTION=true OLLAMA_KV_CACHE_TYPE=q8_0 OLLAMA_CONTEXT_LENGTH=32768 OLLAMA_NUM_PARALLEL=2 OLLAMA_MAX_LOADED_MODELS=3 OLLAMA_MAX_QUEUE=512 OLLAMA_KEEP_ALIVE=5m0s OLLAMA_LOAD_TIMEOUT=5m0s OLLAMA_NEW_ENGINE=false
  • OS: macOS 15.5 (24F74)
  • Hardware: Apple M3 Max, 128 GB unified memory (96 GB recommendedMaxWorkingSetSize per Metal init)
  • Memory at crash: 58.7 GiB free (system), 95.5 GiB GPU available — runner is not OOM-pressured at the OS level, yet still terminates.

Server-side log fragments

From the runner subprocess that backs /v1/chat/completions during a typical crash window (preserved from the related Apr 2026 server log; the original 0.20.5 logs were rotated, but the same termination pattern persists on 0.21.2):

time=...  level=ERROR  source=server.go:1611  msg="post predict"  error="Post \"http://127.0.0.1:NNNNN/completion\": EOF"
[GIN] ... | 500 | ...s | 127.0.0.1 | POST "/v1/chat/completions"
time=... level=ERROR source=server.go:303 msg="llama runner terminated" error="exit status 2"

This is identical in shape to what #14611 reports on 0.17.5 /api/generate, suggesting the runner's exit-status-2 path is a long-standing hot edge that the multi-turn tool-call protocol now exercises far more frequently than single-shot generation did.

Expected behavior

A tool-using agent loop should not be able to terminate the runner. Whatever path the runner takes through tool-call decoding (the harmony-format / function-call grammar, KV cache reuse across turns, or whatever piece of state survives between sequential /v1/chat/completions requests with tools) should not be capable of producing exit status 2 with no recoverable error.

What we've ruled out / additional notes

  • Not OOM at the OS level (free memory is 50%+ during crashes).
  • Not context overflow — single-turn single-tool requests crash too, no growing history.
  • Not concurrency — OLLAMA_NUM_PARALLEL=2 and a single in-flight request both reproduce.
  • Not specific to one quant — happens on Q4_K_M and Q5_K_M variants we tested.
  • Not specific to one model family — Gemma, Mistral, GLM, and Nemotron all reproduce.
  • The only mitigation we've found that meaningfully reduces crash frequency is restricting tool-using workloads to small models (gemma4:e4b, ~9.6 GB) — which suggests a per-runner state path that scales poorly with model size.

Happy to provide additional logs, run a test build, or tighten the reproducer if helpful. Thank you for the work on Ollama — the dual-endpoint architecture is genuinely useful and we'd love to keep building on it.

Originally created by @emcee777 on GitHub (May 1, 2026). Original GitHub issue: https://github.com/ollama/ollama/issues/15923 ## What is the issue? The Ollama runner crashes reliably under sustained multi-turn tool-calling on **0.20.5** (and other 0.20.x releases). The crash manifests as one of three error signatures returned to the client mid-conversation: 1. `model runner has unexpectedly stopped, this may be due to resource limitations or an internal error, check ollama server logs for details` (HTTP 500) 2. `an error was encountered while running the model: unexpected EOF` (HTTP 500) 3. `connection refused` (the entire `ollama serve` process dies; subsequent requests fail until restart) In a 138-run benchmark suite covering 7 tool-using tasks across 7 models, **99 of 138 runs (72%) failed to complete**. Of the 19 that returned a structured error, the three signatures above account for all of them. The remaining 78 `running` rows are runs where the runner crashed mid-stream and the client harness timed out without ever receiving `[DONE]` (consistent with the same root cause). This is a regression from 0.19.x, narrowed by user reports against 0.20.0–0.20.5. It is closely related to but distinct from #14611 (which targets 0.17.5 and `/api/generate`): - This report is specifically about the **OpenAI-compatible `/v1/chat/completions` path with `tools`** and **multi-turn loops** (agent-style: assistant→tool_call→tool_result→assistant…). - Single-shot `/api/generate` and single-shot `/api/chat` (no tools) are far more stable on the same machine and the same model versions. Issue surfaces specifically when the harness drives sequential tool-call/tool-result turns. ### Reproduction (minimal) Full script: see [`repro.sh` in the supporting gist](https://gist.github.com/emcee777/7485a16ec04a86d173c9cdcf17fa3572#file-repro-sh). The essence: 1. Pull a tool-capable model: `ollama pull gemma4:31b` (also reproduces with `gemma4:26b`, `mistral-small3.2`, `glm-4.7-flash`, `nemotron-cascade-2`). 2. In a loop, POST to `/v1/chat/completions` with `tools: [...]` and `tool_choice: "auto"`, providing a single `read_file`-style tool. Each iteration is a fresh single-turn request — so this is not a context-blowup issue, it's *frequency*. 3. After 1–10 iterations the runner exits and one of the three error signatures appears. 4. Larger models (`gemma4:31b`, `mistral-small3.2`) often crash on the **first** tool-using request. Smaller models crash within ~5 iterations. `gemma4:e4b` (4B effective) is the only model in our matrix that survives sustained tool-using loops. ### Aggregate benchmark data ([gist](https://gist.github.com/emcee777/7485a16ec04a86d173c9cdcf17fa3572)) 138 runs, harness = simple agent loop driving `/v1/chat/completions` with one tool, 120 s curl timeout, single in-flight request: | Model | Completed | Total | Pass rate | |---|---|---|---| | gemma4:e4b | 14 | 24 | 58% | | gemma4:26b | 23 | 40 | 58% | | gemma4:31b | 0 | 16 | 0% | | mistral-small3.2 | 0 | 16 | 0% | | glm-4.7-flash | 1 | 32 | 3% | | nemotron-cascade-2 | 1 | 8 | 12% | Three sample crash records (one per error class) attached as JSON in the gist: - [`sample-runner-stopped-gemma4-31b.json`](https://gist.github.com/emcee777/7485a16ec04a86d173c9cdcf17fa3572#file-sample-runner-stopped-gemma4-31b-json) — error #1 - [`sample-unexpected-eof-glm.json`](https://gist.github.com/emcee777/7485a16ec04a86d173c9cdcf17fa3572#file-sample-unexpected-eof-glm-json) — error #2 - [`sample-connection-refused-gemma4-31b.json`](https://gist.github.com/emcee777/7485a16ec04a86d173c9cdcf17fa3572#file-sample-connection-refused-gemma4-31b-json) — error #3 ### Environment - **Ollama:** 0.20.5 (also reproduced on 0.20.6, 0.20.7; 0.21.x improves but still crashes on `qwen3-coder-next` via `/v1/messages+tools`) - **Server config (excerpt from server log):** `OLLAMA_FLASH_ATTENTION=true OLLAMA_KV_CACHE_TYPE=q8_0 OLLAMA_CONTEXT_LENGTH=32768 OLLAMA_NUM_PARALLEL=2 OLLAMA_MAX_LOADED_MODELS=3 OLLAMA_MAX_QUEUE=512 OLLAMA_KEEP_ALIVE=5m0s OLLAMA_LOAD_TIMEOUT=5m0s OLLAMA_NEW_ENGINE=false` - **OS:** macOS 15.5 (24F74) - **Hardware:** Apple M3 Max, 128 GB unified memory (96 GB recommendedMaxWorkingSetSize per Metal init) - **Memory at crash:** 58.7 GiB free (system), 95.5 GiB GPU available — runner is not OOM-pressured at the OS level, yet still terminates. ### Server-side log fragments From the runner subprocess that backs `/v1/chat/completions` during a typical crash window (preserved from the related Apr 2026 server log; the original 0.20.5 logs were rotated, but the same termination pattern persists on 0.21.2): ```text time=... level=ERROR source=server.go:1611 msg="post predict" error="Post \"http://127.0.0.1:NNNNN/completion\": EOF" [GIN] ... | 500 | ...s | 127.0.0.1 | POST "/v1/chat/completions" time=... level=ERROR source=server.go:303 msg="llama runner terminated" error="exit status 2" ``` This is identical in shape to what #14611 reports on 0.17.5 `/api/generate`, suggesting the runner's exit-status-2 path is a long-standing hot edge that the multi-turn tool-call protocol now exercises far more frequently than single-shot generation did. ### Expected behavior A tool-using agent loop should not be able to terminate the runner. Whatever path the runner takes through tool-call decoding (the harmony-format / function-call grammar, KV cache reuse across turns, or whatever piece of state survives between sequential `/v1/chat/completions` requests with `tools`) should not be capable of producing exit status 2 with no recoverable error. ### What we've ruled out / additional notes - Not OOM at the OS level (free memory is 50%+ during crashes). - Not context overflow — single-turn single-tool requests crash too, no growing history. - Not concurrency — `OLLAMA_NUM_PARALLEL=2` and a single in-flight request both reproduce. - Not specific to one quant — happens on Q4_K_M and Q5_K_M variants we tested. - Not specific to one model family — Gemma, Mistral, GLM, and Nemotron all reproduce. - The only mitigation we've found that meaningfully reduces crash frequency is restricting tool-using workloads to small models (gemma4:e4b, ~9.6 GB) — which suggests a per-runner state path that scales poorly with model size. Happy to provide additional logs, run a test build, or tighten the reproducer if helpful. Thank you for the work on Ollama — the dual-endpoint architecture is genuinely useful and we'd love to keep building on it.
Author
Owner

@rick-github commented on GitHub (May 4, 2026):

Server logs with OLLAMA_DEBUG=1 will aid in debugging.

<!-- gh-comment-id:4370592693 --> @rick-github commented on GitHub (May 4, 2026): [Server logs](https://docs.ollama.com/troubleshooting) with `OLLAMA_DEBUG=1` will aid in debugging.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#72201