[PR #14407] fix: deadlock when runner needs reload under concurrent load #61359

Open
opened 2026-04-29 16:26:45 -05:00 by GiteaMirror · 0 comments
Owner

📋 Pull Request Information

Original PR: https://github.com/ollama/ollama/pull/14407
Author: @JRMeyer
Created: 2/25/2026
Status: 🔄 Open

Base: mainHead: fix/sched-deadlock-busy-runner


📝 Commits (1)

  • 1fd00cb sched: fix deadlock when runner needs reload under concurrent load

📊 Changes

1 file changed (+22 additions, -0 deletions)

View changed files

📝 server/sched.go (+22 -0)

📄 Description

TL;DR: Any request with different options (e.g., different num_ctx) arriving while the runner is busy permanently deadlocks the scheduler. The server becomes completely unresponsive to any new request with non-matching options, and the only recovery is killing and restarting the Ollama process. This is trivially triggered in any multi-client setup with OLLAMA_NUM_PARALLEL > 1.


What

When a loaded runner needs to reload (e.g., a request arrives with different num_ctx than the loaded runner), processPending expires the runner and blocks on <-s.unloadedCh waiting for it to unload. If the runner has active requests (refCount > 0), this wait is indefinite under continuous parallel load — refCount never reaches 0 because completed requests are immediately replaced by new ones.

Since processPending is single-threaded, this blocks all future scheduling. The runner subprocess continues processing its existing requests, but no new requests can ever be dispatched through the slow path. The server is permanently deadlocked until the process is killed.

How to reproduce

Only two conditions are needed:

  1. A runner loaded with specific options (e.g., num_ctx: 4096)
  2. A concurrent request with different options (e.g., num_ctx: 8192) arriving while the runner has active requests

This happens easily in practice: a batch workload sets num_ctx, then a health check, monitoring probe, or second client sends a request without specifying num_ctx (which defaults to a different value). The server deadlocks within seconds.

Reproduction script (verified — FAIL on main, PASS with fix)
#!/usr/bin/env bash
#
# Reproduces the scheduler deadlock in Ollama when a request with different
# options arrives while the runner is busy under continuous load.
#
# Usage:
#   ./reproduce-sched-deadlock.sh [model]
#
# Requirements:
#   - Ollama running with OLLAMA_NUM_PARALLEL >= 2
#   - A model already pulled (defaults to "llama3.2:3b")
#
# What it does:
#   1. Loads the model with num_ctx=4096
#   2. Starts continuous worker loops that keep the runner perpetually busy
#   3. While busy, sends a request with num_ctx=8192 (triggers options mismatch)
#   4. Checks if any new request with different options can be served
#
# Expected result (without fix):
#   Steps 3 and 4 hang permanently — scheduler is deadlocked, requires kill
#
# Expected result (with fix):
#   All requests complete, server stays healthy

set -euo pipefail

MODEL="${1:-llama3.2:3b}"
OLLAMA_HOST="${OLLAMA_HOST:-http://localhost:11434}"
NUM_CTX=4096
MISMATCHED_NUM_CTX=8192

cleanup() {
    [ -n "${WORKER_PIDS:-}" ] && kill $WORKER_PIDS 2>/dev/null
    wait 2>/dev/null
}
trap cleanup EXIT

echo "=== Ollama Scheduler Deadlock Reproduction ==="
echo "Model:  $MODEL"
echo "Host:   $OLLAMA_HOST"
echo ""

# Verify Ollama is running
if ! curl -sf --max-time 5 "$OLLAMA_HOST/api/tags" > /dev/null 2>&1; then
    echo "ERROR: Ollama not responding at $OLLAMA_HOST"
    echo "Start with: OLLAMA_NUM_PARALLEL=2 ollama serve"
    exit 1
fi
echo "[OK] Ollama is running"

# Step 1: Load the model with explicit num_ctx
echo ""
echo "Step 1: Loading model with num_ctx=$NUM_CTX ..."
curl -sf --max-time 300 "$OLLAMA_HOST/api/chat" \
    -d "{\"model\":\"$MODEL\",\"messages\":[{\"role\":\"user\",\"content\":\"ping\"}],\"options\":{\"num_ctx\":$NUM_CTX,\"num_predict\":1},\"stream\":false}" \
    > /dev/null
echo "[OK] Model loaded with num_ctx=$NUM_CTX"

# Step 2: Start continuous worker loops to keep refCount > 0 at all times.
# Each worker sends requests back-to-back in a loop, mimicking a real batch
# workload where completed requests are immediately replaced by new ones.
echo ""
echo "Step 2: Starting 4 continuous workers to keep runner perpetually busy ..."
WORKER_PIDS=""
for w in $(seq 1 4); do
    (
        while true; do
            curl -sf --max-time 300 "$OLLAMA_HOST/api/chat" \
                -d "{\"model\":\"$MODEL\",\"messages\":[{\"role\":\"user\",\"content\":\"Write about topic $w iteration $RANDOM in great detail.\"}],\"options\":{\"num_ctx\":$NUM_CTX,\"num_predict\":500},\"stream\":false}" \
                > /dev/null 2>&1
        done
    ) &
    WORKER_PIDS="$WORKER_PIDS $!"
done
echo "[OK] Started 4 continuous workers"

# Wait for workers to establish steady-state load
echo "     Waiting 10s for workers to reach steady state ..."
sleep 10

# Step 3: Send a request with DIFFERENT num_ctx
echo ""
echo "Step 3: Sending request with num_ctx=$MISMATCHED_NUM_CTX (mismatch triggers reload) ..."
echo "     If this hangs for 30s, the scheduler is deadlocked."
MISMATCH_OK=false
START=$(date +%s)
if curl -sf --max-time 30 "$OLLAMA_HOST/api/chat" \
    -d "{\"model\":\"$MODEL\",\"messages\":[{\"role\":\"user\",\"content\":\"hi\"}],\"options\":{\"num_ctx\":$MISMATCHED_NUM_CTX,\"num_predict\":1},\"stream\":false}" \
    > /dev/null 2>&1; then
    ELAPSED=$(( $(date +%s) - START ))
    MISMATCH_OK=true
    echo "[OK] Mismatched request completed (${ELAPSED}s)"
else
    ELAPSED=$(( $(date +%s) - START ))
    echo "[FAIL] Mismatched request timed out (${ELAPSED}s) — scheduler is deadlocked"
fi

# Step 4: Confirm — try another request with yet another num_ctx
CONFIRM_OK=false
echo ""
echo "Step 4: Confirming with another mismatched request ..."
if curl -sf --max-time 10 "$OLLAMA_HOST/api/chat" \
    -d "{\"model\":\"$MODEL\",\"messages\":[{\"role\":\"user\",\"content\":\"ping\"}],\"options\":{\"num_ctx\":16384,\"num_predict\":1},\"stream\":false}" \
    > /dev/null 2>&1; then
    CONFIRM_OK=true
    echo "[OK] Request completed — scheduler is healthy"
else
    echo "[FAIL] Also timed out — confirms scheduler is permanently deadlocked"
    echo ""
    echo "The server cannot process any request with non-matching options."
    echo "Only recovery: kill the Ollama process."
fi

echo ""
echo "=== Result ==="
if $MISMATCH_OK && $CONFIRM_OK; then
    echo "PASS: No deadlock. Server handled options mismatch gracefully."
else
    echo "FAIL: Deadlock triggered. Server required restart."
    exit 1
fi

Root cause

  1. GetRunnerneedsReload() returns true (options mismatch) → slow path → pendingReqCh
  2. processPending picks up request, confirms needsReload() → sets runnerToExpire
  3. Sets sessionDuration = 0, but refCount > 0 → runner not sent to expiredCh
  4. Blocks on <-s.unloadedCh — waits for runner to fully unload
  5. Under continuous parallel load, at least one slot is always active, so refCount never reaches 0
  6. processPending blocked forever → all future slow-path requests dead → server unusable

Fix

When needsReload() returns true but the runner has active requests (refCount > 0), serve the request using the existing runner rather than blocking the scheduler. Applied in both:

  • GetRunner: Avoids queueing to the slow path entirely when the runner is busy
  • processPending: Defense in depth — same check if a request reaches the slow path

The runner is functional — the request proceeds with the loaded options. This is strictly better than a permanent deadlock.

Testing

Reproduced on Mac Studio M4 Max (128GB) with OLLAMA_NUM_PARALLEL=8:

  1. Started continuous batch workload (4 workers, num_ctx: 4096)
  2. Sent request with num_ctx: 8192 while workers active
  3. Before fix: Permanent scheduler deadlock — both mismatched requests timed out, server unusable. Only recovery: kill the process.
  4. After fix: Mismatched requests served in <1 second, workers continue uninterrupted, server stays healthy

🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.

## 📋 Pull Request Information **Original PR:** https://github.com/ollama/ollama/pull/14407 **Author:** [@JRMeyer](https://github.com/JRMeyer) **Created:** 2/25/2026 **Status:** 🔄 Open **Base:** `main` ← **Head:** `fix/sched-deadlock-busy-runner` --- ### 📝 Commits (1) - [`1fd00cb`](https://github.com/ollama/ollama/commit/1fd00cb3de0e92c1e5cdff7e520741c41ebf1033) sched: fix deadlock when runner needs reload under concurrent load ### 📊 Changes **1 file changed** (+22 additions, -0 deletions) <details> <summary>View changed files</summary> 📝 `server/sched.go` (+22 -0) </details> ### 📄 Description **TL;DR: Any request with different options (e.g., different `num_ctx`) arriving while the runner is busy permanently deadlocks the scheduler. The server becomes completely unresponsive to any new request with non-matching options, and the only recovery is killing and restarting the Ollama process. This is trivially triggered in any multi-client setup with `OLLAMA_NUM_PARALLEL > 1`.** --- ## What When a loaded runner needs to reload (e.g., a request arrives with different `num_ctx` than the loaded runner), `processPending` expires the runner and blocks on `<-s.unloadedCh` waiting for it to unload. If the runner has active requests (`refCount > 0`), this wait is indefinite under continuous parallel load — `refCount` never reaches 0 because completed requests are immediately replaced by new ones. Since `processPending` is single-threaded, this blocks **all** future scheduling. The runner subprocess continues processing its existing requests, but no new requests can ever be dispatched through the slow path. The server is permanently deadlocked until the process is killed. ## How to reproduce Only two conditions are needed: 1. A runner loaded with specific options (e.g., `num_ctx: 4096`) 2. A concurrent request with different options (e.g., `num_ctx: 8192`) arriving while the runner has active requests This happens easily in practice: a batch workload sets `num_ctx`, then a health check, monitoring probe, or second client sends a request without specifying `num_ctx` (which defaults to a different value). The server deadlocks within seconds. <details> <summary><strong>Reproduction script (verified — FAIL on main, PASS with fix)</strong></summary> ```bash #!/usr/bin/env bash # # Reproduces the scheduler deadlock in Ollama when a request with different # options arrives while the runner is busy under continuous load. # # Usage: # ./reproduce-sched-deadlock.sh [model] # # Requirements: # - Ollama running with OLLAMA_NUM_PARALLEL >= 2 # - A model already pulled (defaults to "llama3.2:3b") # # What it does: # 1. Loads the model with num_ctx=4096 # 2. Starts continuous worker loops that keep the runner perpetually busy # 3. While busy, sends a request with num_ctx=8192 (triggers options mismatch) # 4. Checks if any new request with different options can be served # # Expected result (without fix): # Steps 3 and 4 hang permanently — scheduler is deadlocked, requires kill # # Expected result (with fix): # All requests complete, server stays healthy set -euo pipefail MODEL="${1:-llama3.2:3b}" OLLAMA_HOST="${OLLAMA_HOST:-http://localhost:11434}" NUM_CTX=4096 MISMATCHED_NUM_CTX=8192 cleanup() { [ -n "${WORKER_PIDS:-}" ] && kill $WORKER_PIDS 2>/dev/null wait 2>/dev/null } trap cleanup EXIT echo "=== Ollama Scheduler Deadlock Reproduction ===" echo "Model: $MODEL" echo "Host: $OLLAMA_HOST" echo "" # Verify Ollama is running if ! curl -sf --max-time 5 "$OLLAMA_HOST/api/tags" > /dev/null 2>&1; then echo "ERROR: Ollama not responding at $OLLAMA_HOST" echo "Start with: OLLAMA_NUM_PARALLEL=2 ollama serve" exit 1 fi echo "[OK] Ollama is running" # Step 1: Load the model with explicit num_ctx echo "" echo "Step 1: Loading model with num_ctx=$NUM_CTX ..." curl -sf --max-time 300 "$OLLAMA_HOST/api/chat" \ -d "{\"model\":\"$MODEL\",\"messages\":[{\"role\":\"user\",\"content\":\"ping\"}],\"options\":{\"num_ctx\":$NUM_CTX,\"num_predict\":1},\"stream\":false}" \ > /dev/null echo "[OK] Model loaded with num_ctx=$NUM_CTX" # Step 2: Start continuous worker loops to keep refCount > 0 at all times. # Each worker sends requests back-to-back in a loop, mimicking a real batch # workload where completed requests are immediately replaced by new ones. echo "" echo "Step 2: Starting 4 continuous workers to keep runner perpetually busy ..." WORKER_PIDS="" for w in $(seq 1 4); do ( while true; do curl -sf --max-time 300 "$OLLAMA_HOST/api/chat" \ -d "{\"model\":\"$MODEL\",\"messages\":[{\"role\":\"user\",\"content\":\"Write about topic $w iteration $RANDOM in great detail.\"}],\"options\":{\"num_ctx\":$NUM_CTX,\"num_predict\":500},\"stream\":false}" \ > /dev/null 2>&1 done ) & WORKER_PIDS="$WORKER_PIDS $!" done echo "[OK] Started 4 continuous workers" # Wait for workers to establish steady-state load echo " Waiting 10s for workers to reach steady state ..." sleep 10 # Step 3: Send a request with DIFFERENT num_ctx echo "" echo "Step 3: Sending request with num_ctx=$MISMATCHED_NUM_CTX (mismatch triggers reload) ..." echo " If this hangs for 30s, the scheduler is deadlocked." MISMATCH_OK=false START=$(date +%s) if curl -sf --max-time 30 "$OLLAMA_HOST/api/chat" \ -d "{\"model\":\"$MODEL\",\"messages\":[{\"role\":\"user\",\"content\":\"hi\"}],\"options\":{\"num_ctx\":$MISMATCHED_NUM_CTX,\"num_predict\":1},\"stream\":false}" \ > /dev/null 2>&1; then ELAPSED=$(( $(date +%s) - START )) MISMATCH_OK=true echo "[OK] Mismatched request completed (${ELAPSED}s)" else ELAPSED=$(( $(date +%s) - START )) echo "[FAIL] Mismatched request timed out (${ELAPSED}s) — scheduler is deadlocked" fi # Step 4: Confirm — try another request with yet another num_ctx CONFIRM_OK=false echo "" echo "Step 4: Confirming with another mismatched request ..." if curl -sf --max-time 10 "$OLLAMA_HOST/api/chat" \ -d "{\"model\":\"$MODEL\",\"messages\":[{\"role\":\"user\",\"content\":\"ping\"}],\"options\":{\"num_ctx\":16384,\"num_predict\":1},\"stream\":false}" \ > /dev/null 2>&1; then CONFIRM_OK=true echo "[OK] Request completed — scheduler is healthy" else echo "[FAIL] Also timed out — confirms scheduler is permanently deadlocked" echo "" echo "The server cannot process any request with non-matching options." echo "Only recovery: kill the Ollama process." fi echo "" echo "=== Result ===" if $MISMATCH_OK && $CONFIRM_OK; then echo "PASS: No deadlock. Server handled options mismatch gracefully." else echo "FAIL: Deadlock triggered. Server required restart." exit 1 fi ``` </details> ## Root cause 1. `GetRunner` → `needsReload()` returns `true` (options mismatch) → slow path → `pendingReqCh` 2. `processPending` picks up request, confirms `needsReload()` → sets `runnerToExpire` 3. Sets `sessionDuration = 0`, but `refCount > 0` → runner not sent to `expiredCh` 4. Blocks on `<-s.unloadedCh` — waits for runner to fully unload 5. Under continuous parallel load, at least one slot is always active, so `refCount` never reaches 0 6. `processPending` blocked forever → all future slow-path requests dead → server unusable ## Fix When `needsReload()` returns `true` but the runner has active requests (`refCount > 0`), serve the request using the existing runner rather than blocking the scheduler. Applied in both: - **`GetRunner`**: Avoids queueing to the slow path entirely when the runner is busy - **`processPending`**: Defense in depth — same check if a request reaches the slow path The runner is functional — the request proceeds with the loaded options. This is strictly better than a permanent deadlock. ## Testing Reproduced on Mac Studio M4 Max (128GB) with `OLLAMA_NUM_PARALLEL=8`: 1. Started continuous batch workload (4 workers, `num_ctx: 4096`) 2. Sent request with `num_ctx: 8192` while workers active 3. **Before fix**: Permanent scheduler deadlock — both mismatched requests timed out, server unusable. Only recovery: kill the process. 4. **After fix**: Mismatched requests served in <1 second, workers continue uninterrupted, server stays healthy --- <sub>🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.</sub>
GiteaMirror added the pull-request label 2026-04-29 16:26:45 -05:00
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#61359