[GH-ISSUE #14578] Requests to different loaded models serialize / queue for ~50s in single Ollama server (qwen3.5:122b-a10b-q4_K_M + qwen3.5:9b-q4_K_M) #55964

Open
opened 2026-04-29 10:04:42 -05:00 by GiteaMirror · 1 comment
Owner

Originally created by @raucodes on GitHub (Mar 3, 2026).
Original GitHub issue: https://github.com/ollama/ollama/issues/14578

What is the issue?

Hi Ollama team,

I’m seeing what looks like global request serialization (or head-of-line blocking) in a single ollama serve process: one long-running request on a large model can cause a second request to a different model to wait in the queue for tens of seconds, even though both models are already loaded and the second request itself only needs ~300ms of compute.

I’m trying to confirm whether:

  1. I’m misunderstanding how concurrency is expected to work across models, or
  2. this is a known limitation/bug (possibly related to qwen35/qwen35moe backends), or
  3. the intended workaround is to run multiple ollama serve instances (separate ports) to isolate queues.

Models involved (exact names)

  • qwen3.5:122b-a10b-q4_K_M (ID 8b9d11d807c5)
  • qwen3.5:9b-q4_K_M (ID 6488c96fa5fa)

Expected behavior

With both models already loaded and plenty of unified memory available, I expected:

  • One request to the 122B model and one request to the 9B model can progress concurrently (or at least the 9B request should not block behind the 122B request for tens of seconds).

Even if true parallel compute is limited, I expected the short 9B request to start promptly (no huge queue delay).

Actual behavior

Occasionally, a request to qwen3.5:9b-q4_K_M will take ~50 seconds total, while the measured compute phases remain ~0.3s:

Example output (same request repeated 10 times; run 01 shows the issue):

  • total_duration ~ 51.875s
  • load_duration ~ 0.072s
  • prompt_eval_duration ~ 0.148s
  • eval_duration ~ 0.118s

So ~51s is unexplained by model compute and appears to be queue wait time.

Runs 02–10 are normal (~0.32s total).

Reproduction steps

  1. Ensure both models are loaded and kept alive (I’m using keep-alive forever).
  2. Start/trigger a long-running request on the 122B model (e.g., a big prompt with generation).
  3. While that request is running, send a trivial short request to the 9B model.
  4. Observe that sometimes the short 9B request waits a very long time before executing.

Minimal request used for the 9B model

Endpoint: POST http://<host>:11434/api/chat

Payload:

{
  "model":"qwen3.5:9b-q4_K_M",
  "messages":[{"role":"user","content":"Reply with exactly: pong"}],
  "stream": false,
  "options": { "num_predict": 5 }
}

I used this loop to print timings:

import subprocess, json

url = "http://<host>:11434/api/chat"
payload = r'''{
  "model":"qwen3.5:9b-q4_K_M",
  "messages":[{"role":"user","content":"Reply with exactly: pong"}],
  "stream": false,
  "options": { "num_predict": 5 }
}'''

def s(ns): return f"{ns/1e9:.3f}s"

for i in range(1, 11):
    out = subprocess.check_output([
        "curl","-sS",url,
        "-H","Content-Type: application/json",
        "-d",payload
    ])
    j = json.loads(out)
    print(f"run={i:02d} total={s(j.get('total_duration',0))} load={s(j.get('load_duration',0))} "
          f"prompt_eval={s(j.get('prompt_eval_duration',0))} eval={s(j.get('eval_duration',0))} "
          f"prompt_tok={j.get('prompt_eval_count')} eval_tok={j.get('eval_count')}")

Example problematic run:

run=01 total=51.875s load=0.072s prompt_eval=0.148s eval=0.118s prompt_tok=18 eval_tok=5
run=02 total=0.324s  load=0.076s prompt_eval=0.116s eval=0.126s prompt_tok=18 eval_tok=5
...

When the system is idle (even after 8 hours), the same request is consistently ~0.25s total, so this is not a “warmup” issue.

System / configuration

  • Hardware: Apple Silicon Mac Studio (128GB unified memory)
  • Both models kept loaded (I want them warm; memory is not a constraint)
  • OLLAMA_KEEP_ALIVE=-1
  • OLLAMA_MAX_LOADED_MODELS=2
  • (I can test OLLAMA_NUM_PARALLEL if you recommend it, but my intent is 1 request per model concurrently, not multiple requests to one model.)

Questions

  1. Is this behavior expected (single ollama serve effectively processes only one active inference request globally, even across different models)?
  2. If not expected: is this a known issue with qwen3.5 backends on Metal / Apple Silicon?
  3. Should OLLAMA_NUM_PARALLEL enable concurrent requests across different models, or is that strictly per-model and still subject to a global single-request pipeline?
  4. Is running two ollama serve instances on different ports the recommended way to guarantee that a small helper model (9B) can’t be blocked behind a long 122B request?

If you want, I can provide a more explicit reproducer with one terminal holding a long 122B request while the other fires the 9B ping loop, plus full ollama version output.

Thanks!

Relevant log output


OS

macOS

GPU

Apple

CPU

Apple

Ollama version

0.17.5

Originally created by @raucodes on GitHub (Mar 3, 2026). Original GitHub issue: https://github.com/ollama/ollama/issues/14578 ### What is the issue? Hi Ollama team, I’m seeing what looks like *global* request serialization (or head-of-line blocking) in a single `ollama serve` process: one long-running request on a large model can cause a second request to a *different* model to wait in the queue for tens of seconds, even though both models are already loaded and the second request itself only needs ~300ms of compute. I’m trying to confirm whether: 1. I’m misunderstanding how concurrency is expected to work across models, or 2. this is a known limitation/bug (possibly related to qwen35/qwen35moe backends), or 3. the intended workaround is to run multiple `ollama serve` instances (separate ports) to isolate queues. ## Models involved (exact names) * `qwen3.5:122b-a10b-q4_K_M` (ID `8b9d11d807c5`) * `qwen3.5:9b-q4_K_M` (ID `6488c96fa5fa`) ## Expected behavior With both models already loaded and plenty of unified memory available, I expected: * One request to the 122B model and one request to the 9B model can progress concurrently (or at least the 9B request should not block behind the 122B request for tens of seconds). Even if true parallel compute is limited, I expected the short 9B request to start promptly (no huge queue delay). ## Actual behavior Occasionally, a request to `qwen3.5:9b-q4_K_M` will take ~50 seconds **total**, while the measured compute phases remain ~0.3s: Example output (same request repeated 10 times; run 01 shows the issue): * `total_duration` ~ **51.875s** * `load_duration` ~ 0.072s * `prompt_eval_duration` ~ 0.148s * `eval_duration` ~ 0.118s So ~51s is unexplained by model compute and appears to be *queue wait* time. Runs 02–10 are normal (~0.32s total). ## Reproduction steps 1. Ensure both models are loaded and kept alive (I’m using keep-alive forever). 2. Start/trigger a long-running request on the 122B model (e.g., a big prompt with generation). 3. While that request is running, send a trivial short request to the 9B model. 4. Observe that sometimes the short 9B request waits a very long time before executing. ## Minimal request used for the 9B model Endpoint: `POST http://<host>:11434/api/chat` Payload: ```json { "model":"qwen3.5:9b-q4_K_M", "messages":[{"role":"user","content":"Reply with exactly: pong"}], "stream": false, "options": { "num_predict": 5 } } ``` I used this loop to print timings: ```python import subprocess, json url = "http://<host>:11434/api/chat" payload = r'''{ "model":"qwen3.5:9b-q4_K_M", "messages":[{"role":"user","content":"Reply with exactly: pong"}], "stream": false, "options": { "num_predict": 5 } }''' def s(ns): return f"{ns/1e9:.3f}s" for i in range(1, 11): out = subprocess.check_output([ "curl","-sS",url, "-H","Content-Type: application/json", "-d",payload ]) j = json.loads(out) print(f"run={i:02d} total={s(j.get('total_duration',0))} load={s(j.get('load_duration',0))} " f"prompt_eval={s(j.get('prompt_eval_duration',0))} eval={s(j.get('eval_duration',0))} " f"prompt_tok={j.get('prompt_eval_count')} eval_tok={j.get('eval_count')}") ``` Example problematic run: ``` run=01 total=51.875s load=0.072s prompt_eval=0.148s eval=0.118s prompt_tok=18 eval_tok=5 run=02 total=0.324s load=0.076s prompt_eval=0.116s eval=0.126s prompt_tok=18 eval_tok=5 ... ``` When the system is idle (even after 8 hours), the same request is consistently ~0.25s total, so this is not a “warmup” issue. ## System / configuration * Hardware: Apple Silicon Mac Studio (128GB unified memory) * Both models kept loaded (I want them warm; memory is not a constraint) * `OLLAMA_KEEP_ALIVE=-1` * `OLLAMA_MAX_LOADED_MODELS=2` * (I can test `OLLAMA_NUM_PARALLEL` if you recommend it, but my intent is 1 request per model concurrently, not multiple requests to one model.) ## Questions 1. Is this behavior expected (single `ollama serve` effectively processes only one active inference request globally, even across different models)? 2. If not expected: is this a known issue with qwen3.5 backends on Metal / Apple Silicon? 3. Should `OLLAMA_NUM_PARALLEL` enable concurrent requests across *different* models, or is that strictly per-model and still subject to a global single-request pipeline? 4. Is running two `ollama serve` instances on different ports the recommended way to guarantee that a small helper model (9B) can’t be blocked behind a long 122B request? If you want, I can provide a more explicit reproducer with one terminal holding a long 122B request while the other fires the 9B ping loop, plus full `ollama version` output. Thanks! ### Relevant log output ```shell ``` ### OS macOS ### GPU Apple ### CPU Apple ### Ollama version 0.17.5
GiteaMirror added the bug label 2026-04-29 10:04:42 -05:00
Author
Owner

@ChharithOeun commented on GitHub (Mar 29, 2026):

I've submitted a fix for this in #15126.

Root cause: In server/sched.go, when processPending decides to evict a model, it enters a blocking select on <-s.unloadedCh. During this wait, all new requests queued in pendingReqCh are frozen — even requests for models that are already loaded and ready to serve. This is the ~50s queue wait you're seeing.

Fix: Replace the blocking select with a for/select loop that also drains pendingReqCh while waiting for eviction. Requests for already-loaded models get dispatched immediately via useLoadedRunner; others are buffered and re-enqueued after the eviction completes.

Includes a regression test (TestSchedNoHeadOfLineBlocking) that reproduces the exact scenario and verifies the fix. All 17 existing scheduler tests pass.

<!-- gh-comment-id:4149499581 --> @ChharithOeun commented on GitHub (Mar 29, 2026): I've submitted a fix for this in #15126. **Root cause:** In `server/sched.go`, when `processPending` decides to evict a model, it enters a blocking `select` on `<-s.unloadedCh`. During this wait, *all* new requests queued in `pendingReqCh` are frozen — even requests for models that are already loaded and ready to serve. This is the ~50s queue wait you're seeing. **Fix:** Replace the blocking `select` with a `for/select` loop that also drains `pendingReqCh` while waiting for eviction. Requests for already-loaded models get dispatched immediately via `useLoadedRunner`; others are buffered and re-enqueued after the eviction completes. Includes a regression test (`TestSchedNoHeadOfLineBlocking`) that reproduces the exact scenario and verifies the fix. All 17 existing scheduler tests pass.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#55964