[GH-ISSUE #4165] OLLAMA_NUM_PARALLEL and multi-modal models lead to failed processing images error #28347

Open
opened 2026-04-22 06:27:33 -05:00 by GiteaMirror · 10 comments
Owner

Originally created by @jmorganca on GitHub (May 5, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/4165

What is the issue?

When processing multiple requests using multi-modal models such as llava or moondream generation freezes and an error is printed in the server logs: failed processing images

OS

No response

GPU

No response

CPU

No response

Ollama version

No response

Originally created by @jmorganca on GitHub (May 5, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/4165 ### What is the issue? When processing multiple requests using multi-modal models such as `llava` or `moondream` generation freezes and an error is printed in the server logs: `failed processing images` ### OS _No response_ ### GPU _No response_ ### CPU _No response_ ### Ollama version _No response_
GiteaMirror added the bug label 2026-04-22 06:27:33 -05:00
Author
Owner

@KevinTurnbull commented on GitHub (Mar 8, 2026):

It's not clear that this is related to #14510 -- though Qwen3.5 is a vision language model.

Is there a prioritization for adding parallelism for qwen3.5?

<!-- gh-comment-id:4019095172 --> @KevinTurnbull commented on GitHub (Mar 8, 2026): It's not clear that this is related to #14510 -- though Qwen3.5 is a vision language model. Is there a prioritization for adding parallelism for qwen3.5?
Author
Owner

@elade9977 commented on GitHub (Mar 14, 2026):

Is there a prioritization for adding parallelism for qwen3.5?

<!-- gh-comment-id:4061378324 --> @elade9977 commented on GitHub (Mar 14, 2026): Is there a prioritization for adding parallelism for qwen3.5?
Author
Owner

@lclrd commented on GitHub (Mar 23, 2026):

Is there any timeline or more context for when support for these models will be added?

<!-- gh-comment-id:4107802000 --> @lclrd commented on GitHub (Mar 23, 2026): Is there any timeline or more context for when support for these models will be added?
Author
Owner

@charlesdrakon-cmyk commented on GitHub (Mar 31, 2026):

We are seeing the same underlying behavior on Apple Silicon in real production-style use.

Environment:

  • Platform: Apple Silicon (macOS)
  • Ollama: 0.19.0
  • Models in use: qwen3.5:35b family in our environment
  • Deployment style: local multi-user / shared-service usage
  • OLLAMA_NUM_PARALLEL configured > 1

Observed behavior:

  • Requests against Qwen 3.5 still behave as serialized / effectively single-active-request.
  • In practice, one request runs and the next waits, rather than true concurrent generation.
  • This matches the scheduler behavior already described here for qwen35/qwen35moe architecture being limited to Parallel=1.

Additional note:

  • We tested 0.19.0 on Apple Silicon after the MLX transition announcement.
  • We did not observe a meaningful real-world concurrency improvement for Qwen 3.5 workloads.
  • Single-user speed remains good, but concurrency is still the limiting factor.

Why this matters:

  • For shared local deployments, Qwen 3.5 speed can partially mask the issue, but true multi-user responsiveness still depends on actual parallel request support.
  • This is especially important on large-memory Apple Silicon systems, where the hardware is capable and the remaining bottleneck appears to be scheduler / architecture support.

Suggested action:

  • Please reopen / reassess this as distinct from #4165 if needed.
  • #4165 appears to be about multi-modal image-processing failures, while this issue is about qwen35/qwen35moe parallel request support being disabled.

If useful, we can provide reproduction details from an Apple Silicon / macOS setup as well.

<!-- gh-comment-id:4164274675 --> @charlesdrakon-cmyk commented on GitHub (Mar 31, 2026): We are seeing the same underlying behavior on Apple Silicon in real production-style use. Environment: - Platform: Apple Silicon (macOS) - Ollama: 0.19.0 - Models in use: qwen3.5:35b family in our environment - Deployment style: local multi-user / shared-service usage - OLLAMA_NUM_PARALLEL configured > 1 Observed behavior: - Requests against Qwen 3.5 still behave as serialized / effectively single-active-request. - In practice, one request runs and the next waits, rather than true concurrent generation. - This matches the scheduler behavior already described here for qwen35/qwen35moe architecture being limited to Parallel=1. Additional note: - We tested 0.19.0 on Apple Silicon after the MLX transition announcement. - We did not observe a meaningful real-world concurrency improvement for Qwen 3.5 workloads. - Single-user speed remains good, but concurrency is still the limiting factor. Why this matters: - For shared local deployments, Qwen 3.5 speed can partially mask the issue, but true multi-user responsiveness still depends on actual parallel request support. - This is especially important on large-memory Apple Silicon systems, where the hardware is capable and the remaining bottleneck appears to be scheduler / architecture support. Suggested action: - Please reopen / reassess this as distinct from #4165 if needed. - #4165 appears to be about multi-modal image-processing failures, while this issue is about qwen35/qwen35moe parallel request support being disabled. If useful, we can provide reproduction details from an Apple Silicon / macOS setup as well.
Author
Owner

@Robinsane commented on GitHub (Apr 1, 2026):

Same issue for me today with Qwen3.5 122b as well as Nemotron 3 super.
Both could not handle parallel requests.

Edit:

  • Not handle == both would set "Parallel:1" according to the logs, while OLLAMA_NUM_PARALLEL was configured to be 2
  • Ollama version: 0.17.7

Edit2:

  • Also a problem for qwen3-next 80b on ollama v 0.19.0
<!-- gh-comment-id:4171044292 --> @Robinsane commented on GitHub (Apr 1, 2026): Same issue for me today with Qwen3.5 122b as well as Nemotron 3 super. Both could not handle parallel requests. Edit: - Not handle == both would set "Parallel:1" according to the logs, while OLLAMA_NUM_PARALLEL was configured to be 2 - Ollama version: 0.17.7 Edit2: - Also a problem for qwen3-next 80b on ollama v 0.19.0
Author
Owner

@rick-github commented on GitHub (Apr 2, 2026):

a8292dd85f/server/sched.go (L419-L424)

<!-- gh-comment-id:4174078830 --> @rick-github commented on GitHub (Apr 2, 2026): https://github.com/ollama/ollama/blob/a8292dd85f234ef52f8b477dbbefbf9517f58ef5/server/sched.go#L419-L424
Author
Owner

@lclrd commented on GitHub (Apr 2, 2026):

@rick-github you've posted that snippet multiple times across many related issues but don't have any information available about if/when they'll be implemented.

Is there any timeline or more context for when support for these models will be added?

<!-- gh-comment-id:4174084347 --> @lclrd commented on GitHub (Apr 2, 2026): @rick-github you've posted that snippet multiple times across many related issues but don't have any information available about if/when they'll be implemented. Is there any timeline or more context for when support for these models will be added?
Author
Owner

@assinchu commented on GitHub (Apr 2, 2026):

I have 2 ollama service running on H100 GPU
ollama/ollama:0.16.3
docker inspect 785e619f910e | grep OLLAMA
"OLLAMA_DEBUG=1",
"OLLAMA_NUM_PARALLEL=8",
"OLLAMA_MAX_LOADED_MODELS=8",
"OLLAMA_MAX_QUEUE=4096",
"OLLAMA_KEEP_ALIVE=-1",
"OLLAMA_SCHED_SPREAD=1",
"OLLAMA_HOST=0.0.0.0:11434"
In this version, I can send multiple request to same model at the same time and I see runner.parallel=8 in the log

Same OLLAMA env in another service
ollama/ollama:0.18.0
But here despite of "OLLAMA_NUM_PARALLEL=8", I see runner.parallel=1 and all the requests goes to queue.

is there any work around for this ?

<!-- gh-comment-id:4174087141 --> @assinchu commented on GitHub (Apr 2, 2026): I have 2 ollama service running on H100 GPU ollama/ollama:0.16.3 docker inspect 785e619f910e | grep OLLAMA "OLLAMA_DEBUG=1", "OLLAMA_NUM_PARALLEL=8", "OLLAMA_MAX_LOADED_MODELS=8", "OLLAMA_MAX_QUEUE=4096", "OLLAMA_KEEP_ALIVE=-1", "OLLAMA_SCHED_SPREAD=1", "OLLAMA_HOST=0.0.0.0:11434" In this version, I can send multiple request to same model at the same time and I see runner.parallel=8 in the log Same OLLAMA env in another service ollama/ollama:0.18.0 But here despite of "OLLAMA_NUM_PARALLEL=8", I see runner.parallel=1 and all the requests goes to queue. is there any work around for this ?
Author
Owner

@rick-github commented on GitHub (Apr 2, 2026):

@lclrd That's what this ticket is for. If/when it's implemented, this ticket will be updated.

@assinchu Run multiple servers and put a reverse proxy in front.

<!-- gh-comment-id:4174098783 --> @rick-github commented on GitHub (Apr 2, 2026): @lclrd That's what this ticket is for. If/when it's implemented, this ticket will be updated. @assinchu Run multiple servers and put a reverse proxy in front.
Author
Owner

@rick-github commented on GitHub (Apr 2, 2026):

To flesh out my comment a bit: the listed models have architectures that prevent them from running parallel queries in an ollama server, so a workaround is to run multiple servers. A reverse proxy (nginx, caddy, etc) can be run in front of the servers to present a single API for clients. Running multiple servers is straightforward in a docker environment, a non-docker environment just has to avoid port collisions. The drawback is that a single model has multiple copies of the weights loaded, taking up space that would otherwise be available for context.

x-ollama: &ollama
  image: ollama/ollama:${OLLAMA_DOCKER_TAG-latest}
  volumes:
    - ${OLLAMA_MODELS-./ollama}:/root/.ollama
  environment: &env
    OLLAMA_DEBUG: ${OLLAMA_DEBUG-1}
    OLLAMA_NUM_PARALLEL: 1
    OLLAMA_MAX_LOADED_MODELS: 1
    OLLAMA_MAX_QUEUE: ${OLLAMA_MAX_QUEUE-4096}
    OLLAMA_KEEP_ALIVE: ${OLLAMA_KEEP_ALIVE--1}
    OLLAMA_FLASH_ATTENTION: ${OLLAMA_FLASH_ATTENTION-0}
    OLLAMA_KV_CACHE_TYPE: ${OLLAMA_KV_CACHE_TYPE-}
  deploy:
    resources:
      reservations:
        devices:
          - driver: nvidia
            count: all
            capabilities: [gpu]

services:
  ollama-1:
    << : *ollama

  ollama-2:
    << : *ollama

  ollama-3:
    << : *ollama

  ollama-4:
    << : *ollama

  ollama:
    image: nginx-lb
    build:
      dockerfile_inline: |
        FROM nginx:latest
        RUN cat > /etc/nginx/conf.d/default.conf <<EOF
        upstream ollama_group {
          least_conn;
          server ollama-1:11434 max_conns=4;
          server ollama-2:11434 max_conns=4;
          server ollama-3:11434 max_conns=4;
          server ollama-4:11434 max_conns=4;
        }
        server {
          listen 11434;
          server_name localhost;
          location / {
            proxy_pass http://ollama_group;
          }
        }
        EOF
    ports:
      - 11434:11434
<!-- gh-comment-id:4176068568 --> @rick-github commented on GitHub (Apr 2, 2026): To flesh out my comment a bit: the listed models have architectures that prevent them from running parallel queries in an ollama server, so a workaround is to run multiple servers. A reverse proxy (nginx, caddy, etc) can be run in front of the servers to present a single API for clients. Running multiple servers is straightforward in a docker environment, a non-docker environment just has to avoid port collisions. The drawback is that a single model has multiple copies of the weights loaded, taking up space that would otherwise be available for context. ```dockerfile x-ollama: &ollama image: ollama/ollama:${OLLAMA_DOCKER_TAG-latest} volumes: - ${OLLAMA_MODELS-./ollama}:/root/.ollama environment: &env OLLAMA_DEBUG: ${OLLAMA_DEBUG-1} OLLAMA_NUM_PARALLEL: 1 OLLAMA_MAX_LOADED_MODELS: 1 OLLAMA_MAX_QUEUE: ${OLLAMA_MAX_QUEUE-4096} OLLAMA_KEEP_ALIVE: ${OLLAMA_KEEP_ALIVE--1} OLLAMA_FLASH_ATTENTION: ${OLLAMA_FLASH_ATTENTION-0} OLLAMA_KV_CACHE_TYPE: ${OLLAMA_KV_CACHE_TYPE-} deploy: resources: reservations: devices: - driver: nvidia count: all capabilities: [gpu] services: ollama-1: << : *ollama ollama-2: << : *ollama ollama-3: << : *ollama ollama-4: << : *ollama ollama: image: nginx-lb build: dockerfile_inline: | FROM nginx:latest RUN cat > /etc/nginx/conf.d/default.conf <<EOF upstream ollama_group { least_conn; server ollama-1:11434 max_conns=4; server ollama-2:11434 max_conns=4; server ollama-3:11434 max_conns=4; server ollama-4:11434 max_conns=4; } server { listen 11434; server_name localhost; location / { proxy_pass http://ollama_group; } } EOF ports: - 11434:11434 ```
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#28347