[GH-ISSUE #14879] Concurrent processing with Qwen 3.5 family models #9592

Closed
opened 2026-04-12 22:30:08 -05:00 by GiteaMirror · 0 comments
Owner

Originally created by @charlesdrakon-cmyk on GitHub (Mar 16, 2026).
Original GitHub issue: https://github.com/ollama/ollama/issues/14879

What is the issue?

Summary

Qwen 3.5 models appear to ignore or bypass OLLAMA_NUM_PARALLEL, resulting in effectively single-request inference even when parallelism is configured and hardware resources are available.

Other models (e.g., Llama family) run concurrently under the same configuration.

Environment

Hardware: Apple Mac Studio (M3 Ultra)

RAM:

Test system (Colossus): 128 GB

Production systems (Hal / Sal): 512 GB

Backend: Ollama (Metal acceleration)

Frontend: Open WebUI

Ollama configuration:

OLLAMA_NUM_PARALLEL=8

Environment variable confirmed active in the running process.

Observed Behavior

When running Qwen 3.5 models (tested with both qwen3.5:35b and qwen3.5:122b):

Requests appear to serialize rather than execute concurrently

Additional requests wait until the active generation completes

Effective parallelism behaves as 1

This occurs even though:

sufficient RAM is available

the model is fully loaded in GPU memory

OLLAMA_NUM_PARALLEL is set and confirmed active.

Control Test

Under the same system and configuration, Llama models behave as expected:

Multiple requests generate simultaneously

Concurrency matches OLLAMA_NUM_PARALLEL

Token streaming occurs from multiple requests at once.

This suggests the issue is specific to the Qwen runner or architecture handling in Ollama.

Reproduction Example

Concurrent requests using the Ollama API:

curl http://localhost:11434/api/generate ...
curl http://localhost:11434/api/generate ...

Expected:

Both requests generate tokens simultaneously.

Observed with Qwen:

second request waits

generation starts only after first request finishes or partially completes.

Additional Observations

Token generation speed and TTFT are excellent for Qwen models.

GPU layers fully offload to Metal.

Memory pressure remains low.

The issue appears to be scheduling / concurrency, not performance.

Expected Behavior

Qwen 3.5 models should respect the configured parallelism:

OLLAMA_NUM_PARALLEL > 1

and allow multiple concurrent inference streams similar to Llama models.

Impact

This prevents Qwen models from being used effectively in multi-user inference environments, even when hardware capacity exists.

Relevant log output


OS

No response

GPU

No response

CPU

No response

Ollama version

No response

Originally created by @charlesdrakon-cmyk on GitHub (Mar 16, 2026). Original GitHub issue: https://github.com/ollama/ollama/issues/14879 ### What is the issue? Summary Qwen 3.5 models appear to ignore or bypass OLLAMA_NUM_PARALLEL, resulting in effectively single-request inference even when parallelism is configured and hardware resources are available. Other models (e.g., Llama family) run concurrently under the same configuration. Environment Hardware: Apple Mac Studio (M3 Ultra) RAM: Test system (Colossus): 128 GB Production systems (Hal / Sal): 512 GB Backend: Ollama (Metal acceleration) Frontend: Open WebUI Ollama configuration: OLLAMA_NUM_PARALLEL=8 Environment variable confirmed active in the running process. Observed Behavior When running Qwen 3.5 models (tested with both qwen3.5:35b and qwen3.5:122b): Requests appear to serialize rather than execute concurrently Additional requests wait until the active generation completes Effective parallelism behaves as 1 This occurs even though: sufficient RAM is available the model is fully loaded in GPU memory OLLAMA_NUM_PARALLEL is set and confirmed active. Control Test Under the same system and configuration, Llama models behave as expected: Multiple requests generate simultaneously Concurrency matches OLLAMA_NUM_PARALLEL Token streaming occurs from multiple requests at once. This suggests the issue is specific to the Qwen runner or architecture handling in Ollama. Reproduction Example Concurrent requests using the Ollama API: curl http://localhost:11434/api/generate ... curl http://localhost:11434/api/generate ... Expected: Both requests generate tokens simultaneously. Observed with Qwen: second request waits generation starts only after first request finishes or partially completes. Additional Observations Token generation speed and TTFT are excellent for Qwen models. GPU layers fully offload to Metal. Memory pressure remains low. The issue appears to be scheduling / concurrency, not performance. Expected Behavior Qwen 3.5 models should respect the configured parallelism: OLLAMA_NUM_PARALLEL > 1 and allow multiple concurrent inference streams similar to Llama models. Impact This prevents Qwen models from being used effectively in multi-user inference environments, even when hardware capacity exists. ### Relevant log output ```shell ``` ### OS _No response_ ### GPU _No response_ ### CPU _No response_ ### Ollama version _No response_
GiteaMirror added the bug label 2026-04-12 22:30:08 -05:00
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#9592