[GH-ISSUE #7758] OLLAMA_MAX_QUEUE does not limit requests to the same model #4954

Open
opened 2026-04-12 16:00:49 -05:00 by GiteaMirror · 1 comment
Owner

Originally created by @yyx1111 on GitHub (Nov 20, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/7758

Originally assigned to: @dhiltgen on GitHub.

What is the issue?

It seems that OLLAMA_MAX_QUEUE is not taking effect. My environment is Windows 11, and I have set OLLAMA_NUM_PARALLEL=1,
set OLLAMA_MAX_QUEUE=1, but excessive requests are still queuing up instead of returning an error.

OS

Windows

GPU

Nvidia

CPU

Intel

Ollama version

0.3.14

Originally created by @yyx1111 on GitHub (Nov 20, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/7758 Originally assigned to: @dhiltgen on GitHub. ### What is the issue? It seems that OLLAMA_MAX_QUEUE is not taking effect. My environment is Windows 11, and I have set OLLAMA_NUM_PARALLEL=1, set OLLAMA_MAX_QUEUE=1, but excessive requests are still queuing up instead of returning an error. ### OS Windows ### GPU Nvidia ### CPU Intel ### Ollama version 0.3.14
GiteaMirror added the feature request label 2026-04-12 16:00:49 -05:00
Author
Owner

@dhiltgen commented on GitHub (Nov 21, 2024):

This is a side effect of the way the scheduler is implemented. The queue depth relates to processing incoming requests that need to be scheduled. If you make requests for different models, then it would have the behavior you're looking for. Multiple requests to the same model pass through the scheduler quickly since the model is already loaded, and get to the underlying runner and queue up on the parallelism semaphore. Currently this layer of the system does not use a fixed depth queue. Adjusting that implementation to use a queue and respect the max queue depth does seem like a good enhancement to make.

<!-- gh-comment-id:2491828367 --> @dhiltgen commented on GitHub (Nov 21, 2024): This is a side effect of the way the scheduler is implemented. The queue depth relates to processing incoming requests that need to be scheduled. If you make requests for different models, then it would have the behavior you're looking for. Multiple requests to the same model pass through the scheduler quickly since the model is already loaded, and get to the underlying runner and queue up on the parallelism semaphore. Currently this layer of the system does not use a fixed depth queue. Adjusting that implementation to use a queue and respect the max queue depth does seem like a good enhancement to make.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#4954