[GH-ISSUE #1187] Scaling/Concurrent Requests #62640

Closed
opened 2026-05-03 09:51:16 -05:00 by GiteaMirror · 4 comments
Owner

Originally created by @jjsarf on GitHub (Nov 18, 2023).
Original GitHub issue: https://github.com/ollama/ollama/issues/1187

Hello again. Great project. This may not be an issue, but I did notice that placing a second request while another one is currently processing makes the new request timeout.
Is this by design? This is not the case when using HuggingFace UI >0.4
Thanks.

Originally created by @jjsarf on GitHub (Nov 18, 2023). Original GitHub issue: https://github.com/ollama/ollama/issues/1187 Hello again. Great project. This may not be an issue, but I did notice that placing a second request while another one is currently processing makes the new request timeout. Is this by design? This is not the case when using HuggingFace UI >0.4 Thanks.
Author
Owner

@SMenigat commented on GitHub (Nov 20, 2023):

Yes thats the current design as far as I understand it. All requests are currently handled sequentially. That allows the API to switch out the LLM it is using per request and allows for better planning of the needed resources to run the service. When Implementing my app that uses Ollama I implemented a worker queue that handles all requests in the background.

<!-- gh-comment-id:1818493732 --> @SMenigat commented on GitHub (Nov 20, 2023): Yes thats the current design as far as I understand it. All requests are currently handled sequentially. That allows the API to switch out the LLM it is using per request and allows for better planning of the needed resources to run the service. When Implementing my app that uses Ollama I implemented a worker queue that handles all requests in the background.
Author
Owner

@jjsarf commented on GitHub (Nov 20, 2023):

Yes thats the current design as far as I understand it. All requests are currently handled sequentially. That allows the API to switch out the LLM it is using per request and allows for better planning of the needed resources to run the service. When Implementing my app that uses Ollama I implemented a worker queue that handles all requests in the background.

It would be great to have this mechanism as a configuration parameter (as in on or off) as being able to handle just a single request at a time is a limitation.

<!-- gh-comment-id:1819068459 --> @jjsarf commented on GitHub (Nov 20, 2023): > Yes thats the current design as far as I understand it. All requests are currently handled sequentially. That allows the API to switch out the LLM it is using per request and allows for better planning of the needed resources to run the service. When Implementing my app that uses Ollama I implemented a worker queue that handles all requests in the background. It would be great to have this mechanism as a configuration parameter (as in on or off) as being able to handle just a single request at a time is a limitation.
Author
Owner

@ishaan-jaff commented on GitHub (Nov 22, 2023):

Hi @SMenigat I'm the maintainer of LiteLLM. We provider an OpenAI compatible endpoint + request queueing with workers for ollama if you're interested in using it (would love your feedback on this)

Here's a quick start on using it: Compatible with ollama, GPT-4, (any LiteLLM supported LLM)
docs: https://docs.litellm.ai/docs/routing#queuing-beta

Quick Start

  1. Add Redis credentials in a .env file
REDIS_HOST="my-redis-endpoint"
REDIS_PORT="my-redis-port"
REDIS_PASSWORD="my-redis-password" # [OPTIONAL] if self-hosted
REDIS_USERNAME="default" # [OPTIONAL] if self-hosted
  1. Start litellm server with your model config
$ litellm --config /path/to/config.yaml --use_queue

Here's an example config for ollama/llama2

config.yaml

model_list: 
  - model_name: llama2
    litellm_params: 
      model: ollama/llama2
      api_key: 
  - model_name: code-llama
    litellm_params: 
      model: ollama/code-llama # actual model name
  1. Test (in another window) → sends 100 simultaneous requests to the queue
$ litellm --test_async --num_requests 100

Available Endpoints

  • /queue/request - Queues a /chat/completions request. Returns a job id.
  • /queue/response/{id} - Returns the status of a job. If completed, returns the response as well. Potential status's are: queued and finished.
<!-- gh-comment-id:1823161213 --> @ishaan-jaff commented on GitHub (Nov 22, 2023): Hi @SMenigat I'm the maintainer of LiteLLM. We provider an OpenAI compatible endpoint + request queueing with workers for ollama if you're interested in using it (would love your feedback on this) Here's a quick start on using it: Compatible with ollama, GPT-4, (any LiteLLM supported LLM) docs: https://docs.litellm.ai/docs/routing#queuing-beta ### Quick Start 1. Add Redis credentials in a .env file ```python REDIS_HOST="my-redis-endpoint" REDIS_PORT="my-redis-port" REDIS_PASSWORD="my-redis-password" # [OPTIONAL] if self-hosted REDIS_USERNAME="default" # [OPTIONAL] if self-hosted ``` 2. Start litellm server with your model config ```bash $ litellm --config /path/to/config.yaml --use_queue ``` Here's an example config for `ollama/llama2` **config.yaml** ```yaml model_list: - model_name: llama2 litellm_params: model: ollama/llama2 api_key: - model_name: code-llama litellm_params: model: ollama/code-llama # actual model name ``` 3. Test (in another window) → sends 100 simultaneous requests to the queue ```bash $ litellm --test_async --num_requests 100 ``` ### Available Endpoints - `/queue/request` - Queues a /chat/completions request. Returns a job id. - `/queue/response/{id}` - Returns the status of a job. If completed, returns the response as well. Potential status's are: `queued` and `finished`.
Author
Owner

@jmorganca commented on GitHub (Feb 20, 2024):

Merging with #358

<!-- gh-comment-id:1953337277 --> @jmorganca commented on GitHub (Feb 20, 2024): Merging with #358
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#62640