[GH-ISSUE #1927] Handling High traffic #1108

Closed
opened 2026-04-12 10:51:19 -05:00 by GiteaMirror · 3 comments
Owner

Originally created by @lauvindra on GitHub (Jan 11, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/1927

Assume I have ollama in server Tesla T4 GPU with 16GB Vram and 120 Ram, how many request can it handle in one second?

Originally created by @lauvindra on GitHub (Jan 11, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/1927 Assume I have ollama in server Tesla T4 GPU with 16GB Vram and 120 Ram, how many request can it handle in one second?
Author
Owner

@easp commented on GitHub (Jan 11, 2024):

That card apparently has ~320GB/s bandwidth. Tokens/s generated is approximately 8 bits/byte *320 GB/s / (# model parameters * # bits per parameter). For a q4 quantization of a 4 bit model that's probably about 100 tokens/s.

Ollama currently queues concurrent requests and processes them serially. This isn't an efficient way to processes concurrent requests.

<!-- gh-comment-id:1887970002 --> @easp commented on GitHub (Jan 11, 2024): That card apparently has ~320GB/s bandwidth. Tokens/s generated is approximately 8 bits/byte *320 GB/s / (# model parameters * # bits per parameter). For a q4 quantization of a 4 bit model that's probably about 100 tokens/s. Ollama currently queues concurrent requests and processes them serially. This isn't an efficient way to processes concurrent requests.
Author
Owner

@jimscard commented on GitHub (Jan 12, 2024):

If Ollama is intended to be used as a local LLM system, then queuing requests and processing them serially is appropriate. Enabling concurrent processing, e.g., a "server" scenario would significantly increase the attack surface and complexity.

<!-- gh-comment-id:1888483084 --> @jimscard commented on GitHub (Jan 12, 2024): If Ollama is intended to be used as a *local* LLM system, then queuing requests and processing them serially is appropriate. Enabling concurrent processing, e.g., a "server" scenario would significantly increase the attack surface and complexity.
Author
Owner

@pdevine commented on GitHub (Jan 26, 2024):

Going to close this as a dupe of #358

<!-- gh-comment-id:1912841199 --> @pdevine commented on GitHub (Jan 26, 2024): Going to close this as a dupe of #358
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#1108