[GH-ISSUE #5828] Will paged attention be added when OLLAMA_NUM_PARALLEL is set higher than 1? #3631

Open
opened 2026-04-12 14:24:36 -05:00 by GiteaMirror · 2 comments
Owner

Originally created by @b-Snaas on GitHub (Jul 21, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/5828

I experimented with ollama_num_parallel on GPUs with a large amount of VRAM, but I could not get a real benefit in terms of total aggregated tokens per second when posting 10 requests at the same time. I assume this is due to ollama not having pagedattention. Are there plans to optimize inference for large amount of concurrent requests?

Originally created by @b-Snaas on GitHub (Jul 21, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/5828 I experimented with ollama_num_parallel on GPUs with a large amount of VRAM, but I could not get a real benefit in terms of total aggregated tokens per second when posting 10 requests at the same time. I assume this is due to ollama not having pagedattention. Are there plans to optimize inference for large amount of concurrent requests?
GiteaMirror added the feature request label 2026-04-12 14:24:36 -05:00
Author
Owner

@kescherCode commented on GitHub (Oct 9, 2024):

When serving multiple requests for a single loaded model, you will end up sharing compute resources for a model across requests. That is why you couldn't see any benefit in tps.

<!-- gh-comment-id:2402322614 --> @kescherCode commented on GitHub (Oct 9, 2024): When serving multiple requests for a single loaded model, you will end up sharing compute resources for a model across requests. That is why you couldn't see any benefit in tps.
Author
Owner

@kungfu-eric commented on GitHub (Feb 1, 2025):

As per https://github.com/ollama/ollama/issues/8741 pagedAttention would help with long context length inference. This is already in vLLM.

VLLM is significantly better optimized for long contexts due to its use of PagedAttention, which efficiently manages the KV (key-value) cache. This reduces memory fragmentation and waste, allowing it to handle sequences up to 10x longer than traditional systems. It also supports dynamic batching and continuous batching, improving throughput.

PagedAttention splits the KV cache into non-contiguous "pages," avoiding wasted memory from padding or fragmentation. Optimized memory management reduces redundant data transfers, mitigating bandwidth bottlenecks.

Ollama is simpler and user-friendly but lacks the advanced memory optimizations of VLLM. It uses standard attention mechanisms and contiguous memory allocation, leading to inefficiencies for very long sequences. It may struggle with 50k-token contexts due to higher memory overhead and fragmentation.

<!-- gh-comment-id:2629047592 --> @kungfu-eric commented on GitHub (Feb 1, 2025): As per https://github.com/ollama/ollama/issues/8741 pagedAttention would help with long context length inference. This is already in vLLM. VLLM is significantly better optimized for long contexts due to its use of PagedAttention, which efficiently manages the KV (key-value) cache. This reduces memory fragmentation and waste, allowing it to handle sequences up to 10x longer than traditional systems. It also supports dynamic batching and continuous batching, improving throughput. PagedAttention splits the KV cache into non-contiguous "pages," avoiding wasted memory from padding or fragmentation. Optimized memory management reduces redundant data transfers, mitigating bandwidth bottlenecks. Ollama is simpler and user-friendly but lacks the advanced memory optimizations of VLLM. It uses standard attention mechanisms and contiguous memory allocation, leading to inefficiencies for very long sequences. It may struggle with 50k-token contexts due to higher memory overhead and fragmentation.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#3631