[GH-ISSUE #10699] official batching support #7031

Closed
opened 2026-04-12 18:56:11 -05:00 by GiteaMirror · 3 comments
Owner

Originally created by @Master-Pr0grammer on GitHub (May 13, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/10699

Summary:

True batching (merging multiple prompts into one tensor and running a single forward pass) offers significant throughput and memory-efficiency gains over the current concurrency implementation, which duplicates full inference contexts for each request.

Background:

Issue #2301 requested batching and was closed when Ollama introduced “concurrency.” However, concurrency—allocating N separate contexts (and N× memory for token buffers) is not the same as true batching.

Key Differences between the two:

Batching: Loads the model weights once, stacks multiple prompts into a [batch_size, seq_len] tensor, runs one forward pass, then splits the output.

Concurrency: Loads the model once, but allocates N independent token-buffer contexts and runs N separate forward passes.

In my own quick testing:

  • 1 request: 8 K context tokens max.
  • 2 concurrent requests: 4 K context tokens each.
  • 4 concurrent requests: 2 K context tokens each.

This 1/N context-size scaling Empirically confirms that Ollama’s concurrency duplicates buffers rather than using real batching on a per layer pass basis.

Why Batching Matters:

  • Much higher throughput for same hardware
  • Lower VRAM usage, leaving room for bigger models or longer contexts
  • Better hardware utilization
  • Scales better and can handle more parallel requests

Proposal & Next Steps:

The initial step would be to implement batching to the inference engine (which it should already have since it is a fork of llama.cpp), and implement a api endpoint where the user can provide a batch of inputs to be processed as a batch rather than serially. Optionally, automatic batching could be implemented with a “batch window” (i.e. waiting ~5-10 ms to gather prompts), and add appropriate config settings.

Originally created by @Master-Pr0grammer on GitHub (May 13, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/10699 ## Summary: True batching (merging multiple prompts into one tensor and running a single forward pass) offers significant throughput and memory-efficiency gains over the current concurrency implementation, which duplicates full inference contexts for each request. ## Background: Issue #2301 requested batching and was closed when Ollama introduced “concurrency.” However, concurrency—allocating N separate contexts (and N× memory for token buffers) is not the same as true batching. Key Differences between the two: Batching: Loads the model weights once, stacks multiple prompts into a [batch_size, seq_len] tensor, runs one forward pass, then splits the output. Concurrency: Loads the model once, but allocates N independent token-buffer contexts and runs N separate forward passes. In my own quick testing: - 1 request: 8 K context tokens max. - 2 concurrent requests: 4 K context tokens each. - 4 concurrent requests: 2 K context tokens each. This 1/N context-size scaling Empirically confirms that Ollama’s concurrency duplicates buffers rather than using real batching on a per layer pass basis. ## Why Batching Matters: * Much higher throughput for same hardware * Lower VRAM usage, leaving room for bigger models or longer contexts * Better hardware utilization * Scales better and can handle more parallel requests ## Proposal & Next Steps: The initial step would be to implement batching to the inference engine (which it should already have since it is a fork of llama.cpp), and implement a api endpoint where the user can provide a batch of inputs to be processed as a batch rather than serially. Optionally, automatic batching could be implemented with a “batch window” (i.e. waiting ~5-10 ms to gather prompts), and add appropriate config settings.
GiteaMirror added the feature request label 2026-04-12 18:56:11 -05:00
Author
Owner

@jessegross commented on GitHub (May 14, 2025):

We already do what you describe as batching. You still need context space for each parallel request since that's the conversation's history but they are run together as single tensor in one forward pass. Requests that are runnable at the beginning of a forward pass are batched together. In practice, this means that the previous forward pass is the batch window for new requests to be staged.

P.S. Ollama is not a fork of llama.cpp

<!-- gh-comment-id:2878217248 --> @jessegross commented on GitHub (May 14, 2025): We already do what you describe as batching. You still need context space for each parallel request since that's the conversation's history but they are run together as single tensor in one forward pass. Requests that are runnable at the beginning of a forward pass are batched together. In practice, this means that the previous forward pass is the batch window for new requests to be staged. P.S. Ollama is not a fork of llama.cpp
Author
Owner

@Master-Pr0grammer commented on GitHub (May 14, 2025):

ok awesome! I stand corrected, thanks for the clarification.

And my bad, I meant to say it started out by using a fork of llama.cpp, not that it is a fork.

<!-- gh-comment-id:2878426915 --> @Master-Pr0grammer commented on GitHub (May 14, 2025): ok awesome! I stand corrected, thanks for the clarification. And my bad, I meant to say it started out by using a fork of llama.cpp, not that it is a fork.
Author
Owner

@barrkel commented on GitHub (Oct 12, 2025):

We already do what you describe as batching. You still need context space for each parallel request since that's the conversation's history but they are run together as single tensor in one forward pass. Requests that are runnable at the beginning of a forward pass are batched together. In practice, this means that the previous forward pass is the batch window for new requests to be staged.

I think there's a misunderstanding.

How do we issue one API request, with multiple conversations, and have inference run for all of them in parallel?

For example, I have a web app and I want to generate multiple suggestions of actions to take based on the current context. I don't want to run a loop issuing API requests, since that will run them sequentially. I can try and issue a bunch of concurrent requests, but how do I guarantee that they all run together? The risk is the first request runs as a separate batch=1 inference and maybe the subsequent requests get batched. But I want to minimize latency for the entire batch.

<!-- gh-comment-id:3394577371 --> @barrkel commented on GitHub (Oct 12, 2025): > We already do what you describe as batching. You still need context space for each parallel request since that's the conversation's history but they are run together as single tensor in one forward pass. Requests that are runnable at the beginning of a forward pass are batched together. In practice, this means that the previous forward pass is the batch window for new requests to be staged. I think there's a misunderstanding. How do we issue one API request, with multiple conversations, and have inference run for all of them in parallel? For example, I have a web app and I want to generate multiple suggestions of actions to take based on the current context. I don't want to run a loop issuing API requests, since that will run them sequentially. I can try and issue a bunch of concurrent requests, but how do I guarantee that they all run together? The risk is the first request runs as a separate batch=1 inference and maybe the subsequent requests get batched. But I want to minimize latency for the entire batch.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#7031