[GH-ISSUE #12477] Ollama Multiple Trace Generation with Parameter Permutations #8290

Closed
opened 2026-04-12 20:50:28 -05:00 by GiteaMirror · 4 comments
Owner

Originally created by @kripper on GitHub (Oct 2, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/12477

Overview

During experiments with OpenHands and Ollama (using the openai adapter), it was observed that Ollama generates multiple batches for a single multi-turn prompt (see documentation here).

When functions were exposed in the prompt, we noticed that their parameter order differs in different batches.

Example of log entries

A single prompt generated 28 similar traces:

# First trace

Oct 02 06:03:04 myserver ollama[4178335]: time=2025-10-02T06:03:03.996-03:00 level=TRACE source=bytepairencoding.go:244 msg=encoded string="<|im_start|>system\nYou are OpenHands agent, a helpful AI assistant that[...]
Oct 02 06:03:04 myserver ollama[4178335]:  start the service. Must be blocking and include the cd command. If using python, use /openhands/poetry/openhands-ai-5O4_aCHf-py3.12/bin/python</description>\n</parameter[...]
Oct 02 06:03:04 myserver ollama[4178335]: 3 3252 2921 9207 30961 3767 2124 397 522 16181 397 27 16181 397 27 606 29 4259 1089 522 606 397 27 1313 29 4082 522 1313 397 522 16181 397 522 13786 397 522 1688 397 27 1[...]

# Second trace (aprox. 0.1 seconds later)

Oct 02 06:03:04 myserver ollama[4178335]: time=2025-10-02T06:03:04.093-03:00 level=TRACE source=bytepairencoding.go:244 msg=encoded string="<|im_start|>system\nYou are OpenHands agent, a helpful AI assistant that[...]
Oct 02 06:03:04 myserver ollama[4178335]:  start the service. Must be blocking and include the cd command. If using python, use /openhands/poetry/openhands-ai-5O4_aCHf-py3.12/bin/python</description>\n</parameter[...]
Oct 02 06:03:04 myserver ollama[4178335]: 1874 382 262 17693 510 286 1874 842 25 3034 315 279 1874 311 17179 17770 504 13 1416 537 3897 11 5711 279 1638 1874 842 624 286 1537 1089 25 5624 315 1429 3213 17770 311 [...]

# Trace 3

Oct 02 06:03:04 myserver ollama[4178335]: time=2025-10-02T06:03:04.176-03:00 level=TRACE source=bytepairencoding.go:244 msg=encoded string="<|im_start|>system\nYou are OpenHands agent, a helpful AI assistant that[...]
Oct 02 06:03:04 myserver ollama[4178335]:  start the service. Must be blocking and include the cd command. If using python, use /openhands/poetry/openhands-ai-5O4_aCHf-py3.12/bin/python</description>\n</parameter[...]
Oct 02 06:03:04 myserver ollama[4178335]: 522 6279 397 522 13786 397 522 1688 397 27 1688 397 27 606 29 455 19169 17932 522 606 397 27 4684 29 1949 458 5387 6821 504 279 4771 4938 553 1181 23698 382 262 17693 510[...]

# Trace 4

Oct 02 06:03:04 myserver ollama[4178335]: time=2025-10-02T06:03:04.238-03:00 level=TRACE source=bytepairencoding.go:244 msg=encoded string="<|im_start|>system\nYou are OpenHands agent, a helpful AI assistant that[...]
Oct 02 06:03:04 myserver ollama[4178335]:  start the service. Must be blocking and include the cd command. If using python, use /openhands/poetry/openhands-ai-5O4_aCHf-py3.12/bin/python</description>\n</parameter[...]
Oct 02 06:03:04 myserver ollama[4178335]:  397 522 16181 397 27 6279 29 1183 17128 1341 522 6279 397 522 13786 397 522 1688 397 27 1688 397 27 606 29 4542 72224 522 606 397 27 4684 96204 458 9234 504 279 4771 493[...]

# Trace 5

Oct 02 06:03:04 myserver ollama[4178335]: time=2025-10-02T06:03:04.328-03:00 level=TRACE source=bytepairencoding.go:244 msg=encoded string="<|im_start|>system\nYou are OpenHands agent, a helpful AI assistant that[...]
Oct 02 06:03:04 myserver ollama[4178335]:  start the service. Must be blocking and include the cd command. If using python, use /openhands/poetry/openhands-ai-5O4_aCHf-py3.12/bin/python</description>\n</parameter[...]
Oct 02 06:03:04 myserver ollama[4178335]: 397 522 1688 397 27 1688 397 27 606 29 1836 19195 14896 522 606 397 27 4684 29 5890 279 4771 4938 369 9760 2436 68922 624 262 4220 6644 264 12126 315 678 315 264 2436 594[...]

....

When we compare Trace 1 with Trace 2, we can see that the only difference is the order of the function parameters:

Image

Question

Why are we doing this?

Feature Request

When OLLAMA_DEBUG=2, please add a log line before the batches are generated with a message similar to:

Creating batches for (explain why we are doing this or link to this issue or some documentation)...

This is to avoid confusion and save considerable time for new developers, as at first glance it may seem that the client is sending multiple prompts, prompting the developer's initial instinct to compare those sub-prompts to determine their origin.

Originally created by @kripper on GitHub (Oct 2, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/12477 ## Overview During experiments with OpenHands and Ollama (using the openai adapter), it was observed that Ollama generates multiple batches for a single multi-turn prompt (see [documentation here](https://github.com/ollama/ollama/issues/12477#issuecomment-3368159437)). When functions were exposed in the prompt, we noticed that their parameter order differs in different batches. ## Example of log entries A single prompt generated 28 similar traces: ``` # First trace Oct 02 06:03:04 myserver ollama[4178335]: time=2025-10-02T06:03:03.996-03:00 level=TRACE source=bytepairencoding.go:244 msg=encoded string="<|im_start|>system\nYou are OpenHands agent, a helpful AI assistant that[...] Oct 02 06:03:04 myserver ollama[4178335]: start the service. Must be blocking and include the cd command. If using python, use /openhands/poetry/openhands-ai-5O4_aCHf-py3.12/bin/python</description>\n</parameter[...] Oct 02 06:03:04 myserver ollama[4178335]: 3 3252 2921 9207 30961 3767 2124 397 522 16181 397 27 16181 397 27 606 29 4259 1089 522 606 397 27 1313 29 4082 522 1313 397 522 16181 397 522 13786 397 522 1688 397 27 1[...] # Second trace (aprox. 0.1 seconds later) Oct 02 06:03:04 myserver ollama[4178335]: time=2025-10-02T06:03:04.093-03:00 level=TRACE source=bytepairencoding.go:244 msg=encoded string="<|im_start|>system\nYou are OpenHands agent, a helpful AI assistant that[...] Oct 02 06:03:04 myserver ollama[4178335]: start the service. Must be blocking and include the cd command. If using python, use /openhands/poetry/openhands-ai-5O4_aCHf-py3.12/bin/python</description>\n</parameter[...] Oct 02 06:03:04 myserver ollama[4178335]: 1874 382 262 17693 510 286 1874 842 25 3034 315 279 1874 311 17179 17770 504 13 1416 537 3897 11 5711 279 1638 1874 842 624 286 1537 1089 25 5624 315 1429 3213 17770 311 [...] # Trace 3 Oct 02 06:03:04 myserver ollama[4178335]: time=2025-10-02T06:03:04.176-03:00 level=TRACE source=bytepairencoding.go:244 msg=encoded string="<|im_start|>system\nYou are OpenHands agent, a helpful AI assistant that[...] Oct 02 06:03:04 myserver ollama[4178335]: start the service. Must be blocking and include the cd command. If using python, use /openhands/poetry/openhands-ai-5O4_aCHf-py3.12/bin/python</description>\n</parameter[...] Oct 02 06:03:04 myserver ollama[4178335]: 522 6279 397 522 13786 397 522 1688 397 27 1688 397 27 606 29 455 19169 17932 522 606 397 27 4684 29 1949 458 5387 6821 504 279 4771 4938 553 1181 23698 382 262 17693 510[...] # Trace 4 Oct 02 06:03:04 myserver ollama[4178335]: time=2025-10-02T06:03:04.238-03:00 level=TRACE source=bytepairencoding.go:244 msg=encoded string="<|im_start|>system\nYou are OpenHands agent, a helpful AI assistant that[...] Oct 02 06:03:04 myserver ollama[4178335]: start the service. Must be blocking and include the cd command. If using python, use /openhands/poetry/openhands-ai-5O4_aCHf-py3.12/bin/python</description>\n</parameter[...] Oct 02 06:03:04 myserver ollama[4178335]: 397 522 16181 397 27 6279 29 1183 17128 1341 522 6279 397 522 13786 397 522 1688 397 27 1688 397 27 606 29 4542 72224 522 606 397 27 4684 96204 458 9234 504 279 4771 493[...] # Trace 5 Oct 02 06:03:04 myserver ollama[4178335]: time=2025-10-02T06:03:04.328-03:00 level=TRACE source=bytepairencoding.go:244 msg=encoded string="<|im_start|>system\nYou are OpenHands agent, a helpful AI assistant that[...] Oct 02 06:03:04 myserver ollama[4178335]: start the service. Must be blocking and include the cd command. If using python, use /openhands/poetry/openhands-ai-5O4_aCHf-py3.12/bin/python</description>\n</parameter[...] Oct 02 06:03:04 myserver ollama[4178335]: 397 522 1688 397 27 1688 397 27 606 29 1836 19195 14896 522 606 397 27 4684 29 5890 279 4771 4938 369 9760 2436 68922 624 262 4220 6644 264 12126 315 678 315 264 2436 594[...] .... ``` When we compare `Trace 1` with `Trace 2`, we can see that the only difference is the order of the function parameters: <img width="1785" height="753" alt="Image" src="https://github.com/user-attachments/assets/a78ab723-f47a-4e29-8e04-58eb1ead0f58" /> ## Question Why are we doing this? ## Feature Request When `OLLAMA_DEBUG=2`, please add a log line before the batches are generated with a message similar to: `Creating batches for (explain why we are doing this or link to this issue or some documentation)...` This is to avoid confusion and save considerable time for new developers, as at first glance it may seem that the client is sending multiple prompts, prompting the developer's initial instinct to compare those sub-prompts to determine their origin.
Author
Owner

@kripper commented on GitHub (Oct 2, 2025):

Now, the problem is Ollama is taking excesive time, because it seems it is also processing all those permutations, because we see multiple batches:

source=runner.go:557 msg="forwardBatch iBatch" batchID=54

But when testing a similar prompt (with the same context length) using OpenWebUI + Ollama, I observed that Ollama creates a batch for every message in a multi-turn prompt.
Therefore, I conclude that Ollama is not generating permutations, and that OpenHands is actually sending them.
I will need to analyze this further...

<!-- gh-comment-id:3360891045 --> @kripper commented on GitHub (Oct 2, 2025): Now, the problem is Ollama is taking excesive time, because it seems it is also processing all those permutations, because we see multiple batches: `source=runner.go:557 msg="forwardBatch iBatch" batchID=54` But when testing a similar prompt (with the same context length) using OpenWebUI + Ollama, I observed that Ollama creates a batch for every message in a multi-turn prompt. Therefore, I conclude that Ollama is not generating permutations, and that OpenHands is actually sending them. I will need to analyze this further...
Author
Owner

@kripper commented on GitHub (Oct 2, 2025):

I analyzed the logs generated in Ollama when using OpenWebUI in more detail and found that:

  • Ollama generates encodings for multiple different prompt endings:
    a sub-prompt for the last 2 messages, for the last 3 messages, for the last 4 messages, and so on. I guess this is done for searching the cache later.

  • So, if our multi-turn conversation has 9 messages, we will see 8 sub-prompts in the logs:

    • last 2
    • last 3
    • last 4
    • last 5
    • last 6
    • last 7
    • last 8
    • last 9 (the complete message)

    These are 8 combinations.

  • Then, the last sub-prompt is logged again. I’m not sure why, maybe it’s a small performance bug,
    or perhaps it’s just logging the same information again at the end.

  • These sub-prompt encodings are logged with the tag: source=bytepairencoding.go:244 msg=encoded

Next, I’ll try to figure out why Ollama received from OpenHands the same messages but with parameters in a different order.

<!-- gh-comment-id:3362043229 --> @kripper commented on GitHub (Oct 2, 2025): I analyzed the logs generated in Ollama when using OpenWebUI in more detail and found that: - Ollama generates encodings for multiple different prompt endings: a sub-prompt for the last 2 messages, for the last 3 messages, for the last 4 messages, and so on. I guess this is done for searching the cache later. - So, if our multi-turn conversation has 9 messages, we will see 8 sub-prompts in the logs: - last 2 - last 3 - last 4 - last 5 - last 6 - last 7 - last 8 - last 9 (the complete message) These are 8 combinations. - Then, the last sub-prompt is logged again. I’m not sure why, maybe it’s a small performance bug, or perhaps it’s just logging the same information again at the end. - These sub-prompt encodings are logged with the tag: `source=bytepairencoding.go:244 msg=encoded` Next, I’ll try to figure out why Ollama received from OpenHands the same messages but with parameters in a different order.
Author
Owner

@kripper commented on GitHub (Oct 4, 2025):

Autogenerated documentation (not validated):

Ollama Multi-Turn Conversation Batching Documentation

Overview

Ollama uses a multi-turn sub-prompt batching strategy to efficiently process multi-turn conversations in chat models.
When a conversation consists of multiple messages, the model does not process all messages as a single forward pass. Instead, it incrementally builds the context using sub-prompts derived from the most recent messages.

This approach allows for efficient memory usage, incremental KV cache construction, and coherent long-turn reasoning.


Key Concepts

Concept Meaning
Message A single logical turn in the conversation (user, assistant, or system).
Segment A formatted block of tokens representing a message, after applying the model’s chat template.
Sub-prompt A slice of the conversation containing the last N messages. Used to incrementally build context.
Batch A chunk of tokens processed together by the model during a forward pass. Often corresponds to a sub-prompt or part of a sub-prompt.

Sub-Prompt Logic

For a conversation with N messages:

  • Ollama generates N-1 sub-prompts in the logs.
  • Each sub-prompt contains the last k messages, where k ranges from 2 to N.
  • The final sub-prompt includes the full conversation and is used to produce the model’s reply.

Example: 9-message conversation

Sub-Prompt Messages Included
Last 2 M8, M9
Last 3 M7, M8, M9
Last 4 M6, M7, M8, M9
Last 5 M5, M6, M7, M8, M9
Last 6 M4, M5, M6, M7, M8, M9
Last 7 M3, M4, M5, M6, M7, M8, M9
Last 8 M2, M3, M4, M5, M6, M7, M8, M9
Last 9 M1, M2, M3, M4, M5, M6, M7, M8, M9

Note: The first message alone does not generate a sub-prompt, so the total number of sub-prompts = N - 1.


How Sub-Prompts Are Processed

  1. Flattening and Tokenization

    • Each message is wrapped using the model’s chat template (e.g., <|user|>...<|end|>).
    • Messages in a sub-prompt are concatenated and tokenized.
  2. Batching

    • The tokenized sub-prompt is divided into token batches for processing.
    • Batch size is determined by the model configuration or GPU constraints.
    • Each batch updates the KV cache incrementally.
  3. Incremental Context Building

    • Sub-prompts are processed sequentially, from the smallest (last 2) to the full conversation.
    • KV cache allows reusing previously computed context, reducing memory and computation overhead.
  4. Response Generation

    • Once the final sub-prompt (full conversation) is processed, the model generates the assistant response.

Benefits

  • Memory efficiency: Only small slices of conversation are processed at a time.
  • Incremental KV caching: Avoids recomputing the entire conversation for each turn.
  • Improved long-turn coherence: Allows the model to focus on recent context without losing older messages.
  • Flexible context truncation: Long conversations can be truncated or batched without breaking reasoning.

Summary

Ollama’s sub-prompt batching ensures that multi-turn conversations are processed efficiently:

  • Each message → formatted → tokenized → part of a sub-prompt
  • Sub-prompts → processed sequentially → KV cache updated
  • Final sub-prompt → full context → generates model reply

This mechanism is crucial for performance, memory management, and maintaining conversation coherence in long multi-turn chats.

<!-- gh-comment-id:3368159437 --> @kripper commented on GitHub (Oct 4, 2025): Autogenerated documentation (not validated): # Ollama Multi-Turn Conversation Batching Documentation ## Overview Ollama uses a **multi-turn sub-prompt batching strategy** to efficiently process multi-turn conversations in chat models. When a conversation consists of multiple messages, the model does not process all messages as a single forward pass. Instead, it incrementally builds the context using **sub-prompts** derived from the most recent messages. This approach allows for efficient memory usage, incremental KV cache construction, and coherent long-turn reasoning. --- ## Key Concepts | Concept | Meaning | |---------|--------| | **Message** | A single logical turn in the conversation (user, assistant, or system). | | **Segment** | A formatted block of tokens representing a message, after applying the model’s chat template. | | **Sub-prompt** | A slice of the conversation containing the last N messages. Used to incrementally build context. | | **Batch** | A chunk of tokens processed together by the model during a forward pass. Often corresponds to a sub-prompt or part of a sub-prompt. | --- ## Sub-Prompt Logic For a conversation with `N` messages: - Ollama generates **N-1 sub-prompts** in the logs. - Each sub-prompt contains the **last `k` messages**, where `k` ranges from 2 to N. - The final sub-prompt includes the full conversation and is used to produce the model’s reply. **Example:** 9-message conversation | Sub-Prompt | Messages Included | |------------|-----------------| | Last 2 | M8, M9 | | Last 3 | M7, M8, M9 | | Last 4 | M6, M7, M8, M9 | | Last 5 | M5, M6, M7, M8, M9 | | Last 6 | M4, M5, M6, M7, M8, M9 | | Last 7 | M3, M4, M5, M6, M7, M8, M9 | | Last 8 | M2, M3, M4, M5, M6, M7, M8, M9 | | Last 9 | M1, M2, M3, M4, M5, M6, M7, M8, M9 | > Note: The first message alone does not generate a sub-prompt, so the total number of sub-prompts = N - 1. --- ## How Sub-Prompts Are Processed 1. **Flattening and Tokenization** - Each message is wrapped using the model’s chat template (e.g., `<|user|>...<|end|>`). - Messages in a sub-prompt are concatenated and tokenized. 2. **Batching** - The tokenized sub-prompt is divided into **token batches** for processing. - Batch size is determined by the model configuration or GPU constraints. - Each batch updates the **KV cache** incrementally. 3. **Incremental Context Building** - Sub-prompts are processed sequentially, from the smallest (last 2) to the full conversation. - KV cache allows reusing previously computed context, reducing memory and computation overhead. 4. **Response Generation** - Once the final sub-prompt (full conversation) is processed, the model generates the assistant response. --- ## Benefits - **Memory efficiency:** Only small slices of conversation are processed at a time. - **Incremental KV caching:** Avoids recomputing the entire conversation for each turn. - **Improved long-turn coherence:** Allows the model to focus on recent context without losing older messages. - **Flexible context truncation:** Long conversations can be truncated or batched without breaking reasoning. --- ## Summary Ollama’s sub-prompt batching ensures that multi-turn conversations are processed efficiently: - Each message → formatted → tokenized → part of a sub-prompt - Sub-prompts → processed sequentially → KV cache updated - Final sub-prompt → full context → generates model reply This mechanism is crucial for performance, memory management, and maintaining conversation coherence in long multi-turn chats.
Author
Owner

@kripper commented on GitHub (Oct 8, 2025):

Closing, since I lost interest and maybe the client (OpenHands) was just sending multiple different prompts. That would explain those strange permutations with different parameters order.

<!-- gh-comment-id:3379265473 --> @kripper commented on GitHub (Oct 8, 2025): Closing, since I lost interest and maybe the client (OpenHands) was just sending multiple different prompts. That would explain those strange permutations with different parameters order.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#8290