[GH-ISSUE #12557] Ollama Tool Calling + Streaming Issue #70388

Closed
opened 2026-05-04 21:23:20 -05:00 by GiteaMirror · 5 comments
Owner

Originally created by @yibie on GitHub (Oct 10, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/12557

Originally assigned to: @ParthSareen on GitHub.

What is the issue?

Date: October 10, 2025
Version: Ollama 0.12.3
Model Tested: qwen3-coder:30b-a3b-q8_0
Severity: High (Affects tool use functionality)

Issue Summary

Ollama's streaming implementation for tool/function calls is incomplete and inconsistent with standard streaming behavior observed in other LLM providers (OpenAI, Anthropic). This causes integration issues with client libraries that expect proper streaming responses for tool calls.

Detailed Analysis

Current Behavior

1. Non-Streaming Tool Calls (Working Correctly)

curl -X POST http://localhost:11434/api/chat -d '{
  "model": "qwen3-coder:30b-a3b-q8_0",
  "messages": [{"role": "user", "content": "Calculate 2+2"}],
  "tools": [{"type": "function", "function": {...}}],
  "stream": false
}'

Response (single, complete):

{
  "model": "qwen3-coder:30b-a3b-q8_0",
  "message": {
    "role": "assistant",
    "content": "",
    "tool_calls": [{
      "function": {
        "name": "calculate",
        "arguments": {"expression": "2 + 2"}
      }
    }]
  },
  "done": true,
  "done_reason": "stop"
}

2. Streaming Tool Calls (Problematic)

# Same request but with "stream": true

Response (split into multiple chunks):

// Chunk 1: Tool call present but not done
{
  "model": "qwen3-coder:30b-a3b-q8_0",
  "message": {
    "role": "assistant",
    "content": "",
    "tool_calls": [{
      "function": {
        "name": "calculate",
        "arguments": {"expression": "2 + 2"}
      }
    }]
  },
  "done": false
}

// Chunk 2: Empty completion
{
  "model": "qwen3-coder:30b-a3b-q8_0",
  "message": {
    "role": "assistant",
    "content": ""
  },
  "done": true,
  "done_reason": "stop"
}

Problems Identified

  1. Incomplete Streaming: Tool calls don't stream progressively like text content
  2. No Follow-up Content: When asked to "calculate and explain", only tool calls are returned, no explanatory text
  3. Inconsistent Format: The two-chunk format differs from standard streaming APIs
  4. Client Library Issues: This behavior breaks integrations with tools like Emacs gptel

Expected Behavior (Based on OpenAI Standard)

Progressive streaming should include:

  1. Tool call construction streaming (if complex)
  2. Tool result processing
  3. Explanatory text streaming after tool execution

Impact

  • High: Affects client integrations (Emacs gptel, LangChain, etc.)
  • Workaround Required: Must disable streaming for tool use scenarios
  • User Experience: Inconsistent behavior between streaming and non-streaming modes

Reproduction Steps

  1. Start Ollama with a tool-capable model (qwen3-coder tested)
  2. Send a request with both stream: true and tools array
  3. Observe the incomplete two-chunk response
  4. Compare with stream: false for expected single-chunk behavior

Technical Details

Environment

  • Ollama Version: 0.12.3
  • Model: qwen3-coder:30b-a3b-q8_0 (supports tool calling)
  • Platform: macOS Darwin 25.0.0
  • API Endpoint: /api/chat

Test Cases Performed

  1. Simple Tool Call: "Calculate 2+2"
  2. Complex Request: "Calculate 2+2 and explain the result"
  3. Regular Text Streaming: "What is the capital of France?" (works correctly)

All tool-related streaming tests show the same issue.

Recommendations

Short-term

  1. Documentation: Clearly document tool calling streaming limitations
  2. Error Handling: Return proper error status when streaming + tools can't be handled
  3. Consistency: Ensure both streaming modes return equivalent tool call information

Long-term

  1. Implement Progressive Tool Streaming: Stream tool call construction when possible
  2. Follow Standard Format: Align with OpenAI/Anthropic streaming patterns
  3. Tool Result Integration: Properly stream post-tool-execution content

Workarounds for Users

Currently, users must:

  1. Detect when tools are being used
  2. Force stream: false in those cases
  3. Handle responses differently for tool vs. non-tool scenarios

This is the approach implemented by Emacs gptel:

;; Current workaround in gptel-ollama.el
(when (and gptel-use-tools gptel-tools)
  (plist-put prompts-plist :stream :json-false))

Conclusion

The current streaming implementation for tool calls in Ollama is incomplete and causes integration issues. A comprehensive fix would improve the user experience and make Ollama more compatible with existing client libraries and frameworks.


Contact: This report is based on testing with the Emacs gptel integration. Similar issues likely affect other client libraries.

Related Issues: This may be related to broader streaming implementation improvements needed in Ollama's API.

Relevant log output


OS

macOS

GPU

Apple

CPU

Apple

Ollama version

0.12.3

Originally created by @yibie on GitHub (Oct 10, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/12557 Originally assigned to: @ParthSareen on GitHub. ### What is the issue? **Date**: October 10, 2025 **Version**: Ollama 0.12.3 **Model Tested**: qwen3-coder:30b-a3b-q8_0 **Severity**: High (Affects tool use functionality) ## Issue Summary Ollama's streaming implementation for tool/function calls is incomplete and inconsistent with standard streaming behavior observed in other LLM providers (OpenAI, Anthropic). This causes integration issues with client libraries that expect proper streaming responses for tool calls. ## Detailed Analysis ### Current Behavior #### 1. Non-Streaming Tool Calls (Working Correctly) ```bash curl -X POST http://localhost:11434/api/chat -d '{ "model": "qwen3-coder:30b-a3b-q8_0", "messages": [{"role": "user", "content": "Calculate 2+2"}], "tools": [{"type": "function", "function": {...}}], "stream": false }' ``` **Response** (single, complete): ```json { "model": "qwen3-coder:30b-a3b-q8_0", "message": { "role": "assistant", "content": "", "tool_calls": [{ "function": { "name": "calculate", "arguments": {"expression": "2 + 2"} } }] }, "done": true, "done_reason": "stop" } ``` #### 2. Streaming Tool Calls (Problematic) ```bash # Same request but with "stream": true ``` **Response** (split into multiple chunks): ```json // Chunk 1: Tool call present but not done { "model": "qwen3-coder:30b-a3b-q8_0", "message": { "role": "assistant", "content": "", "tool_calls": [{ "function": { "name": "calculate", "arguments": {"expression": "2 + 2"} } }] }, "done": false } // Chunk 2: Empty completion { "model": "qwen3-coder:30b-a3b-q8_0", "message": { "role": "assistant", "content": "" }, "done": true, "done_reason": "stop" } ``` ### Problems Identified 1. **Incomplete Streaming**: Tool calls don't stream progressively like text content 2. **No Follow-up Content**: When asked to "calculate and explain", only tool calls are returned, no explanatory text 3. **Inconsistent Format**: The two-chunk format differs from standard streaming APIs 4. **Client Library Issues**: This behavior breaks integrations with tools like Emacs gptel ### Expected Behavior (Based on OpenAI Standard) **Progressive streaming should include**: 1. Tool call construction streaming (if complex) 2. Tool result processing 3. Explanatory text streaming after tool execution ## Impact - **High**: Affects client integrations (Emacs gptel, LangChain, etc.) - **Workaround Required**: Must disable streaming for tool use scenarios - **User Experience**: Inconsistent behavior between streaming and non-streaming modes ## Reproduction Steps 1. Start Ollama with a tool-capable model (qwen3-coder tested) 2. Send a request with both `stream: true` and `tools` array 3. Observe the incomplete two-chunk response 4. Compare with `stream: false` for expected single-chunk behavior ## Technical Details ### Environment - **Ollama Version**: 0.12.3 - **Model**: qwen3-coder:30b-a3b-q8_0 (supports tool calling) - **Platform**: macOS Darwin 25.0.0 - **API Endpoint**: `/api/chat` ### Test Cases Performed 1. **Simple Tool Call**: "Calculate 2+2" 2. **Complex Request**: "Calculate 2+2 and explain the result" 3. **Regular Text Streaming**: "What is the capital of France?" (works correctly) All tool-related streaming tests show the same issue. ## Recommendations ### Short-term 1. **Documentation**: Clearly document tool calling streaming limitations 2. **Error Handling**: Return proper error status when streaming + tools can't be handled 3. **Consistency**: Ensure both streaming modes return equivalent tool call information ### Long-term 1. **Implement Progressive Tool Streaming**: Stream tool call construction when possible 2. **Follow Standard Format**: Align with OpenAI/Anthropic streaming patterns 3. **Tool Result Integration**: Properly stream post-tool-execution content ## Workarounds for Users Currently, users must: 1. Detect when tools are being used 2. Force `stream: false` in those cases 3. Handle responses differently for tool vs. non-tool scenarios This is the approach implemented by Emacs gptel: ```elisp ;; Current workaround in gptel-ollama.el (when (and gptel-use-tools gptel-tools) (plist-put prompts-plist :stream :json-false)) ``` ## Conclusion The current streaming implementation for tool calls in Ollama is incomplete and causes integration issues. A comprehensive fix would improve the user experience and make Ollama more compatible with existing client libraries and frameworks. --- **Contact**: This report is based on testing with the Emacs gptel integration. Similar issues likely affect other client libraries. **Related Issues**: This may be related to broader streaming implementation improvements needed in Ollama's API. ### Relevant log output ```shell ``` ### OS macOS ### GPU Apple ### CPU Apple ### Ollama version 0.12.3
GiteaMirror added the bug label 2026-05-04 21:23:21 -05:00
Author
Owner

@ParthSareen commented on GitHub (Oct 10, 2025):

Hey @yibie which libraries are you having challenges with? Langchain should be working fine. I'm less sure about the gptel integration. You shouldn't have to disable tool calls as the model would return you a full tool call if any exist vs streaming an incomplete one out. So you can directly use the singular streamed message you get back.

Without seeing how you're making the request for "explain the result" I'm not going to be able to help as much. The tool call as well as the tool result must be passed back to the model.

<!-- gh-comment-id:3392524664 --> @ParthSareen commented on GitHub (Oct 10, 2025): Hey @yibie which libraries are you having challenges with? Langchain should be working fine. I'm less sure about the gptel integration. You shouldn't have to disable tool calls as the model would return you a full tool call if any exist vs streaming an incomplete one out. So you can directly use the singular streamed message you get back. Without seeing how you're making the request for "explain the result" I'm not going to be able to help as much. The tool call as well as the tool result must be passed back to the model.
Author
Owner

@yibie commented on GitHub (Oct 11, 2025):

Hi, @ParthSareen Thanks for taking the time to reply. I actually test tool calls through curl. If you have the corresponding curl command, let me see how Ollama responds to tool calls, as well as what values are returned during the process, which would be very helpful to me. Because I don't understand what the data format and return values of Ollama's tool calls are.

<!-- gh-comment-id:3392980189 --> @yibie commented on GitHub (Oct 11, 2025): Hi, @ParthSareen Thanks for taking the time to reply. I actually test tool calls through curl. If you have the corresponding curl command, let me see how Ollama responds to tool calls, as well as what values are returned during the process, which would be very helpful to me. Because I don't understand what the data format and return values of Ollama's tool calls are.
Author
Owner

@yibie commented on GitHub (Oct 14, 2025):

@ParthSareen I must clarify that in gptel, the author states that streaming output needs to be turned off to properly invoke tools, not the scenario you mentioned. What I need help with now is the process and mechanism of Ollama calling tools, to help me understand how to work with Ollama's mechanism. I don't need you to solve any problems.

<!-- gh-comment-id:3402120948 --> @yibie commented on GitHub (Oct 14, 2025): @ParthSareen I must clarify that in gptel, the author states that streaming output needs to be turned off to properly invoke tools, not the scenario you mentioned. What I need help with now is the process and mechanism of Ollama calling tools, to help me understand how to work with Ollama's mechanism. I don't need you to solve any problems.
Author
Owner

@ParthSareen commented on GitHub (Oct 18, 2025):

https://docs.ollama.com/capabilities/tool-calling

<!-- gh-comment-id:3417640889 --> @ParthSareen commented on GitHub (Oct 18, 2025): https://docs.ollama.com/capabilities/tool-calling
Author
Owner

@ParthSareen commented on GitHub (Oct 18, 2025):

Doesn't seem like this is related to us so gonna close this out!

<!-- gh-comment-id:3417641925 --> @ParthSareen commented on GitHub (Oct 18, 2025): Doesn't seem like this is related to us so gonna close this out!
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#70388