[GH-ISSUE #9092] Enabling tools breaks stream=True on /v1 endpoint, only returns a single complete response #5916

Open
opened 2026-04-12 17:15:15 -05:00 by GiteaMirror · 3 comments
Owner

Originally created by @huangdihd on GitHub (Feb 14, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/9092

What is the issue?

Issue Description

When enabling tools, the Ollama API seems to break streaming (stream=True) on the /v1 endpoint. Instead of returning chunks of data progressively, it waits and sends the entire response as a single block.

Steps to Reproduce

  1. Run the following Python script using the OpenAI-compatible API (via the /v1 endpoint):

    import openai
    
    client = openai.OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")
    
    response = client.chat.completions.create(
        model="qwen2.5:32b",
        messages=[{"role": "user", "content": "Tell me a story"}],
        stream=True,
        tools=[{"type": "function", "function": {"name": "test_tool", "description": "Test tool"}}]
    )
    
    for chunk in response:
        print(chunk)
    
  2. Expected behavior:

    • Without tools, stream=True correctly streams chunks of choices[0].delta.content.
    • With tools enabled, the response is only returned as a single complete block.

Expected Behavior

Even when using tools, Ollama should support proper streaming behavior and return incremental chunks instead of buffering the full response.

Environment

  • Ollama Version: (0.5.10)
  • Operating System: Linux / Mac / Windows
  • API Method: OpenAI-compatible API (/v1 endpoint) / requests / Ollama Python SDK
  • Model Tested: qwen2.5:32b

Additional Information

When testing with curl against the /v1 endpoint, streaming works correctly:

curl -N -X POST http://localhost:11434/v1/chat/completions \
     -H "Content-Type: application/json" \
     -d '{
        "model": "qwen2.5:32b",
        "messages": [{"role": "user", "content": "Tell me a story"}],
        "stream": true
     }'

However, when using Python and enabling tools, streaming does not work as expected.

Question

  • Is this an intentional limitation when using tools with /v1 endpoints?
  • If so, is there a workaround to allow streaming while still using tools?

Thanks for your help!

Relevant log output

curl -N -X POST http://localhost:11434/v1/chat/completions \
     -H "Content-Type: application/json" \
     -d '{
        "model": "qwen2.5:32b",
        "messages": [{"role": "user", "content": "Tell me a story"}],
        "stream": true
     }'
...
data: {"id":"chatcmpl-634","object":"chat.completion.chunk","created":1739506822,"model":"qwen2.5:32b","system_fingerprint":"fp_ollama","choices":[{"index":0,"delta":{"role":"assistant","content":"."},"finish_reason":null}]}

data: {"id":"chatcmpl-634","object":"chat.completion.chunk","created":1739506823,"model":"qwen2.5:32b","system_fingerprint":"fp_ollama","choices":[{"index":0,"delta":{"role":"assistant","content":""},"finish_reason":"stop"}]}

data: [DONE]

OS

Linux

GPU

Nvidia

CPU

Intel

Ollama version

0.5.10

Originally created by @huangdihd on GitHub (Feb 14, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/9092 ### What is the issue? ### **Issue Description** When enabling `tools`, the Ollama API seems to break streaming (`stream=True`) on the `/v1` endpoint. Instead of returning chunks of data progressively, it waits and sends the entire response as a single block. ### **Steps to Reproduce** 1. Run the following Python script using the OpenAI-compatible API (via the `/v1` endpoint): ```python import openai client = openai.OpenAI(base_url="http://localhost:11434/v1", api_key="ollama") response = client.chat.completions.create( model="qwen2.5:32b", messages=[{"role": "user", "content": "Tell me a story"}], stream=True, tools=[{"type": "function", "function": {"name": "test_tool", "description": "Test tool"}}] ) for chunk in response: print(chunk) ``` 2. Expected behavior: * **Without `tools`**, `stream=True` correctly streams chunks of `choices[0].delta.content`. * **With `tools` enabled**, the response is only returned as a single complete block. ### **Expected Behavior** Even when using `tools`, Ollama should support proper streaming behavior and return incremental chunks instead of buffering the full response. ### **Environment** * **Ollama Version:** (`0.5.10`) * **Operating System:** Linux / Mac / Windows * **API Method:** OpenAI-compatible API (`/v1` endpoint) / `requests` / Ollama Python SDK * **Model Tested:** qwen2.5:32b ### **Additional Information** When testing with `curl` against the `/v1` endpoint, streaming works correctly: ```sh curl -N -X POST http://localhost:11434/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "qwen2.5:32b", "messages": [{"role": "user", "content": "Tell me a story"}], "stream": true }' ``` However, when using Python and enabling `tools`, streaming does not work as expected. ### **Question** * Is this an intentional limitation when using `tools` with `/v1` endpoints? * If so, is there a workaround to allow streaming while still using `tools`? Thanks for your help! ### Relevant log output ```shell curl -N -X POST http://localhost:11434/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "qwen2.5:32b", "messages": [{"role": "user", "content": "Tell me a story"}], "stream": true }' ... data: {"id":"chatcmpl-634","object":"chat.completion.chunk","created":1739506822,"model":"qwen2.5:32b","system_fingerprint":"fp_ollama","choices":[{"index":0,"delta":{"role":"assistant","content":"."},"finish_reason":null}]} data: {"id":"chatcmpl-634","object":"chat.completion.chunk","created":1739506823,"model":"qwen2.5:32b","system_fingerprint":"fp_ollama","choices":[{"index":0,"delta":{"role":"assistant","content":""},"finish_reason":"stop"}]} data: [DONE] ``` ### OS Linux ### GPU Nvidia ### CPU Intel ### Ollama version 0.5.10
GiteaMirror added the bug label 2026-04-12 17:15:15 -05:00
Author
Owner

@LeisureLinux commented on GitHub (Feb 14, 2025):

https://github.com/ollama/ollama/issues/8517

<!-- gh-comment-id:2659453944 --> @LeisureLinux commented on GitHub (Feb 14, 2025): https://github.com/ollama/ollama/issues/8517
Author
Owner

@nonsleepr commented on GitHub (Feb 28, 2025):

Have similar issue hitting /v1/chat/completions with that payload:

{
  "model": "llama3.2:latest",
  "stream": true,
  "messages": [
    {
      "role": "system",
      "content": "You are a helpful assistant."
    },
    {
      "role": "tool",
      "content": "Write a short poem."
    }
  ],
  "tools": [{}]
}

It doesn't stream if the role is "tool".

<!-- gh-comment-id:2691589724 --> @nonsleepr commented on GitHub (Feb 28, 2025): Have similar issue hitting `/v1/chat/completions` with that payload: ```json { "model": "llama3.2:latest", "stream": true, "messages": [ { "role": "system", "content": "You are a helpful assistant." }, { "role": "tool", "content": "Write a short poem." } ], "tools": [{}] } ``` It doesn't stream if the role is "tool".
Author
Owner

@huangdihd commented on GitHub (Mar 8, 2025):

Have similar issue hitting /v1/chat/completions with that payload:

{
  "model": "llama3.2:latest",
  "stream": true,
  "messages": [
    {
      "role": "system",
      "content": "You are a helpful assistant."
    },
    {
      "role": "tool",
      "content": "Write a short poem."
    }
  ],
  "tools": [{}]
}

It doesn't stream if the role is "tool".

Role "tool" means this message is a response to a tool_call.
It requires parameters named "tool_call_id" and "name".

<!-- gh-comment-id:2708007327 --> @huangdihd commented on GitHub (Mar 8, 2025): > Have similar issue hitting `/v1/chat/completions` with that payload: > ```json > { > "model": "llama3.2:latest", > "stream": true, > "messages": [ > { > "role": "system", > "content": "You are a helpful assistant." > }, > { > "role": "tool", > "content": "Write a short poem." > } > ], > "tools": [{}] > } > ``` > > It doesn't stream if the role is "tool". Role "tool" means this message is a response to a tool_call. It requires parameters named "tool_call_id" and "name".
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#5916