[GH-ISSUE #9084] Enabling tools breaks stream=True on /v1 endpoint, only returns a single complete response #52422

Closed
opened 2026-04-28 23:12:15 -05:00 by GiteaMirror · 5 comments
Owner

Originally created by @huangdihd on GitHub (Feb 14, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/9084

Issue Description

When enabling tools, the Ollama API seems to break streaming (stream=True) on the /v1 endpoint. Instead of returning chunks of data progressively, it waits and sends the entire response as a single block.

Steps to Reproduce

  1. Run the following Python script using the OpenAI-compatible API (via the /v1 endpoint):

    import openai
    
    client = openai.OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")
    
    response = client.chat.completions.create(
        model="qwen2.5:32b",
        messages=[{"role": "user", "content": "Tell me a story"}],
        stream=True,
        tools=[{"type": "function", "function": {"name": "test_tool", "description": "Test tool"}}]
    )
    
    for chunk in response:
        print(chunk)
    
  2. Expected behavior:

    • Without tools, stream=True correctly streams chunks of choices[0].delta.content.
    • With tools enabled, the response is only returned as a single complete block.

Expected Behavior

Even when using tools, Ollama should support proper streaming behavior and return incremental chunks instead of buffering the full response.

Environment

  • Ollama Version: (ollama version)
  • Operating System: Linux / Mac / Windows
  • API Method: OpenAI-compatible API (/v1 endpoint) / requests / Ollama Python SDK
  • Model Tested: qwen2.5:32b

Additional Information

When testing with curl against the /v1 endpoint, streaming works correctly:

curl -N -X POST http://localhost:11434/v1/chat/completions \
     -H "Content-Type: application/json" \
     -d '{
        "model": "qwen2.5:32b",
        "messages": [{"role": "user", "content": "Tell me a story"}],
        "stream": true
     }'

However, when using Python and enabling tools, streaming does not work as expected.

Question

  • Is this an intentional limitation when using tools with /v1 endpoints?
  • If so, is there a workaround to allow streaming while still using tools?

Thanks for your help!

Relevant log output

curl -N -X POST http://localhost:11434/v1/chat/completions \
     -H "Content-Type: application/json" \
     -d '{
        "model": "qwen2.5:32b",
        "messages": [{"role": "user", "content": "Tell me a story"}],
        "stream": true
     }'
...
data: {"id":"chatcmpl-634","object":"chat.completion.chunk","created":1739506822,"model":"qwen2.5:32b","system_fingerprint":"fp_ollama","choices":[{"index":0,"delta":{"role":"assistant","content":"."},"finish_reason":null}]}

data: {"id":"chatcmpl-634","object":"chat.completion.chunk","created":1739506823,"model":"qwen2.5:32b","system_fingerprint":"fp_ollama","choices":[{"index":0,"delta":{"role":"assistant","content":""},"finish_reason":"stop"}]}

data: [DONE]

OS

Linux

GPU

Nvidia

CPU

Intel

Ollama version

0.5.10

Originally created by @huangdihd on GitHub (Feb 14, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/9084 ### **Issue Description** When enabling `tools`, the Ollama API seems to break streaming (`stream=True`) on the `/v1` endpoint. Instead of returning chunks of data progressively, it waits and sends the entire response as a single block. ### **Steps to Reproduce** 1. Run the following Python script using the OpenAI-compatible API (via the `/v1` endpoint): ```python import openai client = openai.OpenAI(base_url="http://localhost:11434/v1", api_key="ollama") response = client.chat.completions.create( model="qwen2.5:32b", messages=[{"role": "user", "content": "Tell me a story"}], stream=True, tools=[{"type": "function", "function": {"name": "test_tool", "description": "Test tool"}}] ) for chunk in response: print(chunk) ``` 2. Expected behavior: * **Without `tools`**, `stream=True` correctly streams chunks of `choices[0].delta.content`. * **With `tools` enabled**, the response is only returned as a single complete block. ### **Expected Behavior** Even when using `tools`, Ollama should support proper streaming behavior and return incremental chunks instead of buffering the full response. ### **Environment** * **Ollama Version:** (`ollama version`) * **Operating System:** Linux / Mac / Windows * **API Method:** OpenAI-compatible API (`/v1` endpoint) / `requests` / Ollama Python SDK * **Model Tested:** qwen2.5:32b ### **Additional Information** When testing with `curl` against the `/v1` endpoint, streaming works correctly: ```sh curl -N -X POST http://localhost:11434/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "qwen2.5:32b", "messages": [{"role": "user", "content": "Tell me a story"}], "stream": true }' ``` However, when using Python and enabling `tools`, streaming does not work as expected. ### **Question** * Is this an intentional limitation when using `tools` with `/v1` endpoints? * If so, is there a workaround to allow streaming while still using `tools`? Thanks for your help! ### Relevant log output ```shell curl -N -X POST http://localhost:11434/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "qwen2.5:32b", "messages": [{"role": "user", "content": "Tell me a story"}], "stream": true }' ... data: {"id":"chatcmpl-634","object":"chat.completion.chunk","created":1739506822,"model":"qwen2.5:32b","system_fingerprint":"fp_ollama","choices":[{"index":0,"delta":{"role":"assistant","content":"."},"finish_reason":null}]} data: {"id":"chatcmpl-634","object":"chat.completion.chunk","created":1739506823,"model":"qwen2.5:32b","system_fingerprint":"fp_ollama","choices":[{"index":0,"delta":{"role":"assistant","content":""},"finish_reason":"stop"}]} data: [DONE] ``` ### OS Linux ### GPU Nvidia ### CPU Intel ### Ollama version 0.5.10
GiteaMirror added the bug label 2026-04-28 23:12:15 -05:00
Author
Owner

@tonylampada commented on GitHub (Mar 9, 2025):

@huangdihd do you know if this fix is available in any recent docker image?

<!-- gh-comment-id:2708818813 --> @tonylampada commented on GitHub (Mar 9, 2025): @huangdihd do you know if this fix is available in any recent docker image?
Author
Owner

@huangdihd commented on GitHub (Mar 28, 2025):

@huangdihd do you know if this fix is available in any recent docker image?

I don't know. I have never tried it.

<!-- gh-comment-id:2761719633 --> @huangdihd commented on GitHub (Mar 28, 2025): > @huangdihd do you know if this fix is available in any recent docker image? I don't know. I have never tried it.
Author
Owner

@anyon17 commented on GitHub (May 10, 2025):

Is this resolved ? If not why the issue is closed ?

<!-- gh-comment-id:2868822785 --> @anyon17 commented on GitHub (May 10, 2025): Is this resolved ? If not why the issue is closed ?
Author
Owner

@sarmadgulzar commented on GitHub (May 17, 2025):

Facing the same issue:

from openai import AsyncOpenAI
from agents import Agent, Runner, OpenAIChatCompletionsModel, set_tracing_disabled, function_tool
import asyncio
from openai.types.responses import ResponseTextDeltaEvent

ollama_client = AsyncOpenAI(base_url='http://localhost:11434/v1', api_key='ollama')

set_tracing_disabled(True)

@function_tool
def get_current_weather(location: str):
    """Gets the current weather for a location."""
    if "tokyo" in location.lower():
        return "Sunny, 25°C"
    elif "london" in location.lower():
        return "Cloudy, 15°C"
    elif "lahore" in location.lower():
        return "Clear, 41°C"
    else:
        return "Weather data not available for this location."

async def main():
    local_llm_model = OpenAIChatCompletionsModel(model="qwen3:4b", openai_client=ollama_client)

    weather_agent = Agent(
        name="WeatherReporter",
        instructions="You are a helpful assistant that provides weather information.",
        model=local_llm_model,
        # tools=[get_current_weather],
    )

    query = "What's the weather like in Lahore?"
    print(f"User: {query}")

    try:
        # Stream the agent's response
        async for event in Runner.run_streamed(starting_agent=weather_agent, input=query).stream_events():
            if event.type == "raw_response_event" and isinstance(event.data, ResponseTextDeltaEvent):
                print(event.data.delta, end="", flush=True)
        print()

    except Exception as e:
        print(f"An error occurred while running the agent: {e}")
        import traceback
        traceback.print_exc()

    finally:
        await ollama_client.close()

if __name__ == "__main__":
    asyncio.run(main())

When I uncomment the tools=[get_current_weather] line, the response stops streaming and I only get the whole thing at the end.

<!-- gh-comment-id:2888401885 --> @sarmadgulzar commented on GitHub (May 17, 2025): Facing the same issue: ```py from openai import AsyncOpenAI from agents import Agent, Runner, OpenAIChatCompletionsModel, set_tracing_disabled, function_tool import asyncio from openai.types.responses import ResponseTextDeltaEvent ollama_client = AsyncOpenAI(base_url='http://localhost:11434/v1', api_key='ollama') set_tracing_disabled(True) @function_tool def get_current_weather(location: str): """Gets the current weather for a location.""" if "tokyo" in location.lower(): return "Sunny, 25°C" elif "london" in location.lower(): return "Cloudy, 15°C" elif "lahore" in location.lower(): return "Clear, 41°C" else: return "Weather data not available for this location." async def main(): local_llm_model = OpenAIChatCompletionsModel(model="qwen3:4b", openai_client=ollama_client) weather_agent = Agent( name="WeatherReporter", instructions="You are a helpful assistant that provides weather information.", model=local_llm_model, # tools=[get_current_weather], ) query = "What's the weather like in Lahore?" print(f"User: {query}") try: # Stream the agent's response async for event in Runner.run_streamed(starting_agent=weather_agent, input=query).stream_events(): if event.type == "raw_response_event" and isinstance(event.data, ResponseTextDeltaEvent): print(event.data.delta, end="", flush=True) print() except Exception as e: print(f"An error occurred while running the agent: {e}") import traceback traceback.print_exc() finally: await ollama_client.close() if __name__ == "__main__": asyncio.run(main()) ``` When I uncomment the `tools=[get_current_weather]` line, the response stops streaming and I only get the whole thing at the end.
Author
Owner

@huangdihd commented on GitHub (May 18, 2025):

Is this resolved ? If not why the issue is closed ?

I'm sorry for my mistake.I carelessly made this issue closed.I issued a new one: #9092

<!-- gh-comment-id:2888809538 --> @huangdihd commented on GitHub (May 18, 2025): > Is this resolved ? If not why the issue is closed ? I'm sorry for my mistake.I carelessly made this issue closed.I issued a new one: #9092
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#52422