[GH-ISSUE #8887] The stream mode doesnt work with Function Calling #5762

Closed
opened 2026-04-12 17:05:12 -05:00 by GiteaMirror · 21 comments
Owner

Originally created by @dickens88 on GitHub (Feb 6, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/8887

Originally assigned to: @ParthSareen on GitHub.

What is the issue?

Hi,

I'm trying to use Function Calling with stream mode. And i Just released with Function list in the request, even input dont need function the response always using non-stream mode.

I used OpenAI SDK and connect to Ollama API. my testing code is like this,

response = self.openai.chat.completions.create(
    model="qwen2.5:14b",
    messages="please write me a 800 words post about AI",
    tools=registry.list_functions(),
    temperature=0.2,
    stream=True
)

for chunk in response:
    for tool_call in chunk.choices[0].delta.tool_calls or []:
        index = tool_call.index
        if index not in final_tool_calls:
            final_tool_calls[index] = tool_call
        if self.is_openai_model(self.model):
            # the way for openai model
            final_tool_calls[index].function.arguments += tool_call.function.arguments

    if chunk.choices[0].delta.content is not None:
       # wrap the chunk with {"text": xxx} and print
        yield json.dumps({"text": chunk.choices[0].delta.content}, ensure_ascii=False)

what if I remove tools=registry.list_functions(),, the output looks like stream that there are multiple chunks and each chunk wrap with Json

{"text": "Certainly"}{"text": "!"}{"text": " Here"}{"text": "'s"}{"text": " an"}{"text": " engaging"}{"text": " and"}{"text": " informative"}{"text": " blog"}

But after I add tools=registry.list_functions(), back. The output looks not like stream anymore, all the content in a single chunk:

{"text": "Creating an 800-word article on AI is quite extensive for this format, but I can certainly provide you with a detailed ..."}

I'm not sure if the Ollama API really works fine with stream mode.

Relevant log output


OS

Linux

GPU

Nvidia

CPU

Intel

Ollama version

0.5.7

Originally created by @dickens88 on GitHub (Feb 6, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/8887 Originally assigned to: @ParthSareen on GitHub. ### What is the issue? Hi, I'm trying to use `Function Calling` with `stream` mode. And i Just released with Function list in the request, even input dont need function the response always using non-stream mode. I used OpenAI SDK and connect to Ollama API. my testing code is like this, ``` response = self.openai.chat.completions.create( model="qwen2.5:14b", messages="please write me a 800 words post about AI", tools=registry.list_functions(), temperature=0.2, stream=True ) for chunk in response: for tool_call in chunk.choices[0].delta.tool_calls or []: index = tool_call.index if index not in final_tool_calls: final_tool_calls[index] = tool_call if self.is_openai_model(self.model): # the way for openai model final_tool_calls[index].function.arguments += tool_call.function.arguments if chunk.choices[0].delta.content is not None: # wrap the chunk with {"text": xxx} and print yield json.dumps({"text": chunk.choices[0].delta.content}, ensure_ascii=False) ``` what if I remove `tools=registry.list_functions(),`, the output looks like stream that there are multiple chunks and each chunk wrap with Json ``` {"text": "Certainly"}{"text": "!"}{"text": " Here"}{"text": "'s"}{"text": " an"}{"text": " engaging"}{"text": " and"}{"text": " informative"}{"text": " blog"} ``` But after I add `tools=registry.list_functions(),` back. The output looks not like stream anymore, all the content in a single chunk: ``` {"text": "Creating an 800-word article on AI is quite extensive for this format, but I can certainly provide you with a detailed ..."} ``` I'm not sure if the Ollama API really works fine with stream mode. ### Relevant log output ```shell ``` ### OS Linux ### GPU Nvidia ### CPU Intel ### Ollama version 0.5.7
GiteaMirror added the bug label 2026-04-12 17:05:12 -05:00
Author
Owner

@rick-github commented on GitHub (Feb 6, 2025):

https://github.com/ollama/ollama/issues/7886

<!-- gh-comment-id:2640125261 --> @rick-github commented on GitHub (Feb 6, 2025): https://github.com/ollama/ollama/issues/7886
Author
Owner

@ParthSareen commented on GitHub (Feb 6, 2025):

I'm not sure what the .list_functions() does, but assuming it is giving the schema of the tool, if the LLM does not respond with a tool then yes, Ollama will send the content back as a single chunk. See: https://github.com/ollama/ollama/issues/5796#issuecomment-2508764074

The current work around is to only add tools when you're expecting to use them.

<!-- gh-comment-id:2640721896 --> @ParthSareen commented on GitHub (Feb 6, 2025): I'm not sure what the `.list_functions()` does, but assuming it is giving the schema of the tool, if the LLM does not respond with a tool then yes, Ollama will send the content back as a single chunk. See: https://github.com/ollama/ollama/issues/5796#issuecomment-2508764074 The current work around is to only add tools when you're expecting to use them.
Author
Owner

@ParthSareen commented on GitHub (Feb 6, 2025):

Happy to open this again if needed :)

<!-- gh-comment-id:2641255473 --> @ParthSareen commented on GitHub (Feb 6, 2025): Happy to open this again if needed :)
Author
Owner

@dickens88 commented on GitHub (Feb 7, 2025):

@ParthSareen Thank you so much for the reply. If we compare with the stream mode implemented by OpenAI, we can see even add all the tools schema the reply still work with stream mode, which mean the response message was splitted into multiple chunks. With the stream mode the frontend looks faster and smooth.

Therefore, I think the current Ollama stream mode does not work in the function call scenario. Please consider if we can reopen the ticket.

<!-- gh-comment-id:2642331657 --> @dickens88 commented on GitHub (Feb 7, 2025): @ParthSareen Thank you so much for the reply. If we compare with the stream mode implemented by OpenAI, we can see even add all the tools schema the reply still work with stream mode, which mean the response message was splitted into multiple chunks. With the stream mode the frontend looks faster and smooth. Therefore, I think the current Ollama stream mode does not work in the function call scenario. Please consider if we can reopen the ticket.
Author
Owner

@ParthSareen commented on GitHub (Feb 7, 2025):

@ParthSareen Thank you so much for the reply. If we compare with the stream mode implemented by OpenAI, we can see even add all the tools schema the reply still work with stream mode, which mean the response message was splitted into multiple chunks. With the stream mode the frontend looks faster and smooth.

Therefore, I think the current Ollama stream mode does not work in the function call scenario. Please consider if we can reopen the ticket.

If you're using tool calling it really shouldn't matter what the split chunks are if you're expecting a function call. The current design returns fully parsed tool calls still as a stream which is a better design, you know which tools to call. If each chunk was streamed back, you'd still have to wait until it was recognized as a full tool call.

<!-- gh-comment-id:2644208284 --> @ParthSareen commented on GitHub (Feb 7, 2025): > [@ParthSareen](https://github.com/ParthSareen) Thank you so much for the reply. If we compare with the stream mode implemented by OpenAI, we can see even add all the tools schema the reply still work with stream mode, which mean the response message was splitted into multiple chunks. With the stream mode the frontend looks faster and smooth. > > Therefore, I think the current Ollama stream mode does not work in the function call scenario. Please consider if we can reopen the ticket. If you're using tool calling it really shouldn't matter what the split chunks are if you're expecting a function call. The current design returns fully parsed tool calls still _as a stream_ which is a better design, you know which tools to call. If each chunk was streamed back, you'd still have to wait until it was recognized as a full tool call.
Author
Owner

@dickens88 commented on GitHub (Feb 8, 2025):

If you're using tool calling it really shouldn't matter what the split chunks are if you're expecting a function call. The current design returns fully parsed tool calls still as a stream which is a better design, you know which tools to call. If each chunk was streamed back, you'd still have to wait until it was recognized as a full tool call.

You are right, but tool call is not only calling tools, but also put the result of function back to the context for the next round of chat. Now the tool call and final chat are all in non-stream mode. In OpenAI streaming function calling there is a example of how to get function call in stream mode, the SDK looks sample.

final_tool_calls = {}

for chunk in stream:
    for tool_call in chunk.choices[0].delta.tool_calls or []:
        index = tool_call.index

        if index not in final_tool_calls:
            final_tool_calls[index] = tool_call

        final_tool_calls[index].function.arguments += tool_call.function.arguments

At the meanwhile, the final answer which include the result of function calling can still output in stream model. In a word, OpenAI's API fully support stream in with function calling and the chat with function result. It use the same API and ask AI to decide when to call a function and when to return text.

<!-- gh-comment-id:2645962547 --> @dickens88 commented on GitHub (Feb 8, 2025): > If you're using tool calling it really shouldn't matter what the split chunks are if you're expecting a function call. The current design returns fully parsed tool calls still _as a stream_ which is a better design, you know which tools to call. If each chunk was streamed back, you'd still have to wait until it was recognized as a full tool call. You are right, but tool call is not only calling tools, but also put the result of function back to the context for the next round of chat. Now the tool call and final chat are all in non-stream mode. In [OpenAI streaming function calling](https://platform.openai.com/docs/guides/function-calling?lang=curl&strict-mode=enabled) there is a example of how to get function call in stream mode, the SDK looks sample. ``` final_tool_calls = {} for chunk in stream: for tool_call in chunk.choices[0].delta.tool_calls or []: index = tool_call.index if index not in final_tool_calls: final_tool_calls[index] = tool_call final_tool_calls[index].function.arguments += tool_call.function.arguments ``` At the meanwhile, the final answer which include the result of function calling can still output in stream model. In a word, OpenAI's API fully support stream in with function calling and the chat with function result. It use the same API and ask AI to decide when to call a function and when to return text.
Author
Owner

@UlrikWKoren commented on GitHub (Mar 12, 2025):

Please FIX This, we need to be able to stream while we have given tools. Please!

<!-- gh-comment-id:2717793969 --> @UlrikWKoren commented on GitHub (Mar 12, 2025): Please FIX This, we need to be able to stream while we have given tools. Please!
Author
Owner

@jbcallaghan commented on GitHub (Mar 22, 2025):

I can understand when a tool is called, that there wouldn't be much benefit to streaming in chunks, but what about when the LLM doesn't require a tool and responds directly? I have never managed to get that part to stream in chunks.

<!-- gh-comment-id:2745726360 --> @jbcallaghan commented on GitHub (Mar 22, 2025): I can understand when a tool is called, that there wouldn't be much benefit to streaming in chunks, but what about when the LLM doesn't require a tool and responds directly? I have never managed to get that part to stream in chunks.
Author
Owner

@ParthSareen commented on GitHub (Mar 25, 2025):

Planning to fix this folks in the coming weeks where we can check if tools are coming down or not and then stream back the result! Sorry for the wait 🙏🏽

<!-- gh-comment-id:2752714801 --> @ParthSareen commented on GitHub (Mar 25, 2025): Planning to fix this folks in the coming weeks where we can check if tools are coming down or not and then stream back the result! Sorry for the wait 🙏🏽
Author
Owner

@RakeshReddyKondeti commented on GitHub (Mar 26, 2025):

Hi @ParthSareen

I'm working on an application where I'd like users to see responses in real-time, but I also need function calling capabilities. Is there any quick and dirty workaround that might help in the interim?

<!-- gh-comment-id:2754348845 --> @RakeshReddyKondeti commented on GitHub (Mar 26, 2025): Hi @ParthSareen I'm working on an application where I'd like users to see responses in real-time, but I also need function calling capabilities. Is there any quick and dirty workaround that might help in the interim?
Author
Owner

@ParthSareen commented on GitHub (Mar 26, 2025):

@RakeshReddyKondeti for the meantime you can just have two clients one streaming w/o tools passed in vs. one with tools for function calling. Let me know how that goes

<!-- gh-comment-id:2755099973 --> @ParthSareen commented on GitHub (Mar 26, 2025): @RakeshReddyKondeti for the meantime you can just have two clients one streaming w/o tools passed in vs. one with tools for function calling. Let me know how that goes
Author
Owner

@RakeshReddyKondeti commented on GitHub (Mar 26, 2025):

Thanks for the response. I appreciate the suggested workaround, but I think it doesn't quite address my specific use case.

My application requires giving the LLM access to tools, but letting the model itself decide whether to use them for each query. The current behavior forces me to choose between:

  1. Providing tools but losing streaming entirely (even when the LLM chooses not to use tools)
  2. Having streaming but removing the LLM's ability to use tools when needed

The suggested approach of having two separate clients would require me to predict in advance whether the LLM will need tools for a given query, which defeats the purpose of letting the LLM make that decision during inference.

I'll wait for the update (hopefully soon), as this functionality is important for my use case. Thanks for working on this issue.

<!-- gh-comment-id:2755827964 --> @RakeshReddyKondeti commented on GitHub (Mar 26, 2025): Thanks for the response. I appreciate the suggested workaround, but I think it doesn't quite address my specific use case. My application requires giving the LLM access to tools, but letting the model itself decide whether to use them for each query. The current behavior forces me to choose between: 1. Providing tools but losing streaming entirely (even when the LLM chooses not to use tools) 2. Having streaming but removing the LLM's ability to use tools when needed The suggested approach of having two separate clients would require me to predict in advance whether the LLM will need tools for a given query, which defeats the purpose of letting the LLM make that decision during inference. I'll wait for the update (hopefully soon), as this functionality is important for my use case. Thanks for working on this issue.
Author
Owner

@NasonZ commented on GitHub (Apr 2, 2025):

Thanks for the response. I appreciate the suggested workaround, but I think it doesn't quite address my specific use case.

My application requires giving the LLM access to tools, but letting the model itself decide whether to use them for each query. The current behavior forces me to choose between:

  1. Providing tools but losing streaming entirely (even when the LLM chooses not to use tools)
  2. Having streaming but removing the LLM's ability to use tools when needed

The suggested approach of having two separate clients would require me to predict in advance whether the LLM will need tools for a given query, which defeats the purpose of letting the LLM make that decision during inference.

I'll wait for the update (hopefully soon), as this functionality is important for my use case. Thanks for working on this issue.

I also have the same use case. Commenting to keep an eye out for updates

<!-- gh-comment-id:2773749724 --> @NasonZ commented on GitHub (Apr 2, 2025): > Thanks for the response. I appreciate the suggested workaround, but I think it doesn't quite address my specific use case. > > My application requires giving the LLM access to tools, but letting the model itself decide whether to use them for each query. The current behavior forces me to choose between: > > 1. Providing tools but losing streaming entirely (even when the LLM chooses not to use tools) > 2. Having streaming but removing the LLM's ability to use tools when needed > > The suggested approach of having two separate clients would require me to predict in advance whether the LLM will need tools for a given query, which defeats the purpose of letting the LLM make that decision during inference. > > I'll wait for the update (hopefully soon), as this functionality is important for my use case. Thanks for working on this issue. I also have the same use case. Commenting to keep an eye out for updates
Author
Owner

@smileyboy2019 commented on GitHub (Apr 25, 2025):

@ParthSareen @dickens88 When can the tool be resolved to stream back conten

<!-- gh-comment-id:2829541639 --> @smileyboy2019 commented on GitHub (Apr 25, 2025): @ParthSareen @dickens88 When can the tool be resolved to stream back conten
Author
Owner

@ParthSareen commented on GitHub (Apr 25, 2025):

Working on it right now! @smileyboy2019

<!-- gh-comment-id:2831547411 --> @ParthSareen commented on GitHub (Apr 25, 2025): Working on it right now! @smileyboy2019
Author
Owner

@danny-avila commented on GitHub (May 7, 2025):

Waiting for this :)

<!-- gh-comment-id:2858558268 --> @danny-avila commented on GitHub (May 7, 2025): Waiting for this :)
Author
Owner

@anyon17 commented on GitHub (May 10, 2025):

Is this issue resolved ? with MCP tools this issue is even more important because MCP servers are pre-registered with the LLM. Right now I am getting single chunk when added tools even if the response does not predict any tool call. Can anyone help with this issue ?

<!-- gh-comment-id:2868733915 --> @anyon17 commented on GitHub (May 10, 2025): Is this issue resolved ? with MCP tools this issue is even more important because MCP servers are pre-registered with the LLM. Right now I am getting single chunk when added tools even if the response does not predict any tool call. Can anyone help with this issue ?
Author
Owner

@ghassenbenghorbal commented on GitHub (May 17, 2025):

Any news?

<!-- gh-comment-id:2888446565 --> @ghassenbenghorbal commented on GitHub (May 17, 2025): Any news?
Author
Owner

@ParthSareen commented on GitHub (May 17, 2025):

Almost done folks. Just some last bit of cleanup left. Going to break this massive PR into some smaller chunks.

If you want to try it out: https://github.com/ollama/ollama/pull/10415

<!-- gh-comment-id:2888461110 --> @ParthSareen commented on GitHub (May 17, 2025): Almost done folks. Just some last bit of cleanup left. Going to break this massive PR into some smaller chunks. If you want to try it out: https://github.com/ollama/ollama/pull/10415
Author
Owner

@danny-avila commented on GitHub (May 25, 2025):

Almost done folks. Just some last bit of cleanup left. Going to break this massive PR into some smaller chunks.

If you want to try it out: #10415

Looking forward to it!

<!-- gh-comment-id:2907853044 --> @danny-avila commented on GitHub (May 25, 2025): > Almost done folks. Just some last bit of cleanup left. Going to break this massive PR into some smaller chunks. > > If you want to try it out: [#10415](https://github.com/ollama/ollama/pull/10415) Looking forward to it!
Author
Owner

@fwq418233640 commented on GitHub (Jun 3, 2025):

Almost done folks. Just some last bit of cleanup left. Going to break this massive PR into some smaller chunks.

If you want to try it out: #10415

Thank you very much and I am looking forward to the release of this feature!

<!-- gh-comment-id:2933977818 --> @fwq418233640 commented on GitHub (Jun 3, 2025): > Almost done folks. Just some last bit of cleanup left. Going to break this massive PR into some smaller chunks. > > If you want to try it out: [#10415](https://github.com/ollama/ollama/pull/10415) Thank you very much and I am looking forward to the release of this feature!
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#5762