[GH-ISSUE #5796] Streaming for tool calls is unsupported #65650

Closed
opened 2026-05-03 22:01:39 -05:00 by GiteaMirror · 42 comments
Owner

Originally created by @vertrue on GitHub (Jul 19, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/5796

Originally assigned to: @ParthSareen on GitHub.

What is the issue?

Hi everyone!

I am trying to use tools in requests to llama3-groq-tool-use:70b. Here is simple code in Python using langchain==0.2.9:

from langchain_core.tools import tool
from langchain_openai import ChatOpenAI
from langchain.prompts import (
    ChatPromptTemplate,
    MessagesPlaceholder,
)
from langchain.agents import AgentExecutor, create_openai_tools_agent

@tool
def function_1(a: int, b: int) -> int:
    """uses function function_1 for arguments a and b."""
    return a % b + 2

@tool
def function_2(a: int, b: int) -> int:
    """uses function function_2 for arguments a and b."""
    return a * b + 1

tools = [function_1, function_2]


llm = ChatOpenAI(
    model="llama3-groq-tool-use:70b",
    temperature=0,
)

default_prompt = ChatPromptTemplate.from_messages(
        [
            ("system", "You are a helpful AI assistant."),
            MessagesPlaceholder("chat_history", optional=True),
            ("human", "{input}"),
            MessagesPlaceholder("agent_scratchpad"),
        ]
    )

input_message = "What is function_1(10, 11)? Also what is function_2(10, 11)?"

agent = create_openai_tools_agent(
    llm=llm,
    tools=tools,
    prompt=default_prompt
)

agent_executor = AgentExecutor(
    agent=agent,
    tools=tools,
    verbose=True,
    return_intermediate_steps=False,
)

res = agent_executor.invoke({"input": input_message})

print(res)

The result is following:

> Entering new AgentExecutor chain...
<tool_call>
{"id": 0, "name": "function_1", "arguments": {"a": 10, "b": 11}}
</tool_call>
<tool_call>
{"id": 1, "name": "function_2", "arguments": {"a": 10, "b": 11}}
</tool_call>

> Finished chain.
{
    'input': 'What is function_1(10, 11)? Also what is function_2(10, 11)?',
    'output': '<tool_call>\n{"id": 0, "name": "function_1", "arguments": {"a": 10, "b": 11}}\n</tool_call>\n<tool_call>\n{"id": 1, "name": "function_2", "arguments": {"a": 10, "b": 11}}\n</tool_call>'
}

If I am using langchain_community.chat_models.ollama.ChatOllama it output the same.

But if I use same model (llama3-groq-70b-8192-tool-use-preview) with groq OpenAI Compatible API, it uses tools and invokes the functions, output below:

> Entering new AgentExecutor chain...

Invoking: `function_1` with `{'a': 10, 'b': 11}`


12
Invoking: `function_2` with `{'a': 10, 'b': 11}`


111The result of function_1(10, 11) is 12, and the result of function_2(10, 11) is 111.

> Finished chain.
{
    'input': 'What is function_1(10, 11)? Also what is function_2(10, 11)?',
    'output': 'The result of function_1(10, 11) is 12, and the result of function_2(10, 11) is 111.'
}

Is it expected behaviour or this problem is it still in progress?
Many thanks

OS

Linux

GPU

Nvidia

CPU

Intel

Ollama version

0.2.7

Originally created by @vertrue on GitHub (Jul 19, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/5796 Originally assigned to: @ParthSareen on GitHub. ### What is the issue? Hi everyone! I am trying to use tools in requests to `llama3-groq-tool-use:70b`. Here is simple code in Python using langchain==0.2.9: ``` from langchain_core.tools import tool from langchain_openai import ChatOpenAI from langchain.prompts import ( ChatPromptTemplate, MessagesPlaceholder, ) from langchain.agents import AgentExecutor, create_openai_tools_agent @tool def function_1(a: int, b: int) -> int: """uses function function_1 for arguments a and b.""" return a % b + 2 @tool def function_2(a: int, b: int) -> int: """uses function function_2 for arguments a and b.""" return a * b + 1 tools = [function_1, function_2] llm = ChatOpenAI( model="llama3-groq-tool-use:70b", temperature=0, ) default_prompt = ChatPromptTemplate.from_messages( [ ("system", "You are a helpful AI assistant."), MessagesPlaceholder("chat_history", optional=True), ("human", "{input}"), MessagesPlaceholder("agent_scratchpad"), ] ) input_message = "What is function_1(10, 11)? Also what is function_2(10, 11)?" agent = create_openai_tools_agent( llm=llm, tools=tools, prompt=default_prompt ) agent_executor = AgentExecutor( agent=agent, tools=tools, verbose=True, return_intermediate_steps=False, ) res = agent_executor.invoke({"input": input_message}) print(res) ``` The result is following: ``` > Entering new AgentExecutor chain... <tool_call> {"id": 0, "name": "function_1", "arguments": {"a": 10, "b": 11}} </tool_call> <tool_call> {"id": 1, "name": "function_2", "arguments": {"a": 10, "b": 11}} </tool_call> > Finished chain. { 'input': 'What is function_1(10, 11)? Also what is function_2(10, 11)?', 'output': '<tool_call>\n{"id": 0, "name": "function_1", "arguments": {"a": 10, "b": 11}}\n</tool_call>\n<tool_call>\n{"id": 1, "name": "function_2", "arguments": {"a": 10, "b": 11}}\n</tool_call>' } ``` If I am using `langchain_community.chat_models.ollama.ChatOllama` it output the same. But if I use same model (`llama3-groq-70b-8192-tool-use-preview`) with groq OpenAI Compatible API, it uses tools and invokes the functions, output below: ``` > Entering new AgentExecutor chain... Invoking: `function_1` with `{'a': 10, 'b': 11}` 12 Invoking: `function_2` with `{'a': 10, 'b': 11}` 111The result of function_1(10, 11) is 12, and the result of function_2(10, 11) is 111. > Finished chain. { 'input': 'What is function_1(10, 11)? Also what is function_2(10, 11)?', 'output': 'The result of function_1(10, 11) is 12, and the result of function_2(10, 11) is 111.' } ``` Is it expected behaviour or this problem is it still in progress? Many thanks ### OS Linux ### GPU Nvidia ### CPU Intel ### Ollama version 0.2.7
GiteaMirror added the bugapi labels 2026-05-03 22:01:39 -05:00
Author
Owner

@rick-github commented on GitHub (Jul 19, 2024):

The dedicated tool handling is a recent addition to ollama so probably needs some tweaking. Looking at your logs, it would seem that what ollama is returning is not what langchain is expecting, so some digging through the code on both sides would be needed to match them up.

<!-- gh-comment-id:2239712796 --> @rick-github commented on GitHub (Jul 19, 2024): The dedicated tool handling is a recent addition to ollama so probably needs some tweaking. Looking at your logs, it would seem that what ollama is returning is not what langchain is expecting, so some digging through the code on both sides would be needed to match them up.
Author
Owner

@marcnnn commented on GitHub (Jul 19, 2024):

I found out that Ollama sends “stop” and not "finish_reason": "tool_calls"
like groq api that I tested it against.

I was using langchain on elixir.
"finish_reason" => "tool_calls", deleting this from the Pattern match helped.

Ollama OpenAi api should answer with tool_calls as a finishing reason as well

<!-- gh-comment-id:2239732341 --> @marcnnn commented on GitHub (Jul 19, 2024): I found out that Ollama sends “stop” and not "finish_reason": "tool_calls" like groq api that I tested it against. I was using langchain on elixir. "finish_reason" => "tool_calls", deleting this from the Pattern match helped. Ollama OpenAi api should answer with tool_calls as a finishing reason as well
Author
Owner

@vertrue commented on GitHub (Jul 20, 2024):

Nice! Hope this will get fixed soon

<!-- gh-comment-id:2241056693 --> @vertrue commented on GitHub (Jul 20, 2024): Nice! Hope this will get fixed soon
Author
Owner

@KSemenenko commented on GitHub (Jul 20, 2024):

Nice! Hope this will get fixed soon

me too!

<!-- gh-comment-id:2241194889 --> @KSemenenko commented on GitHub (Jul 20, 2024): > Nice! Hope this will get fixed soon me too!
Author
Owner

@vertrue commented on GitHub (Jul 23, 2024):

@rick-github hi! are you in contact with someone who can fix this issue or review current PR?
seems like this bug is critical for langchain... or any other instrument that can use tools

<!-- gh-comment-id:2244984346 --> @vertrue commented on GitHub (Jul 23, 2024): @rick-github hi! are you in contact with someone who can fix this issue or review current PR? seems like this bug is critical for langchain... or any other instrument that can use tools
Author
Owner

@rick-github commented on GitHub (Jul 23, 2024):

Sorry, I'm not a member of the ollama team. I see that you've tagged Jeffrey, you'll to wait until he or somebody with review powers takes a look. In the meantime you'll have to build locally.

<!-- gh-comment-id:2245078377 --> @rick-github commented on GitHub (Jul 23, 2024): Sorry, I'm not a member of the ollama team. I see that you've tagged Jeffrey, you'll to wait until he or somebody with review powers takes a look. In the meantime you'll have to build locally.
Author
Owner

@KSemenenko commented on GitHub (Jul 23, 2024):

Llama 3.1 is here, function is here, now this is super important fix

<!-- gh-comment-id:2245885915 --> @KSemenenko commented on GitHub (Jul 23, 2024): Llama 3.1 is here, function is here, now this is super important fix
Author
Owner

@rick-github commented on GitHub (Jul 23, 2024):

The current version of llama3.1 doesn't support tools, https://github.com/ollama/ollama/issues/5885

<!-- gh-comment-id:2245949783 --> @rick-github commented on GitHub (Jul 23, 2024): The current version of llama3.1 doesn't support tools, https://github.com/ollama/ollama/issues/5885
Author
Owner

@vertrue commented on GitHub (Jul 23, 2024):

I dug deeper

when agent is executed in langchain. here is outputs right before function callings:

groq:

[llm/end] [chain:AgentExecutor > chain:RunnableSequence > llm:ChatOpenAI] [696ms] Exiting LLM run with output:
{
  "generations": [
    [
      {
        "text": "",
        "generation_info": {
          "finish_reason": "tool_calls",
          "model_name": "llama3-groq-70b-8192-tool-use-preview",
          "system_fingerprint": "fp_ee4b521143"
        },
        "type": "ChatGenerationChunk",
        "message": {
          "lc": 1,
          "type": "constructor",
          "id": [
            "langchain",
            "schema",
            "messages",
            "AIMessageChunk"
          ],
          "kwargs": {
            "content": "",
            "additional_kwargs": {
              "tool_calls": [
                {
                  "index": 0,
                  "id": "call_wqy8",
                  "function": {
                    "arguments": "{\"a\": 10, \"b\": 11}",
                    "name": "function_1"
                  },
                  "type": "function"
                },
                {
                  "index": 1,
                  "id": "call_my9j",
                  "function": {
                    "arguments": "{\"a\": 10, \"b\": 11}",
                    "name": "function_1"
                  },
                  "type": "function"
                }
              ]
            },
            "response_metadata": {
              "finish_reason": "tool_calls",
              "model_name": "llama3-groq-70b-8192-tool-use-preview",
              "system_fingerprint": "fp_ee4b521143"
            },
            "type": "AIMessageChunk",
            "id": "run-1527b935-91a4-417f-93fa-696d3f184c08",
            "tool_calls": [
              {
                "name": "function_1",
                "args": {
                  "a": 10,
                  "b": 11
                },
                "id": "call_wqy8",
                "type": "tool_call"
              },
              {
                "name": "function_1",
                "args": {
                  "a": 10,
                  "b": 11
                },
                "id": "call_my9j",
                "type": "tool_call"
              }
            ],
            "tool_call_chunks": [
              {
                "name": "function_1",
                "args": "{\"a\": 10, \"b\": 11}",
                "id": "call_wqy8",
                "index": 0,
                "type": "tool_call_chunk"
              },
              {
                "name": "function_1",
                "args": "{\"a\": 10, \"b\": 11}",
                "id": "call_my9j",
                "index": 1,
                "type": "tool_call_chunk"
              }
            ],
            "invalid_tool_calls": []
          }
        }
      }
    ]
  ],
  "llm_output": null,
  "run": null
}

ollama:

[llm/end] [chain:AgentExecutor > chain:RunnableSequence > llm:ChatOpenAI] [3.27s] Exiting LLM run with output:
{
  "generations": [
    [
      {
        "text": "<tool_call>\n{\"id\": 0, \"name\": \"function_1\", \"arguments\": {\"a\": 10, \"b\": 11}}\n</tool_call>",
        "generation_info": {
          "finish_reason": "stop",
          "model_name": "llama3-groq-tool-use:8b",
          "system_fingerprint": "fp_ollama"
        },
        "type": "ChatGenerationChunk",
        "message": {
          "lc": 1,
          "type": "constructor",
          "id": [
            "langchain",
            "schema",
            "messages",
            "AIMessageChunk"
          ],
          "kwargs": {
            "content": "<tool_call>\n{\"id\": 0, \"name\": \"function_1\", \"arguments\": {\"a\": 10, \"b\": 11}}\n</tool_call>",
            "response_metadata": {
              "finish_reason": "stop",
              "model_name": "llama3-groq-tool-use:8b",
              "system_fingerprint": "fp_ollama"
            },
            "type": "AIMessageChunk",
            "id": "run-2099882f-1760-4d5e-9027-04f67a656a0a",
            "tool_calls": [],
            "invalid_tool_calls": []
          }
        }
      }
    ]
  ],
  "llm_output": null,
  "run": null
}

now I am not sure if it is a bug :)

<!-- gh-comment-id:2246353464 --> @vertrue commented on GitHub (Jul 23, 2024): I dug deeper when agent is executed in langchain. here is outputs right before function callings: groq: ``` [llm/end] [chain:AgentExecutor > chain:RunnableSequence > llm:ChatOpenAI] [696ms] Exiting LLM run with output: { "generations": [ [ { "text": "", "generation_info": { "finish_reason": "tool_calls", "model_name": "llama3-groq-70b-8192-tool-use-preview", "system_fingerprint": "fp_ee4b521143" }, "type": "ChatGenerationChunk", "message": { "lc": 1, "type": "constructor", "id": [ "langchain", "schema", "messages", "AIMessageChunk" ], "kwargs": { "content": "", "additional_kwargs": { "tool_calls": [ { "index": 0, "id": "call_wqy8", "function": { "arguments": "{\"a\": 10, \"b\": 11}", "name": "function_1" }, "type": "function" }, { "index": 1, "id": "call_my9j", "function": { "arguments": "{\"a\": 10, \"b\": 11}", "name": "function_1" }, "type": "function" } ] }, "response_metadata": { "finish_reason": "tool_calls", "model_name": "llama3-groq-70b-8192-tool-use-preview", "system_fingerprint": "fp_ee4b521143" }, "type": "AIMessageChunk", "id": "run-1527b935-91a4-417f-93fa-696d3f184c08", "tool_calls": [ { "name": "function_1", "args": { "a": 10, "b": 11 }, "id": "call_wqy8", "type": "tool_call" }, { "name": "function_1", "args": { "a": 10, "b": 11 }, "id": "call_my9j", "type": "tool_call" } ], "tool_call_chunks": [ { "name": "function_1", "args": "{\"a\": 10, \"b\": 11}", "id": "call_wqy8", "index": 0, "type": "tool_call_chunk" }, { "name": "function_1", "args": "{\"a\": 10, \"b\": 11}", "id": "call_my9j", "index": 1, "type": "tool_call_chunk" } ], "invalid_tool_calls": [] } } } ] ], "llm_output": null, "run": null } ``` ollama: ``` [llm/end] [chain:AgentExecutor > chain:RunnableSequence > llm:ChatOpenAI] [3.27s] Exiting LLM run with output: { "generations": [ [ { "text": "<tool_call>\n{\"id\": 0, \"name\": \"function_1\", \"arguments\": {\"a\": 10, \"b\": 11}}\n</tool_call>", "generation_info": { "finish_reason": "stop", "model_name": "llama3-groq-tool-use:8b", "system_fingerprint": "fp_ollama" }, "type": "ChatGenerationChunk", "message": { "lc": 1, "type": "constructor", "id": [ "langchain", "schema", "messages", "AIMessageChunk" ], "kwargs": { "content": "<tool_call>\n{\"id\": 0, \"name\": \"function_1\", \"arguments\": {\"a\": 10, \"b\": 11}}\n</tool_call>", "response_metadata": { "finish_reason": "stop", "model_name": "llama3-groq-tool-use:8b", "system_fingerprint": "fp_ollama" }, "type": "AIMessageChunk", "id": "run-2099882f-1760-4d5e-9027-04f67a656a0a", "tool_calls": [], "invalid_tool_calls": [] } } } ] ], "llm_output": null, "run": null } ``` now I am not sure if it is a bug :)
Author
Owner

@vertrue commented on GitHub (Jul 24, 2024):

found out that ollama is not parsing tools if req.Stream = true

found out that here ChatRequest.Stream is by default is true
a6cd8f6169/api/types.go (L93)

so if you are calling /v1/chat/completions, it just does not parse tools and return text response with tool in tags:
a6cd8f6169/server/routes.go (L1372)

changing
if req.Stream != nil && !*req.Stream
to
if req.Stream != nil && *req.Stream
still gives answer without tools

investigating further to see what langchain is looking for in response, because fmt.Print(len(resp.Message.ToolCalls)) right after this line
a6cd8f6169/server/routes.go (L1401)
prints 1 (not 0) to console

for me it looks like api.ChatResponse should also have field ToolCalls

<!-- gh-comment-id:2247775737 --> @vertrue commented on GitHub (Jul 24, 2024): found out that ollama is not parsing tools if `req.Stream = true` found out that here `ChatRequest.Stream` is by default is `true` https://github.com/ollama/ollama/blob/a6cd8f6169c029c92105962017562274bd90626b/api/types.go#L93 so if you are calling `/v1/chat/completions`, it just does not parse tools and return text response with tool in tags: https://github.com/ollama/ollama/blob/a6cd8f6169c029c92105962017562274bd90626b/server/routes.go#L1372 changing `if req.Stream != nil && !*req.Stream` to `if req.Stream != nil && *req.Stream` still gives answer without tools investigating further to see what langchain is looking for in response, because `fmt.Print(len(resp.Message.ToolCalls))` right after this line https://github.com/ollama/ollama/blob/a6cd8f6169c029c92105962017562274bd90626b/server/routes.go#L1401 prints `1` (not `0`) to console for me it looks like `api.ChatResponse` should also have field `ToolCalls`
Author
Owner

@KSemenenko commented on GitHub (Jul 24, 2024):

I found Mistarl 7B also support toolling, lets check it! maybe groc model is broken

<!-- gh-comment-id:2247818889 --> @KSemenenko commented on GitHub (Jul 24, 2024): I found Mistarl 7B also support toolling, lets check it! maybe groc model is broken
Author
Owner

@vertrue commented on GitHub (Jul 24, 2024):

langchain is working with chunks
and ollama does not return any chunk that includes tools

here is request to /v1/chat/completions:

{
  "messages": [
    {
      "content": "You are a helpful AI assistant that can use tools.",
      "role": "system"
    },
    {
      "content": "What is function_1(10, 11)? Also what is function_1(11, 12)? use provided tools",
      "role": "user"
    }
  ],
  "model": "llama3-groq-tool-use:8b",
  "logprobs": false,
  "n": 1,
  "stream": true,
  "temperature": 0,
  "tools": [
    {
      "type": "function",
      "function": {
        "name": "function_1",
        "description": "uses function function_1 for arguments a and b.",
        "parameters": {
          "type": "object",
          "properties": {
            "a": {
              "type": "integer"
            },
            "b": {
              "type": "integer"
            }
          },
          "required": [
            "a",
            "b"
          ]
        }
      }
    }
  ]
}

here is output of the last chunk:

{
    "id": "chatcmpl-934",
    "object": "chat.completion.chunk",
    "created": 1721826100,
    "model": "llama3-groq-tool-use:8b",
    "system_fingerprint": "fp_ollama",
    "choices": [
        {
            "index": 0,
            "delta": {
                "role": "assistant",
                "content": ""
            },
            "finish_reason": "stop"
        }
    ]
}
<!-- gh-comment-id:2247889882 --> @vertrue commented on GitHub (Jul 24, 2024): langchain is working with chunks and ollama does not return any chunk that includes tools here is request to `/v1/chat/completions`: ``` { "messages": [ { "content": "You are a helpful AI assistant that can use tools.", "role": "system" }, { "content": "What is function_1(10, 11)? Also what is function_1(11, 12)? use provided tools", "role": "user" } ], "model": "llama3-groq-tool-use:8b", "logprobs": false, "n": 1, "stream": true, "temperature": 0, "tools": [ { "type": "function", "function": { "name": "function_1", "description": "uses function function_1 for arguments a and b.", "parameters": { "type": "object", "properties": { "a": { "type": "integer" }, "b": { "type": "integer" } }, "required": [ "a", "b" ] } } } ] } ``` here is output of the last chunk: ``` { "id": "chatcmpl-934", "object": "chat.completion.chunk", "created": 1721826100, "model": "llama3-groq-tool-use:8b", "system_fingerprint": "fp_ollama", "choices": [ { "index": 0, "delta": { "role": "assistant", "content": "" }, "finish_reason": "stop" } ] } ```
Author
Owner

@rick-github commented on GitHub (Jul 24, 2024):

Looks like you pasted the last chunk instead of the request.

<!-- gh-comment-id:2247908898 --> @rick-github commented on GitHub (Jul 24, 2024): Looks like you pasted the last chunk instead of the request.
Author
Owner

@vertrue commented on GitHub (Jul 24, 2024):

@rick-github fixed

<!-- gh-comment-id:2247927136 --> @vertrue commented on GitHub (Jul 24, 2024): @rick-github fixed
Author
Owner

@vertrue commented on GitHub (Jul 24, 2024):

@KSemenenko fixed, I believe!
you can pull my branch if urgent

<!-- gh-comment-id:2248196184 --> @vertrue commented on GitHub (Jul 24, 2024): @KSemenenko fixed, I believe! you can pull my branch if urgent
Author
Owner

@vertrue commented on GitHub (Aug 8, 2024):

still waiting for PR :(

<!-- gh-comment-id:2275557061 --> @vertrue commented on GitHub (Aug 8, 2024): still waiting for PR :(
Author
Owner

@vertrue commented on GitHub (Aug 21, 2024):

interesting update

the following code allows to call tools without any problems. not sure if langchain-ollama uses stream=true
ollama 0.3.6, langchain-ollama 0.1.1

from langchain_ollama import ChatOllama

from langchain.agents import (
    AgentExecutor,
    create_tool_calling_agent
)

from langchain_core.prompts import ChatPromptTemplate

from langchain.globals import set_debug

from langchain_core.tools import tool
from langchain_core.prompts import ChatPromptTemplate


@tool("function_1")
def function_1(a: int, b: int) -> int:
    """uses function function_1 for arguments a and b."""
    return a * b


@tool("function_2")
def function_2(a: int, b: int) -> int:
    """uses function function_2 for arguments a and b."""
    return a // b

set_debug(True)

llm = ChatOllama(
   model='llama3.1:70b',
   temperature=0,
   base_url=
)

tools = [function_1, function_2]

prompt = ChatPromptTemplate.from_messages([
  ("system", "You are a helpful assistant."),
  ("placeholder", "{chat_history}"),
  ("human", "{input}"),
  ("placeholder", "{agent_scratchpad}"),
])

agent = create_tool_calling_agent(
    llm=llm,
    tools=tools,
    prompt=prompt,
)

agent_executor = AgentExecutor(
    agent=agent,
    tools=tools,
    return_intermediate_steps=True,
    verbose=True,
)

agent_executor.invoke({"input": "What is function_1(10, 11)? Also, what is function_2(10, 11)?"})
<!-- gh-comment-id:2302478355 --> @vertrue commented on GitHub (Aug 21, 2024): interesting update the following code allows to call tools without any problems. not sure if `langchain-ollama` uses `stream=true` ollama 0.3.6, langchain-ollama 0.1.1 ``` from langchain_ollama import ChatOllama from langchain.agents import ( AgentExecutor, create_tool_calling_agent ) from langchain_core.prompts import ChatPromptTemplate from langchain.globals import set_debug from langchain_core.tools import tool from langchain_core.prompts import ChatPromptTemplate @tool("function_1") def function_1(a: int, b: int) -> int: """uses function function_1 for arguments a and b.""" return a * b @tool("function_2") def function_2(a: int, b: int) -> int: """uses function function_2 for arguments a and b.""" return a // b set_debug(True) llm = ChatOllama( model='llama3.1:70b', temperature=0, base_url= ) tools = [function_1, function_2] prompt = ChatPromptTemplate.from_messages([ ("system", "You are a helpful assistant."), ("placeholder", "{chat_history}"), ("human", "{input}"), ("placeholder", "{agent_scratchpad}"), ]) agent = create_tool_calling_agent( llm=llm, tools=tools, prompt=prompt, ) agent_executor = AgentExecutor( agent=agent, tools=tools, return_intermediate_steps=True, verbose=True, ) agent_executor.invoke({"input": "What is function_1(10, 11)? Also, what is function_2(10, 11)?"}) ```
Author
Owner

@raducoravu commented on GitHub (Nov 13, 2024):

Any idea when this will get fixed in an official ollama release? It affects me too.

<!-- gh-comment-id:2473452129 --> @raducoravu commented on GitHub (Nov 13, 2024): Any idea when this will get fixed in an official ollama release? It affects me too.
Author
Owner

@Mhijazi16 commented on GitHub (Nov 15, 2024):

any new updates I can't get work done becuase of this issue

<!-- gh-comment-id:2478557773 --> @Mhijazi16 commented on GitHub (Nov 15, 2024): any new updates I can't get work done becuase of this issue
Author
Owner

@ParthSareen commented on GitHub (Nov 19, 2024):

Hi everyone! Thanks for being patient!

I'd love to understand the use case for streamed tool calls. Would appreciate if you can attach code samples as well (can be any framework/usage).

For any tool to be called you'd need the full response from the model to make the decision for which function to call and with what parameters. And since these are not user facing, one usually just waits for the response from the model to complete.

If this is a more framework enabled concept I am a bit weary of adding that as core functionality to Ollama - but happy to reconsider.

<!-- gh-comment-id:2484842187 --> @ParthSareen commented on GitHub (Nov 19, 2024): Hi everyone! Thanks for being patient! I'd love to understand the use case for streamed tool calls. Would appreciate if you can attach code samples as well (can be any framework/usage). For any tool to be called you'd need the full response from the model to make the decision for which function to call and with what parameters. And since these are not user facing, one usually just waits for the response from the model to complete. If this is a more framework enabled concept I am a bit weary of adding that as core functionality to Ollama - but happy to reconsider.
Author
Owner

@codefromthecrypt commented on GitHub (Nov 19, 2024):

@ParthSareen tool calls are the main way to integrate data besides RAG (feel free to argue). Streaming is currently in use by tools like kibana and many demos that render UIs, and core kibana functionality requires tool usage to integrate data.

A good start would be to fully support streaming options. We (elastic) raised a pull request on that recently, and afterwards could consider helping on tool calls. As you can imagine, maintaining diffs is its own task, so landing one thing before another is important https://github.com/ollama/ollama/pull/6784

<!-- gh-comment-id:2484855725 --> @codefromthecrypt commented on GitHub (Nov 19, 2024): @ParthSareen tool calls are the main way to integrate data besides RAG (feel free to argue). Streaming is currently in use by tools like kibana and many demos that render UIs, and core kibana functionality requires tool usage to integrate data. A good start would be to fully support streaming options. We (elastic) raised a pull request on that recently, and afterwards could consider helping on tool calls. As you can imagine, maintaining diffs is its own task, so landing one thing before another is important https://github.com/ollama/ollama/pull/6784
Author
Owner

@raducoravu commented on GitHub (Nov 19, 2024):

@ParthSareen in my case I have a GUI Chat view implemented in Java, when chatting with the AI engine, the engine has various tools at its disposal that it can (if necessary) invoke from the client side. All interactions with the AI engine are done by passing the "stream":true property so that the end user receives their final answer from the AI engine gradually as it is generated.
When working with an OpenAI server directly indeed the tool calls themselves received from the server side are chunked like this:

    data: {"id":"chatcmpl-AVBUA1EJdFS0SFdzMJWdsWoLw6kIm","object":"chat.completion.chunk","created":1731995682,"model":"gpt-4o-2024-05-13","system_fingerprint":"fp_b0dd3c3254","choices":[{"index":0,"delta":{"role":"assistant","content":null,"tool_calls":[{"index":0,"id":"call_ETUPF8laB8dVR8a6AwF2gt7W","type":"function","function":{"name":"retrieve_all_action_ids","arguments":""}}],"refusal":null},"logprobs":null,"finish_reason":null}]}
    data: {"id":"chatcmpl-AVBUA1EJdFS0SFdzMJWdsWoLw6kIm","object":"chat.completion.chunk","created":1731995682,"model":"gpt-4o-2024-05-13","system_fingerprint":"fp_b0dd3c3254","choices":[{"index":0,"delta":{"tool_calls":[{"index":0,"function":{"arguments":"{}"}}]},"logprobs":null,"finish_reason":null}]}
    data: {"id":"chatcmpl-AVBUA1EJdFS0SFdzMJWdsWoLw6kIm","object":"chat.completion.chunk","created":1731995682,"model":"gpt-4o-2024-05-13","system_fingerprint":"fp_b0dd3c3254","choices":[{"index":0,"delta":{},"logprobs":null,"finish_reason":"tool_calls"}]}

It does not bother me if the tool call from the server side would be received in one chunk or in three chunks.

<!-- gh-comment-id:2484858164 --> @raducoravu commented on GitHub (Nov 19, 2024): @ParthSareen in my case I have a GUI Chat view implemented in Java, when chatting with the AI engine, the engine has various tools at its disposal that it can (if necessary) invoke from the client side. All interactions with the AI engine are done by passing the "stream":true property so that the end user receives their final answer from the AI engine gradually as it is generated. When working with an OpenAI server directly indeed the tool calls themselves received from the server side are chunked like this: data: {"id":"chatcmpl-AVBUA1EJdFS0SFdzMJWdsWoLw6kIm","object":"chat.completion.chunk","created":1731995682,"model":"gpt-4o-2024-05-13","system_fingerprint":"fp_b0dd3c3254","choices":[{"index":0,"delta":{"role":"assistant","content":null,"tool_calls":[{"index":0,"id":"call_ETUPF8laB8dVR8a6AwF2gt7W","type":"function","function":{"name":"retrieve_all_action_ids","arguments":""}}],"refusal":null},"logprobs":null,"finish_reason":null}]} data: {"id":"chatcmpl-AVBUA1EJdFS0SFdzMJWdsWoLw6kIm","object":"chat.completion.chunk","created":1731995682,"model":"gpt-4o-2024-05-13","system_fingerprint":"fp_b0dd3c3254","choices":[{"index":0,"delta":{"tool_calls":[{"index":0,"function":{"arguments":"{}"}}]},"logprobs":null,"finish_reason":null}]} data: {"id":"chatcmpl-AVBUA1EJdFS0SFdzMJWdsWoLw6kIm","object":"chat.completion.chunk","created":1731995682,"model":"gpt-4o-2024-05-13","system_fingerprint":"fp_b0dd3c3254","choices":[{"index":0,"delta":{},"logprobs":null,"finish_reason":"tool_calls"}]} It does not bother me if the tool call from the server side would be received in one chunk or in three chunks.
Author
Owner

@ParthSareen commented on GitHub (Nov 19, 2024):

@raducoravu @codefromthecrypt

Thanks for the quick replies! Will dig into this a bit more and hopefully provide some clarity.

<!-- gh-comment-id:2484861972 --> @ParthSareen commented on GitHub (Nov 19, 2024): @raducoravu @codefromthecrypt Thanks for the quick replies! Will dig into this a bit more and hopefully provide some clarity.
Author
Owner

@tzolov commented on GitHub (Nov 19, 2024):

@ParthSareen,
Based on our Spring AI implementation experience with various AI providers:
Because the tool_call messages require complete JSON content before processing, we pre-aggregate only the tool_call chunks into single messages, while keeping regular text responses streaming.
This satisfies both requirements - complete tool calls and streamed final responses.
It would have been nice if the providers did the JSON aggregation on the server/model side.

Note: Initially, we used to switch to non-streaming after detecting the first tool calls message, but user feedback favored keeping the final text responses streamed.

Ollama is an amazing tool! Looking forward to extending our Ollama Function Calling support with streaming when it's ready!

<!-- gh-comment-id:2485195018 --> @tzolov commented on GitHub (Nov 19, 2024): @ParthSareen, Based on our [Spring AI](https://docs.spring.io/spring-ai/reference/api/functions.html) implementation experience with various AI providers: Because the tool_call messages require complete JSON content before processing, we pre-aggregate only the tool_call chunks into single messages, while keeping regular text responses streaming. This satisfies both requirements - complete tool calls and streamed final responses. It would have been nice if the providers did the JSON aggregation on the server/model side. Note: Initially, we used to switch to non-streaming after detecting the first tool calls message, but user feedback favored keeping the final text responses streamed. Ollama is an amazing tool! Looking forward to extending our [Ollama Function Calling](https://docs.spring.io/spring-ai/reference/api/chat/ollama-chat.html#_function_calling) support with streaming when it's ready!
Author
Owner

@edmcman commented on GitHub (Nov 19, 2024):

My situation is very similar to @raducoravu's. I have a chat and want to stream the chat results. I don't care about the tool results streaming. (It's hard to imagine an application where streaming the tool results is important...)

<!-- gh-comment-id:2485883842 --> @edmcman commented on GitHub (Nov 19, 2024): My situation is very similar to @raducoravu's. I have a chat and want to stream the chat results. I don't care about the tool results streaming. (It's hard to imagine an application where streaming the tool results is important...)
Author
Owner

@codefromthecrypt commented on GitHub (Nov 20, 2024):

PSA not everyone has the influence or skill to hunt down the calls sites that use a particular API. Sometimes they are buried in frameworks or otherwise not easy to change. In any case, it costs a significant amount of downstream effort to discover this limitation and then possibly make comments as we've seen. If we look at this issue, we can see a myriad of projects linking problems found to it.

What I'm curious about is related to other parts of openai. When there is incentive (I think we can agree there is incentive here, even if some arguments about the practice).. is it possible for someone to raise a PR and complete a change?

More and more products are normalizing on openai as a portability layer, and that doesn't mean each agrees with all the API decisions. I guess what I mean to say is how much stake is there in not completing this, or allowing it to be completed by someone else?

<!-- gh-comment-id:2487624246 --> @codefromthecrypt commented on GitHub (Nov 20, 2024): PSA not everyone has the influence or skill to hunt down the calls sites that use a particular API. Sometimes they are buried in frameworks or otherwise not easy to change. In any case, it costs a significant amount of downstream effort to discover this limitation and then possibly make comments as we've seen. If we look at this issue, we can see a myriad of projects linking problems found to it. What I'm curious about is related to other parts of openai. When there is incentive (I think we can agree there is incentive here, even if some arguments about the practice).. is it possible for someone to raise a PR and complete a change? More and more products are normalizing on openai as a portability layer, and that doesn't mean each agrees with all the API decisions. I guess what I mean to say is how much stake is there in not completing this, or allowing it to be completed by someone else?
Author
Owner

@ParthSareen commented on GitHub (Nov 20, 2024):

Hey everyone!

Thank you for raising some great points - we'll be working over the next little bit to get this in!

Still figuring out the exact details as it could potentially break some experiences. But this is definitely high up on my list - thankful for you all to bring it up. It's all about making the experience better for you all while having good engineering decisions.

<!-- gh-comment-id:2487639557 --> @ParthSareen commented on GitHub (Nov 20, 2024): Hey everyone! Thank you for raising some great points - we'll be working over the next little bit to get this in! Still figuring out the exact details as it could potentially break some experiences. But this is definitely high up on my list - thankful for you all to bring it up. It's all about making the experience better for you all while having good engineering decisions.
Author
Owner

@jackmpcollins commented on GitHub (Nov 20, 2024):

For any tool to be called you'd need the full response from the model to make the decision for which function to call and with what parameters. And since these are not user facing, one usually just waits for the response from the model to complete.

@ParthSareen I have a use case that would benefit from streaming the tool call arguments, like openai does. In https://github.com/jackmpcollins/magentic tool calling is used to generate structured outputs. When an iterable of structured objects X is requested, under the hood magentic submits a tool with return type list[X], and as the arguments are being streamed back each item is parsed out and yielded when it completes. The advantage of this approach is structured items can start being displayed in UI (or acted on in other ways) without waiting for the whole generation to have finished.

Some details about this in the docs here https://magentic.dev/streaming/#object-streaming with example code:

from collections.abc import Iterable
from time import time

from magentic import prompt
from pydantic import BaseModel


class Superhero(BaseModel):
    name: str
    age: int
    power: str
    enemies: list[str]


@prompt("Create a Superhero team named {name}.")
def create_superhero_team(name: str) -> Iterable[Superhero]: ...


start_time = time()
for hero in create_superhero_team("The Food Dudes"):
    print(f"{time() - start_time:.2f}s : {hero}")

# 2.23s : name='Pizza Man' age=30 power='Can shoot pizza slices from his hands' enemies=['The Hungry Horde', 'The Junk Food Gang']
# 4.03s : name='Captain Carrot' age=35 power='Super strength and agility from eating carrots' enemies=['The Sugar Squad', 'The Greasy Gang']
# 6.05s : name='Ice Cream Girl' age=25 power='Can create ice cream out of thin air' enemies=['The Hot Sauce Squad', 'The Healthy Eaters']

A similar approach is used in https://github.com/jxnl/instructor for the "partial responses" feature. More details in docs here https://python.useinstructor.com/concepts/partial/

<!-- gh-comment-id:2487825355 --> @jackmpcollins commented on GitHub (Nov 20, 2024): > For any tool to be called you'd need the full response from the model to make the decision for which function to call and with what parameters. And since these are not user facing, one usually just waits for the response from the model to complete. @ParthSareen I have a use case that would benefit from streaming the tool call arguments, like openai does. In https://github.com/jackmpcollins/magentic tool calling is used to generate structured outputs. When an iterable of structured objects `X` is requested, under the hood magentic submits a tool with return type `list[X]`, and as the arguments are being streamed back each item is parsed out and yielded when it completes. The advantage of this approach is structured items can start being displayed in UI (or acted on in other ways) without waiting for the whole generation to have finished. Some details about this in the docs here https://magentic.dev/streaming/#object-streaming with example code: ```python from collections.abc import Iterable from time import time from magentic import prompt from pydantic import BaseModel class Superhero(BaseModel): name: str age: int power: str enemies: list[str] @prompt("Create a Superhero team named {name}.") def create_superhero_team(name: str) -> Iterable[Superhero]: ... start_time = time() for hero in create_superhero_team("The Food Dudes"): print(f"{time() - start_time:.2f}s : {hero}") # 2.23s : name='Pizza Man' age=30 power='Can shoot pizza slices from his hands' enemies=['The Hungry Horde', 'The Junk Food Gang'] # 4.03s : name='Captain Carrot' age=35 power='Super strength and agility from eating carrots' enemies=['The Sugar Squad', 'The Greasy Gang'] # 6.05s : name='Ice Cream Girl' age=25 power='Can create ice cream out of thin air' enemies=['The Hot Sauce Squad', 'The Healthy Eaters'] ``` A similar approach is used in https://github.com/jxnl/instructor for the "partial responses" feature. More details in docs here https://python.useinstructor.com/concepts/partial/
Author
Owner

@lucaskatayama commented on GitHub (Nov 21, 2024):

Hey guys... Sorry didn't read the entire thread... But I think I am in the right thread.

I am trying to get langchain to receive chunks when using agents... basically I need ollama to accept stream when using tools...
I achieved that by doing the changes below:

  1. Modify the Modelfile to prefix the tool response
Given the following functions, please respond with a JSON for a function call with its proper arguments that best answers the given prompt.

Respond in the exactly format below:
\f{"name": function name, "parameters": dictionary of argument name and its value}

Do not use variables
Do NOT forget the \f at the beginning.
  1. Modify the ollama server to identify the prefix \f and join all chunks into a tool message response. Sending through stream
    https://github.com/ollama/ollama/compare/main...lucaskatayama:ollama:feat/tool-stream?expand=1

  2. Modify langchain_ollama . Change stream=False to stream=True

 if "tools" in kwargs:
            async for part in await self._async_client.chat( <<<<<<<<
                model=params["model"],
                messages=ollama_messages,
                stream=True, <<<<<<< 
                options=Options(**params["options"]),
                keep_alive=params["keep_alive"],
                format=params["format"],
                tools=kwargs["tools"],
            ) :
                yield part # type:ignore
if "tools" in kwargs:
            yield from self._client.chat( <<<<<<<<<<<<<
                model=params["model"],
                messages=ollama_messages,
                stream=True, <<<<<<<<<<<<<
                options=Options(**params["options"]),
                keep_alive=params["keep_alive"],
                format=params["format"],
                tools=kwargs["tools"],
            )

I am contributing with an idea. Don't know if thi is the right way..

<!-- gh-comment-id:2491425669 --> @lucaskatayama commented on GitHub (Nov 21, 2024): Hey guys... Sorry didn't read the entire thread... But I think I am in the right thread. I am trying to get langchain to receive chunks when using agents... basically I need ollama to accept stream when using tools... I achieved that by doing the changes below: 1. Modify the Modelfile to prefix the tool response ``` Given the following functions, please respond with a JSON for a function call with its proper arguments that best answers the given prompt. Respond in the exactly format below: \f{"name": function name, "parameters": dictionary of argument name and its value} Do not use variables Do NOT forget the \f at the beginning. ``` 2. Modify the ollama server to identify the prefix \f and join all chunks into a tool message response. Sending through stream https://github.com/ollama/ollama/compare/main...lucaskatayama:ollama:feat/tool-stream?expand=1 3. Modify langchain_ollama . Change `stream=False` to `stream=True` ``` if "tools" in kwargs: async for part in await self._async_client.chat( <<<<<<<< model=params["model"], messages=ollama_messages, stream=True, <<<<<<< options=Options(**params["options"]), keep_alive=params["keep_alive"], format=params["format"], tools=kwargs["tools"], ) : yield part # type:ignore ``` ``` if "tools" in kwargs: yield from self._client.chat( <<<<<<<<<<<<< model=params["model"], messages=ollama_messages, stream=True, <<<<<<<<<<<<< options=Options(**params["options"]), keep_alive=params["keep_alive"], format=params["format"], tools=kwargs["tools"], ) ```` I am contributing with an idea. Don't know if thi is the right way..
Author
Owner

@edmcman commented on GitHub (Nov 21, 2024):

It would be best to not require changing the modelfiles...

On Thu, Nov 21, 2024 at 9:46 AM Lucas Katayama - @.***
@.***> wrote:

Hey guys... Sorry didn't read the entire thread... But I think I am in the
right thread.

I am trying to get langchain to receive chunks when using agents...
basically I need ollama to accept stream when using tools...
I achieved that by doing the changes below:

  1. Modify the Modelfile to prefix the tool response

Given the following functions, please respond with a JSON for a function call with its proper arguments that best answers the given prompt.

Respond in the exactly format below:
\f{"name": function name, "parameters": dictionary of argument name and its value}

Do not use variables
Do NOT forget the \f at the beginning.

Modify the ollama server to identify the prefix \f and join all chunks
into a tool message response. Sending through stream

https://github.com/ollama/ollama/compare/main...lucaskatayama:ollama:feat/tool-stream?expand=1
3.

Modify langchain_ollama . Change stream=False to stream=True

if "tools" in kwargs:
async for part in await self._async_client.chat( <<<<<<<<
model=params["model"],
messages=ollama_messages,
stream=True, <<<<<<<
options=Options(**params["options"]),
keep_alive=params["keep_alive"],
format=params["format"],
tools=kwargs["tools"],
) :
yield part # type:ignore

if "tools" in kwargs:
yield from self._client.chat( <<<<<<<<<<<<<
model=params["model"],
messages=ollama_messages,
stream=True, <<<<<<<<<<<<<
options=Options(**params["options"]),
keep_alive=params["keep_alive"],
format=params["format"],
tools=kwargs["tools"],
)

I am contributing with an idea. Don't know if thi is the right way..


Reply to this email directly, view it on GitHub
https://github.com/ollama/ollama/issues/5796#issuecomment-2491425669,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/AAHYKZKQSTUVQTF2VWOL5OD2BXW5FAVCNFSM6AAAAABLE3PH5GVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIOJRGQZDKNRWHE
.
You are receiving this because you are subscribed to this thread.Message
ID: @.***>

<!-- gh-comment-id:2491457228 --> @edmcman commented on GitHub (Nov 21, 2024): It would be best to not require changing the modelfiles... On Thu, Nov 21, 2024 at 9:46 AM Lucas Katayama - ***@***.*** ***@***.***> wrote: > Hey guys... Sorry didn't read the entire thread... But I think I am in the > right thread. > > I am trying to get langchain to receive chunks when using agents... > basically I need ollama to accept stream when using tools... > I achieved that by doing the changes below: > > 1. Modify the Modelfile to prefix the tool response > > Given the following functions, please respond with a JSON for a function call with its proper arguments that best answers the given prompt. > > Respond in the exactly format below: > \f{"name": function name, "parameters": dictionary of argument name and its value} > > Do not use variables > Do NOT forget the \f at the beginning. > > > 2. > > Modify the ollama server to identify the prefix \f and join all chunks > into a tool message response. Sending through stream > > https://github.com/ollama/ollama/compare/main...lucaskatayama:ollama:feat/tool-stream?expand=1 > 3. > > Modify langchain_ollama . Change stream=False to stream=True > > if "tools" in kwargs: > async for part in await self._async_client.chat( <<<<<<<< > model=params["model"], > messages=ollama_messages, > stream=True, <<<<<<< > options=Options(**params["options"]), > keep_alive=params["keep_alive"], > format=params["format"], > tools=kwargs["tools"], > ) : > yield part # type:ignore > > if "tools" in kwargs: > yield from self._client.chat( <<<<<<<<<<<<< > model=params["model"], > messages=ollama_messages, > stream=True, <<<<<<<<<<<<< > options=Options(**params["options"]), > keep_alive=params["keep_alive"], > format=params["format"], > tools=kwargs["tools"], > ) > > I am contributing with an idea. Don't know if thi is the right way.. > > — > Reply to this email directly, view it on GitHub > <https://github.com/ollama/ollama/issues/5796#issuecomment-2491425669>, > or unsubscribe > <https://github.com/notifications/unsubscribe-auth/AAHYKZKQSTUVQTF2VWOL5OD2BXW5FAVCNFSM6AAAAABLE3PH5GVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIOJRGQZDKNRWHE> > . > You are receiving this because you are subscribed to this thread.Message > ID: ***@***.***> >
Author
Owner

@ParthSareen commented on GitHub (Nov 28, 2024):

Hey everyone! Thanks for being so patient :) New release just went out with streaming tool call support. Will ping some folks around the community so they don't have to work around it.

Appreciate all the insight for this issue! https://github.com/ollama/ollama/releases/tag/v0.4.6

Quick Notes:

  • Each chunk returned to the user will contain a tool call (if any)
  • Multiple tool calls can be returned in a streamed manner
<!-- gh-comment-id:2505152208 --> @ParthSareen commented on GitHub (Nov 28, 2024): Hey everyone! Thanks for being so patient :) New release just went out with streaming tool call support. Will ping some folks around the community so they don't have to work around it. Appreciate all the insight for this issue! https://github.com/ollama/ollama/releases/tag/v0.4.6 Quick Notes: - Each chunk returned to the user will contain a tool call (if any) - Multiple tool calls can be returned in a streamed manner
Author
Owner

@edmcman commented on GitHub (Nov 28, 2024):

Thank you!

On Wed, Nov 27, 2024, 9:41 PM Parth Sareen - @.***
@.***> wrote:

Hey everyone! Thanks for being so patient :) New release just went out
with streaming tool call support. Will ping some folks around the community
so they don't have to work around it.

Appreciate all the insight for this issue!
https://github.com/ollama/ollama/releases/tag/v0.4.6


Reply to this email directly, view it on GitHub
https://github.com/ollama/ollama/issues/5796#issuecomment-2505152208,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/AAHYKZKVJSBHBE33V33BBUD2CZ7EJAVCNFSM6AAAAABLE3PH5GVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKMBVGE2TEMRQHA
.
You are receiving this because you are subscribed to this thread.Message
ID: @.***>

<!-- gh-comment-id:2506156453 --> @edmcman commented on GitHub (Nov 28, 2024): Thank you! On Wed, Nov 27, 2024, 9:41 PM Parth Sareen - ***@***.*** ***@***.***> wrote: > Hey everyone! Thanks for being so patient :) New release just went out > with streaming tool call support. Will ping some folks around the community > so they don't have to work around it. > > Appreciate all the insight for this issue! > https://github.com/ollama/ollama/releases/tag/v0.4.6 > > — > Reply to this email directly, view it on GitHub > <https://github.com/ollama/ollama/issues/5796#issuecomment-2505152208>, > or unsubscribe > <https://github.com/notifications/unsubscribe-auth/AAHYKZKVJSBHBE33V33BBUD2CZ7EJAVCNFSM6AAAAABLE3PH5GVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKMBVGE2TEMRQHA> > . > You are receiving this because you are subscribed to this thread.Message > ID: ***@***.***> >
Author
Owner

@jackmpcollins commented on GitHub (Nov 29, 2024):

@ParthSareen I opened an issue for adding the index to each tool call as this breaks compatibility for some use cases. https://github.com/ollama/ollama/issues/7881 Otherwise it is working well! Thank you

<!-- gh-comment-id:2507417030 --> @jackmpcollins commented on GitHub (Nov 29, 2024): @ParthSareen I opened an issue for adding the `index` to each tool call as this breaks compatibility for some use cases. https://github.com/ollama/ollama/issues/7881 Otherwise it is working well! Thank you
Author
Owner

@ParthSareen commented on GitHub (Nov 29, 2024):

@ParthSareen I opened an issue for adding the index to each tool call as this breaks compatibility for some use cases. https://github.com/ollama/ollama/issues/7881 Otherwise it is working well! Thank you

Ahh dang must have missed that field! Will add in the AM thanks for the ping!

<!-- gh-comment-id:2507419719 --> @ParthSareen commented on GitHub (Nov 29, 2024): > @ParthSareen I opened an issue for adding the `index` to each tool call as this breaks compatibility for some use cases. https://github.com/ollama/ollama/issues/7881 Otherwise it is working well! Thank you Ahh dang must have missed that field! Will add in the AM thanks for the ping!
Author
Owner

@Rizaldy commented on GitHub (Nov 29, 2024):

curl http://localhost:11434/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
  "model": "qwen2.5:latest",
  "messages": [
    {
      "role": "user",
      "content": "What is your name? please explain in detail in 2 paragraph"
    }
  ],
  "stream": true,
  "tools": [
    {
      "type": "function",
      "function": {
        "name": "get_current_weather",
        "description": "Get the current weather for a location",
        "parameters": {
          "type": "object",
          "properties": {
            "location": {
              "type": "string",
              "description": "The location to get the weather for, e.g. San Francisco, CA"
            },
            "format": {
              "type": "string",
              "description": "The format to return the weather in, e.g. 'celsius' or 'fahrenheit'",
              "enum": ["celsius", "fahrenheit"]
            }
          },
          "required": ["location", "format"]
        }
      }
    }
  ]
}'

data: {"id":"chatcmpl-635","object":"chat.completion.chunk","created":1732905314,"model":"qwen2.5:latest","system_fingerprint":"fp_ollama","choices":[{"index":0,"delta":{"role":"assistant","content":"I am Qwen, an AI assistant created by Alibaba Cloud. My purpose is to assist and interact with users like you to provide information, answer questions, generate content, and perform various tasks. I don't have a personal identity or name outside of this context, but \"Qwen\" serves as my identifier for the purposes of communication. If you have any specific inquiries or need help with something, feel free to ask!"},"finish_reason":"stop"}]}

data: [DONE]

Hi @ParthSareen want to add feedback after updated Ollama to 0.4.6, as the above:

  1. Add tool on the payload
  2. Ask non related tools
  3. Qwen answer with conversational
  4. the response is not stream in chunk but in one content

But if I remove tools from payload it stream like normal. Just want to know if this is a new implementations or something is missing?

<!-- gh-comment-id:2508374342 --> @Rizaldy commented on GitHub (Nov 29, 2024): ```bash curl http://localhost:11434/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "qwen2.5:latest", "messages": [ { "role": "user", "content": "What is your name? please explain in detail in 2 paragraph" } ], "stream": true, "tools": [ { "type": "function", "function": { "name": "get_current_weather", "description": "Get the current weather for a location", "parameters": { "type": "object", "properties": { "location": { "type": "string", "description": "The location to get the weather for, e.g. San Francisco, CA" }, "format": { "type": "string", "description": "The format to return the weather in, e.g. 'celsius' or 'fahrenheit'", "enum": ["celsius", "fahrenheit"] } }, "required": ["location", "format"] } } } ] }' data: {"id":"chatcmpl-635","object":"chat.completion.chunk","created":1732905314,"model":"qwen2.5:latest","system_fingerprint":"fp_ollama","choices":[{"index":0,"delta":{"role":"assistant","content":"I am Qwen, an AI assistant created by Alibaba Cloud. My purpose is to assist and interact with users like you to provide information, answer questions, generate content, and perform various tasks. I don't have a personal identity or name outside of this context, but \"Qwen\" serves as my identifier for the purposes of communication. If you have any specific inquiries or need help with something, feel free to ask!"},"finish_reason":"stop"}]} data: [DONE] ``` Hi @ParthSareen want to add feedback after updated Ollama to 0.4.6, as the above: 1. Add tool on the payload 2. Ask non related tools 3. Qwen answer with conversational 4. the response is not `stream in chunk` but in one `content` But if I remove `tools` from payload it stream like normal. Just want to know if this is a new implementations or something is missing?
Author
Owner

@ParthSareen commented on GitHub (Nov 30, 2024):

@Rizaldy Yes in streaming mode this is expected. Essentially we don't know when a tool is going to come back from a model. If there is a toolcall present, the content should be removed and only the call should be sent back, and if there is no toolcall we should return whatever content was returned by the model. Hope this helps!

<!-- gh-comment-id:2508764074 --> @ParthSareen commented on GitHub (Nov 30, 2024): @Rizaldy Yes in streaming mode this is expected. Essentially we don't know when a tool is going to come back from a model. If there is a toolcall present, the content should be removed and only the call should be sent back, and if there is no toolcall we should return whatever content was returned by the model. Hope this helps!
Author
Owner

@RippinRocket commented on GitHub (Nov 30, 2024):

@ParthSareen I'm seeing the same thing using the python library. If I don't include tools in the payload, the response content streams across token by token. If I do include tools, I get a streamed response but it contains the full response content in one go instead of token by token.

<!-- gh-comment-id:2508776772 --> @RippinRocket commented on GitHub (Nov 30, 2024): @ParthSareen I'm seeing the same thing using the python library. If I don't include tools in the payload, the response content streams across token by token. If I do include tools, I get a streamed response but it contains the full response content in one go instead of token by token.
Author
Owner

@ParthSareen commented on GitHub (Nov 30, 2024):

Hey @Rizaldy @RippinRocket,

We wanted to get a quick implementation out to unblock people on this. Will scope something in to eventually identify a bit earlier whether tool calls are coming back or not and then stream rest of the response out (tracking in: https://github.com/ollama/ollama/issues/7886).

In the meantime I'd recommend to pass tools in when needed and less for chatting - especially with small models as they overfit to sending tool responses back anyways. Appreciate y'all raising this!

<!-- gh-comment-id:2508779988 --> @ParthSareen commented on GitHub (Nov 30, 2024): Hey @Rizaldy @RippinRocket, We wanted to get a quick implementation out to unblock people on this. Will scope something in to eventually identify a bit earlier whether tool calls are coming back or not and then stream rest of the response out (tracking in: https://github.com/ollama/ollama/issues/7886). In the meantime I'd recommend to pass tools in when needed and less for chatting - especially with small models as they overfit to sending tool responses back anyways. Appreciate y'all raising this!
Author
Owner

@saivishwak commented on GitHub (Feb 28, 2025):

curl http://localhost:11434/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
  "model": "llama3.2",
  "messages": [
    {
      "role": "user",
      "content": "What is your name? please explain in detail in 2 paragraph"
    }
  ],
  "stream": true,
  "tools": [
    {
      "type": "function",
      "function": {
        "name": "get_current_weather",
        "description": "Get the current weather for a location",
        "parameters": {
          "type": "object",
          "properties": {
            "location": {
              "type": "string",
              "description": "The location to get the weather for, e.g. San Francisco, CA"
            },
            "format": {
              "type": "string",
              "description": "The format to return the weather in, e.g. 'celsius' or 'fahrenheit'",
              "enum": ["celsius", "fahrenheit"]
            }
          },
          "required": ["location", "format"]
        }
      }
    }
  ]
}'



data: {"id":"chatcmpl-432","object":"chat.completion.chunk","created":1740745072,"model":"llama3.2","system_fingerprint":"fp_ollama","choices":[{"index":0,"delta":{"role":"assistant","content":"","tool_calls":[{"id":"call_zd4aa4tf","index":0,"type":"function","function":{"name":"get_current_weather","arguments":"{\"format\":\"none\",\"location\":\"assistant AI system\"}"}}]},"finish_reason":null}]}

data: {"id":"chatcmpl-432","object":"chat.completion.chunk","created":1740745072,"model":"llama3.2","system_fingerprint":"fp_ollama","choices":[{"index":0,"delta":{"role":"assistant","content":""},"finish_reason":"stop"}]}

data: [DONE]

When using tools and stream, the reponse has tools_call even when the query is non tool related, Is this expected issue?

<!-- gh-comment-id:2690511791 --> @saivishwak commented on GitHub (Feb 28, 2025): ```sh curl http://localhost:11434/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "llama3.2", "messages": [ { "role": "user", "content": "What is your name? please explain in detail in 2 paragraph" } ], "stream": true, "tools": [ { "type": "function", "function": { "name": "get_current_weather", "description": "Get the current weather for a location", "parameters": { "type": "object", "properties": { "location": { "type": "string", "description": "The location to get the weather for, e.g. San Francisco, CA" }, "format": { "type": "string", "description": "The format to return the weather in, e.g. 'celsius' or 'fahrenheit'", "enum": ["celsius", "fahrenheit"] } }, "required": ["location", "format"] } } } ] }' data: {"id":"chatcmpl-432","object":"chat.completion.chunk","created":1740745072,"model":"llama3.2","system_fingerprint":"fp_ollama","choices":[{"index":0,"delta":{"role":"assistant","content":"","tool_calls":[{"id":"call_zd4aa4tf","index":0,"type":"function","function":{"name":"get_current_weather","arguments":"{\"format\":\"none\",\"location\":\"assistant AI system\"}"}}]},"finish_reason":null}]} data: {"id":"chatcmpl-432","object":"chat.completion.chunk","created":1740745072,"model":"llama3.2","system_fingerprint":"fp_ollama","choices":[{"index":0,"delta":{"role":"assistant","content":""},"finish_reason":"stop"}]} data: [DONE] ``` When using tools and stream, the reponse has tools_call even when the query is non tool related, Is this expected issue?
Author
Owner

@ParthSareen commented on GitHub (Mar 3, 2025):

Hey @saivishwak, this is just model behavior. Smaller models when provided tools tend to lean towards making tool calls than not. If you're constrained to using small models, I'd recommend adding a client to manage the other responses.

<!-- gh-comment-id:2695102436 --> @ParthSareen commented on GitHub (Mar 3, 2025): Hey @saivishwak, this is just model behavior. Smaller models when provided tools tend to lean towards making tool calls than not. If you're constrained to using small models, I'd recommend adding a client to manage the other responses.
Author
Owner

@edmcman commented on GitHub (Mar 3, 2025):

@ParthSareen wrote:

Smaller models when provided tools tend to lean towards making tool calls than not.

This may be true, but it's certainly not the only thing going on here. Ollama is using a poorly performing prompt template. See https://edmcman.github.io/blog/2025-02-21--lang-chain-and-ollama-make-building-local-tool-calling-agents-easy-it-s-a-shame-they-don-t-work-part-2/ -- very curious to hear your thoughts.

@saivishwak You might want to try https://ollama.com/ejschwar/llama3.2-better-prompts or use a different host than Ollama. You can test llama 3.2 on groq pretty easily for free. For instance, I found that on Ollama, llama3.2 regularly responds to "Hello" with a tool call. On groq/llama.cpp, it does not. I believe it all boils down to the prompt template.

<!-- gh-comment-id:2695314800 --> @edmcman commented on GitHub (Mar 3, 2025): @ParthSareen wrote: > Smaller models when provided tools tend to lean towards making tool calls than not. This may be true, but it's certainly not the only thing going on here. Ollama is using a poorly performing prompt template. See https://edmcman.github.io/blog/2025-02-21--lang-chain-and-ollama-make-building-local-tool-calling-agents-easy-it-s-a-shame-they-don-t-work-part-2/ -- very curious to hear your thoughts. @saivishwak You might want to try https://ollama.com/ejschwar/llama3.2-better-prompts or use a different host than Ollama. You can test llama 3.2 on groq pretty easily for free. For instance, I found that on Ollama, `llama3.2` regularly responds to "Hello" with a tool call. On groq/llama.cpp, it does not. I believe it all boils down to the prompt template.
Author
Owner

@ParthSareen commented on GitHub (Mar 3, 2025):

@edmcman Cool work on hacking on the template! Llama3.2 uses a python function for which we don't parse as of yet. It probably explains some of the difference in behavior.

<!-- gh-comment-id:2695380211 --> @ParthSareen commented on GitHub (Mar 3, 2025): @edmcman Cool work on hacking on the template! Llama3.2 uses a python function for which we don't parse as of yet. It probably explains some of the difference in behavior.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#65650