[GH-ISSUE #12187] GPT-OSS not completing tool calls #54617

Open
opened 2026-04-29 06:34:01 -05:00 by GiteaMirror · 36 comments
Owner

Originally created by @Roberto-Candelario on GitHub (Sep 4, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/12187

Originally assigned to: @ParthSareen on GitHub.

What is the issue?

When using ollama with open-webui the model doesn't complete the tool calls being used. It starts the tool and after a few seconds "Completes" ready for a new chat message even though the model did not do anything

Relevant log output


OS

macOS

GPU

Apple

CPU

Apple

Ollama version

0.11.10

Originally created by @Roberto-Candelario on GitHub (Sep 4, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/12187 Originally assigned to: @ParthSareen on GitHub. ### What is the issue? When using ollama with open-webui the model doesn't complete the tool calls being used. It starts the tool and after a few seconds "Completes" ready for a new chat message even though the model did not do anything ### Relevant log output ```shell ``` ### OS macOS ### GPU Apple ### CPU Apple ### Ollama version 0.11.10
GiteaMirror added the bug label 2026-04-29 06:34:01 -05:00
Author
Owner

@rick-github commented on GitHub (Sep 5, 2025):

Can you give an example of a tool? I did a quick test and open-webui had no problems using the weather tool and gpt-oss to tell me what the forecast was.

<!-- gh-comment-id:3258531732 --> @rick-github commented on GitHub (Sep 5, 2025): Can you give an example of a tool? I did a quick test and open-webui had no problems using the weather tool and gpt-oss to tell me what the forecast was.
Author
Owner

@z0rb commented on GitHub (Sep 6, 2025):

I have the same behavior with Open WebUI and the playwright tool: https://github.com/microsoft/playwright-mcp

<!-- gh-comment-id:3262063746 --> @z0rb commented on GitHub (Sep 6, 2025): I have the same behavior with Open WebUI and the playwright tool: https://github.com/microsoft/playwright-mcp
Author
Owner

@lefoulkrod commented on GitHub (Sep 6, 2025):

Maybe related to https://github.com/ollama/ollama/issues/12203 (thinking output being put into tool call results)

<!-- gh-comment-id:3262162238 --> @lefoulkrod commented on GitHub (Sep 6, 2025): Maybe related to https://github.com/ollama/ollama/issues/12203 (thinking output being put into tool call results)
Author
Owner

@z0rb commented on GitHub (Sep 6, 2025):

This behavior happens when "Function calling" for the model is set to "Native" in Open WebUI.

When function calling is set to "Default", tool calls happen again, but they are unstable. E.g. the browser gets started but then no further navigation to the page.

For any other MCP server and tool I have tried, e.g. DuckDuckGo, the function calls work well with "Native".

<!-- gh-comment-id:3262258377 --> @z0rb commented on GitHub (Sep 6, 2025): This behavior happens when "Function calling" for the model is set to "Native" in Open WebUI. When function calling is set to "Default", tool calls happen again, but they are unstable. E.g. the browser gets started but then no further navigation to the page. For any other MCP server and tool I have tried, e.g. DuckDuckGo, the function calls work well with "Native".
Author
Owner

@raffaeler commented on GitHub (Sep 16, 2025):

I also am seeing the same thing.
The call is the following:

    response = chat(
        model=model,
        messages=messages,
        tools=[store_document_section],
        options={"max_tokens": 6000, "temperature": 0.8}
    )

If I use llama3.2, the tool is always called. When using gpt-oss the tool is never called.
In this test the prompt is hard-coded and explicitly ask to call the tool.

Ollama version is 0.11.10

<!-- gh-comment-id:3297875511 --> @raffaeler commented on GitHub (Sep 16, 2025): I also am seeing the same thing. The call is the following: ``` response = chat( model=model, messages=messages, tools=[store_document_section], options={"max_tokens": 6000, "temperature": 0.8} ) ``` If I use `llama3.2`, the tool is always called. When using `gpt-oss` the tool is never called. In this test the prompt is hard-coded and explicitly ask to call the tool. Ollama version is `0.11.10`
Author
Owner

@rick-github commented on GitHub (Sep 16, 2025):

Can you provide a full script that demonstrates the problem?

<!-- gh-comment-id:3297909420 --> @rick-github commented on GitHub (Sep 16, 2025): Can you provide a full script that demonstrates the problem?
Author
Owner

@raffaeler commented on GitHub (Sep 16, 2025):

@rick-github thanks for the prompt response
I also tried with 0.11.11 few seconds ago. I am now trying to create the smallest possible repro of the issue.
Does a notebook works as well for you?

<!-- gh-comment-id:3297963712 --> @raffaeler commented on GitHub (Sep 16, 2025): @rick-github thanks for the prompt response I also tried with `0.11.11` few seconds ago. I am now trying to create the smallest possible repro of the issue. Does a notebook works as well for you?
Author
Owner

@rick-github commented on GitHub (Sep 16, 2025):

Does a notebook works as well for you?

If that works best for you, sure.

<!-- gh-comment-id:3297981316 --> @rick-github commented on GitHub (Sep 16, 2025): > Does a notebook works as well for you? If that works best for you, sure.
Author
Owner

@raffaeler commented on GitHub (Sep 16, 2025):

@rick-github Here is the repro.
Notes:

  • the last cell contains the functions to run with gpt-oss or llama3.2
  • when using gpt-oss sometimes work. Strangely, when it works, the tool is called only once instead of multiple times as it happens with llama3.2.
  • the cell with containing the extract_chunks function currently use a global function. At lines 15-21, you can switch to using an instance function. When you do this, calling the function is more reliable.

Please do run this multiple times bu resetting the whole notebook every time.
Thank you

ollama_test.ipynb

<!-- gh-comment-id:3298448719 --> @raffaeler commented on GitHub (Sep 16, 2025): @rick-github Here is the repro. Notes: - the last cell contains the functions to run with `gpt-oss` or `llama3.2` - when using `gpt-oss` **sometimes** work. Strangely, when it works, the tool is called only once instead of multiple times as it happens with `llama3.2`. - the cell with containing the `extract_chunks` function currently use a global function. At lines 15-21, you can switch to using an instance function. When you do this, calling the function is more reliable. Please do run this multiple times bu resetting the whole notebook every time. Thank you [ollama_test.ipynb](https://github.com/user-attachments/files/22362450/ollama_test.ipynb)
Author
Owner

@rick-github commented on GitHub (Sep 16, 2025):

gpt-oss:20b or gpt-oss:120b? Was the gpt-oss model downloaded from the ollama library or imported from somewhere else?

<!-- gh-comment-id:3299039310 --> @rick-github commented on GitHub (Sep 16, 2025): gpt-oss:20b or gpt-oss:120b? Was the gpt-oss model downloaded from the ollama library or imported from somewhere else?
Author
Owner

@raffaeler commented on GitHub (Sep 16, 2025):

@rick-github It is the latest tag: gpt-oss:latest.
I downloaded it via ollama pull gpt-oss

<!-- gh-comment-id:3299106819 --> @raffaeler commented on GitHub (Sep 16, 2025): @rick-github It is the `latest` tag: `gpt-oss:latest`. I downloaded it via `ollama pull gpt-oss`
Author
Owner

@rick-github commented on GitHub (Sep 16, 2025):

What configuration variables are you running ollama with?

<!-- gh-comment-id:3299286459 --> @rick-github commented on GitHub (Sep 16, 2025): What configuration variables are you running ollama with?
Author
Owner

@raffaeler commented on GitHub (Sep 16, 2025):

Plain default values, no env vars set

<!-- gh-comment-id:3299330645 --> @raffaeler commented on GitHub (Sep 16, 2025): Plain default values, no env vars set
Author
Owner

@raffaeler commented on GitHub (Sep 17, 2025):

@rick-github Out of my curiosity, were you able to reproduce the issue?

<!-- gh-comment-id:3301695783 --> @raffaeler commented on GitHub (Sep 17, 2025): @rick-github Out of my curiosity, were you able to reproduce the issue?
Author
Owner

@rick-github commented on GitHub (Sep 17, 2025):

There are a few issues with the process.

You are using the defaults with a 4060, which means the context size used by the model is 4096 tokens. The template for gpt-oss is large, which uses up token space in the context buffer. gpt-oss is a reasoning model. meaning it generates reasoning tokens, further filling the context buffer. The net result is that the context buffer fills up and is shifted.

ollama  | time=2025-09-17T14:12:32.438Z level=DEBUG source=cache.go:280 msg="context limit hit - shifting" id=0 limit=4096 input=4096 keep=4 discard=2046

This can be impactful because the act of shifting buffer causes the loss of tokens from the head of the buffer, which is where the system message is stored. As a result the integrity of the system message is impaired and model is doing completions based on the remaining content, not the instructions.

This can be remedied by setting the size of the context buffer in the call:

--- 12187.py.orig	2025-09-17 15:49:07.063243867 +0200
+++ 12187.py	2025-09-17 16:10:09.837914523 +0200
@@ -172,7 +172,7 @@
         model=model,
         messages=messages,
         tools=tools,  
-        options={"max_tokens": 6000, "temperature": 0.8},
+        options={"num_ctx": 16384, "num_predict":6000, "temperature": 0.8},
     )
 
     if response.message.tool_calls:

As you noted above, gpt-oss only makes one tool call. This is because the model is not trained for parallel tool calling. In order for it to store multiple document records, the function must be written in a way that gpt-oss only needs to call it once.

--- 12187.py.orig	2025-09-17 15:49:07.063243867 +0200
+++ 12187.py	2025-09-17 17:00:05.602335918 +0200
@@ -99,21 +99,28 @@
 
 global_chunks = []
 
-def store_document_section(section_title: str, equivalent_concept: str) -> bool:
+from typing import List, Dict
+
+def store_document_section(data: List[Dict[str, str]]) -> bool:
     """
     Stores a portion of the document along with its distilled concepts
 
     Args:
-        section_title: The original heading of the section
-        equivalent_concept: A brief yet accurate summary of the section
+        data: a list of JSON objects of the form {"section_title": <The original heading of the section>, "equivalent_concept": <A brief yet accurate summary of the section>}
 
     Returns:
         bool: Whether or not the provided data was stored correctly
     """
 
-    global_chunks.append(
-        {"section_title": section_title, "equivalent_concept": equivalent_concept}
-    )
+    if not isinstance(data, list):
+      raise Exception("Not a list")
+    for d in data:
+      if "section_title" not in d:
+        raise Exception(f"Missing section title in `{d}`")
+      if "equivalent_concept" not in d:
+        raise Exception(f"Missing equivalent concept in `{d}`")
+
+    global_chunks.append(data)
     return True
 
 def get_system_prompt() -> str:

@@ -180,7 +187,10 @@
         for tool in response.message.tool_calls:
             # Ensure the function is available, and then call it
             if function_to_call := available_functions.get(tool.function.name):
-                output = function_to_call(**tool.function.arguments)
+                try:
+                  output = function_to_call(**tool.function.arguments)
+                except Exception as e:
+                  output = e
                 messages.append(response.message)
                 messages.append(
                     {

I'm not sure what the last chat call ("Final response") is trying to achieve. The call is made without tools, and since the instructions explicitly call for using a tool, the model hallucinates a tool call. For gpt-oss this results in an error logged in the server log and no output:

ollama  | time=2025-09-17T14:13:48.164Z level=WARN source=harmonyparser.go:401 msg="harmony parser: no reverse mapping found for function name" harmonyFunctionName=store_document_section

and llama3.2 returns the call in the content as a <|python_tag|> chunk.

<!-- gh-comment-id:3303516717 --> @rick-github commented on GitHub (Sep 17, 2025): There are a few issues with the process. You are using the defaults with a [4060](https://github.com/ollama/ollama/issues/10956#issuecomment-3289637238), which means the context size used by the model is 4096 tokens. The [template](https://ollama.com/library/gpt-oss:20b/blobs/fa6710a93d78) for gpt-oss is large, which uses up token space in the context buffer. gpt-oss is a reasoning model. meaning it generates reasoning tokens, further filling the context buffer. The net result is that the context buffer fills up and is shifted. ``` ollama | time=2025-09-17T14:12:32.438Z level=DEBUG source=cache.go:280 msg="context limit hit - shifting" id=0 limit=4096 input=4096 keep=4 discard=2046 ``` This can be impactful because the act of shifting buffer causes the loss of tokens from the head of the buffer, which is where the system message is stored. As a result the integrity of the system message is impaired and model is doing completions based on the remaining content, not the instructions. This can be remedied by setting the size of the context buffer in the call: ```diff --- 12187.py.orig 2025-09-17 15:49:07.063243867 +0200 +++ 12187.py 2025-09-17 16:10:09.837914523 +0200 @@ -172,7 +172,7 @@ model=model, messages=messages, tools=tools, - options={"max_tokens": 6000, "temperature": 0.8}, + options={"num_ctx": 16384, "num_predict":6000, "temperature": 0.8}, ) if response.message.tool_calls: ``` As you noted above, gpt-oss only makes one tool call. This is because the model is not trained for parallel tool calling. In order for it to store multiple document records, the function must be written in a way that gpt-oss only needs to call it once. ```diff --- 12187.py.orig 2025-09-17 15:49:07.063243867 +0200 +++ 12187.py 2025-09-17 17:00:05.602335918 +0200 @@ -99,21 +99,28 @@ global_chunks = [] -def store_document_section(section_title: str, equivalent_concept: str) -> bool: +from typing import List, Dict + +def store_document_section(data: List[Dict[str, str]]) -> bool: """ Stores a portion of the document along with its distilled concepts Args: - section_title: The original heading of the section - equivalent_concept: A brief yet accurate summary of the section + data: a list of JSON objects of the form {"section_title": <The original heading of the section>, "equivalent_concept": <A brief yet accurate summary of the section>} Returns: bool: Whether or not the provided data was stored correctly """ - global_chunks.append( - {"section_title": section_title, "equivalent_concept": equivalent_concept} - ) + if not isinstance(data, list): + raise Exception("Not a list") + for d in data: + if "section_title" not in d: + raise Exception(f"Missing section title in `{d}`") + if "equivalent_concept" not in d: + raise Exception(f"Missing equivalent concept in `{d}`") + + global_chunks.append(data) return True def get_system_prompt() -> str: @@ -180,7 +187,10 @@ for tool in response.message.tool_calls: # Ensure the function is available, and then call it if function_to_call := available_functions.get(tool.function.name): - output = function_to_call(**tool.function.arguments) + try: + output = function_to_call(**tool.function.arguments) + except Exception as e: + output = e messages.append(response.message) messages.append( { ``` I'm not sure what the last `chat` call ("Final response") is trying to achieve. The call is made without tools, and since the instructions explicitly call for using a tool, the model hallucinates a tool call. For gpt-oss this results in an error logged in the server log and no output: ``` ollama | time=2025-09-17T14:13:48.164Z level=WARN source=harmonyparser.go:401 msg="harmony parser: no reverse mapping found for function name" harmonyFunctionName=store_document_section ``` and llama3.2 returns the call in the content as a `<|python_tag|>` chunk.
Author
Owner

@z0rb commented on GitHub (Sep 17, 2025):

Since you have mentioned: "ollama | time=2025-09-17T14:13:48.164Z level=WARN source=harmonyparser.go:401 msg="harmony parser: no reverse mapping found for function name" harmonyFunctionName=store_document_section"

I started having correct tool calls in with Open WebUI and Ollama, when I changed the model config in Open WebUI from native tool calls to "default". But then the line that you have mentioned started appearing. Then it really calls a function and I can see its result, but it then hallucinates another tool call from the template and breaks off without a proper completion notification in Open WebUI.

I just didn't report back since my previous comment, since I wasn't sure if it is relevant. @raffaeler You might try the same change in your case, switching away from native tool calls. I just wouldn't know how to set the parameter outside of Open WebUI.

<!-- gh-comment-id:3303550969 --> @z0rb commented on GitHub (Sep 17, 2025): Since you have mentioned: "ollama | time=2025-09-17T14:13:48.164Z level=WARN source=harmonyparser.go:401 msg="harmony parser: no reverse mapping found for function name" harmonyFunctionName=store_document_section" I started having correct tool calls in with Open WebUI and Ollama, when I changed the model config in Open WebUI from native tool calls to "default". But then the line that you have mentioned started appearing. Then it really calls a function and I can see its result, but it then hallucinates another tool call from the template and breaks off without a proper completion notification in Open WebUI. I just didn't report back since my previous comment, since I wasn't sure if it is relevant. @raffaeler You might try the same change in your case, switching away from native tool calls. I just wouldn't know how to set the parameter outside of Open WebUI.
Author
Owner

@rick-github commented on GitHub (Sep 17, 2025):

Can you provide an example of a tool that you use in OpenWebUI that changes success rate based on native/default tool calling?

<!-- gh-comment-id:3303665082 --> @rick-github commented on GitHub (Sep 17, 2025): Can you provide an example of a tool that you use in OpenWebUI that changes success rate based on native/default tool calling?
Author
Owner

@raffaeler commented on GitHub (Sep 17, 2025):

@rick-github Thank you very much for the detailed explanation.
Anyway, this raises a number of question marks in my head:

  • The "overflow" is something that popped into my mind at a certain point. Shouldn't Ollama detect this and provide an error?
  • Is the 16384 for num_ctx an estimation/guess or anything else?
  • The official OpenAI and Hugging Face model cards for GPT-OSS are very scarce. How did you get the parameters? (context length and parallel tool support?) Can I read them through Ollama?
  • I read that Llama3.2 does not support parallel tool calling as well. Anyway, if I use llama3.2 the tool is called multiple times (one for each piece of the document). Why?
  • The last call comes from my experiments using OpenAI endpoints. When I don't specify anything, the model typically returns all the content already provided to the tool which is just a waste of tokens. While I could stop before, the idea is to let the model tell if there were issues with the document.

With regards to making a single call, this is something I experimented using OpenAI endpoints and i could see far better results in cycling the tool calls rather than trying to "digest" the document with a single tool call. I will have to find a different strategy (or changing model).

Thanks for the patience

<!-- gh-comment-id:3303676245 --> @raffaeler commented on GitHub (Sep 17, 2025): @rick-github Thank you very much for the detailed explanation. Anyway, this raises a number of question marks in my head: - The "overflow" is something that popped into my mind at a certain point. Shouldn't Ollama detect this and provide an error? - Is the `16384` for `num_ctx` an estimation/guess or anything else? - The official OpenAI and Hugging Face model cards for GPT-OSS are very scarce. How did you get the parameters? (context length and parallel tool support?) Can I read them through Ollama? - I read that Llama3.2 does not support parallel tool calling as well. Anyway, if I use llama3.2 the tool is called multiple times (one for each piece of the document). Why? - The last call comes from my experiments using OpenAI endpoints. When I don't specify anything, the model typically returns all the content already provided to the tool which is just a waste of tokens. While I could stop before, the idea is to let the model tell if there were issues with the document. With regards to making a single call, this is something I experimented using OpenAI endpoints and i could see far better results in cycling the tool calls rather than trying to "digest" the document with a single tool call. I will have to find a different strategy (or changing model). Thanks for the patience
Author
Owner

@rick-github commented on GitHub (Sep 17, 2025):

@rick-github Thank you very much for the detailed explanation. Anyway, this raises a number of question marks in my head:

  • The "overflow" is something that popped into my mind at a certain point. Shouldn't Ollama detect this and provide an error?

It's a feature, not a bug. There are use cases where the client doesn't provide system instructions and just wants tokens. The effect of shifting can be mitigated by specifying num_predict, which stops generation at the given token count. If the number of input tokens + number of output tokens < num_predict, no shift occurs. This requires knowing how many tokens your input tokenizes to, so is somewhat approximate. I have a PR which allows maximum use of the context buffer but sadly the PR has languished in the review queue for 6 months.

  • Is the 16384 for num_ctx an estimation/guess or anything else?

Estimation. The input is around 5K characters and is converted to around 2K tokens. Your original script set max_tokens to 6K, so about 8K tokens in total. Since gpt-oss is a reasoning model, I doubled it to leave room for reasoning tokens.

  • The official OpenAI and Hugging Face model cards for GPT-OSS are very scarce. How did you get the parameters? (context length and parallel tool support?) Can I read them through Ollama?

The maximum context length supported by a model is available in ollama show <model>, look for "context length". The context length that ollama uses is configurable, see here.

Model capabilities are a result of training and templating and are not always easily determined. Capabilities like thought and tool use are again available via ollama show, or through the API in /api/show. Whether the model supports parallel tool calls is not available, I determined that with testing.

  • I read that Llama3.2 does not support parallel tool calling as well. Anyway, if I use llama3.2 the tool is called multiple times (one for each piece of the document). Why?

llama3.2 is obviously capable of parallel tool calling so what you read is incorrect. The model processes the sections and generates a tool call for each section.

<!-- gh-comment-id:3303778160 --> @rick-github commented on GitHub (Sep 17, 2025): > [@rick-github](https://github.com/rick-github) Thank you very much for the detailed explanation. Anyway, this raises a number of question marks in my head: > > * The "overflow" is something that popped into my mind at a certain point. Shouldn't Ollama detect this and provide an error? It's a feature, not a bug. There are use cases where the client doesn't provide system instructions and just wants tokens. The effect of shifting can be mitigated by specifying `num_predict`, which stops generation at the given token count. If the number of input tokens + number of output tokens < num_predict, no shift occurs. This requires knowing how many tokens your input tokenizes to, so is somewhat approximate. I have a [PR](https://github.com/ollama/ollama/pull/9547) which allows maximum use of the context buffer but sadly the PR has languished in the review queue for 6 months. > * Is the `16384` for `num_ctx` an estimation/guess or anything else? Estimation. The input is around 5K characters and is converted to around 2K tokens. Your original script set `max_tokens` to 6K, so about 8K tokens in total. Since gpt-oss is a reasoning model, I doubled it to leave room for reasoning tokens. > * The official OpenAI and Hugging Face model cards for GPT-OSS are very scarce. How did you get the parameters? (context length and parallel tool support?) Can I read them through Ollama? The maximum context length supported by a model is available in `ollama show <model>`, look for "context length". The context length that ollama uses is configurable, see [here](https://github.com/ollama/ollama/blob/main/docs/faq.md#how-can-i-specify-the-context-window-size). Model capabilities are a result of training and templating and are not always easily determined. Capabilities like thought and tool use are again available via `ollama show`, or through the API in `/api/show`. Whether the model supports parallel tool calls is not available, I determined that with testing. > * I read that Llama3.2 does not support parallel tool calling as well. Anyway, if I use llama3.2 the tool is called multiple times (one for each piece of the document). Why? llama3.2 is obviously capable of parallel tool calling so what you read is incorrect. The model processes the sections and generates a tool call for each section.
Author
Owner

@raffaeler commented on GitHub (Sep 17, 2025):

Thanks again @rick-github, precious info!

The info for the gpt-oss are:

$ ollama show gpt-oss
  Model
    architecture        gptoss    
    parameters          20.9B     
    context length      131072    
    embedding length    2880      
    quantization        MXFP4     

  Capabilities
    completion    
    tools         
    thinking      

  Parameters
    temperature    1    

  License
    Apache License               
    Version 2.0, January 2004    
    ...                     


$ ollama show llama3.2
  Model
    architecture        llama     
    parameters          3.2B      
    context length      131072    
    embedding length    3072      
    quantization        Q4_K_M    

  Capabilities
    completion    
    tools         

  Parameters
    stop    "<|start_header_id|>"    
    stop    "<|end_header_id|>"      
    stop    "<|eot_id|>"             

  License
    LLAMA 3.2 COMMUNITY LICENSE AGREEMENT                 
    Llama 3.2 Version Release Date: September 25, 2024    
    ...              
  • I don't see a difference about parallel tools on the two models

  • I see 128K context, the template is roughly 7.5K and you previously said:

You are using the defaults with a https://github.com/ollama/ollama/issues/10956#issuecomment-3289637238, which means the context size used by the model is 4096 tokens.

This should leave more than 4K for my content. Isn't it?
I also don't get how this depend on the GPU I am using.

Thanks!

<!-- gh-comment-id:3303922112 --> @raffaeler commented on GitHub (Sep 17, 2025): Thanks again @rick-github, precious info! The info for the `gpt-oss` are: ``` $ ollama show gpt-oss Model architecture gptoss parameters 20.9B context length 131072 embedding length 2880 quantization MXFP4 Capabilities completion tools thinking Parameters temperature 1 License Apache License Version 2.0, January 2004 ... $ ollama show llama3.2 Model architecture llama parameters 3.2B context length 131072 embedding length 3072 quantization Q4_K_M Capabilities completion tools Parameters stop "<|start_header_id|>" stop "<|end_header_id|>" stop "<|eot_id|>" License LLAMA 3.2 COMMUNITY LICENSE AGREEMENT Llama 3.2 Version Release Date: September 25, 2024 ... ``` - I don't see a difference about parallel tools on the two models - I see `128K` context, the template is roughly 7.5K and you previously said: > You are using the defaults with a https://github.com/ollama/ollama/issues/10956#issuecomment-3289637238, which means the context size used by the model is 4096 tokens. This should leave more than 4K for my content. Isn't it? I also don't get how this depend on the GPU I am using. Thanks!
Author
Owner

@rick-github commented on GitHub (Sep 17, 2025):

  • I don't see a difference about parallel tools on the two models

Correct.

  • I see 128K context, the template is roughly 7.5K and you previously said:

You are using the defaults with a #10956 (comment), which means the context size used by the model is 4096 tokens.

This should leave more than 4K for my content. Isn't it? I also don't get how this depend on the GPU I am using.

128k is the maximum context of the model. The context that ollama uses is configurable, see here. The default context (ie, not configured in the environment, Modelfile or API call) is 4096 tokens. gpt-oss is a special case in that the template is large, so the default depends on how much VRAM your GPU has. If the GPU has > 20G VRAM, the default context for gpt-oss is 8k. If not, the default is the same as for other models, 4k.

<!-- gh-comment-id:3303948206 --> @rick-github commented on GitHub (Sep 17, 2025): > * I don't see a difference about parallel tools on the two models Correct. > * I see `128K` context, the template is roughly 7.5K and you previously said: > > > You are using the defaults with a [#10956 (comment)](https://github.com/ollama/ollama/issues/10956#issuecomment-3289637238), which means the context size used by the model is 4096 tokens. > > This should leave more than 4K for my content. Isn't it? I also don't get how this depend on the GPU I am using. 128k is the maximum context of the model. The context that ollama uses is configurable, see [here](https://github.com/ollama/ollama/blob/main/docs/faq.md#how-can-i-specify-the-context-window-size). The default context (ie, not configured in the environment, Modelfile or API call) is 4096 tokens. gpt-oss is a special case in that the template is large, so the default depends on how much VRAM your GPU has. If the GPU has > 20G VRAM, the default context for gpt-oss is 8k. If not, the default is the same as for other models, 4k.
Author
Owner

@raffaeler commented on GitHub (Sep 17, 2025):

@rick-github Got it, thanks again.

<!-- gh-comment-id:3304132597 --> @raffaeler commented on GitHub (Sep 17, 2025): @rick-github Got it, thanks again.
Author
Owner

@formigarafa commented on GitHub (Sep 26, 2025):

Here is an example that fails similarly when Python tools are enabled.

curl using openai format. (but same happens using ollama api format)

$ curl 'http://localhost:11434/v1/chat/completions'   -H 'Content-Type: application/json'   --data-raw '{
    "model": "gpt-oss:20b",
    "messages":[
      {
        "role":"user",
        "content": "Considering that the 1st number in the Fibonacci sequence is 1, calculate the exact value for the 53rd number in the Fibonacci sequence."
      }
    ],
    "tools": [
      {
        "type": "function",
        "function": {
          "name": "python"
        }
      }
    ],
    "stream":false
  }'

Response: 500 Internal Server Error

{
  "error": {
    "message": "error parsing tool call: raw=def fib(n):\n    a,b=1,1\n    for _ in range(3,n+1):\n        a,b=b,a+b\n    return b\nprint(fib(53))\n, err=invalid character d looking for beginning of value",
    "type": "api_error",
    "param": null,
    "code": null
  }
}
<!-- gh-comment-id:3336607670 --> @formigarafa commented on GitHub (Sep 26, 2025): Here is an example that fails similarly when Python tools are enabled. curl using openai format. (but same happens using ollama api format) ```sh $ curl 'http://localhost:11434/v1/chat/completions' -H 'Content-Type: application/json' --data-raw '{ "model": "gpt-oss:20b", "messages":[ { "role":"user", "content": "Considering that the 1st number in the Fibonacci sequence is 1, calculate the exact value for the 53rd number in the Fibonacci sequence." } ], "tools": [ { "type": "function", "function": { "name": "python" } } ], "stream":false }' ``` Response: 500 Internal Server Error ```json { "error": { "message": "error parsing tool call: raw=def fib(n):\n a,b=1,1\n for _ in range(3,n+1):\n a,b=b,a+b\n return b\nprint(fib(53))\n, err=invalid character d looking for beginning of value", "type": "api_error", "param": null, "code": null } } ```
Author
Owner

@YetheSamartaka commented on GitHub (Oct 7, 2025):

Hi. This has impact in other places than Open WebUI, such as Roo Code (Cline) or Github Copilot when the model is provided through Ollama and it makes it unusable.

Then one gets "warnings" which should really be errors such as:
harmony parser: no reverse mapping found for function name" harmonyFunctionName=search_file

<!-- gh-comment-id:3375184898 --> @YetheSamartaka commented on GitHub (Oct 7, 2025): Hi. This has impact in other places than Open WebUI, such as Roo Code (Cline) or Github Copilot when the model is provided through Ollama and it makes it unusable. Then one gets "warnings" which should really be errors such as: `harmony parser: no reverse mapping found for function name" harmonyFunctionName=search_file`
Author
Owner

@mcr-ksh commented on GitHub (Oct 10, 2025):

My issue is somehow related. I'm using n8n in a 3-tool workflow.

Image

I've always had issues with ollama and tool calling. I'm able to run 1 tool or max 2 tools (if I'm lucky) but that is not very deterministic. I've been using Lamarck-14b model, phi3, deepseek-r1 and lately gpt-oss-20/120b.

I don't know what's the issue with ollama here but once i've switched to vLLM all the issues are gone. For the sake of testing i've switched the model back to ollama and the Problem is immediately back while no prompt changes were made.

<!-- gh-comment-id:3389231957 --> @mcr-ksh commented on GitHub (Oct 10, 2025): My issue is somehow related. I'm using n8n in a 3-tool workflow. <img width="312" height="174" alt="Image" src="https://github.com/user-attachments/assets/c50e4e7f-96d4-47f0-9f4d-5b48a1e78e63" /> I've always had issues with ollama and tool calling. I'm able to run 1 tool or max 2 tools (if I'm lucky) but that is not very deterministic. I've been using Lamarck-14b model, phi3, deepseek-r1 and lately gpt-oss-20/120b. I don't know what's the issue with ollama here but once i've switched to vLLM all the issues are gone. For the sake of testing i've switched the model back to ollama and the Problem is immediately back while no prompt changes were made.
Author
Owner

@ParthSareen commented on GitHub (Oct 18, 2025):

@mcr-ksh is this with cloud or local?

<!-- gh-comment-id:3417642285 --> @ParthSareen commented on GitHub (Oct 18, 2025): @mcr-ksh is this with cloud or local?
Author
Owner

@mcr-ksh commented on GitHub (Oct 18, 2025):

@mcr-ksh is this with cloud or local?

local.

<!-- gh-comment-id:3418741722 --> @mcr-ksh commented on GitHub (Oct 18, 2025): > [@mcr-ksh](https://github.com/mcr-ksh) is this with cloud or local? local.
Author
Owner

@ParthSareen commented on GitHub (Nov 8, 2025):

@mcr-ksh what context amount are you running with in comparison to vLLM? Model behavior should be the same if the context length is the same.

<!-- gh-comment-id:3505543498 --> @ParthSareen commented on GitHub (Nov 8, 2025): @mcr-ksh what context amount are you running with in comparison to vLLM? Model behavior should be the same if the context length is the same.
Author
Owner

@mcr-ksh commented on GitHub (Nov 11, 2025):

@ParthSareen i know, should! Thats why I was shocked to see the behavior is different. I've been fighting in the past with other ollama models as well to make the tool calls happen and could only reliably achieve one, while two calls was already a challange. With vLLM its all alright. context length is the max for gpt-oss, 131072.

<!-- gh-comment-id:3518497272 --> @mcr-ksh commented on GitHub (Nov 11, 2025): @ParthSareen i know, should! Thats why I was shocked to see the behavior is different. I've been fighting in the past with other ollama models as well to make the tool calls happen and could only reliably achieve one, while two calls was already a challange. With vLLM its all alright. context length is the max for gpt-oss, 131072.
Author
Owner

@formigarafa commented on GitHub (Nov 11, 2025):

@mcr-ksh if your problem only happens with gpt-oss local I bet the problem is the same mentioned above (harmony parser: no reverse mapping found).
You probably could verify it checking your logs.

From what I understand, it seems to me that Ollama does not implement a solution to output tool call in a format different than JSON, which is the use-case for those gpt-oss harmony format.

To reproduce the problem you just need to make a tool call enabling the native tool capabilities the model have and ask for something other than JSON on the response. Python is a easy one (I mentioned this a little above: https://github.com/ollama/ollama/issues/12187#issuecomment-3336607670).

I believe this won't be a simple bug fix because it will require a fundamental change on the assumptions of the API response format: it will need to be able to answer with a generic format (probably string) containing the raw tool call code. ATM, as far as I could test or find in Ollama docs, the tool calls the model responds are all expected to be and parsded as JSON object, which usually contain tool name and respective parameters.

Something like "custom tools" from open-ai api (https://platform.openai.com/docs/guides/function-calling#custom-tools) is needed to solve this.

<!-- gh-comment-id:3518860769 --> @formigarafa commented on GitHub (Nov 11, 2025): @mcr-ksh if your problem only happens with gpt-oss local I bet the problem is the same mentioned above (harmony parser: no reverse mapping found). You probably could verify it checking your logs. From what I understand, it seems to me that Ollama does not implement a solution to output tool call in a format different than JSON, which is the use-case for those gpt-oss harmony format. To reproduce the problem you just need to make a tool call enabling the native tool capabilities the model have and ask for something other than JSON on the response. Python is a easy one (I mentioned this a little above: https://github.com/ollama/ollama/issues/12187#issuecomment-3336607670). I believe this won't be a simple bug fix because it will require a fundamental change on the assumptions of the API response format: it will need to be able to answer with a generic format (probably string) containing the raw tool call code. ATM, as far as I could test or find in Ollama docs, the tool calls the model responds are all expected to be and parsded as JSON object, which usually contain tool name and respective parameters. Something like "custom tools" from open-ai api (https://platform.openai.com/docs/guides/function-calling#custom-tools) is needed to solve this.
Author
Owner

@ParthSareen commented on GitHub (Dec 11, 2025):

@formigarafa actually, the model outputs its own format (which you correctly identified as the harmony format). we then convert that to the spec our API expects. Also, looking at your example - I'm not sure what you're doing in this portion of your tool? What is the behavior you expect from the model?

      {
        "type": "function",
        "function": {
          "name": "python"
        }
      }
    ],
<!-- gh-comment-id:3639549892 --> @ParthSareen commented on GitHub (Dec 11, 2025): @formigarafa actually, the model outputs its own format (which you correctly identified as the harmony format). we then convert that to the spec our API expects. Also, looking at your example - I'm not sure what you're doing in this portion of your tool? What is the behavior you expect from the model? ``` "tools": [ { "type": "function", "function": { "name": "python" } } ], ```
Author
Owner

@formigarafa commented on GitHub (Dec 11, 2025):

If you use the command I provided above and look at the logs you will see the generated prompt according to gpt-oss template like this:

<|start|>system<|message|>You are ChatGPT, a large language model trained by OpenAI.
Knowledge cutoff: 2024-06
Current date: 2025-12-11

Reasoning: low

# Tools

## python

Use this tool to execute Python code in your chain of thought. The code will not be shown to the user. This tool should be used for internal reasoning, but not for code that is intended to be visible to the user (e.g. when creating plots, tables, or files).

When you send a message containing Python code to python, it will be executed in a stateful Jupyter notebook environment. python will respond with the output of the execution or time out after 120.0 seconds. The drive at '/mnt/data' can be used to save and persist user files. Internet access for this session is UNKNOWN. Depends on the cluster.

# Valid channels: analysis, commentary, final. Channel must be included for every message.<|end|><|start|>user<|message|>Considering that the 1st number in the Fibonacci sequence is 1, calculate the exact value for the 53rd number in the Fibonacci sequence.<|end|><|start|>assistant

These parameters you are asking about are used to enable the model's native Python code interpreter tool.

And the logs will show the model generating the correct response but at the end of the generation a parsing error is thrown.
The expected python code is there in the error message and it is correct! Just in the wrong place.

"message": "error parsing tool call: raw=def fib(n):\n    a,b=1,1\n    for _ in range(3,n+1):\n        a,b=b,a+b\n    return b\nprint(fib(53))\n, err=invalid character d looking for beginning of value"

Can you see this on the error log?

def fib(n):
    a,b=1,1
    for _ in range(3,n+1):
        a,b=b,a+b
    return b
print(fib(53))

What I expected would be to having access to that python code without having to parse the error message to extract the successful answer.

<!-- gh-comment-id:3640683736 --> @formigarafa commented on GitHub (Dec 11, 2025): If you use the command I provided above and look at the logs you will see the generated prompt according to gpt-oss template like this: ``` <|start|>system<|message|>You are ChatGPT, a large language model trained by OpenAI. Knowledge cutoff: 2024-06 Current date: 2025-12-11 Reasoning: low # Tools ## python Use this tool to execute Python code in your chain of thought. The code will not be shown to the user. This tool should be used for internal reasoning, but not for code that is intended to be visible to the user (e.g. when creating plots, tables, or files). When you send a message containing Python code to python, it will be executed in a stateful Jupyter notebook environment. python will respond with the output of the execution or time out after 120.0 seconds. The drive at '/mnt/data' can be used to save and persist user files. Internet access for this session is UNKNOWN. Depends on the cluster. # Valid channels: analysis, commentary, final. Channel must be included for every message.<|end|><|start|>user<|message|>Considering that the 1st number in the Fibonacci sequence is 1, calculate the exact value for the 53rd number in the Fibonacci sequence.<|end|><|start|>assistant ``` These parameters you are asking about are used to enable the model's native Python code interpreter tool. And the logs will show the model generating the correct response but at the end of the generation a parsing error is thrown. The expected python code is there in the error message and it is correct! Just in the wrong place. ``` "message": "error parsing tool call: raw=def fib(n):\n a,b=1,1\n for _ in range(3,n+1):\n a,b=b,a+b\n return b\nprint(fib(53))\n, err=invalid character d looking for beginning of value" ``` Can you see this on the error log? ```python def fib(n): a,b=1,1 for _ in range(3,n+1): a,b=b,a+b return b print(fib(53)) ``` What I expected would be to having access to that python code without having to parse the error message to extract the successful answer.
Author
Owner

@mcr-ksh commented on GitHub (Dec 11, 2025):

@ParthSareen Context is 128k max for gpt-oss and that's what i'm running on.
@formigarafa that might all be the case that this is harmony related, however I use n8n and there I wont have the ability to change the tool calling or the output format. They integrate LangChain and that is the backend calling ollama and parsing the output. Hence it's LangChain and a massive framework I would vote for making it compatible instead of implementing workarounds. I do recall observing any errors like harmony parser: no reverse mapping found but I could be wrong. Right now I moved on from ollama to vLLM and have no problems since. i've posted a log in #12064 but there is no mention of any harmony reverse mapping.

<!-- gh-comment-id:3640897062 --> @mcr-ksh commented on GitHub (Dec 11, 2025): @ParthSareen Context is 128k max for gpt-oss and that's what i'm running on. @formigarafa that might all be the case that this is harmony related, however I use n8n and there I wont have the ability to change the tool calling or the output format. They integrate LangChain and that is the backend calling ollama and parsing the output. Hence it's LangChain and a massive framework I would vote for making it compatible instead of implementing workarounds. I do recall observing any errors like `harmony parser: no reverse mapping found` but I could be wrong. Right now I moved on from ollama to vLLM and have no problems since. i've posted a log in #12064 but there is no mention of any harmony reverse mapping.
Author
Owner

@ParthSareen commented on GitHub (Dec 12, 2025):

Thanks @formigarafa and @mcr-ksh I'll dig into this. We haven't tried a ton with the python built-in so would be good to make sure we work any bugs out too. Thanks!

<!-- gh-comment-id:3647614295 --> @ParthSareen commented on GitHub (Dec 12, 2025): Thanks @formigarafa and @mcr-ksh I'll dig into this. We haven't tried a ton with the python built-in so would be good to make sure we work any bugs out too. Thanks!
Author
Owner

@rranabha commented on GitHub (Feb 26, 2026):

@ParthSareen in gpt-oss, With response api, the rendered prompt doesn't include tools as well.

messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "What's the weather like in Tokyo?"},
]

tools = [
    {
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Get the current weather for a given location",
            "parameters": {
                "type": "object",
                "properties": {
                    "location": {
                        "type": "string",
                        "description": "The city name, e.g. New York",
                    },
                    "unit": {
                        "type": "string",
                        "enum": ["fahrenheit", "celsius"],
                        "description": "Temperature unit",
                    },
                },
                "required": ["location"],
            },
        },
    }
]


resp_ollama = client.responses.create(
    # model="ollama/gpt-oss:20b",
    model="gpt-oss:20b",
    input=messages,
    tools=tools,
    tool_choice="auto",
)

The Prompt on ollama-server :

You are a helpful assistant.<|end|><|start|>user<|message|>What's the weather like in Tokyo?<|end|><|start|>assistant
<|start|>system<|message|>You are ChatGPT, a large language model trained by OpenAI.
Knowledge cutoff: 2024-06
Current date: 2026-02-26

Reasoning: medium

# Valid channels: analysis, commentary, final. Channel must be included for every message.
Calls to these tools must go to the commentary channel: 'functions'.<|end|><|start|>developer<|message|># Tools

## functions

namespace functions {

type  = () => any;

} // namespace functions

# Instructions

You are a helpful assistant.<|end|><|start|>user<|message|>What's the weather like in Tokyo?<|end|><|start|>assistant
<!-- gh-comment-id:3967844658 --> @rranabha commented on GitHub (Feb 26, 2026): @ParthSareen in gpt-oss, With response api, the rendered prompt doesn't include tools as well. ``` messages = [ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "What's the weather like in Tokyo?"}, ] tools = [ { "type": "function", "function": { "name": "get_weather", "description": "Get the current weather for a given location", "parameters": { "type": "object", "properties": { "location": { "type": "string", "description": "The city name, e.g. New York", }, "unit": { "type": "string", "enum": ["fahrenheit", "celsius"], "description": "Temperature unit", }, }, "required": ["location"], }, }, } ] resp_ollama = client.responses.create( # model="ollama/gpt-oss:20b", model="gpt-oss:20b", input=messages, tools=tools, tool_choice="auto", ) ``` The Prompt on ollama-server : ``` You are a helpful assistant.<|end|><|start|>user<|message|>What's the weather like in Tokyo?<|end|><|start|>assistant <|start|>system<|message|>You are ChatGPT, a large language model trained by OpenAI. Knowledge cutoff: 2024-06 Current date: 2026-02-26 Reasoning: medium # Valid channels: analysis, commentary, final. Channel must be included for every message. Calls to these tools must go to the commentary channel: 'functions'.<|end|><|start|>developer<|message|># Tools ## functions namespace functions { type = () => any; } // namespace functions # Instructions You are a helpful assistant.<|end|><|start|>user<|message|>What's the weather like in Tokyo?<|end|><|start|>assistant ```
Author
Owner

@rick-github commented on GitHub (Feb 26, 2026):

--- 12187.py.orig	2026-02-26 17:57:49.614790309 +0100
+++ 12187.py	2026-02-26 18:11:43.143812641 +0100
@@ -12,7 +12,6 @@
 tools = [
     {
         "type": "function",
-        "function": {
             "name": "get_weather",
             "description": "Get the current weather for a given location",
             "parameters": {
@@ -30,7 +29,6 @@
                 },
                 "required": ["location"],
             },
-        },
     }
 ]
<|start|>system<|message|>You are ChatGPT, a large language model trained by OpenAI.
Knowledge cutoff: 2024-06
Current date: 2026-02-26

Reasoning: medium

# Valid channels: analysis, commentary, final. Channel must be included for every message.
Calls to these tools must go to the commentary channel: 'functions'.<|end|><|start|>developer<|message|># Tools

## functions

namespace functions {

// Get the current weather for a given location
type get_weather = (_: {
  // The city name, e.g. New York
  location: string,
  // Temperature unit
  unit: string,
}) => any;

} // namespace functions

# Instructions

You are a helpful assistant.<|end|><|start|>user<|message|>What's the weather like in Tokyo?<|end|><|start|>assistant
<!-- gh-comment-id:3968008381 --> @rick-github commented on GitHub (Feb 26, 2026): ```diff --- 12187.py.orig 2026-02-26 17:57:49.614790309 +0100 +++ 12187.py 2026-02-26 18:11:43.143812641 +0100 @@ -12,7 +12,6 @@ tools = [ { "type": "function", - "function": { "name": "get_weather", "description": "Get the current weather for a given location", "parameters": { @@ -30,7 +29,6 @@ }, "required": ["location"], }, - }, } ] ``` ``` <|start|>system<|message|>You are ChatGPT, a large language model trained by OpenAI. Knowledge cutoff: 2024-06 Current date: 2026-02-26 Reasoning: medium # Valid channels: analysis, commentary, final. Channel must be included for every message. Calls to these tools must go to the commentary channel: 'functions'.<|end|><|start|>developer<|message|># Tools ## functions namespace functions { // Get the current weather for a given location type get_weather = (_: { // The city name, e.g. New York location: string, // Temperature unit unit: string, }) => any; } // namespace functions # Instructions You are a helpful assistant.<|end|><|start|>user<|message|>What's the weather like in Tokyo?<|end|><|start|>assistant ```
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#54617