[GH-ISSUE #11786] Different results between Ollama Turbo and Ollama #7815

Closed
opened 2026-04-12 19:59:15 -05:00 by GiteaMirror · 13 comments
Owner

Originally created by @MarkWard0110 on GitHub (Aug 7, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/11786

Originally assigned to: @ParthSareen on GitHub.

What is the issue?

I have a benchmark tool that is getting different results for the same model between Ollama Turbo and Ollama.
The agent uses tools.

gpt-oss_20b is gpt-oss:20b on my local Ollama running on Windows with Nvidia RTX 3090
ollamaturbo_gpt-oss_20b is Ollama Turbo with gpt-oss:20b
The ollamaturbo_gpt-oss_20b is returning "thinking" but is not calling any tools. Not shown, I have only one occurrence where it calls tools and passes a benchmark.

The screenshot is of my report between the two. Each is performing the same prompt request with the same data.

Image

Relevant log output


OS

No response

GPU

No response

CPU

No response

Ollama version

No response

Originally created by @MarkWard0110 on GitHub (Aug 7, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/11786 Originally assigned to: @ParthSareen on GitHub. ### What is the issue? I have a benchmark tool that is getting different results for the same model between Ollama Turbo and Ollama. The agent uses tools. `gpt-oss_20b` is `gpt-oss:20b` on my local Ollama running on Windows with Nvidia RTX 3090 `ollamaturbo_gpt-oss_20b` is Ollama Turbo with `gpt-oss:20b` The `ollamaturbo_gpt-oss_20b` is returning "thinking" but is not calling any tools. Not shown, I have only one occurrence where it calls tools and passes a benchmark. The screenshot is of my report between the two. Each is performing the same prompt request with the same data. <img width="1325" height="748" alt="Image" src="https://github.com/user-attachments/assets/8b66acb8-1480-4ce2-851c-0870b481aebf" /> ### Relevant log output ```shell ``` ### OS _No response_ ### GPU _No response_ ### CPU _No response_ ### Ollama version _No response_
GiteaMirror added the cloudbug labels 2026-04-12 19:59:15 -05:00
Author
Owner

@EchoLynx commented on GitHub (Aug 7, 2025):

"Thinking" forever makes me think that they're having issues with their datacenter configuration somehow. Something that's making the model run without GPU acceleration.

<!-- gh-comment-id:3164079933 --> @EchoLynx commented on GitHub (Aug 7, 2025): "Thinking" forever makes me think that they're having issues with their datacenter configuration somehow. Something that's making the model run without GPU acceleration.
Author
Owner

@technovangelist commented on GitHub (Aug 7, 2025):

where is the code for this benchmarking tool

<!-- gh-comment-id:3164598439 --> @technovangelist commented on GitHub (Aug 7, 2025): where is the code for this benchmarking tool
Author
Owner

@rick-github commented on GitHub (Aug 7, 2025):

Seems like turbo doesn't do tool calls at all.

$ curl -s https://ollama.com/api/chat -H "Authorization: Bearer $OLLAMA_API_KEY" -d '{
  "model": "gpt-oss:20b",
  "messages": [
    {
      "role": "user",
      "content": "What is the weather today in Toronto?"
    }
  ],
  "stream": false,
  "tools": [
    {
      "type": "function",
      "function": {
        "name": "get_current_weather",
        "description": "Get the current weather for a location",
        "parameters": {
          "type": "object",
          "properties": {
            "location": {
              "type": "string",
              "description": "The location to get the weather for, e.g. San Francisco, CA"
            },
            "format": {
              "type": "string",
              "description": "The format to return the weather in, e.g. 'celsius' or 'fahrenheit'",
              "enum": ["celsius", "fahrenheit"]
            }
          },
          "required": ["location", "format"]
        }
      }
    }
  ]
}' | jq
{
  "model": "gpt-oss:20b",
  "created_at": "2025-08-07T15:29:31.456910987Z",
  "message": {
    "role": "assistant",
    "content": "",
    "thinking": "We need to use the \"get_current_weather\" function to retrieve weather of Toronto. Output should call the function."
  },
  "done": true,
  "total_duration": 494977659,
  "prompt_eval_count": 177,
  "eval_count": 53
}

gpt-oss:120b does the same thing.

<!-- gh-comment-id:3164721682 --> @rick-github commented on GitHub (Aug 7, 2025): Seems like turbo doesn't do tool calls at all. ```console $ curl -s https://ollama.com/api/chat -H "Authorization: Bearer $OLLAMA_API_KEY" -d '{ "model": "gpt-oss:20b", "messages": [ { "role": "user", "content": "What is the weather today in Toronto?" } ], "stream": false, "tools": [ { "type": "function", "function": { "name": "get_current_weather", "description": "Get the current weather for a location", "parameters": { "type": "object", "properties": { "location": { "type": "string", "description": "The location to get the weather for, e.g. San Francisco, CA" }, "format": { "type": "string", "description": "The format to return the weather in, e.g. 'celsius' or 'fahrenheit'", "enum": ["celsius", "fahrenheit"] } }, "required": ["location", "format"] } } } ] }' | jq { "model": "gpt-oss:20b", "created_at": "2025-08-07T15:29:31.456910987Z", "message": { "role": "assistant", "content": "", "thinking": "We need to use the \"get_current_weather\" function to retrieve weather of Toronto. Output should call the function." }, "done": true, "total_duration": 494977659, "prompt_eval_count": 177, "eval_count": 53 } ``` gpt-oss:120b does the same thing.
Author
Owner

@MarkWard0110 commented on GitHub (Aug 8, 2025):

The model tags indicate that the model digests are not the same.

Ollama Turbo: gpt-oss:120b: digest: d98fe6ba01e6
Ollama 0.11.4: gpt-oss:120b: digest: 735371f916a9e1365c819eae4b05c1ee1dd390536036c1a85cc06075182e9f3d

Ollama Turbo: gpt-oss:20b: digest: 05afbac4bad6
Ollama 0.11.4: gpt-oss:20b: digest: f2b8351c629c005bd3f0a0e3046f905afcbffede19b648e4bd7c884cdfd63af6

Ollama Turbo

{
    "models": [
        {
            "name": "gpt-oss:120b",
            "model": "gpt-oss:120b",
            "modified_at": "2025-08-05T00:00:00Z",
            "size": 65290180781,
            "digest": "d98fe6ba01e6",
            "details": {
                "parent_model": "",
                "format": "",
                "family": "",
                "families": null,
                "parameter_size": "",
                "quantization_level": ""
            }
        },
        {
            "name": "gpt-oss:20b",
            "model": "gpt-oss:20b",
            "modified_at": "2025-08-05T00:00:00Z",
            "size": 13780162412,
            "digest": "05afbac4bad6",
            "details": {
                "parent_model": "",
                "format": "",
                "family": "",
                "families": null,
                "parameter_size": "",
                "quantization_level": ""
            }
        }
    ]
}

Ollama 0.11.4

        {
            "name": "gpt-oss:120b",
            "model": "gpt-oss:120b",
            "modified_at": "2025-08-06T11:22:25.3488601-05:00",
            "size": 65290192208,
            "digest": "735371f916a9e1365c819eae4b05c1ee1dd390536036c1a85cc06075182e9f3d",
            "details": {
                "parent_model": "",
                "format": "gguf",
                "family": "gptoss",
                "families": [
                    "gptoss"
                ],
                "parameter_size": "116.8B",
                "quantization_level": "MXFP4"
            }
        },
        {
            "name": "gpt-oss:20b",
            "model": "gpt-oss:20b",
            "modified_at": "2025-08-06T11:22:25.698919-05:00",
            "size": 13780173839,
            "digest": "f2b8351c629c005bd3f0a0e3046f905afcbffede19b648e4bd7c884cdfd63af6",
            "details": {
                "parent_model": "",
                "format": "gguf",
                "family": "gptoss",
                "families": [
                    "gptoss"
                ],
                "parameter_size": "20.9B",
                "quantization_level": "MXFP4"
            }
        },
<!-- gh-comment-id:3168241192 --> @MarkWard0110 commented on GitHub (Aug 8, 2025): The model tags indicate that the model digests are not the same. Ollama Turbo: gpt-oss:120b: digest: d98fe6ba01e6 Ollama 0.11.4: gpt-oss:120b: digest: 735371f916a9e1365c819eae4b05c1ee1dd390536036c1a85cc06075182e9f3d Ollama Turbo: gpt-oss:20b: digest: 05afbac4bad6 Ollama 0.11.4: gpt-oss:20b: digest: f2b8351c629c005bd3f0a0e3046f905afcbffede19b648e4bd7c884cdfd63af6 Ollama Turbo ``` { "models": [ { "name": "gpt-oss:120b", "model": "gpt-oss:120b", "modified_at": "2025-08-05T00:00:00Z", "size": 65290180781, "digest": "d98fe6ba01e6", "details": { "parent_model": "", "format": "", "family": "", "families": null, "parameter_size": "", "quantization_level": "" } }, { "name": "gpt-oss:20b", "model": "gpt-oss:20b", "modified_at": "2025-08-05T00:00:00Z", "size": 13780162412, "digest": "05afbac4bad6", "details": { "parent_model": "", "format": "", "family": "", "families": null, "parameter_size": "", "quantization_level": "" } } ] } ``` Ollama 0.11.4 ``` { "name": "gpt-oss:120b", "model": "gpt-oss:120b", "modified_at": "2025-08-06T11:22:25.3488601-05:00", "size": 65290192208, "digest": "735371f916a9e1365c819eae4b05c1ee1dd390536036c1a85cc06075182e9f3d", "details": { "parent_model": "", "format": "gguf", "family": "gptoss", "families": [ "gptoss" ], "parameter_size": "116.8B", "quantization_level": "MXFP4" } }, { "name": "gpt-oss:20b", "model": "gpt-oss:20b", "modified_at": "2025-08-06T11:22:25.698919-05:00", "size": 13780173839, "digest": "f2b8351c629c005bd3f0a0e3046f905afcbffede19b648e4bd7c884cdfd63af6", "details": { "parent_model": "", "format": "gguf", "family": "gptoss", "families": [ "gptoss" ], "parameter_size": "20.9B", "quantization_level": "MXFP4" } }, ```
Author
Owner

@ParthSareen commented on GitHub (Aug 8, 2025):

Hey @MarkWard0110 - some updates should have rolled out to improve tool calling. Can you re-run on both and verify? Still seeing some inconsistencies with 120b but 20b turbo should be working well. Worth to test both again

<!-- gh-comment-id:3169019711 --> @ParthSareen commented on GitHub (Aug 8, 2025): Hey @MarkWard0110 - some updates should have rolled out to improve tool calling. Can you re-run on both and verify? Still seeing some inconsistencies with 120b but 20b turbo should be working well. Worth to test both again
Author
Owner

@MarkWard0110 commented on GitHub (Aug 8, 2025):

@ParthSareen
I have rerun the tests. The response only contains the LLMs thinking to use the tools and does not contain the tool calls.

for example

"role": "assistant",
"content": "",
"thinking": "We need to parse email. .... We'll call list_events."
<!-- gh-comment-id:3169078591 --> @MarkWard0110 commented on GitHub (Aug 8, 2025): @ParthSareen I have rerun the tests. The response only contains the LLMs thinking to use the tools and does not contain the tool calls. for example ``` "role": "assistant", "content": "", "thinking": "We need to parse email. .... We'll call list_events." ```
Author
Owner

@ParthSareen commented on GitHub (Aug 8, 2025):

@MarkWard0110 this model is pretty sensitive to how thinking and tool results are passed back in – I can help debug but would need to see the benchmark script.

<!-- gh-comment-id:3169081576 --> @ParthSareen commented on GitHub (Aug 8, 2025): @MarkWard0110 this model is pretty sensitive to how thinking and tool results are passed back in – I can help debug but would need to see the benchmark script.
Author
Owner

@MarkWard0110 commented on GitHub (Aug 8, 2025):

@ParthSareen , may I share an example with you on Discord?

<!-- gh-comment-id:3169097100 --> @MarkWard0110 commented on GitHub (Aug 8, 2025): @ParthSareen , may I share an example with you on Discord?
Author
Owner

@MarkWard0110 commented on GitHub (Aug 8, 2025):

The model might be sensitive but among the prompts, this was the best performing for local Ollama 0.11.4 for gtp-oss:20b, but it does not perform the same on Ollama Turbo.

What I don't know is whether the tool calls are being dropped in Ollama Turbo, or what the issue is, but the client isn't receiving the tool calls that I would expect.
Image

The benchmark uses the same system and user prompt with temp:0 and top_k:1. The rest is up to the LLM model and its host.
If these were the same Ollama and model, I expect the results to be identical or to show only minor differences.

A prompt version runs through 30 different tests. In each test, the LLM must perform a task using the system prompt version provided. The output is graded.

gpt-oss:20b on Ollama Turbo is scoring lower than qwen3 0.6.

Image

Performing poorly because the client isn't getting any tool calls from Ollama Turbo. Why? I don't know

My top-performing models and prompts are

Image When Ollama Turbo supports more models, I'll run and compare the results against my local ones.
<!-- gh-comment-id:3169241949 --> @MarkWard0110 commented on GitHub (Aug 8, 2025): The model might be sensitive but among the prompts, this was the best performing for local Ollama 0.11.4 for `gtp-oss:20b`, but it does not perform the same on Ollama Turbo. What I don't know is whether the tool calls are being dropped in Ollama Turbo, or what the issue is, but the client isn't receiving the tool calls that I would expect. <img width="1519" height="234" alt="Image" src="https://github.com/user-attachments/assets/53d72eed-24e4-45ea-b8c3-233d7f43d364" /> The benchmark uses the same system and user prompt with temp:0 and top_k:1. The rest is up to the LLM model and its host. If these were the same Ollama and model, I expect the results to be identical or to show only minor differences. A prompt version runs through 30 different tests. In each test, the LLM must perform a task using the system prompt version provided. The output is graded. `gpt-oss:20b` on Ollama Turbo is scoring lower than qwen3 0.6. <img width="1480" height="602" alt="Image" src="https://github.com/user-attachments/assets/8f42e143-a07d-46e7-926f-d8c5226f3f9b" /> Performing poorly because the client isn't getting any tool calls from Ollama Turbo. Why? I don't know My top-performing models and prompts are <img width="1501" height="1474" alt="Image" src="https://github.com/user-attachments/assets/c74215d7-844b-414e-bb5d-fb5247bd158c" /> When Ollama Turbo supports more models, I'll run and compare the results against my local ones.
Author
Owner

@MarkWard0110 commented on GitHub (Aug 9, 2025):

Finally getting some local 120b and I have similar results where local does not match Ollama Turbo.

Image
<!-- gh-comment-id:3169916975 --> @MarkWard0110 commented on GitHub (Aug 9, 2025): Finally getting some local 120b and I have similar results where local does not match Ollama Turbo. <img width="1246" height="480" alt="Image" src="https://github.com/user-attachments/assets/56825619-3c22-4c7c-a6fb-a23abf01f01b" />
Author
Owner

@linharrrrrt commented on GitHub (Aug 12, 2025):

I also tested tool calls based on Ollama Turbo, and indeed it cannot call tools, it just provided thoughts and considered that calls were needed, but in reality, no calls were made.

<!-- gh-comment-id:3177513896 --> @linharrrrrt commented on GitHub (Aug 12, 2025): I also tested tool calls based on Ollama Turbo, and indeed it cannot call tools, it just provided thoughts and considered that calls were needed, but in reality, no calls were made.
Author
Owner

@MarkWard0110 commented on GitHub (Aug 19, 2025):

@ParthSareen , I am testing Ollama Turbo again and I am seeing improved results!

Image
<!-- gh-comment-id:3200978074 --> @MarkWard0110 commented on GitHub (Aug 19, 2025): @ParthSareen , I am testing Ollama Turbo again and I am seeing improved results! <img width="1206" height="428" alt="Image" src="https://github.com/user-attachments/assets/1e50af70-3752-46c5-a034-739ca7e50d72" />
Author
Owner

@MarkWard0110 commented on GitHub (Aug 19, 2025):

gpt-oss:120b looks to be working

Image
<!-- gh-comment-id:3202324302 --> @MarkWard0110 commented on GitHub (Aug 19, 2025): gpt-oss:120b looks to be working <img width="1803" height="640" alt="Image" src="https://github.com/user-attachments/assets/35fb7827-183f-4c4c-9b4b-9c739e6f59e1" />
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#7815