[GH-ISSUE #11691] Structured output with OpenAI SDK and gpt-oss:20b not working #69795

Open
opened 2026-05-04 19:19:20 -05:00 by GiteaMirror · 87 comments
Owner

Originally created by @taagarwa-rh on GitHub (Aug 5, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/11691

Originally assigned to: @ParthSareen on GitHub.

What is the issue?

OpenAI SDK is unable to parse structured output from gpt-oss:20b responses. Ollama is supposed to be compatible with OpenAI SDK structured outputs per this Blog Post.

Reproducer:

import openai
from pydantic import BaseModel

class Response(BaseModel):
    
    response: str

client = openai.OpenAI(api_key="NONE", base_url="http://localhost:11434/v1")
response = client.beta.chat.completions.parse(
        messages=[{"role": "user", "content": "Hello, how are you?"}],
        model="gpt-oss:20b",
        response_format=Response,
)
print(response)

Relevant log output

pydantic_core._pydantic_core.ValidationError: 1 validation error for Response
  Invalid JSON: expected value at line 1 column 1 [type=json_invalid, input_value='The user says "\n\n    \t}\n       \t\t \t\t    ', input_type=str]
    For further information visit https://errors.pydantic.dev/2.11/v/json_invalid

OS

macOS

GPU

Apple

CPU

Apple

Ollama version

0.11.0

Originally created by @taagarwa-rh on GitHub (Aug 5, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/11691 Originally assigned to: @ParthSareen on GitHub. ### What is the issue? OpenAI SDK is unable to parse structured output from gpt-oss:20b responses. Ollama is supposed to be compatible with OpenAI SDK structured outputs per this [Blog Post](https://ollama.com/blog/structured-outputs). Reproducer: ```python import openai from pydantic import BaseModel class Response(BaseModel): response: str client = openai.OpenAI(api_key="NONE", base_url="http://localhost:11434/v1") response = client.beta.chat.completions.parse( messages=[{"role": "user", "content": "Hello, how are you?"}], model="gpt-oss:20b", response_format=Response, ) print(response) ``` ### Relevant log output ```shell pydantic_core._pydantic_core.ValidationError: 1 validation error for Response Invalid JSON: expected value at line 1 column 1 [type=json_invalid, input_value='The user says "\n\n \t}\n \t\t \t\t ', input_type=str] For further information visit https://errors.pydantic.dev/2.11/v/json_invalid ``` ### OS macOS ### GPU Apple ### CPU Apple ### Ollama version 0.11.0
GiteaMirror added the gpt-ossbug labels 2026-05-04 19:19:20 -05:00
Author
Owner

@BeatWolf commented on GitHub (Aug 5, 2025):

i think i have a similar issue with langchain and the ollamachatmodel. my pipelines that depend on structured output dont work, making the model unuseable

<!-- gh-comment-id:3156654637 --> @BeatWolf commented on GitHub (Aug 5, 2025): i think i have a similar issue with langchain and the ollamachatmodel. my pipelines that depend on structured output dont work, making the model unuseable
Author
Owner

@jbcallaghan commented on GitHub (Aug 5, 2025):

I can report the same issue, structured output doesn't work. Content = ''

<!-- gh-comment-id:3156672329 --> @jbcallaghan commented on GitHub (Aug 5, 2025): I can report the same issue, structured output doesn't work. Content = ''
Author
Owner

@sheneman commented on GitHub (Aug 6, 2025):

Structured outputs doesn't work with gpt-oss due to the use of the new Harmony response format.

I believe this can be addressed via an integration layer by the Ollama team, but the lack of structured outputs really makes using gpt-oss model useless for most serious purposes.

<!-- gh-comment-id:3157223324 --> @sheneman commented on GitHub (Aug 6, 2025): Structured outputs doesn't work with gpt-oss due to the use of the new **Harmony** response format. I believe this can be addressed via an integration layer by the Ollama team, but the lack of structured outputs really makes using gpt-oss model useless for most serious purposes.
Author
Owner

@frozenkp commented on GitHub (Aug 6, 2025):

Same issue here. I'm using Pydantic with ChatOllama, and nothing in the response content is causing a parsing exception.

<!-- gh-comment-id:3157732806 --> @frozenkp commented on GitHub (Aug 6, 2025): Same issue here. I'm using Pydantic with ChatOllama, and nothing in the response content is causing a parsing exception.
Author
Owner

@tneQpx commented on GitHub (Aug 6, 2025):

Same issue. Message.Content = "

When using ollama chat with format

<!-- gh-comment-id:3157840032 --> @tneQpx commented on GitHub (Aug 6, 2025): Same issue. Message.Content = " When using ollama chat with format
Author
Owner

@KlausGPaul commented on GitHub (Aug 6, 2025):

As a workaround, it seems as if adding the desired response schema to the prompt could work, have not tried it at scale yet, though.

f"""text text
<prompt>
...
Use the provided JSON schema for your reply:
``
{ThisIsMySchema.model_json_schema()}
``
"""

The response, though will also enclose the JSON inside markdown, but it follows the schema.

<!-- gh-comment-id:3158981502 --> @KlausGPaul commented on GitHub (Aug 6, 2025): As a workaround, it seems as if adding the desired response schema to the prompt could work, have not tried it at scale yet, though. ``` f"""text text <prompt> ... Use the provided JSON schema for your reply: `` {ThisIsMySchema.model_json_schema()} `` """ ``` The response, though will also enclose the JSON inside markdown, but it follows the schema.
Author
Owner

@lachlansleight commented on GitHub (Aug 6, 2025):

Adding +1 - same issue here. Adding some tests:

Sending:

{
    "model": "gpt-oss:20b",
    "system": "You are a helpful assistant that always responds with valid JSON",
    "prompt": "Hello there! Respond in the JSON format { \"response\": \"your-response-here\" }",
    "stream": false,
    "format": "json"
}

Results in the following response:

{
    "model": "gpt-oss:20b",
    "created_at": "2025-08-06T11:22:27.512287Z",
    "response": "The user says: \": \"Hello there! Respond in the JSON format { \"\n\n    }",
    "done": true,
    "done_reason": "stop",
    "context": [...],
    "total_duration": 1491802500,
    "load_duration": 73731292,
    "prompt_eval_count": 105,
    "prompt_eval_duration": 712212500,
    "eval_count": 22,
    "eval_duration": 655166083
}

Sometimes response is empty, sometimes it begins with some of the thinking text, as above. Often this is just a tiny fragment, such as "response": "{\"\n\n }". Removing the system prompt seems to give me these little error fragments much more often (about 60% of the time, as opposed to 10% of the time with a system prompt)

If I remove the "format": "json" parameter altogether, I get the following response:

{
    "model": "gpt-oss:20b",
    "created_at": "2025-08-06T11:26:53.037969Z",
    "response": "{\"response\":\"Hello! How can I help you today?\"}",
    "thinking": "User says: \"Hello there! Respond in the JSON format { \"response\": \"your-response-here\" }\". So we need to output JSON with key \"response\" and value as our response. We should greet. So response: \"Hello! How can I help you today?\" Should wrap. Ensure JSON.",
    "done": true,
    "done_reason": "stop",
    "context": [...],
    "total_duration": 3144792167,
    "load_duration": 70313792,
    "prompt_eval_count": 105,
    "prompt_eval_duration": 674871500,
    "eval_count": 87,
    "eval_duration": 2399115875
}

Finally, if I try setting the format to a specific format, I always get the empty response text, with or without a system prompt.

<!-- gh-comment-id:3159779076 --> @lachlansleight commented on GitHub (Aug 6, 2025): Adding +1 - same issue here. Adding some tests: Sending: ```json { "model": "gpt-oss:20b", "system": "You are a helpful assistant that always responds with valid JSON", "prompt": "Hello there! Respond in the JSON format { \"response\": \"your-response-here\" }", "stream": false, "format": "json" } ``` Results in the following response: ```json { "model": "gpt-oss:20b", "created_at": "2025-08-06T11:22:27.512287Z", "response": "The user says: \": \"Hello there! Respond in the JSON format { \"\n\n }", "done": true, "done_reason": "stop", "context": [...], "total_duration": 1491802500, "load_duration": 73731292, "prompt_eval_count": 105, "prompt_eval_duration": 712212500, "eval_count": 22, "eval_duration": 655166083 } ``` Sometimes `response` is empty, sometimes it begins with some of the thinking text, as above. Often this is just a tiny fragment, such as `"response": "{\"\n\n }"`. Removing the system prompt seems to give me these little error fragments much more often (about 60% of the time, as opposed to 10% of the time with a system prompt) If I remove the `"format": "json"` parameter altogether, I get the following response: ```json { "model": "gpt-oss:20b", "created_at": "2025-08-06T11:26:53.037969Z", "response": "{\"response\":\"Hello! How can I help you today?\"}", "thinking": "User says: \"Hello there! Respond in the JSON format { \"response\": \"your-response-here\" }\". So we need to output JSON with key \"response\" and value as our response. We should greet. So response: \"Hello! How can I help you today?\" Should wrap. Ensure JSON.", "done": true, "done_reason": "stop", "context": [...], "total_duration": 3144792167, "load_duration": 70313792, "prompt_eval_count": 105, "prompt_eval_duration": 674871500, "eval_count": 87, "eval_duration": 2399115875 } ``` Finally, if I try setting the format to a specific format, I always get the empty response text, with or without a system prompt.
Author
Owner

@jbcallaghan commented on GitHub (Aug 6, 2025):

As a workaround, it seems as if adding the desired response schema to the prompt could work, have not tried it at scale yet, though.

f"""text text
<prompt>
...
Use the provided JSON schema for your reply:
``
{ThisIsMySchema.model_json_schema()}
``
"""

The response, though will also enclose the JSON inside markdown, but it follows the schema.

I tried this and it works randomly, sometimes I get a response with the correct formatting and other times no output at all. This is using exactly the same query each time. I also noticed there is a lot of blank content before the structured output is populated when it does work, almost like thinking is being shown as blank content

<!-- gh-comment-id:3160838057 --> @jbcallaghan commented on GitHub (Aug 6, 2025): > As a workaround, it seems as if adding the desired response schema to the prompt could work, have not tried it at scale yet, though. > > ``` > f"""text text > <prompt> > ... > Use the provided JSON schema for your reply: > `` > {ThisIsMySchema.model_json_schema()} > `` > """ > ``` > > The response, though will also enclose the JSON inside markdown, but it follows the schema. I tried this and it works randomly, sometimes I get a response with the correct formatting and other times no output at all. This is using exactly the same query each time. I also noticed there is a lot of blank content before the structured output is populated when it does work, almost like thinking is being shown as blank content
Author
Owner

@duxor commented on GitHub (Aug 6, 2025):

Image

It's a little bit crazy, but it works...

You need to find a position of ```json in the string, in some cases there is more text as a prefix:

Image
<!-- gh-comment-id:3161106620 --> @duxor commented on GitHub (Aug 6, 2025): <img width="1518" height="809" alt="Image" src="https://github.com/user-attachments/assets/e43436b7-1fd6-44c0-bb12-5c5d8962a56c" /> It's a little bit crazy, but it works... You need to find a position of ```json in the string, in some cases there is more text as a prefix: <img width="1525" height="879" alt="Image" src="https://github.com/user-attachments/assets/ffae6a71-3c56-4358-b21b-aac109561a50" />
Author
Owner

@sheneman commented on GitHub (Aug 6, 2025):

@duxor - Yeah, you can specify the desired format in the prompt, but that's not really enforced structured output with a compiled grammar. No guarantee that it will work, and yes - you have to filter other stuff around it.

<!-- gh-comment-id:3161178718 --> @sheneman commented on GitHub (Aug 6, 2025): @duxor - Yeah, you can specify the desired format in the prompt, but that's not really enforced structured output with a compiled grammar. No guarantee that it will work, and yes - you have to filter other stuff around it.
Author
Owner

@frozenkp commented on GitHub (Aug 7, 2025):

My current alternative workaround is not asking gpt-oss to reply with a structured format, and asking another small model to produce the structured format from gpt-oss's response. It's redundant while stable and working.

<!-- gh-comment-id:3162551085 --> @frozenkp commented on GitHub (Aug 7, 2025): My current alternative workaround is not asking gpt-oss to reply with a structured format, and asking another small model to produce the structured format from gpt-oss's response. It's redundant while stable and working.
Author
Owner

@duxor commented on GitHub (Aug 7, 2025):

My current alternative workaround is not asking gpt-oss to reply with a structured format, and asking another small model to produce the structured format from gpt-oss's response. It's redundant while stable and working.

You are right. Do you think it's worth it?

I will definitely avoid gpt-oss, for now.

<!-- gh-comment-id:3162704832 --> @duxor commented on GitHub (Aug 7, 2025): > My current alternative workaround is not asking gpt-oss to reply with a structured format, and asking another small model to produce the structured format from gpt-oss's response. It's redundant while stable and working. You are right. Do you think it's worth it? I will definitely avoid `gpt-oss`, for now.
Author
Owner

@frozenkp commented on GitHub (Aug 7, 2025):

My current alternative workaround is not asking gpt-oss to reply with a structured format, and asking another small model to produce the structured format from gpt-oss's response. It's redundant while stable and working.

You are right. Do you think it's worth it?

I will definitely avoid gpt-oss, for now.

Well, it depends. In my case, I used it as one of my research evaluations.

I would suggest using it after the issue is fixed. I didn't expect that I would take this two-layer approach that I used in the very beginning of the LLM era back. LOL

<!-- gh-comment-id:3162736286 --> @frozenkp commented on GitHub (Aug 7, 2025): > > My current alternative workaround is not asking gpt-oss to reply with a structured format, and asking another small model to produce the structured format from gpt-oss's response. It's redundant while stable and working. > > You are right. Do you think it's worth it? > > I will definitely avoid `gpt-oss`, for now. Well, it depends. In my case, I used it as one of my research evaluations. I would suggest using it after the issue is fixed. I didn't expect that I would take this two-layer approach that I used in the very beginning of the LLM era back. LOL
Author
Owner

@rick-github commented on GitHub (Aug 7, 2025):

Image

<!-- gh-comment-id:3163148074 --> @rick-github commented on GitHub (Aug 7, 2025): [<img width="546" height="58" alt="Image" src="https://github.com/user-attachments/assets/190eca3a-a01a-4e7a-8213-c6a0554ecd31" />](https://discord.com/channels/1128867683291627614/1402425163903013036/1402450640654831739)
Author
Owner

@andreys42 commented on GitHub (Aug 7, 2025):

+1 here
I guess absence of StructuredOutput support is signigicant drawback now

<!-- gh-comment-id:3163486720 --> @andreys42 commented on GitHub (Aug 7, 2025): +1 here I guess absence of StructuredOutput support is signigicant drawback now
Author
Owner

@Mohammadtvk commented on GitHub (Aug 7, 2025):

+1
this feature is very important

<!-- gh-comment-id:3164093381 --> @Mohammadtvk commented on GitHub (Aug 7, 2025): +1 this feature is very important
Author
Owner

@dontriskit commented on GitHub (Aug 8, 2025):

same issue with vLLM

resolved with official vllm docs

<!-- gh-comment-id:3166409626 --> @dontriskit commented on GitHub (Aug 8, 2025): same issue with vLLM --- resolved with official vllm docs
Author
Owner

@Koki-Itai commented on GitHub (Aug 8, 2025):

same issue

<!-- gh-comment-id:3167024296 --> @Koki-Itai commented on GitHub (Aug 8, 2025): same issue
Author
Owner

@tttturtle-russ commented on GitHub (Aug 8, 2025):

same issue here, it's an important feature.

<!-- gh-comment-id:3168600426 --> @tttturtle-russ commented on GitHub (Aug 8, 2025): same issue here, it's an important feature.
Author
Owner

@ddudek commented on GitHub (Aug 11, 2025):

As a better workaround, you should put the schema in "developer" role, e.g.:

[
  {
    "role": "system",
    "content": "
You are helpful coding expert that outputs JSON response.
  },
  {
    "role": "developer",
    "content": "
        # Instructions
        Respond in JSON format. Only output valid JSON, do not include any explanations or markdown formatting. Ensure all required fields are included.
    JSON schema: {'$defs': ... <your schema here> }
"
  },
  {
    "role": "user",
    "content": "## The task
...
"
  }
]

The above works pretty good, although the model also outputs "analysis" and "commentary" streams, e.g.:

<|channel|>analysis<|message|>We need JSON output following schema: My class with my_field string. Contains <blah blah for another 4 paragraphs... >

<|start|>assistant<|channel|>final<|message|>{"my_field": "some content"}

So adding this to the system prompt again improves the output:

"Reasoning: low
# Valid channels: final."

Full example:

[
  {
    "role": "system",
    "content": "
You are helpful coding expert that outputs JSON response.

Reasoning: low
# Valid channels: final.
  },
  {
    "role": "developer",
    "content": "
        # Instructions
        Respond in JSON format. Only output valid JSON, do not include any explanations or markdown formatting. Ensure all required fields are included.
    JSON schema: {'$defs': ... <your schema here> }
"
  },
  {
    "role": "user",
    "content": "## The task
...
"
  }
]

Output:
<|channel|>final<|message|>{"my_field": "some content"}
still needs removing "<|channel|>final<|message|>" but gives very stable behavior.

This is nicely documented in the cookbook https://cookbook.openai.com/articles/openai-harmony#developer-message-format and looks like the model follows this very well.

<!-- gh-comment-id:3173445658 --> @ddudek commented on GitHub (Aug 11, 2025): As a better workaround, you should put the schema in "developer" role, e.g.: ``` [ { "role": "system", "content": " You are helpful coding expert that outputs JSON response. }, { "role": "developer", "content": " # Instructions Respond in JSON format. Only output valid JSON, do not include any explanations or markdown formatting. Ensure all required fields are included. JSON schema: {'$defs': ... <your schema here> } " }, { "role": "user", "content": "## The task ... " } ] ``` The above works pretty good, although the model also outputs "analysis" and "commentary" streams, e.g.: ``` <|channel|>analysis<|message|>We need JSON output following schema: My class with my_field string. Contains <blah blah for another 4 paragraphs... > <|start|>assistant<|channel|>final<|message|>{"my_field": "some content"} ``` So adding this to the **system prompt** again improves the output: ``` "Reasoning: low # Valid channels: final." ``` Full example: ``` [ { "role": "system", "content": " You are helpful coding expert that outputs JSON response. Reasoning: low # Valid channels: final. }, { "role": "developer", "content": " # Instructions Respond in JSON format. Only output valid JSON, do not include any explanations or markdown formatting. Ensure all required fields are included. JSON schema: {'$defs': ... <your schema here> } " }, { "role": "user", "content": "## The task ... " } ] ``` Output: ```<|channel|>final<|message|>{"my_field": "some content"}``` still needs removing "<|channel|>final<|message|>" but gives very stable behavior. This is nicely documented in the cookbook https://cookbook.openai.com/articles/openai-harmony#developer-message-format and looks like the model follows this very well.
Author
Owner

@youngbinkim0 commented on GitHub (Aug 11, 2025):

same issue with vLLM

resolved with official vllm docs

can you link to the official vLLM doc referred? running into the same issue @dontriskit

<!-- gh-comment-id:3174515634 --> @youngbinkim0 commented on GitHub (Aug 11, 2025): > ## same issue with vLLM > resolved with official vllm docs can you link to the official vLLM doc referred? running into the same issue @dontriskit
Author
Owner

@youngbinkim0 commented on GitHub (Aug 11, 2025):

As a better workaround, you should put the schema in "developer" role, e.g.:

[
  {
    "role": "system",
    "content": "
You are helpful coding expert that outputs JSON response.
  },
  {
    "role": "developer",
    "content": "
        # Instructions
        Respond in JSON format. Only output valid JSON, do not include any explanations or markdown formatting. Ensure all required fields are included.
    JSON schema: {'$defs': ... <your schema here> }
"
  },
  {
    "role": "user",
    "content": "## The task
...
"
  }
]

The above works pretty good, although the model also outputs "analysis" and "commentary" streams, e.g.:

<|channel|>analysis<|message|>We need JSON output following schema: My class with my_field string. Contains <blah blah for another 4 paragraphs... >

<|start|>assistant<|channel|>final<|message|>{"my_field": "some content"}

So adding this to the system prompt again improves the output:

"Reasoning: low
# Valid channels: final."

Full example:

[
  {
    "role": "system",
    "content": "
You are helpful coding expert that outputs JSON response.

Reasoning: low
# Valid channels: final.
  },
  {
    "role": "developer",
    "content": "
        # Instructions
        Respond in JSON format. Only output valid JSON, do not include any explanations or markdown formatting. Ensure all required fields are included.
    JSON schema: {'$defs': ... <your schema here> }
"
  },
  {
    "role": "user",
    "content": "## The task
...
"
  }
]

Output: <|channel|>final<|message|>{"my_field": "some content"} still needs removing "<|channel|>final<|message|>" but gives very stable behavior.

This is nicely documented in the cookbook https://cookbook.openai.com/articles/openai-harmony#developer-message-format and looks like the model follows this very well.

While this does work for many schemas, it's not the same as enforcing structured outputs. As referred in the structured output section of the cookbook:

"This prompt alone will, however, only influence the model’s behavior but doesn’t guarantee the full adherence to the schema. For this you still need to construct your own grammar and enforce the schema during sampling."

https://cookbook.openai.com/articles/openai-harmony#structured-output

<!-- gh-comment-id:3174525970 --> @youngbinkim0 commented on GitHub (Aug 11, 2025): > As a better workaround, you should put the schema in "developer" role, e.g.: > > ``` > [ > { > "role": "system", > "content": " > You are helpful coding expert that outputs JSON response. > }, > { > "role": "developer", > "content": " > # Instructions > Respond in JSON format. Only output valid JSON, do not include any explanations or markdown formatting. Ensure all required fields are included. > JSON schema: {'$defs': ... <your schema here> } > " > }, > { > "role": "user", > "content": "## The task > ... > " > } > ] > ``` > > The above works pretty good, although the model also outputs "analysis" and "commentary" streams, e.g.: > > ``` > <|channel|>analysis<|message|>We need JSON output following schema: My class with my_field string. Contains <blah blah for another 4 paragraphs... > > > <|start|>assistant<|channel|>final<|message|>{"my_field": "some content"} > ``` > > So adding this to the **system prompt** again improves the output: > > ``` > "Reasoning: low > # Valid channels: final." > ``` > > Full example: > > ``` > [ > { > "role": "system", > "content": " > You are helpful coding expert that outputs JSON response. > > Reasoning: low > # Valid channels: final. > }, > { > "role": "developer", > "content": " > # Instructions > Respond in JSON format. Only output valid JSON, do not include any explanations or markdown formatting. Ensure all required fields are included. > JSON schema: {'$defs': ... <your schema here> } > " > }, > { > "role": "user", > "content": "## The task > ... > " > } > ] > ``` > > Output: `<|channel|>final<|message|>{"my_field": "some content"}` still needs removing "<|channel|>final<|message|>" but gives very stable behavior. > > This is nicely documented in the cookbook https://cookbook.openai.com/articles/openai-harmony#developer-message-format and looks like the model follows this very well. While this does work for many schemas, it's not the same as enforcing structured outputs. As referred in the structured output section of the cookbook: "This prompt alone will, however, only influence the model’s behavior but doesn’t guarantee the full adherence to the schema. For this you still need to construct your own grammar and enforce the schema during sampling." https://cookbook.openai.com/articles/openai-harmony#structured-output
Author
Owner

@lachlansleight commented on GitHub (Aug 12, 2025):

Yeah, if a model does not support JSON output format schema, then it is functionally a fun toy to play with, but not actually useful as an agent.

Agreed - since identifying this as an issue I basically put GPT-OSS down and haven't touched it since. It's interesting to see how it differs from other agents, but without JSON output it's completely useless for anything other than basic chat applications.

I didn't realise how harmful harmony would be to the update of GPT-OSS. It's so bad that I almost want to put on my tin foil hat and wonder whether Open AI is trying to intentionally harm the open-weight community by fragmenting the ecosystem with a complex, difficult-to-implement response format.

<!-- gh-comment-id:3179650003 --> @lachlansleight commented on GitHub (Aug 12, 2025): > Yeah, if a model does not support JSON output format schema, then it is functionally a fun toy to play with, but not actually useful as an agent. Agreed - since identifying this as an issue I basically put GPT-OSS down and haven't touched it since. It's interesting to see how it differs from other agents, but without JSON output it's completely useless for anything other than basic chat applications. I didn't realise how harmful harmony would be to the update of GPT-OSS. It's so bad that I almost want to put on my tin foil hat and wonder whether Open AI is trying to intentionally harm the open-weight community by fragmenting the ecosystem with a complex, difficult-to-implement response format.
Author
Owner

@ParthSareen commented on GitHub (Aug 12, 2025):

Hey folks! Just came across this issue. We currently do not support structured outputs with this model or other thinking models. Working on a change later this week to bring some of the token parsing down to the runner level. With that we'd be able to do grammar sampling and token parsing closer together to know when to start the constrained sampling. Sorry for the delay!

<!-- gh-comment-id:3181220084 --> @ParthSareen commented on GitHub (Aug 12, 2025): Hey folks! Just came across this issue. We currently do not support structured outputs with this model or other thinking models. Working on a change later this week to bring some of the token parsing down to the runner level. With that we'd be able to do grammar sampling and token parsing closer together to know when to start the constrained sampling. Sorry for the delay!
Author
Owner

@sheneman commented on GitHub (Aug 13, 2025):

@ParthSareen : Thank you so much for your attention to the issue of Structured outputs and thinking models in Ollama, including gpt-oss. This issue been a roadblock for my organization using Ollama for awhile. We love Ollama but have been considering alternatives because of this limitation with effective use of thinking models.

I assume the issue with gpt-oss is at least related to open issue #523.

But I assume the problem is compounded with gpt-oss because of the use of the Harmony response format.

Again, thank you and the Ollama team for prioritizing this!

<!-- gh-comment-id:3183751374 --> @sheneman commented on GitHub (Aug 13, 2025): @ParthSareen : **Thank you so much** for your attention to the issue of **_Structured outputs and thinking models_** in Ollama, including gpt-oss. This issue been a roadblock for my organization using Ollama for awhile. We love Ollama but have been considering alternatives because of this limitation with effective use of thinking models. I assume the issue with gpt-oss is at least related to open issue [#523](https://github.com/ollama/ollama-python/issues/523). But I assume the problem is compounded with gpt-oss because of the use of the Harmony response format. Again, thank you and the Ollama team for prioritizing this!
Author
Owner

@nicholas-johnson-techxcel commented on GitHub (Aug 13, 2025):

Yeah progress, the GBNF generator is now pretty stable and I wrote a function to turn chat history into Harmony. The model is quite smart (for a local model) but also unhinged (although it might be that I have not yet tuned llama.cpp well). It tries to put reasoning output in URL params of tools, they really need to find a way to make it think without emitting those tokens. It can handle web browsing tasks using Playwright but I keep overflowing the context window so now I am writing a full node based infrastructure where we can have nodes consolidate / summarise the history and have tool calls be able to be more context aware and veto certain steps.

I got that idea from Haystack but their tooling is hardly type-safe (it was good as a proof of concept) and it was only after a tonne of complaining that they fixed the issue that by importing Haystack, half your python file lines light up bright red with compiler errors. I am still baffled by the fact that everyone is using old Python syntax, type-safety as an afterthought. Like I am new to Python and somehow it feels like I have to write almost everything because too many of the libraries are missing stubs or have serious issues. I mean, is it so hard for, OpenAI, Firecrawl-py, etc that if you give the function a Pydantic class, that it give back an instance of that class, and if you hand it a dict, you get back a dict? If you are streaming then you can use a small state machine to parse half-completed json and you use Pydantic alias generator so that the JSON is camelCase (JSON is inherently born from JS and this is best practice) but when you await the finished Pydantic class instance in Python, the fields are in snake_case as is idiomatic for Python.

Python still has poor generic templating (things like cannot specify the generic type of the function you are calling without it parsing it as array indexing, and it does not allow each overload to have its own code, because it has no idea which one will be used until runtime) and inheritance support (it only enforces checking if you mark as @final, and there is no way of requiring an instance passed in having been marked as final), but it is enough to do the trick. Well, at least it has come a long way, right? Pydantic is a life-saver, although under the hood it has some code I am not pleased with, and it had failed to implement certain cases, and the field aliasing is a nightmare - but I found a good workaround - use an alias generator and a switch statement inside of the generator with a case for each field.

Golden rule of LLMs, because they are extremely obtuse and get distracted easily (I believe next-token prediction to be merely a stepping stone). Use highly constrained schemas by rooting under the hood of Pydantic and cut down the number of choices it has and filter out noisy data (although there was a research paper recently "Let me speak freely", which argues against this, but does not seem to have used GBNF as a constraint method). The most frustrating thing is that I should not need to prompt an LLM to understand causality and the passage of time. The models try to emit all of the tool calls in one round, making bad assumptions about what the future state will be. Making them follow instructions is near impossible. But this gpt-oss:20b is more promising. We will have the hardware to run gpt-oss:120b in a couple of weeks, but my opinion is it will likely be a dud, not worth the 6x RAM and compute required compared to the 20b version.

I personally hope that all of these things get sorted. I mean there is no agent library I can download which has good general-purpose performance out of the box. I hope to change that. But OpenAI has just muddied the water with Harmony (ironic), there is no universally standardised LLM interface in Python, and the fact that I have to implement JSON schema support on my own is just wild. Again, the industry has billions of dollars and somehow they did not add support for JSON schema in llama.cpp when it's only a little over 300 lines of code. And I don't see how this code couldn't have been in there before, rendering the issue with gpt-oss to be only a matter of converting chat history to Harmony. Maybe they have kept a lot of features closed-source. Anyway, I want to be dealing with LLM concepts, not Python concepts. There are things like MCP (relatively clear) and A2A (can't even figure out at what level of abstraction is it meant to be at) and neither of these things have standardised the LLM interface.

One thing is clear, though: GBNF is the way. I see limitless possibilities with this, provided I can continue writing working compilers for it. It is also food for thought for how I was trying to train my own model (not next-token predictor) in my own time: I had issues formalising grammatical concepts for the training process, and this could be the thing I need. Ditching next-token should allow the inlining of classical functions (latch float inputs to closest binary state, one bit per input) to give LLMs extroadinary mathematical abilities without needing to execute Python or other script parsers. I would also wager similar techniques but for neural nets could solve the RSA problem. I also was struggling to understand how the Ollama JSON response format was implemented - it seemed like a mix of prompting, two-shot examples, and re-prompting. But GBNF, from what I can tell, eliminates illegal next tokens from the probability list (compiled into, meaning that it is relatively fail-safe and does not waste time having to go back and re-generate.

If anyone knows of other gems like GBNF, I would love to hear - I may be overlooking other great solutions.

<!-- gh-comment-id:3184544782 --> @nicholas-johnson-techxcel commented on GitHub (Aug 13, 2025): Yeah progress, the GBNF generator is now pretty stable and I wrote a function to turn chat history into Harmony. The model is quite smart (for a local model) but also unhinged (although it might be that I have not yet tuned llama.cpp well). It tries to put reasoning output in URL params of tools, they really need to find a way to make it think without emitting those tokens. It can handle web browsing tasks using Playwright but I keep overflowing the context window so now I am writing a full node based infrastructure where we can have nodes consolidate / summarise the history and have tool calls be able to be more context aware and veto certain steps. I got that idea from Haystack but their tooling is hardly type-safe (it was good as a proof of concept) and it was only after a tonne of complaining that they fixed the issue that by importing Haystack, half your python file lines light up bright red with compiler errors. I am still baffled by the fact that everyone is using old Python syntax, type-safety as an afterthought. Like I am new to Python and somehow it feels like I have to write almost everything because too many of the libraries are missing stubs or have serious issues. I mean, is it so hard for, OpenAI, Firecrawl-py, etc that if you give the function a Pydantic class, that it give back an instance of that class, and if you hand it a dict, you get back a dict? If you are streaming then you can use a small state machine to parse half-completed json and you use Pydantic alias generator so that the JSON is camelCase (JSON is inherently born from JS and this is best practice) but when you await the finished Pydantic class instance in Python, the fields are in snake_case as is idiomatic for Python. Python still has poor generic templating (things like cannot specify the generic type of the function you are calling without it parsing it as array indexing, and it does not allow each overload to have its own code, because it has no idea which one will be used until runtime) and inheritance support (it only enforces checking if you mark as @final, and there is no way of requiring an instance passed in having been marked as final), but it is enough to do the trick. Well, at least it has come a long way, right? Pydantic is a life-saver, although under the hood it has some code I am not pleased with, and it had failed to implement certain cases, and the field aliasing is a nightmare - but I found a good workaround - use an alias generator and a switch statement inside of the generator with a case for each field. Golden rule of LLMs, because they are extremely obtuse and get distracted easily (I believe next-token prediction to be merely a stepping stone). Use highly constrained schemas by rooting under the hood of Pydantic and cut down the number of choices it has and filter out noisy data (although there was a research paper recently "Let me speak freely", which argues against this, but does not seem to have used GBNF as a constraint method). The most frustrating thing is that I should not need to prompt an LLM to understand causality and the passage of time. The models try to emit all of the tool calls in one round, making bad assumptions about what the future state will be. Making them follow instructions is near impossible. But this gpt-oss:20b is more promising. We will have the hardware to run gpt-oss:120b in a couple of weeks, but my opinion is it will likely be a dud, not worth the 6x RAM and compute required compared to the 20b version. I personally hope that all of these things get sorted. I mean there is no agent library I can download which has good general-purpose performance out of the box. I hope to change that. But OpenAI has just muddied the water with Harmony (ironic), there is no universally standardised LLM interface in Python, and the fact that I have to implement JSON schema support on my own is just wild. Again, the industry has billions of dollars and somehow they did not add support for JSON schema in llama.cpp when it's only a little over 300 lines of code. And I don't see how this code couldn't have been in there before, rendering the issue with gpt-oss to be only a matter of converting chat history to Harmony. Maybe they have kept a lot of features closed-source. Anyway, I want to be dealing with LLM concepts, not Python concepts. There are things like MCP (relatively clear) and A2A (can't even figure out at what level of abstraction is it meant to be at) and neither of these things have standardised the LLM interface. One thing is clear, though: GBNF is the way. I see limitless possibilities with this, provided I can continue writing working compilers for it. It is also food for thought for how I was trying to train my own model (not next-token predictor) in my own time: I had issues formalising grammatical concepts for the training process, and this could be the thing I need. Ditching next-token should allow the inlining of classical functions (latch float inputs to closest binary state, one bit per input) to give LLMs extroadinary mathematical abilities without needing to execute Python or other script parsers. I would also wager similar techniques but for neural nets could solve the RSA problem. I also was struggling to understand how the Ollama JSON response format was implemented - it seemed like a mix of prompting, two-shot examples, and re-prompting. But GBNF, from what I can tell, eliminates illegal next tokens from the probability list (compiled into, meaning that it is relatively fail-safe and does not waste time having to go back and re-generate. If anyone knows of other gems like GBNF, I would love to hear - I may be overlooking other great solutions.
Author
Owner

@rick-github commented on GitHub (Aug 13, 2025):

I also was struggling to understand how the Ollama JSON response format was implemented

Ollama use GBNF to implement structured outputs. The problem with applying it to reasoning models is that it currently also constrains the reasoning phase to the GBNF grammar, which compromises quality. Ideally the model should be allowed to consider the full gamut of probabilistic generation during reasoning, and only apply the GBNF grammer during content generation.

<!-- gh-comment-id:3184590697 --> @rick-github commented on GitHub (Aug 13, 2025): > I also was struggling to understand how the Ollama JSON response format was implemented Ollama use GBNF to implement structured outputs. The problem with applying it to reasoning models is that it currently also constrains the reasoning phase to the GBNF grammar, which compromises quality. Ideally the model should be allowed to consider the full gamut of probabilistic generation during reasoning, and only apply the GBNF grammer during content generation.
Author
Owner

@Croups commented on GitHub (Aug 13, 2025):

same issue here, I noticed that sometimes it returns null while using it via pydantic-ai, I tested it without defining a structured output schema, it is generating the answer with tags here is a sample :

AgentRunResult(output='{"analysis":"The user says 'hi how are you', which is a greeting and question about how I am. The user wants to know what ChatGPT is. So respond with a friendly greeting and explanation.<|channel|>commentary:"} ')

This is why when you define a model, pydantic can't parse it.

<!-- gh-comment-id:3185021224 --> @Croups commented on GitHub (Aug 13, 2025): same issue here, I noticed that sometimes it returns null while using it via pydantic-ai, I tested it without defining a structured output schema, it is generating the answer with tags here is a sample : AgentRunResult(output='{"analysis":"The user says \'hi how are you\', which is a greeting and question about how I am. The user wants to know what ChatGPT is. So respond with a friendly greeting and explanation.<|channel|>commentary:"} ') This is why when you define a model, pydantic can't parse it.
Author
Owner

@nicholas-johnson-techxcel commented on GitHub (Aug 15, 2025):

I also was struggling to understand how the Ollama JSON response format was implemented

Ollama use GBNF to implement structured outputs. The problem with applying it to reasoning models is that it currently also constrains the reasoning phase to the GBNF grammar, which compromises quality. Ideally the model should be allowed to consider the full gamut of probabilistic generation during reasoning, and only apply the GBNF grammer during content generation.

Can't it reason silently and not emit these symbols? Besides, I have been trying to disable reasoning as it just adds latency, and I can easily add reasoning fields to the json output of non-reasoning models if and when it makes sense for the application. Sometimes I have found it creates a better agent, other times it is just wasting electricity and time.

<!-- gh-comment-id:3190491870 --> @nicholas-johnson-techxcel commented on GitHub (Aug 15, 2025): > > I also was struggling to understand how the Ollama JSON response format was implemented > > Ollama use GBNF to implement structured outputs. The problem with applying it to reasoning models is that it currently also constrains the reasoning phase to the GBNF grammar, which compromises quality. Ideally the model should be allowed to consider the full gamut of probabilistic generation during reasoning, and only apply the GBNF grammer during content generation. Can't it reason silently and not emit these symbols? Besides, I have been trying to disable reasoning as it just adds latency, and I can easily add reasoning fields to the json output of non-reasoning models if and when it makes sense for the application. Sometimes I have found it creates a better agent, other times it is just wasting electricity and time.
Author
Owner

@rick-github commented on GitHub (Aug 15, 2025):

Models don't have an internal monologue or subconscious, all they do is probabilistically generate tokens. "Reasoning" models are trained to generate "thinking" tokens as a way to guide the generation of tokens in the response phase, but it's just tokens all the way down.

<!-- gh-comment-id:3191207273 --> @rick-github commented on GitHub (Aug 15, 2025): Models don't have an internal monologue or subconscious, all they do is probabilistically generate tokens. "Reasoning" models are trained to generate "thinking" tokens as a way to guide the generation of tokens in the response phase, but it's just tokens all the way down.
Author
Owner

@adamoutler commented on GitHub (Aug 15, 2025):

Confirmed. Same issue. Any model except gpt-oss seems to work with Structured Outputs. gpt-oss returns a blank. I hope to see an adapter later in Ollama soon.

<!-- gh-comment-id:3192394511 --> @adamoutler commented on GitHub (Aug 15, 2025): Confirmed. Same issue. Any model except `gpt-oss` seems to work with Structured Outputs. `gpt-oss` returns a blank. I hope to see an adapter later in Ollama soon.
Author
Owner

@adamoutler commented on GitHub (Aug 15, 2025):

Is this being worked on? That thinking section on the json... It's a likely culprit and a good starting point.

<!-- gh-comment-id:3192468285 --> @adamoutler commented on GitHub (Aug 15, 2025): Is this being worked on? That thinking section on the json... It's a likely culprit and a good starting point.
Author
Owner

@nicholas-johnson-techxcel commented on GitHub (Aug 20, 2025):

Is this being worked on? That thinking section on the json... It's a likely culprit and a good starting point.

I did this in Python and llama.cpp but I since found in the Ollama source code a JSON=>GBNF compiler. You just need to hit llama.cpp with grammar=gbnf and then it works, but because it is a reasoning model, it then becomes a bit unhinged and tries to insert reasoning into json fields instead of keeping reasoning internally.

All Ollama had to do is use the grammar just like it already does and it would work to the extent which I get from llama.cpp (we still need to stop it from reasoning when we use think=False which it ignores) but for some reason they seemed to have made an exception for this model and hence they broke it. If I get some time I can look at their code and give a patch.

This also begs the question: if Ollama is a wrapper around llama.cpp then it could just become a python library which adds features to llama.cpp (basically a llama.cpp client library) and the actual Ollama server become a heap of scripts for running llama.cpp as a service and automatically pulling models down for it.

<!-- gh-comment-id:3203890207 --> @nicholas-johnson-techxcel commented on GitHub (Aug 20, 2025): > Is this being worked on? That thinking section on the json... It's a likely culprit and a good starting point. I did this in Python and llama.cpp but I since found in the Ollama source code a JSON=>GBNF compiler. You just need to hit llama.cpp with `grammar=gbnf` and then it works, but because it is a reasoning model, it then becomes a bit unhinged and tries to insert reasoning into json fields instead of keeping reasoning internally. All Ollama had to do is use the grammar just like it already does and it would work to the extent which I get from llama.cpp (we still need to stop it from reasoning when we use `think=False` which it ignores) but for some reason they seemed to have made an exception for this model and hence they broke it. If I get some time I can look at their code and give a patch. This also begs the question: if Ollama is a wrapper around llama.cpp then it could just become a python library which adds features to llama.cpp (basically a llama.cpp client library) and the actual Ollama server become a heap of scripts for running llama.cpp as a service and automatically pulling models down for it.
Author
Owner

@mpauly commented on GitHub (Aug 20, 2025):

@nicholas-johnson-techxcel With regards to llama.cpp: there is an open issue for structured outputs in llama.cpp and things are mostly working.
Those changes would need to be merged into llama.cpp, and could then eventually trickle down/be ported to ollama

<!-- gh-comment-id:3207428303 --> @mpauly commented on GitHub (Aug 20, 2025): @nicholas-johnson-techxcel With regards to llama.cpp: there is an [open issue](https://github.com/ggml-org/llama.cpp/issues/15276#issuecomment-3201937062) for structured outputs in llama.cpp and things are mostly working. Those changes would need to be merged into llama.cpp, and could then eventually trickle down/be ported to ollama
Author
Owner

@ParthSareen commented on GitHub (Aug 20, 2025):

Hey @nicholas-johnson-techxcel @mpauly that's not how it works - we're not consuming any of the llama.cpp changes for structured outputs - although we do use GBNF.

You can't turn thinking "off" for this model - those tokens have to get generated as this model follows the Harmony format.

@adamoutler As mentioned above I have started working on this. It's not a trivial change unfortunately and needs some moving around of where our parsers current live. Thanks for your patience friends!

Hey folks! Just came across this issue. We currently do not support structured outputs with this model or other thinking models. Working on a change later this week to bring some of the token parsing down to the runner level. With that we'd be able to do grammar sampling and token parsing closer together to know when to start the constrained sampling. Sorry for the delay!

<!-- gh-comment-id:3208186557 --> @ParthSareen commented on GitHub (Aug 20, 2025): Hey @nicholas-johnson-techxcel @mpauly that's not how it works - we're not consuming any of the llama.cpp changes for structured outputs - although we do use GBNF. You can't turn thinking "off" for this model - those tokens have to get generated as this model follows the [Harmony](https://github.com/openai/harmony) format. @adamoutler As mentioned above I have started working on this. It's not a trivial change unfortunately and needs some moving around of where our parsers current live. Thanks for your patience friends! > Hey folks! Just came across this issue. We currently do not support structured outputs with this model or other thinking models. Working on a change later this week to bring some of the token parsing down to the runner level. With that we'd be able to do grammar sampling and token parsing closer together to know when to start the constrained sampling. Sorry for the delay!
Author
Owner

@nicholas-johnson-techxcel commented on GitHub (Aug 25, 2025):

@nicholas-johnson-techxcel With regards to llama.cpp: there is an open issue for structured outputs in llama.cpp and things are mostly working. Those changes would need to be merged into llama.cpp, and could then eventually trickle down/be ported to ollama

I thought that llama.cpp did not have the structured output field, that you have to give GBNF, and it most certainly is working already, I now just compile JSON to GBNF and reluctantly make the root element to be <thinking>.*</thinking>{json} or just <thinking></thinking>{json} if I want to disable reasoning, and I just capture the thinking tags in a state machine, emit chunk messages with role="thinking" for those (and handle any split messages) and either put the structured messages through a JSON stream parser, or accumulate them and then parse for tool calling.

The way I see it, the issue is Ollama.

If anyone is wondering, forcing it to output <thinking></thinking> does seem to properly disable reasoning - despite those saying it cannot be - it stops it from trying to cram reasoning into JSON fields, and this results in significant decreases in latency.

This model still might be one of my favourites for local use, but it has still not caught up to 4o.

<!-- gh-comment-id:3219264325 --> @nicholas-johnson-techxcel commented on GitHub (Aug 25, 2025): > [@nicholas-johnson-techxcel](https://github.com/nicholas-johnson-techxcel) With regards to llama.cpp: there is an [open issue](https://github.com/ggml-org/llama.cpp/issues/15276#issuecomment-3201937062) for structured outputs in llama.cpp and things are mostly working. Those changes would need to be merged into llama.cpp, and could then eventually trickle down/be ported to ollama I thought that llama.cpp did not have the structured output field, that you have to give GBNF, and it most certainly is working already, I now just compile JSON to GBNF and reluctantly make the root element to be `<thinking>.*</thinking>{json}` or just `<thinking></thinking>{json}` if I want to disable reasoning, and I just capture the thinking tags in a state machine, emit chunk messages with role="thinking" for those (and handle any split messages) and either put the structured messages through a JSON stream parser, or accumulate them and then parse for tool calling. The way I see it, the issue is Ollama. If anyone is wondering, forcing it to output `<thinking></thinking>` does seem to properly disable reasoning - despite those saying it cannot be - it stops it from trying to cram reasoning into JSON fields, and this results in significant decreases in latency. This model still might be one of my favourites for local use, but it has still not caught up to 4o.
Author
Owner

@CL415 commented on GitHub (Aug 26, 2025):

In case somebody needs to force GPT-OSS to output valid JSON despite Ollama's shortcomings like I had, I found some success using Pydantic AI .run_sync, with the Ollama server as provider, although using the retrial parameter since sometimes OSS does not comply on the first shot.

<!-- gh-comment-id:3223006183 --> @CL415 commented on GitHub (Aug 26, 2025): In case somebody needs to force GPT-OSS to output valid JSON despite Ollama's shortcomings like I had, I found some success using Pydantic AI `.run_sync`, with the Ollama server as `provider`, although using the retrial parameter since sometimes OSS does not comply on the first shot.
Author
Owner

@nicholas-johnson-techxcel commented on GitHub (Aug 27, 2025):

Okay in terms of progress on my side. It was working extremely well except I had to clamp the number of analysis/thinking chars to stop it rambling. But it seems that if I wrap the JSON-GBNF with GBNF for the harmony format, it no longer needs clamping. But I cannot force it with GBNF to emit tokens like <|end|> which is an issue, because then feeding it into the openai-harmony library encoder, it does not strip the harmony frames from it properly. It's not actually that hard to process streaming chunks using a state machine, but still, it would be best doing it properly. Anyone have any ideas on forcing <|end|> to be emitted with GBNF?

<!-- gh-comment-id:3226206654 --> @nicholas-johnson-techxcel commented on GitHub (Aug 27, 2025): Okay in terms of progress on my side. It was working extremely well except I had to clamp the number of analysis/thinking chars to stop it rambling. But it seems that if I wrap the JSON-GBNF with GBNF for the harmony format, it no longer needs clamping. But I cannot force it with GBNF to emit tokens like <|end|> which is an issue, because then feeding it into the openai-harmony library encoder, it does not strip the harmony frames from it properly. It's not actually that hard to process streaming chunks using a state machine, but still, it would be best doing it properly. Anyone have any ideas on forcing <|end|> to be emitted with GBNF?
Author
Owner

@erennyuksell commented on GitHub (Aug 29, 2025):

any reliable solution?

<!-- gh-comment-id:3236450633 --> @erennyuksell commented on GitHub (Aug 29, 2025): any reliable solution?
Author
Owner

@steenharsted commented on GitHub (Aug 29, 2025):

I’m experiencing the same issue with gpt-oss:20b usingchat_ollama() and chat_structured() from ellmer. The model consistently returns truncated or non-JSON responses. This error persists even after:

  • Explicit JSON schema enforcement
  • Prompting “reply only in JSON”
  • Minimizing prompt size
  • Switching to other models (which work)

The output appears to be malformed almost every time.

Are there any plans to improve structured output compliance for gpt-oss:20b?

Thanks

<!-- gh-comment-id:3236562075 --> @steenharsted commented on GitHub (Aug 29, 2025): I’m experiencing the same issue with `gpt-oss:20b` using`chat_ollama()` and `chat_structured()` from `ellmer`. The model consistently returns truncated or non-JSON responses. This error persists even after: - Explicit JSON schema enforcement - Prompting “reply only in JSON” - Minimizing prompt size - Switching to other models (which work) The output appears to be malformed almost every time. Are there any plans to improve structured output compliance for `gpt-oss:20b`? Thanks
Author
Owner

@josemita87 commented on GitHub (Aug 29, 2025):

Same issue here with structured outputs...

<!-- gh-comment-id:3236568441 --> @josemita87 commented on GitHub (Aug 29, 2025): Same issue here with structured outputs...
Author
Owner

@rick-github commented on GitHub (Aug 29, 2025):

Are there any plans to improve structured output compliance for gpt-oss:20b?

https://github.com/ollama/ollama/issues/11691#issuecomment-3181220084

<!-- gh-comment-id:3236570458 --> @rick-github commented on GitHub (Aug 29, 2025): > Are there any plans to improve structured output compliance for `gpt-oss:20b`? https://github.com/ollama/ollama/issues/11691#issuecomment-3181220084
Author
Owner

@rakadam commented on GitHub (Sep 1, 2025):

I have the same problem. I looked at the code and Ollama was designed in a way that the HarmonyParser runs on high level, while the sampler runs in the cpp code, with some go code for glue. And it is not possible to connect them, so the sampler cannot know when it is supposed to apply the grammar or not. Since the grammar is only valid inside the message, and Harmony formatting is outside the message, this is a big problem.

One not terribly insane solution, already mentioned in this thread: implementing a minimalistic Harmony parser in the go sampler glue code, so it knows when to enable the grammar constraining. Or this could be calling HarmonyParser basically in both layers.

<!-- gh-comment-id:3240513568 --> @rakadam commented on GitHub (Sep 1, 2025): I have the same problem. I looked at the code and Ollama was designed in a way that the HarmonyParser runs on high level, while the sampler runs in the cpp code, with some go code for glue. And it is not possible to connect them, so the sampler cannot know when it is supposed to apply the grammar or not. Since the grammar is only valid _inside_ the message, and Harmony formatting is outside the message, this is a big problem. One not terribly insane solution, already mentioned in this thread: implementing a minimalistic Harmony parser in the go sampler glue code, so it knows when to enable the grammar constraining. Or this could be calling HarmonyParser basically in both layers.
Author
Owner

@nicholas-johnson-techxcel commented on GitHub (Sep 2, 2025):

Okay got it done. The steps are:

  • Use llamacpp /completions (not /v1/completions - this is not the same endpoint)
  • With prompt as list[int] which is list of output tokens from enc.render_conversation_for_completion(Conversation.from_messages(messages), Role.ASSISTANT) from openai_harmony library
  • tokens=True in body
  • Write a gbnf compiler which takes a json schema as input
  • Rename the "root" rule to something else
  • gbnf["root"] = f'thinking-block "{tok_start}" "{tok_assistant}" "{tok_channel}" "final" "{tok_constrain}" "json" "{tok_message}" json-root "{tok_end}"' where the tok_X are like "<|start|>", etc
  • thinking-block is basically the same thing except with analysis in the channel name, and any number of characters which are not "<" as the thought content
  • Send the request
  • Parse response tokens through StreamableParser.process() one by one
  • You have a new channel every time you find the token=200002 - use this to split up messages - read StreamableParser.current_channel to know if
  • We have a bug in StreamableParser.last_content_delta where we cannot use it because it is contaminated with Harmony tags which should have been filtered out so instead we use current_content and diff that from last iteration to get the channel delta. Actually nevermind, turns out I have been using last_content_delta now, and it has stopped doing that.

Finally the model has stopped being unhinged. It is extremely fast and consistent as an agent. gpt-oss:20b and I still doubt the gpt-oss:120b will justify its size with performance, but I guess we will find out in a while once we have the 128GB Macbook Pro. Mine is only 64GB.

<!-- gh-comment-id:3244671027 --> @nicholas-johnson-techxcel commented on GitHub (Sep 2, 2025): Okay got it done. The steps are: - Use llamacpp `/completions` (not `/v1/completions` - this is not the same endpoint) - With prompt as list[int] which is list of output tokens from `enc.render_conversation_for_completion(Conversation.from_messages(messages), Role.ASSISTANT)` from openai_harmony library - `tokens=True` in body - Write a gbnf compiler which takes a json schema as input - Rename the "root" rule to something else - `gbnf["root"] = f'thinking-block "{tok_start}" "{tok_assistant}" "{tok_channel}" "final" "{tok_constrain}" "json" "{tok_message}" json-root "{tok_end}"'` where the tok_X are like "<|start|>", etc - `thinking-block` is basically the same thing except with `analysis` in the channel name, and any number of characters which are not "<" as the thought content - Send the request - Parse response tokens through `StreamableParser.process()` one by one - You have a new channel every time you find the token=200002 - use this to split up messages - read `StreamableParser.current_channel` to know if - We have a bug in `StreamableParser.last_content_delta` where we cannot use it because it is contaminated with Harmony tags which should have been filtered out so instead we use `current_content` and diff that from last iteration to get the channel delta. Actually nevermind, turns out I have been using last_content_delta now, and it has stopped doing that. Finally the model has stopped being unhinged. It is extremely fast and consistent as an agent. `gpt-oss:20b` and I still doubt the `gpt-oss:120b` will justify its size with performance, but I guess we will find out in a while once we have the 128GB Macbook Pro. Mine is only 64GB.
Author
Owner

@adamoutler commented on GitHub (Sep 2, 2025):

Re: @nicholas-johnson-techxcel
...

  • Send the request
  • Parse response tokens through StreamableParser.process() one by one
  • You have a new channel every time you find the token=200002 - use this to split up messages - read StreamableParser.current_channel to know if

...

This caught my attention. I asked ChatGPT more about harmony format.

Token ID Role / Function
199998 Beginning of sequence (BOS)
199999 Padding (PAD)
200000 End of text (EOT)
200001 Reserved special (unused)
200002 End of sequence / Return (EOS)
<!-- gh-comment-id:3244953457 --> @adamoutler commented on GitHub (Sep 2, 2025): > Re: @nicholas-johnson-techxcel > ... > * Send the request > * Parse response tokens through `StreamableParser.process()` one by one > * You have a new channel every time you find the token=200002 - use this to split up messages - read `StreamableParser.current_channel` to know if > > ... This caught my attention. I asked ChatGPT more about harmony format. | Token ID | Role / Function | |----------|----------------------------------| | 199998 | Beginning of sequence (BOS) | | 199999 | Padding (PAD) | | 200000 | End of text (EOT) | | 200001 | Reserved special (unused) | | 200002 | End of sequence / Return (EOS) |
Author
Owner

@MarioRicoIbanez commented on GitHub (Sep 3, 2025):

+1 having the same issue

<!-- gh-comment-id:3248532858 --> @MarioRicoIbanez commented on GitHub (Sep 3, 2025): +1 having the same issue
Author
Owner

@inf-bud commented on GitHub (Sep 3, 2025):

+1 having the same issue

<!-- gh-comment-id:3249122091 --> @inf-bud commented on GitHub (Sep 3, 2025): +1 having the same issue
Author
Owner

@shahidazim commented on GitHub (Sep 3, 2025):

+1 having the same issue

<!-- gh-comment-id:3250790300 --> @shahidazim commented on GitHub (Sep 3, 2025): +1 having the same issue
Author
Owner

@ParthSareen commented on GitHub (Sep 3, 2025):

Hey everyone! This is currently being worked on - trying to get it to y'all asap. https://github.com/ollama/ollama/pull/12052

<!-- gh-comment-id:3250795684 --> @ParthSareen commented on GitHub (Sep 3, 2025): Hey everyone! This is currently being worked on - trying to get it to y'all asap. https://github.com/ollama/ollama/pull/12052
Author
Owner

@sheneman commented on GitHub (Sep 3, 2025):

@ParthSareen Thank you SO much!

<!-- gh-comment-id:3250880082 --> @sheneman commented on GitHub (Sep 3, 2025): @ParthSareen Thank you SO much!
Author
Owner

@MarioRicoIbanez commented on GitHub (Sep 4, 2025):

@ParthSareen Thanks!

<!-- gh-comment-id:3252381334 --> @MarioRicoIbanez commented on GitHub (Sep 4, 2025): @ParthSareen Thanks!
Author
Owner

@kiwamizamurai commented on GitHub (Sep 4, 2025):

@ParthSareen so nice

<!-- gh-comment-id:3253111982 --> @kiwamizamurai commented on GitHub (Sep 4, 2025): @ParthSareen so nice
Author
Owner

@Seyid-cmd commented on GitHub (Sep 5, 2025):

@ParthSareen thanks

<!-- gh-comment-id:3257370259 --> @Seyid-cmd commented on GitHub (Sep 5, 2025): @ParthSareen thanks
Author
Owner

@ParthSareen commented on GitHub (Sep 6, 2025):

You guys can use this branch until I get it into main: https://github.com/ollama/ollama/tree/parth/gpt-oss-structured-outputs 😁

Would also love to know what you use structured outputs for if you do give the branch a shot

<!-- gh-comment-id:3260191601 --> @ParthSareen commented on GitHub (Sep 6, 2025): You guys can use this branch until I get it into main: https://github.com/ollama/ollama/tree/parth/gpt-oss-structured-outputs 😁 Would also love to know what you use structured outputs for if you do give the branch a shot
Author
Owner

@adamoutler commented on GitHub (Sep 6, 2025):

Would also love to know what you use structured outputs for if you do give the branch a shot

Analyzing test results and reacting to binary/enum decisions.

How does it work? What did you do with the thinking?

<!-- gh-comment-id:3260811595 --> @adamoutler commented on GitHub (Sep 6, 2025): > Would also love to know what you use structured outputs for if you do give the branch a shot Analyzing test results and reacting to binary/enum decisions. How does it work? What did you do with the thinking?
Author
Owner

@asabla commented on GitHub (Sep 6, 2025):

Mostly when interacting with LLMs I want to avoid writing too much fuzzy validation code (e.g making sure all needed data is there). Structured output is basically a very convenient shortcut for doing so. On top of that, most agentic frameworks for building reliable workflows, is using structured output under the hood for the same reasons.

Haven't had the time to test out the feature branch yet, but I'll get back to you when I've done so @ParthSareen

<!-- gh-comment-id:3261686248 --> @asabla commented on GitHub (Sep 6, 2025): Mostly when interacting with LLMs I want to avoid writing too much fuzzy validation code (e.g making sure all needed data is there). Structured output is basically a very convenient shortcut for doing so. On top of that, most agentic frameworks for building reliable workflows, is using structured output under the hood for the same reasons. Haven't had the time to test out the feature branch yet, but I'll get back to you when I've done so @ParthSareen
Author
Owner

@sheneman commented on GitHub (Sep 6, 2025):

@ParthSareen The fix you implemented appears to generally work! Structured outputs with gpt-oss are working for me as expected and are present in the content field of the response. Reasoning traces are located in the "thinking" field. This is very improved behavior, and I am very grateful for your help! THANK YOU.

I did have a couple observations, as there still are some inconsistencies with responses from gpt-oss compared to other thinking models:

  1. gpt-oss now provides separate reasoning traces, even if you specify "think": False. This is not a horrible default behavior, but technically it is incorrect and different than other thinking models (e.g. qwen3) which honor the "think" boolean for controlling thinking output as described here: https://ollama.com/blog/thinking

  2. Other thinking models (qwen3) don't behave in the same way as gpt-oss:
    a. If you set "thinking": False, there will be no thinking trace (correct for qwen, fails for gpt-oss)
    b. If you use thinking mode and structured outputs with qwen3, it still will not emit a thinking trace (BUG)

Image Image
<!-- gh-comment-id:3263238687 --> @sheneman commented on GitHub (Sep 6, 2025): @ParthSareen **_The fix you implemented appears to generally work_**! Structured outputs with gpt-oss are working for me as expected and are present in the content field of the response. Reasoning traces are located in the "thinking" field. This is very improved behavior, and I am _very_ grateful for your help! **THANK YOU**. I did have a couple observations, as there still are some inconsistencies with responses from gpt-oss compared to other thinking models: 1. gpt-oss now provides separate reasoning traces, **_even if you specify "think": False._** This is not a horrible default behavior, but technically it is incorrect and different than other thinking models (e.g. qwen3) which honor the "think" boolean for controlling thinking output as described here: [https://ollama.com/blog/thinking](https://ollama.com/blog/thinking) 2. Other thinking models (qwen3) don't behave in the same way as gpt-oss: a. If you set "thinking": False, there will be no thinking trace (correct for qwen, fails for gpt-oss) b. If you use thinking mode **_and_** structured outputs with qwen3, it still will **_not_** emit a thinking trace (BUG) <img width="688" height="546" alt="Image" src="https://github.com/user-attachments/assets/d19c2f37-11ad-4797-b861-81b6cd63ce9d" /> <img width="700" height="494" alt="Image" src="https://github.com/user-attachments/assets/860a5838-32f0-464b-93b6-3ec474d65d56" />
Author
Owner

@adamoutler commented on GitHub (Sep 7, 2025):

I believe with the harmony format, thinking is never turned off which is the problem we are experiencing here. I'm pretty sure this is not fixable on this particular model. It may be on the other one though.

<!-- gh-comment-id:3263313782 --> @adamoutler commented on GitHub (Sep 7, 2025): I believe with the harmony format, thinking is never turned off which is the problem we are experiencing here. I'm pretty sure this is not fixable on this particular model. It may be on the other one though.
Author
Owner

@ParthSareen commented on GitHub (Sep 7, 2025):

@sheneman @adamoutler is correct. the thinking cannot be turned off for gpt-oss - you can only do low medium, and high. And currently my PR only supports gpt-oss as a trial. Going to do thinking models as a whole next!

<!-- gh-comment-id:3264037852 --> @ParthSareen commented on GitHub (Sep 7, 2025): @sheneman @adamoutler is correct. the thinking cannot be turned off for gpt-oss - you can only do `low` `medium`, and `high`. And currently my PR only supports gpt-oss as a trial. Going to do thinking models as a whole next!
Author
Owner

@sheneman commented on GitHub (Sep 7, 2025):

@adamoutler @ParthSareen Thank you! While you can't actually turn off thinking in gpt-oss, you could set thinking to "low" and then suppress or mask the thinking trace. This would maintain response format compatibility with other thinking models. I could also see why you would prefer to output the thinking trace since its being generated anyway. It's easy enough to ignore if needed, so not a huge deal either way.

And Thank you @ParthSareen for now attacking the issue of structured outputs X thinking mode in the other models!!! With that, Ollama becomes so much more compelling for our organization!

<!-- gh-comment-id:3264116471 --> @sheneman commented on GitHub (Sep 7, 2025): @adamoutler @ParthSareen Thank you! While you can't actually turn off thinking in gpt-oss, you _could_ set thinking to "low" and then suppress or mask the thinking trace. This would maintain response format compatibility with other thinking models. I could also see why you would prefer to output the thinking trace since its being generated anyway. It's easy enough to ignore if needed, so not a huge deal either way. And **Thank you** @ParthSareen for now attacking the issue of structured outputs X thinking mode in the other models!!! With that, Ollama becomes so much more compelling for our organization!
Author
Owner

@nicholas-johnson-techxcel commented on GitHub (Sep 8, 2025):

I believe with the harmony format, thinking is never turned off which is the problem we are experiencing here. I'm pretty sure this is not fixable on this particular model. It may be on the other one though.

You can force it to emit an empty thinking tag using GBNF if you wish to save time.

<!-- gh-comment-id:3264478540 --> @nicholas-johnson-techxcel commented on GitHub (Sep 8, 2025): > I believe with the harmony format, thinking is never turned off which is the problem we are experiencing here. I'm pretty sure this is not fixable on this particular model. It may be on the other one though. You can force it to emit an empty thinking tag using GBNF if you wish to save time.
Author
Owner

@ParthSareen commented on GitHub (Sep 8, 2025):

You can force it to emit an empty thinking tag using GBNF if you wish to save time.

You could but you're breaking the format the model was trained on. From experience the model is very sensitive to breaking the format which results in poor outputs. So your mileage may vary with that.

<!-- gh-comment-id:3267288766 --> @ParthSareen commented on GitHub (Sep 8, 2025): > You can force it to emit an empty thinking tag using GBNF if you wish to save time. You could but you're breaking the format the model was trained on. From experience the model is very sensitive to breaking the format which results in poor outputs. So your mileage may vary with that.
Author
Owner

@vishalgoel2 commented on GitHub (Sep 14, 2025):

I tested the fix PR branch for structured outputs and it does improve things — simple structured outputs work now.

However, I’m running into mixed results when using it with browser-use + gpt-oss:20b. With the release version of Ollama, it fails consistently with the familiar

Invalid JSON: expected value at line 1 column 1 [type=json_invalid]

On the fix branch, sometimes it works, but other times I see warnings like this in the logs:

level=WARN source=harmonyparser.go:429 msg="harmony parser: no reverse mapping found for function name" harmonyFunctionName=browser.extract_structured_data

and then browser-use errors out with

 ("1 validation error for AgentOutput\n  Invalid JSON: EOF while parsing a value at line 1 column 0 [type=json_invalid, input_value='', input_type=str]\n    For further information visit https://errors.pydantic.dev/2.11/v/json_invalid", 502)

So it looks like the PR handles some structured output cases, but not all. Not sure yet if browser-use is passing the tool schema in a way Ollama doesn’t expect, or if the fix still misses some scenarios.

<!-- gh-comment-id:3289904118 --> @vishalgoel2 commented on GitHub (Sep 14, 2025): I tested the fix PR branch for structured outputs and it does improve things — simple structured outputs work now. However, I’m running into mixed results when using it with [`browser-use`](https://github.com/browser-use/browser-use) + `gpt-oss:20b`. With the release version of Ollama, it fails consistently with the familiar ``` Invalid JSON: expected value at line 1 column 1 [type=json_invalid] ``` On the fix branch, sometimes it works, but other times I see warnings like this in the logs: ``` level=WARN source=harmonyparser.go:429 msg="harmony parser: no reverse mapping found for function name" harmonyFunctionName=browser.extract_structured_data ``` and then `browser-use` errors out with ``` ("1 validation error for AgentOutput\n Invalid JSON: EOF while parsing a value at line 1 column 0 [type=json_invalid, input_value='', input_type=str]\n For further information visit https://errors.pydantic.dev/2.11/v/json_invalid", 502) ``` So it looks like the PR handles some structured output cases, but not all. Not sure yet if `browser-use` is passing the tool schema in a way Ollama doesn’t expect, or if the fix still misses some scenarios.
Author
Owner

@trebor commented on GitHub (Sep 20, 2025):

i'm on ollama v0.12.0 and still seeing the issue. the query takes time, but returns with a zero-length response field. i'm happy to include payload and response text if that is helpful, but it is the typical prompt and json scheme in the format field.

<!-- gh-comment-id:3315185715 --> @trebor commented on GitHub (Sep 20, 2025): i'm on ollama v0.12.0 and still seeing the issue. the query takes time, but returns with a zero-length response field. i'm happy to include payload and response text if that is helpful, but it is the typical prompt and json scheme in the format field.
Author
Owner

@ParthSareen commented on GitHub (Sep 20, 2025):

i'm on ollama v0.12.0 and still seeing the issue. the query takes time, but returns with a zero-length response field. i'm happy to include payload and response text if that is helpful, but it is the typical prompt and json scheme in the format field.

Hi @trebor it's not released yet

<!-- gh-comment-id:3315186530 --> @ParthSareen commented on GitHub (Sep 20, 2025): > i'm on ollama v0.12.0 and still seeing the issue. the query takes time, but returns with a zero-length response field. i'm happy to include payload and response text if that is helpful, but it is the typical prompt and json scheme in the format field. Hi @trebor it's not released yet
Author
Owner

@srshkmr commented on GitHub (Sep 24, 2025):

Hi @ParthSareen any ETA on the release? is there changes required on the PR?

<!-- gh-comment-id:3326947186 --> @srshkmr commented on GitHub (Sep 24, 2025): Hi @ParthSareen any ETA on the release? is there changes required on the PR?
Author
Owner

@MarioRicoIbanez commented on GitHub (Sep 29, 2025):

Any news on when it will be released?

<!-- gh-comment-id:3346091473 --> @MarioRicoIbanez commented on GitHub (Sep 29, 2025): Any news on when it will be released?
Author
Owner

@AlexanderKozhevin commented on GitHub (Oct 2, 2025):

funny thing, structured output does work on Groq cloud

<!-- gh-comment-id:3359845010 --> @AlexanderKozhevin commented on GitHub (Oct 2, 2025): funny thing, structured output does work on Groq cloud
Author
Owner

@ParthSareen commented on GitHub (Oct 2, 2025):

Had to make some updates to how we ran it. Just put up another PR. Aiming for next release.

<!-- gh-comment-id:3359852913 --> @ParthSareen commented on GitHub (Oct 2, 2025): Had to make some updates to how we ran it. Just put up another PR. Aiming for next release.
Author
Owner

@bogzbonny commented on GitHub (Oct 12, 2025):

haven't gotten it to work with ollama-rs calling on ollama 0.12.5 - I'm assuming https://github.com/ollama/ollama/pull/12460 doesn't actually fully resolve this issue but is only a stepping stone? (@ParthSareen)

<!-- gh-comment-id:3394006595 --> @bogzbonny commented on GitHub (Oct 12, 2025): haven't gotten it to work with ollama-rs calling on ollama 0.12.5 - I'm assuming https://github.com/ollama/ollama/pull/12460 doesn't actually fully resolve this issue but is only a stepping stone? (@ParthSareen)
Author
Owner

@ParthSareen commented on GitHub (Oct 12, 2025):

haven't gotten it to work with ollama-rs calling on ollama 0.12.5 - I'm assuming https://github.com/ollama/ollama/pull/12460 doesn't actually fully resolve this issue but is only a stepping stone? (@ParthSareen)

Hmm it should be working... can you try running ollama run gpt-oss --format json hello!

and see if it shows thinking + the final output? if so it might be some weird client behavior

<!-- gh-comment-id:3394009427 --> @ParthSareen commented on GitHub (Oct 12, 2025): > haven't gotten it to work with ollama-rs calling on ollama 0.12.5 - I'm assuming https://github.com/ollama/ollama/pull/12460 doesn't actually fully resolve this issue but is only a stepping stone? (@ParthSareen) Hmm it should be working... can you try running `ollama run gpt-oss --format json hello!` and see if it shows thinking + the final output? if so it might be some weird client behavior
Author
Owner

@vansatchen commented on GitHub (Oct 12, 2025):

Hmm it should be working... can you try running ollama run gpt-oss --format json hello!

and see if it shows thinking + the final output? if so it might be some weird client behavior

ollama run gpt-oss:20b --format json hello!
We need to respond to ":", basically greet, friendly.Hello! 👋 How can I help you today?Thinking...
We responded.We are done.
...done thinking.

Hey there! What's on your mind today? 😊 "}Error: error parsing tool call: raw='Hey there! What’s on your mind today? 😊 <|constrain|>  <|constrain|><|constrain|>.} ', err=invalid character 'H' looking for beginning of value

ollama -v
ollama version is 0.12.5

<!-- gh-comment-id:3394319282 --> @vansatchen commented on GitHub (Oct 12, 2025): > Hmm it should be working... can you try running `ollama run gpt-oss --format json hello!` > > and see if it shows thinking + the final output? if so it might be some weird client behavior ollama run gpt-oss:20b --format json hello! We need to respond to ":", basically greet, friendly.Hello! 👋 How can I help you today?Thinking... We responded.We are done. ...done thinking. Hey there! What's on your mind today? 😊✨ "}Error: error parsing tool call: raw='Hey there! What’s on your mind today? 😊 <|constrain|>  <|constrain|><|constrain|>.} ', err=invalid character 'H' looking for beginning of value ollama -v ollama version is 0.12.5
Author
Owner

@sheneman commented on GitHub (Oct 12, 2025):

So just to be clear, the fix for this issue has not yet been merged to main, as of 0.12.5?

<!-- gh-comment-id:3394677804 --> @sheneman commented on GitHub (Oct 12, 2025): So just to be clear, the fix for this issue has not yet been merged to main, as of 0.12.5?
Author
Owner

@ParthSareen commented on GitHub (Oct 12, 2025):

Ah I gave the wrong query @vansatchen @bogzbonny. Run just ollama run gpt-oss --format json and then type something to the model.

I haven't updated the generate endpoint yet there's some refactoring to do. Let me know if this doesn't work.

<!-- gh-comment-id:3394899420 --> @ParthSareen commented on GitHub (Oct 12, 2025): Ah I gave the wrong query @vansatchen @bogzbonny. Run just `ollama run gpt-oss --format json` and then type something to the model. I haven't updated the generate endpoint yet there's some refactoring to do. Let me know if this doesn't work.
Author
Owner

@vansatchen commented on GitHub (Oct 12, 2025):

Ah I gave the wrong query @vansatchen @bogzbonny. Run just ollama run gpt-oss --format json and then type something to the model.

I haven't updated the generate endpoint yet there's some refactoring to do. Let me know if this doesn't work.

ollama run gpt-oss --format json
>>> John Dohn 26 yo
Thinking...
The user: "John Dohn 26 yo". Likely they want to talk about health? The user might be asking for medical advice, perhaps about being 
26-year-old male named John Dohn. Maybe they want to know about his health, fitness, sleep, nutrition, mental health. Or it's a user 
profile snippet. The user hasn't asked a specific question. We need to respond appropriately. Usually, we ask clarifying question or 
ask what they need. The user could be prompting for an assessment. The system guidelines: cannot provide medical advice. But can 
provide general wellness tips, encourage professional help. So we can ask: "What can I help you with regarding John Dohn? Are you 
looking for health tips?" Provide general wellness info. Let's do that.
...done thinking.

{"response":"It looks like you’re mentioning a 26‑year‑old male named John Dohn. Could you let me know what you’d like help with? For 
example, are you looking for general wellness and lifestyle advice, or is there a specific concern or goal you have in mind? I’m 
happy to offer general information and resources—just keep in mind that I can’t give personalized medical advice or replace a 
professional consultation."}

>>> Send a message (/? for help)

<!-- gh-comment-id:3394916418 --> @vansatchen commented on GitHub (Oct 12, 2025): > Ah I gave the wrong query [@vansatchen](https://github.com/vansatchen) [@bogzbonny](https://github.com/bogzbonny). Run just `ollama run gpt-oss --format json` and then type something to the model. > > I haven't updated the generate endpoint yet there's some refactoring to do. Let me know if this doesn't work. ``` ollama run gpt-oss --format json >>> John Dohn 26 yo Thinking... The user: "John Dohn 26 yo". Likely they want to talk about health? The user might be asking for medical advice, perhaps about being 26-year-old male named John Dohn. Maybe they want to know about his health, fitness, sleep, nutrition, mental health. Or it's a user profile snippet. The user hasn't asked a specific question. We need to respond appropriately. Usually, we ask clarifying question or ask what they need. The user could be prompting for an assessment. The system guidelines: cannot provide medical advice. But can provide general wellness tips, encourage professional help. So we can ask: "What can I help you with regarding John Dohn? Are you looking for health tips?" Provide general wellness info. Let's do that. ...done thinking. {"response":"It looks like you’re mentioning a 26‑year‑old male named John Dohn. Could you let me know what you’d like help with? For example, are you looking for general wellness and lifestyle advice, or is there a specific concern or goal you have in mind? I’m happy to offer general information and resources—just keep in mind that I can’t give personalized medical advice or replace a professional consultation."} >>> Send a message (/? for help) ```
Author
Owner

@bogzbonny commented on GitHub (Oct 12, 2025):

@ParthSareen Okay cool, appreciated. I tried it and got similar output to @vansatchen I'm not sure how to feed a schema from the CLI but within ollama-rs it appears to be using the generate endpoints HENCE I think I'm still blocked on that endpoint refactor you mentioned get this operating.

(also https://github.com/ollama/ollama/pull/12460 was merged into 0.12.5 @sheneman if you look at the commit history)

<!-- gh-comment-id:3395024144 --> @bogzbonny commented on GitHub (Oct 12, 2025): @ParthSareen Okay cool, appreciated. I tried it and got similar output to @vansatchen I'm not sure how to feed a schema from the CLI but within ollama-rs it appears to be using the generate endpoints HENCE I think I'm still blocked on that endpoint refactor you mentioned get this operating. (also https://github.com/ollama/ollama/pull/12460 was merged into 0.12.5 @sheneman if you look at the commit history)
Author
Owner

@trebor commented on GitHub (Oct 16, 2025):

i have been testing ollama 0.12.5 with the most recent gpt-oss:20b, see below for specific examples. is this the expected behavior? is the change maybe still percolating through the system? am i calling it wrong?

curl commands i used to test:

curl 'http://localhost:11434/api/generate' --data-raw '{"model":"gpt-oss:20b","stream":false,"format":{"type":"integer","minimum":1,"maximum":10},"prompt":"choose a number\n"}'

and am still seeing an empty response. for completeness:

{"model":"gpt-oss:20b","created_at":"2025-10-16T22:39:42.787141Z","response":"","done":true,"done_reason":"stop","context":[200006,17360,200008,3575,553,17554,162016,11,261,4410,6439,2359,22203,656,7788,17527,558,87447,100594,25,220,1323,19,12,3218,198,6576,3521,25,220,1323,20,12,702,12,1125,279,30377,289,25,14093,279,2,13888,18403,25,8450,11,49159,11,1721,13,21030,2804,413,7360,395,1753,3176,13,200007,200006,1428,200008,47312,261,2086,198,200007,200006,173781,16,220],"total_duration":639591959,"load_duration":152973209,"prompt_eval_count":71,"prompt_eval_duration":279207292,"eval_count":3,"eval_duration":50198457}%

if i use qwen:14b, for example:

curl 'http://localhost:11434/api/generate' --data-raw '{"model":"qwen3:14b","stream":false,"format":{"type":"integer","minimum":1,"maximum":10},"prompt":"choose a number\n"}'

i see what i would expect:

{"model":"qwen3:14b","created_at":"2025-10-16T22:45:54.903555Z","response":"8\n\n","done":true,"done_reason":"stop","context":[151644,872,198,27052,264,1372,198,151645,198,151644,77091,198,23,271],"total_duration":594441292,"load_duration":85178125,"prompt_eval_count":12,"prompt_eval_duration":394819125,"eval_count":3,"eval_duration":89476292}%

<!-- gh-comment-id:3413155345 --> @trebor commented on GitHub (Oct 16, 2025): i have been testing ollama 0.12.5 with the most recent gpt-oss:20b, see below for specific examples. is this the expected behavior? is the change maybe still percolating through the system? am i calling it wrong? curl commands i used to test: `curl 'http://localhost:11434/api/generate' --data-raw '{"model":"gpt-oss:20b","stream":false,"format":{"type":"integer","minimum":1,"maximum":10},"prompt":"choose a number\n"}'` and am still seeing an empty response. for completeness: `{"model":"gpt-oss:20b","created_at":"2025-10-16T22:39:42.787141Z","response":"","done":true,"done_reason":"stop","context":[200006,17360,200008,3575,553,17554,162016,11,261,4410,6439,2359,22203,656,7788,17527,558,87447,100594,25,220,1323,19,12,3218,198,6576,3521,25,220,1323,20,12,702,12,1125,279,30377,289,25,14093,279,2,13888,18403,25,8450,11,49159,11,1721,13,21030,2804,413,7360,395,1753,3176,13,200007,200006,1428,200008,47312,261,2086,198,200007,200006,173781,16,220],"total_duration":639591959,"load_duration":152973209,"prompt_eval_count":71,"prompt_eval_duration":279207292,"eval_count":3,"eval_duration":50198457}%` if i use qwen:14b, for example: `curl 'http://localhost:11434/api/generate' --data-raw '{"model":"qwen3:14b","stream":false,"format":{"type":"integer","minimum":1,"maximum":10},"prompt":"choose a number\n"}'` i see what i would expect: `{"model":"qwen3:14b","created_at":"2025-10-16T22:45:54.903555Z","response":"8\n\n","done":true,"done_reason":"stop","context":[151644,872,198,27052,264,1372,198,151645,198,151644,77091,198,23,271],"total_duration":594441292,"load_duration":85178125,"prompt_eval_count":12,"prompt_eval_duration":394819125,"eval_count":3,"eval_duration":89476292}%`
Author
Owner

@ParthSareen commented on GitHub (Oct 16, 2025):

hi @trebor sorry for the lack of documentation at the moment. It should work for the /chat endpoint. Need to do some cleanup on /generate before we can support it there

<!-- gh-comment-id:3413160708 --> @ParthSareen commented on GitHub (Oct 16, 2025): hi @trebor sorry for the lack of documentation at the moment. It should work for the `/chat` endpoint. Need to do some cleanup on `/generate` before we can support it there
Author
Owner

@trebor commented on GitHub (Oct 16, 2025):

oh got it, thank you!

<!-- gh-comment-id:3413161757 --> @trebor commented on GitHub (Oct 16, 2025): oh got it, thank you!
Author
Owner

@ParthSareen commented on GitHub (Oct 16, 2025):

Hey folks it should be out! Closing this issue. It'll work as expected with the /chat endpoint. /generate will come at some point but might be a bit. Just wanted to unblock everyone!

<!-- gh-comment-id:3413162849 --> @ParthSareen commented on GitHub (Oct 16, 2025): Hey folks it should be out! Closing this issue. It'll work as expected with the `/chat` endpoint. `/generate` will come at some point but might be a bit. Just wanted to unblock everyone!
Author
Owner

@dhicks commented on GitHub (Oct 16, 2025):

Could I suggest leaving this open until the issue has been resolved for /generate as well?

<!-- gh-comment-id:3413168288 --> @dhicks commented on GitHub (Oct 16, 2025): Could I suggest leaving this open until the issue has been resolved for `/generate` as well?
Author
Owner

@trebor commented on GitHub (Oct 16, 2025):

btw: here is a minimal example curl that worked for me:

curl 'http://localhost:11434/api/chat' --data-raw '{"model":"gpt-oss:20b","stream":false,"format":{"type":"integer","minimum":1,"maximum":10},"messages":[{"content": "choose a number", "role": "user"}]}'

huge thanks to @ParthSareen!

<!-- gh-comment-id:3413205363 --> @trebor commented on GitHub (Oct 16, 2025): btw: here is a minimal example curl that worked for me: `curl 'http://localhost:11434/api/chat' --data-raw '{"model":"gpt-oss:20b","stream":false,"format":{"type":"integer","minimum":1,"maximum":10},"messages":[{"content": "choose a number", "role": "user"}]}'` huge thanks to @ParthSareen!
Author
Owner

@jacksimpsoncartesian commented on GitHub (Oct 31, 2025):

So glad I found this - thought I was doing something wrong when I was getting the wrong structured outputs. Any word on whether this is likely to be fixed?

<!-- gh-comment-id:3470810744 --> @jacksimpsoncartesian commented on GitHub (Oct 31, 2025): So glad I found this - thought I was doing something wrong when I was getting the wrong structured outputs. Any word on whether this is likely to be fixed?
Author
Owner

@sheneman commented on GitHub (Nov 23, 2025):

Hello @ParthSareen - Has this fix been addressed in /generate and pulled into the main branch? Thank you for your consideration.

<!-- gh-comment-id:3568030561 --> @sheneman commented on GitHub (Nov 23, 2025): Hello @ParthSareen - Has this fix been addressed in /generate and pulled into the main branch? Thank you for your consideration.
Author
Owner

@chakka-guna-sekhar-venkata-chennaiah commented on GitHub (Nov 26, 2025):

@sheneman
hey hi, but for its workign when i used the improt statement from langchain_ollama import ChatOllama where i called model as

llm = ChatOllama(
    base_url="https://ollama-testing-gpt-oss-20b-433688334338.europe-west1.run.app",
    model="gpt-oss:20b",
    temperature=0
)

    llm_structured = llm.with_structured_output(SystemHierarchyResult, include_raw=True)

its worked for me. In the reponse im getting json as

{
  'raw': xxx,
  'response_metadata' : xxx,
  'parsed': xxx,
  'parsed_error' : None
}
<!-- gh-comment-id:3578911754 --> @chakka-guna-sekhar-venkata-chennaiah commented on GitHub (Nov 26, 2025): @sheneman hey hi, but for its workign when i used the improt statement `from langchain_ollama import ChatOllama` where i called model as ``` llm = ChatOllama( base_url="https://ollama-testing-gpt-oss-20b-433688334338.europe-west1.run.app", model="gpt-oss:20b", temperature=0 ) llm_structured = llm.with_structured_output(SystemHierarchyResult, include_raw=True) ``` its worked for me. In the reponse im getting json as ``` { 'raw': xxx, 'response_metadata' : xxx, 'parsed': xxx, 'parsed_error' : None } ```
Author
Owner

@bogzbonny commented on GitHub (Nov 27, 2025):

still doesn't work for me with (using ollama-rs)

<!-- gh-comment-id:3587179887 --> @bogzbonny commented on GitHub (Nov 27, 2025): still doesn't work for me with (using `ollama-rs`)
Author
Owner

@4IbWNsis3S commented on GitHub (Dec 7, 2025):

still doesn't work for me with (using ollama-rs)

OpenAI harmony format used in OSS 20b and 120b breaks Ollama response handling. It's been over two-months since 20b and 120b were released and it's still broken in current/0.13.1

<!-- gh-comment-id:3623099823 --> @4IbWNsis3S commented on GitHub (Dec 7, 2025): > still doesn't work for me with (using `ollama-rs`) OpenAI harmony format used in OSS 20b and 120b breaks Ollama response handling. It's been over two-months since 20b and 120b were released and it's still broken in current/0.13.1
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#69795