[GH-ISSUE #11691] Structured output with OpenAI SDK and gpt-oss:20b not working #69795

New Issue

GiteaMirror · 2026-05-04T19:19:20-05:00

GiteaMirror commented

2026-05-04 19:19:20 -05:00

Originally created by @taagarwa-rh on GitHub (Aug 5, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/11691

Originally assigned to: @ParthSareen on GitHub.

What is the issue?

OpenAI SDK is unable to parse structured output from gpt-oss:20b responses. Ollama is supposed to be compatible with OpenAI SDK structured outputs per this Blog Post.

Reproducer:

import openai
from pydantic import BaseModel

class Response(BaseModel):
    
    response: str

client = openai.OpenAI(api_key="NONE", base_url="http://localhost:11434/v1")
response = client.beta.chat.completions.parse(
        messages=[{"role": "user", "content": "Hello, how are you?"}],
        model="gpt-oss:20b",
        response_format=Response,
)
print(response)

Relevant log output

pydantic_core._pydantic_core.ValidationError: 1 validation error for Response
  Invalid JSON: expected value at line 1 column 1 [type=json_invalid, input_value='The user says "\n\n    \t}\n       \t\t \t\t    ', input_type=str]
    For further information visit https://errors.pydantic.dev/2.11/v/json_invalid

OS

macOS

GPU

Apple

CPU

Apple

Ollama version

0.11.0

Originally created by @taagarwa-rh on GitHub (Aug 5, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/11691 Originally assigned to: @ParthSareen on GitHub. ### What is the issue? OpenAI SDK is unable to parse structured output from gpt-oss:20b responses. Ollama is supposed to be compatible with OpenAI SDK structured outputs per this [Blog Post](https://ollama.com/blog/structured-outputs). Reproducer: ```python import openai from pydantic import BaseModel class Response(BaseModel): response: str client = openai.OpenAI(api_key="NONE", base_url="http://localhost:11434/v1") response = client.beta.chat.completions.parse( messages=[{"role": "user", "content": "Hello, how are you?"}], model="gpt-oss:20b", response_format=Response, ) print(response) ``` ### Relevant log output ```shell pydantic_core._pydantic_core.ValidationError: 1 validation error for Response Invalid JSON: expected value at line 1 column 1 [type=json_invalid, input_value='The user says "\n\n \t}\n \t\t \t\t ', input_type=str] For further information visit https://errors.pydantic.dev/2.11/v/json_invalid ``` ### OS macOS ### GPU Apple ### CPU Apple ### Ollama version 0.11.0

GiteaMirror added the gpt-oss bug labels 2026-05-04 19:19:20 -05:00

GiteaMirror commented

2026-05-04 19:19:22 -05:00

@BeatWolf commented on GitHub (Aug 5, 2025):

i think i have a similar issue with langchain and the ollamachatmodel. my pipelines that depend on structured output dont work, making the model unuseable

@BeatWolf commented on GitHub (Aug 5, 2025): i think i have a similar issue with langchain and the ollamachatmodel. my pipelines that depend on structured output dont work, making the model unuseable

GiteaMirror commented

2026-05-04 19:19:23 -05:00

@jbcallaghan commented on GitHub (Aug 5, 2025):

I can report the same issue, structured output doesn't work. Content = ''

@jbcallaghan commented on GitHub (Aug 5, 2025): I can report the same issue, structured output doesn't work. Content = ''

GiteaMirror commented

2026-05-04 19:19:24 -05:00

@sheneman commented on GitHub (Aug 6, 2025):

Structured outputs doesn't work with gpt-oss due to the use of the new Harmony response format.

I believe this can be addressed via an integration layer by the Ollama team, but the lack of structured outputs really makes using gpt-oss model useless for most serious purposes.

@sheneman commented on GitHub (Aug 6, 2025): Structured outputs doesn't work with gpt-oss due to the use of the new **Harmony** response format. I believe this can be addressed via an integration layer by the Ollama team, but the lack of structured outputs really makes using gpt-oss model useless for most serious purposes.

GiteaMirror commented

2026-05-04 19:19:25 -05:00

@frozenkp commented on GitHub (Aug 6, 2025):

Same issue here. I'm using Pydantic with ChatOllama, and nothing in the response content is causing a parsing exception.

@frozenkp commented on GitHub (Aug 6, 2025): Same issue here. I'm using Pydantic with ChatOllama, and nothing in the response content is causing a parsing exception.

GiteaMirror commented

2026-05-04 19:19:25 -05:00

@tneQpx commented on GitHub (Aug 6, 2025):

Same issue. Message.Content = "

When using ollama chat with format

@tneQpx commented on GitHub (Aug 6, 2025): Same issue. Message.Content = " When using ollama chat with format

GiteaMirror commented

2026-05-04 19:19:27 -05:00

@KlausGPaul commented on GitHub (Aug 6, 2025):

As a workaround, it seems as if adding the desired response schema to the prompt could work, have not tried it at scale yet, though.

f"""text text
<prompt>
...
Use the provided JSON schema for your reply:
``
{ThisIsMySchema.model_json_schema()}
``
"""

The response, though will also enclose the JSON inside markdown, but it follows the schema.

@KlausGPaul commented on GitHub (Aug 6, 2025): As a workaround, it seems as if adding the desired response schema to the prompt could work, have not tried it at scale yet, though. ``` f"""text text <prompt> ... Use the provided JSON schema for your reply: `` {ThisIsMySchema.model_json_schema()} `` """ ``` The response, though will also enclose the JSON inside markdown, but it follows the schema.

GiteaMirror commented

2026-05-04 19:19:28 -05:00

@lachlansleight commented on GitHub (Aug 6, 2025):

Adding +1 - same issue here. Adding some tests:

Sending:

{
    "model": "gpt-oss:20b",
    "system": "You are a helpful assistant that always responds with valid JSON",
    "prompt": "Hello there! Respond in the JSON format { \"response\": \"your-response-here\" }",
    "stream": false,
    "format": "json"
}

Results in the following response:

{
    "model": "gpt-oss:20b",
    "created_at": "2025-08-06T11:22:27.512287Z",
    "response": "The user says: \": \"Hello there! Respond in the JSON format { \"\n\n    }",
    "done": true,
    "done_reason": "stop",
    "context": [...],
    "total_duration": 1491802500,
    "load_duration": 73731292,
    "prompt_eval_count": 105,
    "prompt_eval_duration": 712212500,
    "eval_count": 22,
    "eval_duration": 655166083
}

Sometimes response is empty, sometimes it begins with some of the thinking text, as above. Often this is just a tiny fragment, such as "response": "{\"\n\n }". Removing the system prompt seems to give me these little error fragments much more often (about 60% of the time, as opposed to 10% of the time with a system prompt)

If I remove the "format": "json" parameter altogether, I get the following response:

{
    "model": "gpt-oss:20b",
    "created_at": "2025-08-06T11:26:53.037969Z",
    "response": "{\"response\":\"Hello! How can I help you today?\"}",
    "thinking": "User says: \"Hello there! Respond in the JSON format { \"response\": \"your-response-here\" }\". So we need to output JSON with key \"response\" and value as our response. We should greet. So response: \"Hello! How can I help you today?\" Should wrap. Ensure JSON.",
    "done": true,
    "done_reason": "stop",
    "context": [...],
    "total_duration": 3144792167,
    "load_duration": 70313792,
    "prompt_eval_count": 105,
    "prompt_eval_duration": 674871500,
    "eval_count": 87,
    "eval_duration": 2399115875
}

Finally, if I try setting the format to a specific format, I always get the empty response text, with or without a system prompt.

@lachlansleight commented on GitHub (Aug 6, 2025): Adding +1 - same issue here. Adding some tests: Sending: ```json { "model": "gpt-oss:20b", "system": "You are a helpful assistant that always responds with valid JSON", "prompt": "Hello there! Respond in the JSON format { \"response\": \"your-response-here\" }", "stream": false, "format": "json" } ``` Results in the following response: ```json { "model": "gpt-oss:20b", "created_at": "2025-08-06T11:22:27.512287Z", "response": "The user says: \": \"Hello there! Respond in the JSON format { \"\n\n }", "done": true, "done_reason": "stop", "context": [...], "total_duration": 1491802500, "load_duration": 73731292, "prompt_eval_count": 105, "prompt_eval_duration": 712212500, "eval_count": 22, "eval_duration": 655166083 } ``` Sometimes `response` is empty, sometimes it begins with some of the thinking text, as above. Often this is just a tiny fragment, such as `"response": "{\"\n\n }"`. Removing the system prompt seems to give me these little error fragments much more often (about 60% of the time, as opposed to 10% of the time with a system prompt) If I remove the `"format": "json"` parameter altogether, I get the following response: ```json { "model": "gpt-oss:20b", "created_at": "2025-08-06T11:26:53.037969Z", "response": "{\"response\":\"Hello! How can I help you today?\"}", "thinking": "User says: \"Hello there! Respond in the JSON format { \"response\": \"your-response-here\" }\". So we need to output JSON with key \"response\" and value as our response. We should greet. So response: \"Hello! How can I help you today?\" Should wrap. Ensure JSON.", "done": true, "done_reason": "stop", "context": [...], "total_duration": 3144792167, "load_duration": 70313792, "prompt_eval_count": 105, "prompt_eval_duration": 674871500, "eval_count": 87, "eval_duration": 2399115875 } ``` Finally, if I try setting the format to a specific format, I always get the empty response text, with or without a system prompt.

GiteaMirror commented

2026-05-04 19:19:29 -05:00

@jbcallaghan commented on GitHub (Aug 6, 2025):

As a workaround, it seems as if adding the desired response schema to the prompt could work, have not tried it at scale yet, though.
f"""text text
<prompt>
...
Use the provided JSON schema for your reply:
``
{ThisIsMySchema.model_json_schema()}
``
"""
The response, though will also enclose the JSON inside markdown, but it follows the schema.

I tried this and it works randomly, sometimes I get a response with the correct formatting and other times no output at all. This is using exactly the same query each time. I also noticed there is a lot of blank content before the structured output is populated when it does work, almost like thinking is being shown as blank content

@jbcallaghan commented on GitHub (Aug 6, 2025): > As a workaround, it seems as if adding the desired response schema to the prompt could work, have not tried it at scale yet, though. > > ``` > f"""text text > <prompt> > ... > Use the provided JSON schema for your reply: > `` > {ThisIsMySchema.model_json_schema()} > `` > """ > ``` > > The response, though will also enclose the JSON inside markdown, but it follows the schema. I tried this and it works randomly, sometimes I get a response with the correct formatting and other times no output at all. This is using exactly the same query each time. I also noticed there is a lot of blank content before the structured output is populated when it does work, almost like thinking is being shown as blank content

GiteaMirror commented

2026-05-04 19:19:30 -05:00

@duxor commented on GitHub (Aug 6, 2025):

It's a little bit crazy, but it works...

You need to find a position of ```json in the string, in some cases there is more text as a prefix:

@duxor commented on GitHub (Aug 6, 2025): <img width="1518" height="809" alt="Image" src="https://github.com/user-attachments/assets/e43436b7-1fd6-44c0-bb12-5c5d8962a56c" /> It's a little bit crazy, but it works... You need to find a position of ```json in the string, in some cases there is more text as a prefix: <img width="1525" height="879" alt="Image" src="https://github.com/user-attachments/assets/ffae6a71-3c56-4358-b21b-aac109561a50" />

GiteaMirror commented

2026-05-04 19:19:30 -05:00

@sheneman commented on GitHub (Aug 6, 2025):

@duxor - Yeah, you can specify the desired format in the prompt, but that's not really enforced structured output with a compiled grammar. No guarantee that it will work, and yes - you have to filter other stuff around it.

@sheneman commented on GitHub (Aug 6, 2025): @duxor - Yeah, you can specify the desired format in the prompt, but that's not really enforced structured output with a compiled grammar. No guarantee that it will work, and yes - you have to filter other stuff around it.

GiteaMirror commented

2026-05-04 19:19:31 -05:00

@frozenkp commented on GitHub (Aug 7, 2025):

My current alternative workaround is not asking gpt-oss to reply with a structured format, and asking another small model to produce the structured format from gpt-oss's response. It's redundant while stable and working.

@frozenkp commented on GitHub (Aug 7, 2025): My current alternative workaround is not asking gpt-oss to reply with a structured format, and asking another small model to produce the structured format from gpt-oss's response. It's redundant while stable and working.

GiteaMirror commented

2026-05-04 19:19:32 -05:00

@duxor commented on GitHub (Aug 7, 2025):

My current alternative workaround is not asking gpt-oss to reply with a structured format, and asking another small model to produce the structured format from gpt-oss's response. It's redundant while stable and working.

You are right. Do you think it's worth it?

I will definitely avoid gpt-oss, for now.

@duxor commented on GitHub (Aug 7, 2025): > My current alternative workaround is not asking gpt-oss to reply with a structured format, and asking another small model to produce the structured format from gpt-oss's response. It's redundant while stable and working. You are right. Do you think it's worth it? I will definitely avoid `gpt-oss`, for now.

GiteaMirror commented

2026-05-04 19:19:33 -05:00

@frozenkp commented on GitHub (Aug 7, 2025):

My current alternative workaround is not asking gpt-oss to reply with a structured format, and asking another small model to produce the structured format from gpt-oss's response. It's redundant while stable and working.

You are right. Do you think it's worth it?

I will definitely avoid gpt-oss, for now.

Well, it depends. In my case, I used it as one of my research evaluations.

I would suggest using it after the issue is fixed. I didn't expect that I would take this two-layer approach that I used in the very beginning of the LLM era back. LOL

@frozenkp commented on GitHub (Aug 7, 2025): > > My current alternative workaround is not asking gpt-oss to reply with a structured format, and asking another small model to produce the structured format from gpt-oss's response. It's redundant while stable and working. > > You are right. Do you think it's worth it? > > I will definitely avoid `gpt-oss`, for now. Well, it depends. In my case, I used it as one of my research evaluations. I would suggest using it after the issue is fixed. I didn't expect that I would take this two-layer approach that I used in the very beginning of the LLM era back. LOL

GiteaMirror commented

2026-05-04 19:19:34 -05:00

@rick-github commented on GitHub (Aug 7, 2025):

@rick-github commented on GitHub (Aug 7, 2025): [<img width="546" height="58" alt="Image" src="https://github.com/user-attachments/assets/190eca3a-a01a-4e7a-8213-c6a0554ecd31" />](https://discord.com/channels/1128867683291627614/1402425163903013036/1402450640654831739)

GiteaMirror commented

2026-05-04 19:19:34 -05:00

@andreys42 commented on GitHub (Aug 7, 2025):

+1 here
I guess absence of StructuredOutput support is signigicant drawback now

@andreys42 commented on GitHub (Aug 7, 2025): +1 here I guess absence of StructuredOutput support is signigicant drawback now

GiteaMirror commented

2026-05-04 19:19:35 -05:00

@Mohammadtvk commented on GitHub (Aug 7, 2025):

+1
this feature is very important

@Mohammadtvk commented on GitHub (Aug 7, 2025): +1 this feature is very important

GiteaMirror commented

2026-05-04 19:19:37 -05:00

@dontriskit commented on GitHub (Aug 8, 2025):

same issue with vLLM

resolved with official vllm docs

@dontriskit commented on GitHub (Aug 8, 2025): same issue with vLLM --- resolved with official vllm docs

GiteaMirror commented

2026-05-04 19:19:39 -05:00

@Koki-Itai commented on GitHub (Aug 8, 2025):

same issue

@Koki-Itai commented on GitHub (Aug 8, 2025): same issue

GiteaMirror commented

2026-05-04 19:19:41 -05:00

@tttturtle-russ commented on GitHub (Aug 8, 2025):

same issue here, it's an important feature.

@tttturtle-russ commented on GitHub (Aug 8, 2025): same issue here, it's an important feature.

GiteaMirror commented

2026-05-04 19:19:44 -05:00

@ddudek commented on GitHub (Aug 11, 2025):

As a better workaround, you should put the schema in "developer" role, e.g.:

[
  {
    "role": "system",
    "content": "
You are helpful coding expert that outputs JSON response.
  },
  {
    "role": "developer",
    "content": "
        # Instructions
        Respond in JSON format. Only output valid JSON, do not include any explanations or markdown formatting. Ensure all required fields are included.
    JSON schema: {'$defs': ... <your schema here> }
"
  },
  {
    "role": "user",
    "content": "## The task
...
"
  }
]

The above works pretty good, although the model also outputs "analysis" and "commentary" streams, e.g.:

<|channel|>analysis<|message|>We need JSON output following schema: My class with my_field string. Contains <blah blah for another 4 paragraphs... >

<|start|>assistant<|channel|>final<|message|>{"my_field": "some content"}

So adding this to the system prompt again improves the output:

"Reasoning: low
# Valid channels: final."

Full example:

[
  {
    "role": "system",
    "content": "
You are helpful coding expert that outputs JSON response.

Reasoning: low
# Valid channels: final.
  },
  {
    "role": "developer",
    "content": "
        # Instructions
        Respond in JSON format. Only output valid JSON, do not include any explanations or markdown formatting. Ensure all required fields are included.
    JSON schema: {'$defs': ... <your schema here> }
"
  },
  {
    "role": "user",
    "content": "## The task
...
"
  }
]

Output:
<|channel|>final<|message|>{"my_field": "some content"}
still needs removing "<|channel|>final<|message|>" but gives very stable behavior.

This is nicely documented in the cookbook https://cookbook.openai.com/articles/openai-harmony#developer-message-format and looks like the model follows this very well.

@ddudek commented on GitHub (Aug 11, 2025): As a better workaround, you should put the schema in "developer" role, e.g.: ``` [ { "role": "system", "content": " You are helpful coding expert that outputs JSON response. }, { "role": "developer", "content": " # Instructions Respond in JSON format. Only output valid JSON, do not include any explanations or markdown formatting. Ensure all required fields are included. JSON schema: {'$defs': ... <your schema here> } " }, { "role": "user", "content": "## The task ... " } ] ``` The above works pretty good, although the model also outputs "analysis" and "commentary" streams, e.g.: ``` <|channel|>analysis<|message|>We need JSON output following schema: My class with my_field string. Contains <blah blah for another 4 paragraphs... > <|start|>assistant<|channel|>final<|message|>{"my_field": "some content"} ``` So adding this to the **system prompt** again improves the output: ``` "Reasoning: low # Valid channels: final." ``` Full example: ``` [ { "role": "system", "content": " You are helpful coding expert that outputs JSON response. Reasoning: low # Valid channels: final. }, { "role": "developer", "content": " # Instructions Respond in JSON format. Only output valid JSON, do not include any explanations or markdown formatting. Ensure all required fields are included. JSON schema: {'$defs': ... <your schema here> } " }, { "role": "user", "content": "## The task ... " } ] ``` Output: ```<|channel|>final<|message|>{"my_field": "some content"}``` still needs removing "<|channel|>final<|message|>" but gives very stable behavior. This is nicely documented in the cookbook https://cookbook.openai.com/articles/openai-harmony#developer-message-format and looks like the model follows this very well.

GiteaMirror commented

2026-05-04 19:19:46 -05:00

@youngbinkim0 commented on GitHub (Aug 11, 2025):

same issue with vLLM

resolved with official vllm docs

can you link to the official vLLM doc referred? running into the same issue @dontriskit

@youngbinkim0 commented on GitHub (Aug 11, 2025): > ## same issue with vLLM > resolved with official vllm docs can you link to the official vLLM doc referred? running into the same issue @dontriskit

GiteaMirror commented

2026-05-04 19:19:48 -05:00

@youngbinkim0 commented on GitHub (Aug 11, 2025):

As a better workaround, you should put the schema in "developer" role, e.g.:
[
  {
    "role": "system",
    "content": "
You are helpful coding expert that outputs JSON response.
  },
  {
    "role": "developer",
    "content": "
        # Instructions
        Respond in JSON format. Only output valid JSON, do not include any explanations or markdown formatting. Ensure all required fields are included.
    JSON schema: {'$defs': ... <your schema here> }
"
  },
  {
    "role": "user",
    "content": "## The task
...
"
  }
]
The above works pretty good, although the model also outputs "analysis" and "commentary" streams, e.g.:
<|channel|>analysis<|message|>We need JSON output following schema: My class with my_field string. Contains <blah blah for another 4 paragraphs... >

<|start|>assistant<|channel|>final<|message|>{"my_field": "some content"}
So adding this to the system prompt again improves the output:
"Reasoning: low
# Valid channels: final."
Full example:
[
  {
    "role": "system",
    "content": "
You are helpful coding expert that outputs JSON response.

Reasoning: low
# Valid channels: final.
  },
  {
    "role": "developer",
    "content": "
        # Instructions
        Respond in JSON format. Only output valid JSON, do not include any explanations or markdown formatting. Ensure all required fields are included.
    JSON schema: {'$defs': ... <your schema here> }
"
  },
  {
    "role": "user",
    "content": "## The task
...
"
  }
]
Output: <|channel|>final<|message|>{"my_field": "some content"} still needs removing "<|channel|>final<|message|>" but gives very stable behavior.

This is nicely documented in the cookbook https://cookbook.openai.com/articles/openai-harmony#developer-message-format and looks like the model follows this very well.

While this does work for many schemas, it's not the same as enforcing structured outputs. As referred in the structured output section of the cookbook:

"This prompt alone will, however, only influence the model’s behavior but doesn’t guarantee the full adherence to the schema. For this you still need to construct your own grammar and enforce the schema during sampling."

https://cookbook.openai.com/articles/openai-harmony#structured-output

@youngbinkim0 commented on GitHub (Aug 11, 2025): > As a better workaround, you should put the schema in "developer" role, e.g.: > > ``` > [ > { > "role": "system", > "content": " > You are helpful coding expert that outputs JSON response. > }, > { > "role": "developer", > "content": " > # Instructions > Respond in JSON format. Only output valid JSON, do not include any explanations or markdown formatting. Ensure all required fields are included. > JSON schema: {'$defs': ... <your schema here> } > " > }, > { > "role": "user", > "content": "## The task > ... > " > } > ] > ``` > > The above works pretty good, although the model also outputs "analysis" and "commentary" streams, e.g.: > > ``` > <|channel|>analysis<|message|>We need JSON output following schema: My class with my_field string. Contains <blah blah for another 4 paragraphs... > > > <|start|>assistant<|channel|>final<|message|>{"my_field": "some content"} > ``` > > So adding this to the **system prompt** again improves the output: > > ``` > "Reasoning: low > # Valid channels: final." > ``` > > Full example: > > ``` > [ > { > "role": "system", > "content": " > You are helpful coding expert that outputs JSON response. > > Reasoning: low > # Valid channels: final. > }, > { > "role": "developer", > "content": " > # Instructions > Respond in JSON format. Only output valid JSON, do not include any explanations or markdown formatting. Ensure all required fields are included. > JSON schema: {'$defs': ... <your schema here> } > " > }, > { > "role": "user", > "content": "## The task > ... > " > } > ] > ``` > > Output: `<|channel|>final<|message|>{"my_field": "some content"}` still needs removing "<|channel|>final<|message|>" but gives very stable behavior. > > This is nicely documented in the cookbook https://cookbook.openai.com/articles/openai-harmony#developer-message-format and looks like the model follows this very well. While this does work for many schemas, it's not the same as enforcing structured outputs. As referred in the structured output section of the cookbook: "This prompt alone will, however, only influence the model’s behavior but doesn’t guarantee the full adherence to the schema. For this you still need to construct your own grammar and enforce the schema during sampling." https://cookbook.openai.com/articles/openai-harmony#structured-output

GiteaMirror commented

2026-05-04 19:19:51 -05:00

@lachlansleight commented on GitHub (Aug 12, 2025):

Yeah, if a model does not support JSON output format schema, then it is functionally a fun toy to play with, but not actually useful as an agent.

Agreed - since identifying this as an issue I basically put GPT-OSS down and haven't touched it since. It's interesting to see how it differs from other agents, but without JSON output it's completely useless for anything other than basic chat applications.

I didn't realise how harmful harmony would be to the update of GPT-OSS. It's so bad that I almost want to put on my tin foil hat and wonder whether Open AI is trying to intentionally harm the open-weight community by fragmenting the ecosystem with a complex, difficult-to-implement response format.

@lachlansleight commented on GitHub (Aug 12, 2025): > Yeah, if a model does not support JSON output format schema, then it is functionally a fun toy to play with, but not actually useful as an agent. Agreed - since identifying this as an issue I basically put GPT-OSS down and haven't touched it since. It's interesting to see how it differs from other agents, but without JSON output it's completely useless for anything other than basic chat applications. I didn't realise how harmful harmony would be to the update of GPT-OSS. It's so bad that I almost want to put on my tin foil hat and wonder whether Open AI is trying to intentionally harm the open-weight community by fragmenting the ecosystem with a complex, difficult-to-implement response format.

GiteaMirror commented

2026-05-04 19:19:52 -05:00

@ParthSareen commented on GitHub (Aug 12, 2025):

Hey folks! Just came across this issue. We currently do not support structured outputs with this model or other thinking models. Working on a change later this week to bring some of the token parsing down to the runner level. With that we'd be able to do grammar sampling and token parsing closer together to know when to start the constrained sampling. Sorry for the delay!

@ParthSareen commented on GitHub (Aug 12, 2025): Hey folks! Just came across this issue. We currently do not support structured outputs with this model or other thinking models. Working on a change later this week to bring some of the token parsing down to the runner level. With that we'd be able to do grammar sampling and token parsing closer together to know when to start the constrained sampling. Sorry for the delay!

GiteaMirror commented

2026-05-04 19:19:54 -05:00

@sheneman commented on GitHub (Aug 13, 2025):

@ParthSareen : Thank you so much for your attention to the issue of Structured outputs and thinking models in Ollama, including gpt-oss. This issue been a roadblock for my organization using Ollama for awhile. We love Ollama but have been considering alternatives because of this limitation with effective use of thinking models.

I assume the issue with gpt-oss is at least related to open issue #523.

But I assume the problem is compounded with gpt-oss because of the use of the Harmony response format.

Again, thank you and the Ollama team for prioritizing this!

@sheneman commented on GitHub (Aug 13, 2025): @ParthSareen : **Thank you so much** for your attention to the issue of **_Structured outputs and thinking models_** in Ollama, including gpt-oss. This issue been a roadblock for my organization using Ollama for awhile. We love Ollama but have been considering alternatives because of this limitation with effective use of thinking models. I assume the issue with gpt-oss is at least related to open issue [#523](https://github.com/ollama/ollama-python/issues/523). But I assume the problem is compounded with gpt-oss because of the use of the Harmony response format. Again, thank you and the Ollama team for prioritizing this!

GiteaMirror commented

2026-05-04 19:19:57 -05:00

@nicholas-johnson-techxcel commented on GitHub (Aug 13, 2025):

Yeah progress, the GBNF generator is now pretty stable and I wrote a function to turn chat history into Harmony. The model is quite smart (for a local model) but also unhinged (although it might be that I have not yet tuned llama.cpp well). It tries to put reasoning output in URL params of tools, they really need to find a way to make it think without emitting those tokens. It can handle web browsing tasks using Playwright but I keep overflowing the context window so now I am writing a full node based infrastructure where we can have nodes consolidate / summarise the history and have tool calls be able to be more context aware and veto certain steps.

I got that idea from Haystack but their tooling is hardly type-safe (it was good as a proof of concept) and it was only after a tonne of complaining that they fixed the issue that by importing Haystack, half your python file lines light up bright red with compiler errors. I am still baffled by the fact that everyone is using old Python syntax, type-safety as an afterthought. Like I am new to Python and somehow it feels like I have to write almost everything because too many of the libraries are missing stubs or have serious issues. I mean, is it so hard for, OpenAI, Firecrawl-py, etc that if you give the function a Pydantic class, that it give back an instance of that class, and if you hand it a dict, you get back a dict? If you are streaming then you can use a small state machine to parse half-completed json and you use Pydantic alias generator so that the JSON is camelCase (JSON is inherently born from JS and this is best practice) but when you await the finished Pydantic class instance in Python, the fields are in snake_case as is idiomatic for Python.

Python still has poor generic templating (things like cannot specify the generic type of the function you are calling without it parsing it as array indexing, and it does not allow each overload to have its own code, because it has no idea which one will be used until runtime) and inheritance support (it only enforces checking if you mark as @final, and there is no way of requiring an instance passed in having been marked as final), but it is enough to do the trick. Well, at least it has come a long way, right? Pydantic is a life-saver, although under the hood it has some code I am not pleased with, and it had failed to implement certain cases, and the field aliasing is a nightmare - but I found a good workaround - use an alias generator and a switch statement inside of the generator with a case for each field.

Golden rule of LLMs, because they are extremely obtuse and get distracted easily (I believe next-token prediction to be merely a stepping stone). Use highly constrained schemas by rooting under the hood of Pydantic and cut down the number of choices it has and filter out noisy data (although there was a research paper recently "Let me speak freely", which argues against this, but does not seem to have used GBNF as a constraint method). The most frustrating thing is that I should not need to prompt an LLM to understand causality and the passage of time. The models try to emit all of the tool calls in one round, making bad assumptions about what the future state will be. Making them follow instructions is near impossible. But this gpt-oss:20b is more promising. We will have the hardware to run gpt-oss:120b in a couple of weeks, but my opinion is it will likely be a dud, not worth the 6x RAM and compute required compared to the 20b version.

I personally hope that all of these things get sorted. I mean there is no agent library I can download which has good general-purpose performance out of the box. I hope to change that. But OpenAI has just muddied the water with Harmony (ironic), there is no universally standardised LLM interface in Python, and the fact that I have to implement JSON schema support on my own is just wild. Again, the industry has billions of dollars and somehow they did not add support for JSON schema in llama.cpp when it's only a little over 300 lines of code. And I don't see how this code couldn't have been in there before, rendering the issue with gpt-oss to be only a matter of converting chat history to Harmony. Maybe they have kept a lot of features closed-source. Anyway, I want to be dealing with LLM concepts, not Python concepts. There are things like MCP (relatively clear) and A2A (can't even figure out at what level of abstraction is it meant to be at) and neither of these things have standardised the LLM interface.

One thing is clear, though: GBNF is the way. I see limitless possibilities with this, provided I can continue writing working compilers for it. It is also food for thought for how I was trying to train my own model (not next-token predictor) in my own time: I had issues formalising grammatical concepts for the training process, and this could be the thing I need. Ditching next-token should allow the inlining of classical functions (latch float inputs to closest binary state, one bit per input) to give LLMs extroadinary mathematical abilities without needing to execute Python or other script parsers. I would also wager similar techniques but for neural nets could solve the RSA problem. I also was struggling to understand how the Ollama JSON response format was implemented - it seemed like a mix of prompting, two-shot examples, and re-prompting. But GBNF, from what I can tell, eliminates illegal next tokens from the probability list (compiled into, meaning that it is relatively fail-safe and does not waste time having to go back and re-generate.

If anyone knows of other gems like GBNF, I would love to hear - I may be overlooking other great solutions.

@nicholas-johnson-techxcel commented on GitHub (Aug 13, 2025): Yeah progress, the GBNF generator is now pretty stable and I wrote a function to turn chat history into Harmony. The model is quite smart (for a local model) but also unhinged (although it might be that I have not yet tuned llama.cpp well). It tries to put reasoning output in URL params of tools, they really need to find a way to make it think without emitting those tokens. It can handle web browsing tasks using Playwright but I keep overflowing the context window so now I am writing a full node based infrastructure where we can have nodes consolidate / summarise the history and have tool calls be able to be more context aware and veto certain steps. I got that idea from Haystack but their tooling is hardly type-safe (it was good as a proof of concept) and it was only after a tonne of complaining that they fixed the issue that by importing Haystack, half your python file lines light up bright red with compiler errors. I am still baffled by the fact that everyone is using old Python syntax, type-safety as an afterthought. Like I am new to Python and somehow it feels like I have to write almost everything because too many of the libraries are missing stubs or have serious issues. I mean, is it so hard for, OpenAI, Firecrawl-py, etc that if you give the function a Pydantic class, that it give back an instance of that class, and if you hand it a dict, you get back a dict? If you are streaming then you can use a small state machine to parse half-completed json and you use Pydantic alias generator so that the JSON is camelCase (JSON is inherently born from JS and this is best practice) but when you await the finished Pydantic class instance in Python, the fields are in snake_case as is idiomatic for Python. Python still has poor generic templating (things like cannot specify the generic type of the function you are calling without it parsing it as array indexing, and it does not allow each overload to have its own code, because it has no idea which one will be used until runtime) and inheritance support (it only enforces checking if you mark as @final, and there is no way of requiring an instance passed in having been marked as final), but it is enough to do the trick. Well, at least it has come a long way, right? Pydantic is a life-saver, although under the hood it has some code I am not pleased with, and it had failed to implement certain cases, and the field aliasing is a nightmare - but I found a good workaround - use an alias generator and a switch statement inside of the generator with a case for each field. Golden rule of LLMs, because they are extremely obtuse and get distracted easily (I believe next-token prediction to be merely a stepping stone). Use highly constrained schemas by rooting under the hood of Pydantic and cut down the number of choices it has and filter out noisy data (although there was a research paper recently "Let me speak freely", which argues against this, but does not seem to have used GBNF as a constraint method). The most frustrating thing is that I should not need to prompt an LLM to understand causality and the passage of time. The models try to emit all of the tool calls in one round, making bad assumptions about what the future state will be. Making them follow instructions is near impossible. But this gpt-oss:20b is more promising. We will have the hardware to run gpt-oss:120b in a couple of weeks, but my opinion is it will likely be a dud, not worth the 6x RAM and compute required compared to the 20b version. I personally hope that all of these things get sorted. I mean there is no agent library I can download which has good general-purpose performance out of the box. I hope to change that. But OpenAI has just muddied the water with Harmony (ironic), there is no universally standardised LLM interface in Python, and the fact that I have to implement JSON schema support on my own is just wild. Again, the industry has billions of dollars and somehow they did not add support for JSON schema in llama.cpp when it's only a little over 300 lines of code. And I don't see how this code couldn't have been in there before, rendering the issue with gpt-oss to be only a matter of converting chat history to Harmony. Maybe they have kept a lot of features closed-source. Anyway, I want to be dealing with LLM concepts, not Python concepts. There are things like MCP (relatively clear) and A2A (can't even figure out at what level of abstraction is it meant to be at) and neither of these things have standardised the LLM interface. One thing is clear, though: GBNF is the way. I see limitless possibilities with this, provided I can continue writing working compilers for it. It is also food for thought for how I was trying to train my own model (not next-token predictor) in my own time: I had issues formalising grammatical concepts for the training process, and this could be the thing I need. Ditching next-token should allow the inlining of classical functions (latch float inputs to closest binary state, one bit per input) to give LLMs extroadinary mathematical abilities without needing to execute Python or other script parsers. I would also wager similar techniques but for neural nets could solve the RSA problem. I also was struggling to understand how the Ollama JSON response format was implemented - it seemed like a mix of prompting, two-shot examples, and re-prompting. But GBNF, from what I can tell, eliminates illegal next tokens from the probability list (compiled into, meaning that it is relatively fail-safe and does not waste time having to go back and re-generate. If anyone knows of other gems like GBNF, I would love to hear - I may be overlooking other great solutions.

GiteaMirror commented

2026-05-04 19:19:59 -05:00

@rick-github commented on GitHub (Aug 13, 2025):

I also was struggling to understand how the Ollama JSON response format was implemented

Ollama use GBNF to implement structured outputs. The problem with applying it to reasoning models is that it currently also constrains the reasoning phase to the GBNF grammar, which compromises quality. Ideally the model should be allowed to consider the full gamut of probabilistic generation during reasoning, and only apply the GBNF grammer during content generation.

@rick-github commented on GitHub (Aug 13, 2025): > I also was struggling to understand how the Ollama JSON response format was implemented Ollama use GBNF to implement structured outputs. The problem with applying it to reasoning models is that it currently also constrains the reasoning phase to the GBNF grammar, which compromises quality. Ideally the model should be allowed to consider the full gamut of probabilistic generation during reasoning, and only apply the GBNF grammer during content generation.

GiteaMirror commented

2026-05-04 19:20:01 -05:00

@Croups commented on GitHub (Aug 13, 2025):

same issue here, I noticed that sometimes it returns null while using it via pydantic-ai, I tested it without defining a structured output schema, it is generating the answer with tags here is a sample :

AgentRunResult(output='{"analysis":"The user says 'hi how are you', which is a greeting and question about how I am. The user wants to know what ChatGPT is. So respond with a friendly greeting and explanation.<|channel|>commentary:"} ')

This is why when you define a model, pydantic can't parse it.

@Croups commented on GitHub (Aug 13, 2025): same issue here, I noticed that sometimes it returns null while using it via pydantic-ai, I tested it without defining a structured output schema, it is generating the answer with tags here is a sample : AgentRunResult(output='{"analysis":"The user says \'hi how are you\', which is a greeting and question about how I am. The user wants to know what ChatGPT is. So respond with a friendly greeting and explanation.<|channel|>commentary:"} ') This is why when you define a model, pydantic can't parse it.

GiteaMirror commented

2026-05-04 19:20:05 -05:00

@nicholas-johnson-techxcel commented on GitHub (Aug 15, 2025):

I also was struggling to understand how the Ollama JSON response format was implemented

Ollama use GBNF to implement structured outputs. The problem with applying it to reasoning models is that it currently also constrains the reasoning phase to the GBNF grammar, which compromises quality. Ideally the model should be allowed to consider the full gamut of probabilistic generation during reasoning, and only apply the GBNF grammer during content generation.

Can't it reason silently and not emit these symbols? Besides, I have been trying to disable reasoning as it just adds latency, and I can easily add reasoning fields to the json output of non-reasoning models if and when it makes sense for the application. Sometimes I have found it creates a better agent, other times it is just wasting electricity and time.

@nicholas-johnson-techxcel commented on GitHub (Aug 15, 2025): > > I also was struggling to understand how the Ollama JSON response format was implemented > > Ollama use GBNF to implement structured outputs. The problem with applying it to reasoning models is that it currently also constrains the reasoning phase to the GBNF grammar, which compromises quality. Ideally the model should be allowed to consider the full gamut of probabilistic generation during reasoning, and only apply the GBNF grammer during content generation. Can't it reason silently and not emit these symbols? Besides, I have been trying to disable reasoning as it just adds latency, and I can easily add reasoning fields to the json output of non-reasoning models if and when it makes sense for the application. Sometimes I have found it creates a better agent, other times it is just wasting electricity and time.

GiteaMirror commented

2026-05-04 19:20:11 -05:00

@rick-github commented on GitHub (Aug 15, 2025):

Models don't have an internal monologue or subconscious, all they do is probabilistically generate tokens. "Reasoning" models are trained to generate "thinking" tokens as a way to guide the generation of tokens in the response phase, but it's just tokens all the way down.

@rick-github commented on GitHub (Aug 15, 2025): Models don't have an internal monologue or subconscious, all they do is probabilistically generate tokens. "Reasoning" models are trained to generate "thinking" tokens as a way to guide the generation of tokens in the response phase, but it's just tokens all the way down.

GiteaMirror commented

2026-05-04 19:20:13 -05:00

@adamoutler commented on GitHub (Aug 15, 2025):

Confirmed. Same issue. Any model except gpt-oss seems to work with Structured Outputs. gpt-oss returns a blank. I hope to see an adapter later in Ollama soon.

@adamoutler commented on GitHub (Aug 15, 2025): Confirmed. Same issue. Any model except `gpt-oss` seems to work with Structured Outputs. `gpt-oss` returns a blank. I hope to see an adapter later in Ollama soon.

GiteaMirror commented

2026-05-04 19:20:14 -05:00

@adamoutler commented on GitHub (Aug 15, 2025):

Is this being worked on? That thinking section on the json... It's a likely culprit and a good starting point.

@adamoutler commented on GitHub (Aug 15, 2025): Is this being worked on? That thinking section on the json... It's a likely culprit and a good starting point.

GiteaMirror commented

2026-05-04 19:20:15 -05:00

@nicholas-johnson-techxcel commented on GitHub (Aug 20, 2025):

Is this being worked on? That thinking section on the json... It's a likely culprit and a good starting point.

I did this in Python and llama.cpp but I since found in the Ollama source code a JSON=>GBNF compiler. You just need to hit llama.cpp with grammar=gbnf and then it works, but because it is a reasoning model, it then becomes a bit unhinged and tries to insert reasoning into json fields instead of keeping reasoning internally.

All Ollama had to do is use the grammar just like it already does and it would work to the extent which I get from llama.cpp (we still need to stop it from reasoning when we use think=False which it ignores) but for some reason they seemed to have made an exception for this model and hence they broke it. If I get some time I can look at their code and give a patch.

This also begs the question: if Ollama is a wrapper around llama.cpp then it could just become a python library which adds features to llama.cpp (basically a llama.cpp client library) and the actual Ollama server become a heap of scripts for running llama.cpp as a service and automatically pulling models down for it.

@nicholas-johnson-techxcel commented on GitHub (Aug 20, 2025): > Is this being worked on? That thinking section on the json... It's a likely culprit and a good starting point. I did this in Python and llama.cpp but I since found in the Ollama source code a JSON=>GBNF compiler. You just need to hit llama.cpp with `grammar=gbnf` and then it works, but because it is a reasoning model, it then becomes a bit unhinged and tries to insert reasoning into json fields instead of keeping reasoning internally. All Ollama had to do is use the grammar just like it already does and it would work to the extent which I get from llama.cpp (we still need to stop it from reasoning when we use `think=False` which it ignores) but for some reason they seemed to have made an exception for this model and hence they broke it. If I get some time I can look at their code and give a patch. This also begs the question: if Ollama is a wrapper around llama.cpp then it could just become a python library which adds features to llama.cpp (basically a llama.cpp client library) and the actual Ollama server become a heap of scripts for running llama.cpp as a service and automatically pulling models down for it.

GiteaMirror commented

2026-05-04 19:20:16 -05:00

@mpauly commented on GitHub (Aug 20, 2025):

@nicholas-johnson-techxcel With regards to llama.cpp: there is an open issue for structured outputs in llama.cpp and things are mostly working.
Those changes would need to be merged into llama.cpp, and could then eventually trickle down/be ported to ollama

@mpauly commented on GitHub (Aug 20, 2025): @nicholas-johnson-techxcel With regards to llama.cpp: there is an [open issue](https://github.com/ggml-org/llama.cpp/issues/15276#issuecomment-3201937062) for structured outputs in llama.cpp and things are mostly working. Those changes would need to be merged into llama.cpp, and could then eventually trickle down/be ported to ollama

GiteaMirror commented

2026-05-04 19:20:17 -05:00

@ParthSareen commented on GitHub (Aug 20, 2025):

Hey @nicholas-johnson-techxcel @mpauly that's not how it works - we're not consuming any of the llama.cpp changes for structured outputs - although we do use GBNF.

You can't turn thinking "off" for this model - those tokens have to get generated as this model follows the Harmony format.

@adamoutler As mentioned above I have started working on this. It's not a trivial change unfortunately and needs some moving around of where our parsers current live. Thanks for your patience friends!

Hey folks! Just came across this issue. We currently do not support structured outputs with this model or other thinking models. Working on a change later this week to bring some of the token parsing down to the runner level. With that we'd be able to do grammar sampling and token parsing closer together to know when to start the constrained sampling. Sorry for the delay!

@ParthSareen commented on GitHub (Aug 20, 2025): Hey @nicholas-johnson-techxcel @mpauly that's not how it works - we're not consuming any of the llama.cpp changes for structured outputs - although we do use GBNF. You can't turn thinking "off" for this model - those tokens have to get generated as this model follows the [Harmony](https://github.com/openai/harmony) format. @adamoutler As mentioned above I have started working on this. It's not a trivial change unfortunately and needs some moving around of where our parsers current live. Thanks for your patience friends! > Hey folks! Just came across this issue. We currently do not support structured outputs with this model or other thinking models. Working on a change later this week to bring some of the token parsing down to the runner level. With that we'd be able to do grammar sampling and token parsing closer together to know when to start the constrained sampling. Sorry for the delay!

GiteaMirror commented

2026-05-04 19:20:18 -05:00

@nicholas-johnson-techxcel commented on GitHub (Aug 25, 2025):

@nicholas-johnson-techxcel With regards to llama.cpp: there is an open issue for structured outputs in llama.cpp and things are mostly working. Those changes would need to be merged into llama.cpp, and could then eventually trickle down/be ported to ollama

I thought that llama.cpp did not have the structured output field, that you have to give GBNF, and it most certainly is working already, I now just compile JSON to GBNF and reluctantly make the root element to be <thinking>.*</thinking>{json} or just <thinking></thinking>{json} if I want to disable reasoning, and I just capture the thinking tags in a state machine, emit chunk messages with role="thinking" for those (and handle any split messages) and either put the structured messages through a JSON stream parser, or accumulate them and then parse for tool calling.

The way I see it, the issue is Ollama.

If anyone is wondering, forcing it to output <thinking></thinking> does seem to properly disable reasoning - despite those saying it cannot be - it stops it from trying to cram reasoning into JSON fields, and this results in significant decreases in latency.

This model still might be one of my favourites for local use, but it has still not caught up to 4o.

@nicholas-johnson-techxcel commented on GitHub (Aug 25, 2025): > [@nicholas-johnson-techxcel](https://github.com/nicholas-johnson-techxcel) With regards to llama.cpp: there is an [open issue](https://github.com/ggml-org/llama.cpp/issues/15276#issuecomment-3201937062) for structured outputs in llama.cpp and things are mostly working. Those changes would need to be merged into llama.cpp, and could then eventually trickle down/be ported to ollama I thought that llama.cpp did not have the structured output field, that you have to give GBNF, and it most certainly is working already, I now just compile JSON to GBNF and reluctantly make the root element to be `<thinking>.*</thinking>{json}` or just `<thinking></thinking>{json}` if I want to disable reasoning, and I just capture the thinking tags in a state machine, emit chunk messages with role="thinking" for those (and handle any split messages) and either put the structured messages through a JSON stream parser, or accumulate them and then parse for tool calling. The way I see it, the issue is Ollama. If anyone is wondering, forcing it to output `<thinking></thinking>` does seem to properly disable reasoning - despite those saying it cannot be - it stops it from trying to cram reasoning into JSON fields, and this results in significant decreases in latency. This model still might be one of my favourites for local use, but it has still not caught up to 4o.

GiteaMirror commented

2026-05-04 19:20:20 -05:00

@CL415 commented on GitHub (Aug 26, 2025):

In case somebody needs to force GPT-OSS to output valid JSON despite Ollama's shortcomings like I had, I found some success using Pydantic AI .run_sync, with the Ollama server as provider, although using the retrial parameter since sometimes OSS does not comply on the first shot.

@CL415 commented on GitHub (Aug 26, 2025): In case somebody needs to force GPT-OSS to output valid JSON despite Ollama's shortcomings like I had, I found some success using Pydantic AI `.run_sync`, with the Ollama server as `provider`, although using the retrial parameter since sometimes OSS does not comply on the first shot.

GiteaMirror commented

2026-05-04 19:20:20 -05:00

@nicholas-johnson-techxcel commented on GitHub (Aug 27, 2025):

Okay in terms of progress on my side. It was working extremely well except I had to clamp the number of analysis/thinking chars to stop it rambling. But it seems that if I wrap the JSON-GBNF with GBNF for the harmony format, it no longer needs clamping. But I cannot force it with GBNF to emit tokens like <|end|> which is an issue, because then feeding it into the openai-harmony library encoder, it does not strip the harmony frames from it properly. It's not actually that hard to process streaming chunks using a state machine, but still, it would be best doing it properly. Anyone have any ideas on forcing <|end|> to be emitted with GBNF?

@nicholas-johnson-techxcel commented on GitHub (Aug 27, 2025): Okay in terms of progress on my side. It was working extremely well except I had to clamp the number of analysis/thinking chars to stop it rambling. But it seems that if I wrap the JSON-GBNF with GBNF for the harmony format, it no longer needs clamping. But I cannot force it with GBNF to emit tokens like <|end|> which is an issue, because then feeding it into the openai-harmony library encoder, it does not strip the harmony frames from it properly. It's not actually that hard to process streaming chunks using a state machine, but still, it would be best doing it properly. Anyone have any ideas on forcing <|end|> to be emitted with GBNF?

GiteaMirror commented

2026-05-04 19:20:22 -05:00

@erennyuksell commented on GitHub (Aug 29, 2025):

any reliable solution?

@erennyuksell commented on GitHub (Aug 29, 2025): any reliable solution?

GiteaMirror commented

2026-05-04 19:20:23 -05:00

@steenharsted commented on GitHub (Aug 29, 2025):

I’m experiencing the same issue with gpt-oss:20b usingchat_ollama() and chat_structured() from ellmer. The model consistently returns truncated or non-JSON responses. This error persists even after:

Explicit JSON schema enforcement
Prompting “reply only in JSON”
Minimizing prompt size
Switching to other models (which work)

The output appears to be malformed almost every time.

Are there any plans to improve structured output compliance for gpt-oss:20b?

Thanks

@steenharsted commented on GitHub (Aug 29, 2025): I’m experiencing the same issue with `gpt-oss:20b` using`chat_ollama()` and `chat_structured()` from `ellmer`. The model consistently returns truncated or non-JSON responses. This error persists even after: - Explicit JSON schema enforcement - Prompting “reply only in JSON” - Minimizing prompt size - Switching to other models (which work) The output appears to be malformed almost every time. Are there any plans to improve structured output compliance for `gpt-oss:20b`? Thanks

GiteaMirror commented

2026-05-04 19:20:24 -05:00

@josemita87 commented on GitHub (Aug 29, 2025):

Same issue here with structured outputs...

@josemita87 commented on GitHub (Aug 29, 2025): Same issue here with structured outputs...

GiteaMirror commented

2026-05-04 19:20:25 -05:00

@rick-github commented on GitHub (Aug 29, 2025):

Are there any plans to improve structured output compliance for gpt-oss:20b?

https://github.com/ollama/ollama/issues/11691#issuecomment-3181220084

@rick-github commented on GitHub (Aug 29, 2025): > Are there any plans to improve structured output compliance for `gpt-oss:20b`? https://github.com/ollama/ollama/issues/11691#issuecomment-3181220084

GiteaMirror commented

2026-05-04 19:20:26 -05:00

@rakadam commented on GitHub (Sep 1, 2025):

I have the same problem. I looked at the code and Ollama was designed in a way that the HarmonyParser runs on high level, while the sampler runs in the cpp code, with some go code for glue. And it is not possible to connect them, so the sampler cannot know when it is supposed to apply the grammar or not. Since the grammar is only valid inside the message, and Harmony formatting is outside the message, this is a big problem.

One not terribly insane solution, already mentioned in this thread: implementing a minimalistic Harmony parser in the go sampler glue code, so it knows when to enable the grammar constraining. Or this could be calling HarmonyParser basically in both layers.

@rakadam commented on GitHub (Sep 1, 2025): I have the same problem. I looked at the code and Ollama was designed in a way that the HarmonyParser runs on high level, while the sampler runs in the cpp code, with some go code for glue. And it is not possible to connect them, so the sampler cannot know when it is supposed to apply the grammar or not. Since the grammar is only valid _inside_ the message, and Harmony formatting is outside the message, this is a big problem. One not terribly insane solution, already mentioned in this thread: implementing a minimalistic Harmony parser in the go sampler glue code, so it knows when to enable the grammar constraining. Or this could be calling HarmonyParser basically in both layers.

GiteaMirror commented

2026-05-04 19:20:28 -05:00

@nicholas-johnson-techxcel commented on GitHub (Sep 2, 2025):

Okay got it done. The steps are:

Use llamacpp /completions (not /v1/completions - this is not the same endpoint)
With prompt as list[int] which is list of output tokens from enc.render_conversation_for_completion(Conversation.from_messages(messages), Role.ASSISTANT) from openai_harmony library
tokens=True in body
Write a gbnf compiler which takes a json schema as input
Rename the "root" rule to something else
gbnf["root"] = f'thinking-block "{tok_start}" "{tok_assistant}" "{tok_channel}" "final" "{tok_constrain}" "json" "{tok_message}" json-root "{tok_end}"' where the tok_X are like "<|start|>", etc
thinking-block is basically the same thing except with analysis in the channel name, and any number of characters which are not "<" as the thought content
Send the request
Parse response tokens through StreamableParser.process() one by one
You have a new channel every time you find the token=200002 - use this to split up messages - read StreamableParser.current_channel to know if
We have a bug in StreamableParser.last_content_delta where we cannot use it because it is contaminated with Harmony tags which should have been filtered out so instead we use current_content and diff that from last iteration to get the channel delta. Actually nevermind, turns out I have been using last_content_delta now, and it has stopped doing that.

Finally the model has stopped being unhinged. It is extremely fast and consistent as an agent. gpt-oss:20b and I still doubt the gpt-oss:120b will justify its size with performance, but I guess we will find out in a while once we have the 128GB Macbook Pro. Mine is only 64GB.

@nicholas-johnson-techxcel commented on GitHub (Sep 2, 2025): Okay got it done. The steps are: - Use llamacpp `/completions` (not `/v1/completions` - this is not the same endpoint) - With prompt as list[int] which is list of output tokens from `enc.render_conversation_for_completion(Conversation.from_messages(messages), Role.ASSISTANT)` from openai_harmony library - `tokens=True` in body - Write a gbnf compiler which takes a json schema as input - Rename the "root" rule to something else - `gbnf["root"] = f'thinking-block "{tok_start}" "{tok_assistant}" "{tok_channel}" "final" "{tok_constrain}" "json" "{tok_message}" json-root "{tok_end}"'` where the tok_X are like "<|start|>", etc - `thinking-block` is basically the same thing except with `analysis` in the channel name, and any number of characters which are not "<" as the thought content - Send the request - Parse response tokens through `StreamableParser.process()` one by one - You have a new channel every time you find the token=200002 - use this to split up messages - read `StreamableParser.current_channel` to know if - We have a bug in `StreamableParser.last_content_delta` where we cannot use it because it is contaminated with Harmony tags which should have been filtered out so instead we use `current_content` and diff that from last iteration to get the channel delta. Actually nevermind, turns out I have been using last_content_delta now, and it has stopped doing that. Finally the model has stopped being unhinged. It is extremely fast and consistent as an agent. `gpt-oss:20b` and I still doubt the `gpt-oss:120b` will justify its size with performance, but I guess we will find out in a while once we have the 128GB Macbook Pro. Mine is only 64GB.

GiteaMirror commented

2026-05-04 19:20:29 -05:00

@adamoutler commented on GitHub (Sep 2, 2025):

Re: @nicholas-johnson-techxcel
...

Send the request

Parse response tokens through StreamableParser.process() one by one

You have a new channel every time you find the token=200002 - use this to split up messages - read StreamableParser.current_channel to know if

...

This caught my attention. I asked ChatGPT more about harmony format.

Token ID	Role / Function
199998	Beginning of sequence (BOS)
199999	Padding (PAD)
200000	End of text (EOT)
200001	Reserved special (unused)
200002	End of sequence / Return (EOS)

@adamoutler commented on GitHub (Sep 2, 2025): > Re: @nicholas-johnson-techxcel > ... > * Send the request > * Parse response tokens through `StreamableParser.process()` one by one > * You have a new channel every time you find the token=200002 - use this to split up messages - read `StreamableParser.current_channel` to know if > > ... This caught my attention. I asked ChatGPT more about harmony format. | Token ID | Role / Function | |----------|----------------------------------| | 199998 | Beginning of sequence (BOS) | | 199999 | Padding (PAD) | | 200000 | End of text (EOT) | | 200001 | Reserved special (unused) | | 200002 | End of sequence / Return (EOS) |

GiteaMirror commented

2026-05-04 19:20:30 -05:00

@MarioRicoIbanez commented on GitHub (Sep 3, 2025):

+1 having the same issue

@MarioRicoIbanez commented on GitHub (Sep 3, 2025): +1 having the same issue

GiteaMirror commented

2026-05-04 19:20:31 -05:00

@inf-bud commented on GitHub (Sep 3, 2025):

+1 having the same issue

@inf-bud commented on GitHub (Sep 3, 2025): +1 having the same issue

GiteaMirror commented

2026-05-04 19:20:32 -05:00

@shahidazim commented on GitHub (Sep 3, 2025):

+1 having the same issue

@shahidazim commented on GitHub (Sep 3, 2025): +1 having the same issue

GiteaMirror commented

2026-05-04 19:20:33 -05:00

@ParthSareen commented on GitHub (Sep 3, 2025):

Hey everyone! This is currently being worked on - trying to get it to y'all asap. https://github.com/ollama/ollama/pull/12052

@ParthSareen commented on GitHub (Sep 3, 2025): Hey everyone! This is currently being worked on - trying to get it to y'all asap. https://github.com/ollama/ollama/pull/12052

GiteaMirror commented

2026-05-04 19:20:34 -05:00

@sheneman commented on GitHub (Sep 3, 2025):

@ParthSareen Thank you SO much!

@sheneman commented on GitHub (Sep 3, 2025): @ParthSareen Thank you SO much!

GiteaMirror commented

2026-05-04 19:20:35 -05:00

@MarioRicoIbanez commented on GitHub (Sep 4, 2025):

@ParthSareen Thanks!

@MarioRicoIbanez commented on GitHub (Sep 4, 2025): @ParthSareen Thanks!

GiteaMirror commented

2026-05-04 19:20:36 -05:00

@kiwamizamurai commented on GitHub (Sep 4, 2025):

@ParthSareen so nice

@kiwamizamurai commented on GitHub (Sep 4, 2025): @ParthSareen so nice

GiteaMirror commented

2026-05-04 19:20:38 -05:00

@Seyid-cmd commented on GitHub (Sep 5, 2025):

@ParthSareen thanks

@Seyid-cmd commented on GitHub (Sep 5, 2025): @ParthSareen thanks

GiteaMirror commented

2026-05-04 19:20:41 -05:00

@ParthSareen commented on GitHub (Sep 6, 2025):

You guys can use this branch until I get it into main: https://github.com/ollama/ollama/tree/parth/gpt-oss-structured-outputs 😁

Would also love to know what you use structured outputs for if you do give the branch a shot

@ParthSareen commented on GitHub (Sep 6, 2025): You guys can use this branch until I get it into main: https://github.com/ollama/ollama/tree/parth/gpt-oss-structured-outputs 😁 Would also love to know what you use structured outputs for if you do give the branch a shot

GiteaMirror commented

2026-05-04 19:20:44 -05:00

@adamoutler commented on GitHub (Sep 6, 2025):

Would also love to know what you use structured outputs for if you do give the branch a shot

Analyzing test results and reacting to binary/enum decisions.

How does it work? What did you do with the thinking?

@adamoutler commented on GitHub (Sep 6, 2025): > Would also love to know what you use structured outputs for if you do give the branch a shot Analyzing test results and reacting to binary/enum decisions. How does it work? What did you do with the thinking?

GiteaMirror commented

2026-05-04 19:20:48 -05:00

@asabla commented on GitHub (Sep 6, 2025):

Mostly when interacting with LLMs I want to avoid writing too much fuzzy validation code (e.g making sure all needed data is there). Structured output is basically a very convenient shortcut for doing so. On top of that, most agentic frameworks for building reliable workflows, is using structured output under the hood for the same reasons.

Haven't had the time to test out the feature branch yet, but I'll get back to you when I've done so @ParthSareen

@asabla commented on GitHub (Sep 6, 2025): Mostly when interacting with LLMs I want to avoid writing too much fuzzy validation code (e.g making sure all needed data is there). Structured output is basically a very convenient shortcut for doing so. On top of that, most agentic frameworks for building reliable workflows, is using structured output under the hood for the same reasons. Haven't had the time to test out the feature branch yet, but I'll get back to you when I've done so @ParthSareen

GiteaMirror commented

2026-05-04 19:20:49 -05:00

@sheneman commented on GitHub (Sep 6, 2025):

@ParthSareen The fix you implemented appears to generally work! Structured outputs with gpt-oss are working for me as expected and are present in the content field of the response. Reasoning traces are located in the "thinking" field. This is very improved behavior, and I am very grateful for your help! THANK YOU.

I did have a couple observations, as there still are some inconsistencies with responses from gpt-oss compared to other thinking models:

gpt-oss now provides separate reasoning traces, even if you specify "think": False. This is not a horrible default behavior, but technically it is incorrect and different than other thinking models (e.g. qwen3) which honor the "think" boolean for controlling thinking output as described here: https://ollama.com/blog/thinking
Other thinking models (qwen3) don't behave in the same way as gpt-oss:
a. If you set "thinking": False, there will be no thinking trace (correct for qwen, fails for gpt-oss)
b. If you use thinking mode and structured outputs with qwen3, it still will not emit a thinking trace (BUG)

@sheneman commented on GitHub (Sep 6, 2025): @ParthSareen **_The fix you implemented appears to generally work_**! Structured outputs with gpt-oss are working for me as expected and are present in the content field of the response. Reasoning traces are located in the "thinking" field. This is very improved behavior, and I am _very_ grateful for your help! **THANK YOU**. I did have a couple observations, as there still are some inconsistencies with responses from gpt-oss compared to other thinking models: 1. gpt-oss now provides separate reasoning traces, **_even if you specify "think": False._** This is not a horrible default behavior, but technically it is incorrect and different than other thinking models (e.g. qwen3) which honor the "think" boolean for controlling thinking output as described here: [https://ollama.com/blog/thinking](https://ollama.com/blog/thinking) 2. Other thinking models (qwen3) don't behave in the same way as gpt-oss: a. If you set "thinking": False, there will be no thinking trace (correct for qwen, fails for gpt-oss) b. If you use thinking mode **_and_** structured outputs with qwen3, it still will **_not_** emit a thinking trace (BUG) <img width="688" height="546" alt="Image" src="https://github.com/user-attachments/assets/d19c2f37-11ad-4797-b861-81b6cd63ce9d" /> <img width="700" height="494" alt="Image" src="https://github.com/user-attachments/assets/860a5838-32f0-464b-93b6-3ec474d65d56" />

GiteaMirror commented

2026-05-04 19:20:50 -05:00

@adamoutler commented on GitHub (Sep 7, 2025):

I believe with the harmony format, thinking is never turned off which is the problem we are experiencing here. I'm pretty sure this is not fixable on this particular model. It may be on the other one though.

@adamoutler commented on GitHub (Sep 7, 2025): I believe with the harmony format, thinking is never turned off which is the problem we are experiencing here. I'm pretty sure this is not fixable on this particular model. It may be on the other one though.

GiteaMirror commented

2026-05-04 19:20:52 -05:00

@ParthSareen commented on GitHub (Sep 7, 2025):

@sheneman @adamoutler is correct. the thinking cannot be turned off for gpt-oss - you can only do low medium, and high. And currently my PR only supports gpt-oss as a trial. Going to do thinking models as a whole next!

@ParthSareen commented on GitHub (Sep 7, 2025): @sheneman @adamoutler is correct. the thinking cannot be turned off for gpt-oss - you can only do `low` `medium`, and `high`. And currently my PR only supports gpt-oss as a trial. Going to do thinking models as a whole next!

GiteaMirror commented

2026-05-04 19:20:53 -05:00

@sheneman commented on GitHub (Sep 7, 2025):

@adamoutler @ParthSareen Thank you! While you can't actually turn off thinking in gpt-oss, you could set thinking to "low" and then suppress or mask the thinking trace. This would maintain response format compatibility with other thinking models. I could also see why you would prefer to output the thinking trace since its being generated anyway. It's easy enough to ignore if needed, so not a huge deal either way.

And Thank you @ParthSareen for now attacking the issue of structured outputs X thinking mode in the other models!!! With that, Ollama becomes so much more compelling for our organization!

@sheneman commented on GitHub (Sep 7, 2025): @adamoutler @ParthSareen Thank you! While you can't actually turn off thinking in gpt-oss, you _could_ set thinking to "low" and then suppress or mask the thinking trace. This would maintain response format compatibility with other thinking models. I could also see why you would prefer to output the thinking trace since its being generated anyway. It's easy enough to ignore if needed, so not a huge deal either way. And **Thank you** @ParthSareen for now attacking the issue of structured outputs X thinking mode in the other models!!! With that, Ollama becomes so much more compelling for our organization!

GiteaMirror commented

2026-05-04 19:20:54 -05:00

@nicholas-johnson-techxcel commented on GitHub (Sep 8, 2025):

I believe with the harmony format, thinking is never turned off which is the problem we are experiencing here. I'm pretty sure this is not fixable on this particular model. It may be on the other one though.

You can force it to emit an empty thinking tag using GBNF if you wish to save time.

@nicholas-johnson-techxcel commented on GitHub (Sep 8, 2025): > I believe with the harmony format, thinking is never turned off which is the problem we are experiencing here. I'm pretty sure this is not fixable on this particular model. It may be on the other one though. You can force it to emit an empty thinking tag using GBNF if you wish to save time.

GiteaMirror commented

2026-05-04 19:20:55 -05:00

@ParthSareen commented on GitHub (Sep 8, 2025):

You can force it to emit an empty thinking tag using GBNF if you wish to save time.

You could but you're breaking the format the model was trained on. From experience the model is very sensitive to breaking the format which results in poor outputs. So your mileage may vary with that.

@ParthSareen commented on GitHub (Sep 8, 2025): > You can force it to emit an empty thinking tag using GBNF if you wish to save time. You could but you're breaking the format the model was trained on. From experience the model is very sensitive to breaking the format which results in poor outputs. So your mileage may vary with that.

GiteaMirror commented

2026-05-04 19:20:56 -05:00

@vishalgoel2 commented on GitHub (Sep 14, 2025):

I tested the fix PR branch for structured outputs and it does improve things — simple structured outputs work now.

However, I’m running into mixed results when using it with browser-use + gpt-oss:20b. With the release version of Ollama, it fails consistently with the familiar

Invalid JSON: expected value at line 1 column 1 [type=json_invalid]

On the fix branch, sometimes it works, but other times I see warnings like this in the logs:

level=WARN source=harmonyparser.go:429 msg="harmony parser: no reverse mapping found for function name" harmonyFunctionName=browser.extract_structured_data

and then browser-use errors out with

 ("1 validation error for AgentOutput\n  Invalid JSON: EOF while parsing a value at line 1 column 0 [type=json_invalid, input_value='', input_type=str]\n    For further information visit https://errors.pydantic.dev/2.11/v/json_invalid", 502)

So it looks like the PR handles some structured output cases, but not all. Not sure yet if browser-use is passing the tool schema in a way Ollama doesn’t expect, or if the fix still misses some scenarios.

@vishalgoel2 commented on GitHub (Sep 14, 2025): I tested the fix PR branch for structured outputs and it does improve things — simple structured outputs work now. However, I’m running into mixed results when using it with [`browser-use`](https://github.com/browser-use/browser-use) + `gpt-oss:20b`. With the release version of Ollama, it fails consistently with the familiar ``` Invalid JSON: expected value at line 1 column 1 [type=json_invalid] ``` On the fix branch, sometimes it works, but other times I see warnings like this in the logs: ``` level=WARN source=harmonyparser.go:429 msg="harmony parser: no reverse mapping found for function name" harmonyFunctionName=browser.extract_structured_data ``` and then `browser-use` errors out with ``` ("1 validation error for AgentOutput\n Invalid JSON: EOF while parsing a value at line 1 column 0 [type=json_invalid, input_value='', input_type=str]\n For further information visit https://errors.pydantic.dev/2.11/v/json_invalid", 502) ``` So it looks like the PR handles some structured output cases, but not all. Not sure yet if `browser-use` is passing the tool schema in a way Ollama doesn’t expect, or if the fix still misses some scenarios.

GiteaMirror commented

2026-05-04 19:20:58 -05:00

@trebor commented on GitHub (Sep 20, 2025):

i'm on ollama v0.12.0 and still seeing the issue. the query takes time, but returns with a zero-length response field. i'm happy to include payload and response text if that is helpful, but it is the typical prompt and json scheme in the format field.

@trebor commented on GitHub (Sep 20, 2025): i'm on ollama v0.12.0 and still seeing the issue. the query takes time, but returns with a zero-length response field. i'm happy to include payload and response text if that is helpful, but it is the typical prompt and json scheme in the format field.

GiteaMirror commented

2026-05-04 19:21:00 -05:00

@ParthSareen commented on GitHub (Sep 20, 2025):

i'm on ollama v0.12.0 and still seeing the issue. the query takes time, but returns with a zero-length response field. i'm happy to include payload and response text if that is helpful, but it is the typical prompt and json scheme in the format field.

Hi @trebor it's not released yet

@ParthSareen commented on GitHub (Sep 20, 2025): > i'm on ollama v0.12.0 and still seeing the issue. the query takes time, but returns with a zero-length response field. i'm happy to include payload and response text if that is helpful, but it is the typical prompt and json scheme in the format field. Hi @trebor it's not released yet

GiteaMirror commented

2026-05-04 19:21:05 -05:00

@srshkmr commented on GitHub (Sep 24, 2025):

Hi @ParthSareen any ETA on the release? is there changes required on the PR?

@srshkmr commented on GitHub (Sep 24, 2025): Hi @ParthSareen any ETA on the release? is there changes required on the PR?

GiteaMirror commented

2026-05-04 19:21:08 -05:00

@MarioRicoIbanez commented on GitHub (Sep 29, 2025):

Any news on when it will be released?

@MarioRicoIbanez commented on GitHub (Sep 29, 2025): Any news on when it will be released?

GiteaMirror commented

2026-05-04 19:21:10 -05:00

@AlexanderKozhevin commented on GitHub (Oct 2, 2025):

funny thing, structured output does work on Groq cloud

@AlexanderKozhevin commented on GitHub (Oct 2, 2025): funny thing, structured output does work on Groq cloud

GiteaMirror commented

2026-05-04 19:21:13 -05:00

@ParthSareen commented on GitHub (Oct 2, 2025):

Had to make some updates to how we ran it. Just put up another PR. Aiming for next release.

@ParthSareen commented on GitHub (Oct 2, 2025): Had to make some updates to how we ran it. Just put up another PR. Aiming for next release.

GiteaMirror commented

2026-05-04 19:21:14 -05:00

@bogzbonny commented on GitHub (Oct 12, 2025):

haven't gotten it to work with ollama-rs calling on ollama 0.12.5 - I'm assuming https://github.com/ollama/ollama/pull/12460 doesn't actually fully resolve this issue but is only a stepping stone? (@ParthSareen)

@bogzbonny commented on GitHub (Oct 12, 2025): haven't gotten it to work with ollama-rs calling on ollama 0.12.5 - I'm assuming https://github.com/ollama/ollama/pull/12460 doesn't actually fully resolve this issue but is only a stepping stone? (@ParthSareen)

GiteaMirror commented

2026-05-04 19:21:15 -05:00

@ParthSareen commented on GitHub (Oct 12, 2025):

haven't gotten it to work with ollama-rs calling on ollama 0.12.5 - I'm assuming https://github.com/ollama/ollama/pull/12460 doesn't actually fully resolve this issue but is only a stepping stone? (@ParthSareen)

Hmm it should be working... can you try running ollama run gpt-oss --format json hello!

and see if it shows thinking + the final output? if so it might be some weird client behavior

@ParthSareen commented on GitHub (Oct 12, 2025): > haven't gotten it to work with ollama-rs calling on ollama 0.12.5 - I'm assuming https://github.com/ollama/ollama/pull/12460 doesn't actually fully resolve this issue but is only a stepping stone? (@ParthSareen) Hmm it should be working... can you try running `ollama run gpt-oss --format json hello!` and see if it shows thinking + the final output? if so it might be some weird client behavior

GiteaMirror commented

2026-05-04 19:21:16 -05:00

@vansatchen commented on GitHub (Oct 12, 2025):

Hmm it should be working... can you try running ollama run gpt-oss --format json hello!

and see if it shows thinking + the final output? if so it might be some weird client behavior

ollama run gpt-oss:20b --format json hello!
We need to respond to ":", basically greet, friendly.Hello! 👋 How can I help you today?Thinking...
We responded.We are done.
...done thinking.

Hey there! What's on your mind today? 😊✨ "}Error: error parsing tool call: raw='Hey there! What’s on your mind today? 😊 <|constrain|> <|constrain|><|constrain|>.} ', err=invalid character 'H' looking for beginning of value

ollama -v
ollama version is 0.12.5

@vansatchen commented on GitHub (Oct 12, 2025): > Hmm it should be working... can you try running `ollama run gpt-oss --format json hello!` > > and see if it shows thinking + the final output? if so it might be some weird client behavior ollama run gpt-oss:20b --format json hello! We need to respond to ":", basically greet, friendly.Hello! 👋 How can I help you today?Thinking... We responded.We are done. ...done thinking. Hey there! What's on your mind today? 😊✨ "}Error: error parsing tool call: raw='Hey there! What’s on your mind today? 😊 <|constrain|> <|constrain|><|constrain|>.} ', err=invalid character 'H' looking for beginning of value ollama -v ollama version is 0.12.5

GiteaMirror commented

2026-05-04 19:21:16 -05:00

@sheneman commented on GitHub (Oct 12, 2025):

So just to be clear, the fix for this issue has not yet been merged to main, as of 0.12.5?

@sheneman commented on GitHub (Oct 12, 2025): So just to be clear, the fix for this issue has not yet been merged to main, as of 0.12.5?

GiteaMirror commented

2026-05-04 19:21:17 -05:00

@ParthSareen commented on GitHub (Oct 12, 2025):

Ah I gave the wrong query @vansatchen @bogzbonny. Run just ollama run gpt-oss --format json and then type something to the model.

I haven't updated the generate endpoint yet there's some refactoring to do. Let me know if this doesn't work.

@ParthSareen commented on GitHub (Oct 12, 2025): Ah I gave the wrong query @vansatchen @bogzbonny. Run just `ollama run gpt-oss --format json` and then type something to the model. I haven't updated the generate endpoint yet there's some refactoring to do. Let me know if this doesn't work.

GiteaMirror commented

2026-05-04 19:21:18 -05:00

@vansatchen commented on GitHub (Oct 12, 2025):

Ah I gave the wrong query @vansatchen @bogzbonny. Run just ollama run gpt-oss --format json and then type something to the model.

I haven't updated the generate endpoint yet there's some refactoring to do. Let me know if this doesn't work.

ollama run gpt-oss --format json
>>> John Dohn 26 yo
Thinking...
The user: "John Dohn 26 yo". Likely they want to talk about health? The user might be asking for medical advice, perhaps about being 
26-year-old male named John Dohn. Maybe they want to know about his health, fitness, sleep, nutrition, mental health. Or it's a user 
profile snippet. The user hasn't asked a specific question. We need to respond appropriately. Usually, we ask clarifying question or 
ask what they need. The user could be prompting for an assessment. The system guidelines: cannot provide medical advice. But can 
provide general wellness tips, encourage professional help. So we can ask: "What can I help you with regarding John Dohn? Are you 
looking for health tips?" Provide general wellness info. Let's do that.
...done thinking.

{"response":"It looks like you’re mentioning a 26‑year‑old male named John Dohn. Could you let me know what you’d like help with? For 
example, are you looking for general wellness and lifestyle advice, or is there a specific concern or goal you have in mind? I’m 
happy to offer general information and resources—just keep in mind that I can’t give personalized medical advice or replace a 
professional consultation."}

>>> Send a message (/? for help)

@vansatchen commented on GitHub (Oct 12, 2025): > Ah I gave the wrong query [@vansatchen](https://github.com/vansatchen) [@bogzbonny](https://github.com/bogzbonny). Run just `ollama run gpt-oss --format json` and then type something to the model. > > I haven't updated the generate endpoint yet there's some refactoring to do. Let me know if this doesn't work. ``` ollama run gpt-oss --format json >>> John Dohn 26 yo Thinking... The user: "John Dohn 26 yo". Likely they want to talk about health? The user might be asking for medical advice, perhaps about being 26-year-old male named John Dohn. Maybe they want to know about his health, fitness, sleep, nutrition, mental health. Or it's a user profile snippet. The user hasn't asked a specific question. We need to respond appropriately. Usually, we ask clarifying question or ask what they need. The user could be prompting for an assessment. The system guidelines: cannot provide medical advice. But can provide general wellness tips, encourage professional help. So we can ask: "What can I help you with regarding John Dohn? Are you looking for health tips?" Provide general wellness info. Let's do that. ...done thinking. {"response":"It looks like you’re mentioning a 26‑year‑old male named John Dohn. Could you let me know what you’d like help with? For example, are you looking for general wellness and lifestyle advice, or is there a specific concern or goal you have in mind? I’m happy to offer general information and resources—just keep in mind that I can’t give personalized medical advice or replace a professional consultation."} >>> Send a message (/? for help) ```

GiteaMirror commented

2026-05-04 19:21:19 -05:00

@bogzbonny commented on GitHub (Oct 12, 2025):

@ParthSareen Okay cool, appreciated. I tried it and got similar output to @vansatchen I'm not sure how to feed a schema from the CLI but within ollama-rs it appears to be using the generate endpoints HENCE I think I'm still blocked on that endpoint refactor you mentioned get this operating.

(also https://github.com/ollama/ollama/pull/12460 was merged into 0.12.5 @sheneman if you look at the commit history)

@bogzbonny commented on GitHub (Oct 12, 2025): @ParthSareen Okay cool, appreciated. I tried it and got similar output to @vansatchen I'm not sure how to feed a schema from the CLI but within ollama-rs it appears to be using the generate endpoints HENCE I think I'm still blocked on that endpoint refactor you mentioned get this operating. (also https://github.com/ollama/ollama/pull/12460 was merged into 0.12.5 @sheneman if you look at the commit history)

GiteaMirror commented

2026-05-04 19:21:20 -05:00

@trebor commented on GitHub (Oct 16, 2025):

i have been testing ollama 0.12.5 with the most recent gpt-oss:20b, see below for specific examples. is this the expected behavior? is the change maybe still percolating through the system? am i calling it wrong?

curl commands i used to test:

curl 'http://localhost:11434/api/generate' --data-raw '{"model":"gpt-oss:20b","stream":false,"format":{"type":"integer","minimum":1,"maximum":10},"prompt":"choose a number\n"}'

and am still seeing an empty response. for completeness:

{"model":"gpt-oss:20b","created_at":"2025-10-16T22:39:42.787141Z","response":"","done":true,"done_reason":"stop","context":[200006,17360,200008,3575,553,17554,162016,11,261,4410,6439,2359,22203,656,7788,17527,558,87447,100594,25,220,1323,19,12,3218,198,6576,3521,25,220,1323,20,12,702,12,1125,279,30377,289,25,14093,279,2,13888,18403,25,8450,11,49159,11,1721,13,21030,2804,413,7360,395,1753,3176,13,200007,200006,1428,200008,47312,261,2086,198,200007,200006,173781,16,220],"total_duration":639591959,"load_duration":152973209,"prompt_eval_count":71,"prompt_eval_duration":279207292,"eval_count":3,"eval_duration":50198457}%

if i use qwen:14b, for example:

curl 'http://localhost:11434/api/generate' --data-raw '{"model":"qwen3:14b","stream":false,"format":{"type":"integer","minimum":1,"maximum":10},"prompt":"choose a number\n"}'

i see what i would expect:

{"model":"qwen3:14b","created_at":"2025-10-16T22:45:54.903555Z","response":"8\n\n","done":true,"done_reason":"stop","context":[151644,872,198,27052,264,1372,198,151645,198,151644,77091,198,23,271],"total_duration":594441292,"load_duration":85178125,"prompt_eval_count":12,"prompt_eval_duration":394819125,"eval_count":3,"eval_duration":89476292}%

@trebor commented on GitHub (Oct 16, 2025): i have been testing ollama 0.12.5 with the most recent gpt-oss:20b, see below for specific examples. is this the expected behavior? is the change maybe still percolating through the system? am i calling it wrong? curl commands i used to test: `curl 'http://localhost:11434/api/generate' --data-raw '{"model":"gpt-oss:20b","stream":false,"format":{"type":"integer","minimum":1,"maximum":10},"prompt":"choose a number\n"}'` and am still seeing an empty response. for completeness: `{"model":"gpt-oss:20b","created_at":"2025-10-16T22:39:42.787141Z","response":"","done":true,"done_reason":"stop","context":[200006,17360,200008,3575,553,17554,162016,11,261,4410,6439,2359,22203,656,7788,17527,558,87447,100594,25,220,1323,19,12,3218,198,6576,3521,25,220,1323,20,12,702,12,1125,279,30377,289,25,14093,279,2,13888,18403,25,8450,11,49159,11,1721,13,21030,2804,413,7360,395,1753,3176,13,200007,200006,1428,200008,47312,261,2086,198,200007,200006,173781,16,220],"total_duration":639591959,"load_duration":152973209,"prompt_eval_count":71,"prompt_eval_duration":279207292,"eval_count":3,"eval_duration":50198457}%` if i use qwen:14b, for example: `curl 'http://localhost:11434/api/generate' --data-raw '{"model":"qwen3:14b","stream":false,"format":{"type":"integer","minimum":1,"maximum":10},"prompt":"choose a number\n"}'` i see what i would expect: `{"model":"qwen3:14b","created_at":"2025-10-16T22:45:54.903555Z","response":"8\n\n","done":true,"done_reason":"stop","context":[151644,872,198,27052,264,1372,198,151645,198,151644,77091,198,23,271],"total_duration":594441292,"load_duration":85178125,"prompt_eval_count":12,"prompt_eval_duration":394819125,"eval_count":3,"eval_duration":89476292}%`

GiteaMirror commented

2026-05-04 19:21:21 -05:00

@ParthSareen commented on GitHub (Oct 16, 2025):

hi @trebor sorry for the lack of documentation at the moment. It should work for the /chat endpoint. Need to do some cleanup on /generate before we can support it there

@ParthSareen commented on GitHub (Oct 16, 2025): hi @trebor sorry for the lack of documentation at the moment. It should work for the `/chat` endpoint. Need to do some cleanup on `/generate` before we can support it there

GiteaMirror commented

2026-05-04 19:21:22 -05:00

@trebor commented on GitHub (Oct 16, 2025):

oh got it, thank you!

@trebor commented on GitHub (Oct 16, 2025): oh got it, thank you!

GiteaMirror commented

2026-05-04 19:21:23 -05:00

@ParthSareen commented on GitHub (Oct 16, 2025):

Hey folks it should be out! Closing this issue. It'll work as expected with the /chat endpoint. /generate will come at some point but might be a bit. Just wanted to unblock everyone!

@ParthSareen commented on GitHub (Oct 16, 2025): Hey folks it should be out! Closing this issue. It'll work as expected with the `/chat` endpoint. `/generate` will come at some point but might be a bit. Just wanted to unblock everyone!

GiteaMirror commented

2026-05-04 19:21:24 -05:00

@dhicks commented on GitHub (Oct 16, 2025):

Could I suggest leaving this open until the issue has been resolved for /generate as well?

@dhicks commented on GitHub (Oct 16, 2025): Could I suggest leaving this open until the issue has been resolved for `/generate` as well?

GiteaMirror commented

2026-05-04 19:21:25 -05:00

@trebor commented on GitHub (Oct 16, 2025):

btw: here is a minimal example curl that worked for me:

curl 'http://localhost:11434/api/chat' --data-raw '{"model":"gpt-oss:20b","stream":false,"format":{"type":"integer","minimum":1,"maximum":10},"messages":[{"content": "choose a number", "role": "user"}]}'

huge thanks to @ParthSareen!

@trebor commented on GitHub (Oct 16, 2025): btw: here is a minimal example curl that worked for me: `curl 'http://localhost:11434/api/chat' --data-raw '{"model":"gpt-oss:20b","stream":false,"format":{"type":"integer","minimum":1,"maximum":10},"messages":[{"content": "choose a number", "role": "user"}]}'` huge thanks to @ParthSareen!

GiteaMirror commented

2026-05-04 19:21:26 -05:00

@jacksimpsoncartesian commented on GitHub (Oct 31, 2025):

So glad I found this - thought I was doing something wrong when I was getting the wrong structured outputs. Any word on whether this is likely to be fixed?

@jacksimpsoncartesian commented on GitHub (Oct 31, 2025): So glad I found this - thought I was doing something wrong when I was getting the wrong structured outputs. Any word on whether this is likely to be fixed?

GiteaMirror commented

2026-05-04 19:21:28 -05:00

@sheneman commented on GitHub (Nov 23, 2025):

Hello @ParthSareen - Has this fix been addressed in /generate and pulled into the main branch? Thank you for your consideration.

@sheneman commented on GitHub (Nov 23, 2025): Hello @ParthSareen - Has this fix been addressed in /generate and pulled into the main branch? Thank you for your consideration.

GiteaMirror commented

2026-05-04 19:21:29 -05:00

@chakka-guna-sekhar-venkata-chennaiah commented on GitHub (Nov 26, 2025):

@sheneman
hey hi, but for its workign when i used the improt statement from langchain_ollama import ChatOllama where i called model as

llm = ChatOllama(
    base_url="https://ollama-testing-gpt-oss-20b-433688334338.europe-west1.run.app",
    model="gpt-oss:20b",
    temperature=0
)

    llm_structured = llm.with_structured_output(SystemHierarchyResult, include_raw=True)

its worked for me. In the reponse im getting json as

{
  'raw': xxx,
  'response_metadata' : xxx,
  'parsed': xxx,
  'parsed_error' : None
}

@chakka-guna-sekhar-venkata-chennaiah commented on GitHub (Nov 26, 2025): @sheneman hey hi, but for its workign when i used the improt statement `from langchain_ollama import ChatOllama` where i called model as ``` llm = ChatOllama( base_url="https://ollama-testing-gpt-oss-20b-433688334338.europe-west1.run.app", model="gpt-oss:20b", temperature=0 ) llm_structured = llm.with_structured_output(SystemHierarchyResult, include_raw=True) ``` its worked for me. In the reponse im getting json as ``` { 'raw': xxx, 'response_metadata' : xxx, 'parsed': xxx, 'parsed_error' : None } ```

GiteaMirror commented

2026-05-04 19:21:30 -05:00

@bogzbonny commented on GitHub (Nov 27, 2025):

still doesn't work for me with (using ollama-rs)

@bogzbonny commented on GitHub (Nov 27, 2025): still doesn't work for me with (using `ollama-rs`)

GiteaMirror commented

2026-05-04 19:21:32 -05:00

@4IbWNsis3S commented on GitHub (Dec 7, 2025):

still doesn't work for me with (using ollama-rs)

OpenAI harmony format used in OSS 20b and 120b breaks Ollama response handling. It's been over two-months since 20b and 120b were released and it's still broken in current/0.13.1

@4IbWNsis3S commented on GitHub (Dec 7, 2025): > still doesn't work for me with (using `ollama-rs`) OpenAI harmony format used in OSS 20b and 120b breaks Ollama response handling. It's been over two-months since 20b and 120b were released and it's still broken in current/0.13.1

Sign in to join this conversation.

Branches Tags

main

dhiltgen/ci

dhiltgen/llama-runner

parth-launch-codex-app

hoyyeva/anthropic-local-image-path

hoyyeva/anthropic-reference-images-path

parth-anthropic-reference-images-path

brucemacd/download-before-remove

hoyyeva/editor-config-repair

parth-mlx-decode-checkpoints

hoyyeva/fix-codex-model-metadata-warning

hoyyeva/qwen

parth/hide-claude-desktop-till-release

hoyyeva/opencode-image-modality

parth-add-claude-code-autoinstall

release_v0.22.0

pdevine/manifest-list

codex/fix-codex-model-metadata-warning

pdevine/addressable-manifest

brucemacd/launch-fetch-reccomended

jmorganca/llama-compat

launch-copilot-cli

hoyyeva/opencode-thinking

release_v0.20.7

parth-auto-save-backup

parth-test

jmorganca/gemma4-audio-replacements

fix-manifest-digest-on-pull

hoyyeva/vscode-improve

brucemacd/install-server-wait

parth/update-claude-docs

brucemac/start-ap-install

pdevine/mlx-update

pdevine/qwen35_vision

drifkin/api-show-fallback

mintlify/image-generation-1773352582

hoyyeva/server-context-length-local-config

jmorganca/faster-reptition-penalties

jmorganca/convert-nemotron

parth-pi-thinking

pdevine/sampling-penalties

jmorganca/fix-create-quantization-memory

dongchen/resumable_transfer_fix

pdevine/sampling-cache-error

jessegross/mlx-usage

hoyyeva/openclaw-config

hoyyeva/app-html

pdevine/qwen3next

brucemacd/sign-sh-install

brucemacd/tui-update

brucemacd/usage-api

jmorganca/launch-empty

fix-app-dist-embed

mxyng/mlx-compile

mxyng/mlx-quant

mxyng/mlx-glm4.7

mxyng/mlx

brucemacd/simplify-model-picker

jmorganca/qwen3-concurrent

fix-glm-4.7-flash-mla-config

drifkin/qwen3-coder-opening-tag

brucemacd/usage-cli

fix-cuda12-fattn-shmem

ollama-imagegen-docs

parth/fix-multiline-inputs

brucemacd/config-docs

mxyng/model-files

mxyng/simple-execute

fix-imagegen-ollama-models

mxyng/async-upload

jmorganca/lazy-no-dtype-changes

imagegen-auto-detect-create

parth/decrease-concurrent-download-hf

fix-mlx-quantize-init

jmorganca/x-cleanup

usage

imagegen-readme

jmorganca/glm-image

mlx-gpu-cd

jmorganca/imagegen-modelfile

parth/agent-skills

parth/agent-allowlist

parth/signed-in-offline

parth/agents

parth/fix-context-chopping

improve-cloud-flow

parth/add-models-websearch

parth/prompt-renderer-mcp

jmorganca/native-settings

jmorganca/download-stream-hash

jmorganca/client2-rebased

brucemacd/oai-chat-req-multipart

jessegross/multi_chunk_reserve

grace/additional-omit-empty

grace/mistral-3-large

mxyng/tokenizer2

mxyng/tokenizer

jessegross/flash

hoyyeva/windows-nacked-app

mxyng/cleanup-attention

grace/deepseek-parser

hoyyeva/remember-unsent-prompt

parth/add-lfs-pointer-error-conversion

parth/olmo2-test2

hoyyeva/ollama-launchagent-plist

nicole/olmo-model

parth/olmo-test

mxyng/remove-embedded

parth/render-template

jmorganca/intellect-3

parth/remove-prealloc-linter

jmorganca/cmd-eval

nicole/nomic-embed-text-fix

mxyng/lint-2

hoyyeva/add-gemini-3-pro-preview

hoyyeva/load-model-list

mxyng/expand-path

mxyng/environ-2

hoyyeva/deeplink-json-encoding

parth/improve-tool-calling-tests

hoyyeva/conversation

hoyyeva/assistant-edit-response

hoyyeva/thinking

origin/brucemacd/invalid-char-i-err

parth/improve-tool-calling

jmorganca/required-omitempty

grace/qwen3-vl-tests

mxyng/iter-client

parth/docs-readme

nicole/embed-test

pdevine/integration-benchstat

parth/remove-generate-cmd

parth/add-toolcall-id

mxyng/server-tests

jmorganca/glm-4.6

jmorganca/gin-h-compat

drifkin/stable-tool-args

pdevine/qwen3-more-thinking

parth/add-websearch-client

nicole/websearch_local

jmorganca/qwen3-coder-updates

grace/deepseek-v3-migration-tests

mxyng/fix-create

jmorganca/cloud-errors

pdevine/parser-tidy

revert-12233-parth/simplify-entrypoints-runner

parth/enable-so-gpt-oss

brucemacd/qwen3vl

jmorganca/readme-simplify

parth/gpt-oss-structured-outputs

revert-12039-jmorganca/tools-braces

mxyng/embeddings

mxyng/gguf

mxyng/benchmark

mxyng/types-null

parth/move-parsing

mxyng/gemma2

jmorganca/docs

mxyng/16-bit

mxyng/create-stdin

pdevine/authorizedkeys

mxyng/quant

parth/opt-in-error-context-window

brucemacd/cache-models

brucemacd/runner-completion

jmorganca/llama-update-6

brucemacd/benchmark-list

brucemacd/partial-read-caps

parth/deepseek-r1-tools

mxyng/omit-array

parth/tool-prefix-temp

brucemacd/runner-test

jmorganca/qwen25vl

brucemacd/model-forward-test-ext

parth/python-function-parsing

jmorganca/cuda-compression-none

drifkin/num-parallel

drifkin/chat-truncation-fix

jmorganca/sync

parth/python-tools-calling

drifkin/array-head-count

brucemacd/create-no-loop

parth/server-enable-content-stream-with-tools

qwen25omni

mxyng/v3

brucemacd/ropeconfig

jmorganca/silence-tokenizer

parth/sample-so-test

parth/sampling-structured-outputs

brucemacd/doc-go-engine

parth/constrained-sampling-json

jmorganca/mistral-wip

brucemacd/mistral-small-convert

parth/sample-unmarshal-json-for-params

brucemacd/jomorganca/mistral

pdevine/bfloat16

jmorganca/mistral

brucemacd/mistral

pdevine/logging

parth/sample-correctness-fix

parth/sample-fix-sorting

jmorgan/sample-fix-sorting-extras

jmorganca/temp-0-images

brucemacd/parallel-embed-models

brucemacd/shim-grammar

jmorganca/fix-gguf-error

bmizerany/nameswork

jmorganca/faster-releases

bmizerany/validatenames

brucemacd/err-no-vocab

brucemacd/rope-config

brucemacd/err-hint

brucemacd/qwen2_5

brucemacd/logprobs

brucemacd/new_runner_graph_bench

progress-flicker

brucemacd/forward-test

brucemacd/go_qwen2

pdevine/gemma2

jmorganca/add-missing-symlink-eval

mxyng/next-debug

parth/set-context-size-openai

brucemacd/next-bpe-bench

brucemacd/next-bpe-test

brucemacd/new_runner_e2e

brucemacd/new_runner_qwen2

pdevine/convert-cohere2

brucemacd/convert-cli

parth/log-probs

mxyng/next-mlx

mxyng/cmd-history

parth/templating

parth/tokenize-detokenize

brucemacd/check-key-register

bmizerany/grammar

jmorganca/vendor-081b29bd

mxyng/func-checks

jmorganca/fix-null-format

parth/fix-default-to-warn-json

jmorganca/qwen2vl

jmorganca/no-concat

parth/cmd-cleanup-SO

brucemacd/check-key-register-structured-err

parth/openai-stream-usage

parth/fix-referencing-so

stream-tools-stop

jmorganca/degin-1

brucemacd/install-path-clean

brucemacd/push-name-validation

brucemacd/browser-key-register

jmorganca/openai-fix-first-message

jmorganca/fix-proxy

jessegross/sample

parth/disallow-streaming-tools

dhiltgen/remove_submodule

jmorganca/ga

jmorganca/mllama

pdevine/newlines

pdevine/geems-2b

jmorganca/llama-bump

mxyng/modelname-7

mxyng/gin-slog

mxyng/modelname-6

jyan/convert-prog

jyan/quant5

paligemma-support

pdevine/import-docs

jmorganca/openai-context

jyan/paligemma

jyan/p2

jyan/palitest

bmizerany/embedspeedup

jmorganca/llama-vit

brucemacd/allow-ollama

royh/ep-methods

royh/whisper

mxyng/api-models

mxyng/fix-memory

jyan/q4_4/8

jyan/ollama-v

royh/stream-tools

roy-embed-parallel

bmizerany/hrm

revert-5963-revert-5924-mxyng/llama3.1-rope

royh/embed-viz

jyan/local2

jyan/auth

jyan/local

jyan/parse-temp

jmorganca/template-mistral

jyan/reord-g

royh-openai-suffixdocs

royh-imgembed

royh-embed-parallel

jyan/quant4

royh-precision

jyan/progress

pdevine/fix-template

jyan/quant3

pdevine/ggla

mxyng/update-registry-domain

jmorganca/ggml-static

mxyng/create-context

jyan/v0.146

mxyng/layers-from-files

build_dist

bmizerany/noseek

royh-ls

royh-name

timeout

mxyng/server-timestamp

bmizerany/nosillyggufslurps

royh-params

jmorganca/llama-cpp-7c26775

royh-openai-delete

royh-show-rigid

jmorganca/enable-fa

jmorganca/no-error-template

jyan/format

royh-testdelete

bmizerany/fastverify

language_support

pdevine/ps-glitches

brucemacd/tokenize

bruce/iq-quants

bmizerany/filepathwithcoloninhost

mxyng/split-bin

bmizerany/client-registry

jmorganca/if-none-match

native

jmorganca/native

jmorganca/batch-embeddings

jmorganca/initcmake

jmorganca/mm

pdevine/showggmlinfo

modenameenforcealphanum

bmizerany/modenameenforcealphanum

jmorganca/done-reason

jmorganca/llama-cpp-8960fe8

ollama.com

bmizerany/filepathnobuild

bmizerany/types/model/defaultfix

rmdisplaylong

nogogen

bmizerany/x

modelfile-readme

bmizerany/replacecolon

jmorganca/limit

jmorganca/execstack

jmorganca/replace-assets

mxyng/tune-concurrency

jmorganca/testing

whitespace-detection

jmorganca/options

upgrade-all

scratch

cuda-search

mattw/airenamer

mattw/allmodelsonhuggingface

mattw/quantcontext

mattw/whatneedstorun

brucemacd/llama-mem-calc

mattw/faq-context

mattw/communitylinks

mattw/noprune

mattw/python-functioncalling

rename

mxyng/install

pulse

remove-first

editor

mattw/selfqueryingretrieval

cgo

mattw/howtoquant

api

matt/streamingapi

format-config

mxyng/extra-args

shell

update-nous-hermes

cp-model

upload-progress

fix-unknown-model

fix-model-names

delete-fix

insecure-registry

ls

deletemodels

progressbar

readme-updates

license-layers

skip-list

list-models

modelpath

matt/examplemodelfiles

distribution

go-opts

1 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: github-starred/ollama#69795