[GH-ISSUE #11691] Structured output with OpenAI SDK and gpt-oss:20b not working #7736

New Issue

@dontriskit commented on GitHub (Aug 8, 2025):

same issue with vLLM

resolved with official vllm docs

@dontriskit commented on GitHub (Aug 8, 2025): same issue with vLLM --- resolved with official vllm docs

GiteaMirror commented

@Koki-Itai commented on GitHub (Aug 8, 2025):

same issue

@Koki-Itai commented on GitHub (Aug 8, 2025): same issue

GiteaMirror commented

@tttturtle-russ commented on GitHub (Aug 8, 2025):

same issue here, it's an important feature.

@tttturtle-russ commented on GitHub (Aug 8, 2025): same issue here, it's an important feature.

GiteaMirror commented

2026-04-12 19:51:48 -05:00

@ddudek commented on GitHub (Aug 11, 2025):

As a better workaround, you should put the schema in "developer" role, e.g.:

[
  {
    "role": "system",
    "content": "
You are helpful coding expert that outputs JSON response.
  },
  {
    "role": "developer",
    "content": "
        # Instructions
        Respond in JSON format. Only output valid JSON, do not include any explanations or markdown formatting. Ensure all required fields are included.
    JSON schema: {'$defs': ... <your schema here> }
"
  },
  {
    "role": "user",
    "content": "## The task
...
"
  }
]

The above works pretty good, although the model also outputs "analysis" and "commentary" streams, e.g.:

<|channel|>analysis<|message|>We need JSON output following schema: My class with my_field string. Contains <blah blah for another 4 paragraphs... >

<|start|>assistant<|channel|>final<|message|>{"my_field": "some content"}

So adding this to the system prompt again improves the output:

"Reasoning: low
# Valid channels: final."

Full example:

[
  {
    "role": "system",
    "content": "
You are helpful coding expert that outputs JSON response.

Reasoning: low
# Valid channels: final.
  },
  {
    "role": "developer",
    "content": "
        # Instructions
        Respond in JSON format. Only output valid JSON, do not include any explanations or markdown formatting. Ensure all required fields are included.
    JSON schema: {'$defs': ... <your schema here> }
"
  },
  {
    "role": "user",
    "content": "## The task
...
"
  }
]

Output:
<|channel|>final<|message|>{"my_field": "some content"}
still needs removing "<|channel|>final<|message|>" but gives very stable behavior.

This is nicely documented in the cookbook https://cookbook.openai.com/articles/openai-harmony#developer-message-format and looks like the model follows this very well.

@ddudek commented on GitHub (Aug 11, 2025): As a better workaround, you should put the schema in "developer" role, e.g.: ``` [ { "role": "system", "content": " You are helpful coding expert that outputs JSON response. }, { "role": "developer", "content": " # Instructions Respond in JSON format. Only output valid JSON, do not include any explanations or markdown formatting. Ensure all required fields are included. JSON schema: {'$defs': ... <your schema here> } " }, { "role": "user", "content": "## The task ... " } ] ``` The above works pretty good, although the model also outputs "analysis" and "commentary" streams, e.g.: ``` <|channel|>analysis<|message|>We need JSON output following schema: My class with my_field string. Contains <blah blah for another 4 paragraphs... > <|start|>assistant<|channel|>final<|message|>{"my_field": "some content"} ``` So adding this to the **system prompt** again improves the output: ``` "Reasoning: low # Valid channels: final." ``` Full example: ``` [ { "role": "system", "content": " You are helpful coding expert that outputs JSON response. Reasoning: low # Valid channels: final. }, { "role": "developer", "content": " # Instructions Respond in JSON format. Only output valid JSON, do not include any explanations or markdown formatting. Ensure all required fields are included. JSON schema: {'$defs': ... <your schema here> } " }, { "role": "user", "content": "## The task ... " } ] ``` Output: ```<|channel|>final<|message|>{"my_field": "some content"}``` still needs removing "<|channel|>final<|message|>" but gives very stable behavior. This is nicely documented in the cookbook https://cookbook.openai.com/articles/openai-harmony#developer-message-format and looks like the model follows this very well.

GiteaMirror commented

@youngbinkim0 commented on GitHub (Aug 11, 2025):

same issue with vLLM

resolved with official vllm docs

can you link to the official vLLM doc referred? running into the same issue @dontriskit

@youngbinkim0 commented on GitHub (Aug 11, 2025): > ## same issue with vLLM > resolved with official vllm docs can you link to the official vLLM doc referred? running into the same issue @dontriskit

GiteaMirror commented

2026-04-12 19:51:48 -05:00

@youngbinkim0 commented on GitHub (Aug 11, 2025):

As a better workaround, you should put the schema in "developer" role, e.g.:
[
  {
    "role": "system",
    "content": "
You are helpful coding expert that outputs JSON response.
  },
  {
    "role": "developer",
    "content": "
        # Instructions
        Respond in JSON format. Only output valid JSON, do not include any explanations or markdown formatting. Ensure all required fields are included.
    JSON schema: {'$defs': ... <your schema here> }
"
  },
  {
    "role": "user",
    "content": "## The task
...
"
  }
]
The above works pretty good, although the model also outputs "analysis" and "commentary" streams, e.g.:
<|channel|>analysis<|message|>We need JSON output following schema: My class with my_field string. Contains <blah blah for another 4 paragraphs... >

<|start|>assistant<|channel|>final<|message|>{"my_field": "some content"}
So adding this to the system prompt again improves the output:
"Reasoning: low
# Valid channels: final."
Full example:
[
  {
    "role": "system",
    "content": "
You are helpful coding expert that outputs JSON response.

Reasoning: low
# Valid channels: final.
  },
  {
    "role": "developer",
    "content": "
        # Instructions
        Respond in JSON format. Only output valid JSON, do not include any explanations or markdown formatting. Ensure all required fields are included.
    JSON schema: {'$defs': ... <your schema here> }
"
  },
  {
    "role": "user",
    "content": "## The task
...
"
  }
]
Output: <|channel|>final<|message|>{"my_field": "some content"} still needs removing "<|channel|>final<|message|>" but gives very stable behavior.

This is nicely documented in the cookbook https://cookbook.openai.com/articles/openai-harmony#developer-message-format and looks like the model follows this very well.

While this does work for many schemas, it's not the same as enforcing structured outputs. As referred in the structured output section of the cookbook:

"This prompt alone will, however, only influence the model’s behavior but doesn’t guarantee the full adherence to the schema. For this you still need to construct your own grammar and enforce the schema during sampling."

https://cookbook.openai.com/articles/openai-harmony#structured-output

@youngbinkim0 commented on GitHub (Aug 11, 2025): > As a better workaround, you should put the schema in "developer" role, e.g.: > > ``` > [ > { > "role": "system", > "content": " > You are helpful coding expert that outputs JSON response. > }, > { > "role": "developer", > "content": " > # Instructions > Respond in JSON format. Only output valid JSON, do not include any explanations or markdown formatting. Ensure all required fields are included. > JSON schema: {'$defs': ... <your schema here> } > " > }, > { > "role": "user", > "content": "## The task > ... > " > } > ] > ``` > > The above works pretty good, although the model also outputs "analysis" and "commentary" streams, e.g.: > > ``` > <|channel|>analysis<|message|>We need JSON output following schema: My class with my_field string. Contains <blah blah for another 4 paragraphs... > > > <|start|>assistant<|channel|>final<|message|>{"my_field": "some content"} > ``` > > So adding this to the **system prompt** again improves the output: > > ``` > "Reasoning: low > # Valid channels: final." > ``` > > Full example: > > ``` > [ > { > "role": "system", > "content": " > You are helpful coding expert that outputs JSON response. > > Reasoning: low > # Valid channels: final. > }, > { > "role": "developer", > "content": " > # Instructions > Respond in JSON format. Only output valid JSON, do not include any explanations or markdown formatting. Ensure all required fields are included. > JSON schema: {'$defs': ... <your schema here> } > " > }, > { > "role": "user", > "content": "## The task > ... > " > } > ] > ``` > > Output: `<|channel|>final<|message|>{"my_field": "some content"}` still needs removing "<|channel|>final<|message|>" but gives very stable behavior. > > This is nicely documented in the cookbook https://cookbook.openai.com/articles/openai-harmony#developer-message-format and looks like the model follows this very well. While this does work for many schemas, it's not the same as enforcing structured outputs. As referred in the structured output section of the cookbook: "This prompt alone will, however, only influence the model’s behavior but doesn’t guarantee the full adherence to the schema. For this you still need to construct your own grammar and enforce the schema during sampling." https://cookbook.openai.com/articles/openai-harmony#structured-output

GiteaMirror commented

@lachlansleight commented on GitHub (Aug 12, 2025):

Yeah, if a model does not support JSON output format schema, then it is functionally a fun toy to play with, but not actually useful as an agent.

Agreed - since identifying this as an issue I basically put GPT-OSS down and haven't touched it since. It's interesting to see how it differs from other agents, but without JSON output it's completely useless for anything other than basic chat applications.

I didn't realise how harmful harmony would be to the update of GPT-OSS. It's so bad that I almost want to put on my tin foil hat and wonder whether Open AI is trying to intentionally harm the open-weight community by fragmenting the ecosystem with a complex, difficult-to-implement response format.

@lachlansleight commented on GitHub (Aug 12, 2025): > Yeah, if a model does not support JSON output format schema, then it is functionally a fun toy to play with, but not actually useful as an agent. Agreed - since identifying this as an issue I basically put GPT-OSS down and haven't touched it since. It's interesting to see how it differs from other agents, but without JSON output it's completely useless for anything other than basic chat applications. I didn't realise how harmful harmony would be to the update of GPT-OSS. It's so bad that I almost want to put on my tin foil hat and wonder whether Open AI is trying to intentionally harm the open-weight community by fragmenting the ecosystem with a complex, difficult-to-implement response format.

GiteaMirror commented

@ParthSareen commented on GitHub (Aug 12, 2025):

Hey folks! Just came across this issue. We currently do not support structured outputs with this model or other thinking models. Working on a change later this week to bring some of the token parsing down to the runner level. With that we'd be able to do grammar sampling and token parsing closer together to know when to start the constrained sampling. Sorry for the delay!

@ParthSareen commented on GitHub (Aug 12, 2025): Hey folks! Just came across this issue. We currently do not support structured outputs with this model or other thinking models. Working on a change later this week to bring some of the token parsing down to the runner level. With that we'd be able to do grammar sampling and token parsing closer together to know when to start the constrained sampling. Sorry for the delay!

GiteaMirror commented

@sheneman commented on GitHub (Aug 13, 2025):

@ParthSareen : Thank you so much for your attention to the issue of Structured outputs and thinking models in Ollama, including gpt-oss. This issue been a roadblock for my organization using Ollama for awhile. We love Ollama but have been considering alternatives because of this limitation with effective use of thinking models.

I assume the issue with gpt-oss is at least related to open issue #523.

But I assume the problem is compounded with gpt-oss because of the use of the Harmony response format.

Again, thank you and the Ollama team for prioritizing this!

@sheneman commented on GitHub (Aug 13, 2025): @ParthSareen : **Thank you so much** for your attention to the issue of **_Structured outputs and thinking models_** in Ollama, including gpt-oss. This issue been a roadblock for my organization using Ollama for awhile. We love Ollama but have been considering alternatives because of this limitation with effective use of thinking models. I assume the issue with gpt-oss is at least related to open issue [#523](https://github.com/ollama/ollama-python/issues/523). But I assume the problem is compounded with gpt-oss because of the use of the Harmony response format. Again, thank you and the Ollama team for prioritizing this!

GiteaMirror commented

2026-04-12 19:51:50 -05:00

@nicholas-johnson-techxcel commented on GitHub (Aug 13, 2025):

Yeah progress, the GBNF generator is now pretty stable and I wrote a function to turn chat history into Harmony. The model is quite smart (for a local model) but also unhinged (although it might be that I have not yet tuned llama.cpp well). It tries to put reasoning output in URL params of tools, they really need to find a way to make it think without emitting those tokens. It can handle web browsing tasks using Playwright but I keep overflowing the context window so now I am writing a full node based infrastructure where we can have nodes consolidate / summarise the history and have tool calls be able to be more context aware and veto certain steps.

I got that idea from Haystack but their tooling is hardly type-safe (it was good as a proof of concept) and it was only after a tonne of complaining that they fixed the issue that by importing Haystack, half your python file lines light up bright red with compiler errors. I am still baffled by the fact that everyone is using old Python syntax, type-safety as an afterthought. Like I am new to Python and somehow it feels like I have to write almost everything because too many of the libraries are missing stubs or have serious issues. I mean, is it so hard for, OpenAI, Firecrawl-py, etc that if you give the function a Pydantic class, that it give back an instance of that class, and if you hand it a dict, you get back a dict? If you are streaming then you can use a small state machine to parse half-completed json and you use Pydantic alias generator so that the JSON is camelCase (JSON is inherently born from JS and this is best practice) but when you await the finished Pydantic class instance in Python, the fields are in snake_case as is idiomatic for Python.

Python still has poor generic templating (things like cannot specify the generic type of the function you are calling without it parsing it as array indexing, and it does not allow each overload to have its own code, because it has no idea which one will be used until runtime) and inheritance support (it only enforces checking if you mark as @final, and there is no way of requiring an instance passed in having been marked as final), but it is enough to do the trick. Well, at least it has come a long way, right? Pydantic is a life-saver, although under the hood it has some code I am not pleased with, and it had failed to implement certain cases, and the field aliasing is a nightmare - but I found a good workaround - use an alias generator and a switch statement inside of the generator with a case for each field.

Golden rule of LLMs, because they are extremely obtuse and get distracted easily (I believe next-token prediction to be merely a stepping stone). Use highly constrained schemas by rooting under the hood of Pydantic and cut down the number of choices it has and filter out noisy data (although there was a research paper recently "Let me speak freely", which argues against this, but does not seem to have used GBNF as a constraint method). The most frustrating thing is that I should not need to prompt an LLM to understand causality and the passage of time. The models try to emit all of the tool calls in one round, making bad assumptions about what the future state will be. Making them follow instructions is near impossible. But this gpt-oss:20b is more promising. We will have the hardware to run gpt-oss:120b in a couple of weeks, but my opinion is it will likely be a dud, not worth the 6x RAM and compute required compared to the 20b version.

I personally hope that all of these things get sorted. I mean there is no agent library I can download which has good general-purpose performance out of the box. I hope to change that. But OpenAI has just muddied the water with Harmony (ironic), there is no universally standardised LLM interface in Python, and the fact that I have to implement JSON schema support on my own is just wild. Again, the industry has billions of dollars and somehow they did not add support for JSON schema in llama.cpp when it's only a little over 300 lines of code. And I don't see how this code couldn't have been in there before, rendering the issue with gpt-oss to be only a matter of converting chat history to Harmony. Maybe they have kept a lot of features closed-source. Anyway, I want to be dealing with LLM concepts, not Python concepts. There are things like MCP (relatively clear) and A2A (can't even figure out at what level of abstraction is it meant to be at) and neither of these things have standardised the LLM interface.

One thing is clear, though: GBNF is the way. I see limitless possibilities with this, provided I can continue writing working compilers for it. It is also food for thought for how I was trying to train my own model (not next-token predictor) in my own time: I had issues formalising grammatical concepts for the training process, and this could be the thing I need. Ditching next-token should allow the inlining of classical functions (latch float inputs to closest binary state, one bit per input) to give LLMs extroadinary mathematical abilities without needing to execute Python or other script parsers. I would also wager similar techniques but for neural nets could solve the RSA problem. I also was struggling to understand how the Ollama JSON response format was implemented - it seemed like a mix of prompting, two-shot examples, and re-prompting. But GBNF, from what I can tell, eliminates illegal next tokens from the probability list (compiled into, meaning that it is relatively fail-safe and does not waste time having to go back and re-generate.

If anyone knows of other gems like GBNF, I would love to hear - I may be overlooking other great solutions.

@nicholas-johnson-techxcel commented on GitHub (Aug 13, 2025): Yeah progress, the GBNF generator is now pretty stable and I wrote a function to turn chat history into Harmony. The model is quite smart (for a local model) but also unhinged (although it might be that I have not yet tuned llama.cpp well). It tries to put reasoning output in URL params of tools, they really need to find a way to make it think without emitting those tokens. It can handle web browsing tasks using Playwright but I keep overflowing the context window so now I am writing a full node based infrastructure where we can have nodes consolidate / summarise the history and have tool calls be able to be more context aware and veto certain steps. I got that idea from Haystack but their tooling is hardly type-safe (it was good as a proof of concept) and it was only after a tonne of complaining that they fixed the issue that by importing Haystack, half your python file lines light up bright red with compiler errors. I am still baffled by the fact that everyone is using old Python syntax, type-safety as an afterthought. Like I am new to Python and somehow it feels like I have to write almost everything because too many of the libraries are missing stubs or have serious issues. I mean, is it so hard for, OpenAI, Firecrawl-py, etc that if you give the function a Pydantic class, that it give back an instance of that class, and if you hand it a dict, you get back a dict? If you are streaming then you can use a small state machine to parse half-completed json and you use Pydantic alias generator so that the JSON is camelCase (JSON is inherently born from JS and this is best practice) but when you await the finished Pydantic class instance in Python, the fields are in snake_case as is idiomatic for Python. Python still has poor generic templating (things like cannot specify the generic type of the function you are calling without it parsing it as array indexing, and it does not allow each overload to have its own code, because it has no idea which one will be used until runtime) and inheritance support (it only enforces checking if you mark as @final, and there is no way of requiring an instance passed in having been marked as final), but it is enough to do the trick. Well, at least it has come a long way, right? Pydantic is a life-saver, although under the hood it has some code I am not pleased with, and it had failed to implement certain cases, and the field aliasing is a nightmare - but I found a good workaround - use an alias generator and a switch statement inside of the generator with a case for each field. Golden rule of LLMs, because they are extremely obtuse and get distracted easily (I believe next-token prediction to be merely a stepping stone). Use highly constrained schemas by rooting under the hood of Pydantic and cut down the number of choices it has and filter out noisy data (although there was a research paper recently "Let me speak freely", which argues against this, but does not seem to have used GBNF as a constraint method). The most frustrating thing is that I should not need to prompt an LLM to understand causality and the passage of time. The models try to emit all of the tool calls in one round, making bad assumptions about what the future state will be. Making them follow instructions is near impossible. But this gpt-oss:20b is more promising. We will have the hardware to run gpt-oss:120b in a couple of weeks, but my opinion is it will likely be a dud, not worth the 6x RAM and compute required compared to the 20b version. I personally hope that all of these things get sorted. I mean there is no agent library I can download which has good general-purpose performance out of the box. I hope to change that. But OpenAI has just muddied the water with Harmony (ironic), there is no universally standardised LLM interface in Python, and the fact that I have to implement JSON schema support on my own is just wild. Again, the industry has billions of dollars and somehow they did not add support for JSON schema in llama.cpp when it's only a little over 300 lines of code. And I don't see how this code couldn't have been in there before, rendering the issue with gpt-oss to be only a matter of converting chat history to Harmony. Maybe they have kept a lot of features closed-source. Anyway, I want to be dealing with LLM concepts, not Python concepts. There are things like MCP (relatively clear) and A2A (can't even figure out at what level of abstraction is it meant to be at) and neither of these things have standardised the LLM interface. One thing is clear, though: GBNF is the way. I see limitless possibilities with this, provided I can continue writing working compilers for it. It is also food for thought for how I was trying to train my own model (not next-token predictor) in my own time: I had issues formalising grammatical concepts for the training process, and this could be the thing I need. Ditching next-token should allow the inlining of classical functions (latch float inputs to closest binary state, one bit per input) to give LLMs extroadinary mathematical abilities without needing to execute Python or other script parsers. I would also wager similar techniques but for neural nets could solve the RSA problem. I also was struggling to understand how the Ollama JSON response format was implemented - it seemed like a mix of prompting, two-shot examples, and re-prompting. But GBNF, from what I can tell, eliminates illegal next tokens from the probability list (compiled into, meaning that it is relatively fail-safe and does not waste time having to go back and re-generate. If anyone knows of other gems like GBNF, I would love to hear - I may be overlooking other great solutions.

GiteaMirror commented

@rick-github commented on GitHub (Aug 13, 2025):

I also was struggling to understand how the Ollama JSON response format was implemented

Ollama use GBNF to implement structured outputs. The problem with applying it to reasoning models is that it currently also constrains the reasoning phase to the GBNF grammar, which compromises quality. Ideally the model should be allowed to consider the full gamut of probabilistic generation during reasoning, and only apply the GBNF grammer during content generation.

@rick-github commented on GitHub (Aug 13, 2025): > I also was struggling to understand how the Ollama JSON response format was implemented Ollama use GBNF to implement structured outputs. The problem with applying it to reasoning models is that it currently also constrains the reasoning phase to the GBNF grammar, which compromises quality. Ideally the model should be allowed to consider the full gamut of probabilistic generation during reasoning, and only apply the GBNF grammer during content generation.

GiteaMirror commented

2026-04-12 19:51:50 -05:00

@Croups commented on GitHub (Aug 13, 2025):

same issue here, I noticed that sometimes it returns null while using it via pydantic-ai, I tested it without defining a structured output schema, it is generating the answer with tags here is a sample :

AgentRunResult(output='{"analysis":"The user says 'hi how are you', which is a greeting and question about how I am. The user wants to know what ChatGPT is. So respond with a friendly greeting and explanation.<|channel|>commentary:"} ')

This is why when you define a model, pydantic can't parse it.

@Croups commented on GitHub (Aug 13, 2025): same issue here, I noticed that sometimes it returns null while using it via pydantic-ai, I tested it without defining a structured output schema, it is generating the answer with tags here is a sample : AgentRunResult(output='{"analysis":"The user says \'hi how are you\', which is a greeting and question about how I am. The user wants to know what ChatGPT is. So respond with a friendly greeting and explanation.<|channel|>commentary:"} ') This is why when you define a model, pydantic can't parse it.

GiteaMirror commented

2026-04-12 19:51:50 -05:00

@nicholas-johnson-techxcel commented on GitHub (Aug 15, 2025):

I also was struggling to understand how the Ollama JSON response format was implemented

Ollama use GBNF to implement structured outputs. The problem with applying it to reasoning models is that it currently also constrains the reasoning phase to the GBNF grammar, which compromises quality. Ideally the model should be allowed to consider the full gamut of probabilistic generation during reasoning, and only apply the GBNF grammer during content generation.

Can't it reason silently and not emit these symbols? Besides, I have been trying to disable reasoning as it just adds latency, and I can easily add reasoning fields to the json output of non-reasoning models if and when it makes sense for the application. Sometimes I have found it creates a better agent, other times it is just wasting electricity and time.

@nicholas-johnson-techxcel commented on GitHub (Aug 15, 2025): > > I also was struggling to understand how the Ollama JSON response format was implemented > > Ollama use GBNF to implement structured outputs. The problem with applying it to reasoning models is that it currently also constrains the reasoning phase to the GBNF grammar, which compromises quality. Ideally the model should be allowed to consider the full gamut of probabilistic generation during reasoning, and only apply the GBNF grammer during content generation. Can't it reason silently and not emit these symbols? Besides, I have been trying to disable reasoning as it just adds latency, and I can easily add reasoning fields to the json output of non-reasoning models if and when it makes sense for the application. Sometimes I have found it creates a better agent, other times it is just wasting electricity and time.

GiteaMirror commented

2026-04-12 19:51:51 -05:00

@rick-github commented on GitHub (Aug 15, 2025):

Models don't have an internal monologue or subconscious, all they do is probabilistically generate tokens. "Reasoning" models are trained to generate "thinking" tokens as a way to guide the generation of tokens in the response phase, but it's just tokens all the way down.

@rick-github commented on GitHub (Aug 15, 2025): Models don't have an internal monologue or subconscious, all they do is probabilistically generate tokens. "Reasoning" models are trained to generate "thinking" tokens as a way to guide the generation of tokens in the response phase, but it's just tokens all the way down.

GiteaMirror commented

2026-04-12 19:51:51 -05:00

@adamoutler commented on GitHub (Aug 15, 2025):

Confirmed. Same issue. Any model except gpt-oss seems to work with Structured Outputs. gpt-oss returns a blank. I hope to see an adapter later in Ollama soon.

@adamoutler commented on GitHub (Aug 15, 2025): Confirmed. Same issue. Any model except `gpt-oss` seems to work with Structured Outputs. `gpt-oss` returns a blank. I hope to see an adapter later in Ollama soon.

GiteaMirror commented

2026-04-12 19:51:51 -05:00

@adamoutler commented on GitHub (Aug 15, 2025):

Is this being worked on? That thinking section on the json... It's a likely culprit and a good starting point.

@adamoutler commented on GitHub (Aug 15, 2025): Is this being worked on? That thinking section on the json... It's a likely culprit and a good starting point.

GiteaMirror commented

@nicholas-johnson-techxcel commented on GitHub (Aug 20, 2025):

Is this being worked on? That thinking section on the json... It's a likely culprit and a good starting point.

I did this in Python and llama.cpp but I since found in the Ollama source code a JSON=>GBNF compiler. You just need to hit llama.cpp with grammar=gbnf and then it works, but because it is a reasoning model, it then becomes a bit unhinged and tries to insert reasoning into json fields instead of keeping reasoning internally.

All Ollama had to do is use the grammar just like it already does and it would work to the extent which I get from llama.cpp (we still need to stop it from reasoning when we use think=False which it ignores) but for some reason they seemed to have made an exception for this model and hence they broke it. If I get some time I can look at their code and give a patch.

This also begs the question: if Ollama is a wrapper around llama.cpp then it could just become a python library which adds features to llama.cpp (basically a llama.cpp client library) and the actual Ollama server become a heap of scripts for running llama.cpp as a service and automatically pulling models down for it.

@nicholas-johnson-techxcel commented on GitHub (Aug 20, 2025): > Is this being worked on? That thinking section on the json... It's a likely culprit and a good starting point. I did this in Python and llama.cpp but I since found in the Ollama source code a JSON=>GBNF compiler. You just need to hit llama.cpp with `grammar=gbnf` and then it works, but because it is a reasoning model, it then becomes a bit unhinged and tries to insert reasoning into json fields instead of keeping reasoning internally. All Ollama had to do is use the grammar just like it already does and it would work to the extent which I get from llama.cpp (we still need to stop it from reasoning when we use `think=False` which it ignores) but for some reason they seemed to have made an exception for this model and hence they broke it. If I get some time I can look at their code and give a patch. This also begs the question: if Ollama is a wrapper around llama.cpp then it could just become a python library which adds features to llama.cpp (basically a llama.cpp client library) and the actual Ollama server become a heap of scripts for running llama.cpp as a service and automatically pulling models down for it.

GiteaMirror commented

@mpauly commented on GitHub (Aug 20, 2025):

@nicholas-johnson-techxcel With regards to llama.cpp: there is an open issue for structured outputs in llama.cpp and things are mostly working.
Those changes would need to be merged into llama.cpp, and could then eventually trickle down/be ported to ollama

@mpauly commented on GitHub (Aug 20, 2025): @nicholas-johnson-techxcel With regards to llama.cpp: there is an [open issue](https://github.com/ggml-org/llama.cpp/issues/15276#issuecomment-3201937062) for structured outputs in llama.cpp and things are mostly working. Those changes would need to be merged into llama.cpp, and could then eventually trickle down/be ported to ollama

GiteaMirror commented

@ParthSareen commented on GitHub (Aug 20, 2025):

Hey @nicholas-johnson-techxcel @mpauly that's not how it works - we're not consuming any of the llama.cpp changes for structured outputs - although we do use GBNF.

You can't turn thinking "off" for this model - those tokens have to get generated as this model follows the Harmony format.

@adamoutler As mentioned above I have started working on this. It's not a trivial change unfortunately and needs some moving around of where our parsers current live. Thanks for your patience friends!

Hey folks! Just came across this issue. We currently do not support structured outputs with this model or other thinking models. Working on a change later this week to bring some of the token parsing down to the runner level. With that we'd be able to do grammar sampling and token parsing closer together to know when to start the constrained sampling. Sorry for the delay!

@ParthSareen commented on GitHub (Aug 20, 2025): Hey @nicholas-johnson-techxcel @mpauly that's not how it works - we're not consuming any of the llama.cpp changes for structured outputs - although we do use GBNF. You can't turn thinking "off" for this model - those tokens have to get generated as this model follows the [Harmony](https://github.com/openai/harmony) format. @adamoutler As mentioned above I have started working on this. It's not a trivial change unfortunately and needs some moving around of where our parsers current live. Thanks for your patience friends! > Hey folks! Just came across this issue. We currently do not support structured outputs with this model or other thinking models. Working on a change later this week to bring some of the token parsing down to the runner level. With that we'd be able to do grammar sampling and token parsing closer together to know when to start the constrained sampling. Sorry for the delay!

GiteaMirror commented

2026-04-12 19:51:53 -05:00

@nicholas-johnson-techxcel commented on GitHub (Aug 25, 2025):

@nicholas-johnson-techxcel With regards to llama.cpp: there is an open issue for structured outputs in llama.cpp and things are mostly working. Those changes would need to be merged into llama.cpp, and could then eventually trickle down/be ported to ollama

I thought that llama.cpp did not have the structured output field, that you have to give GBNF, and it most certainly is working already, I now just compile JSON to GBNF and reluctantly make the root element to be <thinking>.*</thinking>{json} or just <thinking></thinking>{json} if I want to disable reasoning, and I just capture the thinking tags in a state machine, emit chunk messages with role="thinking" for those (and handle any split messages) and either put the structured messages through a JSON stream parser, or accumulate them and then parse for tool calling.

The way I see it, the issue is Ollama.

If anyone is wondering, forcing it to output <thinking></thinking> does seem to properly disable reasoning - despite those saying it cannot be - it stops it from trying to cram reasoning into JSON fields, and this results in significant decreases in latency.

This model still might be one of my favourites for local use, but it has still not caught up to 4o.

@nicholas-johnson-techxcel commented on GitHub (Aug 25, 2025): > [@nicholas-johnson-techxcel](https://github.com/nicholas-johnson-techxcel) With regards to llama.cpp: there is an [open issue](https://github.com/ggml-org/llama.cpp/issues/15276#issuecomment-3201937062) for structured outputs in llama.cpp and things are mostly working. Those changes would need to be merged into llama.cpp, and could then eventually trickle down/be ported to ollama I thought that llama.cpp did not have the structured output field, that you have to give GBNF, and it most certainly is working already, I now just compile JSON to GBNF and reluctantly make the root element to be `<thinking>.*</thinking>{json}` or just `<thinking></thinking>{json}` if I want to disable reasoning, and I just capture the thinking tags in a state machine, emit chunk messages with role="thinking" for those (and handle any split messages) and either put the structured messages through a JSON stream parser, or accumulate them and then parse for tool calling. The way I see it, the issue is Ollama. If anyone is wondering, forcing it to output `<thinking></thinking>` does seem to properly disable reasoning - despite those saying it cannot be - it stops it from trying to cram reasoning into JSON fields, and this results in significant decreases in latency. This model still might be one of my favourites for local use, but it has still not caught up to 4o.

GiteaMirror commented

@CL415 commented on GitHub (Aug 26, 2025):

In case somebody needs to force GPT-OSS to output valid JSON despite Ollama's shortcomings like I had, I found some success using Pydantic AI .run_sync, with the Ollama server as provider, although using the retrial parameter since sometimes OSS does not comply on the first shot.

@CL415 commented on GitHub (Aug 26, 2025): In case somebody needs to force GPT-OSS to output valid JSON despite Ollama's shortcomings like I had, I found some success using Pydantic AI `.run_sync`, with the Ollama server as `provider`, although using the retrial parameter since sometimes OSS does not comply on the first shot.

GiteaMirror commented

2026-04-12 19:51:53 -05:00

@nicholas-johnson-techxcel commented on GitHub (Aug 27, 2025):

Okay in terms of progress on my side. It was working extremely well except I had to clamp the number of analysis/thinking chars to stop it rambling. But it seems that if I wrap the JSON-GBNF with GBNF for the harmony format, it no longer needs clamping. But I cannot force it with GBNF to emit tokens like <|end|> which is an issue, because then feeding it into the openai-harmony library encoder, it does not strip the harmony frames from it properly. It's not actually that hard to process streaming chunks using a state machine, but still, it would be best doing it properly. Anyone have any ideas on forcing <|end|> to be emitted with GBNF?

@nicholas-johnson-techxcel commented on GitHub (Aug 27, 2025): Okay in terms of progress on my side. It was working extremely well except I had to clamp the number of analysis/thinking chars to stop it rambling. But it seems that if I wrap the JSON-GBNF with GBNF for the harmony format, it no longer needs clamping. But I cannot force it with GBNF to emit tokens like <|end|> which is an issue, because then feeding it into the openai-harmony library encoder, it does not strip the harmony frames from it properly. It's not actually that hard to process streaming chunks using a state machine, but still, it would be best doing it properly. Anyone have any ideas on forcing <|end|> to be emitted with GBNF?

GiteaMirror commented

2026-04-12 19:51:53 -05:00

@erennyuksell commented on GitHub (Aug 29, 2025):

any reliable solution?

@erennyuksell commented on GitHub (Aug 29, 2025): any reliable solution?

GiteaMirror commented

2026-04-12 19:51:54 -05:00

@steenharsted commented on GitHub (Aug 29, 2025):

I’m experiencing the same issue with gpt-oss:20b usingchat_ollama() and chat_structured() from ellmer. The model consistently returns truncated or non-JSON responses. This error persists even after:

Explicit JSON schema enforcement
Prompting “reply only in JSON”
Minimizing prompt size
Switching to other models (which work)

The output appears to be malformed almost every time.

Are there any plans to improve structured output compliance for gpt-oss:20b?

Thanks

@steenharsted commented on GitHub (Aug 29, 2025): I’m experiencing the same issue with `gpt-oss:20b` using`chat_ollama()` and `chat_structured()` from `ellmer`. The model consistently returns truncated or non-JSON responses. This error persists even after: - Explicit JSON schema enforcement - Prompting “reply only in JSON” - Minimizing prompt size - Switching to other models (which work) The output appears to be malformed almost every time. Are there any plans to improve structured output compliance for `gpt-oss:20b`? Thanks

GiteaMirror commented

2026-04-12 19:51:54 -05:00

@josemita87 commented on GitHub (Aug 29, 2025):

Same issue here with structured outputs...

@josemita87 commented on GitHub (Aug 29, 2025): Same issue here with structured outputs...

GiteaMirror commented

2026-04-12 19:51:55 -05:00

@rick-github commented on GitHub (Aug 29, 2025):

Are there any plans to improve structured output compliance for gpt-oss:20b?

https://github.com/ollama/ollama/issues/11691#issuecomment-3181220084

@rick-github commented on GitHub (Aug 29, 2025): > Are there any plans to improve structured output compliance for `gpt-oss:20b`? https://github.com/ollama/ollama/issues/11691#issuecomment-3181220084

GiteaMirror commented

2026-04-12 19:51:55 -05:00

@rakadam commented on GitHub (Sep 1, 2025):

I have the same problem. I looked at the code and Ollama was designed in a way that the HarmonyParser runs on high level, while the sampler runs in the cpp code, with some go code for glue. And it is not possible to connect them, so the sampler cannot know when it is supposed to apply the grammar or not. Since the grammar is only valid inside the message, and Harmony formatting is outside the message, this is a big problem.

One not terribly insane solution, already mentioned in this thread: implementing a minimalistic Harmony parser in the go sampler glue code, so it knows when to enable the grammar constraining. Or this could be calling HarmonyParser basically in both layers.

@rakadam commented on GitHub (Sep 1, 2025): I have the same problem. I looked at the code and Ollama was designed in a way that the HarmonyParser runs on high level, while the sampler runs in the cpp code, with some go code for glue. And it is not possible to connect them, so the sampler cannot know when it is supposed to apply the grammar or not. Since the grammar is only valid _inside_ the message, and Harmony formatting is outside the message, this is a big problem. One not terribly insane solution, already mentioned in this thread: implementing a minimalistic Harmony parser in the go sampler glue code, so it knows when to enable the grammar constraining. Or this could be calling HarmonyParser basically in both layers.

GiteaMirror commented

2026-04-12 19:51:55 -05:00

@nicholas-johnson-techxcel commented on GitHub (Sep 2, 2025):

Okay got it done. The steps are:

Use llamacpp /completions (not /v1/completions - this is not the same endpoint)
With prompt as list[int] which is list of output tokens from enc.render_conversation_for_completion(Conversation.from_messages(messages), Role.ASSISTANT) from openai_harmony library
tokens=True in body
Write a gbnf compiler which takes a json schema as input
Rename the "root" rule to something else
gbnf["root"] = f'thinking-block "{tok_start}" "{tok_assistant}" "{tok_channel}" "final" "{tok_constrain}" "json" "{tok_message}" json-root "{tok_end}"' where the tok_X are like "<|start|>", etc
thinking-block is basically the same thing except with analysis in the channel name, and any number of characters which are not "<" as the thought content
Send the request
Parse response tokens through StreamableParser.process() one by one
You have a new channel every time you find the token=200002 - use this to split up messages - read StreamableParser.current_channel to know if
We have a bug in StreamableParser.last_content_delta where we cannot use it because it is contaminated with Harmony tags which should have been filtered out so instead we use current_content and diff that from last iteration to get the channel delta. Actually nevermind, turns out I have been using last_content_delta now, and it has stopped doing that.

Finally the model has stopped being unhinged. It is extremely fast and consistent as an agent. gpt-oss:20b and I still doubt the gpt-oss:120b will justify its size with performance, but I guess we will find out in a while once we have the 128GB Macbook Pro. Mine is only 64GB.

@nicholas-johnson-techxcel commented on GitHub (Sep 2, 2025): Okay got it done. The steps are: - Use llamacpp `/completions` (not `/v1/completions` - this is not the same endpoint) - With prompt as list[int] which is list of output tokens from `enc.render_conversation_for_completion(Conversation.from_messages(messages), Role.ASSISTANT)` from openai_harmony library - `tokens=True` in body - Write a gbnf compiler which takes a json schema as input - Rename the "root" rule to something else - `gbnf["root"] = f'thinking-block "{tok_start}" "{tok_assistant}" "{tok_channel}" "final" "{tok_constrain}" "json" "{tok_message}" json-root "{tok_end}"'` where the tok_X are like "<|start|>", etc - `thinking-block` is basically the same thing except with `analysis` in the channel name, and any number of characters which are not "<" as the thought content - Send the request - Parse response tokens through `StreamableParser.process()` one by one - You have a new channel every time you find the token=200002 - use this to split up messages - read `StreamableParser.current_channel` to know if - We have a bug in `StreamableParser.last_content_delta` where we cannot use it because it is contaminated with Harmony tags which should have been filtered out so instead we use `current_content` and diff that from last iteration to get the channel delta. Actually nevermind, turns out I have been using last_content_delta now, and it has stopped doing that. Finally the model has stopped being unhinged. It is extremely fast and consistent as an agent. `gpt-oss:20b` and I still doubt the `gpt-oss:120b` will justify its size with performance, but I guess we will find out in a while once we have the 128GB Macbook Pro. Mine is only 64GB.

GiteaMirror commented

2026-04-12 19:51:56 -05:00

@adamoutler commented on GitHub (Sep 2, 2025):

Re: @nicholas-johnson-techxcel
...

Send the request

Parse response tokens through StreamableParser.process() one by one

You have a new channel every time you find the token=200002 - use this to split up messages - read StreamableParser.current_channel to know if

...

This caught my attention. I asked ChatGPT more about harmony format.

Token ID	Role / Function
199998	Beginning of sequence (BOS)
199999	Padding (PAD)
200000	End of text (EOT)
200001	Reserved special (unused)
200002	End of sequence / Return (EOS)

@adamoutler commented on GitHub (Sep 2, 2025): > Re: @nicholas-johnson-techxcel > ... > * Send the request > * Parse response tokens through `StreamableParser.process()` one by one > * You have a new channel every time you find the token=200002 - use this to split up messages - read `StreamableParser.current_channel` to know if > > ... This caught my attention. I asked ChatGPT more about harmony format. | Token ID | Role / Function | |----------|----------------------------------| | 199998 | Beginning of sequence (BOS) | | 199999 | Padding (PAD) | | 200000 | End of text (EOT) | | 200001 | Reserved special (unused) | | 200002 | End of sequence / Return (EOS) |

GiteaMirror commented

2026-04-12 19:51:57 -05:00

@MarioRicoIbanez commented on GitHub (Sep 3, 2025):

+1 having the same issue

@MarioRicoIbanez commented on GitHub (Sep 3, 2025): +1 having the same issue

GiteaMirror commented

2026-04-12 19:51:57 -05:00

@inf-bud commented on GitHub (Sep 3, 2025):

+1 having the same issue

@inf-bud commented on GitHub (Sep 3, 2025): +1 having the same issue

GiteaMirror commented

2026-04-12 19:51:57 -05:00

@shahidazim commented on GitHub (Sep 3, 2025):

+1 having the same issue

@shahidazim commented on GitHub (Sep 3, 2025): +1 having the same issue

GiteaMirror commented

2026-04-12 19:51:58 -05:00

@ParthSareen commented on GitHub (Sep 3, 2025):

Hey everyone! This is currently being worked on - trying to get it to y'all asap. https://github.com/ollama/ollama/pull/12052

@ParthSareen commented on GitHub (Sep 3, 2025): Hey everyone! This is currently being worked on - trying to get it to y'all asap. https://github.com/ollama/ollama/pull/12052

GiteaMirror commented

2026-04-12 19:51:59 -05:00

@sheneman commented on GitHub (Sep 3, 2025):

@ParthSareen Thank you SO much!

@sheneman commented on GitHub (Sep 3, 2025): @ParthSareen Thank you SO much!

GiteaMirror commented

2026-04-12 19:51:59 -05:00

@MarioRicoIbanez commented on GitHub (Sep 4, 2025):

@ParthSareen Thanks!

@MarioRicoIbanez commented on GitHub (Sep 4, 2025): @ParthSareen Thanks!

GiteaMirror commented

2026-04-12 19:52:00 -05:00

@kiwamizamurai commented on GitHub (Sep 4, 2025):

@ParthSareen so nice

@kiwamizamurai commented on GitHub (Sep 4, 2025): @ParthSareen so nice

GiteaMirror commented

2026-04-12 19:52:00 -05:00

@Seyid-cmd commented on GitHub (Sep 5, 2025):

@ParthSareen thanks

@Seyid-cmd commented on GitHub (Sep 5, 2025): @ParthSareen thanks

GiteaMirror commented

2026-04-12 19:52:02 -05:00

@ParthSareen commented on GitHub (Sep 6, 2025):

You guys can use this branch until I get it into main: https://github.com/ollama/ollama/tree/parth/gpt-oss-structured-outputs 😁

Would also love to know what you use structured outputs for if you do give the branch a shot

@ParthSareen commented on GitHub (Sep 6, 2025): You guys can use this branch until I get it into main: https://github.com/ollama/ollama/tree/parth/gpt-oss-structured-outputs 😁 Would also love to know what you use structured outputs for if you do give the branch a shot

GiteaMirror commented

2026-04-12 19:52:02 -05:00

@adamoutler commented on GitHub (Sep 6, 2025):

Would also love to know what you use structured outputs for if you do give the branch a shot

Analyzing test results and reacting to binary/enum decisions.

How does it work? What did you do with the thinking?

@adamoutler commented on GitHub (Sep 6, 2025): > Would also love to know what you use structured outputs for if you do give the branch a shot Analyzing test results and reacting to binary/enum decisions. How does it work? What did you do with the thinking?

GiteaMirror commented

2026-04-12 19:52:03 -05:00

@asabla commented on GitHub (Sep 6, 2025):

Mostly when interacting with LLMs I want to avoid writing too much fuzzy validation code (e.g making sure all needed data is there). Structured output is basically a very convenient shortcut for doing so. On top of that, most agentic frameworks for building reliable workflows, is using structured output under the hood for the same reasons.

Haven't had the time to test out the feature branch yet, but I'll get back to you when I've done so @ParthSareen

@asabla commented on GitHub (Sep 6, 2025): Mostly when interacting with LLMs I want to avoid writing too much fuzzy validation code (e.g making sure all needed data is there). Structured output is basically a very convenient shortcut for doing so. On top of that, most agentic frameworks for building reliable workflows, is using structured output under the hood for the same reasons. Haven't had the time to test out the feature branch yet, but I'll get back to you when I've done so @ParthSareen

GiteaMirror commented

2026-04-12 19:52:05 -05:00

@sheneman commented on GitHub (Sep 6, 2025):

@ParthSareen The fix you implemented appears to generally work! Structured outputs with gpt-oss are working for me as expected and are present in the content field of the response. Reasoning traces are located in the "thinking" field. This is very improved behavior, and I am very grateful for your help! THANK YOU.

I did have a couple observations, as there still are some inconsistencies with responses from gpt-oss compared to other thinking models:

gpt-oss now provides separate reasoning traces, even if you specify "think": False. This is not a horrible default behavior, but technically it is incorrect and different than other thinking models (e.g. qwen3) which honor the "think" boolean for controlling thinking output as described here: https://ollama.com/blog/thinking
Other thinking models (qwen3) don't behave in the same way as gpt-oss:
a. If you set "thinking": False, there will be no thinking trace (correct for qwen, fails for gpt-oss)
b. If you use thinking mode and structured outputs with qwen3, it still will not emit a thinking trace (BUG)

@sheneman commented on GitHub (Sep 6, 2025): @ParthSareen **_The fix you implemented appears to generally work_**! Structured outputs with gpt-oss are working for me as expected and are present in the content field of the response. Reasoning traces are located in the "thinking" field. This is very improved behavior, and I am _very_ grateful for your help! **THANK YOU**. I did have a couple observations, as there still are some inconsistencies with responses from gpt-oss compared to other thinking models: 1. gpt-oss now provides separate reasoning traces, **_even if you specify "think": False._** This is not a horrible default behavior, but technically it is incorrect and different than other thinking models (e.g. qwen3) which honor the "think" boolean for controlling thinking output as described here: [https://ollama.com/blog/thinking](https://ollama.com/blog/thinking) 2. Other thinking models (qwen3) don't behave in the same way as gpt-oss: a. If you set "thinking": False, there will be no thinking trace (correct for qwen, fails for gpt-oss) b. If you use thinking mode **_and_** structured outputs with qwen3, it still will **_not_** emit a thinking trace (BUG) <img width="688" height="546" alt="Image" src="https://github.com/user-attachments/assets/d19c2f37-11ad-4797-b861-81b6cd63ce9d" /> <img width="700" height="494" alt="Image" src="https://github.com/user-attachments/assets/860a5838-32f0-464b-93b6-3ec474d65d56" />

GiteaMirror commented

2026-04-12 19:52:08 -05:00

@adamoutler commented on GitHub (Sep 7, 2025):

I believe with the harmony format, thinking is never turned off which is the problem we are experiencing here. I'm pretty sure this is not fixable on this particular model. It may be on the other one though.

@adamoutler commented on GitHub (Sep 7, 2025): I believe with the harmony format, thinking is never turned off which is the problem we are experiencing here. I'm pretty sure this is not fixable on this particular model. It may be on the other one though.

GiteaMirror commented

2026-04-12 19:52:09 -05:00

@ParthSareen commented on GitHub (Sep 7, 2025):

@sheneman @adamoutler is correct. the thinking cannot be turned off for gpt-oss - you can only do low medium, and high. And currently my PR only supports gpt-oss as a trial. Going to do thinking models as a whole next!

@ParthSareen commented on GitHub (Sep 7, 2025): @sheneman @adamoutler is correct. the thinking cannot be turned off for gpt-oss - you can only do `low` `medium`, and `high`. And currently my PR only supports gpt-oss as a trial. Going to do thinking models as a whole next!

GiteaMirror commented

2026-04-12 19:52:09 -05:00

@sheneman commented on GitHub (Sep 7, 2025):

@adamoutler @ParthSareen Thank you! While you can't actually turn off thinking in gpt-oss, you could set thinking to "low" and then suppress or mask the thinking trace. This would maintain response format compatibility with other thinking models. I could also see why you would prefer to output the thinking trace since its being generated anyway. It's easy enough to ignore if needed, so not a huge deal either way.

And Thank you @ParthSareen for now attacking the issue of structured outputs X thinking mode in the other models!!! With that, Ollama becomes so much more compelling for our organization!

@sheneman commented on GitHub (Sep 7, 2025): @adamoutler @ParthSareen Thank you! While you can't actually turn off thinking in gpt-oss, you _could_ set thinking to "low" and then suppress or mask the thinking trace. This would maintain response format compatibility with other thinking models. I could also see why you would prefer to output the thinking trace since its being generated anyway. It's easy enough to ignore if needed, so not a huge deal either way. And **Thank you** @ParthSareen for now attacking the issue of structured outputs X thinking mode in the other models!!! With that, Ollama becomes so much more compelling for our organization!

GiteaMirror commented

2026-04-12 19:52:09 -05:00

@nicholas-johnson-techxcel commented on GitHub (Sep 8, 2025):

I believe with the harmony format, thinking is never turned off which is the problem we are experiencing here. I'm pretty sure this is not fixable on this particular model. It may be on the other one though.

You can force it to emit an empty thinking tag using GBNF if you wish to save time.

@nicholas-johnson-techxcel commented on GitHub (Sep 8, 2025): > I believe with the harmony format, thinking is never turned off which is the problem we are experiencing here. I'm pretty sure this is not fixable on this particular model. It may be on the other one though. You can force it to emit an empty thinking tag using GBNF if you wish to save time.

GiteaMirror commented

2026-04-12 19:52:10 -05:00

@ParthSareen commented on GitHub (Sep 8, 2025):

You can force it to emit an empty thinking tag using GBNF if you wish to save time.

You could but you're breaking the format the model was trained on. From experience the model is very sensitive to breaking the format which results in poor outputs. So your mileage may vary with that.

@ParthSareen commented on GitHub (Sep 8, 2025): > You can force it to emit an empty thinking tag using GBNF if you wish to save time. You could but you're breaking the format the model was trained on. From experience the model is very sensitive to breaking the format which results in poor outputs. So your mileage may vary with that.

GiteaMirror commented

2026-04-12 19:52:10 -05:00

@vishalgoel2 commented on GitHub (Sep 14, 2025):

I tested the fix PR branch for structured outputs and it does improve things — simple structured outputs work now.

However, I’m running into mixed results when using it with browser-use + gpt-oss:20b. With the release version of Ollama, it fails consistently with the familiar

Invalid JSON: expected value at line 1 column 1 [type=json_invalid]

On the fix branch, sometimes it works, but other times I see warnings like this in the logs:

level=WARN source=harmonyparser.go:429 msg="harmony parser: no reverse mapping found for function name" harmonyFunctionName=browser.extract_structured_data

and then browser-use errors out with

 ("1 validation error for AgentOutput\n  Invalid JSON: EOF while parsing a value at line 1 column 0 [type=json_invalid, input_value='', input_type=str]\n    For further information visit https://errors.pydantic.dev/2.11/v/json_invalid", 502)

So it looks like the PR handles some structured output cases, but not all. Not sure yet if browser-use is passing the tool schema in a way Ollama doesn’t expect, or if the fix still misses some scenarios.

@vishalgoel2 commented on GitHub (Sep 14, 2025): I tested the fix PR branch for structured outputs and it does improve things — simple structured outputs work now. However, I’m running into mixed results when using it with [`browser-use`](https://github.com/browser-use/browser-use) + `gpt-oss:20b`. With the release version of Ollama, it fails consistently with the familiar ``` Invalid JSON: expected value at line 1 column 1 [type=json_invalid] ``` On the fix branch, sometimes it works, but other times I see warnings like this in the logs: ``` level=WARN source=harmonyparser.go:429 msg="harmony parser: no reverse mapping found for function name" harmonyFunctionName=browser.extract_structured_data ``` and then `browser-use` errors out with ``` ("1 validation error for AgentOutput\n Invalid JSON: EOF while parsing a value at line 1 column 0 [type=json_invalid, input_value='', input_type=str]\n For further information visit https://errors.pydantic.dev/2.11/v/json_invalid", 502) ``` So it looks like the PR handles some structured output cases, but not all. Not sure yet if `browser-use` is passing the tool schema in a way Ollama doesn’t expect, or if the fix still misses some scenarios.

GiteaMirror commented

2026-04-12 19:52:11 -05:00

@trebor commented on GitHub (Sep 20, 2025):

i'm on ollama v0.12.0 and still seeing the issue. the query takes time, but returns with a zero-length response field. i'm happy to include payload and response text if that is helpful, but it is the typical prompt and json scheme in the format field.

@trebor commented on GitHub (Sep 20, 2025): i'm on ollama v0.12.0 and still seeing the issue. the query takes time, but returns with a zero-length response field. i'm happy to include payload and response text if that is helpful, but it is the typical prompt and json scheme in the format field.

GiteaMirror commented

2026-04-12 19:52:11 -05:00

@ParthSareen commented on GitHub (Sep 20, 2025):

i'm on ollama v0.12.0 and still seeing the issue. the query takes time, but returns with a zero-length response field. i'm happy to include payload and response text if that is helpful, but it is the typical prompt and json scheme in the format field.

Hi @trebor it's not released yet

@ParthSareen commented on GitHub (Sep 20, 2025): > i'm on ollama v0.12.0 and still seeing the issue. the query takes time, but returns with a zero-length response field. i'm happy to include payload and response text if that is helpful, but it is the typical prompt and json scheme in the format field. Hi @trebor it's not released yet

GiteaMirror commented

2026-04-12 19:52:12 -05:00

@srshkmr commented on GitHub (Sep 24, 2025):

Hi @ParthSareen any ETA on the release? is there changes required on the PR?

@srshkmr commented on GitHub (Sep 24, 2025): Hi @ParthSareen any ETA on the release? is there changes required on the PR?

GiteaMirror commented

2026-04-12 19:52:12 -05:00

@MarioRicoIbanez commented on GitHub (Sep 29, 2025):

Any news on when it will be released?

@MarioRicoIbanez commented on GitHub (Sep 29, 2025): Any news on when it will be released?

GiteaMirror commented

2026-04-12 19:52:13 -05:00

@AlexanderKozhevin commented on GitHub (Oct 2, 2025):

funny thing, structured output does work on Groq cloud

@AlexanderKozhevin commented on GitHub (Oct 2, 2025): funny thing, structured output does work on Groq cloud

GiteaMirror commented

2026-04-12 19:52:13 -05:00

@ParthSareen commented on GitHub (Oct 2, 2025):

Had to make some updates to how we ran it. Just put up another PR. Aiming for next release.

@ParthSareen commented on GitHub (Oct 2, 2025): Had to make some updates to how we ran it. Just put up another PR. Aiming for next release.

GiteaMirror commented

2026-04-12 19:52:14 -05:00

@bogzbonny commented on GitHub (Oct 12, 2025):

haven't gotten it to work with ollama-rs calling on ollama 0.12.5 - I'm assuming https://github.com/ollama/ollama/pull/12460 doesn't actually fully resolve this issue but is only a stepping stone? (@ParthSareen)

@bogzbonny commented on GitHub (Oct 12, 2025): haven't gotten it to work with ollama-rs calling on ollama 0.12.5 - I'm assuming https://github.com/ollama/ollama/pull/12460 doesn't actually fully resolve this issue but is only a stepping stone? (@ParthSareen)

GiteaMirror commented

@ParthSareen commented on GitHub (Oct 12, 2025):

haven't gotten it to work with ollama-rs calling on ollama 0.12.5 - I'm assuming https://github.com/ollama/ollama/pull/12460 doesn't actually fully resolve this issue but is only a stepping stone? (@ParthSareen)

Hmm it should be working... can you try running ollama run gpt-oss --format json hello!

and see if it shows thinking + the final output? if so it might be some weird client behavior

@ParthSareen commented on GitHub (Oct 12, 2025): > haven't gotten it to work with ollama-rs calling on ollama 0.12.5 - I'm assuming https://github.com/ollama/ollama/pull/12460 doesn't actually fully resolve this issue but is only a stepping stone? (@ParthSareen) Hmm it should be working... can you try running `ollama run gpt-oss --format json hello!` and see if it shows thinking + the final output? if so it might be some weird client behavior

GiteaMirror commented

@vansatchen commented on GitHub (Oct 12, 2025):

Hmm it should be working... can you try running ollama run gpt-oss --format json hello!

and see if it shows thinking + the final output? if so it might be some weird client behavior

ollama run gpt-oss:20b --format json hello!
We need to respond to ":", basically greet, friendly.Hello! 👋 How can I help you today?Thinking...
We responded.We are done.
...done thinking.

Hey there! What's on your mind today? 😊✨ "}Error: error parsing tool call: raw='Hey there! What’s on your mind today? 😊 <|constrain|> <|constrain|><|constrain|>.} ', err=invalid character 'H' looking for beginning of value

ollama -v
ollama version is 0.12.5

@vansatchen commented on GitHub (Oct 12, 2025): > Hmm it should be working... can you try running `ollama run gpt-oss --format json hello!` > > and see if it shows thinking + the final output? if so it might be some weird client behavior ollama run gpt-oss:20b --format json hello! We need to respond to ":", basically greet, friendly.Hello! 👋 How can I help you today?Thinking... We responded.We are done. ...done thinking. Hey there! What's on your mind today? 😊✨ "}Error: error parsing tool call: raw='Hey there! What’s on your mind today? 😊 <|constrain|> <|constrain|><|constrain|>.} ', err=invalid character 'H' looking for beginning of value ollama -v ollama version is 0.12.5

GiteaMirror commented

@sheneman commented on GitHub (Oct 12, 2025):

So just to be clear, the fix for this issue has not yet been merged to main, as of 0.12.5?

@sheneman commented on GitHub (Oct 12, 2025): So just to be clear, the fix for this issue has not yet been merged to main, as of 0.12.5?

GiteaMirror commented

2026-04-12 19:52:16 -05:00

@ParthSareen commented on GitHub (Oct 12, 2025):

Ah I gave the wrong query @vansatchen @bogzbonny. Run just ollama run gpt-oss --format json and then type something to the model.

I haven't updated the generate endpoint yet there's some refactoring to do. Let me know if this doesn't work.

@ParthSareen commented on GitHub (Oct 12, 2025): Ah I gave the wrong query @vansatchen @bogzbonny. Run just `ollama run gpt-oss --format json` and then type something to the model. I haven't updated the generate endpoint yet there's some refactoring to do. Let me know if this doesn't work.

GiteaMirror commented

@vansatchen commented on GitHub (Oct 12, 2025):

Ah I gave the wrong query @vansatchen @bogzbonny. Run just ollama run gpt-oss --format json and then type something to the model.

I haven't updated the generate endpoint yet there's some refactoring to do. Let me know if this doesn't work.

ollama run gpt-oss --format json
>>> John Dohn 26 yo
Thinking...
The user: "John Dohn 26 yo". Likely they want to talk about health? The user might be asking for medical advice, perhaps about being 
26-year-old male named John Dohn. Maybe they want to know about his health, fitness, sleep, nutrition, mental health. Or it's a user 
profile snippet. The user hasn't asked a specific question. We need to respond appropriately. Usually, we ask clarifying question or 
ask what they need. The user could be prompting for an assessment. The system guidelines: cannot provide medical advice. But can 
provide general wellness tips, encourage professional help. So we can ask: "What can I help you with regarding John Dohn? Are you 
looking for health tips?" Provide general wellness info. Let's do that.
...done thinking.

{"response":"It looks like you’re mentioning a 26‑year‑old male named John Dohn. Could you let me know what you’d like help with? For 
example, are you looking for general wellness and lifestyle advice, or is there a specific concern or goal you have in mind? I’m 
happy to offer general information and resources—just keep in mind that I can’t give personalized medical advice or replace a 
professional consultation."}

>>> Send a message (/? for help)

@vansatchen commented on GitHub (Oct 12, 2025): > Ah I gave the wrong query [@vansatchen](https://github.com/vansatchen) [@bogzbonny](https://github.com/bogzbonny). Run just `ollama run gpt-oss --format json` and then type something to the model. > > I haven't updated the generate endpoint yet there's some refactoring to do. Let me know if this doesn't work. ``` ollama run gpt-oss --format json >>> John Dohn 26 yo Thinking... The user: "John Dohn 26 yo". Likely they want to talk about health? The user might be asking for medical advice, perhaps about being 26-year-old male named John Dohn. Maybe they want to know about his health, fitness, sleep, nutrition, mental health. Or it's a user profile snippet. The user hasn't asked a specific question. We need to respond appropriately. Usually, we ask clarifying question or ask what they need. The user could be prompting for an assessment. The system guidelines: cannot provide medical advice. But can provide general wellness tips, encourage professional help. So we can ask: "What can I help you with regarding John Dohn? Are you looking for health tips?" Provide general wellness info. Let's do that. ...done thinking. {"response":"It looks like you’re mentioning a 26‑year‑old male named John Dohn. Could you let me know what you’d like help with? For example, are you looking for general wellness and lifestyle advice, or is there a specific concern or goal you have in mind? I’m happy to offer general information and resources—just keep in mind that I can’t give personalized medical advice or replace a professional consultation."} >>> Send a message (/? for help) ```

GiteaMirror commented

2026-04-12 19:52:17 -05:00

@bogzbonny commented on GitHub (Oct 12, 2025):

@ParthSareen Okay cool, appreciated. I tried it and got similar output to @vansatchen I'm not sure how to feed a schema from the CLI but within ollama-rs it appears to be using the generate endpoints HENCE I think I'm still blocked on that endpoint refactor you mentioned get this operating.

(also https://github.com/ollama/ollama/pull/12460 was merged into 0.12.5 @sheneman if you look at the commit history)

@bogzbonny commented on GitHub (Oct 12, 2025): @ParthSareen Okay cool, appreciated. I tried it and got similar output to @vansatchen I'm not sure how to feed a schema from the CLI but within ollama-rs it appears to be using the generate endpoints HENCE I think I'm still blocked on that endpoint refactor you mentioned get this operating. (also https://github.com/ollama/ollama/pull/12460 was merged into 0.12.5 @sheneman if you look at the commit history)

GiteaMirror commented

2026-04-12 19:52:18 -05:00

@trebor commented on GitHub (Oct 16, 2025):

i have been testing ollama 0.12.5 with the most recent gpt-oss:20b, see below for specific examples. is this the expected behavior? is the change maybe still percolating through the system? am i calling it wrong?

curl commands i used to test:

curl 'http://localhost:11434/api/generate' --data-raw '{"model":"gpt-oss:20b","stream":false,"format":{"type":"integer","minimum":1,"maximum":10},"prompt":"choose a number\n"}'

and am still seeing an empty response. for completeness:

{"model":"gpt-oss:20b","created_at":"2025-10-16T22:39:42.787141Z","response":"","done":true,"done_reason":"stop","context":[200006,17360,200008,3575,553,17554,162016,11,261,4410,6439,2359,22203,656,7788,17527,558,87447,100594,25,220,1323,19,12,3218,198,6576,3521,25,220,1323,20,12,702,12,1125,279,30377,289,25,14093,279,2,13888,18403,25,8450,11,49159,11,1721,13,21030,2804,413,7360,395,1753,3176,13,200007,200006,1428,200008,47312,261,2086,198,200007,200006,173781,16,220],"total_duration":639591959,"load_duration":152973209,"prompt_eval_count":71,"prompt_eval_duration":279207292,"eval_count":3,"eval_duration":50198457}%

if i use qwen:14b, for example:

curl 'http://localhost:11434/api/generate' --data-raw '{"model":"qwen3:14b","stream":false,"format":{"type":"integer","minimum":1,"maximum":10},"prompt":"choose a number\n"}'

i see what i would expect:

{"model":"qwen3:14b","created_at":"2025-10-16T22:45:54.903555Z","response":"8\n\n","done":true,"done_reason":"stop","context":[151644,872,198,27052,264,1372,198,151645,198,151644,77091,198,23,271],"total_duration":594441292,"load_duration":85178125,"prompt_eval_count":12,"prompt_eval_duration":394819125,"eval_count":3,"eval_duration":89476292}%

@trebor commented on GitHub (Oct 16, 2025): i have been testing ollama 0.12.5 with the most recent gpt-oss:20b, see below for specific examples. is this the expected behavior? is the change maybe still percolating through the system? am i calling it wrong? curl commands i used to test: `curl 'http://localhost:11434/api/generate' --data-raw '{"model":"gpt-oss:20b","stream":false,"format":{"type":"integer","minimum":1,"maximum":10},"prompt":"choose a number\n"}'` and am still seeing an empty response. for completeness: `{"model":"gpt-oss:20b","created_at":"2025-10-16T22:39:42.787141Z","response":"","done":true,"done_reason":"stop","context":[200006,17360,200008,3575,553,17554,162016,11,261,4410,6439,2359,22203,656,7788,17527,558,87447,100594,25,220,1323,19,12,3218,198,6576,3521,25,220,1323,20,12,702,12,1125,279,30377,289,25,14093,279,2,13888,18403,25,8450,11,49159,11,1721,13,21030,2804,413,7360,395,1753,3176,13,200007,200006,1428,200008,47312,261,2086,198,200007,200006,173781,16,220],"total_duration":639591959,"load_duration":152973209,"prompt_eval_count":71,"prompt_eval_duration":279207292,"eval_count":3,"eval_duration":50198457}%` if i use qwen:14b, for example: `curl 'http://localhost:11434/api/generate' --data-raw '{"model":"qwen3:14b","stream":false,"format":{"type":"integer","minimum":1,"maximum":10},"prompt":"choose a number\n"}'` i see what i would expect: `{"model":"qwen3:14b","created_at":"2025-10-16T22:45:54.903555Z","response":"8\n\n","done":true,"done_reason":"stop","context":[151644,872,198,27052,264,1372,198,151645,198,151644,77091,198,23,271],"total_duration":594441292,"load_duration":85178125,"prompt_eval_count":12,"prompt_eval_duration":394819125,"eval_count":3,"eval_duration":89476292}%`

GiteaMirror commented

2026-04-12 19:52:18 -05:00

@ParthSareen commented on GitHub (Oct 16, 2025):

hi @trebor sorry for the lack of documentation at the moment. It should work for the /chat endpoint. Need to do some cleanup on /generate before we can support it there

@ParthSareen commented on GitHub (Oct 16, 2025): hi @trebor sorry for the lack of documentation at the moment. It should work for the `/chat` endpoint. Need to do some cleanup on `/generate` before we can support it there

GiteaMirror commented

2026-04-12 19:52:18 -05:00

@trebor commented on GitHub (Oct 16, 2025):

oh got it, thank you!

@trebor commented on GitHub (Oct 16, 2025): oh got it, thank you!

GiteaMirror commented

@ParthSareen commented on GitHub (Oct 16, 2025):

Hey folks it should be out! Closing this issue. It'll work as expected with the /chat endpoint. /generate will come at some point but might be a bit. Just wanted to unblock everyone!

@ParthSareen commented on GitHub (Oct 16, 2025): Hey folks it should be out! Closing this issue. It'll work as expected with the `/chat` endpoint. `/generate` will come at some point but might be a bit. Just wanted to unblock everyone!

GiteaMirror commented

@dhicks commented on GitHub (Oct 16, 2025):

Could I suggest leaving this open until the issue has been resolved for /generate as well?

@dhicks commented on GitHub (Oct 16, 2025): Could I suggest leaving this open until the issue has been resolved for `/generate` as well?

GiteaMirror commented

@trebor commented on GitHub (Oct 16, 2025):

btw: here is a minimal example curl that worked for me:

curl 'http://localhost:11434/api/chat' --data-raw '{"model":"gpt-oss:20b","stream":false,"format":{"type":"integer","minimum":1,"maximum":10},"messages":[{"content": "choose a number", "role": "user"}]}'

huge thanks to @ParthSareen!

@trebor commented on GitHub (Oct 16, 2025): btw: here is a minimal example curl that worked for me: `curl 'http://localhost:11434/api/chat' --data-raw '{"model":"gpt-oss:20b","stream":false,"format":{"type":"integer","minimum":1,"maximum":10},"messages":[{"content": "choose a number", "role": "user"}]}'` huge thanks to @ParthSareen!

GiteaMirror commented