[GH-ISSUE #11994] think=None Not formatted correctly when using gpt-oss:20b in API #85652

Open
opened 2026-05-10 00:43:03 -05:00 by GiteaMirror · 3 comments
Owner

Originally created by @civilwargeeky on GitHub (Aug 20, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/11994

What is the issue?

According to the description of https://github.com/ollama/ollama/pull/10584, leaving "think" blank or sending "think=None" through the API should result in legacy behavior: Tags like <think> and </think> should be left untouched in the message.content field, and message.think will be None.

This seems to work for models like Qwen3 (tested with qwen3:14b-q4_K_M), but not for gpt-oss:20b. gpt-oss:20b will still generate "think" and "content" fields separately.

This is different from https://github.com/ollama/ollama/issues/11751, as I don't want it to stop thinking, I just want response to be formatted appropriately.

This is important to me because I want to early-stop after thinking (by providing an additional stop sequence), then pre-fill the assistant's response before continuing with constrained output (using format=MyJson)

I am using the latest docker image and python API version

Relevant log output


OS

Linux

GPU

Nvidia

CPU

Other, Intel

Ollama version

0.11.5 through docker (run through apptainer) - Python API 0.5.3

Originally created by @civilwargeeky on GitHub (Aug 20, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/11994 ### What is the issue? According to the description of https://github.com/ollama/ollama/pull/10584, leaving "think" blank or sending "think=None" through the API should result in legacy behavior: Tags like `<think>` and `</think>` should be left untouched in the message.content field, and message.think will be None. This seems to work for models like Qwen3 (tested with qwen3:14b-q4_K_M), but not for gpt-oss:20b. gpt-oss:20b will still generate "think" and "content" fields separately. This is different from https://github.com/ollama/ollama/issues/11751, as I don't want it to stop thinking, I just want response to be formatted appropriately. This is important to me because I want to early-stop after thinking (by providing an additional stop sequence), then pre-fill the assistant's response before continuing with constrained output (using `format=MyJson`) I am using the latest docker image and python API version ### Relevant log output ```shell ``` ### OS Linux ### GPU Nvidia ### CPU Other, Intel ### Ollama version 0.11.5 through docker (run through apptainer) - Python API 0.5.3
GiteaMirror added the bug label 2026-05-10 00:43:03 -05:00
Author
Owner

@drifkin commented on GitHub (Aug 20, 2025):

hmm, so gpt-oss behaves pretty differently from other thinking models in that the thinking content isn't contained "within" the normal content, but instead is a sort of "peer" message type.

A few thoughts for the use case you're talking about: if you added the additional stop token like so:

FROM "gpt-oss:20b"

PARAMETER stop "<|end|>"

then the parsed thinking content should come back and will stop just before the model would output non-thinking content. We also support both thinking and non-thinking prefill in the API, so you should be able to make a request where the assistant message is the final message (rather than a user message), and you can provide the thinking content you got from the partial generation, along with whatever content you want to use as the assistant's prefill. If I'm understanding what you're trying to do correctly, I think this should work.

Alternatively, maybe you want even more control, in which case using raw in api/generate might be what you're after, and the main concern for you would become responsible for the full prompt. I added https://github.com/ollama/ollama/pull/11875 recently to help debug templates, but you could "abuse" that to render the template as normal, and then handle everything else yourself. That option is very much subject to break/change in the future, but could be used to experiment. But this seems a lot more complicated than using the prefill

<!-- gh-comment-id:3208285866 --> @drifkin commented on GitHub (Aug 20, 2025): hmm, so `gpt-oss` behaves pretty differently from other thinking models in that the thinking content isn't contained "within" the normal content, but instead is a sort of "peer" message type. A few thoughts for the use case you're talking about: if you added the additional stop token like so: ``` FROM "gpt-oss:20b" PARAMETER stop "<|end|>" ``` then the parsed thinking content should come back and will stop just before the model would output non-thinking content. We also support both thinking and non-thinking prefill in the API, so you should be able to make a request where the `assistant` message is the final message (rather than a `user` message), and you can provide the thinking content you got from the partial generation, along with whatever content you want to use as the assistant's prefill. If I'm understanding what you're trying to do correctly, I think this should work. Alternatively, maybe you want even more control, in which case using `raw` in `api/generate` might be what you're after, and the main concern for you would become responsible for the full prompt. I added https://github.com/ollama/ollama/pull/11875 recently to help debug templates, but you could "abuse" that to render the template as normal, and then handle everything else yourself. That option is very much subject to break/change in the future, but could be used to experiment. But this seems a lot more complicated than using the prefill
Author
Owner

@civilwargeeky commented on GitHub (Aug 21, 2025):

Hi drifkin, thanks for the response. I see how that would complicate the legacy thinking behavior for GPT-OSS: there would be no visible tokens to separate the thinking from response like </think>.

Thanks to your suggestion, I was able to early-stop after thinking (I hadn't realized that gpt-oss used a different message type in it's template).

I also hadn't realized that Ollama allows for prefill options, which may make this easier for me! But I don't see it anywhere in the documentation, or in api/types.go or api/client.go. I see there are prefill elements in the GPT-OSS template, only, but no info on how to populate them?

This still doesn't match the legacy behavior of any other model though, complicating my testing, and as no models other than Deepseek, Qwen3, and GPT-OSS support the new thinking separation, I think for now I will just skip GPT-OSS in my tests : \

Thanks for the help!

<!-- gh-comment-id:3211670198 --> @civilwargeeky commented on GitHub (Aug 21, 2025): Hi drifkin, thanks for the response. I see how that would complicate the legacy `thinking` behavior for GPT-OSS: there would be no visible tokens to separate the thinking from response like `</think>`. Thanks to your suggestion, I was able to early-stop after thinking (I hadn't realized that gpt-oss used a different message type in it's template). I also hadn't realized that Ollama allows for prefill options, which may make this easier for me! But I don't see it anywhere in the documentation, or in api/types.go or api/client.go. I see there are prefill elements in the GPT-OSS template, only, but no info on how to populate them? This still doesn't match the legacy behavior of any other model though, complicating my testing, and as no models other than Deepseek, Qwen3, and GPT-OSS support the new thinking separation, I think for now I will just skip GPT-OSS in my tests : \ Thanks for the help!
Author
Owner

@drifkin commented on GitHub (Aug 22, 2025):

yeah I don't think it's currently documented anywhere, would definitely like to improve that soon. I think that's due it being more of a convention that our official templates tend to have, rather than a core feature, but it's pretty useful. You just provide a trailing assistant message like so:

curl http://localhost:11434/api/chat -d '{
"stream": false,
  "model": "gpt-oss:20b",
  "think": "low",
  "messages": [
    { "role": "user", "content": "Tell me a story in two sentences" },
    { "role": "assistant", "thinking": "Need 2 sentences story.", "content": "This is a surprising story about potatoes:" }
  ]
}'

This generally works with other models too, but in gpt-oss I made it possible for prefill work with thinking only as well (i.e., only providing the beginning of thinking, but then letting the model think more before outputting content after). The feature definitely deserves documentation and another pass on making sure other models are consistent.

That makes sense re: legacy testing since first-class thinking support is new. Let me know if you run into any other issues or have thoughts around prefill etc

<!-- gh-comment-id:3215189713 --> @drifkin commented on GitHub (Aug 22, 2025): yeah I don't think it's currently documented anywhere, would definitely like to improve that soon. I think that's due it being more of a convention that our official templates tend to have, rather than a core feature, but it's pretty useful. You just provide a trailing assistant message like so: ``` curl http://localhost:11434/api/chat -d '{ "stream": false, "model": "gpt-oss:20b", "think": "low", "messages": [ { "role": "user", "content": "Tell me a story in two sentences" }, { "role": "assistant", "thinking": "Need 2 sentences story.", "content": "This is a surprising story about potatoes:" } ] }' ``` This generally works with other models too, but in gpt-oss I made it possible for prefill work with thinking only as well (i.e., only providing the beginning of thinking, but then letting the model think more before outputting content after). The feature definitely deserves documentation and another pass on making sure other models are consistent. That makes sense re: legacy testing since first-class thinking support is new. Let me know if you run into any other issues or have thoughts around prefill etc
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#85652