[PR #14288] server: structured outputs for thinking models in /api/generate #45851

Open
opened 2026-04-25 01:28:09 -05:00 by GiteaMirror · 0 comments
Owner

📋 Pull Request Information

Original PR: https://github.com/ollama/ollama/pull/14288
Author: @veeceey
Created: 2/17/2026
Status: 🔄 Open

Base: mainHead: fix/issue-11691-structured-output


📝 Commits (2)

  • 63cffe7 server: add structured outputs support for thinking models in /api/generate
  • e2ca382 Merge branch 'main' into fix/issue-11691-structured-output

📊 Changes

3 files changed (+202 additions, -60 deletions)

View changed files

📝 middleware/openai_test.go (+26 -0)
📝 openai/openai_test.go (+40 -0)
📝 server/routes.go (+136 -60)

📄 Description

What

Adds the double-request structured outputs approach to the /api/generate endpoint, matching the existing implementation in /api/chat.

Why

The /api/chat endpoint got structured output support for thinking models (gpt-oss, etc.) in #12460, but /api/generate was left behind. This meant that any client using the generate endpoint with a format constraint on a thinking model would get either garbage output or empty responses — the format grammar was being applied during the thinking phase, corrupting the output.

Several users in #11691 confirmed that ollama-rs and other clients that go through /api/generate are still broken, even though /api/chat works.

How

Same pattern as the chat handler:

  1. When a format constraint is set and the model has thinking capability, the first completion request runs without the format constraint so the model can think freely.
  2. Once the parser detects that thinking is done and content is starting, the request is cancelled.
  3. The prompt is rebuilt with the thinking content appended as an assistant message (using chatPrompt), and for Harmony models the <|end|><|start|>assistant<|channel|>final<|message|> header is added.
  4. A second completion request runs with the format constraint active, producing valid structured output.

This only activates when:

  • A format constraint is set (req.Format != nil)
  • The model supports thinking (model.CapabilityThinking)
  • A builtin parser or thinking state parser is active
  • We're in the chat-like flow (have messages to rebuild the prompt)

For non-thinking models or raw mode, behavior is unchanged — the format is applied directly as before.

Tests

  • Added test for json_schema response format conversion in openai/openai_test.go
  • Added test for json_schema request handling in middleware/openai_test.go
  • All existing tests pass: go test ./server/ ./openai/ ./middleware/

Fixes #11691


🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.

## 📋 Pull Request Information **Original PR:** https://github.com/ollama/ollama/pull/14288 **Author:** [@veeceey](https://github.com/veeceey) **Created:** 2/17/2026 **Status:** 🔄 Open **Base:** `main` ← **Head:** `fix/issue-11691-structured-output` --- ### 📝 Commits (2) - [`63cffe7`](https://github.com/ollama/ollama/commit/63cffe7aef274dd872f706594d58a3af34fe9f98) server: add structured outputs support for thinking models in /api/generate - [`e2ca382`](https://github.com/ollama/ollama/commit/e2ca3823678173056a1adabbc223ac975532e27a) Merge branch 'main' into fix/issue-11691-structured-output ### 📊 Changes **3 files changed** (+202 additions, -60 deletions) <details> <summary>View changed files</summary> 📝 `middleware/openai_test.go` (+26 -0) 📝 `openai/openai_test.go` (+40 -0) 📝 `server/routes.go` (+136 -60) </details> ### 📄 Description ## What Adds the double-request structured outputs approach to the `/api/generate` endpoint, matching the existing implementation in `/api/chat`. ## Why The `/api/chat` endpoint got structured output support for thinking models (gpt-oss, etc.) in #12460, but `/api/generate` was left behind. This meant that any client using the generate endpoint with a format constraint on a thinking model would get either garbage output or empty responses — the format grammar was being applied during the thinking phase, corrupting the output. Several users in #11691 confirmed that `ollama-rs` and other clients that go through `/api/generate` are still broken, even though `/api/chat` works. ## How Same pattern as the chat handler: 1. When a format constraint is set and the model has thinking capability, the first completion request runs **without** the format constraint so the model can think freely. 2. Once the parser detects that thinking is done and content is starting, the request is cancelled. 3. The prompt is rebuilt with the thinking content appended as an assistant message (using `chatPrompt`), and for Harmony models the `<|end|><|start|>assistant<|channel|>final<|message|>` header is added. 4. A second completion request runs **with** the format constraint active, producing valid structured output. This only activates when: - A format constraint is set (`req.Format != nil`) - The model supports thinking (`model.CapabilityThinking`) - A builtin parser or thinking state parser is active - We're in the chat-like flow (have messages to rebuild the prompt) For non-thinking models or raw mode, behavior is unchanged — the format is applied directly as before. ## Tests - Added test for `json_schema` response format conversion in `openai/openai_test.go` - Added test for `json_schema` request handling in `middleware/openai_test.go` - All existing tests pass: `go test ./server/ ./openai/ ./middleware/` Fixes #11691 --- <sub>🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.</sub>
GiteaMirror added the pull-request label 2026-04-25 01:28:09 -05:00
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#45851