[GH-ISSUE #5356] allow for num_ctx parameter in the openai API compatibility #49865

Closed
opened 2026-04-28 13:14:22 -05:00 by GiteaMirror · 13 comments
Owner

Originally created by @PabloRMira on GitHub (Jun 28, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/5356

Originally assigned to: @ParthSareen on GitHub.

The OpenAI compatibility module does not allow for setting the number of tokens window (num_ctx) via API call dynamically instead of having to adjust the Modelfile each time we want to use another context window.

Therefore it would be great to have in the OpenAI compatibility. I can also try a PR for this.

Thanks a lot for this wonderful project! :-)

Originally created by @PabloRMira on GitHub (Jun 28, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/5356 Originally assigned to: @ParthSareen on GitHub. The OpenAI compatibility module does not allow for setting the number of tokens window (num_ctx) via API call dynamically instead of having to adjust the Modelfile each time we want to use another context window. Therefore it would be great to have in the OpenAI compatibility. I can also try a PR for this. Thanks a lot for this wonderful project! :-)
GiteaMirror added the feature request label 2026-04-28 13:14:22 -05:00
Author
Owner

@lazarust commented on GitHub (Jul 15, 2024):

+1 to this from me! This would be helpful.

<!-- gh-comment-id:2229395016 --> @lazarust commented on GitHub (Jul 15, 2024): +1 to this from me! This would be helpful.
Author
Owner

@Atakey commented on GitHub (Jul 30, 2024):

+1

<!-- gh-comment-id:2257863311 --> @Atakey commented on GitHub (Jul 30, 2024): +1
Author
Owner

@alexander-potemkin commented on GitHub (Aug 6, 2024):

There seems to be a pull request for that - is there any help required to merge it probably?

<!-- gh-comment-id:2271793615 --> @alexander-potemkin commented on GitHub (Aug 6, 2024): There seems to be a pull request for that - is there any help required to merge it probably?
Author
Owner

@hawktang commented on GitHub (Aug 27, 2024):

+1

<!-- gh-comment-id:2311780661 --> @hawktang commented on GitHub (Aug 27, 2024): +1
Author
Owner

@hawktang commented on GitHub (Aug 27, 2024):

Need to set num_ctx directly

<!-- gh-comment-id:2311783956 --> @hawktang commented on GitHub (Aug 27, 2024): Need to set num_ctx directly
Author
Owner

@pdevine commented on GitHub (Sep 15, 2024):

Unfortunately OpenAI's API doesn't have a way to do this, and we can't modify the num_ctx parameter directly with their API. I did write up a doc which explains how to accomplish this though. Hopefully this makes sense. I'll close the issue, but feel free to keep commenting.

<!-- gh-comment-id:2351795006 --> @pdevine commented on GitHub (Sep 15, 2024): Unfortunately OpenAI's API doesn't have a way to do this, and we can't modify the `num_ctx` parameter directly with their API. I did write up a [doc](https://github.com/ollama/ollama/blob/main/docs/openai.md#setting-the-context-size) which explains how to accomplish this though. Hopefully this makes sense. I'll close the issue, but feel free to keep commenting.
Author
Owner

@kunchenguid commented on GitHub (Jan 5, 2025):

Can we use a custom header to specify this? Having to create a modelfile makes it quite difficult for tool builders to make their tools work out of the box for end users.

<!-- gh-comment-id:2571715023 --> @kunchenguid commented on GitHub (Jan 5, 2025): Can we use a custom header to specify this? Having to create a modelfile makes it quite difficult for tool builders to make their tools work out of the box for end users.
Author
Owner

@ParthSareen commented on GitHub (Apr 16, 2025):

should be closed with https://github.com/ollama/ollama/pull/8938

<!-- gh-comment-id:2810944275 --> @ParthSareen commented on GitHub (Apr 16, 2025): should be closed with https://github.com/ollama/ollama/pull/8938
Author
Owner

@Pablo1107 commented on GitHub (Nov 4, 2025):

I think this issue was wrongly closed as per the solution of adding the env var to specify the num_ctx instead of adding the parameter in OpenAI compat API. Maybe we can re-open this issue?

<!-- gh-comment-id:3488414599 --> @Pablo1107 commented on GitHub (Nov 4, 2025): I think this issue was wrongly closed as per the solution of adding the env var to specify the `num_ctx` instead of adding the parameter in OpenAI compat API. Maybe we can re-open this issue?
Author
Owner

@ParthSareen commented on GitHub (Nov 4, 2025):

I think this issue was wrongly closed as per the solution of adding the env var to specify the num_ctx instead of adding the parameter in OpenAI compat API. Maybe we can re-open this issue?

What's the issue with using the env var?

<!-- gh-comment-id:3488472387 --> @ParthSareen commented on GitHub (Nov 4, 2025): > I think this issue was wrongly closed as per the solution of adding the env var to specify the `num_ctx` instead of adding the parameter in OpenAI compat API. Maybe we can re-open this issue? What's the issue with using the env var?
Author
Owner

@Pablo1107 commented on GitHub (Nov 5, 2025):

I think this issue was wrongly closed as per the solution of adding the env var to specify the num_ctx instead of adding the parameter in OpenAI compat API. Maybe we can re-open this issue?

What's the issue with using the env var?

Some application that integrates with Ollama cannot set this parameter when asking it to load some model, making onboarding a little bit more difficult that it needs to be.

Apart from that, the feature it's already available on the main API, so why would that not be on the OpenAI compat API as well?

<!-- gh-comment-id:3488620360 --> @Pablo1107 commented on GitHub (Nov 5, 2025): > > I think this issue was wrongly closed as per the solution of adding the env var to specify the `num_ctx` instead of adding the parameter in OpenAI compat API. Maybe we can re-open this issue? > > What's the issue with using the env var? Some application that integrates with Ollama cannot set this parameter when asking it to load some model, making onboarding a little bit more difficult that it needs to be. Apart from that, the feature [it's already available on the main API](https://docs.ollama.com/faq#how-can-i-specify-the-context-window-size%3F), so why would that not be on the OpenAI compat API as well?
Author
Owner

@maks commented on GitHub (Feb 22, 2026):

I think this issue was wrongly closed as per the solution of adding the env var to specify the num_ctx instead of adding the parameter in OpenAI compat API. Maybe we can re-open this issue?

What's the issue with using the env var?

@ParthSareen how exactly would I use the env var to set different max context sizes for different models that I have loaded simultaneously?

<!-- gh-comment-id:3941943216 --> @maks commented on GitHub (Feb 22, 2026): > > I think this issue was wrongly closed as per the solution of adding the env var to specify the `num_ctx` instead of adding the parameter in OpenAI compat API. Maybe we can re-open this issue? > > What's the issue with using the env var? @ParthSareen how exactly would I use the env var to set different max context sizes for different models that I have loaded simultaneously?
Author
Owner

@geoffsdesk commented on GitHub (Apr 4, 2026):

Real-world impact data: silent num_ctx truncation destroys large-context workloads

Adding benchmark data that quantifies the impact of this issue. We're building a domain-expertise skill (~90K tokens of structured knowledge) for GKE upgrade operations. We benchmark it with an eval suite of 40 evaluations / 348 graded assertions, testing with-skill vs without-skill.

The setup: Gemma 4 (27B, Q4_K_M) via Ollama, with num_ctx: 131072 passed in the options field.

Results using /v1/chat/completions (broken)

Metric Score
With Skill 42.2% (147/348)
Without Skill 36.8% (128/348)
Delta +5.4%

The skill provided almost no lift. We initially attributed this to Gemma 4 being less capable than Claude Sonnet.

Results after switching to native /api/chat (fixed)

Metric Score
With Skill 69.5% (242/348)
Without Skill 31.3% (109/348)
Delta +38.2%

Same model, same skill, same hardware. The only change was the API endpoint.

Why this is hard to detect

  • No error returned. HTTP 200, valid JSON, plausible-sounding content.
  • No truncation warning. Nothing in the response indicates the context was silently cut from ~90K tokens to ~4K.
  • Degraded output looks like a less capable model, not a misconfiguration. Without a structured benchmark, you'd never know.

The evals that test topics appearing later in the skill document were hit hardest (0-12% with the broken endpoint, 62-87% with native), confirming the input was truncated rather than the model failing to reason over it.

The workaround

We switched from /v1/chat/completions to /api/chat, which properly respects the options.num_ctx field:

# Native endpoint - options are respected
requests.post(f"{base_url}/api/chat", json={
    "model": "gemma4",
    "messages": messages,
    "stream": False,
    "options": {
        "num_ctx": 131072,
        "num_predict": 8192,
    },
})

This works, but the OpenAI-compatible endpoint should either (a) respect num_ctx when passed in options/extra_body, or (b) return an error/warning rather than silently truncating. Silent data loss is the worst failure mode for any API.

Related: #2963 (PR #11249 in review to expose native params through the OpenAI endpoint)

<!-- gh-comment-id:4186460502 --> @geoffsdesk commented on GitHub (Apr 4, 2026): ### Real-world impact data: silent `num_ctx` truncation destroys large-context workloads Adding benchmark data that quantifies the impact of this issue. We're building a domain-expertise skill (~90K tokens of structured knowledge) for GKE upgrade operations. We benchmark it with an eval suite of 40 evaluations / 348 graded assertions, testing with-skill vs without-skill. **The setup:** Gemma 4 (27B, Q4_K_M) via Ollama, with `num_ctx: 131072` passed in the `options` field. #### Results using `/v1/chat/completions` (broken) | Metric | Score | |--------|-------| | With Skill | 42.2% (147/348) | | Without Skill | 36.8% (128/348) | | **Delta** | **+5.4%** | The skill provided almost no lift. We initially attributed this to Gemma 4 being less capable than Claude Sonnet. #### Results after switching to native `/api/chat` (fixed) | Metric | Score | |--------|-------| | With Skill | 69.5% (242/348) | | Without Skill | 31.3% (109/348) | | **Delta** | **+38.2%** | Same model, same skill, same hardware. The only change was the API endpoint. #### Why this is hard to detect - **No error returned.** HTTP 200, valid JSON, plausible-sounding content. - **No truncation warning.** Nothing in the response indicates the context was silently cut from ~90K tokens to ~4K. - **Degraded output looks like a less capable model**, not a misconfiguration. Without a structured benchmark, you'd never know. The evals that test topics appearing *later* in the skill document were hit hardest (0-12% with the broken endpoint, 62-87% with native), confirming the input was truncated rather than the model failing to reason over it. #### The workaround We switched from `/v1/chat/completions` to `/api/chat`, which properly respects the `options.num_ctx` field: ```python # Native endpoint - options are respected requests.post(f"{base_url}/api/chat", json={ "model": "gemma4", "messages": messages, "stream": False, "options": { "num_ctx": 131072, "num_predict": 8192, }, }) ``` This works, but the OpenAI-compatible endpoint should either (a) respect `num_ctx` when passed in `options`/`extra_body`, or (b) return an error/warning rather than silently truncating. Silent data loss is the worst failure mode for any API. Related: #2963 (PR #11249 in review to expose native params through the OpenAI endpoint)
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#49865