[GH-ISSUE #2963] Add ability to provide options in OpenAI compatibility endpoints #48332

Open
opened 2026-04-28 07:46:56 -05:00 by GiteaMirror · 6 comments
Owner

Originally created by @pseudotensor on GitHub (Mar 6, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/2963

It seems one can only set system prompt and hyperparameters like temperature as part of model config file. I'm using the OpenAI API, and ollama ignores system prompt or such hyperparaemters. AFAIK there's no good reason for this.

Am I missing something?

Originally created by @pseudotensor on GitHub (Mar 6, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/2963 It seems one can only set system prompt and hyperparameters like temperature as part of model config file. I'm using the OpenAI API, and ollama ignores system prompt or such hyperparaemters. AFAIK there's no good reason for this. Am I missing something?
GiteaMirror added the compatibilityfeature requestapi labels 2026-04-28 07:46:59 -05:00
Author
Owner

@j-schreuder commented on GitHub (Mar 16, 2024):

Somehow I thought this was already a thing and have been spending quite the hours trying different models, prompt finetuning etc. instead. Turns out things are just being clipped still defaulting to 2048 window with models supporting 16K+ as request options are left ignored. Hoping it'll be a quick feature to resolve since it's already on other endpoints.

<!-- gh-comment-id:2002070311 --> @j-schreuder commented on GitHub (Mar 16, 2024): Somehow I thought this was already a thing and have been spending quite the hours trying different models, prompt finetuning etc. instead. Turns out things are just being clipped still defaulting to 2048 window with models supporting 16K+ as request options are left ignored. Hoping it'll be a quick feature to resolve since it's already on other endpoints.
Author
Owner

@AndreasKarasenko commented on GitHub (Jul 19, 2024):

Don't mean to be pushy, but is there any news on this?

<!-- gh-comment-id:2239326735 --> @AndreasKarasenko commented on GitHub (Jul 19, 2024): Don't mean to be pushy, but is there any news on this?
Author
Owner

@tisfeng commented on GitHub (Mar 29, 2025):

Hello, we need this feature, is anyone going to push this forward?

<!-- gh-comment-id:2762998819 --> @tisfeng commented on GitHub (Mar 29, 2025): Hello, we need this feature, is anyone going to push this forward?
Author
Owner

@basnijholt commented on GitHub (Jun 30, 2025):

I've opened PR #11249 that implements options support for the OpenAI API endpoints. This adds support for think and keep_alive parameters and establishes a foundation for future parameter exposure. Would appreciate any feedback!

<!-- gh-comment-id:3021172220 --> @basnijholt commented on GitHub (Jun 30, 2025): I've opened [PR #11249](https://github.com/ollama/ollama/pull/11249/) that implements `options` support for the OpenAI API endpoints. This adds support for `think` and `keep_alive` parameters and establishes a foundation for future parameter exposure. Would appreciate any feedback!
Author
Owner

@flange-ipb commented on GitHub (Jul 21, 2025):

Edit: I'm sorry, this is wrong. The "params" keyword does not work for Ollama's OpenAI compatibility endpoints. On the other hand, this example is valid for Open WebUI's chat completions endpoint (OpenAI client's base_url is http://localhost:3000/api).


I just discovered that the magic keyword to be passed via OpenAI's extra_body is "params".

Example:

sync_client.chat.completions.create(
        model=MODEL,
        messages=[
            {
                "role": "user",
                "content": prompt,
            },
        ],
        extra_body={
            "params": {"num_ctx": CONTEXT_LENGTH, "seed": 123, "temperature": 0.1},
        },
    )
<!-- gh-comment-id:3096197312 --> @flange-ipb commented on GitHub (Jul 21, 2025): **Edit: I'm sorry, this is wrong.** The `"params"` keyword does not work for Ollama's OpenAI compatibility endpoints. On the other hand, this example is valid for Open WebUI's [chat completions endpoint](https://docs.openwebui.com/getting-started/api-endpoints#-chat-completions) (OpenAI client's `base_url` is `http://localhost:3000/api`). --- I just discovered that the magic keyword to be passed via OpenAI's `extra_body` is `"params"`. Example: ```python sync_client.chat.completions.create( model=MODEL, messages=[ { "role": "user", "content": prompt, }, ], extra_body={ "params": {"num_ctx": CONTEXT_LENGTH, "seed": 123, "temperature": 0.1}, }, ) ```
Author
Owner

@ShyamKadari commented on GitHub (Apr 22, 2026):

Adding a production reproduction case for num_ctx specifically, since #11249 currently scopes to think and keep_alive but not num_ctx.

Environment: Ollama 0.21.0, gemma3:27b-it-q4_K_M (128K native context capability), RTX 4090 Laptop 16 GB, OpenAI SDK via self._client = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama").

Reproduction:

# Native /api/generate honors num_ctx ✓
curl http://localhost:11434/api/generate -d '{
  "model": "gemma3:27b-it-q4_K_M",
  "prompt": "hi",
  "options": {"num_ctx": 32768}
}'
# Ollama log: KvSize:32768

# OpenAI-compat /v1/chat/completions ignores num_ctx ✗
curl http://localhost:11434/v1/chat/completions -d '{
  "model": "gemma3:27b-it-q4_K_M",
  "messages": [{"role":"user","content":"hi"}],
  "options": {"num_ctx": 32768}
}'
# Ollama log: KvSize:4096 (modelfile default)

Silent failure mode: every prompt >4096 tokens gets truncated without any error surfaced to the OpenAI client. A 30,698-char prompt (~7,675 tokens) becomes limit=4096 prompt=30698 keep=4 new=4096 in Ollama's log — the model generates from the truncated input as if nothing happened.

Impact on our RAG pipeline:

  • Our data-quality RAG system assembles 30-50K char prompts from retrieved schema/rules/profile chunks
  • With the OpenAI-compat endpoint dropping num_ctx, >99% of our RAG context is silently discarded
  • Discovered only after shipping comprehensive observability that captured total_prompt_chars sent vs what Ollama actually processed. Would have gone unnoticed indefinitely otherwise.

Current workaround: PARAMETER num_ctx 32768 baked into the modelfile via ollama create. Works but per-machine state that new deploys/CI runners have to replicate, and gets overwritten by ollama pull. We automated it as setup-ollama-models.sh in our repo as a stopgap.

Why we can't just switch to /api/chat: Our client library targets OpenAI-spec providers across the stack (OpenAI, Anthropic via OpenRouter, Azure, Together, vLLM, Ollama). Forking Ollama onto a separate code path for this one bug breaks provider portability for every other provider.

Suggestion for #11249: num_ctx is a MUCH higher-value option to expose than think or keep_alive for production RAG/long-context users. It would resolve this issue entirely. Would be great to see it added to the same PR if possible, or a close follow-up. Happy to test against a build.

<!-- gh-comment-id:4299145530 --> @ShyamKadari commented on GitHub (Apr 22, 2026): Adding a production reproduction case for `num_ctx` specifically, since [#11249](https://github.com/ollama/ollama/pull/11249) currently scopes to `think` and `keep_alive` but not `num_ctx`. **Environment:** Ollama 0.21.0, gemma3:27b-it-q4_K_M (128K native context capability), RTX 4090 Laptop 16 GB, OpenAI SDK via `self._client = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")`. **Reproduction:** ```bash # Native /api/generate honors num_ctx ✓ curl http://localhost:11434/api/generate -d '{ "model": "gemma3:27b-it-q4_K_M", "prompt": "hi", "options": {"num_ctx": 32768} }' # Ollama log: KvSize:32768 # OpenAI-compat /v1/chat/completions ignores num_ctx ✗ curl http://localhost:11434/v1/chat/completions -d '{ "model": "gemma3:27b-it-q4_K_M", "messages": [{"role":"user","content":"hi"}], "options": {"num_ctx": 32768} }' # Ollama log: KvSize:4096 (modelfile default) ``` Silent failure mode: every prompt >4096 tokens gets truncated without any error surfaced to the OpenAI client. A 30,698-char prompt (~7,675 tokens) becomes `limit=4096 prompt=30698 keep=4 new=4096` in Ollama's log — the model generates from the truncated input as if nothing happened. **Impact on our RAG pipeline:** - Our data-quality RAG system assembles 30-50K char prompts from retrieved schema/rules/profile chunks - With the OpenAI-compat endpoint dropping num_ctx, >99% of our RAG context is silently discarded - Discovered only after shipping comprehensive observability that captured `total_prompt_chars` sent vs what Ollama actually processed. Would have gone unnoticed indefinitely otherwise. **Current workaround:** `PARAMETER num_ctx 32768` baked into the modelfile via `ollama create`. Works but per-machine state that new deploys/CI runners have to replicate, and gets overwritten by `ollama pull`. We automated it as `setup-ollama-models.sh` in our repo as a stopgap. **Why we can't just switch to `/api/chat`:** Our client library targets OpenAI-spec providers across the stack (OpenAI, Anthropic via OpenRouter, Azure, Together, vLLM, Ollama). Forking Ollama onto a separate code path for this one bug breaks provider portability for every other provider. **Suggestion for [#11249](https://github.com/ollama/ollama/pull/11249):** `num_ctx` is a MUCH higher-value option to expose than `think` or `keep_alive` for production RAG/long-context users. It would resolve this issue entirely. Would be great to see it added to the same PR if possible, or a close follow-up. Happy to test against a build.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#48332