[GH-ISSUE #7814] Flag to prevent infinite generation in Ollama API #4999

Closed
opened 2026-04-12 16:04:00 -05:00 by GiteaMirror · 7 comments
Owner

Originally created by @gwpl on GitHub (Nov 24, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/7814

Problem statement: Seems that currently "max tokens" is per model parameter, instead of standard ollama API parameter that is meant to always work. It makes it harder for integrations that use more and more new and new models to provide failsafe preventing infinite inference.

I am playing with SmolLM2 models and they endup pretty often in infinite generation loops...

I see that currently setting maximum tokens limit is model dependent.

I wondered if we could have some failsafe flags to limit either by tokens +/- or resources (cpu time?),
so server will stop computation after certain limit?
As goal is failsafe, then it does not have to be exact (e.g. one sets 128000 tokens and it generates 132000 tokens... it's ok, as long as server will start stopping inference as soon as realized that threshold was crossed, to prevent infinite inference...)

Originally created by @gwpl on GitHub (Nov 24, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/7814 Problem statement: Seems that currently "max tokens" is per model parameter, instead of standard ollama API parameter that is meant to always work. It makes it harder for integrations that use more and more new and new models to provide failsafe preventing infinite inference. I am playing with SmolLM2 models and they endup pretty often in infinite generation loops... I see that currently setting maximum tokens limit is model dependent. I wondered if we could have some failsafe flags to limit either by tokens +/- or resources (cpu time?), so server will stop computation after certain limit? As goal is failsafe, then it does not have to be exact (e.g. one sets 128000 tokens and it generates 132000 tokens... it's ok, as long as server will start stopping inference as soon as realized that threshold was crossed, to prevent infinite inference...)
GiteaMirror added the feature request label 2026-04-12 16:04:00 -05:00
Author
Owner

@rick-github commented on GitHub (Nov 24, 2024):

max_tokens is a standard OpenAI API parameter that should always work, and should be independent of model:

$ curl -s localhost:11434/v1/chat/completions -d '{"model":"smollm2:135m","messages":[{"role":"user","content":"why is the sky blue?"}],"max_tokens":10}' | jq
{
  "id": "chatcmpl-383",
  "object": "chat.completion",
  "created": 1732450769,
  "model": "smollm2:135m",
  "system_fingerprint": "fp_ollama",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "The color blue has been one of the most recognizable"
      },
      "finish_reason": "length"
    }
  ],
  "usage": {
    "prompt_tokens": 36,
    "completion_tokens": 10,
    "total_tokens": 46
  }
}

Under what conditions does this not work correctly for you?

<!-- gh-comment-id:2495971871 --> @rick-github commented on GitHub (Nov 24, 2024): `max_tokens` is a standard OpenAI API parameter that should always work, and should be independent of model: ```console $ curl -s localhost:11434/v1/chat/completions -d '{"model":"smollm2:135m","messages":[{"role":"user","content":"why is the sky blue?"}],"max_tokens":10}' | jq { "id": "chatcmpl-383", "object": "chat.completion", "created": 1732450769, "model": "smollm2:135m", "system_fingerprint": "fp_ollama", "choices": [ { "index": 0, "message": { "role": "assistant", "content": "The color blue has been one of the most recognizable" }, "finish_reason": "length" } ], "usage": { "prompt_tokens": 36, "completion_tokens": 10, "total_tokens": 46 } } ``` Under what conditions does this not work correctly for you?
Author
Owner

@gwpl commented on GitHub (Nov 24, 2024):

Yes! It should!

That's why maybe it's short way to make it official! And part of requirements that developers can rely on and build upon?

Currently I don't see max_tokens in
a820d2b267/docs/api.md

also probably should at some point be considered to be checked against regressions with some test in, probably in some :

main/server/*_test.go

<!-- gh-comment-id:2496064508 --> @gwpl commented on GitHub (Nov 24, 2024): Yes! It should! That's why maybe it's short way to make it official! And part of requirements that developers can rely on and build upon? Currently I don't see `max_tokens` in https://github.com/ollama/ollama/blob/a820d2b2673f7f8035e3a2a6f93c83af465f841c/docs/api.md also probably should at some point be considered to be checked against regressions with some test in, probably in some : `main/server/*_test.go`
Author
Owner

@rick-github commented on GitHub (Nov 24, 2024):

The doc you link to is for the ollama API, not the OpenAI API. For the OpenAI API, the goal is to be as close to the official standard as practical, so max_tokens is official and developers can rely on it. For the ollama API, the field you want is num_predict:

$ curl -s localhost:11434/api/generate -d '{"model":"smollm2:135m","prompt":"why is the sky blue?","options":{"num_predict":10},"stream":false}' | jq 'del(.context)'
{
  "model": "smollm2:135m",
  "created_at": "2024-11-24T12:30:39.499798524Z",
  "response": "The sky appears blue because of tiny particles in our",
  "done": true,
  "done_reason": "length",
  "total_duration": 39702355,
  "load_duration": 12255233,
  "prompt_eval_count": 36,
  "prompt_eval_duration": 1000000,
  "eval_count": 10,
  "eval_duration": 25000000
}

Note that the description of the default value is incorrect, there is an open PR to fix that.

<!-- gh-comment-id:2496068293 --> @rick-github commented on GitHub (Nov 24, 2024): The doc you link to is for the ollama API, not the OpenAI API. For the OpenAI API, the goal is to be as [close](https://github.com/ollama/ollama/issues/7125) to the official standard as practical, so `max_tokens` is official and developers can rely on it. For the ollama API, the field you want is [`num_predict`](https://github.com/ollama/ollama/blob/main/docs/modelfile.md#valid-parameters-and-values:~:text=tfs_z%201-,num_predict): ```console $ curl -s localhost:11434/api/generate -d '{"model":"smollm2:135m","prompt":"why is the sky blue?","options":{"num_predict":10},"stream":false}' | jq 'del(.context)' { "model": "smollm2:135m", "created_at": "2024-11-24T12:30:39.499798524Z", "response": "The sky appears blue because of tiny particles in our", "done": true, "done_reason": "length", "total_duration": 39702355, "load_duration": 12255233, "prompt_eval_count": 36, "prompt_eval_duration": 1000000, "eval_count": 10, "eval_duration": 25000000 } ``` Note that the description of the default value is incorrect, there is an open [PR](https://github.com/ollama/ollama/pull/7693) to fix that.
Author
Owner

@gwpl commented on GitHub (Nov 26, 2024):

I read you.

I clarified in description that because it's Ollama project we are talking bout Ollama API , adding this parameter, feature to Ollama API.

<!-- gh-comment-id:2501467340 --> @gwpl commented on GitHub (Nov 26, 2024): I read you. I clarified in description that because it's Ollama project we are talking bout Ollama API , adding this parameter, feature to Ollama API.
Author
Owner

@rick-github commented on GitHub (Nov 26, 2024):

num_predict

<!-- gh-comment-id:2501646922 --> @rick-github commented on GitHub (Nov 26, 2024): [`num_predict`](https://github.com/ollama/ollama/blob/main/docs/modelfile.md#valid-parameters-and-values:~:text=tfs_z%201-,num_predict)
Author
Owner

@gwpl commented on GitHub (Nov 26, 2024):

Thank you for considering my feature request. I would like to emphasize the importance of this change and provide additional rationale for it.

2cd11ae365/docs/api.md

The current API documentation does not guarantee the num_predict parameter as a standard parameter for LLM or token-based models. This lack of guarantee makes it challenging for builders who rely on this parameter for functionality, as it cannot be depended upon consistently.

However, I suspect that the num_predict parameter is always present for all token-based models, but it is not explicitly documented or enforced. To address this, I propose the following actions:

  1. Documentation Update:

    • Clearly document the num_predict parameter for all token-based models in the API documentation. This will provide builders with the confidence that this parameter is always available and can be reliably used.
  2. Implementation of Failsafe Flags:

    • Introduce failsafe flags to limit generation based on tokens or resources (e.g., CPU time). These flags will enable the server to stop computation once a specified threshold is crossed, providing a more reliable failsafe mechanism.
  3. Unit Testing:

    • Add unit tests to validate the presence and functionality of the num_predict parameter for all token-based models. This will ensure that the parameter is always present and functioning as expected, preventing regressions that could impact production systems reliant on this parameter.

By enforcing the num_predict parameter in the documentation and through unit tests, we can ensure its reliability and prevent future issues. This change will enhance the robustness of the Ollama API and provide a more consistent and reliable experience for developers.

I appreciate your consideration of this request and look forward to your positive response. If there are any specific concerns or further requirements, please let me know.

<!-- gh-comment-id:2501958660 --> @gwpl commented on GitHub (Nov 26, 2024): Thank you for considering my feature request. I would like to emphasize the importance of this change and provide additional rationale for it. https://github.com/ollama/ollama/blob/2cd11ae365a9423578069457312dce6b9e1e5a37/docs/api.md The current API documentation does not guarantee the `num_predict` parameter as a standard parameter for LLM or token-based models. This lack of guarantee makes it challenging for builders who rely on this parameter for functionality, as it cannot be depended upon consistently. However, I suspect that the `num_predict` parameter is always present for all token-based models, but it is not explicitly documented or enforced. To address this, I propose the following actions: 1. **Documentation Update**: - Clearly document the `num_predict` parameter for all token-based models in the API documentation. This will provide builders with the confidence that this parameter is always available and can be reliably used. 2. **Implementation of Failsafe Flags**: - Introduce failsafe flags to limit generation based on tokens or resources (e.g., CPU time). These flags will enable the server to stop computation once a specified threshold is crossed, providing a more reliable failsafe mechanism. 3. **Unit Testing**: - Add unit tests to validate the presence and functionality of the `num_predict` parameter for all token-based models. This will ensure that the parameter is always present and functioning as expected, preventing regressions that could impact production systems reliant on this parameter. By enforcing the `num_predict` parameter in the documentation and through unit tests, we can ensure its reliability and prevent future issues. This change will enhance the robustness of the Ollama API and provide a more consistent and reliable experience for developers. I appreciate your consideration of this request and look forward to your positive response. If there are any specific concerns or further requirements, please let me know.
Author
Owner

@rick-github commented on GitHub (Nov 26, 2024):

  1. Parameters section of API doc refers readers to the parameters section of the Modelfile which documents num_predict.
  2. num_predict
  3. 71e6a0d0d1/parser/parser_test.go (L483)
<!-- gh-comment-id:2501980057 --> @rick-github commented on GitHub (Nov 26, 2024): 1. [Parameters](https://github.com/ollama/ollama/blob/main/docs/api.md#parameters) section of API doc refers readers to the [parameters](https://github.com/ollama/ollama/blob/main/docs/modelfile.md#valid-parameters-and-values) section of the Modelfile which documents `num_predict`. 2. [`num_predict`](https://github.com/ollama/ollama/blob/main/docs/modelfile.md#valid-parameters-and-values:~:text=tfs_z%201-,num_predict) 3. https://github.com/ollama/ollama/blob/71e6a0d0d181e3be45f3e47a677d088479d73c76/parser/parser_test.go#L483
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#4999