[GH-ISSUE #8786] Prediction aborted due to token repeat limit reached error in granite3.1-dense:8b #5705

Closed
opened 2026-04-12 17:00:02 -05:00 by GiteaMirror · 10 comments
Owner

Originally created by @ALLMI78 on GitHub (Feb 3, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/8786

What is the issue?

I am using the ollama API (v0.5.7) with the granite3.1-dense:8b-instruct-q6_K model. Although the model generally performs well, I occasionally encounter an error where the response returns an JSON output containing multiple <fim_prefix> tokens instead of a valid answer.

response >{"model":"granite3.1-dense:8b-instruct-q6_K","created_at":"2025-02-03T10:02:35.4537802Z","message":{"role":"assistant","content":"\u003cfim_prefix\u003e\u003cfim_prefix\u003e\u003cfim_prefix\u003e\u003cfim_prefix\u003e\u003cfim_prefix\u003e\u003cfim_prefix\u003e\u003cfim_prefix\u003e\u003cfim_prefix\u003e\u003cfim_prefix\u003e\u003cfim_prefix\u003e\u003cfim_prefix\u003e\u003cfim_prefix\u003e\u003cfim_prefix\u003e\u003cfim_prefix\u003e\u003cfim_prefix\u003e\u003cfim_prefix\u003e\u003cfim_prefix\u003e\u003cfim_prefix\u003e\u003cfim_prefix\u003e\u003cfim_prefix\u003e\u003cfim_prefix\u003e\u003cfim_prefix\u003e\u003cfim_prefix\u003e\u003cfim_prefix\u003e\u003cfim_prefix\u003e\u003cfim_prefix\u003e\u003cfim_prefix\u003e\u003cfim_prefix\u003e\u003cfim_prefix\u003e\u003cfim_prefix\u003e\u003cfim_prefix\u003e"},"done":false}<

In the logs, I see:

time=2025-02-03T11:02:35.503+01:00 level=DEBUG source=server.go:816 msg="prediction aborted, token repeat limit reached"
[GIN] 2025/02/03 - 11:02:35 | 200 |   18.2363314s |       127.0.0.1 | POST     "/api/chat"

I suspect that this error might be related to exceeding the context length (currently set at 32768 tokens) or what does "token repeat" mean? Since (as far is i know) Ollama does not provide a built-in method to count tokens before sending the request, I am unable to trim or control the context length dynamically, which may be causing the system to abort the prediction.

  • Are there recommended workarounds or configuration adjustments (e.g., trimming the conversation history, parameter tuning) to mitigate this issue?
  • Would it be possible to implement or expose a token counting mechanism to avoid exceeding the limit?
  • Is there any additional debug information or logging that might help pinpoint the root cause?

Any guidance or suggestions to resolve this would be greatly appreciated!

OS

Windows

GPU

Nvidia

CPU

Intel

Ollama version

0.5.7

EDIT: exceeding the context length is not the problem, got this error with granite3.2 @ 18k conetxt length and a 32k context size

Originally created by @ALLMI78 on GitHub (Feb 3, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/8786 ### What is the issue? I am using the ollama API (v0.5.7) with the granite3.1-dense:8b-instruct-q6_K model. Although the model generally performs well, I occasionally encounter an error where the response returns an JSON output containing multiple <fim_prefix> tokens instead of a valid answer. > response >{"model":"granite3.1-dense:8b-instruct-q6_K","created_at":"2025-02-03T10:02:35.4537802Z","message":{"role":"assistant","content":"\u003cfim_prefix\u003e\u003cfim_prefix\u003e\u003cfim_prefix\u003e\u003cfim_prefix\u003e\u003cfim_prefix\u003e\u003cfim_prefix\u003e\u003cfim_prefix\u003e\u003cfim_prefix\u003e\u003cfim_prefix\u003e\u003cfim_prefix\u003e\u003cfim_prefix\u003e\u003cfim_prefix\u003e\u003cfim_prefix\u003e\u003cfim_prefix\u003e\u003cfim_prefix\u003e\u003cfim_prefix\u003e\u003cfim_prefix\u003e\u003cfim_prefix\u003e\u003cfim_prefix\u003e\u003cfim_prefix\u003e\u003cfim_prefix\u003e\u003cfim_prefix\u003e\u003cfim_prefix\u003e\u003cfim_prefix\u003e\u003cfim_prefix\u003e\u003cfim_prefix\u003e\u003cfim_prefix\u003e\u003cfim_prefix\u003e\u003cfim_prefix\u003e\u003cfim_prefix\u003e\u003cfim_prefix\u003e"},"done":false}< In the logs, I see: ``` time=2025-02-03T11:02:35.503+01:00 level=DEBUG source=server.go:816 msg="prediction aborted, token repeat limit reached" [GIN] 2025/02/03 - 11:02:35 | 200 | 18.2363314s | 127.0.0.1 | POST "/api/chat" ``` I suspect that this error might be related to _exceeding the context length (currently set at 32768 tokens)_ or what does "token repeat" mean? Since (as far is i know) Ollama does not provide a built-in method to count tokens before sending the request, I am unable to trim or control the context length dynamically, which may be causing the system to abort the prediction. - Are there recommended workarounds or configuration adjustments (e.g., trimming the conversation history, parameter tuning) to mitigate this issue? - Would it be possible to implement or expose a token counting mechanism to avoid exceeding the limit? - Is there any additional debug information or logging that might help pinpoint the root cause? Any guidance or suggestions to resolve this would be greatly appreciated! ### OS Windows ### GPU Nvidia ### CPU Intel ### Ollama version 0.5.7 EDIT: exceeding the context length is not the problem, got this error with granite3.2 @ 18k conetxt length and a 32k context size
GiteaMirror added the bug label 2026-04-12 17:00:02 -05:00
Author
Owner

@rick-github commented on GitHub (Feb 3, 2025):

token repeat limit reached is about the output tokens, not the input tokens. ollama has detected that a repeating pattern is being generated which generally indicates the model has lost coherence. This is a condition models get into occasionally, it can be triggered by exceeding the context buffer but it's not the only cause. If it's a buffer size issue you will see lines about shifting in the log.

If you can give some context about the type of query you are sending, there might be some specific advice. Generally, you could try increasing the context buffer, setting num_predict to control the number of output tokens, reduce temperature to control variability, or try adjusting prompting.

<!-- gh-comment-id:2630802671 --> @rick-github commented on GitHub (Feb 3, 2025): `token repeat limit reached` is about the output tokens, not the input tokens. ollama has detected that a repeating pattern is being generated which generally indicates the model has lost coherence. This is a condition models get into occasionally, it can be triggered by exceeding the context buffer but it's not the only cause. If it's a buffer size issue you will see lines about `shifting` in the log. If you can give some context about the type of query you are sending, there might be some specific advice. Generally, you could try increasing the context buffer, setting `num_predict` to control the number of output tokens, reduce `temperature` to control variability, or try adjusting prompting.
Author
Owner

@ALLMI78 commented on GitHub (Feb 3, 2025):

Hello Rick,

Thanks for your quick response. I'm currently testing different parameters, but I can't say for sure yet whether any of them make a difference. My temperature was set to 0.4, and I just increased it to 0.6 to see what happens. Do you think lowering it might be a better approach?

What impact could the following parameters have?

options.repeat_last_n = 64;    
options.repeat_penalty = 1.1;  

One important note: I only experience this issue with Granite3.1-dense models. When I use Llama 3.1, Tulu, or Qwen, the problem does not occur.

Regarding the "context about the type of query": Two LLMs perform analyses and need to generate a signal by inserting numbers into a template. Unfortunately, I can't provide more details, but I might be close to the 32k context limit when the error occurs. In the beginning, when the conversation history is still short, I believe the issue does not appear.

Looking forward to your thoughts!

EDIT:

  • no "shifting" in the log
  • changes of the temperature from 0.4 to 0.6 dit not help
  • 1 truncate but at a different time and in another run

time=2025-02-03T14:46:05.922+01:00 level=DEBUG source=prompt.go:77 msg="truncating input messages which exceed context length" truncated=3

time=2025-02-03T14:56:13.646+01:00 level=DEBUG source=server.go:816 msg="prediction aborted, token repeat limit reached"

[GIN] 2025/02/03 - 14:56:13 | 200 | 18.3095295s | 127.0.0.1 | POST "/api/chat"

<!-- gh-comment-id:2630932137 --> @ALLMI78 commented on GitHub (Feb 3, 2025): Hello Rick, Thanks for your quick response. I'm currently testing different parameters, but I can't say for sure yet whether any of them make a difference. My temperature was set to 0.4, and I just increased it to 0.6 to see what happens. Do you think lowering it might be a better approach? What impact could the following parameters have? ```cpp options.repeat_last_n = 64; options.repeat_penalty = 1.1; ``` One important note: I **only** experience this issue with Granite3.1-dense models. When I use Llama 3.1, Tulu, or Qwen, the problem does not occur. Regarding the "context about the type of query": Two LLMs perform analyses and need to generate a signal by inserting numbers into a template. Unfortunately, I can't provide more details, but I might be close to the 32k context limit when the error occurs. In the beginning, when the conversation history is still short, I believe the issue does not appear. Looking forward to your thoughts! EDIT: - no "shifting" in the log - changes of the temperature from 0.4 to 0.6 dit not help - 1 truncate but at a different time and in another run time=2025-02-03T14:46:05.922+01:00 level=DEBUG source=prompt.go:77 msg="truncating input messages which exceed context length" truncated=3 time=2025-02-03T14:56:13.646+01:00 level=DEBUG source=server.go:816 msg="prediction aborted, token repeat limit reached" [GIN] 2025/02/03 - 14:56:13 | 200 | 18.3095295s | 127.0.0.1 | POST "/api/chat"
Author
Owner

@rick-github commented on GitHub (Feb 3, 2025):

Lower temperature would be better.

It's interesting that your output is composed of <fim_prefix> tokens when the template doesn't support FIM. Does your prompt ask the model to do that, or is spontaneous? Has the template been modified? Are you using the chat or generate endpoints? I poked around a bit with granite3.1-dense:8b-instruct-q6_K and was unable to trigger the behaviour you see.

<!-- gh-comment-id:2631208407 --> @rick-github commented on GitHub (Feb 3, 2025): Lower `temperature` would be better. It's interesting that your output is composed of `<fim_prefix>` tokens when the template doesn't support FIM. Does your prompt ask the model to do that, or is spontaneous? Has the template been modified? Are you using the chat or generate endpoints? I poked around a bit with granite3.1-dense:8b-instruct-q6_K and was unable to trigger the behaviour you see.
Author
Owner

@ALLMI78 commented on GitHub (Feb 3, 2025):

I’m using the chat endpoint. I can test again with a lower temperature.

But I need to be careful not to use the wrong terms. When I said "template," I meant that my LLMs perform analyses and then have to generate a "SIGNAL."

Example for my template or SIGNAL:
####[SIGNAL_START] OPTIONA=INT; OPTIONB=INT; OPTIONC=INT; VALUEA=%.2f; VALUEB=%.2f; [SIGNAL_END]####

The LLMs are tasked with filling in specific numerical values based on their analysis and outputting them as a SIGNAL. And they do that very well, until the error...

This predefined structure for a signal is what I referred to as a "template." However, I now realize that there are also model templates (related to the models), but I have no experience with those. I wasn’t referring to them and haven’t changed anything there. That was my mistake—sorry for the confusion.

<!-- gh-comment-id:2631356690 --> @ALLMI78 commented on GitHub (Feb 3, 2025): I’m using the chat endpoint. I can test again with a lower temperature. But I need to be careful not to use the wrong terms. When I said "template," I meant that my LLMs perform analyses and then have to generate a "SIGNAL." Example for my template or SIGNAL: `####[SIGNAL_START] OPTIONA=INT; OPTIONB=INT; OPTIONC=INT; VALUEA=%.2f; VALUEB=%.2f; [SIGNAL_END]####` The LLMs are tasked with filling in specific numerical values based on their analysis and outputting them as a SIGNAL. And they do that very well, until the error... This predefined structure for a signal is what I referred to as a "template." However, I now realize that there are also model templates (related to the models), but I have no experience with those. I wasn’t referring to them and haven’t changed anything there. That was my mistake—sorry for the confusion.
Author
Owner

@rick-github commented on GitHub (Feb 3, 2025):

So your client is something like this:

#!/usr/bin/env python3

import ollama
import argparse

prompt = """
Analyse the following data and return a SIGNAL in the format specified.  Only the SIGNAL should be returned, no explantory text.  You are to determine the colour, shape and mass of an object.  Use the following values for the attributes:

Shape:
  ROUND = 1
  SQUARE = 2
Colour:
  RED = 1
  GREEN = 2
  BLUE = 3

The SIGNAL to be returned must be in the following format:
####[SIGNAL_START] SHAPE=INT; COLOUR=INT; WEIGHT=%.2f; [SIGNAL_END]####

Here is the data:
{data}
"""

parser = argparse.ArgumentParser()
parser.add_argument("-m", "--model", default="granite3.1-dense:8b")
parser.add_argument("-t", "--temperature", default=0.4)
parser.add_argument("-c", "--context", default=2048)
parser.add_argument("data", nargs='*')
args = parser.parse_args()

for d in args.data:
  response = ollama.chat(
      model=args.model,
      messages=[{"role":"user","content":prompt.format(data=d)}],
      options={"temperature":args.temperature, "num_ctx":args.context},
  )
  print(response["message"]["content"])

$ ./8786.py 'the red ball weighs 4 kilos' 'the blue box weighs 10 and a half kilos'
####[SIGNAL_START] SHAPE=1; COLOUR=1; WEIGHT=4.00; [SIGNAL_END]####
####[SIGNAL_START] SHAPE=2; COLOUR=3; WEIGHT=10.50; [SIGNAL_END]####

Have you tried using structured outputs? The addition of a schema may make the model adhere more closely to the required output.

#!/usr/bin/env python3

import ollama
import argparse
import json
from pydantic import BaseModel, Field
from decimal import Decimal

prompt = """
Analyse the following data and return a SIGNAL in the format specified.  You are to determine the colour, shape and mass of an object.  Use the following values for the attributes:

Shape:
  ROUND = 1
  SQUARE = 2
Colour:
  RED = 1
  GREEN = 2
  BLUE = 3

Here is the data:
{data}
"""

class Signal(BaseModel):
  SHAPE: int = Field(..., description="Shape of the object")
  COLOUR: int = Field(..., description="Colour of the object")
  WEIGHT: Decimal = Field(..., description="Weight of the object", decimal_places=2)

  def __str__(self):
    return f"####[SIGNAL_START] SHAPE={self.SHAPE}; COLOUR={self.COLOUR}; WEIGHT={self.WEIGHT:.2f}; [SIGNAL_END]####"


parser = argparse.ArgumentParser()
parser.add_argument("-m", "--model", default="granite3.1-dense:8b")
parser.add_argument("-t", "--temperature", default=0.4)
parser.add_argument("-c", "--context", default=2048)
parser.add_argument("data", nargs='*')
args = parser.parse_args()

for d in args.data:
  response = ollama.chat(
      model=args.model,
      messages=[{"role":"user","content":prompt.format(data=d)}],
      options={"temperature":args.temperature, "num_ctx":args.context},
      format=Signal.model_json_schema(),
  )
  signal = Signal.model_validate_json(response["message"]["content"])
  print(signal)
$ ./8786-structured.py 'the red ball weighs 4 kilos' 'the blue box weighs 10 and a half kilos'
####[SIGNAL_START] SHAPE=1; COLOUR=1; WEIGHT=4.00; [SIGNAL_END]####
####[SIGNAL_START] SHAPE=2; COLOUR=3; WEIGHT=10.50; [SIGNAL_END]####
<!-- gh-comment-id:2631817807 --> @rick-github commented on GitHub (Feb 3, 2025): So your client is something like this: ```python #!/usr/bin/env python3 import ollama import argparse prompt = """ Analyse the following data and return a SIGNAL in the format specified. Only the SIGNAL should be returned, no explantory text. You are to determine the colour, shape and mass of an object. Use the following values for the attributes: Shape: ROUND = 1 SQUARE = 2 Colour: RED = 1 GREEN = 2 BLUE = 3 The SIGNAL to be returned must be in the following format: ####[SIGNAL_START] SHAPE=INT; COLOUR=INT; WEIGHT=%.2f; [SIGNAL_END]#### Here is the data: {data} """ parser = argparse.ArgumentParser() parser.add_argument("-m", "--model", default="granite3.1-dense:8b") parser.add_argument("-t", "--temperature", default=0.4) parser.add_argument("-c", "--context", default=2048) parser.add_argument("data", nargs='*') args = parser.parse_args() for d in args.data: response = ollama.chat( model=args.model, messages=[{"role":"user","content":prompt.format(data=d)}], options={"temperature":args.temperature, "num_ctx":args.context}, ) print(response["message"]["content"]) ``` ```console $ ./8786.py 'the red ball weighs 4 kilos' 'the blue box weighs 10 and a half kilos' ####[SIGNAL_START] SHAPE=1; COLOUR=1; WEIGHT=4.00; [SIGNAL_END]#### ####[SIGNAL_START] SHAPE=2; COLOUR=3; WEIGHT=10.50; [SIGNAL_END]#### ``` Have you tried using structured outputs? The addition of a schema may make the model adhere more closely to the required output. ```python #!/usr/bin/env python3 import ollama import argparse import json from pydantic import BaseModel, Field from decimal import Decimal prompt = """ Analyse the following data and return a SIGNAL in the format specified. You are to determine the colour, shape and mass of an object. Use the following values for the attributes: Shape: ROUND = 1 SQUARE = 2 Colour: RED = 1 GREEN = 2 BLUE = 3 Here is the data: {data} """ class Signal(BaseModel): SHAPE: int = Field(..., description="Shape of the object") COLOUR: int = Field(..., description="Colour of the object") WEIGHT: Decimal = Field(..., description="Weight of the object", decimal_places=2) def __str__(self): return f"####[SIGNAL_START] SHAPE={self.SHAPE}; COLOUR={self.COLOUR}; WEIGHT={self.WEIGHT:.2f}; [SIGNAL_END]####" parser = argparse.ArgumentParser() parser.add_argument("-m", "--model", default="granite3.1-dense:8b") parser.add_argument("-t", "--temperature", default=0.4) parser.add_argument("-c", "--context", default=2048) parser.add_argument("data", nargs='*') args = parser.parse_args() for d in args.data: response = ollama.chat( model=args.model, messages=[{"role":"user","content":prompt.format(data=d)}], options={"temperature":args.temperature, "num_ctx":args.context}, format=Signal.model_json_schema(), ) signal = Signal.model_validate_json(response["message"]["content"]) print(signal) ``` ```console $ ./8786-structured.py 'the red ball weighs 4 kilos' 'the blue box weighs 10 and a half kilos' ####[SIGNAL_START] SHAPE=1; COLOUR=1; WEIGHT=4.00; [SIGNAL_END]#### ####[SIGNAL_START] SHAPE=2; COLOUR=3; WEIGHT=10.50; [SIGNAL_END]#### ```
Author
Owner

@ALLMI78 commented on GitHub (Feb 3, 2025):

Hello Rick,

Thank you so much for taking the time to recreate this issue. If you're trying to trigger the error, it would make sense to set the context length to 32k and send more data in your request to get close to this limit. But yes, my setup (in MQL5) is structured similarly, with the difference that I send the response from LLMA to LLMB and vice versa. I let both models discuss and generate analyses while they control and refine each other's outputs, with additional instructions from me. i keep last 3 messages (U>A>U>next assistant answer) in context window, but some are long...

I've seen "structured outputs" and tools before, but I'm still unsure if I want to rebuild my system around them. My current purely text-based solution, where I manually parse the responses, runs 90% stable and allows me to use all models. I'm hesitant because I don't know if all models fully support tools and structured outputs yet. Additionally, since my client runs in MQL5, I need to be careful with implementation—I can't use universal solutions like in Python.

Unfortunately, I had to remove Granite now, as I couldn't find a solution. Parameter changes didn't help. I've replaced Granite with DeepSeek-Qwen2.5, and the system is running quite well with it. Something about the Granite-Dense models is causing issues in my setup, but I’m not sure if you’ll be able to reproduce it.

<!-- gh-comment-id:2632057871 --> @ALLMI78 commented on GitHub (Feb 3, 2025): Hello Rick, Thank you so much for taking the time to recreate this issue. If you're trying to trigger the error, it would make sense to set the context length to 32k and send more data in your request to get close to this limit. But yes, my setup (in MQL5) is structured similarly, with the difference that I send the response from LLMA to LLMB and vice versa. I let both models discuss and generate analyses while they control and refine each other's outputs, with additional instructions from me. i keep last 3 messages (U>A>U>next assistant answer) in context window, but some are long... I've seen "structured outputs" and tools before, but I'm still unsure if I want to rebuild my system around them. My current purely text-based solution, where I manually parse the responses, runs 90% stable and allows me to use all models. I'm hesitant because I don't know if all models fully support tools and structured outputs yet. Additionally, since my client runs in MQL5, I need to be careful with implementation—I can't use universal solutions like in Python. Unfortunately, I had to remove Granite now, as I couldn't find a solution. Parameter changes didn't help. I've replaced Granite with DeepSeek-Qwen2.5, and the system is running quite well with it. Something about the Granite-Dense models is causing issues in my setup, but I’m not sure if you’ll be able to reproduce it.
Author
Owner

@rick-github commented on GitHub (Feb 3, 2025):

No, I wasn't trying to trigger the error, just wanted to get a feel for the use case and offer a workaround. I've already tried and failed to trigger the issue, so it may be specific to the data that you are feeding the model. It's an unfortunate fact that each model has its quirks and sometimes returns unexpected results. In those cases it's sometimes easier to switch to a different model rather than trying to untie to Gordian knot of model weights as you have done.

All models support structured outputs but then, as with tools, how well they adhere to the schema is down to how the model was trained. Some will be better than others. I understand that re-tooling is extra burden for unknown results.

In the absence of a clear trigger, I don't think we can make much headway, and switching models is the best solution.

<!-- gh-comment-id:2632075943 --> @rick-github commented on GitHub (Feb 3, 2025): No, I wasn't trying to trigger the error, just wanted to get a feel for the use case and offer a workaround. I've already tried and failed to trigger the issue, so it may be specific to the data that you are feeding the model. It's an unfortunate fact that each model has its quirks and sometimes returns unexpected results. In those cases it's sometimes easier to switch to a different model rather than trying to untie to Gordian knot of model weights as you have done. All models support structured outputs but then, as with tools, how well they adhere to the schema is down to how the model was trained. Some will be better than others. I understand that re-tooling is extra burden for unknown results. In the absence of a clear trigger, I don't think we can make much headway, and switching models is the best solution.
Author
Owner

@ALLMI78 commented on GitHub (Feb 3, 2025):

thanks for your awesome work here ;)

<!-- gh-comment-id:2632095464 --> @ALLMI78 commented on GitHub (Feb 3, 2025): thanks for your awesome work here ;)
Author
Owner

@ALLMI78 commented on GitHub (Feb 9, 2025):

Same Problem with the granite-3.2-8b-instruct-preview

tested with hf.co/AaronFeng753/granite-3.2-8b-instruct-preview-Q8_0-GGUF:Q8_0

and the context length is not the problem, i had this error with 18k token @ 32k context window size...

<!-- gh-comment-id:2646300969 --> @ALLMI78 commented on GitHub (Feb 9, 2025): Same Problem with the granite-3.2-8b-instruct-preview tested with hf.co/AaronFeng753/granite-3.2-8b-instruct-preview-Q8_0-GGUF:Q8_0 and the context length is not the problem, i had this error with 18k token @ 32k context window size...
Author
Owner

@rick-github commented on GitHub (Feb 9, 2025):

In the absence of more details there's not much that can be done.

<!-- gh-comment-id:2646308090 --> @rick-github commented on GitHub (Feb 9, 2025): In the absence of more details there's not much that can be done.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#5705