[GH-ISSUE #7931] Phi3 model starts responding crazy thing after thousand of calls. #30837

Closed
opened 2026-04-22 10:46:45 -05:00 by GiteaMirror · 6 comments
Owner

Originally created by @TizDu on GitHub (Dec 4, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/7931

What is the issue?

I am currently using Ollama server 0.4.1 and phi3:3.8b model

In my python script I am using OllamaLLM via langchain_ollama.

For each user query (messages) I am re-creating the LLM with OllamaLLM and then call llm.invoke to be sure there is no history or so.

Everything works well but after some utterances (around 3000 queries) the model starts returning totally wrong results not following anymore the system prompt.

Any idea on what can be wrong? Is there a limit with Phi3-mini or something like this?

I do not understand since I am creating a fresh llm for each query.

Here is a code snippet

        llm = OllamaLLM(model="phi3:3.8b", format="json", temperature=0)
        output = llm.invoke(messages, temperature=0, chat_history=[])

OS

Linux

GPU

No response

CPU

No response

Ollama version

0.4.1

Originally created by @TizDu on GitHub (Dec 4, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/7931 ### What is the issue? I am currently using Ollama server 0.4.1 and phi3:3.8b model In my python script I am using OllamaLLM via langchain_ollama. For each user query (messages) I am re-creating the LLM with OllamaLLM and then call llm.invoke to be sure there is no history or so. Everything works well but after some utterances (around 3000 queries) the model starts returning totally wrong results not following anymore the system prompt. Any idea on what can be wrong? Is there a limit with Phi3-mini or something like this? I do not understand since I am creating a fresh llm for each query. Here is a code snippet llm = OllamaLLM(model="phi3:3.8b", format="json", temperature=0) output = llm.invoke(messages, temperature=0, chat_history=[]) ### OS Linux ### GPU _No response_ ### CPU _No response_ ### Ollama version 0.4.1
GiteaMirror added the bug label 2026-04-22 10:46:45 -05:00
Author
Owner

@rick-github commented on GitHub (Dec 4, 2024):

Server logs may aid in debugging, although this sounds like the sort of thing you need to set OLLAMA_DEBUG=1 to catch. How big are your messages?

<!-- gh-comment-id:2517127966 --> @rick-github commented on GitHub (Dec 4, 2024): [Server logs](https://github.com/ollama/ollama/blob/main/docs/troubleshooting.md#how-to-troubleshoot-issues) may aid in debugging, although this sounds like the sort of thing you need to set `OLLAMA_DEBUG=1` to catch. How big are your messages?
Author
Owner

@TizDu commented on GitHub (Dec 4, 2024):

Messages are pretty short, for instance: 'explain what rare earth elements are and list some examples'

But suddenly the model does not return what is asked in the system prompt and almost all following message also give wrong results.

I will try to re-run them with new server and OLLAMA_DEBUG=1. But currently I have no logs folder under ~/.ollama/

<!-- gh-comment-id:2517254861 --> @TizDu commented on GitHub (Dec 4, 2024): Messages are pretty short, for instance: 'explain what rare earth elements are and list some examples' But suddenly the model does not return what is asked in the system prompt and almost all following message also give wrong results. I will try to re-run them with new server and OLLAMA_DEBUG=1. But currently I have no logs folder under ~/.ollama/
Author
Owner

@TizDu commented on GitHub (Dec 5, 2024):

Looking in the log I do see the following

time=2024-12-04T07:52:46.811-05:00 level=DEBUG source=server.go:812 msg="prediction aborted, token repeat limit reached"

In fact my message is not long (100 characters) but I do have a system prompt of almost 2000 characters.

Can it be that I am reaching a limit of phi3?

Also from the log I have the impression that even if I create a new llm (OllamaLLM) for each message that it is not actually creating a new one.

since between message I do not see the model information (starting with llama_model_loader)

is there a way to destroy/close the llm created with OllamaLLM?

<!-- gh-comment-id:2519669151 --> @TizDu commented on GitHub (Dec 5, 2024): Looking in the log I do see the following time=2024-12-04T07:52:46.811-05:00 level=DEBUG source=server.go:812 msg="prediction aborted, token repeat limit reached" In fact my message is not long (100 characters) but I do have a system prompt of almost 2000 characters. Can it be that I am reaching a limit of phi3? Also from the log I have the impression that even if I create a new llm (OllamaLLM) for each message that it is not actually creating a new one. since between message I do not see the model information (starting with llama_model_loader) is there a way to destroy/close the llm created with OllamaLLM?
Author
Owner

@rick-github commented on GitHub (Dec 5, 2024):

2000 characters system prompt and 100 characters prompt should still fit within the default context window. If there were server logs available that could be checked.

OllamaLLM is just being created in your client, if you want to destroy the LLM in the server, set keep_alive to zero.

<!-- gh-comment-id:2519950816 --> @rick-github commented on GitHub (Dec 5, 2024): 2000 characters system prompt and 100 characters prompt should still fit within the default context window. If there were server logs available that could be checked. OllamaLLM is just being created in your client, if you want to destroy the LLM in the server, set [`keep_alive`](https://github.com/ollama/ollama/blob/main/docs/faq.md#how-do-i-keep-a-model-loaded-in-memory-or-make-it-unload-immediately) to zero.
Author
Owner

@TizDu commented on GitHub (Dec 5, 2024):

I just learned about that 'keep_alive' option in OllamaLLM, that probably explains why even if I create a new llm (OllamaLLM) for each message I do not see the model information every time in the server logfile.

Something else, even if my message is short the system prompt (that I did add in the model via modelfile) is around 2000 characters.

So I am wondering if I am not hitting a limit at some point and should force a 'reload' of the model on the server.

I could not see any function forcing such a refresh in https://api.python.langchain.com/en/latest/ollama/llms/langchain_ollama.llms.OllamaLLM.html#ollamallm

Any idea on how to do that?

If I use the Ollama application I do see a '/clear' option is it what I am looking for?

<!-- gh-comment-id:2520005076 --> @TizDu commented on GitHub (Dec 5, 2024): I just learned about that 'keep_alive' option in OllamaLLM, that probably explains why even if I create a new llm (OllamaLLM) for each message I do not see the model information every time in the server logfile. Something else, even if my message is short the system prompt (that I did add in the model via modelfile) is around 2000 characters. So I am wondering if I am not hitting a limit at some point and should force a 'reload' of the model on the server. I could not see any function forcing such a refresh in https://api.python.langchain.com/en/latest/ollama/llms/langchain_ollama.llms.OllamaLLM.html#ollamallm Any idea on how to do that? If I use the Ollama application I do see a '/clear' option is it what I am looking for?
Author
Owner

@rick-github commented on GitHub (Dec 5, 2024):

ollama is supposed to be stateless, when you send requests to ollama they don't accumulate and exceed a limit, so there should be no need to reload the LLM. If something is going wrong then that would be a bug. If there were server logs available that could be checked.

<!-- gh-comment-id:2520105175 --> @rick-github commented on GitHub (Dec 5, 2024): ollama is supposed to be stateless, when you send requests to ollama they don't accumulate and exceed a limit, so there should be no need to reload the LLM. If something is going wrong then that would be a bug. If there were server logs available that could be checked.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#30837