[GH-ISSUE #13208] Cache Question #55244

Closed
opened 2026-04-29 08:36:08 -05:00 by GiteaMirror · 4 comments
Owner

Originally created by @sammyvoncheese on GitHub (Nov 23, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/13208

What is the issue?

I use the python API to send requests to the Ollama server.

Is there any way to flush the server cache via API, or force an API request top skip the cache?

I run text generation in loops with small differences in the prompts, I see about (1-2 in 10) failure rates.
The model returns unexpected results that looks like something partially cached.

I want to test with the cache off/disabled to eliminate it as a possible point of failure in debugging, without having to resort to server restart.

Relevant log output


OS

Windows

GPU

Nvidia

CPU

AMD

Ollama version

0.13.0

Originally created by @sammyvoncheese on GitHub (Nov 23, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/13208 ### What is the issue? I use the python API to send requests to the Ollama server. Is there any way to flush the server cache via API, or force an API request top skip the cache? I run text generation in loops with small differences in the prompts, I see about (1-2 in 10) failure rates. The model returns unexpected results that looks like something partially cached. I want to test with the cache off/disabled to eliminate it as a possible point of failure in debugging, without having to resort to server restart. ### Relevant log output ```shell ``` ### OS Windows ### GPU Nvidia ### CPU AMD ### Ollama version 0.13.0
GiteaMirror added the bugneeds more info labels 2026-04-29 08:36:09 -05:00
Author
Owner

@rick-github commented on GitHub (Nov 23, 2025):

Ollama does prompt caching, anything after the change in the prompt should be invalidated in the cache. There's no option for cache flushing, the only option is to reload the model by setting keep_alive:0 in the API request.

If there's a problem with caching, an example of input/output and the server log may help in debugging.

<!-- gh-comment-id:3567800709 --> @rick-github commented on GitHub (Nov 23, 2025): Ollama does prompt caching, anything after the change in the prompt should be invalidated in the cache. There's no option for cache flushing, the only option is to reload the model by setting `keep_alive:0` in the API request. If there's a problem with caching, an example of input/output and the [server log](https://docs.ollama.com/troubleshooting) may help in debugging.
Author
Owner

@sammyvoncheese commented on GitHub (Nov 23, 2025):

What part of the logs would you like to see (good vs bad request)?

Here as bit more detail.

In this example a story outline is provided as context, and the model is asked to generate text for a specific chapter.

I send in a prompt like this (this is waaaay oversimplified for this discussion but should explain the details without the data).

**System Prompt Content: ** "You are and do blah blah blah" (does not change between calls)

**User Prompt Content: **
"### Additional Context: {Story outline} Starts Here:
<story outline which includes 10 titled chapters and their summaries.>
"### Additional Context: {Story outline} Ends Here:
Create chapter events for chapter {loop-count} using the story outline in context.
The format is markdown"

The prompts are sent in a loop (10x once for each chapter). Only the one character "loop-count " changes. The size of the system/user prompt buffer is unchanged for the 10 calls.

I use a 100k context in the API. But for these tests the prompts are under 20k. I have also tested with a much larger context (64k and see the exact same behavior). I have tried a range of Temp settings 0-1 with no real impact.

What I observe is that the first 1 or 2 calls in the loop successfully extract the chapter title and details from the outline, and construct formatted output as expected. Then at some point the output starts to change, the model will return the wrong chapter information, the formatting will change. The result is duplicated chapters, hallucinated titles, or the wrong title for the chapter used. Sometimes the model will recover and pick up, but usually the output is significantly different from the first prompt.

This occurs with all the models I have tested with from the Ollama site (for these tests they were all in the Gemma3 family, 12b, 27b, with quants from 4-16). Have seen similar results, with gpt-oss:20b, Qwen3 models, granite3/4, and others.

Thanks for listening.

<!-- gh-comment-id:3568110789 --> @sammyvoncheese commented on GitHub (Nov 23, 2025): What part of the logs would you like to see (good vs bad request)? ### Here as bit more detail. In this example a story outline is provided as context, and the model is asked to generate text for a **specific** chapter. I send in a prompt like this (this is waaaay oversimplified for this discussion but should explain the details without the data). **System Prompt Content: ** "You are and do blah blah blah" (does not change between calls) **User Prompt Content: ** "### Additional Context: {Story outline} Starts Here: <story outline which includes 10 titled chapters and their summaries.> "### Additional Context: {Story outline} Ends Here: Create chapter events for chapter {loop-count} using the story outline in context. The format is markdown" The prompts are sent in a loop (10x once for each chapter). Only the one character "loop-count " changes. The size of the system/user prompt buffer is unchanged for the 10 calls. I use a 100k context in the API. But for these tests the prompts are under 20k. I have also tested with a much larger context (64k and see the exact same behavior). I have tried a range of Temp settings 0-1 with no real impact. What I observe is that the first 1 or 2 calls in the loop successfully extract the chapter title and details from the outline, and construct formatted output as expected. Then at some point the output starts to change, the model will return the wrong chapter information, the formatting will change. The result is duplicated chapters, hallucinated titles, or the wrong title for the chapter used. Sometimes the model will recover and pick up, but usually the output is significantly different from the first prompt. This occurs with all the models I have tested with from the Ollama site (for these tests they were all in the Gemma3 family, 12b, 27b, with quants from 4-16). Have seen similar results, with gpt-oss:20b, Qwen3 models, granite3/4, and others. Thanks for listening.
Author
Owner

@rick-github commented on GitHub (Nov 23, 2025):

The only caching that should happen here is up to the loop-count character, everything from that character onwards should be invalidated on each subsequent call. Are the API calls serialized or in parallel? If you set OLLAMA_DEBUG=1 and post the resulting full log it may help in debugging. The contents of the prompts won't be logged, just metadata (character count, token count, size of cache, etc).

<!-- gh-comment-id:3568118245 --> @rick-github commented on GitHub (Nov 23, 2025): The only caching that should happen here is up to the loop-count character, everything from that character onwards should be invalidated on each subsequent call. Are the API calls serialized or in parallel? If you set `OLLAMA_DEBUG=1` and post the resulting full log it may help in debugging. The contents of the prompts won't be logged, just metadata (character count, token count, size of cache, etc).
Author
Owner

@sammyvoncheese commented on GitHub (Nov 28, 2025):

Tested the following.

  1. Tested with keep_alive to 0 so models reload each call (not ideal for larger models). This reduced the hallucinations to 1 in 30 (which seems much better then 3 in 10).
  2. Tested keeping the model loaded via keep_alive=60m and added a unique watermark to the start of the user prompt "Watermark:{timestamp}" (resolves to current timestamp, so its unique) in an attempt to force the whole user prompt to skip cache. (this continued to have similar 30% failure rates)

I'll grab logs next.

Side Note: Identified one issue was caused by an older version of a model:
Turns out I had an earlier version of gemma3:27b that I was testing with (another computer). I found that model was pretty old (probably downloaded the day it was released), and it had a hard coded 4k context in the model file which was overriding my 120k ctx via API (this was dumping my 20-60k prompts). I pulled the model again, and this issue was resolved.

<!-- gh-comment-id:3590154780 --> @sammyvoncheese commented on GitHub (Nov 28, 2025): **Tested the following.** 1. Tested with keep_alive to 0 so models reload each call (not ideal for larger models). This reduced the hallucinations to 1 in 30 (which seems much better then 3 in 10). 2. Tested keeping the model loaded via keep_alive=60m and added a unique watermark to the start of the user prompt "Watermark:{timestamp}" (resolves to current timestamp, so its unique) in an attempt to force the whole user prompt to skip cache. (this continued to have similar 30% failure rates) I'll grab logs next. **Side Note:** Identified one issue was caused by an older version of a model: Turns out I had an earlier version of gemma3:27b that I was testing with (another computer). I found that model was pretty old (probably downloaded the day it was released), and it had a hard coded 4k context in the model file which was overriding my 120k ctx via API (this was dumping my 20-60k prompts). I pulled the model again, and this issue was resolved.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#55244