[GH-ISSUE #2023] Enable Prompt Caching by Default #1167

Closed
opened 2026-04-12 10:57:00 -05:00 by GiteaMirror · 10 comments
Owner

Originally created by @BruceMacD on GitHub (Jan 16, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/2023

Originally assigned to: @BruceMacD on GitHub.

I had to disable prompt caching due to requests getting stuck: #1994

We should bring this back when we have a mitigation for the inference issue:
https://github.com/ggerganov/llama.cpp/issues/4989

Originally created by @BruceMacD on GitHub (Jan 16, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/2023 Originally assigned to: @BruceMacD on GitHub. I had to disable prompt caching due to requests getting stuck: #1994 We should bring this back when we have a mitigation for the inference issue: https://github.com/ggerganov/llama.cpp/issues/4989
GiteaMirror added the feature request label 2026-04-12 10:57:00 -05:00
Author
Owner

@jmorganca commented on GitHub (May 6, 2024):

Prompt caching is now on by default

<!-- gh-comment-id:2097108279 --> @jmorganca commented on GitHub (May 6, 2024): Prompt caching is now on by default
Author
Owner

@chigkim commented on GitHub (Jul 9, 2024):

How can I disable prompt caching? Is there a way to disable prompt caching with environment variable?

<!-- gh-comment-id:2216920773 --> @chigkim commented on GitHub (Jul 9, 2024): How can I disable prompt caching? Is there a way to disable prompt caching with environment variable?
Author
Owner

@NeuralNotwerk commented on GitHub (Jul 9, 2024):

I'm going to second @chigkim - how do I disable this? My first inference request is the fastest request that my system outputs. Every subsequent request drops down until I sustain tokens per second of about 60% of the original first request. I've got a feeling there's something wonky with the caching somewhere. (224 CPU cores, 2TB of memory, 8xH100 8gb gpus - it's not a resource issue - vmem is less than 50% used and its not a thermal issues as temps barely jump to 45c.)

<!-- gh-comment-id:2218019718 --> @NeuralNotwerk commented on GitHub (Jul 9, 2024): I'm going to second @chigkim - how do I disable this? My first inference request is the fastest request that my system outputs. Every subsequent request drops down until I sustain tokens per second of about 60% of the original first request. I've got a feeling there's something wonky with the caching somewhere. (224 CPU cores, 2TB of memory, 8xH100 8gb gpus - it's not a resource issue - vmem is less than 50% used and its not a thermal issues as temps barely jump to 45c.)
Author
Owner

@iganev commented on GitHub (Feb 4, 2025):

I am going to second the second. I am noticing suspiciously identical responses for vastly different requests. Summarizing 2 different texts couldn't have possibly yielded the same thing. There's either some caching involved or a crashed executor that isn't handled properly.

<!-- gh-comment-id:2634350786 --> @iganev commented on GitHub (Feb 4, 2025): I am going to second the second. I am noticing suspiciously identical responses for vastly different requests. Summarizing 2 different texts couldn't have possibly yielded the same thing. There's either some caching involved or a crashed executor that isn't handled properly.
Author
Owner

@jessegross commented on GitHub (Feb 4, 2025):

If the requests are different, it won't trigger prompt caching.

<!-- gh-comment-id:2634668288 --> @jessegross commented on GitHub (Feb 4, 2025): If the requests are different, it won't trigger prompt caching.
Author
Owner

@mknwebsolutions commented on GitHub (Mar 5, 2025):

Also wondering if there's an ability to turn off the cache layer to free up resources when the prompt requests are always going to be unique.

<!-- gh-comment-id:2701847062 --> @mknwebsolutions commented on GitHub (Mar 5, 2025): Also wondering if there's an ability to turn off the cache layer to free up resources when the prompt requests are always going to be unique.
Author
Owner

@jdblack commented on GitHub (Mar 14, 2025):

If the requests are different, it won't trigger prompt caching.

I'm sorry if this comes off as a bit of a jumble, as I'm at a loss as how to explain this coherently. I could provide a ruby script that shows the behavior, if that would be useful?

I'm definitely getting identical responses from /api/generate despite the leading portion of the prompt changing. In my use case, I'm asking gemma3 (via ollama 0.6 of course) to extract a list of facts from news articles. The format of the generation is along the lines of:

instructions  to extract facts from articles

the_article_title  (40- 200 characters
article_date
article_contents   (5k -  30k characters)

However, I quickly found that I kept getting the same response back even after changing the instructions.

The length of the test article is nearly 10k, so I suspect it's overwhelming the initial part of the prompt.

Here are some sample instructions I tried using:

ollama("You are a expect in sorting facts.  First, identify as 'entities' all of the  people, places, products, organziations and products.   Then,  for each entity, find as many fact as possible, and day and time they occurred: \n\n#{a.title}\n#{a.publishedAt}\n#{a.content}  ", format: format)
ollama("Find as many facts as possible!  \n\n#{a.title}\n#{a.publishedAt}\n#{a.content}  ", format: format)  
ollama("Extract at least 5 facts from this article  \n\n#{a.title}\n#{a.publishedAt}\n#{a.content}", format: format)
ollama("Say hello  \n\n#{a.content}  ", format: format)  

Format is a json blob that describes the json I want back:

{"type":"array","items":{"type":"object","properties":{"fact":{"type":"string"},"date":{"type":"string"},"entities":{"type":"array","items":{"type":"string"}}},"required":["fact","date","entities"]}}

In all cases, I get the exact same return, even for the "say hello" one in which I don't even pass the date at all!

[{"fact" => "The article emphasizes the limitations of DCF (Discounted Cash Flow) valuation, arguing that 80% of the calculated ‘value’ relies on uncertain terminal assumptions.",
  "date" => "October 26, 2023",
  "entities" => ["DCF valuation", "Amazon", "Tesla", "Apple", "S&P 500"]}]

Is there an option I can pass to /api/generate that skips the cache, or something I can tune to make the cache hitting exact, rather than approximate?

<!-- gh-comment-id:2723515037 --> @jdblack commented on GitHub (Mar 14, 2025): > If the requests are different, it won't trigger prompt caching. I'm sorry if this comes off as a bit of a jumble, as I'm at a loss as how to explain this coherently. I could provide a ruby script that shows the behavior, if that would be useful? I'm definitely getting identical responses from /api/generate despite the leading portion of the prompt changing. In my use case, I'm asking gemma3 (via ollama 0.6 of course) to extract a list of facts from news articles. The format of the generation is along the lines of: ``` instructions to extract facts from articles the_article_title (40- 200 characters article_date article_contents (5k - 30k characters) ``` However, I quickly found that I kept getting the same response back even after changing the instructions. The length of the test article is nearly 10k, so I suspect it's overwhelming the initial part of the prompt. Here are some sample instructions I tried using: ``` ollama("You are a expect in sorting facts. First, identify as 'entities' all of the people, places, products, organziations and products. Then, for each entity, find as many fact as possible, and day and time they occurred: \n\n#{a.title}\n#{a.publishedAt}\n#{a.content} ", format: format) ``` ``` ollama("Find as many facts as possible! \n\n#{a.title}\n#{a.publishedAt}\n#{a.content} ", format: format) ``` ``` ollama("Extract at least 5 facts from this article \n\n#{a.title}\n#{a.publishedAt}\n#{a.content}", format: format) ``` ``` ollama("Say hello \n\n#{a.content} ", format: format) ``` Format is a json blob that describes the json I want back: ``` {"type":"array","items":{"type":"object","properties":{"fact":{"type":"string"},"date":{"type":"string"},"entities":{"type":"array","items":{"type":"string"}}},"required":["fact","date","entities"]}} ``` In all cases, I get the exact same return, even for the "say hello" one in which I don't even pass the date at all! ``` [{"fact" => "The article emphasizes the limitations of DCF (Discounted Cash Flow) valuation, arguing that 80% of the calculated ‘value’ relies on uncertain terminal assumptions.", "date" => "October 26, 2023", "entities" => ["DCF valuation", "Amazon", "Tesla", "Apple", "S&P 500"]}] ``` Is there an option I can pass to /api/generate that skips the cache, or something I can tune to make the cache hitting exact, rather than approximate?
Author
Owner

@kevin-pw commented on GitHub (Mar 28, 2025):

In all cases, I get the exact same return, even for the "say hello" one in which I don't even pass the date at all!

@jdblack Your issue is probably unrelated to prompt caching. It sounds like your context length is set too short so the beginning of your prompt gets cut off. Ollama uses a default context length of 2048 tokens.

Try setting a larger context length with num_ctx, for example:

curl http://localhost:11434/api/generate -d '{
  "model": "llama3.2",
  "prompt": "Why is the sky blue?",
  "options": {
    "num_ctx": 65536
  }
}'

Also, see docs: https://github.com/ollama/ollama/blob/main/docs/faq.md

<!-- gh-comment-id:2759967115 --> @kevin-pw commented on GitHub (Mar 28, 2025): > In all cases, I get the exact same return, even for the "say hello" one in which I don't even pass the date at all! @jdblack Your issue is probably unrelated to prompt caching. It sounds like your context length is set too short so the beginning of your prompt gets cut off. Ollama uses a default context length of 2048 tokens. Try setting a larger context length with `num_ctx`, for example: ``` curl http://localhost:11434/api/generate -d '{ "model": "llama3.2", "prompt": "Why is the sky blue?", "options": { "num_ctx": 65536 } }' ``` Also, see docs: https://github.com/ollama/ollama/blob/main/docs/faq.md
Author
Owner

@jdblack commented on GitHub (Mar 28, 2025):

Try setting a larger context length with num_ctx, for example:

curl http://localhost:11434/api/generate -d '{
  "model": "llama3.2",
  "prompt": "Why is the sky blue?",
  "options": {
    "num_ctx": 65536
  }
}'

Sorry, I must have commented twice and forgotten. I wrote a much larger description elsewhere and the small context was indeed the problem.

Thank you for following up!

<!-- gh-comment-id:2761171221 --> @jdblack commented on GitHub (Mar 28, 2025): > Try setting a larger context length with `num_ctx`, for example: > > ``` > curl http://localhost:11434/api/generate -d '{ > "model": "llama3.2", > "prompt": "Why is the sky blue?", > "options": { > "num_ctx": 65536 > } > }' > ``` Sorry, I must have commented twice and forgotten. I wrote a much larger description elsewhere and the small context was indeed the problem. Thank you for following up!
Author
Owner

@micseydel commented on GitHub (May 7, 2025):

I am noticing suspiciously identical responses for vastly different requests. Summarizing 2 different texts couldn't have possibly yielded the same thing.

I've failed to reproduce it, but I hit the same thing this week. My code's HTTP request had a bug where it didn't include the images field for a llava request, and the resulting hallucination had specific details from the prior image I'd successfully sent over the web API. A screenshot of text was described as having a cat, with details specific to the prior image (which did have a cat).

I'm not filing a new bug since I can't reproduce it, but if anyone has ideas for how to reproduce this bug or links to similar issues, I'm open to suggestions.

@NeuralNotwerk and @iganev , do either of you have any updates on this? If you moved away from Ollama and are using Apple Silicon, I'd be curious what you moved to.

ETA: my code has no implementation for back-and-forth chats, just a simple one-off without streaming, written to a Markdown note. The context window using the web API should be clear between runs.

ETA2: I tried expanding the context window, which doesn't seem to have made a difference. I also looked at the logs, but nothing stands out. My current hypothesis is that it's not a context/caching issue, someone else mentioned it might be a crashed executor that wasn't cleaned up, but I'm not familiar enough with Ollama's implementation to dig in further. If anyone has tips on how I could investigate further, I'm happy to, but I'm gonna call it here unless someone has advice.

ETA3: I've scoured logs now, there's no evidence of a crashed executor but I went through my own app logs and git history, and I can't find any evidence this buggy behavior is on my side. I did have to fix issues on my side, but after auditing everything, I cannot account for the carry-over between prompts. I was switching between llava:7b, llava:13b and llava:34b, but the issue happened between consecutive runs of llava:34b. For transparency, the correct output was

The image shows a cat lying down, with its body partially visible and one paw extended outwards in what appears to be a relaxed or slightly playful position. The cat is surrounded by blankets that are piled up, providing a comfortable and soft spot for the cat to rest. There's also a red towel beneath the blankets on which the cat is resting its head. The setting looks like an indoor environment with warm lighting.

followed by

It appears to be an image of a cat sitting down with its paw extended as if it's about to interact or play with something. The cat seems relaxed and comfortable, possibly indoors given the blurred background which suggests a domestic setting. However, without more context, it's difficult to determine what specific activity the cat is engaged in.

when no image was provided. Although the beginning shows clear carryover, the end of the response is also typical of hallucinations (which happened because my code didn't provide the text-image; once I realized and fixed my bug, I eventually realized this behavior still couldn't be accounted for). For the two responses above, the first prompt was What is in this picture? ![[75a34d58-008a-4ff0-9b64-e6089cb6b6e6.png|320]] and the second was What is in this picture? ![[Pasted image 20250505144200.png]]

<!-- gh-comment-id:2860601956 --> @micseydel commented on GitHub (May 7, 2025): > I am noticing suspiciously identical responses for vastly different requests. Summarizing 2 different texts couldn't have possibly yielded the same thing. I've failed to reproduce it, but I hit the same thing this week. My code's HTTP request had a bug where it didn't include the `images` field for a llava request, and the resulting hallucination had specific details from the _prior_ image I'd successfully sent over the web API. A screenshot of text was described as having a cat, with details specific to the prior image (which did have a cat). I'm not filing a new bug since I can't reproduce it, but if anyone has ideas for how to reproduce this bug or links to similar issues, I'm open to suggestions. @NeuralNotwerk and @iganev , do either of you have any updates on this? If you moved away from Ollama and are using Apple Silicon, I'd be curious what you moved to. ETA: my code has no implementation for back-and-forth chats, just a simple one-off without streaming, written to a Markdown note. The context window using the web API should be clear between runs. ETA2: I tried expanding the context window, which doesn't seem to have made a difference. I also looked at the logs, but nothing stands out. My current hypothesis is that it's not a context/caching issue, someone else mentioned it might be a crashed executor that wasn't cleaned up, but I'm not familiar enough with Ollama's implementation to dig in further. If anyone has tips on how I could investigate further, I'm happy to, but I'm gonna call it here unless someone has advice. ETA3: I've scoured logs now, there's no evidence of a crashed executor but I went through my own app logs and git history, and I can't find any evidence this buggy behavior is on my side. I did have to fix issues on my side, but after auditing everything, I cannot account for the carry-over between prompts. I was switching between llava:7b, llava:13b and llava:34b, but the issue happened between consecutive runs of llava:34b. For transparency, the correct output was > The image shows a cat lying down, with its body partially visible and one paw extended outwards in what appears to be a relaxed or slightly playful position. The cat is surrounded by blankets that are piled up, providing a comfortable and soft spot for the cat to rest. There's also a red towel beneath the blankets on which the cat is resting its head. The setting looks like an indoor environment with warm lighting. followed by > It appears to be an image of a cat sitting down with its paw extended as if it's about to interact or play with something. The cat seems relaxed and comfortable, possibly indoors given the blurred background which suggests a domestic setting. However, without more context, it's difficult to determine what specific activity the cat is engaged in. when no image was provided. Although the beginning shows clear carryover, the end of the response is also typical of hallucinations (which happened because my code didn't provide the text-image; once I realized and fixed my bug, I eventually realized this behavior still couldn't be accounted for). For the two responses above, the first prompt was `What is in this picture? ![[75a34d58-008a-4ff0-9b64-e6089cb6b6e6.png|320]]` and the second was `What is in this picture? ![[Pasted image 20250505144200.png]] `
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#1167