[GH-ISSUE #1573] Enable prompt cache #47374

Closed
opened 2026-04-28 03:38:38 -05:00 by GiteaMirror · 9 comments
Owner

Originally created by @K0IN on GitHub (Dec 17, 2023).
Original GitHub issue: https://github.com/ollama/ollama/issues/1573

I use ollama in an automated way, that's why I use the same prompt all the time.

That's why I thought we might allow ollama to use prompt_cache.

f7f468a97d/common/common.cpp (L1508C22-L1508C38)

Or is there already a way to control this / does ollama cache multiple prompts anyway?

Originally created by @K0IN on GitHub (Dec 17, 2023). Original GitHub issue: https://github.com/ollama/ollama/issues/1573 I use ollama in an automated way, that's why I use the same prompt all the time. That's why I thought we might allow ollama to use prompt_cache. https://github.com/ggerganov/llama.cpp/blob/f7f468a97dceec2f8fe8b1ed7a2091083446ebc7/common/common.cpp#L1508C22-L1508C38 Or is there already a way to control this / does ollama cache multiple prompts anyway?
Author
Owner

@TheDudeFromCI commented on GitHub (Dec 19, 2023):

I think this would be extremely useful, especially on slower devices. At current, it takes me around 5 minutes to evaluate the prompt before generating any text because of hardware limitations. Having to wait for the prompt to be re-evaluated after each newly generated line takes over 90% of the total time spent generating text.

If the cache could be perhaps returned like the context tokens are, or even just a returned code to let the model know to read the cache from a specific temp file would greatly improve overall performance.

<!-- gh-comment-id:1862797903 --> @TheDudeFromCI commented on GitHub (Dec 19, 2023): I think this would be extremely useful, especially on slower devices. At current, it takes me around 5 minutes to evaluate the prompt before generating any text because of hardware limitations. Having to wait for the prompt to be re-evaluated after each newly generated line takes over 90% of the total time spent generating text. If the cache could be perhaps returned like the `context` tokens are, or even just a returned code to let the model know to read the cache from a specific temp file would greatly improve overall performance.
Author
Owner

@djmaze commented on GitHub (Dec 19, 2023):

I imagine a memory-based, fixed size LRU cache which stores prompt evaluations on a session-by-session basis. The data of the least recently used session would be evicted first.

That said, I don't know the ollama internals. Maybe it does not even have the concept of a client session?

<!-- gh-comment-id:1863551758 --> @djmaze commented on GitHub (Dec 19, 2023): I imagine a memory-based, fixed size LRU cache which stores prompt evaluations on a session-by-session basis. The data of the least recently used session would be evicted first. That said, I don't know the ollama internals. Maybe it does not even have the concept of a client session?
Author
Owner

@AndreiSva commented on GitHub (Dec 20, 2023):

yes please!!

<!-- gh-comment-id:1864012172 --> @AndreiSva commented on GitHub (Dec 20, 2023): yes please!!
Author
Owner

@K0IN commented on GitHub (Dec 20, 2023):

The link in my issue might be wrong, as far as i can tell ollama uses the example/server from llama.cpp which has its own cache_prompt flag, see https://github.com/ggerganov/llama.cpp/tree/master/examples/server#api-endpoints.

Also see the comment on n_predict.

I hacked something together to control this flag from the ollama api, but i don't see any difference so far.

I like the idea of @TheDudeFromCI, it is nice to let the API be stateless.

<!-- gh-comment-id:1865050524 --> @K0IN commented on GitHub (Dec 20, 2023): The link in my issue might be wrong, as far as i can tell ollama uses the example/server from llama.cpp which has its own cache_prompt flag, see https://github.com/ggerganov/llama.cpp/tree/master/examples/server#api-endpoints. Also see the comment on n_predict. I hacked something together to control this flag from the ollama api, but i don't see any difference so far. I like the idea of @TheDudeFromCI, it is nice to let the API be stateless.
Author
Owner

@K0IN commented on GitHub (Dec 20, 2023):

This might be just right, see comments made in https://github.com/ggerganov/llama.cpp/issues/4329

<!-- gh-comment-id:1865088918 --> @K0IN commented on GitHub (Dec 20, 2023): This might be just right, see comments made in https://github.com/ggerganov/llama.cpp/issues/4329
Author
Owner

@K0IN commented on GitHub (Dec 20, 2023):

@djmaze feel free to test out my pr, and give some feedback :)

<!-- gh-comment-id:1865113995 --> @K0IN commented on GitHub (Dec 20, 2023): @djmaze feel free to test out my pr, and give some feedback :)
Author
Owner

@jmorganca commented on GitHub (Jan 25, 2024):

This should be fixed in https://github.com/ollama/ollama/pull/2190, but feel free to re-open @K0IN

<!-- gh-comment-id:1911051607 --> @jmorganca commented on GitHub (Jan 25, 2024): This should be fixed in https://github.com/ollama/ollama/pull/2190, but feel free to re-open @K0IN
Author
Owner

@samos123 commented on GitHub (Feb 13, 2025):

Edit: looks like it's enabled by default

How to enable prefix caching?

<!-- gh-comment-id:2655385084 --> @samos123 commented on GitHub (Feb 13, 2025): Edit: looks like it's enabled by default How to enable prefix caching?
Author
Owner

@sairajv commented on GitHub (Mar 3, 2025):

@jmorganca Does ollama have prefix caching capability? I have searched issues but cannot find a definitive answer

<!-- gh-comment-id:2694981245 --> @sairajv commented on GitHub (Mar 3, 2025): @jmorganca Does ollama have prefix caching capability? I have searched issues but cannot find a definitive answer
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#47374