feat: (Optionally) Enable Caching for Claude models #1881

New Issue

GiteaMirror · 2025-11-11T14:55:32-06:00

GiteaMirror commented

2025-11-11 14:55:32 -06:00

Originally created by @AspireOne on GitHub (Aug 25, 2024).

Is your feature request related to a problem? Please describe.
Anthropic has introduced caching for all of their models. This is especially useful in chats (read: the whole use-case for Open WebUI), because all of the previous conversation can be cached, cutting costs by up to 90%. This is a big deal.

Describe the solution you'd like
Enable caching by default for Anthropic-Claude models for both Open Router and Anthropic endpoints, or add a switcher to the model's advanced params settings to toggle caching (ideally in an universal fashion, because inevitably more and more models will implement caching).

Additional context
Prompt caching docs: https://docs.anthropic.com/en/docs/build-with-claude/prompt-caching

Originally created by @AspireOne on GitHub (Aug 25, 2024). **Is your feature request related to a problem? Please describe.** Anthropic has introduced caching for all of their models. This is especially useful in chats (read: the whole use-case for Open WebUI), because all of the previous conversation can be cached, cutting costs by up to 90%. This is a big deal. **Describe the solution you'd like** Enable caching by default for Anthropic-Claude models for both Open Router and Anthropic endpoints, or add a switcher to the model's advanced params settings to toggle caching (ideally in an universal fashion, because inevitably more and more models will implement caching). **Additional context** Prompt caching docs: https://docs.anthropic.com/en/docs/build-with-claude/prompt-caching

GiteaMirror closed this issue

2025-11-11 14:55:32 -06:00

GiteaMirror commented

2025-11-11 14:55:33 -06:00

@thiswillbeyourgithub commented on GitHub (Aug 25, 2024):

Wy would it be enabled by default? Caching increases token cost so there are definitely situation where caching is not desirable

@thiswillbeyourgithub commented on GitHub (Aug 25, 2024): Wy would it be enabled by default? Caching increases token cost so there are definitely situation where caching is not desirable

GiteaMirror commented

2025-11-11 14:55:33 -06:00

@AspireOne commented on GitHub (Aug 25, 2024):

Wy would it be enabled by default? Caching increases token cost so there are definitely situation where caching is not desirable

Cache write is 20% pricier (3$ -> 3.75$)
Cache hit is 90% cheaper (3$ -> 0.3$)

If there is a part of the query that has already been sent before (e.g. all of the chat history), it is a cache hit.
In a chatting session, all of the previous messages are a cache hit. When you have 20 messages in your chat, all of the messages (including the previous 19 ones) are always re-sent as a new query, and priced at full 3$/mTok. However, with caching enabled, the previous 19 messages are CACHED, are not re-computed at Anthropic's end, and are priced at just 0.3$ (excluding the latest one that was JUST sent). Try to think of how this accumulates during a back and forth chat.
The longer the chat goes on, the closer it is to 90% price savings. Just 2 messages are enough for it to pay off, and after that, the ratio just becomes better.
Finally, because the default mode of OpenWebUI is chatting, it absolutely makes the most sense to turn caching on by default.

The only situation where it does not pay off is when you only send ONE message in a chat and then don't continue it. Then the price of the INPUT tokens is increased from 3$ to 3.75$. How many users open a chat and send just one message? I would say a minority.

@AspireOne commented on GitHub (Aug 25, 2024): > Wy would it be enabled by default? Caching increases token cost so there are definitely situation where caching is not desirable - **Cache write** is **20% pricier** (3$ -> 3.75$) - **Cache hit** is **90% cheaper** (3$ -> 0.3$) 1. If there is a part of the query that has already been sent before (e.g. all of the chat history), it is a cache hit. 2. In a chatting session, **all of the previous messages are a cache hit**. When you have 20 messages in your chat, all of the messages (including the previous 19 ones) are always re-sent as a new query, and priced at full 3$/mTok. However, with caching enabled, the previous 19 messages are CACHED, are not re-computed at Anthropic's end, and are priced at just 0.3$ (excluding the latest one that was JUST sent). Try to think of how this accumulates during a back and forth chat. 3. The longer the chat goes on, the closer it is to 90% price savings. Just 2 messages are enough for it to pay off, and after that, the ratio just becomes better. 4. Finally, because the **default** mode of OpenWebUI is **chatting**, it absolutely makes the most sense to turn caching on by default. The only situation where it does not pay off is when you only send ONE message in a chat and then don't continue it. Then the price of the INPUT tokens is increased from 3$ to 3.75$. How many users open a chat and send just one message? I would say a minority.

GiteaMirror commented

2025-11-11 14:55:33 -06:00

@thiswillbeyourgithub commented on GitHub (Aug 25, 2024):

Okay thanks for that. I don't agree with the 1 message being a minority though. It happens in my case in about 60% of chats i'd say. There are just many many one off questions I have, just like people use google, they type a query and don't even go on the links because the instant answer by google is enough.

Also my system prompt is super long so actually I would turn it on for my case anyway in my litellm.

@thiswillbeyourgithub commented on GitHub (Aug 25, 2024): Okay thanks for that. I don't agree with the 1 message being a minority though. It happens in my case in about 60% of chats i'd say. There are just many many one off questions I have, just like people use google, they type a query and don't even go on the links because the instant answer by google is enough. Also my system prompt is super long so actually I would turn it on for my case anyway in my litellm.

GiteaMirror commented

2025-11-11 14:55:33 -06:00

@asdf8675309 commented on GitHub (Aug 26, 2024):

We have some prompts that this would be very helpful with - especially those with long context windows and on going discussions about the document or program in question.

@asdf8675309 commented on GitHub (Aug 26, 2024): We have some prompts that this would be very helpful with - especially those with long context windows and on going discussions about the document or program in question.

GiteaMirror commented

2025-11-11 14:55:33 -06:00

@Algorithm5838 commented on GitHub (Sep 6, 2024):

Note that OpenRouter now supports Claude prompt caching.

@Algorithm5838 commented on GitHub (Sep 6, 2024): Note that OpenRouter now supports Claude prompt caching.

GiteaMirror commented

2025-11-11 14:55:34 -06:00

@thiswillbeyourgithub commented on GitHub (Sep 6, 2024):

Note that OpenRouter now supports Claude prompt caching.

Do you have a link for how to do it? I couldn't find it the last few times I checked. More generally I have no idea where to get news from openrouter if you happen to know

@thiswillbeyourgithub commented on GitHub (Sep 6, 2024): > Note that OpenRouter now supports Claude prompt caching. Do you have a link for how to do it? I couldn't find it the last few times I checked. More generally I have no idea where to get news from openrouter if you happen to know

GiteaMirror commented

2025-11-11 14:55:34 -06:00

@Algorithm5838 commented on GitHub (Sep 6, 2024):

Here are their docs: https://openrouter.ai/docs/prompt-caching
They also have a Discord server: https://discord.gg/ReGHfT7R

@Algorithm5838 commented on GitHub (Sep 6, 2024): Here are their docs: https://openrouter.ai/docs/prompt-caching They also have a Discord server: https://discord.gg/ReGHfT7R

GiteaMirror commented

2025-11-11 14:55:34 -06:00

@crizCraig commented on GitHub (Nov 30, 2024):

One wrinkle to this is that the cache is only kept for 5 minutes where reads reset the 5 minute TTL. There are also a max of 4 cache sections you can add to a single request. Still very useful for large chats I think.

@crizCraig commented on GitHub (Nov 30, 2024): One wrinkle to this is that the cache is only kept for 5 minutes where reads reset the 5 minute TTL. There are also a max of 4 cache sections you can add to a single request. Still very useful for large chats I think.

GiteaMirror commented

2025-11-11 14:55:34 -06:00

@crizCraig commented on GitHub (Nov 30, 2024):

Interestingly OpenAI now does this automatically but costs are only reduced by 50%.

@crizCraig commented on GitHub (Nov 30, 2024): ![image](https://github.com/user-attachments/assets/05348193-58b2-40a5-8d6f-78fc802bee7e) Interestingly OpenAI now does this automatically but costs are only reduced by 50%.

GiteaMirror commented

2025-11-11 14:55:34 -06:00

@IN-Neil commented on GitHub (Dec 16, 2024):

Bumping this, my main model is Claude. I moved from Librechat to Open WebUI recently and the anthropic API costs have sky rocketed, didn't realize how much the automatic prompt caching from Librechat was saving cost for me until I moved over. It's really limiting my use.

@IN-Neil commented on GitHub (Dec 16, 2024): Bumping this, my main model is Claude. I moved from Librechat to Open WebUI recently and the anthropic API costs have sky rocketed, didn't realize how much the automatic prompt caching from Librechat was saving cost for me until I moved over. It's really limiting my use.

GiteaMirror commented

2025-11-11 14:55:35 -06:00

@crizCraig commented on GitHub (Dec 17, 2024):

I actually have been seeing in my recent requests (in the last couple of weeks) that Claude seems to be automatically caching, i.e. cache_read_input_tokens and cache_read_input_tokens are set and the prompt_tokens are lower. So maybe we don't have to do anything! I track the usage by proxying the model outputs, so I'm not sure how you can check in OpenWeb UI, but perhaps you can see it just by your costs going down.

Edit: The above was wrong. Claude is not automatically caching. The API was setting the cache_control header.

@crizCraig commented on GitHub (Dec 17, 2024): ~~I actually have been seeing in my recent requests (in the last couple of weeks) that Claude seems to be automatically caching, i.e. `cache_read_input_tokens` and `cache_read_input_tokens` are set and the `prompt_tokens` are lower. So maybe we don't have to do anything! I track the usage by proxying the model outputs, so I'm not sure how you can check in OpenWeb UI, but perhaps you can see it just by your costs going down.~~ Edit: The above was wrong. Claude is not automatically caching. The API was setting the cache_control header.

GiteaMirror commented

2025-11-11 14:55:35 -06:00

@crizCraig commented on GitHub (Jan 10, 2025):

I'm using prompt caching in my hosted OpenWeb UI, PolyChat, and it's reduced costs by 66%. Here's the code for it. Since I implemented this in the API layer, not part of OpenWeb UI, it's not ready as a PR, but it's a self-contained module with tests.

Related: https://github.com/open-webui/open-webui/discussions/7873

@crizCraig commented on GitHub (Jan 10, 2025): I'm using prompt caching in my hosted OpenWeb UI, [PolyChat](https://polychat.co), and it's reduced costs by 66%. Here's the [code](https://gist.github.com/crizCraig/6f8885e8cd9ecaf1ee333b0ec3c30de0) for it. Since I implemented this in the API layer, not part of OpenWeb UI, it's not ready as a PR, but it's a self-contained module with tests. Related: https://github.com/open-webui/open-webui/discussions/7873

GiteaMirror commented

2025-11-11 14:55:35 -06:00

@crizCraig commented on GitHub (Jan 17, 2025):

Another thing to be mindful of here are RAG requests #8661 which don't persist and so should not be cached.

@crizCraig commented on GitHub (Jan 17, 2025): Another thing to be mindful of here are RAG requests #8661 which don't persist and so should not be cached.