mirror of
https://github.com/open-webui/open-webui.git
synced 2026-03-25 04:24:30 -05:00
feat: (Optionally) Enable Caching for Claude models #1881
Reference in New Issue
Block a user
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @AspireOne on GitHub (Aug 25, 2024).
Is your feature request related to a problem? Please describe.
Anthropic has introduced caching for all of their models. This is especially useful in chats (read: the whole use-case for Open WebUI), because all of the previous conversation can be cached, cutting costs by up to 90%. This is a big deal.
Describe the solution you'd like
Enable caching by default for Anthropic-Claude models for both Open Router and Anthropic endpoints, or add a switcher to the model's advanced params settings to toggle caching (ideally in an universal fashion, because inevitably more and more models will implement caching).
Additional context
Prompt caching docs: https://docs.anthropic.com/en/docs/build-with-claude/prompt-caching
@thiswillbeyourgithub commented on GitHub (Aug 25, 2024):
Wy would it be enabled by default? Caching increases token cost so there are definitely situation where caching is not desirable
@AspireOne commented on GitHub (Aug 25, 2024):
The only situation where it does not pay off is when you only send ONE message in a chat and then don't continue it. Then the price of the INPUT tokens is increased from 3$ to 3.75$. How many users open a chat and send just one message? I would say a minority.
@thiswillbeyourgithub commented on GitHub (Aug 25, 2024):
Okay thanks for that. I don't agree with the 1 message being a minority though. It happens in my case in about 60% of chats i'd say. There are just many many one off questions I have, just like people use google, they type a query and don't even go on the links because the instant answer by google is enough.
Also my system prompt is super long so actually I would turn it on for my case anyway in my litellm.
@asdf8675309 commented on GitHub (Aug 26, 2024):
We have some prompts that this would be very helpful with - especially those with long context windows and on going discussions about the document or program in question.
@Algorithm5838 commented on GitHub (Sep 6, 2024):
Note that OpenRouter now supports Claude prompt caching.
@thiswillbeyourgithub commented on GitHub (Sep 6, 2024):
Do you have a link for how to do it? I couldn't find it the last few times I checked. More generally I have no idea where to get news from openrouter if you happen to know
@Algorithm5838 commented on GitHub (Sep 6, 2024):
Here are their docs: https://openrouter.ai/docs/prompt-caching
They also have a Discord server: https://discord.gg/ReGHfT7R
@crizCraig commented on GitHub (Nov 30, 2024):
One wrinkle to this is that the cache is only kept for 5 minutes where reads reset the 5 minute TTL. There are also a max of 4 cache sections you can add to a single request. Still very useful for large chats I think.
@crizCraig commented on GitHub (Nov 30, 2024):
@IN-Neil commented on GitHub (Dec 16, 2024):
Bumping this, my main model is Claude. I moved from Librechat to Open WebUI recently and the anthropic API costs have sky rocketed, didn't realize how much the automatic prompt caching from Librechat was saving cost for me until I moved over. It's really limiting my use.
@crizCraig commented on GitHub (Dec 17, 2024):
I actually have been seeing in my recent requests (in the last couple of weeks) that Claude seems to be automatically caching, i.e.cache_read_input_tokensandcache_read_input_tokensare set and theprompt_tokensare lower. So maybe we don't have to do anything! I track the usage by proxying the model outputs, so I'm not sure how you can check in OpenWeb UI, but perhaps you can see it just by your costs going down.Edit: The above was wrong. Claude is not automatically caching. The API was setting the cache_control header.
@crizCraig commented on GitHub (Jan 10, 2025):
I'm using prompt caching in my hosted OpenWeb UI, PolyChat, and it's reduced costs by 66%. Here's the code for it. Since I implemented this in the API layer, not part of OpenWeb UI, it's not ready as a PR, but it's a self-contained module with tests.
Related: https://github.com/open-webui/open-webui/discussions/7873
@crizCraig commented on GitHub (Jan 17, 2025):
Another thing to be mindful of here are RAG requests #8661 which don't persist and so should not be cached.