mirror of
https://github.com/open-webui/open-webui.git
synced 2026-06-09 03:01:34 -05:00
[GH-ISSUE #21085] feat: Bridge third-party library cache env vars to DATA_DIR for pip installations #73979
Reference in New Issue
Block a user
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @BaseBlank on GitHub (Feb 1, 2026).
Original GitHub issue: https://github.com/open-webui/open-webui/issues/21085
Check Existing Issues
Verify Feature Scope
Problem Description
When deploying via Docker, the Dockerfile (lines 88–102) explicitly sets third-party library cache environment variables so that all data lives under /app/backend/data/:
ENV WHISPER_MODEL_DIR="/app/backend/data/cache/whisper/models"
ENV SENTENCE_TRANSFORMERS_HOME="/app/backend/data/cache/embedding/models"
ENV TIKTOKEN_CACHE_DIR="/app/backend/data/cache/tiktoken"
ENV HF_HOME="/app/backend/data/cache/embedding/models"
This bridging logic does not exist in the Python code. When installing via pip install open-webui, even if you set DATA_DIR to a custom path, the following third-party library caches still scatter to the user's home directory (e.g. ~/.cache/huggingface/hub/, %APPDATA%\nltk_data, ~/.cache/torch/):
So DATA_DIR works as a single knob for everything the project itself manages (database, uploads, vector_db, whisper models, audit logs — all clean). But it doesn't control third-party library caches, which is what the Dockerfile compensates for but pip users have to figure out on their own.
Also, retrieval/loaders/datalab_marker.py:244 has a hard-coded Docker path: marker_output_dir = os.path.join("/app/backend/data/uploads", "marker_output")
This should probably use UPLOAD_DIR instead.
Desired Solution you'd like
When FROM_INIT_PY is true (pip installation mode), set default values for third-party cache env vars based on DATA_DIR, mirroring what the Dockerfile already does. Something along the lines of:
In env.py or config.py, after DATA_DIR and CACHE_DIR are resolved:
_env_defaults = {
"SENTENCE_TRANSFORMERS_HOME": str(CACHE_DIR / "embedding" / "models"),
"HF_HOME": str(CACHE_DIR / "embedding" / "models"),
"TIKTOKEN_CACHE_DIR": str(CACHE_DIR / "tiktoken"),
etc.
}
for key, default in _env_defaults.items():
if key not in os.environ:
os.environ[key] = default
This way, DATA_DIR becomes a true single knob for pip users too — set one variable, and everything (project data + third-party caches) goes to the right place. Users who already set these env vars explicitly would not be affected.
Also fix the hard-coded path in datalab_marker.py:244.
Alternatives Considered
Users can manually set all the env vars themselves before running open-webui serve. This works, but requires reading through the Dockerfile to discover which variables to set — the Python code and docs don't surface this information for pip deployments.
Additional Context
The existing DATA_DIR architecture is well-designed — one root variable, everything derived from it. This request is specifically about replicating the Dockerfile's env-var bridging (lines 88–102) in Python code so that pip users get the same unified data directory behavior that Docker users already enjoy.