[GH-ISSUE #21085] feat: Bridge third-party library cache env vars to DATA_DIR for pip installations #73979

Closed
opened 2026-05-13 06:33:45 -05:00 by GiteaMirror · 0 comments
Owner

Originally created by @BaseBlank on GitHub (Feb 1, 2026).
Original GitHub issue: https://github.com/open-webui/open-webui/issues/21085

Check Existing Issues

  • I have searched for all existing open AND closed issues and discussions for similar requests. I have found none that is comparable to my request.

Verify Feature Scope

  • I have read through and understood the scope definition for feature requests in the Issues section. I believe my feature request meets the definition and belongs in the Issues section instead of the Discussions.

Problem Description

When deploying via Docker, the Dockerfile (lines 88–102) explicitly sets third-party library cache environment variables so that all data lives under /app/backend/data/:

ENV WHISPER_MODEL_DIR="/app/backend/data/cache/whisper/models"
ENV SENTENCE_TRANSFORMERS_HOME="/app/backend/data/cache/embedding/models"
ENV TIKTOKEN_CACHE_DIR="/app/backend/data/cache/tiktoken"
ENV HF_HOME="/app/backend/data/cache/embedding/models"

This bridging logic does not exist in the Python code. When installing via pip install open-webui, even if you set DATA_DIR to a custom path, the following third-party library caches still scatter to the user's home directory (e.g. ~/.cache/huggingface/hub/, %APPDATA%\nltk_data, ~/.cache/torch/):

  • HuggingFace Hub models — HF_HOME is not set in Python code at all
  • Sentence Transformers — retrieval/utils.py:1213 reads SENTENCE_TRANSFORMERS_HOME and passes it to snapshot_download(), but the evaluations router (routers/evaluations.py:66) calls SentenceTransformer(EMBEDDING_MODEL_NAME) without passing cache_folder, relying solely on the library reading the env var
  • tiktoken — config.py:2903 reads TIKTOKEN_CACHE_DIR and computes a fallback of f"{CACHE_DIR}/tiktoken", but this Python variable is never referenced anywhere else in the codebase and is never written back to os.environ. The computed fallback is dead code — tiktoken only sees the env var if the user sets it externally
  • NLTK data — start.sh:14 calls nltk.download('punkt_tab'), which downloads to ~/nltk_data by default; no env var is set
  • PyTorch cache — defaults to ~/.cache/torch/

So DATA_DIR works as a single knob for everything the project itself manages (database, uploads, vector_db, whisper models, audit logs — all clean). But it doesn't control third-party library caches, which is what the Dockerfile compensates for but pip users have to figure out on their own.

Also, retrieval/loaders/datalab_marker.py:244 has a hard-coded Docker path: marker_output_dir = os.path.join("/app/backend/data/uploads", "marker_output")
This should probably use UPLOAD_DIR instead.

Desired Solution you'd like

When FROM_INIT_PY is true (pip installation mode), set default values for third-party cache env vars based on DATA_DIR, mirroring what the Dockerfile already does. Something along the lines of:

In env.py or config.py, after DATA_DIR and CACHE_DIR are resolved:
_env_defaults = {
"SENTENCE_TRANSFORMERS_HOME": str(CACHE_DIR / "embedding" / "models"),
"HF_HOME": str(CACHE_DIR / "embedding" / "models"),
"TIKTOKEN_CACHE_DIR": str(CACHE_DIR / "tiktoken"),
etc.
}
for key, default in _env_defaults.items():
if key not in os.environ:
os.environ[key] = default

This way, DATA_DIR becomes a true single knob for pip users too — set one variable, and everything (project data + third-party caches) goes to the right place. Users who already set these env vars explicitly would not be affected.

Also fix the hard-coded path in datalab_marker.py:244.

Alternatives Considered

Users can manually set all the env vars themselves before running open-webui serve. This works, but requires reading through the Dockerfile to discover which variables to set — the Python code and docs don't surface this information for pip deployments.

Additional Context

The existing DATA_DIR architecture is well-designed — one root variable, everything derived from it. This request is specifically about replicating the Dockerfile's env-var bridging (lines 88–102) in Python code so that pip users get the same unified data directory behavior that Docker users already enjoy.

Originally created by @BaseBlank on GitHub (Feb 1, 2026). Original GitHub issue: https://github.com/open-webui/open-webui/issues/21085 ### Check Existing Issues - [x] I have searched for all existing **open AND closed** issues and discussions for similar requests. I have found none that is comparable to my request. ### Verify Feature Scope - [x] I have read through and understood the scope definition for feature requests in the Issues section. I believe my feature request meets the definition and belongs in the Issues section instead of the Discussions. ### Problem Description When deploying via Docker, the Dockerfile (lines 88–102) explicitly sets third-party library cache environment variables so that all data lives under /app/backend/data/: ENV WHISPER_MODEL_DIR="/app/backend/data/cache/whisper/models" ENV SENTENCE_TRANSFORMERS_HOME="/app/backend/data/cache/embedding/models" ENV TIKTOKEN_CACHE_DIR="/app/backend/data/cache/tiktoken" ENV HF_HOME="/app/backend/data/cache/embedding/models" This bridging logic does not exist in the Python code. When installing via pip install open-webui, even if you set DATA_DIR to a custom path, the following third-party library caches still scatter to the user's home directory (e.g. ~/.cache/huggingface/hub/, %APPDATA%\nltk_data\, ~/.cache/torch/): - HuggingFace Hub models — HF_HOME is not set in Python code at all - Sentence Transformers — retrieval/utils.py:1213 reads SENTENCE_TRANSFORMERS_HOME and passes it to snapshot_download(), but the evaluations router (routers/evaluations.py:66) calls SentenceTransformer(EMBEDDING_MODEL_NAME) without passing cache_folder, relying solely on the library reading the env var - tiktoken — config.py:2903 reads TIKTOKEN_CACHE_DIR and computes a fallback of f"{CACHE_DIR}/tiktoken", but this Python variable is never referenced anywhere else in the codebase and is never written back to os.environ. The computed fallback is dead code — tiktoken only sees the env var if the user sets it externally - NLTK data — start.sh:14 calls nltk.download('punkt_tab'), which downloads to ~/nltk_data by default; no env var is set - PyTorch cache — defaults to ~/.cache/torch/ So DATA_DIR works as a single knob for everything the project itself manages (database, uploads, vector_db, whisper models, audit logs — all clean). But it doesn't control third-party library caches, which is what the Dockerfile compensates for but pip users have to figure out on their own. Also, retrieval/loaders/datalab_marker.py:244 has a hard-coded Docker path: marker_output_dir = os.path.join("/app/backend/data/uploads", "marker_output") This should probably use UPLOAD_DIR instead. ### Desired Solution you'd like When FROM_INIT_PY is true (pip installation mode), set default values for third-party cache env vars based on DATA_DIR, mirroring what the Dockerfile already does. Something along the lines of: In env.py or config.py, after DATA_DIR and CACHE_DIR are resolved: _env_defaults = { "SENTENCE_TRANSFORMERS_HOME": str(CACHE_DIR / "embedding" / "models"), "HF_HOME": str(CACHE_DIR / "embedding" / "models"), "TIKTOKEN_CACHE_DIR": str(CACHE_DIR / "tiktoken"), etc. } for key, default in _env_defaults.items(): if key not in os.environ: os.environ[key] = default This way, DATA_DIR becomes a true single knob for pip users too — set one variable, and everything (project data + third-party caches) goes to the right place. Users who already set these env vars explicitly would not be affected. Also fix the hard-coded path in datalab_marker.py:244. ### Alternatives Considered Users can manually set all the env vars themselves before running open-webui serve. This works, but requires reading through the Dockerfile to discover which variables to set — the Python code and docs don't surface this information for pip deployments. ### Additional Context The existing DATA_DIR architecture is well-designed — one root variable, everything derived from it. This request is specifically about replicating the Dockerfile's env-var bridging (lines 88–102) in Python code so that pip users get the same unified data directory behavior that Docker users already enjoy.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/open-webui#73979