The catch-all /{path:path} proxy forwards any request to the upstream OpenAI-compatible API with the admin's API key and no access control. This is an intentional proxy but should be opt-in.
Adds ENABLE_OPENAI_API_PASSTHROUGH env var (defaults to False). When disabled, the catch-all returns 403. No other routers (Ollama, responses) have catch-all proxies.
The model name from user input was interpolated directly into Azure deployment URL paths without validation. A user could send a model name like '../../management/foo' to traverse the URL path and hit unintended Azure endpoints with the admin's API key.
Adds _sanitize_model_for_url that rejects path separators and traversal sequences, and percent-encodes the name. Applied at convert_to_azure_payload (covers chat completions + proxy) and the responses endpoint's direct URL construction.
The /responses proxy endpoint only required authentication via
get_verified_user but did not check per-model access grants. This
allowed any authenticated user to access any model through this
endpoint, bypassing the access control system.
Extract a shared check_model_access helper into utils/access_control
and replace all inline access control blocks across openai.py and
ollama.py (7 locations) with calls to this helper. This eliminates
code duplication and prevents future policy drift between endpoints.
CWE-862: Missing Authorization
CVSS:3.1/AV:N/AC:L/PR:L/UI:N/S:U/C:L/I:N/A:H (6.5 Medium)
Azure offers two URL formats: the legacy deployment-based format
(/openai/deployments/{model}/...) and the newer v1 format
(/openai/v1/...) where the model stays in the payload body and no
api-version query parameter is needed.
Previously, the code always ran convert_to_azure_payload which
rewrites the URL to the deployment format, causing 404 errors for
users with v1-style base URLs. Now, when the base URL contains
'/openai/v1', we skip deployment URL construction and route
directly.
Applied consistently across all three Azure routing paths:
generate_chat_completion, /responses proxy, and generic proxy.
fix: reduce TTFT by caching model lookups in chat completion
Skip expensive get_all_models() calls when models are already cached
in app.state. This significantly reduces Time To First Token (TTFT)
for chat completions and embeddings requests.
Previously, every request called get_all_models() which fetches model
lists from all configured backends. Now we check the cache first and
only call get_all_models() on cache miss.
Affected endpoints:
- openai: generate_chat_completion, embeddings
- ollama: embed, embeddings
Fixes#20069
Co-authored-by: Michael <42099345+mickeytheseal@users.noreply.github.com>
Each access to request.app.state.config.<KEY> triggers a synchronous
Redis GET. In get_all_models_responses() and get_merged_models(), the
config keys OPENAI_API_BASE_URLS, OPENAI_API_KEYS, and
OPENAI_API_CONFIGS were read on every loop iteration — resulting
in some cases in 200-300 Redis round-trips for OPENAI_API_BASE_URLS alone.
Read each config value once into a local variable at the start of the
function and reuse it throughout.
Remove Depends(get_session) from the /chat/completions endpoint to prevent database connections from being held during the entire duration of LLM calls (30-60+ seconds for streaming responses).
Previously, the database session was acquired at request start and held until the streaming response completed. Under concurrent load, this exhausted the connection pool, causing QueuePool timeout errors for other database operations.
The fix allows Models.get_model_by_id() and has_access() to manage their own short-lived sessions internally, releasing the connection immediately after the quick authorization checks complete - before the slow external LLM API call begins.