[PR #22645] [CLOSED] fix+feat: Knowledge file processing — async Docling, live status UI, error visibility, delete-race guard #26794

Closed
opened 2026-04-20 06:42:58 -05:00 by GiteaMirror · 0 comments
Owner

📋 Pull Request Information

Original PR: https://github.com/open-webui/open-webui/pull/22645
Author: @jannefleischer
Created: 3/13/2026
Status: Closed

Base: devHead: feat-knowledgestatus


📝 Commits (1)

  • 8ef7bd5 feat+fix: Fixed handling of long-running task when adding knowledge-files; added a tooltip to the spinner with statusses, where available (currently: only docling-serve)

📊 Changes

8 files changed (+333 additions, -34 deletions)

View changed files

📝 backend/open_webui/config.py (+6 -0)
📝 backend/open_webui/main.py (+2 -0)
📝 backend/open_webui/retrieval/loaders/main.py (+111 -17)
📝 backend/open_webui/retrieval/utils.py (+7 -2)
📝 backend/open_webui/routers/retrieval.py (+61 -0)
📝 src/lib/apis/files/index.ts (+6 -1)
📝 src/lib/components/workspace/Knowledge/KnowledgeBase.svelte (+60 -8)
📝 src/lib/components/workspace/Knowledge/KnowledgeBase/Files.svelte (+80 -6)

📄 Description

Summary

This PR bundles several tightly coupled improvements to the knowledge-base file pipeline. They share backend infrastructure and a common motivation: the UI had no awareness of what happened after upload, and two separate race conditions could cause files to silently disappear.

Addresses #22571 and #22573; several related issues discovered along the way are also fixed.


Issues addressed

#22571 — Silent file drop when Ollama model TTL expires during Docling processing (IndexError)

When Ollama unloads its embedding model (default TTL: 5 min) while a large file is being processed by Docling, all parallel embedding batch requests receive 503 Service Unavailable. The response body was unexpected, causing an IndexError: list index out of range that crashed process_file() silently — the file disappeared from the UI with no error shown.

Fix in retrieval/utils.py: The batch-aggregation loop that previously silently skipped non-list results now raises a descriptive ValueError (including response type and a hint to check the log). This ensures the exception propagates to process_file(), which marks the file as failed with the actual error detail, making the failure visible in the UI (see UI changes below).

Note: PR #22589 separately fixes the generate_ollama_batch_embeddings function to re-raise instead of returning None. The change here is defense-in-depth for similar silencing patterns elsewhere in the batch pipeline.

#22573 — Files not linked to knowledge collection when dropped simultaneously (FILE_NOT_PROCESSED race condition)

When multiple files were drag-dropped simultaneously with a slow extraction backend (Docling, Marker, MinerU), the frontend called POST /knowledge/{id}/file/add immediately after the upload returned — before Docling had actually finished. The backend correctly rejected the call with 400 FILE_NOT_PROCESSED, the error was silently swallowed, and the file's embeddings (written later by the background task) were permanently orphaned with no knowledge_file row.

Root cause: The old DoclingLoader used the synchronous /v1/convert/file endpoint. With large queues, nginx would drop the connection after ~120 s with a gateway timeout, the SSE stream closed early, and the frontend called file/add on an unprocessed file.

Fix in loaders/main.py: DoclingLoader now uses the async /v1/convert/file/async endpoint. process_file() polls /v1/status/poll/{task_id} until the job completes (or DOCLING_SERVE_TIMEOUT is exceeded). The SSE stream from GET /files/{id}/process/status stays open for the entire polling loop, so uploadFile() on the frontend only resolves — and addFileHandler is only called — once the file is fully processed. The race window is closed.


Changes by component

backend/open_webui/retrieval/loaders/main.py — DoclingLoader async rewrite

  • Switched from POST /v1/convert/file (synchronous, drops on gateway timeout) to POST /v1/convert/file/async + GET /v1/status/poll/{task_id} + GET /v1/result/{task_id}.
  • Long-poll window: 30 s per request. Guard against servers that ignore ?wait= ensures we never poll faster than once per 30 s.
  • New timeout parameter: optional total seconds to wait before raising. Maps to the new DOCLING_SERVE_TIMEOUT config.
  • New status_callback parameter: optional callable(dict) invoked after submit (with task_id + initial task_position) and after each poll (with updated task_position). Used by process_file() to persist queue state into file.data for the UI tooltip.

backend/open_webui/config.py / main.py / routers/retrieval.pyDOCLING_SERVE_TIMEOUT

  • New PersistentConfig entry DOCLING_SERVE_TIMEOUT: integer seconds, default None (wait forever). Set via env var or Admin UI → Documents → Docling.
  • Exposed in GET /retrieval/config and POST /retrieval/config/update so it is readable/writable from the Admin Settings panel.

backend/open_webui/routers/retrieval.pyprocess_file enhancements

Processing status + start time: Immediately before loader.load(), sets {"status": "processing", "started_at": <unix timestamp>} in file.data. Applies to all extraction engines — not just Docling.

Docling status callback: A closure _docling_status_callback(data) opens a fresh DB session and merges data into file.data. Passed to Loader() as DOCLING_STATUS_CALLBACK kwarg, forwarded by Loader._get_loader() to DoclingLoader. Other engines receive None and are unaffected.

Delete-race guard: After save_docs_to_vector_db() returns True but before writing status: completed, checks via a fresh DB session whether the file row still exists. If the user deleted the file while Docling/embedding was running (a window that is now unbounded with the async polling approach):

  • Shared knowledge collection (form_data.collection_name set): calls VECTOR_DB_CLIENT.delete(filter={"file_id": ...}) to remove only this file's chunks.
  • Standalone private collection: calls VECTOR_DB_CLIENT.delete_collection(...).
  • Returns early with {"status": False, "reason": "file_deleted_during_processing"}.

A fresh session is used so the long-since-committed outer session's read cache doesn't mask the deletion.

backend/open_webui/retrieval/utils.py — batch embedding error propagation

The batch-results aggregation loop now raises ValueError instead of silently skipping batches that returned a non-list value. Error message includes the actual return type and a pointer to the log lines above for the root cause.

src/lib/apis/files/index.tsuploadFile progress callback

Added optional onProgress?: (data: { status: string; error?: string }) => void parameter. Called for every SSE event while the file is being processed, forwarding the server-side file.data patch to the caller.

src/lib/components/workspace/Knowledge/KnowledgeBase.svelte — polling, silent refresh, live progress

  • Polling loop: Reactive $: block monitors fileItems and manages a 15 s setInterval (pollingInterval). Starts when any file has status: pending | processing | uploading. Stops when none remain. Cleared in onDestroy to prevent memory leaks.
  • Silent refresh: getItemsPage gains a silent flag. When true (used by the polling interval), the current list is preserved during the fetch so files don't flash away every 15 s.
  • Live progress via SSE: Both upload paths (uploadFileHandler for local files, the URL processor) now pass an onProgress callback to uploadFile. The callback patches item.data in fileItems directly, so the tooltip reflects processing, task_id, and task_position in real time — before the polling interval even fires.
  • Search debounce guard: Added knowledgeId !== null to the reactive debounce block so the initial evaluation (before onMount sets knowledgeId) cannot schedule a non-silent refresh that blanks the list 300 ms after first load.
  • Null-safe {#key}: Changed {#key selectedFile.id} to {#key selectedFile?.id} to prevent a JS error when selectedFile is null.

src/lib/components/workspace/Knowledge/KnowledgeBase/Files.svelte — visual status states

Per-row reactive constants:

Constant Condition
isInFlight file.status === 'uploading' OR file.data.status{pending, processing}
isFailed file.data.status === 'failed'
statusTooltip HTML from getStatusTooltip(file)

Visual states:

State Icon Row colour
uploading / pending / processing <Spinner> default
failed <DocumentPage> text-red-500 dark:text-red-400
completed <DocumentPage> default

HTML tooltip (in-flight and failed files, tippy.js + DOMPurify):

  • Bold status heading (red for failures)
  • "Started: N minutes ago" from data.started_at, or "Uploaded: …" from created_at for pending files
  • "Queue position: N" from data.task_position (Docling only)
  • "Task ID: a3f9c1b2…" — first 8 chars of data.task_id (Docling only; useful for cross-referencing docling-serve logs)
  • Error message in red from data.error (failed files only)

Click guard: In-flight files and placeholder items without a DB id return early on click, preventing attempts to open an unprocessed file.

Structured debug logging: console.log on file click now emits a structured object with id, name, status, started_at as ISO string, task_id, task_position, and error.


Files changed

File Nature of change
backend/open_webui/config.py Add DOCLING_SERVE_TIMEOUT PersistentConfig
backend/open_webui/main.py Import and register DOCLING_SERVE_TIMEOUT on app.state
backend/open_webui/retrieval/loaders/main.py DoclingLoader async rewrite; timeout + status_callback params
backend/open_webui/retrieval/utils.py Raise on non-list batch result instead of silently skipping
backend/open_webui/routers/retrieval.py process_file: status/started_at, callback, delete-race guard; expose DOCLING_SERVE_TIMEOUT in config API
src/lib/apis/files/index.ts onProgress callback on uploadFile
src/lib/components/workspace/Knowledge/KnowledgeBase.svelte Polling, silent refresh, live SSE progress, debounce guard, null-safe {#key}
src/lib/components/workspace/Knowledge/KnowledgeBase/Files.svelte Status tooltip, spinner/red states, click guard, structured logging

What is NOT changed

  • No new API endpoints. Status data piggybacks on the existing data field of FileModelResponse.
  • No DB schema migrations.
  • No changes to other extraction engines (Tika, Marker, MinerU, Document Intelligence) — they benefit from started_at automatically but receive no callback.
  • No i18n strings added (tooltip content is English; can be wrapped in $i18n.t() in a follow-up if desired).
  • The delete endpoint (remove_file_from_knowledge_by_id) is unchanged.

🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.

## 📋 Pull Request Information **Original PR:** https://github.com/open-webui/open-webui/pull/22645 **Author:** [@jannefleischer](https://github.com/jannefleischer) **Created:** 3/13/2026 **Status:** ❌ Closed **Base:** `dev` ← **Head:** `feat-knowledgestatus` --- ### 📝 Commits (1) - [`8ef7bd5`](https://github.com/open-webui/open-webui/commit/8ef7bd5d3ab969748a3829d304e1a8f84e444e19) feat+fix: Fixed handling of long-running task when adding knowledge-files; added a tooltip to the spinner with statusses, where available (currently: only docling-serve) ### 📊 Changes **8 files changed** (+333 additions, -34 deletions) <details> <summary>View changed files</summary> 📝 `backend/open_webui/config.py` (+6 -0) 📝 `backend/open_webui/main.py` (+2 -0) 📝 `backend/open_webui/retrieval/loaders/main.py` (+111 -17) 📝 `backend/open_webui/retrieval/utils.py` (+7 -2) 📝 `backend/open_webui/routers/retrieval.py` (+61 -0) 📝 `src/lib/apis/files/index.ts` (+6 -1) 📝 `src/lib/components/workspace/Knowledge/KnowledgeBase.svelte` (+60 -8) 📝 `src/lib/components/workspace/Knowledge/KnowledgeBase/Files.svelte` (+80 -6) </details> ### 📄 Description ### Summary This PR bundles several tightly coupled improvements to the knowledge-base file pipeline. They share backend infrastructure and a common motivation: the UI had no awareness of what happened after upload, and two separate race conditions could cause files to silently disappear. Addresses **#22571** and **#22573**; several related issues discovered along the way are also fixed. --- ### Issues addressed #### #22571 — Silent file drop when Ollama model TTL expires during Docling processing (`IndexError`) When Ollama unloads its embedding model (default TTL: 5 min) while a large file is being processed by Docling, all parallel embedding batch requests receive `503 Service Unavailable`. The response body was unexpected, causing an `IndexError: list index out of range` that crashed `process_file()` silently — the file disappeared from the UI with no error shown. **Fix in `retrieval/utils.py`:** The batch-aggregation loop that previously silently skipped non-list results now raises a descriptive `ValueError` (including response type and a hint to check the log). This ensures the exception propagates to `process_file()`, which marks the file as `failed` with the actual error detail, making the failure visible in the UI (see UI changes below). > Note: PR #22589 separately fixes the `generate_ollama_batch_embeddings` function to re-raise instead of returning `None`. The change here is defense-in-depth for similar silencing patterns elsewhere in the batch pipeline. #### #22573 — Files not linked to knowledge collection when dropped simultaneously (`FILE_NOT_PROCESSED` race condition) When multiple files were drag-dropped simultaneously with a slow extraction backend (Docling, Marker, MinerU), the frontend called `POST /knowledge/{id}/file/add` immediately after the upload returned — before Docling had actually finished. The backend correctly rejected the call with `400 FILE_NOT_PROCESSED`, the error was silently swallowed, and the file's embeddings (written later by the background task) were permanently orphaned with no `knowledge_file` row. **Root cause:** The old `DoclingLoader` used the synchronous `/v1/convert/file` endpoint. With large queues, nginx would drop the connection after ~120 s with a gateway timeout, the SSE stream closed early, and the frontend called `file/add` on an unprocessed file. **Fix in `loaders/main.py`:** `DoclingLoader` now uses the async `/v1/convert/file/async` endpoint. `process_file()` polls `/v1/status/poll/{task_id}` until the job completes (or `DOCLING_SERVE_TIMEOUT` is exceeded). The SSE stream from `GET /files/{id}/process/status` stays open for the entire polling loop, so `uploadFile()` on the frontend only resolves — and `addFileHandler` is only called — once the file is fully processed. The race window is closed. --- ### Changes by component #### `backend/open_webui/retrieval/loaders/main.py` — DoclingLoader async rewrite - Switched from `POST /v1/convert/file` (synchronous, drops on gateway timeout) to `POST /v1/convert/file/async` + `GET /v1/status/poll/{task_id}` + `GET /v1/result/{task_id}`. - Long-poll window: 30 s per request. Guard against servers that ignore `?wait=` ensures we never poll faster than once per 30 s. - New `timeout` parameter: optional total seconds to wait before raising. Maps to the new `DOCLING_SERVE_TIMEOUT` config. - New `status_callback` parameter: optional `callable(dict)` invoked after submit (with `task_id` + initial `task_position`) and after each poll (with updated `task_position`). Used by `process_file()` to persist queue state into `file.data` for the UI tooltip. #### `backend/open_webui/config.py` / `main.py` / `routers/retrieval.py` — `DOCLING_SERVE_TIMEOUT` - New `PersistentConfig` entry `DOCLING_SERVE_TIMEOUT`: integer seconds, default `None` (wait forever). Set via env var or Admin UI → Documents → Docling. - Exposed in `GET /retrieval/config` and `POST /retrieval/config/update` so it is readable/writable from the Admin Settings panel. #### `backend/open_webui/routers/retrieval.py` — `process_file` enhancements **Processing status + start time:** Immediately before `loader.load()`, sets `{"status": "processing", "started_at": <unix timestamp>}` in `file.data`. Applies to all extraction engines — not just Docling. **Docling status callback:** A closure `_docling_status_callback(data)` opens a fresh DB session and merges `data` into `file.data`. Passed to `Loader()` as `DOCLING_STATUS_CALLBACK` kwarg, forwarded by `Loader._get_loader()` to `DoclingLoader`. Other engines receive `None` and are unaffected. **Delete-race guard:** After `save_docs_to_vector_db()` returns `True` but *before* writing `status: completed`, checks via a fresh DB session whether the file row still exists. If the user deleted the file while Docling/embedding was running (a window that is now unbounded with the async polling approach): - Shared knowledge collection (`form_data.collection_name` set): calls `VECTOR_DB_CLIENT.delete(filter={"file_id": ...})` to remove only this file's chunks. - Standalone private collection: calls `VECTOR_DB_CLIENT.delete_collection(...)`. - Returns early with `{"status": False, "reason": "file_deleted_during_processing"}`. A fresh session is used so the long-since-committed outer session's read cache doesn't mask the deletion. #### `backend/open_webui/retrieval/utils.py` — batch embedding error propagation The batch-results aggregation loop now raises `ValueError` instead of silently skipping batches that returned a non-list value. Error message includes the actual return type and a pointer to the log lines above for the root cause. #### `src/lib/apis/files/index.ts` — `uploadFile` progress callback Added optional `onProgress?: (data: { status: string; error?: string }) => void` parameter. Called for every SSE event while the file is being processed, forwarding the server-side `file.data` patch to the caller. #### `src/lib/components/workspace/Knowledge/KnowledgeBase.svelte` — polling, silent refresh, live progress - **Polling loop:** Reactive `$:` block monitors `fileItems` and manages a 15 s `setInterval` (`pollingInterval`). Starts when any file has `status: pending | processing | uploading`. Stops when none remain. Cleared in `onDestroy` to prevent memory leaks. - **Silent refresh:** `getItemsPage` gains a `silent` flag. When `true` (used by the polling interval), the current list is preserved during the fetch so files don't flash away every 15 s. - **Live progress via SSE:** Both upload paths (`uploadFileHandler` for local files, the URL processor) now pass an `onProgress` callback to `uploadFile`. The callback patches `item.data` in `fileItems` directly, so the tooltip reflects `processing`, `task_id`, and `task_position` in real time — before the polling interval even fires. - **Search debounce guard:** Added `knowledgeId !== null` to the reactive debounce block so the initial evaluation (before `onMount` sets `knowledgeId`) cannot schedule a non-silent refresh that blanks the list 300 ms after first load. - **Null-safe `{#key}`:** Changed `{#key selectedFile.id}` to `{#key selectedFile?.id}` to prevent a JS error when `selectedFile` is `null`. #### `src/lib/components/workspace/Knowledge/KnowledgeBase/Files.svelte` — visual status states **Per-row reactive constants:** | Constant | Condition | |---|---| | `isInFlight` | `file.status === 'uploading'` OR `file.data.status` ∈ `{pending, processing}` | | `isFailed` | `file.data.status === 'failed'` | | `statusTooltip` | HTML from `getStatusTooltip(file)` | **Visual states:** | State | Icon | Row colour | |---|---|---| | `uploading / pending / processing` | `<Spinner>` | default | | `failed` | `<DocumentPage>` | `text-red-500 dark:text-red-400` | | `completed` | `<DocumentPage>` | default | **HTML tooltip** (in-flight and failed files, tippy.js + DOMPurify): - Bold status heading (red for failures) - "Started: N minutes ago" from `data.started_at`, or "Uploaded: …" from `created_at` for pending files - "Queue position: N" from `data.task_position` (Docling only) - "Task ID: a3f9c1b2…" — first 8 chars of `data.task_id` (Docling only; useful for cross-referencing docling-serve logs) - Error message in red from `data.error` (failed files only) **Click guard:** In-flight files and placeholder items without a DB `id` return early on click, preventing attempts to open an unprocessed file. **Structured debug logging:** `console.log` on file click now emits a structured object with `id`, `name`, `status`, `started_at` as ISO string, `task_id`, `task_position`, and `error`. --- ### Files changed | File | Nature of change | |---|---| | `backend/open_webui/config.py` | Add `DOCLING_SERVE_TIMEOUT` PersistentConfig | | `backend/open_webui/main.py` | Import and register `DOCLING_SERVE_TIMEOUT` on `app.state` | | `backend/open_webui/retrieval/loaders/main.py` | DoclingLoader async rewrite; `timeout` + `status_callback` params | | `backend/open_webui/retrieval/utils.py` | Raise on non-list batch result instead of silently skipping | | `backend/open_webui/routers/retrieval.py` | `process_file`: status/started_at, callback, delete-race guard; expose `DOCLING_SERVE_TIMEOUT` in config API | | `src/lib/apis/files/index.ts` | `onProgress` callback on `uploadFile` | | `src/lib/components/workspace/Knowledge/KnowledgeBase.svelte` | Polling, silent refresh, live SSE progress, debounce guard, null-safe `{#key}` | | `src/lib/components/workspace/Knowledge/KnowledgeBase/Files.svelte` | Status tooltip, spinner/red states, click guard, structured logging | --- ### What is NOT changed - No new API endpoints. Status data piggybacks on the existing `data` field of `FileModelResponse`. - No DB schema migrations. - No changes to other extraction engines (Tika, Marker, MinerU, Document Intelligence) — they benefit from `started_at` automatically but receive no callback. - No i18n strings added (tooltip content is English; can be wrapped in `$i18n.t()` in a follow-up if desired). - The delete endpoint (`remove_file_from_knowledge_by_id`) is unchanged. --- <sub>🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.</sub>
GiteaMirror added the pull-request label 2026-04-20 06:42:59 -05:00
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/open-webui#26794