[PR #22645] [CLOSED] fix+feat: Knowledge file processing — async Docling, live status UI, error visibility, delete-race guard #26794

New Issue

GiteaMirror · 2026-04-20T06:42:58-05:00

GiteaMirror commented

2026-04-20 06:42:58 -05:00

📋 Pull Request Information

Original PR: https://github.com/open-webui/open-webui/pull/22645
Author: @jannefleischer
Created: 3/13/2026
Status: ❌ Closed

Base: dev ← Head: feat-knowledgestatus

📝 Commits (1)

8ef7bd5 feat+fix: Fixed handling of long-running task when adding knowledge-files; added a tooltip to the spinner with statusses, where available (currently: only docling-serve)

📊 Changes

8 files changed (+333 additions, -34 deletions)

View changed files

📝 backend/open_webui/config.py (+6 -0)
📝 backend/open_webui/main.py (+2 -0)
📝 backend/open_webui/retrieval/loaders/main.py (+111 -17)
📝 backend/open_webui/retrieval/utils.py (+7 -2)
📝 backend/open_webui/routers/retrieval.py (+61 -0)
📝 src/lib/apis/files/index.ts (+6 -1)
📝 src/lib/components/workspace/Knowledge/KnowledgeBase.svelte (+60 -8)
📝 src/lib/components/workspace/Knowledge/KnowledgeBase/Files.svelte (+80 -6)

📄 Description

Summary

This PR bundles several tightly coupled improvements to the knowledge-base file pipeline. They share backend infrastructure and a common motivation: the UI had no awareness of what happened after upload, and two separate race conditions could cause files to silently disappear.

Addresses #22571 and #22573; several related issues discovered along the way are also fixed.

Issues addressed

#22571 — Silent file drop when Ollama model TTL expires during Docling processing (`IndexError`)

When Ollama unloads its embedding model (default TTL: 5 min) while a large file is being processed by Docling, all parallel embedding batch requests receive 503 Service Unavailable. The response body was unexpected, causing an IndexError: list index out of range that crashed process_file() silently — the file disappeared from the UI with no error shown.

Fix in retrieval/utils.py: The batch-aggregation loop that previously silently skipped non-list results now raises a descriptive ValueError (including response type and a hint to check the log). This ensures the exception propagates to process_file(), which marks the file as failed with the actual error detail, making the failure visible in the UI (see UI changes below).

Note: PR #22589 separately fixes the generate_ollama_batch_embeddings function to re-raise instead of returning None. The change here is defense-in-depth for similar silencing patterns elsewhere in the batch pipeline.

#22573 — Files not linked to knowledge collection when dropped simultaneously (`FILE_NOT_PROCESSED` race condition)

When multiple files were drag-dropped simultaneously with a slow extraction backend (Docling, Marker, MinerU), the frontend called POST /knowledge/{id}/file/add immediately after the upload returned — before Docling had actually finished. The backend correctly rejected the call with 400 FILE_NOT_PROCESSED, the error was silently swallowed, and the file's embeddings (written later by the background task) were permanently orphaned with no knowledge_file row.

Root cause: The old DoclingLoader used the synchronous /v1/convert/file endpoint. With large queues, nginx would drop the connection after ~120 s with a gateway timeout, the SSE stream closed early, and the frontend called file/add on an unprocessed file.

Fix in loaders/main.py: DoclingLoader now uses the async /v1/convert/file/async endpoint. process_file() polls /v1/status/poll/{task_id} until the job completes (or DOCLING_SERVE_TIMEOUT is exceeded). The SSE stream from GET /files/{id}/process/status stays open for the entire polling loop, so uploadFile() on the frontend only resolves — and addFileHandler is only called — once the file is fully processed. The race window is closed.

Changes by component

`backend/open_webui/retrieval/loaders/main.py` — DoclingLoader async rewrite

Switched from POST /v1/convert/file (synchronous, drops on gateway timeout) to POST /v1/convert/file/async + GET /v1/status/poll/{task_id} + GET /v1/result/{task_id}.
Long-poll window: 30 s per request. Guard against servers that ignore ?wait= ensures we never poll faster than once per 30 s.
New timeout parameter: optional total seconds to wait before raising. Maps to the new DOCLING_SERVE_TIMEOUT config.
New status_callback parameter: optional callable(dict) invoked after submit (with task_id + initial task_position) and after each poll (with updated task_position). Used by process_file() to persist queue state into file.data for the UI tooltip.

`backend/open_webui/config.py` / `main.py` / `routers/retrieval.py` — `DOCLING_SERVE_TIMEOUT`

New PersistentConfig entry DOCLING_SERVE_TIMEOUT: integer seconds, default None (wait forever). Set via env var or Admin UI → Documents → Docling.
Exposed in GET /retrieval/config and POST /retrieval/config/update so it is readable/writable from the Admin Settings panel.

`backend/open_webui/routers/retrieval.py` — `process_file` enhancements

Processing status + start time: Immediately before loader.load(), sets {"status": "processing", "started_at": <unix timestamp>} in file.data. Applies to all extraction engines — not just Docling.

Docling status callback: A closure _docling_status_callback(data) opens a fresh DB session and merges data into file.data. Passed to Loader() as DOCLING_STATUS_CALLBACK kwarg, forwarded by Loader._get_loader() to DoclingLoader. Other engines receive None and are unaffected.

Delete-race guard: After save_docs_to_vector_db() returns True but before writing status: completed, checks via a fresh DB session whether the file row still exists. If the user deleted the file while Docling/embedding was running (a window that is now unbounded with the async polling approach):

Shared knowledge collection (form_data.collection_name set): calls VECTOR_DB_CLIENT.delete(filter={"file_id": ...}) to remove only this file's chunks.
Standalone private collection: calls VECTOR_DB_CLIENT.delete_collection(...).
Returns early with {"status": False, "reason": "file_deleted_during_processing"}.

A fresh session is used so the long-since-committed outer session's read cache doesn't mask the deletion.

`backend/open_webui/retrieval/utils.py` — batch embedding error propagation

The batch-results aggregation loop now raises ValueError instead of silently skipping batches that returned a non-list value. Error message includes the actual return type and a pointer to the log lines above for the root cause.

`src/lib/apis/files/index.ts` — `uploadFile` progress callback

Added optional onProgress?: (data: { status: string; error?: string }) => void parameter. Called for every SSE event while the file is being processed, forwarding the server-side file.data patch to the caller.

`src/lib/components/workspace/Knowledge/KnowledgeBase.svelte` — polling, silent refresh, live progress

Polling loop: Reactive $: block monitors fileItems and manages a 15 s setInterval (pollingInterval). Starts when any file has status: pending | processing | uploading. Stops when none remain. Cleared in onDestroy to prevent memory leaks.
Silent refresh: getItemsPage gains a silent flag. When true (used by the polling interval), the current list is preserved during the fetch so files don't flash away every 15 s.
Live progress via SSE: Both upload paths (uploadFileHandler for local files, the URL processor) now pass an onProgress callback to uploadFile. The callback patches item.data in fileItems directly, so the tooltip reflects processing, task_id, and task_position in real time — before the polling interval even fires.
Search debounce guard: Added knowledgeId !== null to the reactive debounce block so the initial evaluation (before onMount sets knowledgeId) cannot schedule a non-silent refresh that blanks the list 300 ms after first load.
Null-safe {#key}: Changed {#key selectedFile.id} to {#key selectedFile?.id} to prevent a JS error when selectedFile is null.

`src/lib/components/workspace/Knowledge/KnowledgeBase/Files.svelte` — visual status states

Per-row reactive constants:

Constant	Condition
`isInFlight`	`file.status === 'uploading'` OR `file.data.status` ∈ `{pending, processing}`
`isFailed`	`file.data.status === 'failed'`
`statusTooltip`	HTML from `getStatusTooltip(file)`

Visual states:

State	Icon	Row colour
`uploading / pending / processing`	`<Spinner>`	default
`failed`	`<DocumentPage>`	`text-red-500 dark:text-red-400`
`completed`	`<DocumentPage>`	default

HTML tooltip (in-flight and failed files, tippy.js + DOMPurify):

Bold status heading (red for failures)
"Started: N minutes ago" from data.started_at, or "Uploaded: …" from created_at for pending files
"Queue position: N" from data.task_position (Docling only)
"Task ID: a3f9c1b2…" — first 8 chars of data.task_id (Docling only; useful for cross-referencing docling-serve logs)
Error message in red from data.error (failed files only)

Click guard: In-flight files and placeholder items without a DB id return early on click, preventing attempts to open an unprocessed file.

Structured debug logging: console.log on file click now emits a structured object with id, name, status, started_at as ISO string, task_id, task_position, and error.

Files changed

File	Nature of change
`backend/open_webui/config.py`	Add `DOCLING_SERVE_TIMEOUT` PersistentConfig
`backend/open_webui/main.py`	Import and register `DOCLING_SERVE_TIMEOUT` on `app.state`
`backend/open_webui/retrieval/loaders/main.py`	DoclingLoader async rewrite; `timeout` + `status_callback` params
`backend/open_webui/retrieval/utils.py`	Raise on non-list batch result instead of silently skipping
`backend/open_webui/routers/retrieval.py`	`process_file`: status/started_at, callback, delete-race guard; expose `DOCLING_SERVE_TIMEOUT` in config API
`src/lib/apis/files/index.ts`	`onProgress` callback on `uploadFile`
`src/lib/components/workspace/Knowledge/KnowledgeBase.svelte`	Polling, silent refresh, live SSE progress, debounce guard, null-safe `{#key}`
`src/lib/components/workspace/Knowledge/KnowledgeBase/Files.svelte`	Status tooltip, spinner/red states, click guard, structured logging

What is NOT changed

No new API endpoints. Status data piggybacks on the existing data field of FileModelResponse.
No DB schema migrations.
No changes to other extraction engines (Tika, Marker, MinerU, Document Intelligence) — they benefit from started_at automatically but receive no callback.
No i18n strings added (tooltip content is English; can be wrapped in $i18n.t() in a follow-up if desired).
The delete endpoint (remove_file_from_knowledge_by_id) is unchanged.

_{🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.}

## 📋 Pull Request Information **Original PR:** https://github.com/open-webui/open-webui/pull/22645 **Author:** [@jannefleischer](https://github.com/jannefleischer) **Created:** 3/13/2026 **Status:** ❌ Closed **Base:** `dev` ← **Head:** `feat-knowledgestatus` --- ### 📝 Commits (1) - [`8ef7bd5`](https://github.com/open-webui/open-webui/commit/8ef7bd5d3ab969748a3829d304e1a8f84e444e19) feat+fix: Fixed handling of long-running task when adding knowledge-files; added a tooltip to the spinner with statusses, where available (currently: only docling-serve) ### 📊 Changes **8 files changed** (+333 additions, -34 deletions) <details> <summary>View changed files</summary> 📝 `backend/open_webui/config.py` (+6 -0) 📝 `backend/open_webui/main.py` (+2 -0) 📝 `backend/open_webui/retrieval/loaders/main.py` (+111 -17) 📝 `backend/open_webui/retrieval/utils.py` (+7 -2) 📝 `backend/open_webui/routers/retrieval.py` (+61 -0) 📝 `src/lib/apis/files/index.ts` (+6 -1) 📝 `src/lib/components/workspace/Knowledge/KnowledgeBase.svelte` (+60 -8) 📝 `src/lib/components/workspace/Knowledge/KnowledgeBase/Files.svelte` (+80 -6) </details> ### 📄 Description ### Summary This PR bundles several tightly coupled improvements to the knowledge-base file pipeline. They share backend infrastructure and a common motivation: the UI had no awareness of what happened after upload, and two separate race conditions could cause files to silently disappear. Addresses **#22571** and **#22573**; several related issues discovered along the way are also fixed. --- ### Issues addressed #### #22571 — Silent file drop when Ollama model TTL expires during Docling processing (`IndexError`) When Ollama unloads its embedding model (default TTL: 5 min) while a large file is being processed by Docling, all parallel embedding batch requests receive `503 Service Unavailable`. The response body was unexpected, causing an `IndexError: list index out of range` that crashed `process_file()` silently — the file disappeared from the UI with no error shown. **Fix in `retrieval/utils.py`:** The batch-aggregation loop that previously silently skipped non-list results now raises a descriptive `ValueError` (including response type and a hint to check the log). This ensures the exception propagates to `process_file()`, which marks the file as `failed` with the actual error detail, making the failure visible in the UI (see UI changes below). > Note: PR #22589 separately fixes the `generate_ollama_batch_embeddings` function to re-raise instead of returning `None`. The change here is defense-in-depth for similar silencing patterns elsewhere in the batch pipeline. #### #22573 — Files not linked to knowledge collection when dropped simultaneously (`FILE_NOT_PROCESSED` race condition) When multiple files were drag-dropped simultaneously with a slow extraction backend (Docling, Marker, MinerU), the frontend called `POST /knowledge/{id}/file/add` immediately after the upload returned — before Docling had actually finished. The backend correctly rejected the call with `400 FILE_NOT_PROCESSED`, the error was silently swallowed, and the file's embeddings (written later by the background task) were permanently orphaned with no `knowledge_file` row. **Root cause:** The old `DoclingLoader` used the synchronous `/v1/convert/file` endpoint. With large queues, nginx would drop the connection after ~120 s with a gateway timeout, the SSE stream closed early, and the frontend called `file/add` on an unprocessed file. **Fix in `loaders/main.py`:** `DoclingLoader` now uses the async `/v1/convert/file/async` endpoint. `process_file()` polls `/v1/status/poll/{task_id}` until the job completes (or `DOCLING_SERVE_TIMEOUT` is exceeded). The SSE stream from `GET /files/{id}/process/status` stays open for the entire polling loop, so `uploadFile()` on the frontend only resolves — and `addFileHandler` is only called — once the file is fully processed. The race window is closed. --- ### Changes by component #### `backend/open_webui/retrieval/loaders/main.py` — DoclingLoader async rewrite - Switched from `POST /v1/convert/file` (synchronous, drops on gateway timeout) to `POST /v1/convert/file/async` + `GET /v1/status/poll/{task_id}` + `GET /v1/result/{task_id}`. - Long-poll window: 30 s per request. Guard against servers that ignore `?wait=` ensures we never poll faster than once per 30 s. - New `timeout` parameter: optional total seconds to wait before raising. Maps to the new `DOCLING_SERVE_TIMEOUT` config. - New `status_callback` parameter: optional `callable(dict)` invoked after submit (with `task_id` + initial `task_position`) and after each poll (with updated `task_position`). Used by `process_file()` to persist queue state into `file.data` for the UI tooltip. #### `backend/open_webui/config.py` / `main.py` / `routers/retrieval.py` — `DOCLING_SERVE_TIMEOUT` - New `PersistentConfig` entry `DOCLING_SERVE_TIMEOUT`: integer seconds, default `None` (wait forever). Set via env var or Admin UI → Documents → Docling. - Exposed in `GET /retrieval/config` and `POST /retrieval/config/update` so it is readable/writable from the Admin Settings panel. #### `backend/open_webui/routers/retrieval.py` — `process_file` enhancements **Processing status + start time:** Immediately before `loader.load()`, sets `{"status": "processing", "started_at": <unix timestamp>}` in `file.data`. Applies to all extraction engines — not just Docling. **Docling status callback:** A closure `_docling_status_callback(data)` opens a fresh DB session and merges `data` into `file.data`. Passed to `Loader()` as `DOCLING_STATUS_CALLBACK` kwarg, forwarded by `Loader._get_loader()` to `DoclingLoader`. Other engines receive `None` and are unaffected. **Delete-race guard:** After `save_docs_to_vector_db()` returns `True` but *before* writing `status: completed`, checks via a fresh DB session whether the file row still exists. If the user deleted the file while Docling/embedding was running (a window that is now unbounded with the async polling approach): - Shared knowledge collection (`form_data.collection_name` set): calls `VECTOR_DB_CLIENT.delete(filter={"file_id": ...})` to remove only this file's chunks. - Standalone private collection: calls `VECTOR_DB_CLIENT.delete_collection(...)`. - Returns early with `{"status": False, "reason": "file_deleted_during_processing"}`. A fresh session is used so the long-since-committed outer session's read cache doesn't mask the deletion. #### `backend/open_webui/retrieval/utils.py` — batch embedding error propagation The batch-results aggregation loop now raises `ValueError` instead of silently skipping batches that returned a non-list value. Error message includes the actual return type and a pointer to the log lines above for the root cause. #### `src/lib/apis/files/index.ts` — `uploadFile` progress callback Added optional `onProgress?: (data: { status: string; error?: string }) => void` parameter. Called for every SSE event while the file is being processed, forwarding the server-side `file.data` patch to the caller. #### `src/lib/components/workspace/Knowledge/KnowledgeBase.svelte` — polling, silent refresh, live progress - **Polling loop:** Reactive `$:` block monitors `fileItems` and manages a 15 s `setInterval` (`pollingInterval`). Starts when any file has `status: pending | processing | uploading`. Stops when none remain. Cleared in `onDestroy` to prevent memory leaks. - **Silent refresh:** `getItemsPage` gains a `silent` flag. When `true` (used by the polling interval), the current list is preserved during the fetch so files don't flash away every 15 s. - **Live progress via SSE:** Both upload paths (`uploadFileHandler` for local files, the URL processor) now pass an `onProgress` callback to `uploadFile`. The callback patches `item.data` in `fileItems` directly, so the tooltip reflects `processing`, `task_id`, and `task_position` in real time — before the polling interval even fires. - **Search debounce guard:** Added `knowledgeId !== null` to the reactive debounce block so the initial evaluation (before `onMount` sets `knowledgeId`) cannot schedule a non-silent refresh that blanks the list 300 ms after first load. - **Null-safe `{#key}`:** Changed `{#key selectedFile.id}` to `{#key selectedFile?.id}` to prevent a JS error when `selectedFile` is `null`. #### `src/lib/components/workspace/Knowledge/KnowledgeBase/Files.svelte` — visual status states **Per-row reactive constants:** | Constant | Condition | |---|---| | `isInFlight` | `file.status === 'uploading'` OR `file.data.status` ∈ `{pending, processing}` | | `isFailed` | `file.data.status === 'failed'` | | `statusTooltip` | HTML from `getStatusTooltip(file)` | **Visual states:** | State | Icon | Row colour | |---|---|---| | `uploading / pending / processing` | `<Spinner>` | default | | `failed` | `<DocumentPage>` | `text-red-500 dark:text-red-400` | | `completed` | `<DocumentPage>` | default | **HTML tooltip** (in-flight and failed files, tippy.js + DOMPurify): - Bold status heading (red for failures) - "Started: N minutes ago" from `data.started_at`, or "Uploaded: …" from `created_at` for pending files - "Queue position: N" from `data.task_position` (Docling only) - "Task ID: a3f9c1b2…" — first 8 chars of `data.task_id` (Docling only; useful for cross-referencing docling-serve logs) - Error message in red from `data.error` (failed files only) **Click guard:** In-flight files and placeholder items without a DB `id` return early on click, preventing attempts to open an unprocessed file. **Structured debug logging:** `console.log` on file click now emits a structured object with `id`, `name`, `status`, `started_at` as ISO string, `task_id`, `task_position`, and `error`. --- ### Files changed | File | Nature of change | |---|---| | `backend/open_webui/config.py` | Add `DOCLING_SERVE_TIMEOUT` PersistentConfig | | `backend/open_webui/main.py` | Import and register `DOCLING_SERVE_TIMEOUT` on `app.state` | | `backend/open_webui/retrieval/loaders/main.py` | DoclingLoader async rewrite; `timeout` + `status_callback` params | | `backend/open_webui/retrieval/utils.py` | Raise on non-list batch result instead of silently skipping | | `backend/open_webui/routers/retrieval.py` | `process_file`: status/started_at, callback, delete-race guard; expose `DOCLING_SERVE_TIMEOUT` in config API | | `src/lib/apis/files/index.ts` | `onProgress` callback on `uploadFile` | | `src/lib/components/workspace/Knowledge/KnowledgeBase.svelte` | Polling, silent refresh, live SSE progress, debounce guard, null-safe `{#key}` | | `src/lib/components/workspace/Knowledge/KnowledgeBase/Files.svelte` | Status tooltip, spinner/red states, click guard, structured logging | --- ### What is NOT changed - No new API endpoints. Status data piggybacks on the existing `data` field of `FileModelResponse`. - No DB schema migrations. - No changes to other extraction engines (Tika, Marker, MinerU, Document Intelligence) — they benefit from `started_at` automatically but receive no callback. - No i18n strings added (tooltip content is English; can be wrapped in `$i18n.t()` in a follow-up if desired). - The delete endpoint (`remove_file_from_knowledge_by_id`) is unchanged. --- <sub>🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.</sub>

GiteaMirror added the pull-request label 2026-04-20 06:42:59 -05:00

GiteaMirror closed this issue

2026-04-20 06:42:59 -05:00

Sign in to join this conversation.

Branches Tags

1 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: github-starred/open-webui#26794

[PR #22645] [CLOSED] fix+feat: Knowledge file processing — async Docling, live status UI, error visibility, delete-race guard #26794

📋 Pull Request Information

📝 Commits (1)

📊 Changes

📄 Description

Summary

Issues addressed

#22571 — Silent file drop when Ollama model TTL expires during Docling processing (IndexError)

#22573 — Files not linked to knowledge collection when dropped simultaneously (FILE_NOT_PROCESSED race condition)

Changes by component

backend/open_webui/retrieval/loaders/main.py — DoclingLoader async rewrite

backend/open_webui/config.py / main.py / routers/retrieval.py — DOCLING_SERVE_TIMEOUT

backend/open_webui/routers/retrieval.py — process_file enhancements

backend/open_webui/retrieval/utils.py — batch embedding error propagation

src/lib/apis/files/index.ts — uploadFile progress callback

src/lib/components/workspace/Knowledge/KnowledgeBase.svelte — polling, silent refresh, live progress

src/lib/components/workspace/Knowledge/KnowledgeBase/Files.svelte — visual status states