[GH-ISSUE #19264] issue: Uploaded file hash remains in database even when OCR fails, causing false duplicate detection #122138

Closed
opened 2026-05-21 00:36:13 -05:00 by GiteaMirror · 4 comments
Owner

Originally created by @flefevre on GitHub (Nov 18, 2025).
Original GitHub issue: https://github.com/open-webui/open-webui/issues/19264

Check Existing Issues

  • I have searched for any existing and/or related issues.
  • I have searched for any existing and/or related discussions.
  • I have also searched in the CLOSED issues AND CLOSED discussions and found no related items (your issue might already be addressed on the development branch!).
  • I am using the latest version of Open WebUI.

Installation Method

Docker

Open WebUI Version

v0.6.36

Ollama Version (if applicable)

No response

Operating System

Ubuntu 22.04

Browser (if applicable)

No response

Confirmation

  • I have read and followed all instructions in README.md.
  • I am using the latest version of both Open WebUI and Ollama.
  • I have included the browser console logs.
  • I have included the Docker container logs.
  • I have provided every relevant configuration, setting, and environment variable used in my setup.
  • I have clearly listed every relevant configuration, custom setting, environment variable, and command-line option that influences my setup (such as Docker Compose overrides, .env values, browser settings, authentication configurations, etc).
  • I have documented step-by-step reproduction instructions that are precise, sequential, and leave nothing to interpretation. My steps:
  • Start with the initial platform/version/OS and dependencies used,
  • Specify exact install/launch/configure commands,
  • List URLs visited, user input (incl. example values/emails/passwords if needed),
  • Describe all options and toggles enabled or changed,
  • Include any files or environmental changes,
  • Identify the expected and actual result at each stage,
  • Ensure any reasonably skilled user can follow and hit the same issue.

Expected Behavior

If OCR fails, the file should either:
Not be stored at all, or
Be stored with a clear "failed OCR" status that doesn't block future uploads

Actual Behavior

When uploading a file to the platform that fails OCR (returns no text), the file's hash is still stored in the database. This creates an inconsistency where:

The file does not appear in the knowledge base listing (as expected, since OCR failed)
However, subsequent upload attempts for the same file incorrectly report a "duplicate file" error

Steps to Reproduce

Upload a file that fails OCR (e.g., a corrupted or unsupported file type)
Observe that the file doesn't appear in the knowledge base listing
Attempt to upload the same file again
Notice the system reports it as a duplicate, even though it wasn't properly added

Logs & Screenshots

Current Behavior:

The file hash is stored in the database
The knowledge base listing doesn't show the file
Subsequent uploads are blocked by the duplicate detection system

Additional Context:
This appears to be a data consistency issue where the database state doesn't match the UI state. The duplicate detection system is working as intended (checking hashes), but the initial upload process isn't properly handling failed OCR cases.

Additional Information

Possible Solutions:

Add a flag in the database to mark files with failed OCR
Modify the upload process to only store hashes for successfully processed files
Update the duplicate detection to ignore files marked as failed
Severity: Medium (affects user experience but doesn't cause data loss)

Originally created by @flefevre on GitHub (Nov 18, 2025). Original GitHub issue: https://github.com/open-webui/open-webui/issues/19264 ### Check Existing Issues - [x] I have searched for any existing and/or related issues. - [x] I have searched for any existing and/or related discussions. - [x] I have also searched in the CLOSED issues AND CLOSED discussions and found no related items (your issue might already be addressed on the development branch!). - [x] I am using the latest version of Open WebUI. ### Installation Method Docker ### Open WebUI Version v0.6.36 ### Ollama Version (if applicable) _No response_ ### Operating System Ubuntu 22.04 ### Browser (if applicable) _No response_ ### Confirmation - [x] I have read and followed all instructions in `README.md`. - [x] I am using the latest version of **both** Open WebUI and Ollama. - [x] I have included the browser console logs. - [x] I have included the Docker container logs. - [x] I have **provided every relevant configuration, setting, and environment variable used in my setup.** - [x] I have clearly **listed every relevant configuration, custom setting, environment variable, and command-line option that influences my setup** (such as Docker Compose overrides, .env values, browser settings, authentication configurations, etc). - [x] I have documented **step-by-step reproduction instructions that are precise, sequential, and leave nothing to interpretation**. My steps: - Start with the initial platform/version/OS and dependencies used, - Specify exact install/launch/configure commands, - List URLs visited, user input (incl. example values/emails/passwords if needed), - Describe all options and toggles enabled or changed, - Include any files or environmental changes, - Identify the expected and actual result at each stage, - Ensure any reasonably skilled user can follow and hit the same issue. ### Expected Behavior If OCR fails, the file should either: Not be stored at all, or Be stored with a clear "failed OCR" status that doesn't block future uploads ### Actual Behavior When uploading a file to the platform that fails OCR (returns no text), the file's hash is still stored in the database. This creates an inconsistency where: The file does not appear in the knowledge base listing (as expected, since OCR failed) However, subsequent upload attempts for the same file incorrectly report a "duplicate file" error ### Steps to Reproduce Upload a file that fails OCR (e.g., a corrupted or unsupported file type) Observe that the file doesn't appear in the knowledge base listing Attempt to upload the same file again Notice the system reports it as a duplicate, even though it wasn't properly added ### Logs & Screenshots **Current Behavior:** The file hash is stored in the database The knowledge base listing doesn't show the file Subsequent uploads are blocked by the duplicate detection system **Additional Context:** This appears to be a data consistency issue where the database state doesn't match the UI state. The duplicate detection system is working as intended (checking hashes), but the initial upload process isn't properly handling failed OCR cases. ### Additional Information **Possible Solutions:** Add a flag in the database to mark files with failed OCR Modify the upload process to only store hashes for successfully processed files Update the duplicate detection to ignore files marked as failed Severity: Medium (affects user experience but doesn't cause data loss)
GiteaMirror added the bug label 2026-05-21 00:36:13 -05:00
Author
Owner

@tjbck commented on GitHub (Nov 18, 2025):

This report is missing a few key details, please share your entire document settings as a screenshot.

<!-- gh-comment-id:3548993825 --> @tjbck commented on GitHub (Nov 18, 2025): This report is missing a few key details, please share your entire document settings as a screenshot.
Author
Owner

@flefevre commented on GitHub (Nov 19, 2025):

here a screenshot of admin dashboard for "document"

Image Image

do you need more information?
we are using Milvus for vector database

  - name: MILVUS_INDEX_TYPE
    value: 'HNSW'
  - name: MILVUS_METRIC_TYPE
    value: 'COSINE'
  - name: MILVUS_HNSW_M
    value: '16'
  - name: MILVUS_HNSW_EFCONSTRUCTION
    value: '100'
<!-- gh-comment-id:3554875867 --> @flefevre commented on GitHub (Nov 19, 2025): here a screenshot of admin dashboard for "document" <img width="1591" height="837" alt="Image" src="https://github.com/user-attachments/assets/4a2b784d-8a12-47ba-b145-3b31164b434c" /> <img width="1604" height="202" alt="Image" src="https://github.com/user-attachments/assets/0f2ff484-5332-44f9-8b3f-eb4d4a780874" /> do you need more information? we are using Milvus for vector database ``` - name: MILVUS_INDEX_TYPE value: 'HNSW' - name: MILVUS_METRIC_TYPE value: 'COSINE' - name: MILVUS_HNSW_M value: '16' - name: MILVUS_HNSW_EFCONSTRUCTION value: '100' ```
Author
Owner

@Classic298 commented on GitHub (Jan 9, 2026):

fixed in dev

<!-- gh-comment-id:3729351647 --> @Classic298 commented on GitHub (Jan 9, 2026): fixed in dev
Author
Owner

@flefevre commented on GitHub (Jan 9, 2026):

Thanks for the work done

<!-- gh-comment-id:3730113974 --> @flefevre commented on GitHub (Jan 9, 2026): Thanks for the work done
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/open-webui#122138