mirror of
https://github.com/open-webui/open-webui.git
synced 2026-06-04 07:47:12 -05:00
[GH-ISSUE #19264] issue: Uploaded file hash remains in database even when OCR fails, causing false duplicate detection #73431
Reference in New Issue
Block a user
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @flefevre on GitHub (Nov 18, 2025).
Original GitHub issue: https://github.com/open-webui/open-webui/issues/19264
Check Existing Issues
Installation Method
Docker
Open WebUI Version
v0.6.36
Ollama Version (if applicable)
No response
Operating System
Ubuntu 22.04
Browser (if applicable)
No response
Confirmation
README.md.Expected Behavior
If OCR fails, the file should either:
Not be stored at all, or
Be stored with a clear "failed OCR" status that doesn't block future uploads
Actual Behavior
When uploading a file to the platform that fails OCR (returns no text), the file's hash is still stored in the database. This creates an inconsistency where:
The file does not appear in the knowledge base listing (as expected, since OCR failed)
However, subsequent upload attempts for the same file incorrectly report a "duplicate file" error
Steps to Reproduce
Upload a file that fails OCR (e.g., a corrupted or unsupported file type)
Observe that the file doesn't appear in the knowledge base listing
Attempt to upload the same file again
Notice the system reports it as a duplicate, even though it wasn't properly added
Logs & Screenshots
Current Behavior:
The file hash is stored in the database
The knowledge base listing doesn't show the file
Subsequent uploads are blocked by the duplicate detection system
Additional Context:
This appears to be a data consistency issue where the database state doesn't match the UI state. The duplicate detection system is working as intended (checking hashes), but the initial upload process isn't properly handling failed OCR cases.
Additional Information
Possible Solutions:
Add a flag in the database to mark files with failed OCR
Modify the upload process to only store hashes for successfully processed files
Update the duplicate detection to ignore files marked as failed
Severity: Medium (affects user experience but doesn't cause data loss)
@tjbck commented on GitHub (Nov 18, 2025):
This report is missing a few key details, please share your entire document settings as a screenshot.
@flefevre commented on GitHub (Nov 19, 2025):
here a screenshot of admin dashboard for "document"
do you need more information?
we are using Milvus for vector database
@Classic298 commented on GitHub (Jan 9, 2026):
fixed in dev
@flefevre commented on GitHub (Jan 9, 2026):
Thanks for the work done