mirror of
https://github.com/open-webui/open-webui.git
synced 2026-05-06 19:08:59 -05:00
[GH-ISSUE #7338] enh: proposal to address scanned PDF handling in OpenWebUI RAG #53376
Reference in New Issue
Block a user
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @kukjun on GitHub (Nov 25, 2024).
Original GitHub issue: https://github.com/open-webui/open-webui/issues/7338
Problem
I recently encountered an issue when uploading scanned PDF files, which resulted in the following error message:
Upon investigating the issue, I discovered that PyPDF cannot process image-based PDFs, which lack an embedded text layer. This limitation is problematic for the OpenWebUI RAG (Retrieval-Augmented Generation) feature, as it prevents the use of scanned PDFs.
To address this, I propose integrating OCR (Optical Character Recognition) libraries such as EasyOCR or Tesseract. The solution would involve the following approach:
First, PyPDF attempts to process the PDF as usual.
If PyPDF fails to extract text (indicating an image-based PDF), the system would then use an OCR library to extract text from the embedded images.
This fallback mechanism would ensure that scanned PDFs can also be utilized within OpenWebUI, significantly enhancing its functionality.
Before implementing and submitting a PR, I would like to confirm whether this approach aligns with the direction of OpenWebUI's development and if there are alternative suggestions for handling scanned PDFs.
I look forward to your feedback and insights! ☺️
If you agree with this approach, I will proceed to implement the solution and submit a PR accordingly.