[GH-ISSUE #7338] enh: proposal to address scanned PDF handling in OpenWebUI RAG #53376

Closed
opened 2026-05-05 14:40:24 -05:00 by GiteaMirror · 0 comments
Owner

Originally created by @kukjun on GitHub (Nov 25, 2024).
Original GitHub issue: https://github.com/open-webui/open-webui/issues/7338

Problem

I recently encountered an issue when uploading scanned PDF files, which resulted in the following error message:

image

Upon investigating the issue, I discovered that PyPDF cannot process image-based PDFs, which lack an embedded text layer. This limitation is problematic for the OpenWebUI RAG (Retrieval-Augmented Generation) feature, as it prevents the use of scanned PDFs.

To address this, I propose integrating OCR (Optical Character Recognition) libraries such as EasyOCR or Tesseract. The solution would involve the following approach:

First, PyPDF attempts to process the PDF as usual.
If PyPDF fails to extract text (indicating an image-based PDF), the system would then use an OCR library to extract text from the embedded images.

This fallback mechanism would ensure that scanned PDFs can also be utilized within OpenWebUI, significantly enhancing its functionality.

Before implementing and submitting a PR, I would like to confirm whether this approach aligns with the direction of OpenWebUI's development and if there are alternative suggestions for handling scanned PDFs.

I look forward to your feedback and insights! ☺️

If you agree with this approach, I will proceed to implement the solution and submit a PR accordingly.

Originally created by @kukjun on GitHub (Nov 25, 2024). Original GitHub issue: https://github.com/open-webui/open-webui/issues/7338 ### Problem I recently encountered an issue when uploading scanned PDF files, which resulted in the following error message: ![image](https://github.com/user-attachments/assets/5f94678c-5134-488c-b74b-a293c78d567b) Upon investigating the issue, I discovered that PyPDF cannot process image-based PDFs, which lack an embedded text layer. This limitation is problematic for the OpenWebUI RAG (Retrieval-Augmented Generation) feature, as it prevents the use of scanned PDFs. To address this, I propose integrating OCR (Optical Character Recognition) libraries such as EasyOCR or Tesseract. The solution would involve the following approach: First, PyPDF attempts to process the PDF as usual. If PyPDF fails to extract text (indicating an image-based PDF), the system would then use an OCR library to extract text from the embedded images. This fallback mechanism would ensure that scanned PDFs can also be utilized within OpenWebUI, significantly enhancing its functionality. Before implementing and submitting a PR, I would like to confirm whether this approach aligns with the direction of OpenWebUI's development and if there are alternative suggestions for handling scanned PDFs. I look forward to your feedback and insights! ☺️ If you agree with this approach, I will proceed to implement the solution and submit a PR accordingly.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/open-webui#53376