mirror of
https://github.com/open-webui/open-webui.git
synced 2026-05-06 10:58:17 -05:00
[PR #11240] [CLOSED] PR: fix PDF loader when using default Content Extraction Engine and OCR #45743
Reference in New Issue
Block a user
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
📋 Pull Request Information
Original PR: https://github.com/open-webui/open-webui/pull/11240
Author: @rgaricano
Created: 3/5/2025
Status: ❌ Closed
Base:
dev← Head:dev📝 Commits (10+)
9f77c64Fix PDF loader when using default CEE and OCR4ec5710refac: assetsaac5e07Merge branch 'dev' of https://github.com/rgaricano/open-webui into dev7af2a2cUpdate main.py197fafaMerge branch 'open-webui:dev' into devf27451aUpdate main.py fix PDF loader when using default Content Extraction Engine and OCR01b6b04Update requirements.txt for fix PDF default engine+OCR80ebd80Update pyproject.toml17f1d49Merge branch 'open-webui:dev' into devdaae293Merge branch 'open-webui:dev' into dev📊 Changes
4 files changed (+10 additions, -11 deletions)
View changed files
📝
backend/open_webui/retrieval/loaders/main.py(+2 -2)📝
backend/open_webui/utils/pdf_generator.py(+2 -2)📝
backend/requirements.txt(+4 -4)📝
pyproject.toml(+2 -3)📄 Description
Pull Request
Fix PDF loader when using default Content Extraction Engine and OCR
Checklist
devbranch.Changelog Entry
Fix PDF loader when using default CEE and OCR
Description
Fix PDF loader when using default Content Extraction Engine and OCR enabled
Commit to fixing issues due to incorrect dimension calculation when reshaping images using the Langchain class Pypdfloader.
Using the UnsestructuredPDFLoader call instead of the PyPDFLoader call fixes errors encountered when processing PDF files with images.
(before I tried with PyMuPDF, is fast and do the work well, but with AGPL License, it was't approved,
In this case I use unstructured lib, same as is allready been used, with apache license.
unstructured[pdf] is an extra option with that requirements:
Provides-Extra: pdf
Requires-Dist: onnx ; extra == 'pdf'
Requires-Dist: pdf2image ; extra == 'pdf'
Requires-Dist: pdfminer.six ; extra == 'pdf'
Requires-Dist: pikepdf ; extra == 'pdf'
Requires-Dist: pi-heif ; extra == 'pdf'
Requires-Dist: pypdf ; extra == 'pdf'
Requires-Dist: google-cloud-vision ; extra == 'pdf'
Requires-Dist: effdet ; extra == 'pdf'
Requires-Dist: unstructured-inference (>=0.8.7) ; extra == 'pdf'
Requires-Dist: unstructured.pytesseract (>=0.3.12) ; extra == 'pdf'
As I don't know if all unstructured options are going to be installed, i added a specific entry in requirements.txt: unstructured[pdf]
Note: when testing it give this errors:
(I run
pip install unstructured_inference)& after
(I run
pip install pdf2image)Please check that, maybe some of these dependencies may not be desired. ?
unstructured_inference requirements
Requires-Dist: python-multipart
Requires-Dist: huggingface-hub
Requires-Dist: numpy (<2)
Requires-Dist: opencv-python (!=4.7.0.68)
Requires-Dist: onnx
Requires-Dist: onnxruntime (>=1.17.0)
Requires-Dist: matplotlib
Requires-Dist: torch
Requires-Dist: timm
Requires-Dist: transformers (>=4.25.1)
Requires-Dist: rapidfuzz
Requires-Dist: pandas
Requires-Dist: scipy
Requires-Dist: pypdfium2
Requires-Dist: pdfminer-six (==20240706)
With this steps seem it work ok,
In this question you may need to do some testing to make sure all dependencies are installed correctly, if not you may need to explicitly list those libraries.
Licenses that I revised (pdfminer & pdf2image), others than own unstructured, are MIT.
Added
Changed
replace call to langchain-community PyPDFLoader class by langchain-community UnstructuredPDFLoader class
Deprecated
Removed
Fixed
Errors uploading pdf files with default content extraction engine (internal) and OCR feature
Security
Breaking Changes
Additional Information
https://github.com/open-webui/open-webui/discussions/11171
https://github.com/open-webui/open-webui/discussions/4458
Screenshots or Videos
🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.