[PR #11634] [CLOSED] fix: Silent failures to load PDF documents. Replace pypdf with pymupdf for improved compatibility. #9564

New Issue

GiteaMirror · 2025-11-11T18:25:47-06:00

GiteaMirror commented

2025-11-11 18:25:47 -06:00

📋 Pull Request Information

Original PR: https://github.com/open-webui/open-webui/pull/11634
Author: @EliasOenal
Created: 3/13/2025
Status: ❌ Closed

Base: dev ← Head: main

📝 Commits (1)

cb6cb45 fix: Replace pypdf with pymupdf for improved compatibility.

📊 Changes

2 files changed (+3 additions, -3 deletions)

View changed files

📝 backend/open_webui/retrieval/loaders/main.py (+2 -2)
📝 backend/requirements.txt (+1 -1)

📄 Description

Pull Request Checklist

Note to first-time contributors: Please open a discussion post in Discussions and describe your changes before submitting a pull request.

Before submitting, make sure you've checked the following:

Target branch: Please verify that the pull request targets the dev branch.
Description: Provide a concise description of the changes made in this pull request.
Changelog: Ensure a changelog entry following the format of Keep a Changelog is added at the bottom of the PR description.
[-] Documentation: Have you updated relevant documentation Open WebUI Docs, or other documentation sources?
[-] Dependencies: Are there any new dependencies? Have you updated the dependency versions in the documentation?
Testing: Have you written and run sufficient tests for validating the changes?
Code review: Have you performed a self-review of your code, addressing any coding standard issues and ensuring adherence to the project's coding standards?
Prefix: To cleary categorize this pull request, prefix the pull request title, using one of the following:
- BREAKING CHANGE: Significant changes that may affect compatibility
- build: Changes that affect the build system or external dependencies
- ci: Changes to our continuous integration processes or workflows
- chore: Refactor, cleanup, or other non-functional code changes
- docs: Documentation update or addition
- feat: Introduces a new feature or enhancement to the codebase
- fix: Bug fix or error correction
- i18n: Internationalization or localization changes
- perf: Performance improvement
- refactor: Code restructuring for better maintainability, readability, or scalability
- style: Changes that do not affect the meaning of the code (white-space, formatting, missing semi-colons, etc.)
- test: Adding missing tests or correcting existing tests
- WIP: Work in progress, a temporary label for incomplete or ongoing work

Changelog Entry

Description

The default PyPDFLoader silently fails to load a lot of PDF documents, as discussed #11171, #4458 and #6929. In each of these discussions, replacing it with PyMuPDFLoader was recommended as a solution. This PR implements the change and significantly improves parsing a wide variety of PDF documents. The discussions also mention various PDF documents that can be used to verify the fix.
During my tests I further identified another unrelated PDF parsing bug that triggers when enabling OCR on PDFs that embed grayscale images. It's a bug in langchain, which was also mentioned in: #11171 and #4458, it causes the message: error: "Cannot handle this data type: (1, 1, 1), |u1" While it's not directly related, I have opened a PR with langchain to further complement Open WebUI's PDF parsing: https://github.com/langchain-ai/langchain/pull/30261

Added

PyMuPDFLoader for robust PDF parsing of a much broader range of PDF documents.

Changed

Replaced PyPDFLoader with PyMuPDFLoader.

Deprecated

none

Removed

PyPDFLoader

Fixed

Fixed silent failures to parse uploaded PDF documents. #11171, #4458 and #6929.

Security

none

Breaking Changes

none

Additional Information

[Insert any additional context, notes, or explanations for the changes]
- [Reference any related issues, commits, or other relevant information]

Screenshots or Videos

[Attach any relevant screenshots or videos demonstrating the changes]

_{🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.}

## 📋 Pull Request Information **Original PR:** https://github.com/open-webui/open-webui/pull/11634 **Author:** [@EliasOenal](https://github.com/EliasOenal) **Created:** 3/13/2025 **Status:** ❌ Closed **Base:** `dev` ← **Head:** `main` --- ### 📝 Commits (1) - [`cb6cb45`](https://github.com/open-webui/open-webui/commit/cb6cb457f8b7767f8ac2e8738c16bd5ca71db4c6) fix: Replace pypdf with pymupdf for improved compatibility. ### 📊 Changes **2 files changed** (+3 additions, -3 deletions) <details> <summary>View changed files</summary> 📝 `backend/open_webui/retrieval/loaders/main.py` (+2 -2) 📝 `backend/requirements.txt` (+1 -1) </details> ### 📄 Description # Pull Request Checklist ### Note to first-time contributors: Please open a discussion post in [Discussions](https://github.com/open-webui/open-webui/discussions) and describe your changes before submitting a pull request. **Before submitting, make sure you've checked the following:** - [X] **Target branch:** Please verify that the pull request targets the `dev` branch. - [X] **Description:** Provide a concise description of the changes made in this pull request. - [X] **Changelog:** Ensure a changelog entry following the format of [Keep a Changelog](https://keepachangelog.com/) is added at the bottom of the PR description. - [-] **Documentation:** Have you updated relevant documentation [Open WebUI Docs](https://github.com/open-webui/docs), or other documentation sources? - [-] **Dependencies:** Are there any new dependencies? Have you updated the dependency versions in the documentation? - [ ] **Testing:** Have you written and run sufficient tests for validating the changes? - [X] **Code review:** Have you performed a self-review of your code, addressing any coding standard issues and ensuring adherence to the project's coding standards? - [X] **Prefix:** To cleary categorize this pull request, prefix the pull request title, using one of the following: - **BREAKING CHANGE**: Significant changes that may affect compatibility - **build**: Changes that affect the build system or external dependencies - **ci**: Changes to our continuous integration processes or workflows - **chore**: Refactor, cleanup, or other non-functional code changes - **docs**: Documentation update or addition - **feat**: Introduces a new feature or enhancement to the codebase - **fix**: Bug fix or error correction - **i18n**: Internationalization or localization changes - **perf**: Performance improvement - **refactor**: Code restructuring for better maintainability, readability, or scalability - **style**: Changes that do not affect the meaning of the code (white-space, formatting, missing semi-colons, etc.) - **test**: Adding missing tests or correcting existing tests - **WIP**: Work in progress, a temporary label for incomplete or ongoing work # Changelog Entry ### Description - The default PyPDFLoader silently fails to load a lot of PDF documents, as discussed #11171, #4458 and #6929. In each of these discussions, replacing it with PyMuPDFLoader was recommended as a solution. This PR implements the change and significantly improves parsing a wide variety of PDF documents. The discussions also mention various PDF documents that can be used to verify the fix. - During my tests I further identified another unrelated PDF parsing bug that triggers when enabling OCR on PDFs that embed grayscale images. It's a bug in langchain, which was also mentioned in: #11171 and #4458, it causes the message: error: "Cannot handle this data type: (1, 1, 1), |u1" While it's not directly related, I have opened a PR with langchain to further complement Open WebUI's PDF parsing: https://github.com/langchain-ai/langchain/pull/30261 ### Added - PyMuPDFLoader for robust PDF parsing of a much broader range of PDF documents. ### Changed - Replaced PyPDFLoader with PyMuPDFLoader. ### Deprecated - none ### Removed - PyPDFLoader ### Fixed - Fixed silent failures to parse uploaded PDF documents. #11171, #4458 and #6929. ### Security - none ### Breaking Changes - none --- ### Additional Information - [Insert any additional context, notes, or explanations for the changes] - [Reference any related issues, commits, or other relevant information] ### Screenshots or Videos - [Attach any relevant screenshots or videos demonstrating the changes] --- <sub>🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.</sub>

GiteaMirror added the pull-request label 2025-11-11 18:25:47 -06:00

GiteaMirror closed this issue

2025-11-11 18:25:47 -06:00

GiteaMirror referenced this issue

2026-04-19 21:44:22 -05:00

[GH-ISSUE #9564] Model icons and deactivation in settings not updated in chat model dropdown. #15561

GiteaMirror referenced this issue

2026-04-25 05:09:57 -05:00

[GH-ISSUE #9564] Model icons and deactivation in settings not updated in chat model dropdown. #31089

GiteaMirror referenced this issue

2026-05-05 15:59:23 -05:00

[GH-ISSUE #9564] Model icons and deactivation in settings not updated in chat model dropdown. #54226