mirror of
https://github.com/open-webui/open-webui.git
synced 2026-05-06 19:08:59 -05:00
[PR #13085] [MERGED] fix: pass extractInlineImages header to Tika if PDF_EXTRACT_IMAGES is true #46142
Reference in New Issue
Block a user
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
📋 Pull Request Information
Original PR: https://github.com/open-webui/open-webui/pull/13085
Author: @ayan4m1
Created: 4/20/2025
Status: ✅ Merged
Merged: 5/2/2025
Merged by: @tjbck
Base:
dev← Head:fix/tika-image-ocr📝 Commits (1)
039dec6fix: pass header to Tika if PDF_EXTRACT_IMAGES is true📊 Changes
1 file changed (+3 additions, -0 deletions)
View changed files
📝
backend/open_webui/retrieval/loaders/main.py(+3 -0)📄 Description
Pull Request Checklist
Note to first-time contributors: Please open a discussion post in Discussions and describe your changes before submitting a pull request.
Before submitting, make sure you've checked the following:
devbranch.Changelog Entry
Description
There is an issue with Tika PDF processing where images are not (always, at least) OCR'd using the Docker tika:latest-full image.
To fix this, we can set an optional Tika HTTP header that forces processing the inline images using Tesseract. Because it will slow things down, I gated it behind PDF_EXTRACT_IMAGES. This fixes the issue where inline images were not being OCR'd even if PDF_EXTRACT_IMAGES was set to True.
Added
N/A
Changed
N/A
Deprecated
N/A
Removed
N/A
Fixed
Security
N/A
Breaking Changes
N/A
Additional Information
This can be seen in #11377 and I can personally reproduce it using semiconductor datasheets with mixed text/images containing text.
Screenshots or Videos
Contributor License Agreement
By submitting this pull request, I confirm that I have read and fully agree to the CONTRIBUTOR_LICENSE_AGREEMENT, and I am providing my contributions under its terms.
🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.