mirror of
https://github.com/open-webui/open-webui.git
synced 2026-05-06 19:08:59 -05:00
[PR #23874] [CLOSED] fix: extract text from PDF URLs in fetch_url tool #43050
Reference in New Issue
Block a user
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
📋 Pull Request Information
Original PR: https://github.com/open-webui/open-webui/pull/23874
Author: @gaurav0107
Created: 4/19/2026
Status: ❌ Closed
Base:
dev← Head:fix/fetch-url-pdf-handling📝 Commits (1)
9d3d80bfix: extract text from PDF URLs in fetch_url tool📊 Changes
3 files changed (+485 additions, -2 deletions)
View changed files
📝
backend/open_webui/retrieval/utils.py(+109 -2)📝
backend/open_webui/retrieval/web/utils.py(+33 -0)➕
backend/tests/retrieval/test_pdf_handling.py(+343 -0)📄 Description
Pull Request Checklist
devbranch.pypdfis already pinned inrequirements.txt(pypdf==6.7.5).dev.Changelog Entry
Description
When
fetch_urlfetches a PDF URL, the content is returned as garbled binary text because BeautifulSoup's HTML parser corrupts the raw PDF bytes. This PR adds PDF-aware text extraction so that PDF URLs return clean, readable text.Closes #23841
Added
extract_text_from_pdf_bytes()— extracts text from raw PDF bytes usingpypdfextract_pdf_from_url()— downloads a PDF and extracts text with SSRF protection, streaming size limits, and proper session management.pdf(case-insensitive)Content-Type: application/pdfheader detection in asyncSafeWebBaseLoader._fetch()%PDFbinary content fallback inget_content_from_url()backend/tests/retrieval/test_pdf_handling.pyChanged
get_content_from_url()now detects and properly handles PDF content instead of piping binary through BeautifulSoupget_content_from_url() -> tuple[str, list[Document]]Deprecated
Removed
Fixed
fetch_urltool now returns extracted text instead of garbled binary when fetching PDF URLs (#23841)Security
validate_url()before any PDF downloadstream=True) withContent-Lengthheader pre-check to reject oversized PDFs before bufferingrequests.Sessionused as context manager to prevent connection leaksBreaking Changes
Additional Information
pypdfis already a project dependency (requirements.txt:pypdf==6.7.5), used by the existingPyPDFLoaderget_content_from_url→loader.load()) and async (SafeWebBaseLoader._fetch()) code pathsPYTHONPATH=backend pytest backend/tests/retrieval/test_pdf_handling.py -vScreenshots or Videos
N/A — backend-only change, no UI impact.
Contributor License Agreement
🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.