mirror of
https://github.com/open-webui/open-webui.git
synced 2026-05-07 03:18:23 -05:00
[PR #23869] [CLOSED] fix: extract text from PDF URLs in fetch_url tool #50465
Reference in New Issue
Block a user
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
📋 Pull Request Information
Original PR: https://github.com/open-webui/open-webui/pull/23869
Author: @gaurav0107
Created: 4/18/2026
Status: ❌ Closed
Base:
main← Head:fix/fetch-url-pdf-handling📝 Commits (1)
1c22f06fix: extract text from PDF URLs in fetch_url tool📊 Changes
3 files changed (+440 additions, -2 deletions)
View changed files
📝
backend/open_webui/retrieval/utils.py(+101 -2)📝
backend/open_webui/retrieval/web/utils.py(+26 -0)➕
backend/tests/retrieval/test_pdf_handling.py(+313 -0)📄 Description
Summary
Fixes #23841 — When
fetch_urlfetches a PDF URL, the content is returned as garbled binary text because BeautifulSoup's HTML parser corrupts the raw PDF bytes.This PR adds PDF-aware text extraction with three detection layers:
.pdf(case-insensitive) bypass the web loader entirely and extract text directly withpypdfSafeWebBaseLoader._fetch()checks theContent-Type: application/pdfheader for dynamic download URLs%PDF, re-downloads and extracts withpypdfSecurity & resource controls
validate_url()before any downloadstream=True) withContent-Lengthheader pre-check to reject oversized PDFs before bufferingrequests.Sessionused as context manager (no connection leaks)No new dependencies
pypdfis already pinned inrequirements.txt(pypdf==6.7.5) and used by the existingPyPDFLoader.Test plan
.pdfURL fast-path routing (case-insensitive)%PDFbinary content fallback detectionRun tests:
PYTHONPATH=backend pytest backend/tests/retrieval/test_pdf_handling.py -v🤖 Generated with Claude Code
🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.